EuroGOV: Engineering a Multilingual Web Corpus
Börkur Sigurbjörnsson, Jaap Kamps, and Maarten de Rijke.
In: Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. Lecture Notes in Computer Science. 2006.
Link: springerlink
Abstract
EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.
Bibtex
@inproceedings{sigurbjornsson2006eurogov, author = {B"orkur Sigurbj"ornsson and Jaap Kamps and Maarten de Rijke}, title = {EuroGOV: Engineering a Multilingual Web Corpus}, editor = {C. Peters and F.C. Gey and J. Gonzalo and G.J.F. Jones and M. Kluck and B. Magnini and H. Müller and M. de Rijke }, booktitle = {Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005}, series = {Lecture Notes in Computer Science}, volume = {4022}, pages = {825--836}, publisher = {Springer-Verlag}, year = {2006}, doi = {http://dx.doi.org/10.1007/11878773_90},}