Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.
The TREC 2022 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.
- Tasks
- Data Formats
- Data
- Additional Resources
- Software
- Submission
- Assessment & Scoring
- Carbon Neutrality
- Registration
- Timeline
- Organizers
The main task in the NeuCLIR track is ad hoc cross-language information retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of English topics. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic.
The reranking task is an extension of the ad hoc task. In addition to a document collection and a set of topics, systems also receive as input for each topic a ranked list of 1000 documents drawn from the document collection. Each ranked list is the output of a retrieval system operating on the ad hoc task. Reranking task results must be drawn only from the documents that appear in these lists. In all other ways, the reranking task is identical to the ad hoc task. We will release runs for reranking for the HC4 training topics.
While monolingual retrieval is not a focus of the NeuCLIR track, monolingual runs can improve assessment pools, and serve as good points of reference for cross-language runs. The monolingual retrieval task is identical to the ad hoc task, but uses topic files that are human translations of the English topics into the target language in a way that would be expressed by speakers of the language.
Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:
id
(string): The document ID for the documenttext
(string): Complete text of the documentdate
(string): YYYY-MM-DD or empty stringLang
(string): ISO 639-3
Participating teams can get a copy of the processed corpora directly from NIST. Alternatively, anybody can replicate the corpora by downloading them from the Common Crawl using provided scripts (this takes longer than downloading from NIST).
NeuCLIR will use the standard TREC ad hoc submission format for submissions and for ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:
- Topic (query) number
- The fixed string
Q0
- Document ID
- Rank
- Score (integer or float)
- Run ID
Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Field 6 is the run ID, generated by the submitter (the first characters of the run ID must be the name of the submitting team). Fields 2 and 4 are ignored, although they must be present. Here is a portion of a sample ranked results file:
1 Q0 pid1 1 2.73 team1-run1
1 Q0 pid2 2 2.71 team1-run1
1 Q0 pid3 3 2.61 team1-run1
1 Q0 pid4 4 2.05 team1-run1
1 Q0 pid5 5 1.89 team1-run1
Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be rejected.
Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:
- A topic ID
- The string
0
(zero) - A document ID
- A relevance judgment drawn from the following set:
3
: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)1
: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)0
: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)
See the assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.
Development data, called HC4, include document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. HC4 is available from the ir_datasets package under hc4.
NEW: Details about our document collection, NEUCLIR1, are available on this page.
The document collections will be drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were filtered. Russian was randomly sampled to have 5 million documents pre-deduplication. Automatic deduplication removes duplicate documents. There are 4 ½ million Russian, 3 million Chinese, and 2 million Persian documents. Evaluation topics will be released in June 2022.
The download script for the collection is currently available in the NeuCLIR/download-collection repository. A gzipped version of the collection is available here (password will be available on the TREC website).
A list of multilingual topics is available here.
- Multiple languages
- mMARCO: Translations, using OPUS-MT and Google Translate, of MS MARCO into many languages (but not Persian):
- ir_datasets: mmarco
- On HuggingFace: unicamp-dl/mmarco
- neuMARCO: Translations of MS MARCO via a sockeye translation system
- Download link neuMSMARCO.tar.gz
- WikiCLIR: Multilingual CLIR collection, including Chinese and Russian
- ir_datasets: wikiclir
- CLIRMatrix: Another multi-lingual CLIR collection, including Chinese, Russian, and Persian.
- ir_datasets: clirmatrix
- Example code
- mMARCO: Translations, using OPUS-MT and Google Translate, of MS MARCO into many languages (but not Persian):
- Chinese
- mMARCO Chinese v1 (OPUS MT)
- ir_datasets: mmarco/zh
- mMARCO Chinese v2 (Google Translate)
- ir_datasets: mmarco/v2/zh
- neuMARCO Chinese
- Download link: neuMSMARCO.tar.gz
- mMARCO Chinese v1 (OPUS MT)
- Persian
- neuMARCO Persian
- Download link: neuMSMARCO.tar.gz
- neuMARCO Persian
- Russian
- mMARCO Russian v1 (Opus MT)
- ir_datasets: mmarco/ru
- mMARCO Russian v2 (Google Translate)
- ir_datasets: mmarco/v2/ru
- neuMARCO Russian
- Download link: neuMSMARCO.tar.gz
- Mr. TyDi: Monolingual retrieval in several languages, including Russian
- ir_datasets: mr-tydi
- mMARCO Russian v1 (Opus MT)
Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from the hltcoe/patapsco repository. A Jupyter notebook that steps through basic Patapsco usage is available here.
The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available in the NeuCLIR/download-collection repository. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.
A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.
Submissions will be evaluated using ir-measures. The software uses the official trec_eval implementation for measures that trec_eval (nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval to alternative implementations (e.g., Rank Biased Precision uses cwl_eval). Evaluation will be conducted using the following command:
ir_measures qrels_file run_file \
'nDCG@20 MAP RBP(rel=1) R@100 R@1000'
Runs will be submitted through the NIST submission system at trec.nist.gov/act_part/act_part.html. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) single-stage or multi-stage (i.e., ranking or reranking); (b) neural, statistical, or a combination; and (c) manual or automatic.
Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention once the system is started and provided with the task inputs. We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations (i.e., the monolingual task) will be reported separately from cross-language runs.
Results on manual runs will be specifically identified when results are reported. A manual run is any run in which changes are made to the queries, the system, or the system’s results after the topics have been seen. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but the changes should be described.
Relevance assessments for each topic will be made by a single person, who would optimally be the person who created the topic.
Scoring will be performed by ir-measures. The main reported measure will be:
- Normalized Discounted Cumulative Gain at 20 (nDCG@20). Weights for the three levels of relevance are:
- Not relevant: 0
- Somewhat relevant: 1
- Highly relevant: 3
- Additional evaluation measures include MAP, RBP, Recall@100, and Recall@1000.
We also solicit self-reported mean response time per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.
Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).
- Sign up with NIST, using the link at the bottom of the Call for Participation. That’s when you choose a team name. If you don’t do this, your team name won’t be in the submission system, and you won’t be able to submit runs. This also gets you one email address on the NIST mailing list, which is where all the NIST coordination information will be sent.
- Sign up for our track's Google group. This is where we discuss things like the track guidelines.
-
November 2021: NeuCLIR track is first discussed at TREC 2021; feedback from potential participants is welcome.
-
December 2021: Release preliminary track guidelines. Complete download, cleaning and formatting of new document collection.
-
January 2022: Initial training data (HC4) and Patapsco infrastructure released
-
March 2022: Release new document collection.
-
May 2022: Topic development complete.
-
June 2022: Release test topics and baseline runs for re-ranking.
-
July 2022: Submissions due to NIST
-
October 2022: Distribute results to participants.
-
November 2022: TREC 2022.
- Dawn Lawrie, Johns Hopkins University, HLTCOE
- Sean MacAvaney, University of Glasgow
- James Mayfield, Johns Hopkins University, HLTCOE
- Paul McNamee, Johns Hopkins University, HLTCOE
- Douglas W. Oard, University of Maryland
- Luca Soldaini, Allen Institute for AI
- Eugene Yang, Johns Hopkins University, HLTCOE
- We encourage all interested researchers to join our mailing list for the latest announcement and news.
- For any questions, please reach out to the organizers at neuclir-organizers@googlegroups.com.
- Follow us on Twitter at @neuclir to connect with other NeuCLIR participants.