Code for our paper "Noise-Robust De-Duplication at Scale" [NBER], [arxiv], [ICLR]
This repo includes:
- NEWS-COPY dataset: 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication.
- Rule-based de-duplication methods: hashing and N-gram overlap.
- Neural de-duplication methods: a contrastively trained bi-encoder, and a "re-rank" style approach combining a bi- and cross-encoder. Plus pre-trained models for both of these.
- Inference at scale for hashing and the biencoder methods.
If you find this work useful, please cite the following paper:
@inproceedings{silcock-etal-2020-noise,
title = "Noise-Robust De-Duplication at Scale",
author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
booktitle = "International Conference on Learning Representations (ICLR)",
year = "2023",
}
git clone https://github.com/dell-research-harvard/NEWS-COPY.git
cd NEWS-COPY
conda env create -f environment.yml
- Historical Newspapers: train, evaluation and test sets can be downloaded here. For more detail see the paper above
- C4: C4 can be downloaded thanks to AllenAI - see allenai/allennlp#5056
Codebase for neural methods for de-duplication, n-gram overlap and locally sensitive hashing. These predominate in the literature, but are significantly outperformed by the neural methods below.
Training and evaluation scripts for a contrastively trained bi-encoder, and a "re-rank" style approach combining a bi- and cross-encoder. These outperform the rule-based approaches above.
Pretrained models for both the bi-encoder and cross-encoder can be found here.
Inference at scale for hashing (LSH) and the biencoder methods over C4 and SuperGlue. The bi-encoder scales well, de-duplicating a 10 million text corpus on a single GPU card in a matter of hours.