Deduplication, entity resolution, record linkage, author disambiguation, and others ...
As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.
Dedupe is an open source python library that quickly de-duplicates large sets of data.
- machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
- runs on a laptop - makes intelligent comparisons so you don't need a powerful server to run it
- built as a library - so it can be integrated in to your applications or import scripts
- extensible - supports adding custom data types, string comparators and blocking rules
- open source - anyone can use, modify or add to it
- Dedupe Google group
- ChiPy presentation
- IRC channel, #dedupe on irc.freenode.net
Dedupe requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions. You'll need to version 1.6 of numpy or higher.
After numpy is set up, then install the following:
Using pip:
git clone git://github.com/open-city/dedupe.git
cd dedupe
pip install "numpy>=1.6"
# for python 2.7
pip install -r requirements.txt
# OR for python 2.6
pip install -r py26_requirements.txt
python setup.py install
Using easy_install:
git clone git://github.com/open-city/dedupe.git
cd dedupe
easy_install "numpy>=1.6"
easy_install "fastcluster>=1.1.8"
easy_install "hcluster>=0.2.0"
easy_install networkx
python setup.py install
Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.
CSV example (<10,000 rows)
python examples/csv_example/csv_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)
To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.
MySQL example (10,000 - 1,000,000+ rows)
This can take a few hours and will noticeably tax your laptop. You might want to run it overnight.
To follow this example you need to
- Create a MySQL database called 'contributions'
- Copy
examples/mysql_example/mysql.cnf_LOCAL
toexamples/mysql_example/mysql.cnf
- Update
examples/mysql_example/mysql.cnf
with your MySQL username and password easy_install MySQL-python
orpip install MySQL-python
Once that's all done you can run the example:
python examples/mysql_example/mysql_init_db.py
python examples/mysql_example/mysql_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)
To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.
We are trying to figure out a range of typical runtimes for diferent hardware. Please let us know your run time for the MySQL example.
The documentation for the dedupe library is on our wiki.
Unit tests of core dedupe functions
python test/test_dedupe.py
Using random sample data for training
python test/canonical_test.py
Using active learning for training
python test/canonical_test.py --active True
Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.
If something is not behaving intuitively, it is a bug, and should be reported. Report it here
- Fork the project.
- Make your feature addition or bug fix.
- Send us a pull request. Bonus points for topic branches.
Copyright (c) 2013 Forest Gregg and Derek Eder. Released under the MIT License.
Third-party copyright in this distribution is noted where applicable.
If you use Dedupe in an academic work, please give this citation:
Gregg, Forest, and Derek Eder. 2013. Dedupe. https://github.com/open-city/dedupe.