ERPub is a tool designed for resolving entities across multiple academic publication datasets (specifically ACM and DBLP) by employing various matching functions. This pipeline takes advantage of blocking, matching, and clustering techniques to identify and resolve duplicate entities within the given datasets.
Clone this repository to your local machine and navigate to the project directory:
git clone https://github.com/dia-exercise/ERPub
cd ERPub
Install the required dependencies:
pip install -r requirements.txt
The pipeline requires datasets in a specific format. To obtain and prepare the required datasets, you can use the provided script:
python erpub/data_preparation.py
This script downloads the DBLP and ACM datasets, filters publications published between 1995 and 2004, and removes duplicates. The resulting CSV files DBLP_1995_2004.csv
and ACM_1995_2004.csv
) will be stored in the data/prepared
directory.
from erpub.pipeline import pipeline, blocking, matching, preprocessing
file_dir = "data/prepared"
pipeline = pipeline.Pipeline(
file_dir=file_dir,
preprocess_data_fn=preprocessing.all_lowercase_and_stripped,
blocking_fn=blocking.same_year_of_publication,
matching_fns={
"paper_title": matching.jaccard_similarity,
"author_names": matching.specific_name_matcher,
},
)
pipeline.run("output_directory", similarity_threshold=0.8)
pipeline.resolve("resolved_output_directory")
from erpub.pipeline import pipeline, blocking, matching
file_dir = "data/prepared"
pipeline = pipeline.Pipeline(
file_dir=file_dir,
blocking_fn=blocking.naive_all_pairs,
matching_fns={
"paper_title": matching.jaccard_similarity,
"author_names": matching.specific_name_matcher,
},
)
pipeline.run("output_directory", similarity_threshold=0.8)
pipeline.resolve("resolved_output_directory")
For matching.vector_embeddings
, the embeddings_path
of pipeline.Pipeline
parameter is required. In our case we used the GloVe, you can download them to the embeddings/
directory by running this script:
python erpub/download_glove_embeddings.py
Check out the existing preprocessing, blocking and matching functions in erpub/pipeline/
. You can also add custom functions to the pipeline on your own.
Set the verbose
parameter of the pipeline to False to disable logging.
Install pytest for testing:
pip install pytest
Run unit tests using:
pytest tests/
Explore various experiments inside experiments.ipynb notebook. The notebook provides insights into different use cases and scenarios for applying the entity resolution pipeline.
This project is licensed under the MIT License - see the LICENSE file for details.