ERPub

ERPub

ERPub is a tool designed for resolving entities across multiple academic publication datasets (specifically ACM and DBLP) by employing various matching functions. This pipeline takes advantage of blocking, matching, and clustering techniques to identify and resolve duplicate entities within the given datasets.

Installation

Clone this repository to your local machine and navigate to the project directory:

git clone https://github.com/dia-exercise/ERPub
cd ERPub

Install the required dependencies:

pip install -r requirements.txt

Dataset Preparation

The pipeline requires datasets in a specific format. To obtain and prepare the required datasets, you can use the provided script:

python erpub/data_preparation.py

This script downloads the DBLP and ACM datasets, filters publications published between 1995 and 2004, and removes duplicates. The resulting CSV files DBLP_1995_2004.csv and ACM_1995_2004.csv) will be stored in the data/prepared directory.

Usage

Importing the Pipeline and it's required functions

from erpub.pipeline import pipeline, blocking, matching, preprocessing

Initializing the Pipeline

file_dir = "data/prepared"
pipeline = pipeline.Pipeline(
    file_dir=file_dir,
    preprocess_data_fn=preprocessing.all_lowercase_and_stripped,
    blocking_fn=blocking.same_year_of_publication,
    matching_fns={
        "paper_title": matching.jaccard_similarity,
        "author_names": matching.specific_name_matcher,
    },
)

Running the Pipeline

pipeline.run("output_directory", similarity_threshold=0.8)

Resolving the Entities

pipeline.resolve("resolved_output_directory")

Example setup

from erpub.pipeline import pipeline, blocking, matching

file_dir = "data/prepared"
pipeline = pipeline.Pipeline(
    file_dir=file_dir,
    blocking_fn=blocking.naive_all_pairs,
    matching_fns={
        "paper_title": matching.jaccard_similarity,
        "author_names": matching.specific_name_matcher,
    },
)

pipeline.run("output_directory", similarity_threshold=0.8)

pipeline.resolve("resolved_output_directory")

Note for vector embeddings

For matching.vector_embeddings, the embeddings_path of pipeline.Pipeline parameter is required. In our case we used the GloVe, you can download them to the embeddings/ directory by running this script:

python erpub/download_glove_embeddings.py

Customization

Functions

Check out the existing preprocessing, blocking and matching functions in erpub/pipeline/. You can also add custom functions to the pipeline on your own.

Verbosity

Set the verbose parameter of the pipeline to False to disable logging.

Unit Tests

Install pytest for testing:

pip install pytest

Run unit tests using:

pytest tests/

Experiments

Explore various experiments inside experiments.ipynb notebook. The notebook provides insights into different use cases and scenarios for applying the entity resolution pipeline.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
erpub		erpub
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experiment_utils.py		experiment_utils.py
experiments.ipynb		experiments.ipynb
experiments_dask.ipynb		experiments_dask.ipynb
labeled_entities.csv		labeled_entities.csv
requirements.txt		requirements.txt
test.py		test.py
test_dask.py		test_dask.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ERPub

Installation

Dataset Preparation

Usage

Importing the Pipeline and it's required functions

Initializing the Pipeline

Running the Pipeline

Resolving the Entities

Example setup

Note for vector embeddings

Customization

Functions

Verbosity

Unit Tests

Experiments

License

About

Releases

Packages

Contributors 3

Languages

License

dia-exercise/ERPub

Folders and files

Latest commit

History

Repository files navigation

ERPub

Installation

Dataset Preparation

Usage

Importing the Pipeline and it's required functions

Initializing the Pipeline

Running the Pipeline

Resolving the Entities

Example setup

Note for vector embeddings

Customization

Functions

Verbosity

Unit Tests

Experiments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages