GitHub - carmanzhang/PubMed-AND-method: Aggregating Large-Scale Databases for PubMed Author Name Disambiguation

PubMed Author Name Disambiguation

This project is developed for PubMed author name disambiguation. The name ambiguity problem can be understood that tow authors with same (similar) name in different citation are often ambiguous to tell from. Name ambiguity problem is serious in PubMed, because there are nearly 10,000 namespaces (last name and firs initial, also known as blocks) with size over 1,000 in PubMed as of 2019. This problem not only hinders the communication of valuable discoveries produced by others in biomedical field, but also restricts many downstream researches or applications, such as author-centric bibliometric analysis and expert identification.

Setup

The project is mainly implemented by Python 3.6, we used following packages.

scipy, numpy, pandas, sklearn,
clickhouse-driver==0.2.0
geograpy3==0.1.24
jaro-winkler==2.0.0
python-Levenshtein==0.12.0
nltk==3.5

The Python module can extracted most features in use, and develop disambiguation model using machine learning models, while for some features, extracting them from raw input has already implemented by other language, such as Java. Thus, to integrate these features, "Dependency-Feature" is a Java-based module, which can extract these dependent features. Note that "maui" in this folder, is a keyword-generation tool. "tc2011" can extract Journal Descriptors and Semantic Types for each PubMed citation. Besides, this module also detect geographic fields from author affiliation using NER technique, provided by "stanford-corenlp" (see dependencies in pom.xml).

Database

The "database" folder contains a bunch of sql scripts, their names are self-explainable. These scripts aim to associate additional metadata from external databases for the gold standard datasets, thus, some steps including "database linkages", "metadata extraction", "author profile building" are implemented here.

Collected Resources

The "resources" folder contains necessary resources during developing this project. The two validation datasets did not contain any other metadata apart from the author names, positions. To obtain more discriminative information, we developed a program to crawl from PubMed official site. The XML format citations for the datasets are included in "gs-dataset-articles" and "song-dataset-articles".

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
Dependency-Feature		Dependency-Feature
database		database
licenses		licenses
resources		resources
src		src
test/eutilities		test/eutilities
.gitignore		.gitignore
README.md		README.md
pubmed-paper-author-link.iml		pubmed-paper-author-link.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed Author Name Disambiguation

Setup

Database

Collected Resources

About

Releases

Packages

Languages

carmanzhang/PubMed-AND-method

Folders and files

Latest commit

History

Repository files navigation

PubMed Author Name Disambiguation

Setup

Database

Collected Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages