Publication Metadata Extraction Pipeline

The publication extraction pipeline is used to extract publication metadata from web pages. It consists of two modules:

A PublicationExtractor responsible for extracting the publication strings located within web pages. Everything related to this module is located in /publication_extraction.
A MetadataExtractor responsible for extracting various metadata fields located within publication strings. Everything related to this module is located in /metadata_extraction

The pipeline itself is represented by the Pipeline class.

Prerequisites

Some models assume that pre-trained GloVe word embeddings are located in data/glove/.
Some models assume that the HomePub dataset is located in data/homepub-2500/.
Some models assume that the UMass Citation Field Extraction Dataset is located in data/umass/.

The glove.sh, homepub.sh, and umass.sh scripts can be used to download these.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
pipeline		pipeline
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
glove.sh		glove.sh
homepub.sh		homepub.sh
requirements.txt		requirements.txt
umass.sh		umass.sh