The publication extraction pipeline is used to extract publication metadata from web pages. It consists of two modules:
- A
PublicationExtractor
responsible for extracting the publication strings located within web pages. Everything related to this module is located in/publication_extraction
. - A
MetadataExtractor
responsible for extracting various metadata fields located within publication strings. Everything related to this module is located in/metadata_extraction
The pipeline itself is represented by the Pipeline
class.
- Some models assume that pre-trained GloVe word embeddings are located in
data/glove/
. - Some models assume that the HomePub dataset is located in
data/homepub-2500/
. - Some models assume that the UMass Citation Field Extraction Dataset is located in
data/umass/
.
The glove.sh
, homepub.sh
, and umass.sh
scripts can be used to download these.