Skip to content

Extract publication metadata from web pages. Developed as part of my specialization project (TDT4501 @ NTNU).

Notifications You must be signed in to change notification settings

olafapl/publication_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Publication Metadata Extraction Pipeline

The publication extraction pipeline is used to extract publication metadata from web pages. It consists of two modules:

  • A PublicationExtractor responsible for extracting the publication strings located within web pages. Everything related to this module is located in /publication_extraction.
  • A MetadataExtractor responsible for extracting various metadata fields located within publication strings. Everything related to this module is located in /metadata_extraction

The pipeline itself is represented by the Pipeline class.

Prerequisites

  • Some models assume that pre-trained GloVe word embeddings are located in data/glove/.
  • Some models assume that the HomePub dataset is located in data/homepub-2500/.
  • Some models assume that the UMass Citation Field Extraction Dataset is located in data/umass/.

The glove.sh, homepub.sh, and umass.sh scripts can be used to download these.

About

Extract publication metadata from web pages. Developed as part of my specialization project (TDT4501 @ NTNU).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published