Skip to content

Python PDF parser for scientific publications: content and figures

License

Notifications You must be signed in to change notification settings

skuam/scipdf_parser

 
 

Repository files navigation

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/skuam/scipdf_parser

Note

  • We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
  • You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version
python -m spacy download en_core_web_sm

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import json
from scipdf.parse_pdf import SciPDFParser
from scipdf.models import Article

parser = SciPDFParser()

article:Article = parser.parse_pdf('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf')

print(json.dumps(article.dict(), indent = 4))
# output example
{
    "title": "A new method for measuring daytime sleepiness: the Epworth sleepiness scale.",
    "authors": "Murray Johns",
    "pub_date": "1991",
    "abstract": "Text of abstract",
    "sections": [
        {
            "heading": "Introduction",
            "text": "Text of introduction",
            "n_publication_ref": 1,
            "n_figure_ref": 1
        }
    ],
    "references": [
        {
            "title": "The Epworth Sleepiness Scale in Clinical Practice",
            "journal": "Sleep Breath",
            "year": "2017",
            "authors": "Chervin RD, et al."
        },
        {
            "title": "A new method for measuring daytime sleepiness: the Epworth sleepiness scale.",
            "journal": "Sleep",
            "year": "1991",
            "authors": "Johns MW"
        }
    ],
    "figures": [
        {
            "figure_label": "Figure 1",
            "figure_type": "table",
            "figure_id": "fig1",
            "figure_caption": "Caption of figure 1",
            "figure_data": "Data of figure 1"
        }
    ],
    "formulas": [
        {
            "formula_id": "f1",
            "formula_text": "a^2 + b^2 = c^2",
            "formula_coordinates": [
                1,
                2,
                3,
                4
            ]
        }
    ],
    "doi": "10.1111/j.1365-2869.1991.tb00031.x"
}

!!! Warning Parsing of figures is not supported yet in pydantic models, so you need to parse it manually. !!!

To parse figures from PDF using pdffigures2, you can run

from scipdf.parse_pdf import SciPDFParser
parser = SciPDFParser()
parser.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

About

Python PDF parser for scientific publications: content and figures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Shell 1.0%