A Python parser for scientific PDF based on GROBID.
Use pip
to install from this Github repository
pip install git+https://github.com/skuam/scipdf_parser
Note
- We also need an
en_core_web_sm
model for spacy, where you can runpython -m spacy download en_core_web_sm
to download it - You can change GROBID version in
serve_grobid.sh
to test the parser on a new GROBID version
python -m spacy download en_core_web_sm
Run the GROBID using the given bash script before parsing PDF
bash serve_grobid.sh
This script will download GROBID and run the service at default port 8070 (see more here).
To parse a PDF provided in example_data
folder or direct URL, use the following function:
import json
from scipdf.parse_pdf import SciPDFParser
from scipdf.models import Article
parser = SciPDFParser()
article:Article = parser.parse_pdf('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf')
print(json.dumps(article.dict(), indent = 4))
# output example
{
"title": "A new method for measuring daytime sleepiness: the Epworth sleepiness scale.",
"authors": "Murray Johns",
"pub_date": "1991",
"abstract": "Text of abstract",
"sections": [
{
"heading": "Introduction",
"text": "Text of introduction",
"n_publication_ref": 1,
"n_figure_ref": 1
}
],
"references": [
{
"title": "The Epworth Sleepiness Scale in Clinical Practice",
"journal": "Sleep Breath",
"year": "2017",
"authors": "Chervin RD, et al."
},
{
"title": "A new method for measuring daytime sleepiness: the Epworth sleepiness scale.",
"journal": "Sleep",
"year": "1991",
"authors": "Johns MW"
}
],
"figures": [
{
"figure_label": "Figure 1",
"figure_type": "table",
"figure_id": "fig1",
"figure_caption": "Caption of figure 1",
"figure_data": "Data of figure 1"
}
],
"formulas": [
{
"formula_id": "f1",
"formula_text": "a^2 + b^2 = c^2",
"formula_coordinates": [
1,
2,
3,
4
]
}
],
"doi": "10.1111/j.1365-2869.1991.tb00031.x"
}
!!! Warning Parsing of figures is not supported yet in pydantic models, so you need to parse it manually. !!!
To parse figures from PDF using pdffigures2, you can run
from scipdf.parse_pdf import SciPDFParser
parser = SciPDFParser()
parser.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files
You can see example output figures in figures
folder.