-
Notifications
You must be signed in to change notification settings - Fork 9
Supplementary Data and images
#suppdata and images/figures Comments by Petermr
Varies considerably by publisher, repository. Don't expect a consistent description. Also called:
- supplement(al/ary) data
- supporting information
- additional material
etc.
We take PMC7200000 as an example (from publisher PLoS). Other publishers would each treat this slightly differently
There are no clear boundaries.
These are usually links (often hyperlinks) to other data sets or repositories such as Zenodo, Figshare, and biomedical databases (e.g. Genbank, Protein Data Bank). The metadata is varied and may be very sparse. The formats may we well known , but can be binary (tar.gz) or proprietary (XLS, Matlib, etc.)
These may be images, tables, maths, statistics, chemical schemes. Sometimes there is a diagram in the text and this is the supporting data.
This is very publisher-dependent. Typical components are referees' reviews and authors' responses. Also required statements from authors about regulatory or other compliance.
In favourable cases the publisher may describe the data. In others you have to make your own judgement. It may be possible to create a per-publisher or per-repository workflow. Or it may not.
In some cases the supplementary data and images in the full text are bundled together (as in PMC7200000 below)
pygetpapers 0.0.4.2
was used to download supplementary files. Note that -s
(supplementary files) has been available from the start of getpapers
but -z
FTP'ed zip files have been implemented by Ayush last week (2021-05-16). Note that PMC7200000_project
is the CProject - in this case it contains one CTree (PMC7200000) but it could contain any number:
pm286macbook:test pm286$ pygetpapers -q PMC7200000 -s -x -p -z -o PMC7200000_project/
INFO: Total Hits are 1
WARNING: Could not find more papers
WARNING: Keywords not found for paper 1
INFO: Saving XML files to /Users/pm286/temp/test/PMC7200000_project/*/fulltext.xml
INFO: Wrote supplementary files for PMC7200000
INFO: Wrote zip files for PMC7200000
INFO: */Wrote xml for PMC7200000/
INFO: Wrote the pdf file for PMC7200000
(base) pm286macbook:test pm286$ tree
The resulting CProject is shown with comments (//
) . Note that there are 8 figures in the paper, 5 supplemental PDFs, 1 reviewers section, and 5 tables.
The paper itself contains:
- actual images at different resolutions
- actual character based tables , together with medium-res screenshots of them
- actual review text
- hyperlinks to supplementary/additional files
S1 Fig
Frequency histogram of distances moved by San Francisco garter snakes
(Thamnophis sirtalis tetrataenia) between captures at five sites sampled in 2018.
(PDF)
Click here for additional data file.(191K, pdf)
.
├── PMC7200000_project
│ ├── PMC7200000
│ │ ├── eupmc_result.json
│ │ ├── ftpfiles // triggered by -z
// images of special characters/glyphs in the text, about 21x21 pixels. generally not worth keeping
│ │ │ ├── pone.0231744.e001.jpg.
│ │ │ ├── pone.0231744.e002.jpg
│ │ │ ├── pone.0231744.e003.jpg
│ │ │ ├── pone.0231744.e004.jpg
│ │ │ ├── pone.0231744.e005.jpg
│ │ │ ├── pone.0231744.e006.jpg
│ │ │ ├── pone.0231744.e007.jpg
│ │ │ ├── pone.0231744.e008.gif. // duplicate. I don't know why it's just for this one
│ │ │ ├── pone.0231744.e008.jpg
// the 8 figures in the text. each has a thumbnail GIF (ca 100x100) and a higher resolution JPG (e.g. 780x1000)
│ │ │ ├── pone.0231744.g001.gif
│ │ │ ├── pone.0231744.g001.jpg
│ │ │ ├── pone.0231744.g002.gif
│ │ │ ├── pone.0231744.g002.jpg
│ │ │ ├── pone.0231744.g003.gif
│ │ │ ├── pone.0231744.g003.jpg
│ │ │ ├── pone.0231744.g004.gif
│ │ │ ├── pone.0231744.g004.jpg
│ │ │ ├── pone.0231744.g005.gif
│ │ │ ├── pone.0231744.g005.jpg
│ │ │ ├── pone.0231744.g006.gif
│ │ │ ├── pone.0231744.g006.jpg
│ │ │ ├── pone.0231744.g007.gif
│ │ │ ├── pone.0231744.g007.jpg
│ │ │ ├── pone.0231744.g008.gif
│ │ │ ├── pone.0231744.g008.jpg
// the full text of the paper (identical to fulltext.xml)
│ │ │ ├── pone.0231744.nxml
// supplementary files (presumably the same as the targets of the hyperlinks in the paper)
│ │ │ ├── pone.0231744.s001.pdf
│ │ │ ├── pone.0231744.s002.pdf
│ │ │ ├── pone.0231744.s003.pdf
│ │ │ ├── pone.0231744.s004.pdf
│ │ │ ├── pone.0231744.s005.pdf
// reviewer's and author's comments (see below)
// there is no metadata describing what this file is and I don't know if it's universal
│ │ │ ├── pone.0231744.s006.docx
// tables as thumbnails (GIF) and medium-res (JPG). NOTE that the actual tables are published as
// characters in fulltext.xml and AMI can extract them. So these are probably not worth keeping
│ │ │ ├── pone.0231744.t001.gif
│ │ │ ├── pone.0231744.t001.jpg
│ │ │ ├── pone.0231744.t002.gif
│ │ │ ├── pone.0231744.t002.jpg
│ │ │ ├── pone.0231744.t003.gif
│ │ │ ├── pone.0231744.t003.jpg
│ │ │ ├── pone.0231744.t004.gif
│ │ │ ├── pone.0231744.t004.jpg
│ │ │ ├── pone.0231744.t005.gif
│ │ │ └── pone.0231744.t005.jpg
│ │ ├── fulltext.pdf
│ │ ├── fulltext.xml
// identical in content and name to the additional files above so only one copy needs keeping
│ │ └── supplementaryfiles
│ │ ├── pone.0231744.s001.pdf
│ │ ├── pone.0231744.s002.pdf
│ │ ├── pone.0231744.s003.pdf
│ │ ├── pone.0231744.s004.pdf
│ │ ├── pone.0231744.s005.pdf
│ │ └── pone.0231744.s006.docx
│ └── eupmc_results.json
└── eupmc_results.json
pygetpapers -q PMC7300000 -z -o SciRep_project
INFO: Total Hits are 1
WARNING: Could not find more papers
WARNING: Keywords not found for paper 1
INFO: Wrote zip files for PMC7300000
(base) pm286macbook:PMC_download_test pm286$ tree SciRep_project/
SciRep_project/
├── PMC7300000
│ ├── eupmc_result.json
│ └── ftpfiles
// small math equations/expressions
│ ├── 41598_2020_66260_Article_Equa.gif
│ ├── 41598_2020_66260_Article_Equb.gif
│ ├── 41598_2020_66260_Article_Equc.gif
│ ├── 41598_2020_66260_Article_Equd.gif
│ ├── 41598_2020_66260_Article_Eque.gif
// larger math equations/expressions
│ ├── 41598_2020_66260_Article_IEq1.gif
│ ├── 41598_2020_66260_Article_IEq2.gif
│ ├── 41598_2020_66260_Article_IEq3.gif
│ ├── 41598_2020_66260_Article_IEq4.gif
│ ├── 41598_2020_66260_Article_IEq5.gif
│ ├── 41598_2020_66260_Article_IEq6.gif
│ ├── 41598_2020_66260_Article_IEq7.gif
│ ├── 41598_2020_66260_Article_IEq8.gif
// Figures GIF = thumbnail , JPG = medium-res
│ ├── 41598_2020_66260_Fig1_HTML.gif
│ ├── 41598_2020_66260_Fig1_HTML.jpg
│ ├── 41598_2020_66260_Fig2_HTML.gif
│ ├── 41598_2020_66260_Fig2_HTML.jpg
│ ├── 41598_2020_66260_Fig3_HTML.gif
│ ├── 41598_2020_66260_Fig3_HTML.jpg
│ ├── 41598_2020_66260_Fig4_HTML.gif
│ ├── 41598_2020_66260_Fig4_HTML.jpg
│ ├── 41598_2020_66260_Fig5_HTML.gif
│ ├── 41598_2020_66260_Fig5_HTML.jpg
│ ├── 41598_2020_66260_Fig6_HTML.gif
│ ├── 41598_2020_66260_Fig6_HTML.jpg
│ ├── 41598_2020_66260_Fig7_HTML.gif
│ ├── 41598_2020_66260_Fig7_HTML.jpg
// remote supporting information
│ ├── 41598_2020_66260_MOESM1_ESM.pdf
│ ├── 41598_2020_66260_MOESM2_ESM.xlsx
│ ├── 41598_2020_66260_MOESM3_ESM.xlsx
// fulltext.xml
│ └── 41598_2020_Article_66260.nxml
└── eupmc_results.json
No zip file, and crashes (pygetpapers needs mending)
pygetpapers -q PMC7500000 -z -o BMC_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7500000
(base) pm286macbook:PMC_download_test pm286$ tree BMC_project/
BMC_project/
├── PMC7500000
│ ├── eupmc_result.json
│ └── ftpfiles
│ ├── 12920_2020_789_Fig1_HTML.jpg
│ ├── 12920_2020_789_Fig2_HTML.jpg
│ ├── 12920_2020_789_MOESM1_ESM.pdf
│ ├── 12920_2020_789_MOESM2_ESM.xls
│ └── 12920_2020_Article_789.nxml
└── eupmc_results.json
The JPGs are from the figures in the paper.
the ESM are electronic supplemental material (i.e. suppdata).
pygetpapers -q PMC7600000 -z -o MDPI_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7600000
(base) pm286macbook:PMC_download_test pm286$ tree MDPI_project/
MDPI_project/
├── PMC7400000
├── PMC7600000
│ ├── eupmc_result.json
│ └── ftpfiles
│ ├── children-07-00174-g001.jpg
│ └── children-07-00174.nxml
└── eupmc_results.json
pygetpapers -q PMC7700000 -z -o Dove_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7700000
(base) pm286macbook:PMC_download_test pm286$ tree Dove_project/
Dove_project/
├── PMC7700000
│ ├── eupmc_result.json
│ └── ftpfiles
│ ├── OPTH-14-4047-g0001.jpg
│ └── opth-14-4047.nxml
└── eupmc_results.json
pygetpapers -q PMC7800000 -z -o Mitchondrial_project
INFO: Total Hits are 1
WARNING: Could not find more papers
Traceback (most recent call last):
File "/opt/anaconda3/bin/pygetpapers", line 8, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 603, in main
callpygetpapers.handlecli()
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 586, in handlecli
self.apipaperdownload(args.query, args.limit,
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 377, in apipaperdownload
self.makexmlfiles(read_json, getpdf=getpdf, makecsv=makecsv, makexml=makexml, makehtml=makehtml,
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 260, in makexmlfiles
self.download_tools.getsupplementaryfiles(
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/download_tools.py", line 337, in getsupplementaryfiles
z = zipfile.ZipFile(io.BytesIO(r.content))
File "/opt/anaconda3/lib/python3.8/zipfile.py", line 1268, in __init__
self._RealGetContents()
File "/opt/anaconda3/lib/python3.8/zipfile.py", line 1335, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file