Skip to content

Supplementary Data and images

petermr edited this page May 23, 2021 · 6 revisions

#suppdata and images/figures Comments by Petermr

Varies considerably by publisher, repository. Don't expect a consistent description. Also called:

  • supplement(al/ary) data
  • supporting information
  • additional material

etc.

example

We take PMC7200000 as an example (from publisher PLoS). Other publishers would each treat this slightly differently

types

There are no clear boundaries.

information external to the manuscript

These are usually links (often hyperlinks) to other data sets or repositories such as Zenodo, Figshare, and biomedical databases (e.g. Genbank, Protein Data Bank). The metadata is varied and may be very sparse. The formats may we well known , but can be binary (tar.gz) or proprietary (XLS, Matlib, etc.)

information partially or wholly published in "print"

These may be images, tables, maths, statistics, chemical schemes. Sometimes there is a diagram in the text and this is the supporting data.

ancillary publication metadata

This is very publisher-dependent. Typical components are referees' reviews and authors' responses. Also required statements from authors about regulatory or other compliance.

summary document/s

In favourable cases the publisher may describe the data. In others you have to make your own judgement. It may be possible to create a per-publisher or per-repository workflow. Or it may not.

suppdata and images

In some cases the supplementary data and images in the full text are bundled together (as in PMC7200000 below)

example PMC7200000 from EuropePMC

pygetpapers 0.0.4.2 was used to download supplementary files. Note that -s (supplementary files) has been available from the start of getpapers but -z FTP'ed zip files have been implemented by Ayush last week (2021-05-16). Note that PMC7200000_project is the CProject - in this case it contains one CTree (PMC7200000) but it could contain any number:

 pm286macbook:test pm286$ pygetpapers -q PMC7200000 -s -x -p -z -o PMC7200000_project/
INFO: Total Hits are 1
WARNING: Could not find more papers
WARNING: Keywords not found for paper 1
INFO: Saving XML files to /Users/pm286/temp/test/PMC7200000_project/*/fulltext.xml
INFO: Wrote supplementary files for PMC7200000
INFO: Wrote zip files for PMC7200000
INFO: */Wrote xml for PMC7200000/
INFO: Wrote the pdf file for PMC7200000
(base) pm286macbook:test pm286$ tree

The resulting CProject is shown with comments (//) . Note that there are 8 figures in the paper, 5 supplemental PDFs, 1 reviewers section, and 5 tables.

The paper itself contains:

  • actual images at different resolutions
  • actual character based tables , together with medium-res screenshots of them
  • actual review text
  • hyperlinks to supplementary/additional files
S1 Fig
Frequency histogram of distances moved by San Francisco garter snakes
(Thamnophis sirtalis tetrataenia) between captures at five sites sampled in 2018.
(PDF)

Click here for additional data file.(191K, pdf)
.
├── PMC7200000_project
│   ├── PMC7200000
│   │   ├── eupmc_result.json
│   │   ├── ftpfiles                    // triggered by -z

// images of special characters/glyphs in the text, about 21x21 pixels. generally not worth keeping
│   │   │   ├── pone.0231744.e001.jpg.  
│   │   │   ├── pone.0231744.e002.jpg
│   │   │   ├── pone.0231744.e003.jpg
│   │   │   ├── pone.0231744.e004.jpg
│   │   │   ├── pone.0231744.e005.jpg
│   │   │   ├── pone.0231744.e006.jpg
│   │   │   ├── pone.0231744.e007.jpg
│   │   │   ├── pone.0231744.e008.gif.  // duplicate. I don't know why it's just for this one
│   │   │   ├── pone.0231744.e008.jpg
// the 8 figures in the text. each has a thumbnail GIF (ca 100x100) and a higher resolution JPG (e.g. 780x1000)
│   │   │   ├── pone.0231744.g001.gif
│   │   │   ├── pone.0231744.g001.jpg
│   │   │   ├── pone.0231744.g002.gif
│   │   │   ├── pone.0231744.g002.jpg
│   │   │   ├── pone.0231744.g003.gif
│   │   │   ├── pone.0231744.g003.jpg
│   │   │   ├── pone.0231744.g004.gif
│   │   │   ├── pone.0231744.g004.jpg
│   │   │   ├── pone.0231744.g005.gif
│   │   │   ├── pone.0231744.g005.jpg
│   │   │   ├── pone.0231744.g006.gif
│   │   │   ├── pone.0231744.g006.jpg
│   │   │   ├── pone.0231744.g007.gif
│   │   │   ├── pone.0231744.g007.jpg
│   │   │   ├── pone.0231744.g008.gif
│   │   │   ├── pone.0231744.g008.jpg

// the full text of the paper (identical to fulltext.xml)
│   │   │   ├── pone.0231744.nxml

// supplementary files (presumably the same as the targets of the hyperlinks in the paper)
│   │   │   ├── pone.0231744.s001.pdf
│   │   │   ├── pone.0231744.s002.pdf
│   │   │   ├── pone.0231744.s003.pdf
│   │   │   ├── pone.0231744.s004.pdf
│   │   │   ├── pone.0231744.s005.pdf

// reviewer's and author's comments (see below)
// there is no metadata describing what this file is and I don't know if it's universal

│   │   │   ├── pone.0231744.s006.docx

// tables as thumbnails (GIF) and medium-res (JPG). NOTE that the actual tables are published as 
// characters in fulltext.xml and AMI can extract them. So these are probably not worth keeping

│   │   │   ├── pone.0231744.t001.gif
│   │   │   ├── pone.0231744.t001.jpg
│   │   │   ├── pone.0231744.t002.gif
│   │   │   ├── pone.0231744.t002.jpg
│   │   │   ├── pone.0231744.t003.gif
│   │   │   ├── pone.0231744.t003.jpg
│   │   │   ├── pone.0231744.t004.gif
│   │   │   ├── pone.0231744.t004.jpg
│   │   │   ├── pone.0231744.t005.gif
│   │   │   └── pone.0231744.t005.jpg

│   │   ├── fulltext.pdf
│   │   ├── fulltext.xml

// identical in content and name to the additional files above so only one copy needs keeping
│   │   └── supplementaryfiles
│   │       ├── pone.0231744.s001.pdf
│   │       ├── pone.0231744.s002.pdf
│   │       ├── pone.0231744.s003.pdf
│   │       ├── pone.0231744.s004.pdf
│   │       ├── pone.0231744.s005.pdf
│   │       └── pone.0231744.s006.docx
│   └── eupmc_results.json
└── eupmc_results.json

Example PMC7300000 from ScientificReports

pygetpapers -q PMC7300000 -z -o SciRep_project
INFO: Total Hits are 1
WARNING: Could not find more papers
WARNING: Keywords not found for paper 1
INFO: Wrote zip files for PMC7300000
(base) pm286macbook:PMC_download_test pm286$ tree SciRep_project/
SciRep_project/
├── PMC7300000
│   ├── eupmc_result.json
│   └── ftpfiles

// small math equations/expressions
│       ├── 41598_2020_66260_Article_Equa.gif
│       ├── 41598_2020_66260_Article_Equb.gif
│       ├── 41598_2020_66260_Article_Equc.gif
│       ├── 41598_2020_66260_Article_Equd.gif
│       ├── 41598_2020_66260_Article_Eque.gif
// larger math equations/expressions
│       ├── 41598_2020_66260_Article_IEq1.gif
│       ├── 41598_2020_66260_Article_IEq2.gif
│       ├── 41598_2020_66260_Article_IEq3.gif
│       ├── 41598_2020_66260_Article_IEq4.gif
│       ├── 41598_2020_66260_Article_IEq5.gif
│       ├── 41598_2020_66260_Article_IEq6.gif
│       ├── 41598_2020_66260_Article_IEq7.gif
│       ├── 41598_2020_66260_Article_IEq8.gif
// Figures GIF = thumbnail , JPG = medium-res
│       ├── 41598_2020_66260_Fig1_HTML.gif
│       ├── 41598_2020_66260_Fig1_HTML.jpg
│       ├── 41598_2020_66260_Fig2_HTML.gif
│       ├── 41598_2020_66260_Fig2_HTML.jpg
│       ├── 41598_2020_66260_Fig3_HTML.gif
│       ├── 41598_2020_66260_Fig3_HTML.jpg
│       ├── 41598_2020_66260_Fig4_HTML.gif
│       ├── 41598_2020_66260_Fig4_HTML.jpg
│       ├── 41598_2020_66260_Fig5_HTML.gif
│       ├── 41598_2020_66260_Fig5_HTML.jpg
│       ├── 41598_2020_66260_Fig6_HTML.gif
│       ├── 41598_2020_66260_Fig6_HTML.jpg
│       ├── 41598_2020_66260_Fig7_HTML.gif
│       ├── 41598_2020_66260_Fig7_HTML.jpg
// remote supporting information
│       ├── 41598_2020_66260_MOESM1_ESM.pdf
│       ├── 41598_2020_66260_MOESM2_ESM.xlsx
│       ├── 41598_2020_66260_MOESM3_ESM.xlsx
// fulltext.xml
│       └── 41598_2020_Article_66260.nxml
└── eupmc_results.json

Example PMC7400000 from MDPI

No zip file, and crashes (pygetpapers needs mending)

Example PMC7500000 from BMC

pygetpapers -q PMC7500000 -z -o BMC_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7500000
(base) pm286macbook:PMC_download_test pm286$ tree BMC_project/
BMC_project/
├── PMC7500000
│   ├── eupmc_result.json
│   └── ftpfiles
│       ├── 12920_2020_789_Fig1_HTML.jpg
│       ├── 12920_2020_789_Fig2_HTML.jpg
│       ├── 12920_2020_789_MOESM1_ESM.pdf
│       ├── 12920_2020_789_MOESM2_ESM.xls
│       └── 12920_2020_Article_789.nxml
└── eupmc_results.json

The JPGs are from the figures in the paper.

the ESM are electronic supplemental material (i.e. suppdata).

Example PMC7600000 from MDPI

pygetpapers -q PMC7600000 -z -o MDPI_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7600000
(base) pm286macbook:PMC_download_test pm286$ tree MDPI_project/
MDPI_project/
├── PMC7400000
├── PMC7600000
│   ├── eupmc_result.json
│   └── ftpfiles
│       ├── children-07-00174-g001.jpg
│       └── children-07-00174.nxml
└── eupmc_results.json

Example PMC7700000 from Dove

pygetpapers -q PMC7700000 -z -o Dove_project
INFO: Total Hits are 1
WARNING: Could not find more papers
INFO: Wrote zip files for PMC7700000
(base) pm286macbook:PMC_download_test pm286$ tree Dove_project/
Dove_project/
├── PMC7700000
│   ├── eupmc_result.json
│   └── ftpfiles
│       ├── OPTH-14-4047-g0001.jpg
│       └── opth-14-4047.nxml
└── eupmc_results.json

Example PMC7800000 from Mitochondrial DNA

pygetpapers -q PMC7800000 -z -o Mitchondrial_project
INFO: Total Hits are 1
WARNING: Could not find more papers
Traceback (most recent call last):
  File "/opt/anaconda3/bin/pygetpapers", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 603, in main
    callpygetpapers.handlecli()
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 586, in handlecli
    self.apipaperdownload(args.query, args.limit,
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 377, in apipaperdownload
    self.makexmlfiles(read_json, getpdf=getpdf, makecsv=makecsv, makexml=makexml, makehtml=makehtml,
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 260, in makexmlfiles
    self.download_tools.getsupplementaryfiles(
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/download_tools.py", line 337, in getsupplementaryfiles
    z = zipfile.ZipFile(io.BytesIO(r.content))
  File "/opt/anaconda3/lib/python3.8/zipfile.py", line 1268, in __init__
    self._RealGetContents()
  File "/opt/anaconda3/lib/python3.8/zipfile.py", line 1335, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file