-
Notifications
You must be signed in to change notification settings - Fork 33
Issues Encountered when Metadata Scraping
James A. Fellows Yates edited this page Nov 9, 2022
·
24 revisions
Many papers report metadata in a hetereogeneous way. This can make rapid retrieval and compilation of useful data difficult when generating comparative datasets.
This page acts as a 'whiteboard' for people to note down common problems they have when adding studies to AncientMetagenomeDir. This will help the ancient metagenomics community to begin to define reporting standards in the field, in an easy to access and understand manner.
- In many cases radiocarbon dates are reported inconsistently
- No uncalibrated date
- No radiocarbon lab code (like OxA-0000 or MAMS-0000)
- Only provides ranges (no median midpoint)
- Reports in AD or BP (or both!?!!)
- No calibration curve reported
- In some cases sample codes reported in the manuscript are not the same codes that are used in data upload to ENA/SRA
- Multiple sample codes within a paper, difficult to trace
- Sample collection date is not frequently reported in manuscripts
- The latitude/longitude coordinates of sites are not often reported
- Issue when common site name and no additional locality information
- Range of uploading data types (particularly single genomes)
- Consensus sequences in github BEAST XML files, not on GenBank
- Mapped reads only uploaded as FASTQ files
- Not reproducible! What if I want to do different pre-proessing/mapping -Data stated to be consensus in article, but in GenBank/ENA it is raw data
- Each library gets it's own sample accession despite being from same sample [PRJEB35483]
- People upload e.g. BAM files with that have non-UDg and UDG-full reads merged together, so can't tell which is which [PRJNA348634]
- Discrepency between instrument model reported in paper vs ENA metadata [PRJNA688065, PRJNA643812, PRJEB19769,PRJNA417381]
- No record which libraries are UDG treated vs non-UDG treated (even though both were generated)
- Discrepencies between sequencing cycles reported in paper, and those detected during ENA processing [PRJEB31971, PRJEB24499]
- No library names [PRJEB32319]
- No polymerase or library construction information (just fill in, nothing about indexing etc) [common]
- Citation chain that leads to non-relevant citation. E.g. [PRJEB41353] -> partialUDG treatment as per a modified version of Rohland as described in Krause-Kyora 2018a, but the two protocols in KK2018 are for non-UDG or UDG full, Rohland not cited...; [PRJNA320875] Following Grahaham et al. 2016, who cites Meyer and Kircher with modifications reporeted in Heintzman 2015, but Heinztman 2015 reports no modifications..?
- Or Guellil2022b -> Scheib2018 -> Rasmussen 2014 just to find the polymerase (and Ramussen 2014 is a variant of Meyer and Kircher...)...
- Citation chain that leads to non-relevant citation. E.g. [PRJEB41353] -> partialUDG treatment as per a modified version of Rohland as described in Krause-Kyora 2018a, but the two protocols in KK2018 are for non-UDG or UDG full, Rohland not cited...; [PRJNA320875] Following Grahaham et al. 2016, who cites Meyer and Kircher with modifications reporeted in Heintzman 2015, but Heinztman 2015 reports no modifications..?
- Multiple sample/library names some not reported in paper (e.g. theDir sample_name (from paper), ENA sample_alias ec., uploaded FASTQ files etc.) [PRJEB31971,PRJEB19769]
- Note: not necessarily bad if well reported in publication like this one - but can occur that paper doesn't match ENA data
- Report as paired-end sequencing, but only uploaded single FASTQ (presumably already merged?) [PRJEB45013]