Issues Encountered when Metadata Scraping

Introduction

Many papers report metadata in a hetereogeneous way. This can make rapid retrieval and compilation of useful data difficult when generating comparative datasets.

This page acts as a 'whiteboard' for people to note down common problems they have when adding studies to AncientMetagenomeDir. This will help the ancient metagenomics community to begin to define reporting standards in the field, in an easy to access and understand manner.

Issues

Samples

Sample Ages

In many cases radiocarbon dates are reported inconsistently
- No uncalibrated date
- No radiocarbon lab code (like OxA-0000 or MAMS-0000)
- Only provides ranges (no median midpoint)
- Reports in AD or BP (or both!?!!)
- No calibration curve reported

Sample Codes

In some cases sample codes reported in the manuscript are not the same codes that are used in data upload to ENA/SRA
Multiple sample codes within a paper, difficult to trace

Sample Collection Date

Sample collection date is not frequently reported in manuscripts

Site latitude/longitude

The latitude/longitude coordinates of sites are not often reported
- Issue when common site name and no additional locality information

Sequencing Data / Accession Codes

Range of uploading data types (particularly single genomes)
Consensus sequences in github BEAST XML files, not on GenBank
Mapped reads only uploaded as FASTQ files
- Not reproducible! What if I want to do different pre-proessing/mapping -Data stated to be consensus in article, but in GenBank/ENA it is raw data

Libraries

Each library gets it's own sample accession despite being from same sample [PRJEB35483]
People upload e.g. BAM files with that have non-UDg and UDG-full reads merged together, so can't tell which is which [PRJNA348634]
Discrepency between instrument model reported in paper vs ENA metadata [PRJNA688065, PRJNA643812, PRJEB19769,PRJNA417381]
No record which libraries are UDG treated vs non-UDG treated (even though both were generated)
Discrepencies between sequencing cycles reported in paper, and those detected during ENA processing [PRJEB31971, PRJEB24499]
No library names [PRJEB32319]
No polymerase or library construction information (just fill in, nothing about indexing etc) [common]
- Citation chain that leads to non-relevant citation. E.g. [PRJEB41353] -> partialUDG treatment as per a modified version of Rohland as described in Krause-Kyora 2018a, but the two protocols in KK2018 are for non-UDG or UDG full, Rohland not cited...; [PRJNA320875] Following Grahaham et al. 2016, who cites Meyer and Kircher with modifications reporeted in Heintzman 2015, but Heinztman 2015 reports no modifications..?
  - Or Guellil2022b -> Scheib2018 -> Rasmussen 2014 just to find the polymerase (and Ramussen 2014 is a variant of Meyer and Kircher...)...
Multiple sample/library names some not reported in paper (e.g. theDir sample_name (from paper), ENA sample_alias ec., uploaded FASTQ files etc.) [PRJEB31971,PRJEB19769]
- Note: not necessarily bad if well reported in publication like this one - but can occur that paper doesn't match ENA data
Report as paired-end sequencing, but only uploaded single FASTQ (presumably already merged?) [PRJEB45013]

PRJEB45013]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly