Issues Encountered when Metadata Scraping

Introduction

Many papers report metadata in a hetereogeneous way. This can make rapid retrieval and compilation of useful data difficult when generating comparative datasets.

This page acts as a 'whiteboard' for people to note down common problems they have when adding studies to AncientMetagenomeDir. This will help the ancient metagenomics community to begin to define reporting standards in the field, in an easy to access and understand manner.

Issues

Samples

Sample Ages

In many cases radiocarbon dates are reported inconsistently
- No uncalibrated date
- No radiocarbon lab code (like OxA-0000 or MAMS-0000)
- Only provides ranges (no median midpoint)
- Reports in AD or BP (or both!?!!)
- No calibration curve reported

Sample Codes

In some cases sample codes reported in the manuscript are not the same codes that are used in data upload to ENA/SRA
Multiple sample codes within a paper, difficult to trace

Sample Collection Date

Sample collection date is not frequently reported in manuscripts

Site latitude/longitude

The latitude/longitude coordinates of sites are not often reported
- Issue when common site name and no additional locality information

Sequencing Data / Accession Codes

Range of uploading data types (particularly single genomes)
Consensus sequences in github BEAST XML files, not on GenBank
Mapped reads only uploaded as FASTQ files
- Not reproducible! What if I want to do different pre-proessing/mapping -Data stated to be consensus in article, but in GenBank/ENA it is raw data

Libraries

Each library gets it's own sample accession despite being from same sample [PRJEB35483]
People upload e.g. BAM files with that have non-UDg and UDG-full reads merged together, so can't tell which is which [PRJNA348634]
Discrepency between instrument model reported in paper vs ENA metadata [PRJNA688065]
No record which libraries are UDG treated vs non-UDG treated (even though both were generated)
Discrepencies between sequencing cycles reported in paper, and those detected during ENA processing [..can't remember...]
No library names [PRJEB32319]
No polymerase or library construction information (just fill in, nothing about indexing etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly