Skip to content

Issues Encountered when Metadata Scraping

James A. Fellows Yates edited this page Nov 8, 2021 · 24 revisions

Introduction

Many papers report metadata in a hetereogeneous way. This can make rapid retrieval and compilation of useful data difficult when generating comparative datasets.

This page acts as a 'whiteboard' for people to note down common problems they have when adding studies to AncientMetagenomeDir. This will help the ancient metagenomics community to begin to define reporting standards in the field, in an easy to access and understand manner.

Issues

Samples

Sample Ages

  • In many cases radiocarbon dates are reported inconsistently
    • No uncalibrated date
    • No radiocarbon lab code (like OxA-0000 or MAMS-0000)
    • Only provides ranges (no median midpoint)
    • Reports in AD or BP (or both!?!!)
    • No calibration curve reported

Sample Codes

  • In some cases sample codes reported in the manuscript are not the same codes that are used in data upload to ENA/SRA
  • Multiple sample codes within a paper, difficult to trace

Sample Collection Date

  • Sample collection date is not frequently reported in manuscripts

Site latitude/longitude

  • The latitude/longitude coordinates of sites are not often reported
    • Issue when common site name and no additional locality information

Sequencing Data / Accession Codes

  • Range of uploading data types (particularly single genomes)
  • Consensus sequences in github BEAST XML files, not on GenBank
  • Mapped reads only uploaded as FASTQ files
    • Not reproducible! What if I want to do different pre-proessing/mapping -Data stated to be consensus in article, but in GenBank/ENA it is raw data

Libraries

  • Each library gets it's own sample accession despite being from same sample [PRJEB35483]
  • People upload e.g. BAM files with that have non-UDg and UDG-full reads merged together, so can't tell which is which [PRJNA348634]
  • Discrepency between instrument model reported in paper vs ENA metadata [PRJNA688065]
  • No record which libraries are UDG treated vs non-UDG treated (even though both were generated)
  • Discrepencies between sequencing cycles reported in paper, and those detected during ENA processing [..can't remember...]
  • No library names [PRJEB32319]