-
Notifications
You must be signed in to change notification settings - Fork 33
Issues Encountered when Metadata Scraping
James A. Fellows Yates edited this page Nov 19, 2021
·
24 revisions
Many papers report metadata in a hetereogeneous way. This can make rapid retrieval and compilation of useful data difficult when generating comparative datasets.
This page acts as a 'whiteboard' for people to note down common problems they have when adding studies to AncientMetagenomeDir. This will help the ancient metagenomics community to begin to define reporting standards in the field, in an easy to access and understand manner.
- In many cases radiocarbon dates are reported inconsistently
- No uncalibrated date
- No radiocarbon lab code (like OxA-0000 or MAMS-0000)
- Only provides ranges (no median midpoint)
- Reports in AD or BP (or both!?!!)
- No calibration curve reported
- In some cases sample codes reported in the manuscript are not the same codes that are used in data upload to ENA/SRA
- Multiple sample codes within a paper, difficult to trace
- Sample collection date is not frequently reported in manuscripts
- The latitude/longitude coordinates of sites are not often reported
- Issue when common site name and no additional locality information
- Range of uploading data types (particularly single genomes)
- Consensus sequences in github BEAST XML files, not on GenBank
- Mapped reads only uploaded as FASTQ files
- Not reproducible! What if I want to do different pre-proessing/mapping -Data stated to be consensus in article, but in GenBank/ENA it is raw data
- Each library gets it's own sample accession despite being from same sample [PRJEB35483]
- People upload e.g. BAM files with that have non-UDg and UDG-full reads merged together, so can't tell which is which [PRJNA348634]
- Discrepency between instrument model reported in paper vs ENA metadata [PRJNA688065]
- No record which libraries are UDG treated vs non-UDG treated (even though both were generated)
- Discrepencies between sequencing cycles reported in paper, and those detected during ENA processing [..can't remember...]
- No library names [PRJEB32319]
- No polymerase or library construction information (just fill in, nothing about indexing etc)