diff --git a/Getting_Started_with_eva_sub_cli.md b/Getting_Started_with_eva_sub_cli.md index 50d058f..4383140 100644 --- a/Getting_Started_with_eva_sub_cli.md +++ b/Getting_Started_with_eva_sub_cli.md @@ -4,7 +4,7 @@ The eva-sub-cli tool requires the following inputs: - One or several valid VCF files - Completed metadata spreadsheet -- list 3 Reference genome in fasta format +- Reference genome in fasta format The VCF file must adhere to official VCF specifications, and the metadata spreadsheet provides contextual information about the dataset. In the following sections, we will examine each of these inputs in detail. @@ -13,10 +13,37 @@ The VCF file must adhere to official VCF specifications, and the metadata spread A VCF (Variant Call Format) file is a type of file used in bioinformatics to store information about genetic variants. It includes data about the differences (or variants) between a sample's DNA and a reference genome. Typically, generating a VCF file involves several steps: preparing your sample, sequencing the DNA, aligning it to a reference genome, identifying variants, and finally, formatting this information into a VCF file. The overall goal is to systematically capture and record genetic differences in a standardised format. A VCF file consists of two main parts: the header and the body. Header: The header contains metadata about the file, such as the format version, reference genome information, and descriptions of the data fields. Each line in the header starts with a double ##, except for the last header line which starts with a single #. +File format version + +``` +##fileformat=VCFv4.2 +##INFO= +##FILTER= +##FORMAT= + +``` + +Body: The body of the VCF file contains the actual variant data, with each row representing a single variant. The columns in the body are : CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Sample Columns + +``` +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [SampleIDs...] +``` +Here's a small example to illustrate the structure of a VCF file: Example VCF file archived at EVA to be inserted + # Metadata Spreadsheet The spreadsheet provides comprehensive contextual information about the dataset, ensuring that each submission is accompanied by detailed descriptions that facilitate proper understanding and use of the data. Key elements included in the metadata spreadsheet are analysis and project information, sample information, sequencing methodologies, experimental details. + +| WORKSHEET | EXPLANATION | +| ----------------- | -------------------- | +| Submitter Details | This sheet captures the details of the submitter| +| Project | The objective of this sheet is to gather general information about the Project including submitter, submitting centre, collaborators, project title, description and publications. | +| Sample | Projects consist of analyses that are run on samples. We accept sample information in the form of BioSample, ENA or EGA accession(s). We also accept BioSamples sampleset accessions. If your samples are not yet accessioned, and are therefore novel, please use the "Novel sample(s)" sections of the Sample(s) worksheet to have them registered at BioSample | +| Analysis | For EVA, each analysis is one vcf file, plus an unlimited number of ancillary files. This sheet allows EVA to link vcf files to a project and to other EVA analyses. Additionally, this worksheet contains experimental meta-data detailing the methodology of each analysis. Important to note; one project can have multiple associated analyses | +| Files | Filenames and associated checking data associated with this EVA submission should be entered into this worksheet. Each file should be linked to one, or more, analysis. | + + # Validation checks The CLI tool performs the following validation checks and generates corresponding reports: @@ -57,7 +84,7 @@ Key points to note before validating your VCF file with the eva-sub-cli tool: - Header Metadata: Should include the reference genome, information fields (INFO), filters (FILTER), AF and genotype metadata - Variant Information: VCF files must provide either sample genotypes and/or aggregated sample summary-level allele frequencies. - Unique Variants: Variant lines should be unique and not specify duplicate loci. -- Reference Genome: All variants must be submitted with positions on a reference genome accessionned by a member of the INSDC consortium (Genbank, ENA, or DDBJ). +- Reference Genome: All variants must be submitted with positions on a reference genome accessionned by a member of the INSDC consortium [Genbank](https://www.ncbi.nlm.nih.gov/genbank/), [ENA](https://www.ebi.ac.uk/ena/browser/home), or [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html). Common Errors Seen with VCF Checks: @@ -75,7 +102,9 @@ Key points to note before validating your data with the eva-sub-cli Tool: - Ensure that the reference sequences in the FASTA file used to call the variants are accessioned in INSDC. - Verify that the VCF file does not use non-GCA contig aliases by cross-checking with the reference assembly report. -- Common errors seen with assembly checks: + + Common errors seen with assembly checks: + - VCF file uses a non-GCA contig alias causing the assembly check to fail - Contigs used do not exist in the assembly report of the reference genome - Major Allele Used as REF Allele: This typically occurs when a specific version of Plink or Tassel is used to create VCF files, causing the tool to use the major allele as the reference allele. In such cases, submitters should use the GCA FASTA sequence to create corrected files.