Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dataset-genomics-data-lake.md #29

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-1000-genomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ ms.date: 07/10/2024

# 1000 Genomes

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The 1000 Genomes Project ran between 2008 and 2015, to create the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, visit the 1000 Genome Project [website](https://www.internationalgenome.org/) and these publications:

[Pilot Analysis: A map of human genome variation from population-scale sequencing Nature 467, 1061-1073 (28 October 2010)](https://www.nature.com/articles/nature09534)
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-clinvar-annotations.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ ms.date: 06/13/2024

# ClinVar Annotations

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) resource is a freely accessible, public archive of reports - with supporting evidence - about the relationships among human variations and phenotypes. It facilitates access to and communication about the claimed relationships between human variation and observed health status, and about the history of that interpretation. It provides access to a broader set of clinical interpretations that researchers can incorporate into genomics workflows and applications.

Visit the [Data Dictionary](https://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVarDataDictionary.pdf) and the [FAQ resource](https://www.ncbi.nlm.nih.gov/clinvar/docs/faq/) for more information about the data.
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-encode.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# ENCODE: Encyclopedia of DNA Elements

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The [Encyclopedia of DNA Elements (ENCODE) Consortium](https://www.encodeproject.org/help/project-overview/) is an ongoing international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). ENCODE's goal is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

ENCODE investigators employ various assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, that is, modified histones, transcription factors, chromatin regulators, and RNA-binding proteins, followed by sequencing.
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-gatk-resource-bundle.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# GATK Resource Bundle

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The [GATK resource bundle](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle) is a collection of standard files for working with human resequencing data with the GATK.

[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-usage-note.md)]
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-genomics-data-lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ The Genomics Data Lake is hosted in the West US 2 and West Central US Azure regi
| [GATK Resource Bundle](dataset-gatk-resource-bundle.md) | GATK Resource bundle |
| [TCGA Open Data](dataset-the-cancer-genome-atlas.md) | TCGA Open Data |
| [Pan UK-Biobank](dataset-panancestry-uk-bio-bank.md) | Pan UK-Biobank |
| [ImmuneCODE database](dataset-immunecode.md) | ImmuneCODE database |
| [Open Targets dataset](dataset-panancestry-uk-bio-bank.md) | Open Targets dataset |

## Next steps

Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-human-reference-genomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# Human Reference Genomes

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

This dataset includes two human-genome references assembled by the [Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc): Hg19 and Hg38.

For more information on Hg19 (GRCh37) data, see the [GRCh37 report at NCBI](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/).
Expand Down
4 changes: 3 additions & 1 deletion articles/open-datasets/dataset-illumina-platinum-genomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# Illumina Platinum Genomes

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

Whole-genome sequencing is enabling researchers worldwide to characterize the human genome more fully and accurately. This requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes as a benchmark. Illumina has generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree. Illumina has called variants in each genome using a range of currently available algorithms.

For more information on the data, see the official [Illumina site](https://www.illumina.com/platinumgenomes.html).
Expand Down Expand Up @@ -206,4 +208,4 @@ run gatk VariantsToTable -V NA12877.vcf.gz -F CHROM -F POS -F TYPE -F AC -F AD -

## Next steps

View the rest of the datasets in the [Open Datasets catalog](dataset-catalog.md).
View the rest of the datasets in the [Open Datasets catalog](dataset-catalog.md).
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-immunecode.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 11/09/2023

# ImmuneCODE database

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The ImmuneCODE™ database, which includes hundreds of millions of T-cell Receptor (TCR) sequences from over 1,400 subjects exposed to or infected with the SARS-CoV-2 virus, and over 160,000 high-confidence SARS-CoV-2-specific TCRs.
The database is accessible at no cost. Its data can be analyzed to aid global initiatives aimed at comprehending the immune response to the SARS-CoV-2 virus and crafting novel interventions. To learn more about the dataset refer the associated [publication.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7418738/)

Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-open-cravat.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# OpenCravat: Open Custom Ranked Analysis of Variants Toolkit

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

OpenCRAVAT is a Python package that performs genomic variant interpretation including variant impact, annotation, and scoring. OpenCRAVAT has a modular architecture with a wide variety of analysis modules and annotation resources that can be selected and installed/run based on the needs of a given study.

For more information on the data, see the [OpenCravat](https://opencravat.org/).
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-panancestry-uk-bio-bank.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ ms.date: 05/17/2023

# Pan UK-Biobank: Pan-ancestry genetic analysis of the UK Biobank

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The [Pan-ancestry genetic analysis of the UK Biobank(Pan-UKBB)](https://pan.ukbb.broadinstitute.org) is a resource to researchers that promotes more inclusive research practices, accelerates scientific discoveries, and improves the health of all people equitably. In genetics research, it's statistically necessary to study groups of individuals together with similar ancestries. In practice, this method has meant that most previous research has excluded individuals with non-European ancestries. The Pan-ancestry of UK-biobank is a resource using one of the most widely accessed sources of genetic data, the UK Biobank, in a manner that is more inclusive than most previous efforts--namely studying groups of individuals with diverse ancestries. The results of this research have many important limitations, which should be carefully considered when researchers use this resource in their work and when they and others interpret subsequent findings.

[!INCLUDE [Open Dataset usage notice](./includes/open-datasets-usage-note.md)]
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-snpeff.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ ms.date: 04/16/2021

# SnpEff: Genomic variant annotations and functional effect prediction toolbox

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

[SnpEff](https://pcingola.github.io/SnpEff/) Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

For more information on the data, see the [User Manual](https://pcingola.github.io/SnpEff/snpeff/introduction/).
Expand Down
2 changes: 2 additions & 0 deletions articles/open-datasets/dataset-the-cancer-genome-atlas.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ ms.date: 09/22/2022

# TCGA Open Data

<em> Important Update 9/19/2024: all URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z . After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters). </em>

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types[[1]](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The TCGA cancer data made available publically are two tiers: open or controlled access.

- Open access [available on Azure]: This dataset contains deindentified clinical and biospecimen data or summarized data that doesn't contain any individually identifiable information. The data types included are Gene expression, methylation beta values and protein quantification. DNA level datatype includes gene level copy number and masked copy number segment.
Expand Down