Bias in sequencing the genome

Structural variants--large genetic mutations--are often identified in the human genome using algorithms that search for hills and valleys in read depth of coverage--the number of sequenced "reads" that align to each part of the human genome.

But certain "motif" sequences in the genome are known to be associated with changes in coverage, even in the absence of a structural variant.

Such systematic biases in depth of coverage need to be corrected before those data are passed to a structural variant caller to avoid false calls.

Learning and correcting the bias

Convolutional Neural Networks have recently been used to classify genomic sequences. I illustrate the approach here using toy sequence data.

We adapted this idea and built a Convolutional Neural Network that models the read depth associated with a given sequence as a mixture of Poisson distributions.

When this model is trained on sequences containing an AT-dinucleotide repeat and random sequences, it corrects depth of coverage in sequences harboring the AT-dinucleotide repeat.

Next steps

What is needed now is a training set enriched for ALL motifs in the genome that affect coverage. (The training set used above contains only one motif.) With the more diverse training set in hand, the model could be trained to correct all systematic biases present in the genome.

Possible ways to obtain such a training set include:

pooling depths across multiple samples (individuals), thereby increasing the signal-to-noise ratio, and then selecting examples with, e.g., lower than expected read depth
using the HOMER and MEME-CHIP toolsets to find motifs that appear in the training set more often than in an equal-sized set of random DNA sequences, and then retaining only training examples that contain one or more of these motifs.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.gitignore		.gitignore
README.md		README.md
discovering_DNA_motifs_using_convnets_classification.ipynb		discovering_DNA_motifs_using_convnets_classification.ipynb
load_preprocess_data.py		load_preprocess_data.py
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bias in sequencing the genome

Learning and correcting the bias

Next steps

About

Releases

Packages

Languages

petermchale/denoising_coverage_profiles

Folders and files

Latest commit

History

Repository files navigation

Bias in sequencing the genome

Learning and correcting the bias

Next steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages