Skip to content

Commit

Permalink
Merge pull request #47 from eastgenomics/development
Browse files Browse the repository at this point in the history
1.2 - development -> main
  • Loading branch information
mattgarner authored Jul 7, 2021
2 parents c0fcfea + d283759 commit 755255b
Show file tree
Hide file tree
Showing 10 changed files with 1,708 additions and 21,667 deletions.
43 changes: 27 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Athena is a tool to generate coverage statistics for NGS data, and combine these

Dependencies may be installed from the requirements.txt file using ```pip install -r requirements.txt```.
This should contains all the required python packages required to generate coverage statistics and reports.
For optional calculating of variant coverage from VCFs, [BEDtools][bedtools-url] is also required to be installed.
In addition, [BEDtools][bedtools-url] is also required to be installed and on path.

Tested on Ubuntu 18.04.4 and macOS 10.15.4

Expand All @@ -22,35 +22,52 @@ Tested on Ubuntu 18.04.4 and macOS 10.15.4
It is written to take in per base coverage data (as output from tools such as [mosdepth][mosdepth-url]) as input to calculate coverage for target regions defined in a bed file. <br></br>

The general workflow for generating the statistics and report is as follows: <br>
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.sh`
- Annotate each region of the bed file with the gene, exon and per base coverage data using `annotate_bed.py`
- Generate per exon and per gene statistics using `coverage_stats_single.py`
- Generate HTML coverage report with `coverage_report_single.py`

For DNAnexus cloud platform users, an Athena [dx applet][dx-url] has also been built.


### Expected file formats

As a minimum, Athena requires 3 input files. These are a bed file for the gene panel, a file of transcript information and the output of your coverage tool (mosdepth, samtools etc.). These files MUST have the following columns:

- panel bed file: `chromosome start end transcript`
- transcript file: `chromosome start end gene transcript exon`
- coverage file: `chromosome start end coverage`

n.b. the process for creating the transcript file may be found [here][transcript-file-url].

### Annotating BED file
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. <br>
The BED file containing regions of interest is first required to be annotated with gene, exon and coverage information prior to analysis. This may be done using [BEDtools intersect][bedtools-intersect-url], with a file containing transcript to gene and exon information, and then the per base coverage data. Currently, 100% overlap is required between coordinates in the panel bed file and the transcript annotation file, therefore you must ensure any added flank regions etc. are the same.<br>

Included is a Bash script (`annotate_bed.sh`) to perform the required BED file annotation.
Included is a Python script (`annotate_bed.py`) to perform the required BED file annotation.

Expected inputs:

```
-i : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript.
-g : Exons nirvana file, contains required gene and exon information.
-b : Per base coverage file (output from mosdepth or similar).
-p / --panel_bed : Input panel bed file; must have ONLY the following 4 columns chromosome, start position, end position, gene/transcript
-t / --transcript_file : Transcript annotation file, contains required gene and exon information. Must have ONLY the following 6 columns:
chromosome, start, end, gene, transcript, exon
-c / --coverage_file : Per base coverage file (output from mosdepth or similar)
-s / -chunk_size : (optional) nrows to split per-base coverage file for intersecting, with <16GB RAM can lead to bedtools intersect failing. Reccomended values: 16GB RAM -> 20000000; 8GB RAM -> 10000000
-n / --output_name : (optional) Prefix for naming output file, if not given will use name from per base coverage file
Example usage:
$ annotate_bed.sh -i panel_bed_file.bed -g exons_nirvana -b {input_file}.per_base.bed
$ annotate_bed.py -p panel_bed_file.bed -t transcript_file.tsv -c {input_file}.per_base.bed
```
<br>
This wraps the bedtools intersect commands below. These commands are given as an example, the output file column ordering must match that given in /data/example example_annotated_bed for calculating coverage statistics:
<br>

```
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
$ bedtools intersect -a beds/sorted_bed_file.bed -b beds/exons_nirvana2010_no_PAR_Y_noflank.bed -wa -wb -f 1.0 -r | awk 'OFS="\t" {if ($4 == $9) print}' | cut -f 1,2,3,8,9,10 > sample1_genes_exons.bed
- sorted_bed_file.bed -- bed file defining regions of interest (columns: chromosome, start, end, transcript)
- exons_nirvana2010_no_PAR_Y.bed -- a bed file containing transcript -> exon and gene information
Expand Down Expand Up @@ -108,13 +125,6 @@ $ python3 bin/coverage_report_single.py --gene_stats output/sample1-exon-coverag
```


### For development

Features to be developed:
- Generate run level statistics from multiple samples
- Generate run level report from multiple samples
- Add interactive elements to tables to increase useability (i.e sorting, filtering, searching)

Any bugs or suggestions for improvements please raise an issue.


Expand All @@ -130,3 +140,4 @@ Any bugs or suggestions for improvements please raise an issue.
[mosdepth-url]: https://github.com/brentp/mosdepth

[dx-url]: https://github.com/eastgenomics/eggd_athena
[transcript-file-url]: https://cuhbioinformatics.atlassian.net/wiki/spaces/P/pages/2241101840/Generating+transcripts+file+for+Athena
253 changes: 253 additions & 0 deletions bin/annotate_bed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
"""
Script to annotate a panel bed file with transcript information and per
base coverage data.
Requires: bedtools
Jethro Rainford
20/06/2021
"""

import argparse
import os
import pandas as pd
from pathlib import Path
import pybedtools as bedtools

from load_data import loadData


class annotateBed():

def add_transcript_info(self, panel_bed, transcript_info_df):
"""
Use pybedtools to annotate panel bed file with coverage data
Args:
- panel_bed (df): panel bed file regions df
- transcript_info_df (df): transcript info file df
Returns:
- bed_w_transcript (df): panel bed file with transcript information
"""
print("calling bedtools to add transcript info")

# get total number of transcripts before to ensure none are dropped
panel_transcripts = panel_bed.transcript.unique().tolist()

# turn dfs into BedTools objects
bed = bedtools.BedTool.from_dataframe(panel_bed)
transcript_info = bedtools.BedTool.from_dataframe(transcript_info_df)

# intersecting panel bed file with transcript/gene/exon information
# requires 100% overlap on panel -> transcript coordinates
bed_w_transcript = bed.intersect(
transcript_info, wa=True, wb=True, F=1.0
)

# convert pybedtools object to df
bed_w_transcript = bed_w_transcript.to_dataframe(names=[
"p_chrom", "p_start", "p_end", "p_transcript",
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon"
])

# check for empty file
assert len(bed_w_transcript.index) > 0, """Empty file returned from
intersecting panel bed and transcript file. Check if flanks are
being used as 100% coordinate overlap is currently required."""

# panel bed file defines transcript to use, filter transcript file for
# just those transcripts
bed_w_transcript = bed_w_transcript[
bed_w_transcript["p_transcript"] == bed_w_transcript["t_transcript"]
]

# drop duplicate columns
bed_w_transcript = bed_w_transcript.drop(columns=[
't_chrom', 't_start', 't_end', 'p_transcript'
])

intersect_transcripts = bed_w_transcript.t_transcript.unique().tolist()

# ensure no transcripts dropped from panel due to missing from
# transcripts file
assert len(panel_transcripts) == len(intersect_transcripts), (
f"Transcript(s) dropped from panel during intersecting with "
f"transcript file. Total before {len(panel_transcripts)}. Total "
f"after {len(intersect_transcripts)}. Dropped transcripts: "
f"{set(panel_transcripts) - set(intersect_transcripts)}"
)

return bed_w_transcript


def add_coverage(self, bed_w_transcript, coverage_df, chunks=False):
"""
Use pybedtools to add coverage bin data to selected panel regions
Args:
- bed_w_transcript (df): panel bed file with transcript information
- coverage_df (df / list): coverage bin data df / list of dfs if
chunks value passed
Returns:
- bed_w_coverage (df): panel bed with transcript and coverage info
"""
print("calling bedtools to add coverage info")

# turn dfs into BedTools objects
bed_w_transcript = bedtools.BedTool.from_dataframe(bed_w_transcript)

col_names = [
"t_chrom", "t_start", "t_end", "t_gene", "t_transcript", "t_exon",
"c_chrom", "cov_start", "cov_end", "cov"
]

if not chunks:
# per-base coverage all in one df
coverage_df = bedtools.BedTool.from_dataframe(coverage_df)

bed_w_coverage = bed_w_transcript.intersect(
coverage_df, wa=True, wb=True
)
bed_w_coverage = bed_w_coverage.to_dataframe(names=col_names)
else:
# coverage data in chunks, loop over each df and intersect
bed_w_coverage = pd.DataFrame(columns=col_names)

for num, df in enumerate(coverage_df):
print(f"intersecting {num + 1}/{len(coverage_df)} coverage chunks")
# read each to bedtools object, intersect and add back to df
chunk_df = bedtools.BedTool.from_dataframe(df)

bed_w_coverage_chunk = bed_w_transcript.intersect(
chunk_df, wa=True, wb=True
)

bed_w_coverage_chunk = bed_w_coverage_chunk.to_dataframe(
names=col_names
)

bed_w_coverage = pd.concat(
[bed_w_coverage, bed_w_coverage_chunk],
ignore_index=True
)

# check again for empty output of bedtools, can happen due to memory
# maxing out and doesn't seem to raise an exception...
assert len(bed_w_coverage) > 0, """Error intersecting with coverage
data, empty file generated. Is this the correct coverage data for
the panel used? bedtools may also have reached memory limit and
died, try re-running with --chunk_size 1000000"""

# drop duplicate chromosome col and rename
bed_w_coverage.drop(columns=["c_chrom"], inplace=True)

bed_w_coverage.columns = [
"chrom", "exon_start", "exon_end", "gene", "tx", "exon",
"cov_start", "cov_end", "cov"
]

return bed_w_coverage


def write_file(bed_w_coverage, outfile):
"""
Write annotated bed to file
Args:
- bed_w_coverage (df): bed file with transcript and coverage info
- output_prefix (str): prefix for naming output file
Outputs: annotated_bed.tsv
"""
# tiny function but want this separate for writing a wrapper script later
bed_w_coverage.to_csv(outfile, sep="\t", header=False, index=False)
print(f"annotated bed file written to {outfile}")


def parse_args():
"""
Parse cmd line arguments
Args: None
Returns:
- args (arguments): args passed from cmd line
"""
parser = argparse.ArgumentParser(
description='Annotate panel bed file with transcript & coverage data.'
)
parser.add_argument(
'--panel_bed', '-p',
help='panel bed file'
)
parser.add_argument(
'--transcript_file', '-t',
help='file with gene and exon information'
)
parser.add_argument(
'--coverage_file', '-c',
help='per base coverage data file'
)
parser.add_argument(
'--chunk_size', '-s', type=int,
help='number lines to read per-base coverage file in one go'
)
parser.add_argument(
'--output_name', '-n',
help='name preifx for output file, if none will use coverage file'
)

args = parser.parse_args()

return args


def main():
annotate = annotateBed()
load = loadData() # class of functions for reading in data

args = parse_args()

if not args.output_name:
# output name not defined, use sample identifier from coverage file
args.output_name = Path(args.coverage_file).name.split('_')[0]

# set dir for writing to
bin_dir = os.path.dirname(os.path.abspath(__file__))
out_dir = os.path.join(bin_dir, "../output/")
outfile_name = f"{args.output_name}_annotated.bed"
outfile = os.path.join(out_dir, outfile_name)

# read in files
panel_bed_df = load.read_panel_bed(args.panel_bed)
transcript_info_df = load.read_transcript_info(args.transcript_file)
pb_coverage_df = load.read_coverage_data(
args.coverage_file, args.chunk_size
)

# add transcript info
bed_w_transcript = annotate.add_transcript_info(
panel_bed_df, transcript_info_df
)

# add coverage
if args.chunk_size:
# per-base coverage split to multiple dfs to limit memory usage
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=True
)
else:
bed_w_coverage = annotate.add_coverage(
bed_w_transcript, pb_coverage_df, chunks=False
)

# sense check generated file isn't empty, should be caught earlier
assert len(bed_w_coverage.index) > 0, (
'An error has occured: annotated bed file is empty. This is likely ',
'due to an error in regions defined in bed file (i.e. different ',
'transcripts to those in the transcripts file). Start debugging by ',
'intersecting files manually...'
)

write_file(bed_w_coverage, outfile)


if __name__ == "__main__":

main()
Loading

0 comments on commit 755255b

Please sign in to comment.