Skip to content

A toolset for handling sequencing data with unique molecular identifiers (UMIs)

License

Notifications You must be signed in to change notification settings

weng-lab/umitools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

A toolset for handling sequencing data with unique molecular identifiers (UMIs)

Installation

This tools set requires Python 3.

To install umitools, run

pip3 install umitools  # add --user if you want to install it to your own directory

How to process UMI small RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

wget -O clipped.fq.gz "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.sRNA-seq.fq.gz"

1. Identify UMIs:

umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq

How to process UMI RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

wget -O "r1.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r1.fq.gz"
wget -O "r2.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r2.fq.gz"

1. To identify reads with proper UMIs and parse out their UMIs, you can run:

umitools reformat_fastq -l r1.fq.gz -r r2.fq.gz -L r1.fmt.fq.gz -R r2.fmt.fq.gz

And it will output some stats for your UMI RNA-seq data.

2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., fmt.bam).

To download an example, run

wget -O fmt.bam https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.sorted.bam

To mark the reads with PCR duplicates (and assuming you want to use 8 threads), run

umitools mark_duplicates -f fmt.bam -p 8

And it will produce fmt.deumi.sorted.bam in which reads that are identified as PCR duplicates will have the flag 0x400. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:

samtools view -b -h -F 0x400 fmt.deumi.sorted.bam > fmt.deumi.F400.sorted.bam

You can then feed the bam file without PCR duplicates to your downstream analysis.

How UMI locators are handled

For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting --umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4 when you run umi_reformat_fastq.

For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match NNNCGANNNTACNNN or NNNATCNNNAGTNNN, AND 3' UMI locator should match NNNGTCNNNTAGNNN where N's are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting --umi-pattern-5 and --umi-pattern-3. You can further tweak the number of errors allowed by changing N_MISMATCH_ALLOWED_IN_UMI_LOCATOR in the script.

Other utilities

umi_simulator

A simple in silico PCR simulator for UMI reads. Run it with -h to see options.

FAQ

Other ways to run umitools?

In addition to providing subcommands to umitools (e.g., umitools mark_duplicates), these commands can also be called individually.

  • umitools reformat_fastq is equivalent to umi_reformat_fastq.
  • umitools mark_duplicates is equivalent to umi_mark_duplicates.
  • umitools reformat_sra_fastq is equivalent to umi_reformat_sra_fastq.

How to remove 3' end small RNA-seq adapter

There are many tools to remove adapters. This is just one example. To process a fastq (raw.fq.gz) file from your UMI small RNA-seq data, you can first remove the 3' end small RNA-seq adapter. For example, you can use fastx_clipper from the FASTX-Toolkit and the adapter sequence is TGGAATTCTCGGGTGCCAAGG:

zcat raw.fq.gz | fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 48 -c -Q33 2> raw.clipped.log | gzip -c - > clipped.fq.gz

where -l 48 specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5' UMI + 15 nt in the 3' UMI).

Not sure if your libraries have high-quality UMIs at proper positions?

To see which reads have improper UMIs, run

umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq --reads-with-improper-umi sra.improper_umi.fq

where sra.umi.fq contains all the non-duplicate reads and sra.dup.fq contains all duplicates.

Feeling adventurous? You can install the git version

  1. Grab the version on GitHub:
git clone https://github.com/weng-lab/umitools.git
  1. Install it in editable mode:
pip3 install -e /path/to/umitools

Citation

Fu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.

Contact us

Yu Fu (Yu.Fu {at} umassmed.edu)