Skip to content

mt1022/gppy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Positioning with Python

gppy is a light-weight (no third-party dependencies) and easy-to-install python package for genomic interval conversions to facilitate related transcriptome or translatome analysis.

Main features include:

  • convert transcript/CDS coordinates/intervals to genomic coordinates/intervals in bed12 format and vice versa, while taking well care of the presence of introns.
  • extract mRNA/CDS/UTR intervals from gtf and export in bed12 format.
  • extract metadata from gtf files (including gene names, biotypes, and canonical status, transcript/CDS/UTR lengths) and export in tabular format.

News

  • 2024.11.14: A custom version (gtf_flybase.py) that works for FlyBase GTF files is added. The FlyBase GTF files are not formatted as those in ENSEMBL Genome Browser, so the original gppy may fail when trying to extract metadata. In such cases, you can try python scripts/gtf_flybase.py.

Installation

pip install gppy

# alternatively, download wheel and install
pip install gppy-version-py3-none-any.whl

Run without installation

Scripts in this package rely only on the standard python (tested with version >= 3.7). No third party dependency is required. All the scripts can be run from the command line without installation after downloading.

wget https://raw.githubusercontent.com/mt1022/gppy/main/gppy/gtf.py

To run gppy:

# as package
gppy subcommand -h

# script
python gppy/gtf.py subcommand -h

How to specify coordinates when using gppy?

  • Genomic positions (gpos), genomic intervals (giv), transcriptomic positions (tpos), and transcriptomic intervals (tiv) are all 1-based. For example, if a region spans fifth nucleotide to tenth nucleotide of an mRNA, tiv should be (5, 10) and the tpos of the first nucleotide in this region is 5.
  • bed files generated by gppy present genomic regions with zero-based half-open intervals, following common practices. For example, if a region spans fifth nucleotide to tenth nucleotide of chr1, the first three columns for this region in bed will be chr1 4 10.

Examples

Extract transcript length stats and metadata:

gppy txinfo -g test/human.chrY.gtf >test/human.chrY.txinfo.tsv
cut -f1-9,12,15,19-22 test/human.chrY.txinfo.tsv | head
# tx_name	gene_id	chrom	strand	nexon	tx_len	cds_len	utr5_len	utr3_len	gene_name	transcript_biotype	ccds	ensembl_canonical	mane_select	basic
# ENST00000431340	ENSG00000215601	Y	+	4	443	0	0	0	TSPY24P	unprocessed_pseudogene	False	True	False	True
# ENST00000415010	ENSG00000215603	Y	-	1	1191	0	0	0	ZNF92P1Y	processed_pseudogene	False	True	False	True
# ENST00000449381	ENSG00000231436	Y	-	8	1145	0	0	0	RBMY3AP	unprocessed_pseudogene	False	True	False	True
# ENST00000436888	ENSG00000225878	Y	-	1	1164	0	0	0	SERBP1P2	processed_pseudogene	False	True	False	True
# ENST00000421279	ENSG00000236435	Y	-	5	868	0	0	0	TSPY12P	unprocessed_pseudogene	False	True	False	True
# ENST00000430032	ENSG00000278478	Y	+	1	279	0	0	0		processed_pseudogene	False	True	False	True
# ENST00000557448	ENSG00000258991	Y	+	1	1267	0	0	0	DUX4L19	unprocessed_pseudogene	False	True	False	True
# ENST00000651670	ENSG00000237048	Y	+	4	1123	0	0	0	TTTY12	lncRNA	False	False	False	True
# ENST00000413466	ENSG00000237048	Y	+	3	1046	0	0	0	TTTY12	lncRNA	False	True	False	False

Note: if your GTF file is not formatted as those in ENSEMBL Genome Browser, gppy may fail when trying to extract metadata. In such cases, you can try gppy txinfo_basic to get only basic information including name, id, chrom, strand, and length-related features.

Extract CDS regions of each protein-coding transcript and export in bed12 format

gppy convert2bed -g test/human.chrY.gtf -t cds >test/human.chrY.cds.bed12
head test/human.chrY.cds.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,

Convert CDS regions in genome coordinates to transcriptome coordinates

awk -v OFS="\t" '{print $4, $2 + 1, $3, $6}' test/human.chrY.cds.bed12 >test/human.chrY.cds.giv.tsv
gppy giv2tiv -g test/human.chrY.gtf -i test/human.chrY.cds.giv.tsv >test/human.chrY.cds.tiv.tsv

head test/human.chrY.cds.giv.tsv
# ENST00000303728	22501565	22514067	+
# ENST00000477123	22501565	22512665	+
# ENST00000651177	12709448	12859413	+
# ENST00000338981	12709448	12859413	+
# ENST00000453031	12847045	12859413	+
# ENST00000303804	22072326	22084839	-
# ENST00000472391	22073731	22084839	-
# ENST00000361365	20575872	20592343	+
# ENST00000382772	20575872	20592343	+
# ENST00000602732	22992344	22992376	+

head test/human.chrY.cds.tiv.tsv
# ENST00000303728	22501565	22514067	+	228	668	exon	exon
# ENST00000477123	22501565	22512665	+	228	440	exon	exon
# ENST00000651177	12709448	12859413	+	587	8251	exon	exon
# ENST00000338981	12709448	12859413	+	946	8610	exon	exon
# ENST00000453031	12847045	12859413	+	1	710	exon	exon
# ENST00000303804	22072326	22084839	-	228	668	exon	exon
# ENST00000472391	22073731	22084839	-	228	440	exon	exon
# ENST00000361365	20575872	20592343	+	97	528	exon	exon
# ENST00000382772	20575872	20592343	+	79	459	exon	exon
# ENST00000602732	22992344	22992376	+	527	559	exon	exon

Convert CDS regions in transcriptome coordinates to genome coordinates

cut -f1,5,6 test/human.chrY.cds.tiv.tsv >test/human.chrY.cds.tiv2.tsv
gppy tiv2giv -g test/human.chrY.gtf -i test/human.chrY.cds.tiv2.tsv -a >test/human.chrY.cds.giv2.bed12

head test/human.chrY.cds.tiv2.tsv
# ENST00000303728	228	668
# ENST00000477123	228	440
# ENST00000651177	587	8251
# ENST00000338981	946	8610
# ENST00000453031	1	710
# ENST00000303804	228	668
# ENST00000472391	228	440
# ENST00000361365	97	528
# ENST00000382772	79	459
# ENST00000602732	527	559

head test/human.chrY.cds.giv2.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,	ENST00000303728	228	668
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,	ENST00000477123	228	440
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,	ENST00000651177	587	8251
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,	ENST00000338981	946	8610
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,	ENST00000453031	1	710
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,	ENST00000303804	228	668
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,	ENST00000472391	228	440
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,	ENST00000361365	97	528
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,	ENST00000382772	79459
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,	ENST00000602732	527	559

# the above should be identical to the CDS regions we extracted from GTF with `convert2bed`
head test/human.chrY.cds.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,

Converison between genomic and transcriptomic positions for individual sites

cut -f1,2 test/human.chrY.cds.tiv2.tsv >test/human.chrY.cds.start.tpos.tsv
gppy t2g -g test/human.chrY.gtf -i test/human.chrY.cds.start.tpos.tsv >test/human.chrY.cds.start.gpos.tsv

head test/human.chrY.cds.start.tpos.tsv
# ENST00000303728	228
# ENST00000477123	228
# ENST00000651177	587
# ENST00000338981	946
# ENST00000453031	1
# ENST00000303804	228
# ENST00000472391	228
# ENST00000361365	97
# ENST00000382772	79
# ENST00000602732	527

head test/human.chrY.cds.start.gpos.tsv
# ENST00000303728	228	Y	+	22501565
# ENST00000477123	228	Y	+	22501565
# ENST00000651177	587	Y	+	12709448
# ENST00000338981	946	Y	+	12709448
# ENST00000453031	1	Y	+	12847045
# ENST00000303804	228	Y	-	22084839
# ENST00000472391	228	Y	-	22084839
# ENST00000361365	97	Y	+	20575872
# ENST00000382772	79	Y	+	20575872
# ENST00000602732	527	Y	+	22992344

cut -f1,5 test/human.chrY.cds.start.gpos.tsv >test/human.chrY.cds.start.gpos2.tsv
gppy g2t -g test/human.chrY.gtf -i test/human.chrY.cds.start.gpos2.tsv >test/human.chrY.cds.start.tpos2.tsv

head test/human.chrY.cds.start.gpos2.tsv
# ENST00000303728	22501565
# ENST00000477123	22501565
# ENST00000651177	12709448
# ENST00000338981	12709448
# ENST00000453031	12847045
# ENST00000303804	22084839
# ENST00000472391	22084839
# ENST00000361365	20575872
# ENST00000382772	20575872
# ENST00000602732	22992344

head test/human.chrY.cds.start.tpos2.tsv
# ENST00000303728	22501565	228	exon
# ENST00000477123	22501565	228	exon
# ENST00000651177	12709448	587	exon
# ENST00000338981	12709448	946	exon
# ENST00000453031	12847045	1	exon
# ENST00000303804	22084839	228	exon
# ENST00000472391	22084839	228	exon
# ENST00000361365	20575872	97	exon
# ENST00000382772	20575872	79	exon
# ENST00000602732	22992344	527	exon

Usage

List utilities

$ gppy -h
usage: gppy|gtf.py [-h] {txinfo,convert2bed,t2g,g2t,tiv2giv,giv2tiv,extract_thick} ...

GTF file manipulation

options:
  -h, --help            show this help message and exit

GTF operations:
  {txinfo,convert2bed,t2g,g2t,tiv2giv,giv2tiv,extract_thick}
                        supported operations
    txinfo              summary information of each transcript
    convert2bed         convert GTF to bed12 format
    t2g                 convert tpos to gpos
    g2t                 convert gpos to tpos
    tiv2giv             convert tiv to giv
    giv2tiv             convert giv to tiv
    extract_thick       Extract nested thick regions from bed12

Extract basic transcript information

$ gppy txinfo -h
usage: gppy|gtf.py txinfo [-h] [-g GTF]

options:
  -h, --help         show this help message and exit
  -g GTF, --gtf GTF  input gtf file (default: -)

Extract transcript/CDS/UTR features in GTF as bed12 format

$ gppy convert2bed -h
usage: gtf.py convert2bed [-h] [-g GTF] [-t {exon,cds,utr5,utr3}] [-e EXTEND]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -t {exon,cds,utr5,utr3}, --type {exon,cds,utr5,utr3}
                        types of intervals to be converted to bed for each transcript (default: exon)
  -e EXTEND, --extend EXTEND
                        number of bases to extend at both sides (default: 0)

Convert transcript positions to genomic positions

$ gppy t2g -h
usage: gppy|gtf.py t2g [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first two columns composed of tx_id and transcript coordinates (default: None)

Convert transcript intervals to genomic intervals (allow spliced regions)

$ gppy tiv2giv -h
usage: gppy|gtf.py tiv2giv [-h] [-g GTF] [-i INFILE] [-a]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first three columns composed of tx_id, start and end coordinates (default: None)
  -a, --append          whether to append input at the end of the ouput (default: False)

Convert genomic positions to transcript positions

$ gppy g2t -h
usage: gppy|gtf.py g2t [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first two columns composed of tx_id and genomic coordinates (default: None)

Convert genomic intervals to transcript intervals

$ gppy giv2tiv -h
usage: gppy|gtf.py giv2tiv [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first three columns composed of tx_id, start and end coordinates (default: None)

Links

  • GTF format check and fix: AGAT

Other

Please use the issues section to report if you have spotted any bug or want a feature to be implemented :)