gencode-feature-lengths.Rmd

---
title: "Micro-first exons in GENCODE 26"
author: "Charles Plessy"
output: 
  html_document: 
    keep_md: yes
    toc: yes
---

```{r options, message = FALSE}
knitr::opts_chunk$set(cache = TRUE)
```

## The question

I am interested in the length of first exons.  To get an estimate of what to expect,
I inspected the contents of GENCODE 26.  I am surprised to find very small first
exons, as I do not see how these could be produced by splicing: how could the
spliceosome handle a donor fragment that is only a few nucleotides long ?

So my question is *what do these micro-first exons repesent*.  Are they just
bugs in processing pipelines ?  Do they represent processed transcripts (in which
case its short length is not in contradiction with splicing mechanisms).  Or
is there something else I am missing ?

## Data load and preparation

This analysis is done in R.  First, let's load R libraries.

```{r load-libraries, message = FALSE}
library(magrittr)
library(GenomicRanges)
```

Then, let's download GENCODE 26.

```{r download-gencode, engine='bash'}
wget --quiet --continue ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
```

Then, let's load GENCODE 26 in R as a GenomicRanges object and select all
exon features.

```{r load-data}
g <- rtracklayer::import.gff("gencode.v26.annotation.gtf.gz")
summary(g)
colnames(mcols(g))
table(g$type)
ge <- g[g$type == "exon"]
mcols(ge) %<>% droplevels
```

## Exon lengths and number

Many transcripts have more than one exon.

```{r}
tapply(ge$exon_number %>% as.numeric, ge$transcript_id, max) %>% table %>% head(50) %>%
  plot(main = "Number of exons per GENCODE transcripts", sub = "Transcripts with > 50 exons were omitted.")
```

Exons tend to be short.

```{r exon-length}
width(ge) %>% summary
width(ge) %>% table %>% head(1000) %>%
  plot(main = "Length of GENCODE exons", sub = "Exons longer than 100 nt were omitted.")
```

Some Exons are very short; they can be microexons.

```{r micro-exon-length}
width(ge) %>% table %>% head(20) %>%
  plot(main = "Length of shortest GENCODE exons", sub = "Exons longer than 20 nt were omitted.")
```

## First exons of spliced transcripts

Let's focus on spliced transcripts.

```{r spliced-transcripts}
spliced <- tapply(ge$exon_number %>% as.numeric, ge$transcript_id, max) > 1
gs <- ge[ge$transcript_id %in% names(which(spliced))]
summary(gs)

tapply(gs$exon_number %>% as.numeric, gs$transcript_id, max) %>%
  table %>% head(50) %>%
  plot(main = "Number of exons per spliced transcripts", sub = "Transcripts with > 50 exons were omitted.")
 
width(gs) %>% table %>% head(20) %>%
  plot( main = "Length of shortest GENCODE spliced exons"
      , sub = "Exons longer than 20 nt were omitted.")
```

And within them, let's focus on first exons.  Altogether, their size distribution
is not very different from all the exons together.

```{r first-exons}
gs1 <- gs[gs$exon_number == 1]

width(gs1) %>% summary

width(gs1) %>% table %>% head(1000) %>%
  plot(main = "Length of spliced first exons", sub = "Exons longer than 500 nt were omitted.")
```

Surprisingly, even within first exons, there are microexons.

```{r first-microexons}

width(gs1) %>% table %>% head(20) %>%
  plot(main = "Length of spliced first exons", sub = "Exons longer than 20 nt were omitted.")
```

## Examples

Inspection in the UCSC browser suggested that many of these micro-first exons
coincided with translation start sites.  Indeed, it seems to happen often, but
not systematically.

```{r examples}
options(width=110)
options("showTailLines"=Inf) # http://stackoverflow.com/questions/16949545/show-all-lines-in-genomicrange-package-output

x <- g[g$transcript_id %in% gs1[width(gs1) < 4]$transcript_id]

# For display purposes, let's give names to each entry, and re-sort the names
# of the metadata's columns.

names(x) <- x$transcript_name

mcols(x) <- mcols(x)[,c("type", "transcript_type", "gene_id",
  "source", "gene_type", "transcript_id", "level", "havana_gene",  
  "transcript_name", "transcript_support_level", "tag", "havana_transcript", 
  "exon_number", "exon_id", "ont", "protein_id", "ccdsid", "score", "phase",
  "gene_name")]

x[x$type %in% c("start_codon", "exon") & x$exon_number == 1]
options("showTailLines"=NULL)
```