---
title: "ORFik Overview"
author: "Haakon Tjeldnes & Kornel Labun"
date: "`r BiocStyle::doc_date()`"
package: "`r pkg_ver('ORFik')`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{ORFik Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Welcome to the `ORFik` package. This vignette will walk you through our main 
package usage with examples. `ORFik` is an R package containing various 
functions for analysis of RiboSeq, RNASeq and CageSeq data.  

`ORFik` currently supports:  

1. Finding Open Reading Frames (very fast) in the genome of interest or on the 
set of transcripts/sequences.  
2. Automatic estimations of RiboSeq footprint shift.  
3. Utilities for metaplots of RiboSeq coverage over gene START and STOP codons 
allowing to spot the shift.  
4. Shifting functions for the RiboSeq data.  
5. Finding new Transcription Start Sites with the use of CageSeq data.  
6. Various measurements of gene identity e.g. FLOSS, coverage, ORFscore, 
entropy that are recreated based on many scientific publications.  
7. Utility functions to extend GenomicRanges for faster grouping, splitting, 
tiling etc.  

# Finding Open Reading Frames 

In molecular genetics, an Open Reading Frame (ORF) is the part of a reading 
frame that has the ability to be translated. It does not mean that every ORF is
being translated or is functional, but to be able to find novel genes we must be
able to firstly identify potential ORFs.

To find all Open Reading Frames (ORFs) and possibly map them to genomic 
coordinates `ORFik` gives you three main functions:

* `findORFs` - find ORFs in sequences of interest,
* `findMapORFs` - find ORFs and map them to their respective genomic coordinates 
* `findORFsFasta` - find ORFs in Fasta file or `BSGenome` (supports circular genomes!)

## Example of finding ORFs in on 5' UTR of hg19

```{r eval = TRUE, echo = TRUE, message = FALSE}
library(ORFik)
library(GenomicFeatures)
```

After loading libraries, load example data from `GenomicFeatures`. We load gtf 
file as txdb. We will extract the 5' leaders to find all upstream open reading 
frames.

```{r eval = TRUE, echo = TRUE}
txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite", 
                        package = "GenomicFeatures")
txdb <- loadDb(txdbFile)
fiveUTRs <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTRs
```

As we can see we have extracted 5' UTRs for hg19 annotations. Now we can load
`BSgenome` version of human genome (hg19). If you don't have this package 
installed you will not see the result from the code below. You might have to 
install `BSgenome.Hsapiens.UCSC.hg19` and run the code for yourself as we don't
install this package together with `ORFik`.

```{r eval = TRUE, echo = TRUE, message = FALSE}
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {

  # Extract sequences of fiveUTRs.
  tx_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, fiveUTRs) 
  
  # Find all ORFs on those transcripts and get their genomic coordinates
  fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs)
  fiveUTR_ORFs
}
```

In the example above you can see that fiveUTR_ORFs are grouped by transcript, 
the first group is from transcript "uc010ogz.1". Meta-column names contains name 
of the transcript and identifier of the ORF separated by "_". When ORF is 
separated into two exons you can see it twice, like the first ORF with name 
"uc010ogz.1_1". The first ORF will always be the one most upstream for "+"" 
strand, and least upstream for "-"" strand.

# CageSeq data for 5' UTR re-annotation

In the prerevious example we used the refence annotation of the 5' UTRs
from the package GenomicFeatures. Here we will use advantage of CageSeq data to 
set new Transcription Start Sites (TSS) and re-annotate 5' UTRs.

```{r eval = TRUE, echo = TRUE} 
# path to example CageSeq data from hg19 heart sample
cageData <- system.file("extdata", "cage-seq-heart.bed.bgz", 
                        package = "ORFik")
# get new Transcription Start Sites using CageSeq dataset
newFiveUTRs <- reassignTSSbyCage(fiveUTRs, cageData)
newFiveUTRs
```

You will now see that most of the transcription start sites have changed. 
Depending on the species, regular annotations might be incomplete or not 
specific enough for your purposes. 

# RiboSeq footprints automatic shift detection and shifting

In RiboSeq data ribosomal footprints are restricted to their p-site positions 
and shifted with respect to the shifts visible over the start and stop 
codons. `ORFik` has multiple functions for processing of RiboSeq data. We will
go through an example processing of RiboSeq data below.

Load example raw RiboSeq footprints (unshifted).
```{r eval = TRUE, echo = TRUE} 
bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
footprints <- GenomicAlignments::readGAlignments(bam_file)
```

Investigate what footprint lengths are present in our data.
```{r eval = TRUE, echo = TRUE} 
table(qwidth(footprints))
```

For the sake of this example we will focus only on most abundant length of 29.
```{r eval = TRUE, echo = TRUE} 
footprints <- footprints[qwidth(footprints) == 29]
footprintsGR <- granges(footprints, use.mcols = TRUE)
footprintsGR
```

Restrict footprints to their 5' starts (after shifting it will be a p-site).
```{r eval = TRUE, echo = TRUE} 
footprintsGR <- resize(footprintsGR, 1)
footprintsGR
```

Now, lets prepare annotations and focus on START and STOP codons.
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} 
gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik")
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file, format = "gtf")
cds <- GenomicFeatures::cdsBy(txdb, by = "tx", use.names = TRUE)
cds[1]
```

Filter cds to only those who have some minimum trailer and leader lengths, as 
well as cds. And get start and stop codons with extra window of 30bp around 
them.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
txNames <- txNamesWithLeaders(txdb)
windows <- getStartStopWindows(txdb, txNames)
windows
```

Calculate meta-coverage over start and stop windowed regions.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
hitMapStart <- metaWindow(footprintsGR, windows$start)
hitMapStop <- metaWindow(footprintsGR, windows$stop)
```

Plot start/stop windows for length 29.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
if (requireNamespace("ggplot2")) {
  library(ggplot2)
  ggplot(hitMapStart, aes(x = factor(position), y = avg_counts, fill = factor(frame))) +
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = paste0("Length 29 over START of canonical CDS")) +
    xlab("\nshift from first START nucleotide [bp]") +
    ylab("Averaged counts") +
    guides(fill = FALSE)
}
```

```{r eval = TRUE, echo = TRUE, warning = FALSE} 
if (requireNamespace("ggplot2")) {
  ggplot(hitMapStop, aes(x = factor(position), y = avg_counts, fill = factor(frame))) +
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = paste0("Length 29 over STOP of canonical CDS")) +
    xlab("\nshift from last STOP nucleotide [bp]") +
    ylab("Averaged counts") +
    guides(fill = FALSE)
}
```

We can also use automatic detection of RiboSeq shifts using the code below. As
we can see reasonable conclusion from the plots would be to shift length 29 by
12, it is in agreement with the automatic detection of the offsets.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE)
shifts
```

Fortunately `ORFik` has function that can be used to shift footprints using 
desired shifts. Check documentation for more details.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
shiftedFootprints <- shiftFootprints(
  footprints, shifts$fragment_length, shifts$offsets_start)
shiftedFootprints
```

# Gene identity functions for ORFs or genes

`ORFik` contains couple functions of gene identity that can be used to predict 
which ORFs are potentially coding and functional.

- FLOSS `floss`,   
- coverage `coverage`,  
- ORFscore `orfScore`,  
- entropy `entropy`,  
- translational effiency `translationalEff`,  
- inside outside score `insideOutsideScore`,  
- distance between orfs and cds' `distToCds`,  
- other  

All of the features are implemented based on scientific article published in
per reviewed journal. `ORFik` supports seemingles calculation of all available 
features. See example below.

```{r eval = TRUE, echo = TRUE, warning = FALSE}
 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  library(GenomicFeatures)

  # Extract sequences of fiveUTRs.
  fiveUTRs <- fiveUTRs[1:10]
  faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
  tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs)

  # Find all ORFs on those transcripts and get their genomic coordinates
  fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs)
  unlistedORFs <- unlistGrl(fiveUTR_ORFs)
  # group GRanges by ORFs instead of Transcripts, use 4 first ORFs
  fiveUTR_ORFs <- groupGRangesBy(unlistedORFs, unlistedORFs$names)[1:4]

  # make some toy ribo seq and rna seq data
  starts <- unlist(ORFik:::firstExonPerGroup(fiveUTR_ORFs), use.names = FALSE)
  RFP <- promoters(starts, upstream = 0, downstream = 1)
  score(RFP) <- rep(29, length(RFP)) # the original read widths

  # set RNA seq to duplicate transcripts
  RNA <- unlist(exonsBy(txdb, by = "tx", use.names = TRUE), use.names = TRUE)
  # transcript database
  txdb <- loadDb(txdbFile)
  computeFeatures(grl = fiveUTR_ORFs, orfFeatures =  TRUE, RFP = RFP,
                  RNA = RNA, Gtf = txdb, faFile = faFile)
}
```
You will now get a data.table with one column per score, the columns are named after
the different scores, you can now go further with prediction, or making plots.

# Calculating Kozak sequence score for ORFs

Instead of getting all features, we can also extract single features.

To understand how strong the binding affinitity of an ORF promoter region might be, we can use kozak sequence score. The kozak functions supports
several species. In the first example we use human kozak sequence,
then we make a self defined kozak sequence.

```{r eval = TRUE, echo = TRUE}
  # In this example we will find kozak score of cds' 

  if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {

    cds <- cdsBy(txdb, by = "tx", use.names = TRUE)[1:10]

    faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens

    kozakSequenceScore(cds, faFile, species = "human")

    # A few species are pre supported, if not, make your own input pfm.

    # here is an example where the human pfm is sent in again, even though
    # it is already supported.

    pfm <- t(matrix(as.integer(c(29,26,28,26,22,35,62,39,28,24,27,17,
                                 21,26,24,16,28,32,5,23,35,12,42,21,
                                 25,24,22,33,22,19,28,17,27,47,16,34,
                                 25,24,26,25,28,14,5,21,10,17,15,28)),
                    ncol = 4))

   kozakSequenceScore(cds, faFile, species = pfm)

  }

```


# GRanges and GRangesList utilities

`ORFik` contains couple functions that can be utilized to speed up your coding.
Check documentations for these functions: `unlistGrl`, `sortPerGroup`, 
`strandBool`, `tile1`.

## Grouping ORFs

Sometimes you want a GRangesList of ORFs grouped by transcript, or you might 
want each ORF as groups in the GRangesList. To do this more easily you can use 
the function `groupGRangesBy`.

```{r eval = TRUE, echo = TRUE}

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # the orfs are now grouped by orfs. If we want to go back to transcripts we do:
  unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
  unlisted_ranges
  test_ranges <- groupGRangesBy(unlisted_ranges, names(unlisted_ranges))
  
  # test_ranges is now grouped by transcript, but we want them grouped by ORFs:
  # we use the orfs exon column called ($names) to group, it is made by ORFik.
  unlisted_ranges <- unlistGrl(test_ranges)
  test_ranges <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
}
```

## Filtering example

Lets say you found some ORFs, and you want to filter out some of them.
ORFik provides several functions for filtering. A problem with the
original GenomicRanges container, is that filtering on GRanges objects
are much easier than on GRangesList objects, ORFik tries to fix this.

In this example we will filter out all orfs as following:
1. First group GRangesList by ORFs
2. width < 60
3. number of exons < 2
4. strand is negative

```{r eval = TRUE, echo = TRUE}

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # lets use the fiveUTR_ORFs
  #1. Group by ORFs
  unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
  ORFs <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
  length(ORFs)
  #2. Remove widths < 60
  ORFs <- ORFs[widthPerGroup(ORFs) >= 60]
  length(ORFs)
  #3. Keep only ORFs with at least 2 exons
  ORFs <- ORFs[numExonsPerGroup(ORFs) > 1]
  length(ORFs)
  
  #4. Keep only positive ORFs
  ORFs <- ORFs[strandPerGroup(ORFs) == "+"]
  # all remaining ORFs where on positive strand, so no change
  length(ORFs)
}
```
Many more operations are also supported for manipulation