---
title: "An Introduction To strandCheckR"
author:
-   name: Thu-Hien To
    affiliation: Bioinformatics Hub, University of Adelaide
    email: hien.to@adelaide.edu.au
-   name: Steve Pederson
    affiliation: Bioinformatics Hub, University of Adelaide
    email: stephen.pederson@adelaide.edu.au
package: strandCheckR
output:
    BiocStyle::html_document
vignette: >
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\VignetteIndexEntry{An Introduction To strandCheckR}
---

# Introduction

*strandCheckR* is a package used a window that slides across a bam file and
gets the count/coverage of reads mapping to +/- strand in each window.
The returned Data Frame gives information across the whole genome through each
sliding window. That helps to quantify the amount of reads coming from 
double strand genomic DNA. It can be applied to detect DNA contamination within 
a strand specifc RNA-seq by assessing whether each window contains reads mainly 
from one strand, as would be expected from RNA molecules, or from both strands 
as would be expected from contaminating DNA. Hence, it is considered as an
additional quality checking for strand specific RNA data.
The package can also be used to detect regions with specific behavior (e.g. 
highest number of reads) through the whole genome and not only feature regions.

The window read count function is designed flexibly so that user can filter
low mapping quality reads, set read proportion to overlap a window, set window 
length & step etc. It was also implemented in an efficient way to manange big
bam file. In a typical example of a human RNA-seq bam file of more than *4Gb*, 
it takes about *3* minutes to scan the whole file and return a Data Frame of 
about than *1.5Gb*.

A function to filter out reads coming from double strand DNA is also provided.
For each sliding window, it considers the proportion of +/- stranded reads 
comparing to a given threshold, to decide if a window contains reads derived 
from single strand RNA or double stranded DNA.
A read that does not belong to any window coming from RNA will be removed. A
read that belongs to some windows coming from RNA will be kept with a 
probability caluclated based on the proportion of +/- stranded of those windows.

# Get windows information

The function *getWinFromBamFile* is used to get the number of +/- stranded 
reads from all sliding windows throughout a list of bam files.
The bam files should be sorted and have their index files located at the same 
path as well.


```{r getWin, message=FALSE,warning=FALSE}
library(strandCheckR)
files <- system.file(
    "extdata",c("s1.sorted.bam","s2.sorted.bam"),package = "strandCheckR"
    )
win <- getWinFromBamFile(files, sequences = "10")
# shorten the file name
win$File <- basename(as.character(win$File))
win
```

The returned DataFrame has 10 columns that correspond to the read type (R1 or
R2 or single end read), sequence names, starting & ending bases of the windows 
(by default the window length is 1000 and the window step is 100), number of 
positive & negative reads that overlap each window (*NbPositive*, 
*NbNegative*), number of positive & negative bases that overlap each window 
(*CovPositive*, *CovNegative*), the maximum coverage (*MaxCoverage*) and the 
file name (*File*). Windows that do not overlap with any read are not reported.

The windows with highest read counts, for example, can be obtained as follows.

```{r highestCoverage, eval=TRUE, message=FALSE,warning=FALSE}
head(win[order((win$NbPos+win$NbNeg),decreasing=TRUE),])
```

Here is an example for paired end bam file.

```{r paired end, eval=TRUE,message=FALSE,warning=FALSE}
fileP <- system.file("extdata","paired.bam",package = "strandCheckR")
winP <- getWinFromBamFile(fileP, sequences = "10")
winP$File <- basename(as.character(winP$File)) #shorten file name
winP
```

# Intersect with an annotation GRanges object

If you have an annotation data, you can integrate it with the sliding windows 
obtained from the previous step using the function *intersectWithFeature*. The 
annotation must be a GRanges object, with or without *mcols*. Make sure that 
the sequence names in windows and annotation data are consistent.
By default, you will have an additional column in the returned windows which 
indicates wheather a window overlap with any feature in the annotation object. 
You can also get details of the overlapped features in the *mcols* of the 
annotation object by specifying it in the *mcolsAnnnot* parameter.

```{r intersect, eval=TRUE, warning=FALSE,message=FALSE}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
annot <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
#add chr before the sequence names to be consistent with the annot data
win$Seq <- paste0("chr",win$Seq) 
win <- intersectWithFeature(
    windows = win, annotation = annot, overlapCol = "OverlapTranscript"
    )
win
```


If you have a *gtf* or *gff* file, you can use the function *gffReadGR* of the 
*ballgown* package to read it as a GRanges object.

# Plot histogram and windows information

With these windows, you can have some plots via *plotHist* and *plotWin* 
functions which can be saved to an appropriate location.

To plot the histogram of positive proportion of the sliding windows, we use
the function *plotHist*. This function will first calculates the frequency of 
positive proportion over all windows, and also group/normalize them based on 
given column names. Then it uses the *geom_bar* function of *ggplot2* to plot
those frequencies. The plot gives you an idea on how much double-strand DNA is 
contained in your sample. In perfectly clear ss-RNA-seq, the positive 
proportion of every window should be either around 0\% or 100\%. 
The more amount of windows have this proportion around 50\%, the more the 
sample was contaminated by DNA. 

```{r plotHist, eval=TRUE, message=FALSE,warning=FALSE}
plotHist(
    windows = win, group_by = c("File","OverlapTranscript"), 
    normalize_by = "File", scales = "free_y"
    )
```

In this example, file *s2.sorted.bam* seems to be contaminated with double 
strand DNA, while file *s1.sorted.bam* is cleaner. By default, the windows are 
splitted in to 4 coverage groups *<10*, *10-100*, *100-1000*, and *>1000*.
It can be set via option *split*.

For paired-end data, we can group the Data Frame by read type:

```{r plotHistPaired, eval=TRUE,message=FALSE,warning=FALSE}
plotHist(
    windows = winP, group_by = "Type", normalize_by = "Type", scales = "free_y"
    )
```

Heatmap can be used instead of classic barplot for histogram when specifying 
*heatmap=TRUE*. This would be usefull to visualize mutliple files in the 
same plot.

```{r heatMap, eval=TRUE, message = FALSE, warning=FALSE,}
plotHist(
    windows = win, group_by = c("File","OverlapTranscript"), 
    normalize_by = "File", heatmap = TRUE
    )
```


*plotWin* creates a plot on the number of reads vs positive proportion of each 
window. There are also 4 lines correspond to 4 representative thresholds 
(0.6, 0.7, 0.8, 0.9). Threshold is a parameter that is used when filtering a 
bam file using *filterDNA*. Given a threshold, a positive (resp. negative) 
window is kept if and only if it is above (resp. below) the corresponding 
threshold line on this plot. This can be used to give guidance as to the best 
threshold to choose when filtering windows.

```{r plotwin,eval=TRUE,message=FALSE,warning=FALSE}
plotWin(win, group_by = "File")
```

# Filter bam files

The functions *filterDNA* removes potential double strand DNA from a bam file 
using a given threshold.

```{r filterDNA, eval=TRUE, message=FALSE, warning=FALSE, results=FALSE}
win2 <- filterDNA(
    file = files[2], sequences = "10", destination = "s2.filter.bam", 
    threshold = 0.7, getWin = TRUE
    )
```


Other parameters can be specified for more flexible filtering: define the 
ranges that you want to always keep, the minimum number of reads under which 
you want to ignore, the *pvalue* threshold for keeping a windows etc. Look at
the manual page of the function for more options and explanations.

Plotting the windows before and after filering:

```{r compare,eval=TRUE,message=FALSE,warning=FALSE}
win2$File <- basename(as.character(win2$File))
win2$File <- factor(win2$File,levels = c("s2.sorted.bam","s2.filter.bam"))
library(ggplot2)
plotHist(win2,group_by = "File",normalize_by = "File",scales = "free_y") +
    ggtitle("Histogram of positive proportions over all sliding windows before 
            and after filtering reads coming from double strand DNA")
```