--- title: "An Introduction To strandCheckR" author: - name: Thu-Hien To affiliation: Bioinformatics Hub, University of Adelaide email: hien.to@adelaide.edu.au - name: Steve Pederson affiliation: Bioinformatics Hub, University of Adelaide email: stephen.pederson@adelaide.edu.au package: strandCheckR output: BiocStyle::html_document vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{An Introduction To strandCheckR} --- # Introduction *strandCheckR* is a package used a window that slides across a bam file and gets the count/coverage of reads mapping to +/- strand in each window. The returned Data Frame gives information across the whole genome through each sliding window. That helps to quantify the amount of reads coming from double strand genomic DNA. It can be applied to detect DNA contamination within a strand specifc RNA-seq by assessing whether each window contains reads mainly from one strand, as would be expected from RNA molecules, or from both strands as would be expected from contaminating DNA. Hence, it is considered as an additional quality checking for strand specific RNA data. The package can also be used to detect regions with specific behavior (e.g. highest number of reads) through the whole genome and not only feature regions. The window read count function is designed flexibly so that user can filter low mapping quality reads, set read proportion to overlap a window, set window length & step etc. It was also implemented in an efficient way to manange big bam file. In a typical example of a human RNA-seq bam file of more than *4Gb*, it takes about *3* minutes to scan the whole file and return a Data Frame of about than *1.5Gb*. A function to filter out reads coming from double strand DNA is also provided. For each sliding window, it considers the proportion of +/- stranded reads comparing to a given threshold, to decide if a window contains reads derived from single strand RNA or double stranded DNA. A read that does not belong to any window coming from RNA will be removed. A read that belongs to some windows coming from RNA will be kept with a probability caluclated based on the proportion of +/- stranded of those windows. # Get windows information The function *getWinFromBamFile* is used to get the number of +/- stranded reads from all sliding windows throughout a list of bam files. The bam files should be sorted and have their index files located at the same path as well. ```{r getWin, message=FALSE,warning=FALSE} library(strandCheckR) files <- system.file( "extdata",c("s1.sorted.bam","s2.sorted.bam"),package = "strandCheckR" ) win <- getWinFromBamFile(files, sequences = "10") # shorten the file name win$File <- basename(as.character(win$File)) win ``` The returned DataFrame has 10 columns that correspond to the read type (R1 or R2 or single end read), sequence names, starting & ending bases of the windows (by default the window length is 1000 and the window step is 100), number of positive & negative reads that overlap each window (*NbPositive*, *NbNegative*), number of positive & negative bases that overlap each window (*CovPositive*, *CovNegative*), the maximum coverage (*MaxCoverage*) and the file name (*File*). Windows that do not overlap with any read are not reported. The windows with highest read counts, for example, can be obtained as follows. ```{r highestCoverage, eval=TRUE, message=FALSE,warning=FALSE} head(win[order((win$NbPos+win$NbNeg),decreasing=TRUE),]) ``` Here is an example for paired end bam file. ```{r paired end, eval=TRUE,message=FALSE,warning=FALSE} fileP <- system.file("extdata","paired.bam",package = "strandCheckR") winP <- getWinFromBamFile(fileP, sequences = "10") winP$File <- basename(as.character(winP$File)) #shorten file name winP ``` # Intersect with an annotation GRanges object If you have an annotation data, you can integrate it with the sliding windows obtained from the previous step using the function *intersectWithFeature*. The annotation must be a GRanges object, with or without *mcols*. Make sure that the sequence names in windows and annotation data are consistent. By default, you will have an additional column in the returned windows which indicates wheather a window overlap with any feature in the annotation object. You can also get details of the overlapped features in the *mcols* of the annotation object by specifying it in the *mcolsAnnnot* parameter. ```{r intersect, eval=TRUE, warning=FALSE,message=FALSE} library(TxDb.Hsapiens.UCSC.hg38.knownGene) annot <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene) #add chr before the sequence names to be consistent with the annot data win$Seq <- paste0("chr",win$Seq) win <- intersectWithFeature( windows = win, annotation = annot, overlapCol = "OverlapTranscript" ) win ``` If you have a *gtf* or *gff* file, you can use the function *gffReadGR* of the *ballgown* package to read it as a GRanges object. # Plot histogram and windows information With these windows, you can have some plots via *plotHist* and *plotWin* functions which can be saved to an appropriate location. To plot the histogram of positive proportion of the sliding windows, we use the function *plotHist*. This function will first calculates the frequency of positive proportion over all windows, and also group/normalize them based on given column names. Then it uses the *geom_bar* function of *ggplot2* to plot those frequencies. The plot gives you an idea on how much double-strand DNA is contained in your sample. In perfectly clear ss-RNA-seq, the positive proportion of every window should be either around 0\% or 100\%. The more amount of windows have this proportion around 50\%, the more the sample was contaminated by DNA. ```{r plotHist, eval=TRUE, message=FALSE,warning=FALSE} plotHist( windows = win, group_by = c("File","OverlapTranscript"), normalize_by = "File", scales = "free_y" ) ``` In this example, file *s2.sorted.bam* seems to be contaminated with double strand DNA, while file *s1.sorted.bam* is cleaner. By default, the windows are splitted in to 4 coverage groups *<10*, *10-100*, *100-1000*, and *>1000*. It can be set via option *split*. For paired-end data, we can group the Data Frame by read type: ```{r plotHistPaired, eval=TRUE,message=FALSE,warning=FALSE} plotHist( windows = winP, group_by = "Type", normalize_by = "Type", scales = "free_y" ) ``` Heatmap can be used instead of classic barplot for histogram when specifying *heatmap=TRUE*. This would be usefull to visualize mutliple files in the same plot. ```{r heatMap, eval=TRUE, message = FALSE, warning=FALSE,} plotHist( windows = win, group_by = c("File","OverlapTranscript"), normalize_by = "File", heatmap = TRUE ) ``` *plotWin* creates a plot on the number of reads vs positive proportion of each window. There are also 4 lines correspond to 4 representative thresholds (0.6, 0.7, 0.8, 0.9). Threshold is a parameter that is used when filtering a bam file using *filterDNA*. Given a threshold, a positive (resp. negative) window is kept if and only if it is above (resp. below) the corresponding threshold line on this plot. This can be used to give guidance as to the best threshold to choose when filtering windows. ```{r plotwin,eval=TRUE,message=FALSE,warning=FALSE} plotWin(win, group_by = "File") ``` # Filter bam files The functions *filterDNA* removes potential double strand DNA from a bam file using a given threshold. ```{r filterDNA, eval=TRUE, message=FALSE, warning=FALSE, results=FALSE} win2 <- filterDNA( file = files[2], sequences = "10", destination = "s2.filter.bam", threshold = 0.7, getWin = TRUE ) ``` Other parameters can be specified for more flexible filtering: define the ranges that you want to always keep, the minimum number of reads under which you want to ignore, the *pvalue* threshold for keeping a windows etc. Look at the manual page of the function for more options and explanations. Plotting the windows before and after filering: ```{r compare,eval=TRUE,message=FALSE,warning=FALSE} win2$File <- basename(as.character(win2$File)) win2$File <- factor(win2$File,levels = c("s2.sorted.bam","s2.filter.bam")) library(ggplot2) plotHist(win2,group_by = "File",normalize_by = "File",scales = "free_y") + ggtitle("Histogram of positive proportions over all sliding windows before and after filtering reads coming from double strand DNA") ```