---
title: "NADfinder Guide"
author: "Jianhong Ou, Jun Yu, Paul Kaufman and Lihua Julie Zhu"
date: "`r doc_date()`"
package: "`r pkg_ver('NADfinder')`"
bibliography: bibliography.bib
csl: nature.csl
abstract: >
  Nucleolus is an important structure inside the nucleus 
    in eukaryotic cells. It is the site for transcribing rDNA 
    into rRNA and for assembling ribosomes, aka ribosome 
    biogenesis. In addition, nucleoli are dynamic hubs 
    through which numerous proteins shuttle and contact 
    specific non-rDNA genomic loci. Deep sequencing 
    analyses of DNA associated with isolated nucleoli (NAD-
    seq) have shown that specific loci, termed nucleolar-
    associated domains (NADs) form frequent three-
    dimensional associations with nucleoli. NAD-seq has 
    been used to study the biological functions of NAD and 
    the dynamics of NAD distribution during embryonic 
    stem cell (ESC) differentiation. 
    Here, we developed a 
    Bioconductor package NADfinder for bioinformatic 
    analysis of the NAD-seq data, including normalization, 
    smoothing, peak calling, peak trimming and annotation.
vignette: >
  %\VignetteIndexEntry{NADfinder Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---

```{r, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
  library(NADfinder)
  library(BSgenome.Mmusculus.UCSC.mm10)
  library(rtracklayer)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE, eval=TRUE)
```

# Introduction
Nucleoli serve as major organizing hubs for the three-dimensional structure 
of mammalian heterochromatin[@politz2016redundancy]. Specific loci, termed
nucleolar-associated domains (NADs), form frequent three-dimensional 
associations with nucleoli [@nemeth2010initial, @van2010high]. Early mammalian 
development is a critical period to study NAD biological function, because 
interactions between pericentromeric chromatin and perinucleolar regions are 
particularly dynamic during embryonic development 
[@aguirre20123d, @popken2014reprogramming]. 
We therefore propose for the first time to map the 
NADs in the mouse genome, determine how these associations are altered during 
embryonic stem cell (ESC) differentiation, and develop tools for study of 
these higher-order chromosome interactions in fixed and live single cells 
(Fig. 1). 

![Fig. 1 Workflow for NAD peak identification. Read counts are summarized for 
each 50kb moving window with step size of 1kb for nucleolar and genomic 
samples. Log2 ratio between nucleolar and genomic samples was computed for 
each window, followed by local background correction, smoothing, peak calling, 
filtering and annotation.
](Picture1.png)

# Single sample
Here is an example to use NADfinder for peak calling.

## Coverage calculation

Here is the code snippet for calculating coverage with a sliding window in a 
given step along the genome using a pair of bam files from genomic sample as 
background and purified nucleoli as target signal. 


```{r quickStart, fig.width=6, fig.height=4}
## load the library
library(NADfinder)
library(SummarizedExperiment)
## bam file path
path <- "path/to/your/bam/files"
f <- c(genome="genome.bam", nucleosome="nucleosome.bam")
## window size for tile of genome. Here we set it to a big window size,
## ie., 50k because of the following two reasons:
## 1. peaks of NADs are wide;
## 2. data will be smoothed better with bigger window size.
ws <- 50000
## step for tile of genome
step <- 1000
## Set the background. 
## 0.25 means 25% of the lower ratios will be used for background training.
backgroundPercentage <- 0.25
## Count the reads for each window with a given step in the genome.
## The output will be a GRanges object.
library(BSgenome.Mmusculus.UCSC.mm10)
```
```{r eval=FALSE}
se <- tileCount(reads=file.path(path, f), 
                genome=Mmusculus, 
                windowSize=ws, 
                step=step)
```

Here we load the pre-computed coverage data single.count to save build time.

```{r}
data(single.count)
se <- single.count
```

For quality asessment, `cumulativePercentage` can be used to plot the 
cumulative sums of sorted read counts for nucleosome and genomic DNA. 
We expect the cumulative sum in the genomic DNA to be close to a straight 
line because the coverage for the genomic DNA sample should be uniformly 
distributed. Unlike ChIP-seq data, the cumulative sum in the nucleosome 
sample will not exhibit sharp changes because most of NADs are broad 
regions as wide as 100 kb. However, we should observe clear differences 
between the two curves.

```{r}
## Calculate ratios for peak calling. We use nucleosome vs genomic DNA.
dat <- log2se(se, nucleosomeCols = "nucleosome.bam", genomeCols="genome.bam")
## Smooth the ratios for each chromosome.
## We found that for each chromosome, the ratios are higher in 5'end than 3'end.
datList <- smoothRatiosByChromosome(dat, N=100)
## check the difference between the cumulative percentage tag allocation 
## in genome and nucleosome samples.
cumulativePercentage(datList[["chr18"]])
```

## Call peaks

Before peak calling, the function `smoothRatiosByChromosome` is used for 
log ratios calculation, normalization and smoothing.

The peaks will be called if the ratios are significantly higher than 
chromosome-specific background determined by `trimPeaks`. 
The following figure shows the peaks (black rectangles) called 
using normalized (green curve) and smoothed (red curve) log2 ratios.

```{r}
## check the results of smooth function
chr18 <- datList[["chr18"]] ## we only have reads in chr18 in test data.
chr18subset <- subset(chr18, seq.int(floor(length(chr18)/10))*10)
chr18 <- assays(chr18subset)
ylim <- range(c(chr18$ratio[, 1], 
                chr18$bcRatio[, 1], 
                chr18$smoothedRatio[, 1]))
plot(chr18$ratio[, 1], 
     ylim=ylim*c(.9, 1.1), 
     type="l", main="chr18")
abline(h=0, col="cyan", lty=2)
points(chr18$bcRatio[, 1], type="l", col="green")
points(chr18$smoothedRatio[, 1], type="l", col="red")
legend("topright", 
       legend = c("raw_ratio", "background_corrected", "smoothed"), 
       fill = c("black", "green", "red"))
## call peaks for each chromosome
peaks <- lapply(datList, trimPeaks, 
                backgroundPercentage=backgroundPercentage, 
                cutoffPvalue=0.05, countFilter=1000)
## plot the peaks in "chr18"
peaks.subset <- countOverlaps(rowRanges(chr18subset), peaks$chr18)>=1
peaks.run <- rle(peaks.subset)
run <- cumsum(peaks.run$lengths)
run <- data.frame(x0=c(1, run[-length(run)]), x1=run)
run <- run[peaks.run$values, , drop=FALSE]
rect(xleft = run$x0, ybottom = ylim[2]*.75, 
     xright = run$x1, ytop = ylim[2]*.8,
     col = "black")
## convert list to a GRanges object
peaks.gr <- unlist(GRangesList(peaks))
```

The following shows how to save or export the called peaks for 
downstream analysis.

```{r}
## output the peaks
write.csv(as.data.frame(unname(peaks.gr)), "peaklist.csv", row.names=FALSE)
## export peaks to a bed file.
library(rtracklayer)
export(peaks.gr, "peaklist.bed", "BED")
```

# Samples with duplicates

Data analysis with multiple biological replicates follows the same steps as 
that of a single paired samples, i.e., coverage calculation, normalization 
and smoothing, and peak calling. The only difference is that limma is used 
to determine the statistical significance in peak calling.

```{r}
library(NADfinder)
## bam file path
path <- "path/to/your/bam/files"
f <- c("G26.bam", "G28.bam", "G29.bam", "N26.bam", "N28.bam", "N29.bam")
ws <- 50000
step <- 1000
library(BSgenome.Mmusculus.UCSC.mm10)
```

```{r eval=FALSE}
se <- tileCount(reads=file.path(path, f), 
                genome=Mmusculus, 
                windowSize=ws, 
                step=step)
```

Load saved coverage.

```{r}
data(triplicates.counts)
se <- triplicates.counts

## Calculate ratios for nucleoli vs genomic sample.
gps <- c("26", "28", "29")
se <- log2se(se, 
             nucleosomeCols = paste0("N", gps, ".bam"),
             genomeCols = paste0("G", gps, ".bam"))
getCorrelations(se, "chr18")
seList<- smoothRatiosByChromosome(se, chr="chr18")
cumulativePercentage(seList[["chr18"]])
peaks <- lapply(seList, callPeaks, 
                cutoffAdjPvalue=0.05, countFilter=1000)
peaks <- unlist(GRangesList(peaks))
peaks
```

The peaks can be visualized along the ideogram using `plotSig`. 

```{r fig.height=1.5}
ideo <- readRDS(system.file("extdata", "ideo.mm10.rds", 
                            package = "NADfinder"))
plotSig(ideo=ideo, grList=GRangesList(peaks), mcolName="AveSig", 
        layout=list("chr18"), 
        parameterList=list(types="heatmap"))
```

# Session Info
```{r sessionInfo}
sessionInfo()
```

# References