---
title: "BAM Quality Control Measures"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true # table of content true
vignette: >
    %\VignetteIndexEntry{3. BAM Quality Control Measures}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---

# Loading and saving RaMWAS objects

Steps 1 and 2 of RaMWAS 
scan BAM files, calculate and collect 
quality control (QC) metrics
The QC information is stored in .rds files in "rqc" directory.
They are also summarizes in text files in the "qc" directory
and illustrated in mulpiple plots.

## QC text summary

The text summary is saved in two versions:

*   `Summary_QC.txt` -- formatted for viewing in Excel.
*   `Summary_QC_R.txt` -- formatted for easy import in R.

Excel\ friendly\ column\ name | R\ friendly column name | brief\ description\ and\ link\ to\ section
:---|:---|:---
Sample | Sample | Sample / BAM name
\# BAMs | NBAMs | [Number of BAMs in the sample](#NBAMs)
Total reads | TotalReads | [Total number of reads](#TotalReads)
Reads aligned | ReadsAligned | [The number of aligned reads](#ReadsAligned)
RA \% of total | ReadsAlignedPct | Percent of reads aligned
Reads after filter | ReadsAfterFilter | [Number of reads that passed minimum score filter](#ReadsAfterFilter)
RAF \% of aligned | ReadsAfterFilterPct | Percent of reads passing the filter out of all aligned reads
Reads removed as duplicate | ReadsRemovedAsDuplicate | [Reads removed as duplicate](#ReadsRemovedAsDuplicate)
RRAD % of aligned | ReadsRemovedAsDuplicatePct | Percent of reads removed as duplicate out of all aligned reads
Reads used for coverage | ReadsUsedForCoverage | Number of aligned reads used for coverage (left after filtering and removal of duplicate reads)
RUFC % of aligned | ReadsUsedForCoveragePct | Percent of reads used for coverage out of all aligned reads
Forward strand (%) | ForwardStrandPct | [Percent of reads used for coverage aligned to the forward strand](#ForwardStrandPct)
Avg alignment score | AvgAlignmentScore | [Average alignment score](#AvgAlignmentScore)
Avg aligned length | AvgAlignedLength | [Average length of the aligned part of the read](#AvgAlignedLength)
Avg edit distance | AvgEditDistance | [Average number of mismatches between the aligned part of the read and the reference genome](#AvgEditDistance)
Non-CpG reads (%) | NonCpGreadsPct | [Percent of reads aligned away from CpGs](#NonCpGreadsPct)
Avg non-CpG coverage | AvgNonCpGcoverage | [Average CpG score for non-CpG locations](#AvgNonCpGcoverage)
Avg CpG coverage | AvgCpGcoverage | [Average CpG score for CpGs](#AvgNonCpGcoverage)
Non-Cpg/CpG coverage ratio | NonCpg2CpGcoverageRatio | [Ratio of average non-CpG and CpG scores](#AvgNonCpGcoverage)
ChrX reads (%) | ChrXreadsPct | [Fraction of reads aligned to chromosome X](#ChrXreadsPct)
ChrY reads (%) | ChrYreadsPct | [Fraction of reads aligned to chromosome Y](#ChrXreadsPct)
Peak SQRT | PeakSQRT | [Square root of the CpG density with highest average CpG score](#PeakSQRT)

# Quality control measures

All QC measures are designed to be additive, in the sense that any QC
measure calculated for a combination of two BAM files is equal to the
sum of the respective measure calculated for those BAMs separately.

Many QC measures can be visualized by calling the `plot` function.
For most QC measures, a single number summary is 
available via the `qcmean` function.

A sample QC file, which contains accumulated
QC measures from 42 BAM files,
can be loaded with the following code.

```{r global_options, include=FALSE}
#getwd()
#knitr::opts_chunk$set(fig.align="center", fig.retina=1)
knitr::opts_chunk$set(fig.retina=1)
library(ramwas)
```

```{r loadCgGset}
library(ramwas)
filename = system.file("extdata", "bigQC.rds", package = "ramwas")
qc = readRDS(filename)$qc
```

Next we describe the QC metrics calculated by RaMWAS.

## The number of BAM files {#NBAMs}

The `nbams` QC metric counts the number of BAM files.
This cumulative metric is calculated for total of 42 BAMs.

```{r nbams}
cat("Number of BAMs:", qc$nbams)
```

## Total number of reads in the BAM file(s) {#TotalReads}

The `reads.total` QC metric counts the number of reads
scanned in the BAM file(s). This includes reads later excluded
due to low alignment score or as duplicates.\
The 42 BAMs contain 2.46 billion reads.

```{r reads.total}
cat("Reads total:", qc$reads.total)
```

## Number of reads aligned to the reference genome {#ReadsAligned}

The `reads.aligned` QC metric indicates the number of reads
that were successfully aligned to the reference genome.\
The number of aligned reads is only 2\% smaller, 2.42 billion.

```{r reads.aligned}
{
cat("Reads aligned:", qc$reads.aligned, "\n")
cat("This is ", qc$reads.aligned / qc$reads.total * 100,
    "% of all reads", sep="")
}
```

## Number of reads that passed minimum score filter {#ReadsAfterFilter}

At step 1 of RaMWAS, reads are filtered by the `scoretag` parameter,
which is usually either the "MAPQ" field or "AS" tag in the BAM file.
Reads with scores below `minscore` are excluded.

The `reads.recorded` QC metric counts the number of reads
that passed the score threshold.\
Almost of 2.2 billion reads passed the score threshold.

```{r reads.recorded}
{
cat("Reads recorded:",qc$reads.recorded,"\n")
cat("This is ", qc$reads.recorded / qc$reads.aligned * 100,
    "% of aligned reads", sep="")
}
```

## Number of reads after removal of duplicate reads {#ReadsRemovedAsDuplicate}

Reads that start at the same nucleotide positions
and aligned to the same strand are called duplicate reads.
When sequencing a whole genome, duplicate-reads often arise from 
template preparation or amplification artifacts. 
In the context of sequencing an enriched genomic fraction,
duplicate-reads are increasingly likely to occur because
reads align to a much smaller fraction of the genome. 
RaMWAS allows the user to define a threshold for the number
of reads starting at the same position and 
limits the read count to this threshold 
(implicitly assuming that an excess of 
reads are tagging the same clonal fragment). 

The  threshold is set by `maxrepeats` parameter with the default value 3.

When there are mulptiple reads with the same start
position, we suspect them to be falsely duplicated.

The `reads.recorded.no.repeats` QC metric records the
total number of reads after removal of duplicates.

In our example, 10\% of reads are removed as duplicates.

```{r reads.recorded.no.repeats}
{
cat("Reads without duplicates:", qc$reads.recorded.no.repeats, "\n")
cat("This is ", qc$reads.recorded.no.repeats / qc$reads.recorded * 100,
    "% of aligned reads", "\n", sep="")
}
```

## Number of recorded reads aligned to each strand {#ForwardStrandPct}

The `frwrev` QC metric records two values -- the number of reads
aligned to the forward and reverse strands respectively,
after filtering reads by alignment score.

The `frwrev.no.repeats` is similar to `frwrev`,
but it excludes duplicate reads.

For these measures, the `qcmean` function returns 
the fraction of reads on the forward strand.
Normally, the number of reads on the forward 
and reverse strands are very close,
so `qcmean` should give a number close to `0.5`.

```{r frwrevNR}
{
    cat("Excluding duplicate reads", "\n")
    cat("Reads on forward strand:", qc$frwrev.no.repeats[1], "\n")
    cat("Reads on reverse strand:", qc$frwrev.no.repeats[2], "\n")
    cat("Fraction of reads on forward strand:",
        qcmean(qc$frwrev.no.repeats), "\n")
}
```

Clearly before removal of duplicate reads the 
fraction of reads on the forward strand was further away from 0.5.

```{r frwrev}
{
    cat("Not excluding duplicate reads", "\n")
    cat("Reads on forward strand:", qc$frwrev[1], "\n")
    cat("Reads on reverse strand:", qc$frwrev[2], "\n")
    cat("Fraction of reads on forward strand (before QC):",
        qcmean(qc$frwrev), "\n")
}
```

## Distribution of the alignment scores {#AvgAlignmentScore}

The QC metrics `bf.hist.score1` and `hist.score1` record the distribution of
the alignment scores before and after the filter.
The score is defined by the `scoretag` parameter.
While `hist.score1` contain the distribution for reads that passed the filter,
`bf.hist.score1` has the distribution for all reads.
The `qcmean` function for these QC measures returns
the average score for the respective group.
The first element of the vector `qc$hist.score1` contains the number of
reads with score of 0, the second with score of 1, and so on.
Negative scores (if any) are ignored.

```{r hist.score1, fig.width=8}
{
cat("Average alignment score, after filter:", qcmean(qc$hist.score1),    "\n")
cat("Average alignment score, no filter:   ", qcmean(qc$bf.hist.score1), "\n")
par(mfrow=c(1,2))
plot(qc$hist.score1)
plot(qc$bf.hist.score1)
}
```

## Distribution of the length of the aligned part of the reads {#AvgAlignedLength}

The `hist.length.matched` QC metrix records the 
distribution of the length of the aligned part of the reads.
The length of the aligned part of a read is calculated from
the CIGAR string in the BAM file using the `cigarWidthAlongQuerySpace` function.
The vector `hist.length.matched` has the distribution
for reads that passed the filter, 
`bf.hist.length.matched` -- for all reads.
The `qcmean` function for these QC measures returns the
average length
The first element of the vector contains the number of
reads with 1 aligned basepair, the second with 2, and so on.

```{r hist.length.matched, fig.width=8}
{
cat("Average aligned length, after filter:", 
    qcmean(qc$hist.length.matched),    "\n")
cat("Average aligned length, no filter:   ",
    qcmean(qc$bf.hist.length.matched), "\n")
par(mfrow = c(1,2))
plot(qc$hist.length.matched)
plot(qc$bf.hist.length.matched)
}
```


## Distribution of edit distance {#AvgEditDistance}

The `hist.edit.dist1` QC metricrecords the 
distribution of the number of mismatches between 
the aligned part of the read and the reference genome.
The mismatches are caused by base call errors and genetic variation.
The metric is calculated from the NM tag in BAM files.

The `bf.hist.edit.dist1` QC metric records
the distribution before read filtering, while `hist.edit.dist1` after.
The first element of the vector contains the number of reads
with 0 edit distance (perfect match), the second with edit distance 1,
and so on.

The `qcmean` function for this QC metric returns the average edit distance.

```{r hist.edit.dist1}
{
cat("Average edit distance, after filter:", 
    qcmean(qc$hist.edit.dist1),    "\n")
cat("Average edit distance, no filter:   ", 
    qcmean(qc$bf.hist.edit.dist1), "\n")
par(mfrow = c(1,2))
plot(qc$hist.edit.dist1)
plot(qc$bf.hist.edit.dist1)
}
```

## Number of reads away from CpGs {#NonCpGreadsPct}

MBD-seq detects CpG methylation, 
such that reads aligning to loci that are 
at least `maxfragmentsize` away from any CpGs represent "noise".
These reads occur due to alignment errors or 
imperfect enrichment leading to sequencing of non-methylated fragments).

The `cnt.nonCpG.reads` QC metric contains the number of
non-CpG reads in it's first element and
the total number of reads in the second.

The `qcmean` function for this QC metric returns the
percent of non-CpG reads out of all reads. \

For our data there is less than 1\% of non-CpG reads,
which is consistent with low level of noise.

```{r cnt.nonCpG.reads}
{
cat("Non-CpG reads:", qc$cnt.nonCpG.reads[1], "\n")
cat("This is ", qcmean(qc$cnt.nonCpG.reads)*100, "% of recorded reads", sep="")
}
```

## Average CpG score for CpGs and non-CpGs {#AvgNonCpGcoverage}

The `avg.cpg.coverage` and `avg.noncpg.coverage` QC metrics record
the average CpG score (fragment coverage) for 
all CpGs and for all non-CpGs (locations away from CpGs).

For successfull enrichment, the average CpG score
should be much larger that average non-CpG score
The ratio of these metrics,
gives us a lower bound on the enrichment level (higher is better).
The opposite ratio measures the noise level (lower is better).

```{r avg.cpg.coverage}
{
cat("Summed across", qc$nbams, "bams", "\n")
cat("Average     CpG coverage:", qc$avg.cpg.coverage, "\n")
cat("Average non-CpG coverage:", qc$avg.noncpg.coverage, "\n")
cat("Enrichment ratio:", qc$avg.cpg.coverage / qc$avg.noncpg.coverage, "\n")
cat("Noise level:", qc$avg.noncpg.coverage / qc$avg.cpg.coverage)
}
```

## Average CpG score vs. CpG density {#PeakSQRT}

Enrichment profiles are not only affected by the 
total amount of methylation of the DNA fragments,
which is a function of the number of CpGs and how many of them are methylated, 
but also by variability in the laboratory protocol. 
To capture this variability, 
the `avg.coverage.by.density` QC metric
records the dependence of average CpG score 
as a function of CpG density.
The `qcmean` function returns the square root of the CpG density 
where maximum average CpG score is achieved. 

This 'sqrt peak' sensitivity measure can be used as a covariate in 
downstream analyses to regress out 
variability in enrichment profiles across samples
caused by lab-technical factors.

```{r avg.coverage.by.density}
{
cat("Highest coverage is observed at CpG density of",
    qcmean(qc$avg.coverage.by.density)^2)
plot(qc$avg.coverage.by.density)
}
```

## Coverage around isolated CpGs

A CpG is called isolated if
there are no other CpGs within 
a sufficient distance from it.
The distance is usually the longest possible fragment size.

The distribution of reads around isolated CpGs 
also contains information about the enrichment profile. 
Because the total amount of methylation in fragments 
containing isolated CpGs is small, 
this distribution reflects the sensitivity of the enrichment.
That is, a sensitive assay would be characterized by a pattern where 
most reads start close to the isolated CpG. 
The read counts should decrease further away from the CpG until 
they eventually stabilize to a "noise" level 
at distances larger than the maximum fragment size. 
This pattern of read counts around isolated CpGs 
also form the basis of the fragment size estimation needed 
to calculate CpG score for single end sequencing data.

The `hist.isolated.dist1` QC metric records the distribution of
distances from read start sites to isolated CpGs.
In our example, as we expected,
there are more reads starting closer to the isolated CpGs.

```{r hist.isolated.dist1}
plot(qc$hist.isolated.dist1)
```

## Fraction of reads from chrX and chrY {#ChrXreadsPct}

The fractions of reads from chromosome X and chromosome Y
can be used to test whether the investigated biosamples
have the same sex as recorded in the phenotype data.
Mismatches often indicate swapped biosamples or phenotypes information.
They can also indicate contamination across biosamples.

The `chrX.count` QC metric records the number of chromosome X reads 
in its first element and the total number of reads in the second.
The `qcmean` function returns the percent of chromosome X reads 
out of the total.

The `chrY.count` QC metric is defined analogously for chromosome Y.

```{r chrXY}
{
cat("ChrX reads: ", qc$chrX.count[1], ", which is ",
    qcmean(qc$chrX.count)*100, "% of total", sep="", "\n")
cat("ChrY reads: ", qc$chrY.count[1], ", which is ",
    qcmean(qc$chrY.count)*100, "% of total", sep="", "\n")
}
```