%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{The 'qcmetrics' infrastructure for quality control and reporting}
%\VignetteKeywords{Bioinformatics, proteomics, genomics, transcriptomics, mass-spectrometry, Quality control, reporting}
%\VignettePackage{qcmetrics}

\documentclass[12pt, oneside]{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\usepackage[final]{pdfpages}
\includepdfset{nup=2x2, pages=-, frame=TRUE}
\usepackage{wasysym}

\bioctitle[\Biocpkg{qcmetrics}]{The \Biocpkg{qcmetrics} infrastructure
  for quality control and automatic reporting}

\author{
  Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}\\  
  Computational Proteomics Unit\footnote{\url{http://cpu.sysbiol.cam.ac.uk}}\\
  University of Cambridge, UK
}


\begin{document}

\maketitle

%% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vskip 0.3in minus 0.1in
\hrule
\begin{abstract}
  The \Biocpkg{qcmetrics} package is a framework that provides simple
  data containers for quality metrics and support for automatic report
  generation. This document briefly illustrates the core data
  structures and then demonstrates the generation and automation of
  quality control reports for microarray and proteomics data.
\end{abstract}
\textit{Keywords}: Bioinformatics, Quality control, reporting, visualisation 
\vskip 0.1in minus 0.05in
\hrule
\vskip 0.2in minus 0.1in
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\tableofcontents

<<env, include=FALSE, echo=FALSE, cache=FALSE>>=
library("knitr")
opts_chunk$set(fig.align = 'center', 
               fig.show = 'hold', 
               par = TRUE,
               prompt = FALSE,
               tidy = FALSE,
               eval = TRUE,
               stop_on_error = 1L,
               comment = "##")
options(replace.assign = TRUE, 
        width = 55)

suppressPackageStartupMessages(library("qcmetrics"))
suppressPackageStartupMessages(library("MAQCsubsetAFX"))
suppressPackageStartupMessages(library("yaqcaffy"))
suppressPackageStartupMessages(library("affy"))
suppressPackageStartupMessages(library("AnnotationDbi"))
suppressPackageStartupMessages(library("RforProteomics"))
suppressPackageStartupMessages(library("mzR"))
suppressPackageStartupMessages(library("MSnbase"))
set.seed(1)
@ 
%%$

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Introduction}\label{sec:intro} 

Quality control (QC) is an essential step in any analytical
process. Data of poor quality can at best lead to the absence of
positive results or, much worse, false positives that stem from
uncaught faulty and noisy data and much wasted resources in pursuing
red herrings.

Quality is often a relative concept that depends on the nature of the
biological sample, the experimental settings, the analytical process
and other factors. Research and development in the area of QC has
generally lead to two types of work being disseminated. Firstly, the
comparison of samples of variable quality and the identification of
metrics that correlate with the quality of the data. These quality
metrics could then, in later experiments, be used to assess their
quality. Secondly, the design of domain-specific software to
facilitate the collection, visualisation and interpretation of various
QC metrics is also an area that has seen much development. QC is a
prime example where standardisation and automation are of great
benefit. While a great variety of QC metrics, software and pipelines
have been described for any assay commonly used in modern biology, we
present here a different tool for QC, whose main features are
flexibility and versatility. The \Biocpkg{qcmetrics} package is a
general framework for QC that can accommodate any type of data. It
provides a flexible framework to implement QC items that store
relevant QC metrics with a specific visualisation mechanism. These
individual items can be bundled into higher level QC containers that
can be readily used to generate reports in various formats. As a
result, it becomes easy to develop complete custom pipelines from
scratch and automate the generation of reports. The pipelines can be
easily updated to accommodate new QC items of better visualisation
techniques.

Section \ref{sec:qcclasses} provides an overview of the framework. In
section \ref{sec:pipeline}, we use microarray (subsection
\ref{sec:marray}) and proteomics data (subsection \ref{sec:prot}) to
demonstrate the elaboration of QC pipelines: how to create individual
QC objects, how to bundle them to create sets of QC metrics and how to
generate reports in multiple formats. We also show how the above steps
can be fully automated through simple wrapper functions in section
\ref{sec:wrapper}. Although kept simple in the interest of time and
space, these examples are meaningful and relevant. In section
\ref{sec:report}, we provide more detail about the report generation
process, how reports can be customised and how new exports can be
contributed. We proceed in section \ref{sec:qcpkg} to the
consolidation of QC pipelines using \R and elaborate on the
development of dedicated QC packages with \Biocpkg{qcmetrics}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{The QC classes}\label{sec:qcclasses}

The package provides two types of QC containers. The
\Robject{QcMetric} class stores data and visualisation functions for
single metrics. Several such metrics can be bundled into
\Robject{QcMetrics} instances, that can be used as input for automated
report generation. Below, we will provide a quick overview of how to
create respective \Robject{QcMetric} and \Robject{QcMetrics}
instances. More details are available in the corresponding
documentations.

\subsection{The \Robject{QcMetric} class}

A QC metric is composed of a description (\Robject{name} in the code
chunk below), some QC data (\Robject{qcdata}) and a \Robject{status}
that defines if the metric is deemed of acceptable quality (coded as
\Robject{TRUE}), bad quality (coded as \Robject{FALSE}) or not yet
evaluated (coded as \Robject{NA}). Individual metrics can be displayed
as a short textual summary or plotted. To do the former, one can use
the default \Rfunction{show} method.

<<qcmetric>>=
library("qcmetrics")
qc <- QcMetric(name = "A test metric")
qcdata(qc, "x") <- rnorm(100)
qcdata(qc) ## all available qcdata
summary(qcdata(qc, "x")) ## get x
show(qc) ## or just qc
status(qc) <- TRUE
qc
@

Plotting \Robject{QcMetric} instances requires to implement a plotting
method that is relevant to the data at hand. We can use a
\Rfunction{plot} replacement method to define our custom function. The
code inside the \Rfunction{plot} uses \Rfunction{qcdata} to extract
the relevant QC data from \Robject{object} that is then passed as
argument to \Rfunction{plot} and uses the adequate visualisation to
present the QC data.

<<qcmetricplot, dev='pdf', fig.width = 4, fig.height = 4, tidy = FALSE>>=
plot(qc)
plot(qc) <-
    function(object, ... ) boxplot(qcdata(object, "x"), ...)
plot(qc)
@

\subsection{The \Robject{QcMetrics} class}

A \Robject{QcMetrics} object is essentially just a list of individual
\Robject{QcMetric} instances. It is also possible to set a list of
metadata variables to describe the source of the QC metrics. The
metadata can be passed as an \Robject{QcMetadata} object (the way it
is stored in the \Robject{QcMetrics} instance) or directly as a named
\Robject{list}. The \Robject{QcMetadata} is itself a \Robject{list}
and can be accessed and set with \Rfunction{metadata} or
\Rfunction{mdata}. When accessed, it is returned and displayed as a
\Robject{list}.

<<qcmetrics>>=
qcm <- QcMetrics(qcdata = list(qc))
qcm
metadata(qcm) <- list(author = "Prof. Who",
                      lab = "Big lab")
qcm
mdata(qcm)
@

The metadata can be updated with the same interface. If new named
items are passed, the metadata is updated by addition of the new
elements. If a named item is already present, its value gets updated.

<<qcmdataupdate>>=
metadata(qcm) <- list(author = "Prof. Who",
                      lab = "Cabin lab",
                      University = "Universe-ity")
mdata(qcm)
@


The \Robject{QcMetrics} can then be passed to the \Rfunction{qcReport}
method to generate reports, as described in more details below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Creating QC pipelines}\label{sec:pipeline}


\subsection{Microarray degradation}\label{sec:marray}

We will use the \Robject{refA} Affymetrix arrays from the
\Biocexptpkg{MAQCsubsetAFX} package as an example data set and
investigate the RNA degradation using the \Rfunction{AffyRNAdeg} from
\Biocpkg{affy} \cite{Gautier:2004} and the actin and GAPDH
$\frac{3'}{5'}$ ratios, as calculated in the \Biocpkg{yaqcaffy}
package \cite{yaqcaffy}. The first code chunk demonstrate how to load
the data and compute the QC data\footnote{%%
  The pre-computed objects can be directly loaded with
  \Rfunction{load(system.file("extdata/deg.rda", package =
    "qcmetrics"))} and \Rfunction{load(system.file("extdata/deg.rda",
    package = "qcmetrics"))}.}.


<<maqcdata, eval=FALSE>>=
library("MAQCsubsetAFX")
data(refA)
library("affy")
deg <- AffyRNAdeg(refA)
library("yaqcaffy")
yqc <- yaqc(refA)
@

<<maqcdata0, echo=FALSE>>=
load(system.file("extdata/deg.rda", package = "qcmetrics"))
load(system.file("extdata/yqc.rda", package = "qcmetrics"))
@

We then create two \Robject{QcMetric} instances, one for each of our quality metrics. 

<<maqc1>>=
qc1 <- QcMetric(name = "Affy RNA degradation slopes")
qcdata(qc1, "deg") <- deg
plot(qc1) <- function(object, ...) {
    x <- qcdata(object, "deg")
    nms <- x$sample.names
    plotAffyRNAdeg(x, col = 1:length(nms), ...)
    legend("topleft", nms, lty = 1, cex = 0.8, 
           col = 1:length(nms), bty = "n")
}
status(qc1) <- TRUE
qc1
@


<<maqc2>>=
qc2 <- QcMetric(name = "Affy RNA degradation ratios")
qcdata(qc2, "yqc") <- yqc
plot(qc2) <- function(object, ...) {
    par(mfrow = c(1, 2))
    yaqcaffy:::.plotQCRatios(qcdata(object, "yqc"), "all", ...)
}
status(qc2) <- FALSE
qc2
@

Then, we combine the individual QC items into a \Robject{QcMetrics} instance. 

<<maqcm>>=
maqcm <- QcMetrics(qcdata = list(qc1, qc2))
maqcm
@

%% Running this one with echoing so that the auxiliary files, 
%% in particular the figure directory does not get deleted, as 
%% it is also created and needed by the vignette itself.
<<maqcreport0, echo = FALSE, message = FALSE>>=
qcReport(maqcm, reportname = "rnadeg", clean = FALSE)
@

With our \Robject{QcMetrics} data, we can easily generate quality
reports in several different formats. Below, we create a \texttt{pdf}
report, which is the default type. Using \texttt{type = "html"} would
generate the equivalent report in \texttt{html} format. See
\Rfunction{?qcReport} for more details.

<<maqcreport, eval = FALSE>>=
qcReport(maqcm, reportname = "rnadeg", type = "pdf")
@

The resulting report is shown below. Each \Robject{QcMetric} item
generates a section named according to the object's name. A final
summary section shows a table with all the QC items and their
status. The report concludes with a detailed session information
section.

\bigskip

In addition to the report, it is of course advised to store the actual
\Robject{QcMetrics} object. This is most easily done with the \R
\Rfunction{save}/\Rfunction{load} and
\Rfunction{saveRDS}/\Rfunction{readRDS} functions. As the data and
visualisation methods are stored together, it is possible to reproduce
the figures from the report or further explore the data at a later
stage.

\includepdf{rnadeg.pdf}

\clearpage

\subsection{A wrapper function}\label{sec:wrapper}

Once an appropriate set of quality metrics has been identified, the
generation of the \Robject{QcMetrics} instances can be wrapped up for
automation.

<<maqcwrap, tidy=FALSE>>=
rnadeg
@

It is now possible to generate a \Robject{QcMetrics} object from a set
of CEL files or directly from an \Robject{affybatch} object. The
\Robject{status} argument allows to directly set the statuses of the
individual QC items; these can also be set later, as illustrated
below. If a report type is specified, the corresponding report is
generated.

<<qcwrap2, eval = FALSE>>=
maqcm <- rnadeg(refA)
@

<<qcwrapstatus0, echo=FALSE>>=
status(maqcm) <- c(NA, NA)
@

<<qcwrapstatus>>=
status(maqcm)
## check the QC data 
(status(maqcm) <- c(TRUE, FALSE))
@

The report can be generated manually with \Rfunction{qcReport(maqcm)} or
directly with the wrapper function as follows:

<<qcwrap3, eval = FALSE>>=
maqcm <- rnadeg(refA, type = "pdf")
@

\subsection{Proteomics raw data}\label{sec:prot}

To illustrate a simple QC analysis for proteomics data, we will
download data set \texttt{PXD00001} from the ProteomeXchange
repository in the \texttt{mzXML} format \cite{Pedrioli:2004}. The
MS$^2$ spectra from that mass-spectrometry run are then read into
\R\footnote{%%
  In the interest of time, this code chunk has been pre-computed and a
  subset (1 in 3) of the \Robject{exp} instance is distributed with
  the package. The data is loaded with
  \Rfunction{load(system.file("extdata/exp.rda", package =
    "qcmetrics"))}.} %%
and stored as an \Robject{MSnExp} experiment using the
\Rfunction{readMSData} function from the \Biocpkg{MSnbase} package
\cite{Gatto:2012}.

<<protdata0, echo=FALSE>>=
load(system.file("extdata/exp.rda", package = "qcmetrics"))
@

<<protdata, eval=FALSE>>=
library("RforProteomics")
msfile <- getPXD000001mzXML()
library("MSnbase")
exp <- readMSData(msfile, verbose = FALSE)
@

The \Robject{QcMetrics} will consist of 3 items, namely a chromatogram
constructed with the MS$^2$ spectra precursor's intensities, a figure
illustrating the precursor charges in the MS space and an
$\frac{m}{z}$ delta plot illustrating the suitability of MS$^2$
spectra for identification (see \Rfunction{?plotMzDelta} or
\cite{Foster:2011}).


<<protqc1, cache=TRUE, tidy=FALSE>>=
qc1 <- QcMetric(name = "Chromatogram")
x <- rtime(exp)
y <- precursorIntensity(exp)
o <- order(x)
qcdata(qc1, "x") <- x[o]
qcdata(qc1, "y") <- y[o]
plot(qc1) <- function(object, ...)
    plot(qcdata(object, "x"),
         qcdata(object, "y"),
         col = "darkgrey", type ="l",
         xlab = "retention time",
         ylab = "precursor intensity")
@

<<protqc2, cache=TRUE>>=
qc2 <- QcMetric(name = "MS space")
qcdata(qc2, "p2d") <- plot2d(exp, z = "charge", plot = FALSE)
plot(qc2) <- function(object) { 
    require("ggplot2")
    print(qcdata(object, "p2d"))
}
@

<<protqc3, cache=TRUE, messages=FALSE, tidy=FALSE, warnings=FALSE>>=
qc3 <- QcMetric(name = "m/z delta plot")
qcdata(qc3, "pmz") <- plotMzDelta(exp, plot = FALSE,
                                  verbose = FALSE)
plot(qc3) <- function(object) 
    suppressWarnings(print(qcdata(object, "pmz")))
@ 

Note that we do not store the raw data in any of the above instances,
but always pre-compute the necessary data or plots that are then
stored as \Robject{qcdata}. If the raw data was to be needed in
multiple \Robject{QcMetric} instances, we could re-use the same
\Robject{qcdata} \emph{environment} to avoid unnecessary copies using
\Rfunction{qcdata(qc2) <- qcenv(qc1)} and implement different views
through custom \Rfunction{plot} methods.

\bigskip

Let's now combine the three items into a \Robject{QcMetrics} object,
decorate it with custom metadata using the MIAPE information from the
\Robject{MSnExp} object and generate a report.

<<protqcm, tidy=FALSE>>=
protqcm <- QcMetrics(qcdata = list(qc1, qc2, qc3))
metadata(protqcm) <- list(
    data = "PXD000001",
    instrument = experimentData(exp)@instrumentModel,
    source = experimentData(exp)@ionSource,
    analyser = experimentData(exp)@analyser,
    detector = experimentData(exp)@detectorType,
    manufacurer = experimentData(exp)@instrumentManufacturer)
@

%% Running this one with echoing so that the auxiliary files, 
%% in particular the figure directory does not get deleted, as 
%% it is also created and needed by the vignette itself.
<<protreport0, echo = FALSE, message = FALSE>>=
qcReport(protqcm, reportname = "protqc", clean=FALSE, quiet=TRUE)
@

The status column of the summary table is empty as we have not set the
QC items statuses yet.

<<protreport, eval=FALSE>>=
qcReport(protqcm, reportname = "protqc")
@

\includepdf{protqc.pdf}

\subsection{Processed $^{15}$N labelling data}\label{sec:n15}

In this section, we describe a set of $^{15}$N metabolic labelling QC
metrics \cite{Krijgsveld:2003}. The data is a phospho-enriched
$^{15}$N labelled \textit{Arabidopsis thaliana} sample prepared as
described in \cite{Groen:2013}. The data was processed with in-house
tools and is available as an \Robject{MSnSet} instance. Briefly,
MS$^2$ spectra were search with the Mascot engine and identification
scores adjusted with Mascot Percolator. Heavy and light pairs were
then searched in the survey scans and $^{15}$N incorporation was
estimated based on the peptide sequence and the isotopic envelope of
the heavy member of the pair (the \Robject{inc} feature
variable). Heavy and light peptides isotopic envelope areas were
finally integrated to obtain unlabelled and $^{15}$N quantitation
data. The \Robject{psm} object provides such data for PSMs (peptide
spectrum matches) with a posterior error probability \textless 0.05
that can be uniquely matched to proteins.

We first load the \Biocpkg{MSnbase} package (required to support the
\Robject{MSnSet} data structure) and example data that is distributed
with the \Biocpkg{qcmetrics} package. We will make use of the
\CRANpkg{ggplot2} plotting package.

<<n15ex>>=
library("ggplot2")
library("MSnbase")
data(n15psm)
psm
@
 
 The first QC item examines the $^{15}$N incorporation rate, available
 in the \Robject{inc} feature variable. We also defined a median
 incorporation rate threshold \Robject{tr} equal to 97.5 that is used
 to set the QC status.

<<qcinc, tidy=FALSE>>=
## incorporation rate QC metric
qcinc <- QcMetric(name = "15N incorporation rate")
qcdata(qcinc, "inc") <- fData(psm)$inc
qcdata(qcinc, "tr") <- 97.5
status(qcinc) <- median(qcdata(qcinc, "inc")) > qcdata(qcinc, "tr")
@

Next, we implement a custom \Rfunction{show} method, that prints 5
summary values of the variable's distribution.

<<qcinc2, tidy=FALSE>>=
show(qcinc) <- function(object) {
    qcshow(object, qcdata = FALSE) 
    cat(" QC threshold:", qcdata(object, "tr"), "\n")
    cat(" Incorporation rate\n")
    print(summary(qcdata(object, "inc")))
    invisible(NULL)
}
@


We then define the metric's \Rfunction{plot} function that represent
the distribution of the PSM's incorporation rates as a boxplot, shows
all the individual rates as jittered dots and represents the
\Robject{tr} threshold as a dotted red line.

<<qcinc3, tidy=FALSE>>=
plot(qcinc) <- function(object) {
    inc <- qcdata(object, "inc")
    tr <- qcdata(object, "tr")
    lab <- "Incorporation rate"
    dd <- data.frame(inc = qcdata(qcinc, "inc"))
    p <- ggplot(dd, aes(factor(""), inc)) +
        geom_jitter(colour = "#4582B370", size = 3) + 
    geom_boxplot(fill = "#FFFFFFD0", colour = "#000000",
                 outlier.size = 0) +
    geom_hline(yintercept = tr, colour = "red",
               linetype = "dotted", size = 1) +
    labs(x = "", y = "Incorporation rate") 
    p
}
@

$^{15}$N experiments of good quality are characterised by high
incorporation rates, which allow to deconvolute the heavy and light
peptide isotopic envelopes and accurate quantification.

\bigskip

The second metric inspects the log$_2$ fold-changes of the PSMs,
unique peptides with modifications, unique peptide sequences (not
taking modifications into account) and proteins. These respective data
sets are computed with the \Rfunction{combineFeatures} function (see
\Rfunction{?combineFeatures} for details).

<<combinefeatures, tidy = FALSE>>=
fData(psm)$modseq <- ## pep seq + PTM
    paste(fData(psm)$Peptide_Sequence, 
          fData(psm)$Variable_Modifications, sep = "+")
pep <- combineFeatures(psm,
                       as.character(fData(psm)$Peptide_Sequence), 
                       "median", verbose = FALSE)
modpep <- combineFeatures(psm,
                          fData(psm)$modseq,
                          "median", verbose = FALSE)
prot <- combineFeatures(psm,
                        as.character(fData(psm)$Protein_Accession), 
                        "median", verbose = FALSE)
@

The log$_2$ fold-changes for all the features are then computed and
stored as QC data of our next QC item. We also store a pair of values
\Robject{explfc} that defined an interval in which we expect our
median PSM log$_2$ fold-change to be.

<<qclfc, tidy=FALSE>>=
## calculate log fold-change
qclfc <- QcMetric(name = "Log2 fold-changes")
qcdata(qclfc, "lfc.psm") <-
    log2(exprs(psm)[,"unlabelled"] / exprs(psm)[, "N15"])
qcdata(qclfc, "lfc.pep") <-
    log2(exprs(pep)[,"unlabelled"] / exprs(pep)[, "N15"])
qcdata(qclfc, "lfc.modpep") <-
    log2(exprs(modpep)[,"unlabelled"] / exprs(modpep)[, "N15"])
qcdata(qclfc, "lfc.prot") <-
    log2(exprs(prot)[,"unlabelled"] / exprs(prot)[, "N15"])
qcdata(qclfc, "explfc") <- c(-0.5, 0.5)

status(qclfc) <-
    median(qcdata(qclfc, "lfc.psm")) > qcdata(qclfc, "explfc")[1] &
    median(qcdata(qclfc, "lfc.psm")) < qcdata(qclfc, "explfc")[2]
@

As previously, we provide a custom \Rfunction{show} method that
displays summary values for the four fold-changes. The
\Rfunction{plot} function illustrates the respective log$_2$
fold-change densities and the expected median PSM fold-change range
(red rectangle). The expected 0 log$_2$ fold-change is shown as a
dotted black vertical line and the observed median PSM value is shown
as a blue dashed line.

<<qclfc2, tidy=FALSE>>=
show(qclfc) <- function(object) {
    qcshow(object, qcdata = FALSE) ## default
    cat(" QC thresholds:", qcdata(object, "explfc"), "\n")
    cat(" * PSM log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.psm")))
    cat(" * Modified peptide log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.modpep")))
    cat(" * Peptide log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.pep")))
    cat(" * Protein log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.prot")))    
    invisible(NULL)
}
plot(qclfc) <- function(object) {
    x <- qcdata(object, "explfc")
    plot(density(qcdata(object, "lfc.psm")),
         main = "", sub = "", col = "red",
         ylab = "", lwd = 2,
         xlab = expression(log[2]~fold-change))    
    lines(density(qcdata(object, "lfc.modpep")),
          col = "steelblue", lwd = 2)
    lines(density(qcdata(object, "lfc.pep")),
          col = "blue", lwd = 2)
    lines(density(qcdata(object, "lfc.prot")),
          col = "orange")
    abline(h = 0, col = "grey")
    abline(v = 0, lty = "dotted")    
    rect(x[1], -1, x[2], 1, col = "#EE000030",
         border = NA)
    abline(v = median(qcdata(object, "lfc.psm")),
           lty = "dashed", col = "blue")
    legend("topright",
           c("PSM", "Peptides", "Modified peptides", "Proteins"),
           col = c("red", "steelblue", "blue", "orange"), lwd = 2,
           bty = "n")
}
@

A good quality experiment is expected to have a tight distribution
centred around 0. Major deviations would indicate incomplete
incorporation, errors in the respective amounts of light and heavy
material used, and a wide distribution would reflect large variability
in the data.

\bigskip

Our last QC item inspects the number of features that have been
identified in the experiment. We also investigate how many peptides
(with or without considering the modification) have been observed at
the PSM level and the number of unique peptides per protein. Here, we
do not specify any expected values as the number of observed features
is experiment specific; the QC status is left as \Robject{NA}.

<<qcnb, tidy=FALSE>>=
## number of features
qcnb <- QcMetric(name = "Number of features")
qcdata(qcnb, "count") <- c(
    PSM = nrow(psm),
    ModPep = nrow(modpep),
    Pep = nrow(pep),
    Prot = nrow(prot))
qcdata(qcnb, "peptab") <-
    table(fData(psm)$Peptide_Sequence)
qcdata(qcnb, "modpeptab") <-
    table(fData(psm)$modseq)
qcdata(qcnb, "upep.per.prot") <- 
    fData(psm)$Number_Of_Unique_Peptides
@

The counts are displayed by the new \Rfunction{show} and plotted as bar charts by the \Rfunction{plot} methods.

<<qcnb2, tidy=FALSE>>=
show(qcnb) <- function(object) {
    qcshow(object, qcdata = FALSE)
    print(qcdata(object, "count"))
}
plot(qcnb) <- function(object) {
    par(mar = c(5, 4, 2, 1))
    layout(matrix(c(1, 2, 1, 3, 1, 4), ncol = 3))
    barplot(qcdata(object, "count"), horiz = TRUE, las = 2)
    barplot(table(qcdata(object, "modpeptab")),
            xlab = "Modified peptides")
    barplot(table(qcdata(object, "peptab")),
            xlab = "Peptides")
    barplot(table(qcdata(object, "upep.per.prot")),
            xlab = "Unique peptides per protein ")
}
@

In the code chunk below, we combine the 3 QC items into a
\Robject{QcMetrics} instance and generate a report using meta data
extracted from the \Robject{psm} \Robject{MSnSet} instance.

<<n15qcm, tidy=FALSE>>=
n15qcm <- QcMetrics(qcdata = list(qcinc, qclfc, qcnb))
qcReport(n15qcm, reportname = "n15qcreport",
         title = expinfo(experimentData(psm))["title"],
         author = expinfo(experimentData(psm))["contact"],
         clean = FALSE)
@


We provide with the package the \Rfunction{n15qc} wrapper function
that automates the above pipeline. The names of the feature variable
columns and the thresholds for the two first QC items are provided as
arguments. In case no report name is given, a custom title with date
and time is used, to avoid overwriting existing reports.


% \begin{small}
% <<n15qcwrapper, tidy=FALSE>>=
% n15qc
% @
% \end{small}

\includepdf{n15qcreport.pdf}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Report generation}\label{sec:report}

The report generation is handled by dedicated packages, in particular
\CRANpkg{knitr} \cite{Xie:2013} and \CRANpkg{markdown}
\cite{markdown}.

\subsection{Custom reports}

\subsubsection*{Templates}

It is possible to customise reports for any of the existing types. The
generation of the \texttt{pdf} report is based on a \texttt{tex}
template, \texttt{knitr-template.Rnw}, that is available with the
package\footnote{You can find it with
  \Rfunction{system.file("templates", "knitr-template.Rnw", package =
    "qcmetrics")}.}. The \Rfunction{qcReport} method accepts the path
to a custom \Robject{template} as argument.

The template corresponds to a \LaTeX~preamble with the inclusion of
two variables that are passed to the \Rfunction{qcReport} and used to
customise the template: the author's name and the title of the
report. The former is defaulted to the system username with
\Rfunction{Sys.getenv("USER")} and the later is a simple
character. The \Rfunction{qcReport} function also automatically
generates summary and session information sections. The core of the QC
report, i.e the sections corresponding the the individual
\Robject{QcMetric} instances bundled in a \Robject{QcMetrics} input
(described in more details below) is then inserted into the template
and weaved, or more specifically \Rfunction{knit}'ted into a
\texttt{tex} document that is (if \Robject{type=pdf}) compiled into a
\texttt{pdf} document.

The generation of the \texttt{html} report is enabled by the creation
of a \R markdown file (\texttt{Rmd}) that is then converted with
\CRANpkg{knitr} and \CRANpkg{markdown} into \texttt{html}. The
\texttt{Rmd} syntax being much simpler, no \texttt{Rmd} template is
needed. It is possible to customise the final \texttt{html} output by
providing a \texttt{css} definition as \Robject{template} argument
when calling \Rfunction{qcReport}.

Initial support for the \CRANpkg{Nozzle.R1} package \cite{nozzle} is
available with type \texttt{nozzle}.

\subsubsection*{\Robject{QcMetric} sections}

The generation of the sections for \Robject{QcMetric} instances is
controlled by a function passed to the \Robject{qcto} argument. This
function takes care of transforming an instance of class
\Robject{QcMetric} into a \texttt{character} that can be inserted into
the report. For the \texttt{tex} and \texttt{pdf} reports,
\Rfunction{Qc2Tex} is used; the \texttt{Rmd} and \texttt{html} reports
make use of \Rfunction{Qc2Rmd}. These functions take an instance of
class \Robject{QcMetrics} and the index of the \Robject{QcMetric} to
be converted.

<<Qc2Tex, tidy=FALSE>>=
qcmetrics:::Qc2Tex
qcmetrics:::Qc2Tex(maqcm, 1)
@

Let's investigate how to customise these sections depending on the
\Robject{QcMetric} status, the goal being to highlight positive QC
results (i.e. when the status is \Robject{TRUE}) with {\color{green}
  $\CIRCLE$} (or {\color{green} $\smiley$}), negative results with
{\color{red} $\CIRCLE$} (or {\color{red} $\frownie$}) and use
$\Circle$ if status is \Robject{NA} after the section
title\footnote{The respective symbols are \texttt{CIRCLE},
  \texttt{smiley}, \texttt{frownie} and \texttt{Circle} from the
  \LaTeX{} package \texttt{wasysym}.}.

Below, we see that different section headers are composed based on the
value of \Rfunction{status(object[[i]])} by appending the appropriate
\LaTeX~symbol.

<<Qc2Tex2, tidy=FALSE>>=
Qc2Tex2
@

To use this specific sectioning code, we pass our new function as
\Robject{qcto} when generating the report. To generate smiley labels,
use \Rfunction{Qc2Tex3}.

%% Running this one with echoing so that the auxiliary files, 
%% in particular the figure directory does not get deleted, as 
%% it is also created and needed by the vignette itself.
<<maqcreport2, echo = FALSE, message = FALSE>>=
qcReport(maqcm, reportname = "rnadeg2", clean = FALSE, qcto = Qc2Tex2)
@
<<maqcreport3, echo = FALSE, message = FALSE>>=
qcReport(maqcm, reportname = "rnadeg3", clean = FALSE, qcto = Qc2Tex3)
@

<<maqcreport4, eval = FALSE>>=
qcReport(maqcm, reportname = "rnadeg2", qcto = Qc2Tex2)
@

\includepdfmerge{rnadeg2.pdf, 1-2, rnadeg3.pdf, 1-2}

% \subsubsection*{The metadata section}

\subsection{New report types}

A reporting function is a function that 

\begin{itemize}
\item Converts the appropriate QC item sections (for example the \Rfunction{Qc2Tex2} function described above)
\item Optionally includes the QC item sections into addition header
  and footer, either by writing these directly or by inserting the
  sections into an appropriate template. The reporting functions that
  are available in \Biocpkg{qcmetrics} can be found in
  \Rfunction{?qcReport}: \Rfunction{reporting\_tex} for type
  \texttt{tex}, \Rfunction{reporting\_pdf} for type \texttt{pdf},
  \ldots These functions should use the same arguments as
  \Rfunction{qcReport} insofar as possible.
\item Once written to a report source file, the final report type is
  generated. \Rfunction{knit} is used to convert the \texttt{Rnw}
  source to \texttt{tex} which is compiled into \texttt{pdf} using
  \Rfunction{tools::texi2pdf}. The \texttt{Rmd} content is directly
  written into a file which is knitted and converted to \texttt{html}
  using \Rfunction{knit2html} (which call \Rfunction{markdownTOHTML}).
\end{itemize}

New \Rfunction{reporting\_abc} functions can be called directly or
passed to \Rfunction{qcReport} using the \Robject{reporter} argument.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{QC packages}\label{sec:qcpkg}

\subsection{A simple RNA degradation package}

While the examples presented in section \ref{sec:pipeline} and in
particular the wrapper function in section \ref{sec:wrapper} are
flexible and fast ways to design QC pipeline prototypes, a more robust
mechanism is desirable for production pipelines. The \R packaging
mechanism is ideally suited for this as it provides versioning,
documentation, unit testing and easy distribution and installation
facilities.

While the detailed description of package development is out of the
scope of this document, it is of interest to provide an overview of
the development of a QC package. Taking the wrapper function, it could
be used the create the package structure

<<qcpkg0, eval=FALSE>>=
package.skeleton("RnaDegQC", list = "rnadeg")
@

The \texttt{DESCRIPTION} file would need to be updated. The packages
\Biocpkg{qcmetrics}, \Biocpkg{affy} and \Biocpkg{yaqcaffy} would
need to be specified as dependencies in the \texttt{Imports:} line and
imported in the \texttt{NAMESPACE} file. The documentation file
\texttt{RnaDegQC/man/rnadeg.Rd} and the (optional)
\texttt{RnaDegQC/man/RnaDegQC-packge.Rd} would need to be updated.

Alternatively, the \Rfunction{rnadeg} function could be modularised so
that QC items would be created and returned by dedicated constructors
like \Rfunction{makeRnaDegSlopes} and
\Rfunction{makeRnaDegRatios}. This would provide other developers with
the means to re-use some components of the pipeline by using the
package.

\subsection{A QC pipeline repository}

The wiki on the \Biocpkg{qcmetrics} github
page\footnote{\url{https://github.com/lgatto/qcmetrics}} can be edited
by any github user and will be used to cite, document and share QC
functions, pipelines and packages, in particular those that make use
of the \Biocpkg{qcmetrics} infrastructure.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Conclusions}\label{sec:ccl}

\R and Bioconductor are well suited for the analysis of high
throughput biology data. They provide first class statistical
routines, excellent graph capabilities and an interface of choice to
import and manipulate various omics data, as demonstrated by the
wealth of
packages\footnote{\url{http://bioconductor.org/packages/release/BiocViews.html\#\_\_\_QualityControl}}
that provide functionalities for QC.

The \Biocpkg{qcmetrics} package is different than existing \R
packages and QC systems in general. It proposes a unique
domain-independent framework to design QC pipelines and is thus suited
for any use case. The examples presented in this document illustrated
the application of \Biocpkg{qcmetrics} on data containing single or
multiple samples or experimental runs from different technologies. It
is also possible to automate the generation of QC metrics for a set of
repeated (and growing) analyses of standard samples to establish
\emph{lab memory} types of QC reports, that track a set of metrics for
controlled standard samples over time. It can be applied to raw data
or processed data and tailored to suite precise needs. The
popularisation of integrative approaches that combine multiple types
of data in novel ways stresses out the need for flexible QC
development.

\Biocpkg{qcmetrics} is a versatile software that allows rapid and
easy QC pipeline prototyping and development and supports
straightforward migration to production level systems through its well
defined packaging mechanism.

%% Maybe elaborate on difference between single sample vs multiple
%% sample QC, and how to distinguish them in the latter case?

\clearpage

\section*{Acknowledgements}\label{sec:ack} 

Many thanks to Arnoud Groen for providing the $^{15}$N data and
Andrzej Oles for helpful comments and suggestions about the package
and this document.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Session information}\label{sec:sessionInfo} 

All software and respective versions used to produce this document are listed below.

<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\bibliography{qcmetrics}

\end{document}