%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{The 'qcmetrics' infrastructure for quality control and reporting} %\VignetteKeywords{Bioinformatics, proteomics, genomics, transcriptomics, mass-spectrometry, Quality control, reporting} %\VignettePackage{qcmetrics} \documentclass[12pt, oneside]{article} <>= BiocStyle::latex() @ \usepackage[final]{pdfpages} \includepdfset{nup=2x2, pages=-, frame=TRUE} \usepackage{wasysym} \bioctitle[\Biocpkg{qcmetrics}]{The \Biocpkg{qcmetrics} infrastructure for quality control and automatic reporting} \author{ Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}\\ Computational Proteomics Unit\footnote{\url{http://cpu.sysbiol.cam.ac.uk}}\\ University of Cambridge, UK } \begin{document} \maketitle %% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \vskip 0.3in minus 0.1in \hrule \begin{abstract} The \Biocpkg{qcmetrics} package is a framework that provides simple data containers for quality metrics and support for automatic report generation. This document briefly illustrates the core data structures and then demonstrates the generation and automation of quality control reports for microarray and proteomics data. \end{abstract} \textit{Keywords}: Bioinformatics, Quality control, reporting, visualisation \vskip 0.1in minus 0.05in \hrule \vskip 0.2in minus 0.1in %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \tableofcontents <>= library("knitr") opts_chunk$set(fig.align = 'center', fig.show = 'hold', par = TRUE, prompt = FALSE, tidy = FALSE, eval = TRUE, stop_on_error = 1L, comment = "##") options(replace.assign = TRUE, width = 55) suppressPackageStartupMessages(library("qcmetrics")) suppressPackageStartupMessages(library("MAQCsubsetAFX")) suppressPackageStartupMessages(library("yaqcaffy")) suppressPackageStartupMessages(library("affy")) suppressPackageStartupMessages(library("AnnotationDbi")) suppressPackageStartupMessages(library("RforProteomics")) suppressPackageStartupMessages(library("mzR")) suppressPackageStartupMessages(library("MSnbase")) set.seed(1) @ %%$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction}\label{sec:intro} Quality control (QC) is an essential step in any analytical process. Data of poor quality can at best lead to the absence of positive results or, much worse, false positives that stem from uncaught faulty and noisy data and much wasted resources in pursuing red herrings. Quality is often a relative concept that depends on the nature of the biological sample, the experimental settings, the analytical process and other factors. Research and development in the area of QC has generally lead to two types of work being disseminated. Firstly, the comparison of samples of variable quality and the identification of metrics that correlate with the quality of the data. These quality metrics could then, in later experiments, be used to assess their quality. Secondly, the design of domain-specific software to facilitate the collection, visualisation and interpretation of various QC metrics is also an area that has seen much development. QC is a prime example where standardisation and automation are of great benefit. While a great variety of QC metrics, software and pipelines have been described for any assay commonly used in modern biology, we present here a different tool for QC, whose main features are flexibility and versatility. The \Biocpkg{qcmetrics} package is a general framework for QC that can accommodate any type of data. It provides a flexible framework to implement QC items that store relevant QC metrics with a specific visualisation mechanism. These individual items can be bundled into higher level QC containers that can be readily used to generate reports in various formats. As a result, it becomes easy to develop complete custom pipelines from scratch and automate the generation of reports. The pipelines can be easily updated to accommodate new QC items of better visualisation techniques. Section \ref{sec:qcclasses} provides an overview of the framework. In section \ref{sec:pipeline}, we use microarray (subsection \ref{sec:marray}) and proteomics data (subsection \ref{sec:prot}) to demonstrate the elaboration of QC pipelines: how to create individual QC objects, how to bundle them to create sets of QC metrics and how to generate reports in multiple formats. We also show how the above steps can be fully automated through simple wrapper functions in section \ref{sec:wrapper}. Although kept simple in the interest of time and space, these examples are meaningful and relevant. In section \ref{sec:report}, we provide more detail about the report generation process, how reports can be customised and how new exports can be contributed. We proceed in section \ref{sec:qcpkg} to the consolidation of QC pipelines using \R and elaborate on the development of dedicated QC packages with \Biocpkg{qcmetrics}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The QC classes}\label{sec:qcclasses} The package provides two types of QC containers. The \Robject{QcMetric} class stores data and visualisation functions for single metrics. Several such metrics can be bundled into \Robject{QcMetrics} instances, that can be used as input for automated report generation. Below, we will provide a quick overview of how to create respective \Robject{QcMetric} and \Robject{QcMetrics} instances. More details are available in the corresponding documentations. \subsection{The \Robject{QcMetric} class} A QC metric is composed of a description (\Robject{name} in the code chunk below), some QC data (\Robject{qcdata}) and a \Robject{status} that defines if the metric is deemed of acceptable quality (coded as \Robject{TRUE}), bad quality (coded as \Robject{FALSE}) or not yet evaluated (coded as \Robject{NA}). Individual metrics can be displayed as a short textual summary or plotted. To do the former, one can use the default \Rfunction{show} method. <>= library("qcmetrics") qc <- QcMetric(name = "A test metric") qcdata(qc, "x") <- rnorm(100) qcdata(qc) ## all available qcdata summary(qcdata(qc, "x")) ## get x show(qc) ## or just qc status(qc) <- TRUE qc @ Plotting \Robject{QcMetric} instances requires to implement a plotting method that is relevant to the data at hand. We can use a \Rfunction{plot} replacement method to define our custom function. The code inside the \Rfunction{plot} uses \Rfunction{qcdata} to extract the relevant QC data from \Robject{object} that is then passed as argument to \Rfunction{plot} and uses the adequate visualisation to present the QC data. <>= plot(qc) plot(qc) <- function(object, ... ) boxplot(qcdata(object, "x"), ...) plot(qc) @ \subsection{The \Robject{QcMetrics} class} A \Robject{QcMetrics} object is essentially just a list of individual \Robject{QcMetric} instances. It is also possible to set a list of metadata variables to describe the source of the QC metrics. The metadata can be passed as an \Robject{QcMetadata} object (the way it is stored in the \Robject{QcMetrics} instance) or directly as a named \Robject{list}. The \Robject{QcMetadata} is itself a \Robject{list} and can be accessed and set with \Rfunction{metadata} or \Rfunction{mdata}. When accessed, it is returned and displayed as a \Robject{list}. <>= qcm <- QcMetrics(qcdata = list(qc)) qcm metadata(qcm) <- list(author = "Prof. Who", lab = "Big lab") qcm mdata(qcm) @ The metadata can be updated with the same interface. If new named items are passed, the metadata is updated by addition of the new elements. If a named item is already present, its value gets updated. <>= metadata(qcm) <- list(author = "Prof. Who", lab = "Cabin lab", University = "Universe-ity") mdata(qcm) @ The \Robject{QcMetrics} can then be passed to the \Rfunction{qcReport} method to generate reports, as described in more details below. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Creating QC pipelines}\label{sec:pipeline} \subsection{Microarray degradation}\label{sec:marray} We will use the \Robject{refA} Affymetrix arrays from the \Biocexptpkg{MAQCsubsetAFX} package as an example data set and investigate the RNA degradation using the \Rfunction{AffyRNAdeg} from \Biocpkg{affy} \cite{Gautier:2004} and the actin and GAPDH $\frac{3'}{5'}$ ratios, as calculated in the \Biocpkg{yaqcaffy} package \cite{yaqcaffy}. The first code chunk demonstrate how to load the data and compute the QC data\footnote{%% The pre-computed objects can be directly loaded with \Rfunction{load(system.file("extdata/deg.rda", package = "qcmetrics"))} and \Rfunction{load(system.file("extdata/deg.rda", package = "qcmetrics"))}.}. <>= library("MAQCsubsetAFX") data(refA) library("affy") deg <- AffyRNAdeg(refA) library("yaqcaffy") yqc <- yaqc(refA) @ <>= load(system.file("extdata/deg.rda", package = "qcmetrics")) load(system.file("extdata/yqc.rda", package = "qcmetrics")) @ We then create two \Robject{QcMetric} instances, one for each of our quality metrics. <>= qc1 <- QcMetric(name = "Affy RNA degradation slopes") qcdata(qc1, "deg") <- deg plot(qc1) <- function(object, ...) { x <- qcdata(object, "deg") nms <- x$sample.names plotAffyRNAdeg(x, col = 1:length(nms), ...) legend("topleft", nms, lty = 1, cex = 0.8, col = 1:length(nms), bty = "n") } status(qc1) <- TRUE qc1 @ <>= qc2 <- QcMetric(name = "Affy RNA degradation ratios") qcdata(qc2, "yqc") <- yqc plot(qc2) <- function(object, ...) { par(mfrow = c(1, 2)) yaqcaffy:::.plotQCRatios(qcdata(object, "yqc"), "all", ...) } status(qc2) <- FALSE qc2 @ Then, we combine the individual QC items into a \Robject{QcMetrics} instance. <>= maqcm <- QcMetrics(qcdata = list(qc1, qc2)) maqcm @ %% Running this one with echoing so that the auxiliary files, %% in particular the figure directory does not get deleted, as %% it is also created and needed by the vignette itself. <>= qcReport(maqcm, reportname = "rnadeg", clean = FALSE) @ With our \Robject{QcMetrics} data, we can easily generate quality reports in several different formats. Below, we create a \texttt{pdf} report, which is the default type. Using \texttt{type = "html"} would generate the equivalent report in \texttt{html} format. See \Rfunction{?qcReport} for more details. <>= qcReport(maqcm, reportname = "rnadeg", type = "pdf") @ The resulting report is shown below. Each \Robject{QcMetric} item generates a section named according to the object's name. A final summary section shows a table with all the QC items and their status. The report concludes with a detailed session information section. \bigskip In addition to the report, it is of course advised to store the actual \Robject{QcMetrics} object. This is most easily done with the \R \Rfunction{save}/\Rfunction{load} and \Rfunction{saveRDS}/\Rfunction{readRDS} functions. As the data and visualisation methods are stored together, it is possible to reproduce the figures from the report or further explore the data at a later stage. \includepdf{rnadeg.pdf} \clearpage \subsection{A wrapper function}\label{sec:wrapper} Once an appropriate set of quality metrics has been identified, the generation of the \Robject{QcMetrics} instances can be wrapped up for automation. <>= rnadeg @ It is now possible to generate a \Robject{QcMetrics} object from a set of CEL files or directly from an \Robject{affybatch} object. The \Robject{status} argument allows to directly set the statuses of the individual QC items; these can also be set later, as illustrated below. If a report type is specified, the corresponding report is generated. <>= maqcm <- rnadeg(refA) @ <>= status(maqcm) <- c(NA, NA) @ <>= status(maqcm) ## check the QC data (status(maqcm) <- c(TRUE, FALSE)) @ The report can be generated manually with \Rfunction{qcReport(maqcm)} or directly with the wrapper function as follows: <>= maqcm <- rnadeg(refA, type = "pdf") @ \subsection{Proteomics raw data}\label{sec:prot} To illustrate a simple QC analysis for proteomics data, we will download data set \texttt{PXD00001} from the ProteomeXchange repository in the \texttt{mzXML} format \cite{Pedrioli:2004}. The MS$^2$ spectra from that mass-spectrometry run are then read into \R\footnote{%% In the interest of time, this code chunk has been pre-computed and a subset (1 in 3) of the \Robject{exp} instance is distributed with the package. The data is loaded with \Rfunction{load(system.file("extdata/exp.rda", package = "qcmetrics"))}.} %% and stored as an \Robject{MSnExp} experiment using the \Rfunction{readMSData} function from the \Biocpkg{MSnbase} package \cite{Gatto:2012}. <>= load(system.file("extdata/exp.rda", package = "qcmetrics")) @ <>= library("RforProteomics") msfile <- getPXD000001mzXML() library("MSnbase") exp <- readMSData(msfile, verbose = FALSE) @ The \Robject{QcMetrics} will consist of 3 items, namely a chromatogram constructed with the MS$^2$ spectra precursor's intensities, a figure illustrating the precursor charges in the MS space and an $\frac{m}{z}$ delta plot illustrating the suitability of MS$^2$ spectra for identification (see \Rfunction{?plotMzDelta} or \cite{Foster:2011}). <>= qc1 <- QcMetric(name = "Chromatogram") x <- rtime(exp) y <- precursorIntensity(exp) o <- order(x) qcdata(qc1, "x") <- x[o] qcdata(qc1, "y") <- y[o] plot(qc1) <- function(object, ...) plot(qcdata(object, "x"), qcdata(object, "y"), col = "darkgrey", type ="l", xlab = "retention time", ylab = "precursor intensity") @ <>= qc2 <- QcMetric(name = "MS space") qcdata(qc2, "p2d") <- plot2d(exp, z = "charge", plot = FALSE) plot(qc2) <- function(object) { require("ggplot2") print(qcdata(object, "p2d")) } @ <>= qc3 <- QcMetric(name = "m/z delta plot") qcdata(qc3, "pmz") <- plotMzDelta(exp, plot = FALSE, verbose = FALSE) plot(qc3) <- function(object) suppressWarnings(print(qcdata(object, "pmz"))) @ Note that we do not store the raw data in any of the above instances, but always pre-compute the necessary data or plots that are then stored as \Robject{qcdata}. If the raw data was to be needed in multiple \Robject{QcMetric} instances, we could re-use the same \Robject{qcdata} \emph{environment} to avoid unnecessary copies using \Rfunction{qcdata(qc2) <- qcenv(qc1)} and implement different views through custom \Rfunction{plot} methods. \bigskip Let's now combine the three items into a \Robject{QcMetrics} object, decorate it with custom metadata using the MIAPE information from the \Robject{MSnExp} object and generate a report. <>= protqcm <- QcMetrics(qcdata = list(qc1, qc2, qc3)) metadata(protqcm) <- list( data = "PXD000001", instrument = experimentData(exp)@instrumentModel, source = experimentData(exp)@ionSource, analyser = experimentData(exp)@analyser, detector = experimentData(exp)@detectorType, manufacurer = experimentData(exp)@instrumentManufacturer) @ %% Running this one with echoing so that the auxiliary files, %% in particular the figure directory does not get deleted, as %% it is also created and needed by the vignette itself. <>= qcReport(protqcm, reportname = "protqc", clean=FALSE, quiet=TRUE) @ The status column of the summary table is empty as we have not set the QC items statuses yet. <>= qcReport(protqcm, reportname = "protqc") @ \includepdf{protqc.pdf} \subsection{Processed $^{15}$N labelling data}\label{sec:n15} In this section, we describe a set of $^{15}$N metabolic labelling QC metrics \cite{Krijgsveld:2003}. The data is a phospho-enriched $^{15}$N labelled \textit{Arabidopsis thaliana} sample prepared as described in \cite{Groen:2013}. The data was processed with in-house tools and is available as an \Robject{MSnSet} instance. Briefly, MS$^2$ spectra were search with the Mascot engine and identification scores adjusted with Mascot Percolator. Heavy and light pairs were then searched in the survey scans and $^{15}$N incorporation was estimated based on the peptide sequence and the isotopic envelope of the heavy member of the pair (the \Robject{inc} feature variable). Heavy and light peptides isotopic envelope areas were finally integrated to obtain unlabelled and $^{15}$N quantitation data. The \Robject{psm} object provides such data for PSMs (peptide spectrum matches) with a posterior error probability \textless 0.05 that can be uniquely matched to proteins. We first load the \Biocpkg{MSnbase} package (required to support the \Robject{MSnSet} data structure) and example data that is distributed with the \Biocpkg{qcmetrics} package. We will make use of the \CRANpkg{ggplot2} plotting package. <>= library("ggplot2") library("MSnbase") data(n15psm) psm @ The first QC item examines the $^{15}$N incorporation rate, available in the \Robject{inc} feature variable. We also defined a median incorporation rate threshold \Robject{tr} equal to 97.5 that is used to set the QC status. <>= ## incorporation rate QC metric qcinc <- QcMetric(name = "15N incorporation rate") qcdata(qcinc, "inc") <- fData(psm)$inc qcdata(qcinc, "tr") <- 97.5 status(qcinc) <- median(qcdata(qcinc, "inc")) > qcdata(qcinc, "tr") @ Next, we implement a custom \Rfunction{show} method, that prints 5 summary values of the variable's distribution. <>= show(qcinc) <- function(object) { qcshow(object, qcdata = FALSE) cat(" QC threshold:", qcdata(object, "tr"), "\n") cat(" Incorporation rate\n") print(summary(qcdata(object, "inc"))) invisible(NULL) } @ We then define the metric's \Rfunction{plot} function that represent the distribution of the PSM's incorporation rates as a boxplot, shows all the individual rates as jittered dots and represents the \Robject{tr} threshold as a dotted red line. <>= plot(qcinc) <- function(object) { inc <- qcdata(object, "inc") tr <- qcdata(object, "tr") lab <- "Incorporation rate" dd <- data.frame(inc = qcdata(qcinc, "inc")) p <- ggplot(dd, aes(factor(""), inc)) + geom_jitter(colour = "#4582B370", size = 3) + geom_boxplot(fill = "#FFFFFFD0", colour = "#000000", outlier.size = 0) + geom_hline(yintercept = tr, colour = "red", linetype = "dotted", size = 1) + labs(x = "", y = "Incorporation rate") p } @ $^{15}$N experiments of good quality are characterised by high incorporation rates, which allow to deconvolute the heavy and light peptide isotopic envelopes and accurate quantification. \bigskip The second metric inspects the log$_2$ fold-changes of the PSMs, unique peptides with modifications, unique peptide sequences (not taking modifications into account) and proteins. These respective data sets are computed with the \Rfunction{combineFeatures} function (see \Rfunction{?combineFeatures} for details). <>= fData(psm)$modseq <- ## pep seq + PTM paste(fData(psm)$Peptide_Sequence, fData(psm)$Variable_Modifications, sep = "+") pep <- combineFeatures(psm, as.character(fData(psm)$Peptide_Sequence), "median", verbose = FALSE) modpep <- combineFeatures(psm, fData(psm)$modseq, "median", verbose = FALSE) prot <- combineFeatures(psm, as.character(fData(psm)$Protein_Accession), "median", verbose = FALSE) @ The log$_2$ fold-changes for all the features are then computed and stored as QC data of our next QC item. We also store a pair of values \Robject{explfc} that defined an interval in which we expect our median PSM log$_2$ fold-change to be. <>= ## calculate log fold-change qclfc <- QcMetric(name = "Log2 fold-changes") qcdata(qclfc, "lfc.psm") <- log2(exprs(psm)[,"unlabelled"] / exprs(psm)[, "N15"]) qcdata(qclfc, "lfc.pep") <- log2(exprs(pep)[,"unlabelled"] / exprs(pep)[, "N15"]) qcdata(qclfc, "lfc.modpep") <- log2(exprs(modpep)[,"unlabelled"] / exprs(modpep)[, "N15"]) qcdata(qclfc, "lfc.prot") <- log2(exprs(prot)[,"unlabelled"] / exprs(prot)[, "N15"]) qcdata(qclfc, "explfc") <- c(-0.5, 0.5) status(qclfc) <- median(qcdata(qclfc, "lfc.psm")) > qcdata(qclfc, "explfc")[1] & median(qcdata(qclfc, "lfc.psm")) < qcdata(qclfc, "explfc")[2] @ As previously, we provide a custom \Rfunction{show} method that displays summary values for the four fold-changes. The \Rfunction{plot} function illustrates the respective log$_2$ fold-change densities and the expected median PSM fold-change range (red rectangle). The expected 0 log$_2$ fold-change is shown as a dotted black vertical line and the observed median PSM value is shown as a blue dashed line. <>= show(qclfc) <- function(object) { qcshow(object, qcdata = FALSE) ## default cat(" QC thresholds:", qcdata(object, "explfc"), "\n") cat(" * PSM log2 fold-changes\n") print(summary(qcdata(object, "lfc.psm"))) cat(" * Modified peptide log2 fold-changes\n") print(summary(qcdata(object, "lfc.modpep"))) cat(" * Peptide log2 fold-changes\n") print(summary(qcdata(object, "lfc.pep"))) cat(" * Protein log2 fold-changes\n") print(summary(qcdata(object, "lfc.prot"))) invisible(NULL) } plot(qclfc) <- function(object) { x <- qcdata(object, "explfc") plot(density(qcdata(object, "lfc.psm")), main = "", sub = "", col = "red", ylab = "", lwd = 2, xlab = expression(log[2]~fold-change)) lines(density(qcdata(object, "lfc.modpep")), col = "steelblue", lwd = 2) lines(density(qcdata(object, "lfc.pep")), col = "blue", lwd = 2) lines(density(qcdata(object, "lfc.prot")), col = "orange") abline(h = 0, col = "grey") abline(v = 0, lty = "dotted") rect(x[1], -1, x[2], 1, col = "#EE000030", border = NA) abline(v = median(qcdata(object, "lfc.psm")), lty = "dashed", col = "blue") legend("topright", c("PSM", "Peptides", "Modified peptides", "Proteins"), col = c("red", "steelblue", "blue", "orange"), lwd = 2, bty = "n") } @ A good quality experiment is expected to have a tight distribution centred around 0. Major deviations would indicate incomplete incorporation, errors in the respective amounts of light and heavy material used, and a wide distribution would reflect large variability in the data. \bigskip Our last QC item inspects the number of features that have been identified in the experiment. We also investigate how many peptides (with or without considering the modification) have been observed at the PSM level and the number of unique peptides per protein. Here, we do not specify any expected values as the number of observed features is experiment specific; the QC status is left as \Robject{NA}. <>= ## number of features qcnb <- QcMetric(name = "Number of features") qcdata(qcnb, "count") <- c( PSM = nrow(psm), ModPep = nrow(modpep), Pep = nrow(pep), Prot = nrow(prot)) qcdata(qcnb, "peptab") <- table(fData(psm)$Peptide_Sequence) qcdata(qcnb, "modpeptab") <- table(fData(psm)$modseq) qcdata(qcnb, "upep.per.prot") <- fData(psm)$Number_Of_Unique_Peptides @ The counts are displayed by the new \Rfunction{show} and plotted as bar charts by the \Rfunction{plot} methods. <>= show(qcnb) <- function(object) { qcshow(object, qcdata = FALSE) print(qcdata(object, "count")) } plot(qcnb) <- function(object) { par(mar = c(5, 4, 2, 1)) layout(matrix(c(1, 2, 1, 3, 1, 4), ncol = 3)) barplot(qcdata(object, "count"), horiz = TRUE, las = 2) barplot(table(qcdata(object, "modpeptab")), xlab = "Modified peptides") barplot(table(qcdata(object, "peptab")), xlab = "Peptides") barplot(table(qcdata(object, "upep.per.prot")), xlab = "Unique peptides per protein ") } @ In the code chunk below, we combine the 3 QC items into a \Robject{QcMetrics} instance and generate a report using meta data extracted from the \Robject{psm} \Robject{MSnSet} instance. <>= n15qcm <- QcMetrics(qcdata = list(qcinc, qclfc, qcnb)) qcReport(n15qcm, reportname = "n15qcreport", title = expinfo(experimentData(psm))["title"], author = expinfo(experimentData(psm))["contact"], clean = FALSE) @ We provide with the package the \Rfunction{n15qc} wrapper function that automates the above pipeline. The names of the feature variable columns and the thresholds for the two first QC items are provided as arguments. In case no report name is given, a custom title with date and time is used, to avoid overwriting existing reports. % \begin{small} % <>= % n15qc % @ % \end{small} \includepdf{n15qcreport.pdf} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Report generation}\label{sec:report} The report generation is handled by dedicated packages, in particular \CRANpkg{knitr} \cite{Xie:2013} and \CRANpkg{markdown} \cite{markdown}. \subsection{Custom reports} \subsubsection*{Templates} It is possible to customise reports for any of the existing types. The generation of the \texttt{pdf} report is based on a \texttt{tex} template, \texttt{knitr-template.Rnw}, that is available with the package\footnote{You can find it with \Rfunction{system.file("templates", "knitr-template.Rnw", package = "qcmetrics")}.}. The \Rfunction{qcReport} method accepts the path to a custom \Robject{template} as argument. The template corresponds to a \LaTeX~preamble with the inclusion of two variables that are passed to the \Rfunction{qcReport} and used to customise the template: the author's name and the title of the report. The former is defaulted to the system username with \Rfunction{Sys.getenv("USER")} and the later is a simple character. The \Rfunction{qcReport} function also automatically generates summary and session information sections. The core of the QC report, i.e the sections corresponding the the individual \Robject{QcMetric} instances bundled in a \Robject{QcMetrics} input (described in more details below) is then inserted into the template and weaved, or more specifically \Rfunction{knit}'ted into a \texttt{tex} document that is (if \Robject{type=pdf}) compiled into a \texttt{pdf} document. The generation of the \texttt{html} report is enabled by the creation of a \R markdown file (\texttt{Rmd}) that is then converted with \CRANpkg{knitr} and \CRANpkg{markdown} into \texttt{html}. The \texttt{Rmd} syntax being much simpler, no \texttt{Rmd} template is needed. It is possible to customise the final \texttt{html} output by providing a \texttt{css} definition as \Robject{template} argument when calling \Rfunction{qcReport}. Initial support for the \CRANpkg{Nozzle.R1} package \cite{nozzle} is available with type \texttt{nozzle}. \subsubsection*{\Robject{QcMetric} sections} The generation of the sections for \Robject{QcMetric} instances is controlled by a function passed to the \Robject{qcto} argument. This function takes care of transforming an instance of class \Robject{QcMetric} into a \texttt{character} that can be inserted into the report. For the \texttt{tex} and \texttt{pdf} reports, \Rfunction{Qc2Tex} is used; the \texttt{Rmd} and \texttt{html} reports make use of \Rfunction{Qc2Rmd}. These functions take an instance of class \Robject{QcMetrics} and the index of the \Robject{QcMetric} to be converted. <>= qcmetrics:::Qc2Tex qcmetrics:::Qc2Tex(maqcm, 1) @ Let's investigate how to customise these sections depending on the \Robject{QcMetric} status, the goal being to highlight positive QC results (i.e. when the status is \Robject{TRUE}) with {\color{green} $\CIRCLE$} (or {\color{green} $\smiley$}), negative results with {\color{red} $\CIRCLE$} (or {\color{red} $\frownie$}) and use $\Circle$ if status is \Robject{NA} after the section title\footnote{The respective symbols are \texttt{CIRCLE}, \texttt{smiley}, \texttt{frownie} and \texttt{Circle} from the \LaTeX{} package \texttt{wasysym}.}. Below, we see that different section headers are composed based on the value of \Rfunction{status(object[[i]])} by appending the appropriate \LaTeX~symbol. <>= Qc2Tex2 @ To use this specific sectioning code, we pass our new function as \Robject{qcto} when generating the report. To generate smiley labels, use \Rfunction{Qc2Tex3}. %% Running this one with echoing so that the auxiliary files, %% in particular the figure directory does not get deleted, as %% it is also created and needed by the vignette itself. <>= qcReport(maqcm, reportname = "rnadeg2", clean = FALSE, qcto = Qc2Tex2) @ <>= qcReport(maqcm, reportname = "rnadeg3", clean = FALSE, qcto = Qc2Tex3) @ <>= qcReport(maqcm, reportname = "rnadeg2", qcto = Qc2Tex2) @ \includepdfmerge{rnadeg2.pdf, 1-2, rnadeg3.pdf, 1-2} % \subsubsection*{The metadata section} \subsection{New report types} A reporting function is a function that \begin{itemize} \item Converts the appropriate QC item sections (for example the \Rfunction{Qc2Tex2} function described above) \item Optionally includes the QC item sections into addition header and footer, either by writing these directly or by inserting the sections into an appropriate template. The reporting functions that are available in \Biocpkg{qcmetrics} can be found in \Rfunction{?qcReport}: \Rfunction{reporting\_tex} for type \texttt{tex}, \Rfunction{reporting\_pdf} for type \texttt{pdf}, \ldots These functions should use the same arguments as \Rfunction{qcReport} insofar as possible. \item Once written to a report source file, the final report type is generated. \Rfunction{knit} is used to convert the \texttt{Rnw} source to \texttt{tex} which is compiled into \texttt{pdf} using \Rfunction{tools::texi2pdf}. The \texttt{Rmd} content is directly written into a file which is knitted and converted to \texttt{html} using \Rfunction{knit2html} (which call \Rfunction{markdownTOHTML}). \end{itemize} New \Rfunction{reporting\_abc} functions can be called directly or passed to \Rfunction{qcReport} using the \Robject{reporter} argument. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{QC packages}\label{sec:qcpkg} \subsection{A simple RNA degradation package} While the examples presented in section \ref{sec:pipeline} and in particular the wrapper function in section \ref{sec:wrapper} are flexible and fast ways to design QC pipeline prototypes, a more robust mechanism is desirable for production pipelines. The \R packaging mechanism is ideally suited for this as it provides versioning, documentation, unit testing and easy distribution and installation facilities. While the detailed description of package development is out of the scope of this document, it is of interest to provide an overview of the development of a QC package. Taking the wrapper function, it could be used the create the package structure <>= package.skeleton("RnaDegQC", list = "rnadeg") @ The \texttt{DESCRIPTION} file would need to be updated. The packages \Biocpkg{qcmetrics}, \Biocpkg{affy} and \Biocpkg{yaqcaffy} would need to be specified as dependencies in the \texttt{Imports:} line and imported in the \texttt{NAMESPACE} file. The documentation file \texttt{RnaDegQC/man/rnadeg.Rd} and the (optional) \texttt{RnaDegQC/man/RnaDegQC-packge.Rd} would need to be updated. Alternatively, the \Rfunction{rnadeg} function could be modularised so that QC items would be created and returned by dedicated constructors like \Rfunction{makeRnaDegSlopes} and \Rfunction{makeRnaDegRatios}. This would provide other developers with the means to re-use some components of the pipeline by using the package. \subsection{A QC pipeline repository} The wiki on the \Biocpkg{qcmetrics} github page\footnote{\url{https://github.com/lgatto/qcmetrics}} can be edited by any github user and will be used to cite, document and share QC functions, pipelines and packages, in particular those that make use of the \Biocpkg{qcmetrics} infrastructure. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusions}\label{sec:ccl} \R and Bioconductor are well suited for the analysis of high throughput biology data. They provide first class statistical routines, excellent graph capabilities and an interface of choice to import and manipulate various omics data, as demonstrated by the wealth of packages\footnote{\url{http://bioconductor.org/packages/release/BiocViews.html\#\_\_\_QualityControl}} that provide functionalities for QC. The \Biocpkg{qcmetrics} package is different than existing \R packages and QC systems in general. It proposes a unique domain-independent framework to design QC pipelines and is thus suited for any use case. The examples presented in this document illustrated the application of \Biocpkg{qcmetrics} on data containing single or multiple samples or experimental runs from different technologies. It is also possible to automate the generation of QC metrics for a set of repeated (and growing) analyses of standard samples to establish \emph{lab memory} types of QC reports, that track a set of metrics for controlled standard samples over time. It can be applied to raw data or processed data and tailored to suite precise needs. The popularisation of integrative approaches that combine multiple types of data in novel ways stresses out the need for flexible QC development. \Biocpkg{qcmetrics} is a versatile software that allows rapid and easy QC pipeline prototyping and development and supports straightforward migration to production level systems through its well defined packaging mechanism. %% Maybe elaborate on difference between single sample vs multiple %% sample QC, and how to distinguish them in the latter case? \clearpage \section*{Acknowledgements}\label{sec:ack} Many thanks to Arnoud Groen for providing the $^{15}$N data and Andrzej Oles for helpful comments and suggestions about the package and this document. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Session information}\label{sec:sessionInfo} All software and respective versions used to produce this document are listed below. <>= toLatex(sessionInfo()) @ \bibliography{qcmetrics} \end{document}