\documentclass[10pt]{article} %\VignetteIndexEntry{TarSeqQC: Targeted Sequencing Experiment Quality Control} %\VignetteKeyword{targeted sequencing} %\VignetteKeyword{quality control} %\VignetteKeyword{exploration tool} %\VignettePackage{TarSeqQC} \usepackage{natbib} \usepackage{authblk} \usepackage{times} \usepackage[colorlinks=false, urlcolor=black, pdfborder={0 0 0}]{hyperref} \usepackage{amsmath} \usepackage{enumerate} \usepackage{graphicx} \usepackage[nogin]{Sweave} \textwidth=6.5in \textheight=8.5in \parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\bedfile}{\textit{Bed File}} \newcommand{\bamfile}{\textit{BAM File}} \newcommand{\fastafile}{\textit{FASTA File}} \newcommand{\pileup}{\textit{pileup}} \newcommand{\targetexperiment}{\Rclass{TargetExperiment}} \newcommand{\tarseq}{\Rpackage{TarSeqQC}} \title{\Rpackage{TarSeqQC}: Targeted Sequencing Experiment Quality Control} \author[1]{Gabriela A Merino} \author[1]{Crist\'{o}bal Fresno} \author[1]{Elmer A Fern\'{a}ndez} \affil[1]{CONICET-Universidad Cat\'{o}lica de C\'{o}rdoba, Argentina} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} % \SweaveOpts{concordance=FALSE} \maketitle \begin{center} gmerino@bdmg.com.ar \end{center} \vspace{2cm} \begin{abstract} Targeted Sequencing experiments are a Next Generation Sequencing application, designed to explore a small group of specific genomic regions. The \tarseq{}$~$ package models this kind of experiments in R, and its main goal is to allow the quality control and fast exploration over the experiment results. To do this, a new \R{} class, called \targetexperiment, was implemented. This class is based on the \bedfile, that characterize the experiment, the alignment \bamfile{} and the reference genome \fastafile. When the constructor is called, coverage and read count information are computed for the targeted sequences. After that, exploration and quality control could be carried out using graphical and numerical tools. Density, bar, read profile and box plots were implemented to achieve this task. A circular histogram plot was also implemented in order to summarize all experiment results. Coverage or median count intervals can be defined and explored to further assist quality control analysis. Library and pool preparation, sequencing errors, fragment length or gc content bias could be easily detected. Finally, an .xlsx file reporting quality control results can be built. \end{abstract} <>= options(prompt="> ", continue="+ ", width=78, useFancyQuotes=FALSE, digits=3) @ \newpage \tableofcontents \newpage \section{Introduction} \par Next Generation Sequencing (NGS) technologies produce huge volume of sequence data at relative low cost. Among the different NGS applications, Targeted Sequencing (TS) allows the exploration of specific genomic regions, called \textit{features}, of a small group of genes \citep{metzker2010sequencing}. An ordinary application of TS is to detect Single Nucleotide Polimorphisms (SNPs) involved in several pathologies. Nowadays, \textit{TS cancer panels} are emerging as a new screening methodology to explore specific regions of a small number of genes known to be related to cancer. %like cancer. In this case, each feature corresponding %with a small group of genes related to a particular pathology. \\ In TS, specific regions of a DNA sample are copied and amplified by PCR. If a target region is too large, several primers can be used to read it. In addition, if the panel also has a large number of interest genomic regions, different PCR pools could be required in order to achieve a good coverage. All DNA fragments are sequenced in a NGS machine, generating millions of short sequence reads, though less than if the whole genome was sequenced. Reads are then aligned against a reference genome and, after that, downstream analysis could be performed. However, prior to this, it is crucial to evaluate the run performance, as well as the experiment quality control, i.e., how well the features were sequenced, which feature and gene coverages were achieved, if some problems arise in the global setting or by specific PCR pools \citep{metzker2010sequencing}. \\ At present, several open access tools can be used to explore and control experiment results\citep{lee2012bioinformatics}. Those tools allow visualization and some level of read profiles quantification. But, they were %But they are %very powerful and computational intensive tools moreover if only a small %fraction of the genome is handled. developed as general purpose tools to cover a wide range of NGS applications, mainly for whole genome exploration. Consequently, they require great amount of computational resources and power. On the other hand, in TS only small group of regions, the features, are required to be explored and characterized in terms of coverage or counts, as well as, the evaluation and comparison of pool efficiency. In addition, this analysis should be also performed at a gene level in order to provide a general results overview. In this scenario, current genomic tools have became heavy and coarse for such amount of data. Consequently, the availability of light, fast and specific tools for TS data handling and visualization is a must in current labs.\\ % On the other hand, in TS it is required % to characterize and evaluate statistics such as the feature and gene % achieved coverage. If PCR pools were used, is also necessary an evaluation % and comparison of pool results.\\ Here we present \tarseq{} \R{} package, an exploration tool for fast visualization and quality control of TS experiments. Its use is not restricted to TS and can also be used to analyze data from others NGS applications in which \textit{feature-gene} structure could be defined, like exons or isoforms in RNA-seq and amplicons in DNA-seq.\\ This vignette intends to guide through to the use of the \tarseq{} \R{} \Bioconductor{} package. First, the input data format is described. Then, we show how to build an instance of the \targetexperiment{} class. After that, we will graphically explore the results and do the quality control over the sequenced features. Finally, we will build an .xlsx report that summarize the analysis above. \section{The \targetexperiment{} class} \tarseq{} \R{} package is based on the \targetexperiment{} class. The Figure \ref{fig:class} shows the \targetexperiment{} class structure.\\ \begin{figure}[!hbp] \begin{center} \includegraphics[scale=0.6]{class.pdf} \end{center} \caption{\targetexperiment{} class diagram.} \label{fig:class} \end{figure} The \targetexperiment{} class has nine slots: \begin{itemize} \item \Rfunarg{bedFile}: a \Rclass{GRanges} object that models the \bedfile{} \item \Rfunarg{bamFile}: a \Rclass{BamFile} object that is a reference to the \bamfile{}. \item \Rfunarg{fastaFile}: a \Rclass{FaFile} object that is a reference to the reference sequence file. \item \Rfunarg{featurePanel}: a \Rclass{GRanges} object that models the feature panel and related statistics. \item \Rfunarg{genePanel}: a \Rclass{GRanges} object that models the analyzed panel and related statistics at a gene level. \item \Rfunarg{scanBamP}: a \Rclass{ScanBamParam} containing the information to scan the \bamfile. \item \Rfunarg{pileupP}: a \Rclass{PileupParam} containing the information to build the pileup matrix. \item \Rfunarg{attribute}: a \Rclass{character} indicating which attribute \textit{coverage} or \textit{medianCounts} will be used to the analysis. \item \Rfunarg{feature}: a \Rclass{character} indicating the name of the analyzed features, e.g.: ``amplicon'', ``exon'', ``transcript''. \end{itemize} The next sections will illustrate how the \targetexperiment{} methods can be used. For this, the \tarseq{} \R{} package provides a \bedfile, a \bamfile, a \fastafile{} and a dataset that stores the \targetexperiment{} object built with those. This example case is based on a synthetic amplicon sequencing experiment containing 29 \textit{amplicons} of 8 genes in 4 chromosomes. \\ %\clearpage \section{Input Data} A TS experiment is characterized by the presence of a \bedfile{} which defines the \textit{features} that should be sequenced. The \tarseq{} package follows this architecture, where the \bedfile{} is the key data of the experiment. However, \tarseq{} also requires mainly three pieces of information that should be provided in order to call the \targetexperiment{} constructor. The \bedfile{}, the \bamfile, that contains the obtained alignment for the sequenced reads, and the sequence \fastafile. The complete path to these files should be defined when the \targetexperiment{} constructor is called. \\ Other parameters can also be specified in the \targetexperiment{} object constructor. The \Rfunarg{scanBamP} and \Rfunarg{pileupP} are instances of the \Rclass{ScanBamParam} and \Rclass{PileupParam} classes defined in the \Rpackage{RSamtools} \R{} \Bioconductor{} package \citep{rsamtools}. These parameters specify how to scan the \bamfile and how to build the corresponding \textit{pileup}, that will be used for exploration and quality control. The \Rfunarg{scanBamP} allows the specification of the features specified in the \bedfile, acording to \cite{rsamtools} specifications. The \Rfunarg{pileupP} establishes what information should be contained in the pileup matrix, for instance, if nucleotides and/or strand should be distinguished. If these two parameters are not specified, the deafult values of their constructors will be used. \\ Other important parameters that should be specified before to conduct the \textit{Quality Control} are \Rfunarg{feature} and \Rfunarg{attribute}. The first is a character that determines which kind of features are contained in the \bedfile{}. In the example presented here, \textit{amplicon} is the feature type. The second parameter, \Rfunarg{attribute}, can be \textit{coverage} or \textit{medianCounts} defining which meassures will be considered in the \textit{Quality Control} analysis. \subsection{Bed File} The \bedfile{} is stored as a \targetexperiment{} slot and it is modeled as a \Rclass{GRanges} object (\cite{genomicranges}). The \bedfile{} must be as a tabular file in which each row represents one feature. This file should contain at least ``chr'', ``start'', ``end'' , ``name'' and ``gene'' named columns. Additional columns like ``strand'' or another experimental information, could be included and would be conserved. For example, in some experiments, more than one PCR pool is neccesary. In this case, the \bedfile{} must also contain a ``pool'' specifying the pool in which each feature was defined. This information is an imperative requisite to evaluate the performance of each PCR pool. \\ A \Rclass{GRanges} object represents a collection of genomic features each having a single start and end location on the genome \citep{genomicranges}. In order to use it to represent the \bedfile, the ``chr'', ``start'' and ``end'' mandatory fields will be used to define the ``seqnames'', ``start'' and ``end'' \Rclass{GRanges} slots. The same will occur if the optional field ``strand'' is included in the \bedfile{}. The ``name'' column will be setted as ranges identifiers. Finally, ``gene'' and additional columns like ``pool'', will be stored as metadata columns. \\ In order to create a \targetexperiment{} object, the complete route to the \bedfile{} and its name must be specified as a \Rclass{character} \R{} object. Thus, to use the example \bedfile{} provided by \tarseq: <<>>= bedFile<-system.file("extdata", "mybed.bed", package="TarSeqQC", mustWork=TRUE) @ Note that any experiment, in which can be defined \textbf{\textit{feature-gene}} relations, could be analyzed using the \tarseq{} \R{} \Bioconductor{} package. For instance, if you have an RNA-seq experiment and you are interested in exploring some genes, you could build your customized \bedfile{} in which the \textit{feature} could be ``exon'' or ``transcript''. \subsection{BAM File} The \bamfile{} stores the alignment results \citep{li2009sequence}. In this example case, it corresponds to the amplicon sequencing experiment alignment. This file will be used to build the \textit{pileup} matrix for the selected features. Briefly, a \textit{pileup} is a matrix in which each row represents a genomic position and have at least three columns: ``pos'', ``chr'' and ``counts''. The first and second columns specify the genomic position and ``counts'' contains the total read counts for this position. Pileup matrix could contains four additional columns that store the read counts for each nucleotide at this position.\\ In order to call the \targetexperiment{} constructor, the complete route to the \bamfile{} and its name must be specified as a \Rclass{character} \R{} object. For example, we can define it in order to use \targetexperiment{} external data: <<>>= bamFile<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) @ When the \targetexperiment{} constructor is called the \bamfile{}, will be stored as a \Rclass{BamFile} object \citep{rsamtools} and this object will be a \Rclass{TargetExperiment} slot. \subsection{FASTA File} The \fastafile{} contains the reference sequence previously used to align the \bamfile{} and will be used to extract the feature sequences. This information is useful to compare the pileup results with the reference, in order to detect potential \textit{nucleotide variants}. To create a \targetexperiment{} object, the full path to the \fastafile{} and its name must be specified as a character \R{} object. For example:\\ <<>>= fastaFile<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ The \fastafile{} will be stored as a \Rclass{FaFile} object (\cite{rsamtools}) and this object will be setted as a \Rclass{TargetExperiment} slot. \subsection{Additional input data} The previous files are mandatories to call the \targetexperiment{} constructor. Additional parameters can be set in order to apply several methods and perform quality control and results exploration. These parameters are: \begin{itemize} \item \Rfunarg{scanBamP}: is a \Rclass{ScanBamParam} object, that specifies rules to scan a \Rclass{BamFile} object. For example, if you wish only keep those reads that were properly paired, or those that have a specific Cigar code, \Rfunarg{scanBamP} can be used to specify it. In TS experiments, we want to analyze only the features. The way to specify this is using the \Rfunarg{which} parameter in the \textit{scanBamP} constructor. If the \Rfunarg{scanBamP} parameter was not specified in the \targetexperiment{} constructor calling, its default value will be used and then, the \Rfunarg{which} parameter will be specified using the \bedfile{}. \item \Rfunarg{pileupP}: is a \Rclass{PileupParam} object, that specifies rules to build the \pileup{} starting from a \Rclass{BamFile}. You can use the \Rfunarg{pileupP} parameter to specify if you want to distinguish between nucleotides and or strands, filter low read quality or low mapping quality bases. If the \Rfunarg{pileupP} parameter is not specified, its default value will be used. \item \Rfunarg{attribute}: is a \Rclass{character} that specifies which attribute must be used for the results exploration and quality control. The user can choice between \textit{medianCounts} or \textit{coverage}. If the \textit{attribute} parameter is not specified in the \targetexperiment{} constructor, it will be setted as ``''. But, prior to perform some exploration or control, this argument must be set using the \Rmethod{setAttribute()} method. \item \Rfunarg{feature}: is a \Rclass{character} that defines what means a \textit{feature}. In this vignette a little example using an synthetic amplicon targeted sequencing experiment is shown, thus the feature means an \textit{amplicon}. But, the use of \tarseq{} \R{} package is not restricted to analyze only this kind of experiments. If you don't specify the \Rfunarg{feature} parameter, it will be setted as ``''. But, as in the \Rfunarg{attribute} parameter, it must be set prior to perform some exploration or control. It can be done using the \Rmethod{setFeature()} method. \item \Rfunarg{BPPARAM}: is a \Rclass{BiocParallelParam} instance defining the parallel back-end to be used during evaluation (see \citep{biocparallel}). It allows the specification of how many \Rfunarg{workers} (cpus) will be used, etc. \end{itemize} For more information about \Rclass{ScanBamParam} and \Rclass{PileupParam} constructors see \Rpackage{Rsamtools} manual. \section{Creating a \targetexperiment{} object} \subsection{Constructor call} Once you have defined the input data presented above, the \targetexperiment{} constructor could be called using:\\ <>= suppressMessages(library("TarSeqQC")) suppressMessages(library("BiocParallel")) @ <>= library("TarSeqQC") library("BiocParallel") BPPARAM<-bpparam() myPanel<-TargetExperiment(bedFile, bamFile, fastaFile, feature="amplicon", attribute="coverage", BPPARAM=BPPARAM) @ <>= BPPARAM<-bpparam() data(ampliPanel, package="TarSeqQC") myPanel<-ampliPanel rm(ampliPanel) setBamFile(myPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(myPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ When \Rfunction(TargetExperiment) is called, some \targetexperiment{} methods are invoked in order to define two of the \targetexperiment{} slots. First, the \Rfunction{buildFeaturePanel} is internally used in order to build the \Rfunarg{featurePanel} slot. Then, the \Rfunction{summarizePanel} is invoked in order to build the \Rfunarg{genePanel} slot, which contain the information summarized at a gene level.\\ In the previous example, the \Rfunarg{feature} and \Rfunarg{attribute} parameter values were defined. If they aren't specified in the constructor call, the \targetexperiment{} object can be created, but a warning message will be printed. After that, the \Rfunction{setFeature} and \Rfunction{setAttribute} methods should be used to set these values. For example:\\ <>= # set feature slot value setFeature(myPanel)<-"amplicon" # set attribute slot value setAttribute(myPanel)<-"coverage" @ As mentioned earlier, when the \Rfunarg{scanBamP} and \Rfunarg{pileupP} are not specified in the constructor call, they assume their default constructor. But, after the constructor call, they could be specified using \Rfunarg{setScanBamP} and \Rfunarg{setPileupP} methods. \\ <>= # set scanBamP slot value scanBamP<-ScanBamParam() #set scanBamP which slot bamWhich(scanBamP)<-getBedFile(myPanel) setScanBamP(myPanel)<-scanBamP # set pileupP slot value setPileupP(myPanel)<-PileupParam(max_depth=1000) # build the featurePanel again setFeaturePanel(myPanel)<-buildFeaturePanel(myPanel, BPPARAM) # build the genePanel again setGenePanel(myPanel)<-summarizePanel(myPanel, BPPARAM) @ Note that the previous code specifies that the maximum read depth can be 1000. If you have some genomic positions that has more than 1000 reads, they will be truncated to this number. It is also important to note that, if any change in the \Rfunarg{scanBamP} and/or \Rfunarg{pileupP} slots was done, the \Rfunarg{featurePanel} and the \Rfunarg{genePanel} slots will be setted again.\\ The \tarseq{} \R{} package provides a dataset that stores the \targetexperiment{} object built in the previous steps. To use it, run: <>= data(ampliPanel, package="TarSeqQC") @ The loaded object is called \textit{ampliPanel}. As was mentioned before, some \targetexperiment() methods need to consult the \bamfile and \fastafile, and for this, the \Rfunarg{bamFile} and \Rfunarg{fastaFile} slots are used. Given that \textit{ampliPanel} was built by the package creators, the path file routes that have its slots are not the same in which these files are located in users' computers. Thus, after \textit{ampliPanel} loading it is necessary to re-define the \bamfile{} and \fastafile{} path files, running: <>= # Defining bam file and fasta file names and paths setBamFile(ampliPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(ampliPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ Note that \Rfunarg{featurePanel} and \Rfunarg{genePanel} do not need to be rebuilt. The redefinition file names is only neccesary in order to use \targetexperiment{} methods that query this files.\\ \subsection{Early exploration} The \targetexperiment{} class has typical \Rfunction{show/print} and \Rfunction{summary} \R{} methods implemented. In addition, the \Rfunction{summaryGeneLev} and \Rfunction{summaryFeatureLev} methods allow the summary exploration at ``gene'' and ``feature'' level. The next example illustrates how these methods can be called: <>= # show/print myPanel # summary summary(myPanel) #summary at feature level summaryFeatureLev(myPanel) #summary at gene level summaryGeneLev(myPanel) @ Using those methods you can easily find, for example, that amplicons were sequenced,in average, at a coverage of 256. It can also be observed that there is at least one amplicon that was not read. This is because the minimum value of the attribute (\textit{coverage}) is 0. To complement this analysis, the attribute distribution can be explored using:\\ <>= g<-plotAttrExpl(myPanel,level="feature",join=TRUE, log=FALSE, color="blue") x11(type="cairo"); g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.3]{attrExpl.pdf} \end{center} \caption{Attribute distribution and density plots.} \label{attrexpl} \end{figure} \par In the Figure \ref{attrexpl}, the \Rfunarg{join} parameter was set as 'TRUE'. Thus, density and box plots are plotted together. If it is set as 'FALSE', the figure will contain the attribute box-plot on the left and the corresponding attribute density plot on the right. \\ Exploration of any metadata information is also provided. \textit{GC content, feature length} distributions and \textit{gene} or \textit{pool} frequencies can be explored using the \Rfunction{plotMetaDataExpl} method. Other metadata columns, specified in the \textit{bedFile}, can also be analyzed. If the analyzed metadata is numeric, then boxplot and density plot is built. Those can be ploted together, or not, and in log10 scale. On the other hand, if it is categorical like \textit{gene} or \textit{pool}, a bar frequency (absoulte or in relative percentages) is plotted. The following code allows the exploration of feature length distributions and gene frequencies along the features.\\ <>= # explore amplicon length distribution plotMetaDataExpl(ampliPanel, "length", log=FALSE, join=FALSE, color= "blueviolet") @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.4]{lengthExploration.pdf} \end{center} \caption{Amplicon length distribution' plot.} \label{fig:lengthplot} \end{figure} \par <>= # explore gene's relative frequencies plotMetaDataExpl(ampliPanel, "gene", abs=FALSE) @ \begin{figure}[!hbtp] \begin{center} \includegraphics[scale=0.4]{geneExploration.pdf} \end{center} \caption{Gene's relative frquencies.} \label{fig:geneplot} \end{figure} \par Figure \ref{fig:lengthplot} indicates that the mean amplicon length is 74.7 nucleotides with a standard deviations of 20.8. But, as can be observed in the density plot, the mode is lower than this mean. In addition, half of the amplicons have their lengths lower than 71, and the rest between 71 and 125.\\ The Figure \ref{fig:geneplot} shows the gene relative frequencies, in percentages. As can be viewed, more than the 22\% of the amplicons belong to 'gene8' and approximately 21\% belong to 'gene5'. On the other hand, both 'gene1' and 'gene3' have less than 5\% of total amplicons.\\ \section{Deep exploration and Quality Control} \subsection{Targeted selection performance} In the context of TS experiments, it is expected that most of sequencing reads come from the target regions and not the rest of the genome. If it does not, may be a resequencing using another primers will be needed. One way to check this is using the \Rfunction{readFrequencies} which returns a \Rclass{data frame} containing, for each chromosome, the amount and the percentage of reads fall in and out features. Then, the \Rfunction{plotInOutFeatures} method can be used in order to plot this \Rclass{data frame}. It can be achieved running the next code: <>= readFrequencies(ampliPanel) plotInOutFeatures(readFrequencies(ampliPanel)) @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.4]{plotInOut.pdf} \end{center} \caption{Percentages of reads falling in or out of the targeted regions.} \label{fig:plotInOut} \end{figure} \par In the Figure \ref{fig:plotInOut} can be observed that more than half sequence reads belong to genomic regions outside of the amplicons (red bars). This is a not good results, because it indicates that the used primers were not so specifics to capture only the targeted regions.\\ \subsection{Panel overview} When a person is working with a TS experiment, it is interesting to simultaneously evaluate the performance of all the features. In addition, if the user have prefixed attribute intervals, it could be important to compare features according to them. For example, five coverage intervals can be defined according to the Table \ref{table:featureInterval}.\\ \begin{table}[ht] \caption{Coverage intervals} % title name of the table \centering % centering table \begin{tabular}{| l | l | } % creating 10 columns \hline \textbf{Coverage Interval} & \textbf{Motivation} \\ \hline $[0,1)$ & \textit{Not sequenced}\\ [0.5ex] \hline $[1,50)$ & \textit{Low sequencing coverage}\\ [0.5ex] \hline $[50,200)$ & \textit{Regular sequencing coverage}\\ [0.5ex] \hline $[200,500)$ & \textit{Very good sequencing coverage}\\ [0.5ex] \hline $[500,Inf)$ & \textit{Excellent sequencing coverage}\\ [0.5ex] \hline \end{tabular} \label{table:featureInterval} \end{table} Tthese coverage intervals could be incorporated into the analysis. To do this, the \tarseq{} \R{} package needs an interval extreme definitions: <>= # definition of the interval extreme values attributeThres<-c(0,1,50,200,500, Inf) @ %In order to facilitate the object exploration and help the %\textit{Quality Control} process, additional methods were developed. A panel results overview is critical in order to compare and integrate those. To help this task, \tarseq package implements the \Rfunction{plot} method. This is a graphical tool consisting in a polar histogram, in which each gene is represented as a bar. Each bar is colored depending the percentage of features that have their attribute value in a particular prefixed interval. In addition, the bars (genes) can be grouped in chromosomes in order to facilitate the comparisson at this level. The next code help to build this plot: <>= # plot panel overview g<-plot(myPanel, attributeThres, chrLabels =TRUE) g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.5]{plot.pdf} \end{center} \caption{Panel overview plot.} \label{fig:plot} \end{figure} \par In the example presented here, we can easily distinguish that the unique amplicon of the ``gene3'' was not sequenced. This is because in the Figure \ref{fig:plot}, the bar corresponding to ``gene3'' is colored in red and this color is related to the [0,1) coverage interval. In the same plot, can also be appreciated that this gene has only one amplicon, as depicted in parenthesis in the bar label \textit{``gene3(1)''}. Also it is possible to note that 40\% of ``gene4'' amplicons has a coverage between 1 and 50. Note that this gene have five amplicons, then the 40 \% corresponds to 2 amplicon. Another 20 \% (1 amplicon) has a coverage value between 50 and 200, other one ``gene4'' amplicon have a very good coverage value, it means, between 200 and 500 and the other amplicon have an excellent coverage higher than 500. \\ It is important to note that a small and simple example is presented here. The prevoious plot could have a grater impact when the panel has more features and genes. Figure \ref{fig:plotCPP} shows the panel overview for a TS Experiment based on the \textit{Ion AmpliSeq Cancer Panel Primer Pool}. This is a TS Panel offered by \textit{Life Technologies} (\cite{CPP}) that allows the inspection of 190 amplicons. In this case, can be easily observed that ``MLH1'' and ``CDKN2A'' genes were no sequenced. Can also be appreciated that several genes like ``ALK'', ``VHL'', ``AKT1'', ``ARBB2'', have uniforme coverage values along their amplicons. On the contrary, ``KDR'' and ``PTEN'' genes have some amplicons that were not sequenced and some other with a high coverage. \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.5]{plotCPP.pdf} \end{center} \caption{Cancer Panel Primer Pool overview plot.} \label{fig:plotCPP} \end{figure} Complementing the previous plot, \Rfunction{plotFeatPerform} illustrates a similar graphic where the bars are distributed along the x-axis. In order to expand the polar histogram showed in Figure \ref{fig:plot}, the parameter \Rfunarg{complete} is included. If you set it as TRUE, the resultant plot will contain two graphics. The upper panel is a bar plot at feature level, and the lower, at a gene level. Both graphics incorporate the prefixed attribute intervals information and contain a red line to indicate the average value of the attribute at the corresponding level. In our example, you could run:\\ <>= # plot panel overview g<-plotFeatPerform(myPanel, attributeThres, complete=TRUE, log=FALSE, featureLabs=TRUE, sepChr=TRUE, legend=TRUE) g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.7]{featPerf.pdf} \end{center} \caption{Amplicon coverage performance. The upper panel is a bar plot at feature level, and the lower, at gene level.} \label{fig:perform} \end{figure} In Figure \ref{fig:perform} we could evaluate the coverage value for each amplicon and gene. We can observe that when coverage is summarized at gene level the highest value is lower than 500. However, at amplicon level, the highest value is grater than 800. \\ The previous plot is also very useful when we are working with panels made-up by several primer pools combination. For example, the \textit{Comprehensive Cancer Panel} is another \textit{Life Tachnologies} panel that allows the exploration of 16000 amplicons from 409 genes related to several cancer types using 4 primer pools (\cite{CCP}). In this case, the \bedfile{} contains a ``pool'' column that stores the pool number for each feature. This information will be conserved in the \targetexperiment{} object that was built from this panel.\\ In the \textit{Quality Control} context, it is so important, evaluate in early analysis stages, if some pool effect exists and if all pool results are comparable. Naturally, the \tarseq{} \R{} package uses this information to assist the user. For example, the Figure \ref{fig:perfCCP} illustrates the use of the \Rfunction{plotFeatPerform} in the described case. Now, you can see that the graphic corresponding to the amplicon level shows a separation between amplicons according to its pool value. Note that the same plot at a gene level is not showed because the \Rfunarg{complete} parameter was set to 'FALSE'. It is important to emphasize that, if correspond, the pool information will be included in all methods of the \targetexperiment{} class. Thus, for example, when you call the \Rfunction{summary} function for a \targetexperiment{} object that has pool information, the output will contain statistic results for the amplicon level and for each pool separately. \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.8]{perfExplCCP.pdf} \end{center} \caption{Performance exploration of an Ion AmpliSeq Comprehensive Cancer Panel experiment.} \label{fig:perfCCP} \end{figure} \clearpage \subsection{Controlling possible attribute bias} It is known that those genes having high GC content or high length tends to be 'more sequenced' that those with lower GC/length values. Then, GC content and feature length can be considered as possible bias sources, in particular, when the feature is an exon or a transcript. In order to check it, \tarseq{} incorporates \Rfunction{biasExploration} that allows the inspection of the selected attribute ('coverage' in the example) in terms of groups or intervals of bias sources. To do this, the method defines four source groups according to the source's distribution quartiles and then asigns one group to each feature. After that boxplot and, optionally, density plot for each group are plotted in order to appreciate possible source bias. It can be performed for GC content, feature length, or any other metadata column. If the source is a catergorical variable, like pool or gene, each cathegory will be considered as a group. For instance, the next lines allows the exploration of GC content effect: <>= x11(type="cairo") biasExploration(myPanel, source="gc", dens=TRUE) @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.5]{gcExploration.pdf} \end{center} \caption{Exploration of amplicon coverage in each of amplicon GC content's quartiles.} \label{fig:gcExploration} \end{figure} \subsection{Controlling low counts features} Low counts features should be detected in early analysis stages. The \Rfunction{summaryIntervals} method builds a frequency table of the fetures that have its attribute value between predefined intervals. For example, if you are interested in explore the ``coverage'' intervals defined before, you could do:\\ <>= # summaryIntervals summaryIntervals(myPanel, attributeThres) @ The previous methods is also useful when you are interesting in quantifying how many features have its attribute value (\textit{coverage}) lower or higher than a threshold. In this example, we are interested in knowing how many amplicons have shown at least a coverage of 50, because we consider that this is a minimum value that we will admit. This is a typical aspect that will be analyzed when you do an experiment \textit{Quality Control}. Complementing this method, the \Rfunction{plotAttrPerform} method allows the graphical exploration of relative and cumulative feature frequencies in the predefined intervals. The corresponding function call is:\\ <>= plotAttrPerform(myPanel, attributeThres) @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.7]{coveragePerform.pdf} \end{center} \caption{Relative and cumulative amplicon frequencies in the specified intervals.} \label{fig:covPerf} \end{figure} In the cases in which several pools were used, both \Rfunction{summaryIntervals} and \Rfunction{plotAttrPerform} allows the exploration at pool level. Thus, differences among pool can be easily detected.\\ Another method that could help this analysis is \Rfunction{getLowCtsFeatures}. This method returns a \Rclass{data frame} object containing all the features that have its attribute value lower than a threshold. The output \Rclass{data frame} also contains panel and attribute information for each feature. For example, to known which are the genes that have a coverage value lower than 50, you can do:\\ <>= getLowCtsFeatures(myPanel, level="gene", threshold=50) @ In addition, if you want to known which amplicons have a coverage value lower than 50, you should execute:\\ <>= getLowCtsFeatures(myPanel, level="feature", threshold=50) @ Graphical methods were also implemented. The \Rfunction{plotGeneAttrPerFeat} allows the attribute value exploration for all the features of a selected gene. For instance, if you want to explore the ``gene4'', you should do:\\ <>= g<-plotGeneAttrPerFeat(myPanel, geneID="gene4") # adjust text size g<-g+theme(title=element_text(size=16), axis.title=element_text(size=16), legend.text=element_text(size=14)) g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.5]{gene4AttrFeat.pdf} \end{center} \caption{Performance attribute exploration of the \textit{gene4}.} \label{fig:perfgene4} \end{figure} The attribute value for each feature contained in the ``gene4'' can be observed in Figure \ref{fig:perfgene4}. \subsection{Read counts exploration} \textit{Quality Control} in TS experiments implies the analysis of coverage/median counts achieved for each feature. But, sometimes, the exploration of the read profile that was obtained for a particular genomic region or a feature could be interesting. For this reason, the \tarseq{} \R{} package provides methods to help the exploration at a nucleotide resolution. Those methods are based on the \Rclass{data frame} returned by the \Rfunction{pileupCounts} method. This contains the read counts information for each nuecleotide of the features contained in the \bedfile. Note that the columns in the obtained \Rclass{data frame} could change, depending on the \Rfunarg{pileupP} parameter definition. In our case we are working with its default constructor and only the \Rfunarg{maxdepth} parameter was modified. For this reason, the resultant \Rclass{data frame} will contain one column for each nucleotide and one column (``-'') storing deletion counts.\\ \Rfunction{pileupCounts} is a function, not a \targetexperiment{} method, thus it could be called externally to the class. In order to call it, the specification of several parameters is needed: \begin{itemize} \item \Rfunarg{bed}: is a \Rclass{GRanges} object that, at least, should have values in the \Rfunarg{seqnames}, \Rfunarg{start} and \Rfunarg{end} slots. \item \Rfunarg{bamFile}: is a \Rclass{character} indicating the full path to the \bamfile. \item \Rfunarg{fastaFile}: is a \Rclass{character} indicating the full path to the \fastafile. \item \Rfunarg{scanBamP}: is a \Rclass{ScanBamParam} object, that specifies rules to scan the \Rclass{BamFile}. If it was not specified, its default constructor values will be used and then, the \Rfunarg{which} parameter will be specified using the \Rfunarg{bed} parameter. \item \Rfunarg{pileupP}: is a \Rclass{PileupParam} object, that specifies rules to build the \pileup{}, starting from the \Rclass{BamFile}. If it was not specified, the \Rfunarg{pileupP} parameter will be defined using the constructor default values. \item \Rfunarg{BPPARAM}: is a \Rclass{BiocParallelParam} instance defining the parallel back-end to be used during evaluation (see \citep{biocparallel}). \end{itemize} In order to work with the example data, it is necessary do: <>= # define function parameters bed<-getBedFile(myPanel) bamFile<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) fastaFile<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) scanBamP<-getScanBamP(myPanel) pileupP<-getPileupP(myPanel) #call pileupCounts function myCounts<-pileupCounts(bed=bed, bamFile=bamFile, fastaFile=fastaFile, scanBamP=scanBamP, pileupP=pileupP, BPPARAM=BPPARAM) @ <>= data("myCounts", package="TarSeqQC") @ <>= head(myCounts) @ The obtained \Rclass{data frame} contains the \textit{pileup} information. It can be used to build a \textit{read profile} plot, in which the x axis represents the genomic position and the y axis, the obtained read counts. The \Rfunction{plotRegion} allows the read profile exploration for a specific genomic region. Helping the region definition, the \Rfunction{getRegion} method extracts the information for a genomic region. For example, the code below returns a \Rclass{data frame} with location information of ``gene7'' amplicons: <>= #complete information for gene7 getRegion(myPanel, level="gene", ID="gene7", collapse=FALSE) #summarized information for gene7 getRegion(myPanel, level="gene", ID="gene7", collapse=TRUE) @ Then, the previous information can be used to specify a genomic region and plot its read count profile, as: <>= g<-plotRegion(myPanel, region=c(4500,6800), seqname="chr10", SNPs=TRUE, xlab="", title="gene7 amplicons",size=0.5) x11(type="cairo") g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.35]{gene7Region.pdf} \end{center} \caption{Read counts profile for the gene7 genomic region.} \label{fig:regiongene7} \end{figure} Other method, \Rfunction{plotFeature}, allows the read profile exploration of a particular feature. In this case, only is neccesary to specify the feature name that should be explored. For example in order to explore the ``AMPL20'' amplicon of the ``gene7'', the code is: <>= g<-plotFeature(myPanel, featureID="AMPL20") x11(type="cairo") g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.4]{featuregene7.pdf} \end{center} \caption{Read counts profile for the "20" gene7 amplicon.} \label{fig:featgene7} \end{figure} As can be seen in both Figure \ref{fig:regiongene7} and Figure \ref{fig:featgene7}, the gray shadow corresponds to the total counts that were obtained at each genomic position insight the selected amplicon. The violet line indicates the read counts matching with the reference sequence. In order to distinguish how many read counts could correspond to a genomic variation, it is crucial that the \Rfunarg{pileupP} definition contains the \Rfunarg{distinguish nucleotide} parameter as TRUE, which is its default value. In the plots, the green, blue, red and brown lines illustrate the read counts that do not match with the reference and inform about a possible nucleotide variation. In the \ref{fig:featgene7} can be appreciated that the selected amplicon shows a variation changing the reference nucleotide for a ``T''. Remember that the presented analysis is only an exploratory tool . Thus, the possible nucleotide variations detected in this stage should be confirmed or rejected in more specific downstream analysis. \\ The proportion of read counts that match and no match against the reference can be known using the \Rfunction{plotNtdPercentage} method as: <>= g<-plotNtdPercentage(myPanel, featureID="AMPL20") g @ \begin{figure}[!htp] \begin{center} \includegraphics[scale=0.4]{NTDPercAMPL20.pdf} \end{center} \caption{Nucleotide percentages for each genomic position on the "AMPL20" gene7 amplicon.} \label{fig:ntd} \end{figure} In Figure \ref{fig:ntd} can be observed that between 4910 and 4920 positions there are a nucleotide showing differences between reference and read sequences. This information can be also extracted using the previous read counts \Rclass{data frame}, \Rfunarg{myCounts} and the \Rfunarg{featurePanel} slot of the \targetexperiment object. To do this only the feature name is neccesary: <>= getFeaturePanel(myPanel)["AMPL20"] @ Using this information<, you could select a subset in the \Rfunarg{myCounts} object containing only those nucleotide rows corresponding with the feature: <>= featureCounts<-myCounts[myCounts[, "seqnames"] =="chr10" & myCounts[,"pos"] >= 4866 & myCounts[,"pos"] <= 4928,] @ Then, the position having the lowest value in the ``='' column can be found. The position with the minimum value of read counts matching against the reference is the same position that has the higher variation: <>= featureCounts[which.min(featureCounts[,"="]),] @ As can be observed, in the position 4912 the reference genome indicates that there should be a ``G'', and the read counts indicate that in this position there is a ``T'', showing a possible nucleotide variation. \\ \section{Quality Control Report} The \tarseq{} \R{} package provides a method that generates an .xlsx report in which \textit{Quality Control} relevant information is contained. This file has three sheets. In the first, a summary is presented, containing the results of \Rfunction{summary} and \Rfunction{summaryIntervals} methods. This sheet also includes a plot characterizing the experiment. Any graphic could be choosen, but if its name is not specified, the method calls the \tarseq{} \Rfunction{plot} method to build it. The second and third sheets store the panel information at a gene and a feature level respectively. Only the information corresponding to the selected attribute will be stored. Then, if only the report generation is desired, the \Rfunction{buildReport} method can be called after the object construction. In the present example, the image file that should be included in the report and the report creation is specified, using the code below: <>= imageFile<-system.file("extdata", "plot.pdf", package="TarSeqQC", mustWork=TRUE) buildReport(ampliPanel, attributeThres, imageFile ,file="Results.xlsx") @ \section{Troubleshoot} Remember that all \targetexperiment{} methods that need read count information at a nucleotide level work over the \bedfile, \bamfile{} and the \fastafile. For this reason, in order to use some of them, please make sure that the corresponding \targetexperiment{} slots have the file names well defined. For example, if the \tarseq{} example data is loading as follows: <>= data(ampliPanel, package="TarSeqQC") ampliPanel @ and you want to explore a feature using the \Rfunction{plotFeature} method, the execution will cause an error because the method cannot find the files. \\ \begin{Schunk} \begin{Soutput} plotFeature(ampliPanel, featureID="AMPL1") [1] "The index of your BAM file doesn't exist" [1] "Building BAM file index" open: No such file or directory Error in FUN(X[[i]], ...) : failed to open SAM/BAM file file: './mybam.bam' \end{Soutput} \end{Schunk} To solve the previous error, the next code should be executed before: <>= setBamFile(ampliPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(ampliPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ and then: <>= plotFeature(ampliPanel, featureID="AMPL1") @ \clearpage \section*{Session Info} <>= sessionInfo() @ \clearpage \bibliographystyle{apalike} \bibliography{TarSeqQCvignette} \end{document}