\documentclass[10pt]{article} %\VignetteIndexEntry{TarSeqQC: Targeted Sequencing Experiment Quality Control} %\VignetteKeyword{targeted sequencing} %\VignetteKeyword{quality control} %\VignetteKeyword{exploration tool} %\VignettePackage{TarSeqQC} \usepackage{natbib} \usepackage{authblk} \usepackage{times} \usepackage[colorlinks=false, urlcolor=black, pdfborder={0 0 0}]{hyperref} \usepackage{amsmath} \usepackage{enumerate} \textwidth=6.5in \textheight=8.5in \parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\bedfile}{\textit{Bed File}} \newcommand{\bamfile}{\textit{BAM File}} \newcommand{\fastafile}{\textit{FASTA File}} \newcommand{\pileup}{\textit{pileup}} \newcommand{\targetexperiment}{\Rclass{TargetExperiment}} \newcommand{\tarseq}{\Rpackage{TarSeqQC}} \title{\Rpackage{TarSeqQC}: Targeted Sequencing Experiment Quality Control} \author[1]{Gabriela A Merino} \author[1]{Crist\'{o}bal Fresno} \author[1]{Elmer A Fern\'{a}ndez} \affil[1]{CONICET-Universidad Cat\'{o}lica de C\'{o}rdoba, Argentina} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} % \SweaveOpts{concordance=FALSE} \maketitle \begin{center} gmerino@bdmg.com.ar \end{center} \vspace{2cm} \begin{abstract} Targeted Sequencing experiments are a Next Generation Sequencing application, designed to explore a small group of specific genomic regions. The \tarseq{}$~$ package models this kind of experiments in R and its main goal is to allow the quality control and fast exploration over the experiment results. To do this, a new \R{} class, called \targetexperiment, was implemented. This class is based on the \bedfile, that characterize the experiment, the alignment \bamfile{} and the reference genome \fastafile. When the constructor is called, coverage and read count information are computed for the targeted sequences. After that, exploration and quality control could be carried out using graphical and numerical tools. Density, bar, read profile and box plots were implemented to achieve this task. A circular histogram plot was also implemented in order to summarize all experiment results. Coverage or median counts intervals can be defined and explored to further assist quality control analysis. Thus, library and pool preparation or sequencing errors could be easily detected. Finally, an .xlsx report containing quality control results can be built. \end{abstract} <>= options(prompt="> ", continue="+ ", width=78, useFancyQuotes=FALSE, digits=3) suppressMessages(library("TarSeqQC")) suppressMessages(library("BiocParallel")) @ \newpage \tableofcontents \newpage \section{Introduction} Next Generation Sequencing (NGS) technologies produce huge volume of sequence data at relative low cost. Among the different NGS applications, Targeted Sequencing (TS) allows the exploration of specific genomic regions, called \textit{features}, of a small group of genes \citep{metzker2010sequencing}. An ordinary application of TS is to detect Single Nucleotide Polimorphisms (SNPs) involved in several pathologies. Nowadays, \textit{TS cancer panels} are emerging as a new screening methodology to explore specific regions of a small number of genes known to be related to cancer. %like cancer. In this case, each feature corresponding %with a small group of genes related to a particular pathology. \\ In TS, specific regions of a DNA sample are copied and amplified by PCR. If a target region is too large, several primers can be used to read it. In addition, if the panel has a large number of interest genomic regions, different PCR pools could be required to achieve a good coverage. All fragments are sequenced in a NGS machine, generating millions of short sequence reads, but its throughput obviously is less than if the whole genome was sequenced. The reads are then aligned against a reference genome and, after that, downstream analysis could be performed. However, prior to further analysis, it is crucial to evaluate the run performance, as well as the experiment quality control, i.e., how well the features were sequenced, which feature and gene coverages were achieved, if some problems arise in the global setting or by specific PCR pools \citep{metzker2010sequencing}. \\ At present, several open access tools can be used to explore and control experiment results\citep{lee2012bioinformatics}. Those tools allow visualization and some level of read profiles quantification. But, they were %But they are %very powerful and computational intensive tools moreover if only a small %fraction of the genome is handled. developed as general purpose tools to cover a wide range of NGS applications, mainly for whole genome exploration. Consequently, they require great amount of computational resources and power. On the other hand, in TS only small group of regions are required to be explored and characterized in terms of coverage, as well as, the evaluation and comparison of pool efficiency. In this scenario, current genomic tools have became heavy and coarse for such amount of data. Consequently, the availability of light, fast and specific tools for TS data handling and visualization is a must in current labs.\\ % On the other hand, in TS it is required % to characterize and evaluate statistics such as the feature and gene % achieved coverage. If PCR pools were used, is also necessary an evaluation % and comparison of pool results.\\ Here we present \tarseq{} \R{} package, an exploration tool for fast visualization and quality control of TS experiments. Its use is not restricted to TS and can also be used to analyze data from others NGS applications in which \textit{feature-gene} structure could be defined, like exons or isoforms in RNA-seq and amplicon in DNA-seq.\\ This vignette intends to guide through to the use of the \tarseq{} \R{} \Bioconductor{} package. First, the input data format is described. Then, we show how to build an instance of the \targetexperiment{} class. After that, we will graphically explore the results and do the quality control over the sequenced features. Finally, we will build an .xlsx report that summarize the analysis above. \section{The \targetexperiment{} class} \tarseq{} \R{} package is based on the \targetexperiment{} class. The Figure \ref{fig:class} shows the \targetexperiment{} class structure.\\ \begin{figure}[!hbp] \begin{center} \includegraphics[scale=0.15]{class.pdf} \end{center} \caption{\targetexperiment{} class diagram.} \label{fig:class} \end{figure} The \targetexperiment{} class has nine slots: \begin{itemize} \item \Rfunarg{bedFile}: a \Rclass{GRanges} object that models the \bedfile{} \item \Rfunarg{bamFile}: a \Rclass{BamFile} object that is a reference to the \bamfile{}. \item \Rfunarg{fastaFile}: a \Rclass{FaFile} object that is a reference to the reference sequence. \item \Rfunarg{featurePanel}: a \Rclass{GRanges} object that models the feature panel and related statistics. \item \Rfunarg{genePanel}: a \Rclass{GRanges} object that models the analyzed panel and related statistics at a gene level. \item \Rfunarg{scanBamP}: a \Rclass{ScanBamParam} containing the information to scan the \bamfile. \item \Rfunarg{pileupP}: a \Rclass{PileupParam} containing the information to build the pileup matrix. \item \Rfunarg{attribute}: a \Rclass{character} indicating which attribute \textit{coverage} or \textit{medianCounts} will be used to the analysis. \item \Rfunarg{feature}: a \Rclass{character} indicating the name of the analyzed features, e.g.: ``amplicon'', ``exon'', ``transcript''. \end{itemize} The next sections will illustrate how the \targetexperiment{} methods can be used. For illustrate this, the \tarseq{} \R{} package provides a \bedfile, a \bamfile, a \fastafile{} and a dataset that stores the \targetexperiment{} object built with those. This example case is based on a synthetic amplicon sequencing experiment containing 29 \textit{amplicons} of 8 genes in 4 chromosomes. \\ %\clearpage \section{Input Data} A TS experiment is characterized by the presence of a \bedfile{} which defines the \textit{features} that should be sequenced. The \tarseq{} package follows this architecture, where the \bedfile{} is the key data of the experiment. However, \tarseq{} also requires mainly three pieces of information that should be provided in order to call the \targetexperiment{} constructor. The \bedfile{}, the \bamfile, that contains the obtained alignment for the sequenced reads, and the sequence \fastafile. The complete path to these files should be defined when the \targetexperiment{} constructor is called. \\ Other parameters can also be specified in the \targetexperiment{} object constructor. The \Rfunarg{scanBamP} and \Rfunarg{pileupP} are instances of the \Rclass{ScanBamParam} and \Rclass{PileupParam} classes defined in the \Rpackage{RSamtools} \R{} \Bioconductor{} package \citep{rsamtools}. These parameters specify how to scan the \bamfile and how to build the corresponding \textit{pileup}, that will be used for exploration and quality control. The \Rfunarg{scanBamP} allows to specify the features of interest contained in the \bedfile, acording to \cite{rsamtools} specifications. The \Rfunarg{pileupP} establishes what information should be contained in the pileup matrix, for instance, if nucleotides and/or strand should be distinguished. If these two parameters are not specified, the deafult values of their constructors will be used. In addition, \Rfunarg{feature} and \Rfunarg{attribute} are other important parameters that should be specified in order to conduct the \textit{Quality Control}. The first is a character that determines which kind of features are contained in the \bedfile{}. In the example presented here, \textit{amplicon} is the feature type. The second parameter, \Rfunarg{attribute}, can be \textit{coverage} or \textit{medianCounts} defining which meassures will be considered in the \textit{Quality Control} analysis. \subsection{Bed File} The \bedfile{} is stored as a \targetexperiment{} slot and is modeled as a \Rclass{GRanges} object (\cite{genomicranges}). The \bedfile{} must be a tabular file in which each row represents one sequenced feature. This file should contain at least ``chr'', ``start'', ``end'' , ``name'' and ``gene'' columns. Additional columns like ``strand'' or another experimental information, could be included and would be conserved. For example, in some experiments, more than one PCR pool is neccesary. In this case, the \bedfile{} must also contain a ``pool'' column specifying in which of these pools each feature was defined. This information is an imperative requisite to evaluate the performance of each PCR pool. \\ A \Rclass{GRanges} object represents a collection of genomic features each having a single start and end location on the genome \citep{genomicranges}. In order to use it to model the \bedfile, the mandatory fields ``chr'', ``start'' and ``end'' will be used to define the ``seqnames'', ``start'' and ``end'' \Rclass{GRanges} slots. The same will occur if the optional field ``strand'' is included in the \bedfile{}. The ``name'' column will be setted as ranges identifiers. Finally, ``gene'' and additional columns like ``pool'', will be stored as metadata columns. \\ In order to create a \targetexperiment{} object, the complete route to the \bedfile{} and its name must be specified as a \Rclass{character} \R{} object. Thus, to use the example \bedfile{} provided by \tarseq: <<>>= bedFile<-system.file("extdata", "mybed.bed", package="TarSeqQC", mustWork=TRUE) @ Note that any experiment, in which can be defined \textbf{\textit{feature-gene}} relations, could be analyzed using the \tarseq{} \R{} \Bioconductor{} package. For instance, if you have an RNA-seq experiment and you are interested in exploring some genes, you could build your customized \bedfile{} in which the \textit{feature} could be ``exon'' or ``transcript''. \subsection{BAM File} The \bamfile{} stores the alignment results \citep{li2009sequence}. In this example case, it corresponds to the amplicon sequencing experiment alignment. This file will be used to build the \textit{pileup} for the selected features in which quality control is based. Briefly, a \textit{pileup} is a matrix in which each row represents a genomic position and have at least three columns: ``pos'', ``chr'' and ``counts''. The first and second columns specify the genomic position and ``counts'' contains the total read counts for this position. Pileup matrix could contains four additional columns that store the read counts for each nucleotide at this position. \\ In order to call the \targetexperiment{} constructor, the complete route to the \bamfile{} and its name must be specified as a \Rclass{character} \R{} object. For example, we can define it in order to use \targetexperiment{} external data: <<>>= bamFile<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) @ When the \targetexperiment{} constructor is called the \bamfile{}, will be stored as a \Rclass{BamFile} object \citep{rsamtools} and this object will be a \Rclass{TargetExperiment} slot. \subsection{FASTA File} The \fastafile{} contains the reference sequence previously used to align the \bamfile{} and will be used to extract the sequences for the selected features. This information is useful to compare the pileup results with the reference, in order to detect \textit{nucleotide variants}. To create a \targetexperiment{} object, the full path to the \fastafile{} and its name must be specified as a character \R{} object. For example:\\ <<>>= fastaFile<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ The \fastafile{} will be stored as a \Rclass{FaFile} object (\cite{rsamtools}) and this object will be setted as a \Rclass{TargetExperiment} slot. \subsection{Additional input data} The previous files are mandatories to call the \targetexperiment{} constructor. Additional parameters can be set in order to apply several methods, perform the quality control and results exploration. These parameters are: \begin{itemize} \item \Rfunarg{scanBamP}: is a \Rclass{ScanBamParam} object, that specifies rules to scan a \Rclass{BamFile} object. For example, if you wish only keep those reads that were properly paired, or those that have a specific Cigar code, \Rfunarg{scanBamP} can be used to specify it. In TS experiments, we want to analyze only the features. The way to specify this is using the \Rfunarg{which} parameter in the \textit{scanBamP} constructor. If the \Rfunarg{scanBamP} parameter was not specified in the \targetexperiment{} constructor calling, its default value will be used and then, the \Rfunarg{which} parameter will be specified using the \bedfile{}. \item \Rfunarg{pileupP}: is a \Rclass{PileupParam} object, that specifies rules to build the \pileup{} starting from a \Rclass{BamFile}. You can use the \Rfunarg{pileupP} parameter to specify if you want to distinguish between nucleotides and or strands, filter low read quality or low mapping quality bases. If the \Rfunarg{pileupP} parameter is not specified, its default value will be used. \item \Rfunarg{attribute}: is a \Rclass{character} that specifies which attribute must be used for the results exploration and quality control. The user can choice between \textit{medianCounts} or \textit{coverage}. If the \textit{attribute} parameter is not specified in the \targetexperiment{} constructor, it will be setted as ``''. But, prior to perform some exploration or control, this argument must be set using the \Rmethod{setAttribute()} method. \item \Rfunarg{feature}: is a \Rclass{character} that defines what means a \textit{feature}. In this vignette a little example using an synthetic amplicon targeted sequencing experiment is shown, thus the feature means an \textit{amplicon}. But, the use of \tarseq{} \R{} package is not restricted to analyze only this kind of experiments. If you don't specify the \Rfunarg{feature} parameter, it will be setted as ``''. But, the same in \Rfunarg{attribute} parameter, it must be set prior to perform some exploration or control. It can be done using the \Rmethod{setFeature()} method. \item \Rfunarg{BPPARAM}: is a \Rclass{BiocParallelParam} instance defining the parallel back-end to be used during evaluation (see \citep{biocparallel}). It allows the specification of how many \Rfunarg{workers} (cpus) will be used, etc. \end{itemize} For more information about \Rclass{ScanBamParam} and \Rclass{PileupParam} constructors see \Rpackage{Rsamtools} manual. \section{Creating a \targetexperiment{} object} \subsection{Constructor call} Once you have defined the input data presented above, the \targetexperiment{} constructor could be called using:\\ <>= BPPARAM<-bpparam() myPanel<-TargetExperiment(bedFile, bamFile, fastaFile, feature="amplicon", attribute="coverage", BPPARAM=BPPARAM) @ <>= BPPARAM<-bpparam() data(ampliPanel, package="TarSeqQC") myPanel<-ampliPanel rm(ampliPanel) setBamFile(myPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(myPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ When \Rfunction(TargetExperiment) is called, some \targetexperiment{} methods are invoked in order to define two of the \targetexperiment{} slots. First, the \Rfunction{buildFeaturePanel} is internally used in order to build the \Rfunarg{featurePanel} slot. This method calls the \Rfunction{pileupCounts} function to build the pileup matrix. Then, the \Rfunction{summarizePanel} is invoked in order to build the \Rfunarg{genePanel} slot. \\ In the previous example, were defined the \Rfunarg{feature} and \Rfunarg{attribute} parameter values. If you don't do this, you can create the \targetexperiment{} object but a warning message will be printed. Then, you can use the \Rfunction{setFeature} and \Rfunction{setAttribute} methods to set these values. For example:\\ <>= # set feature slot value setFeature(myPanel)<-"amplicon" # set attribute slot value setAttribute(myPanel)<-"coverage" @ As we mentioned before, when the \Rfunarg{scanBamP} and \Rfunarg{pileupP} are not specified in the constructor call, they assume their default constructor. But, you could specify those after the constructor call, using \Rfunarg{setScanBamP} and \Rfunarg{setPileupP}. \\ <>= # set scanBamP slot value scanBamP<-ScanBamParam() #set which slot bamWhich(scanBamP)<-getBedFile(myPanel) setScanBamP(myPanel)<-scanBamP # set attribute slot value setPileupP(myPanel)<-PileupParam(max_depth=1000) # build the featurePanel again setFeaturePanel(myPanel)<-buildFeaturePanel(myPanel, BPPARAM) # build the genePanel again setGenePanel(myPanel)<-summarizePanel(myPanel, BPPARAM) @ Note that the previous code specifies that the maximum read depth can be 1000. If you have some genomic positions that has more than 1000 reads, they will not be computed. On the other hand, if you do any change in the \Rfunarg{scanBamP} and/or \Rfunarg{pileupP} slots you will need set the \Rfunarg{featurePanel} and the \Rfunarg{genePanel} slots again.\\ The \tarseq{} \R{} package provides a dataset that stores the \targetexperiment{} object built with the previous files. In order to use it, you can do: <>= data(ampliPanel, package="TarSeqQC") @ The loaded object is called \textit{ampliPanel}. If you want to use it, you need to re-define the \bamfile{} and \fastafile{} path files. In order to do this, you can use: <>= # Defining bam file and fasta file names and paths setBamFile(ampliPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(ampliPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ Note that \Rfunarg{featurePanel} and \Rfunarg{genePanel} do not need to be rebuilt. The redefinition file names is neccesary in order to use \targetexperiment{} methods that query this files.\\ \subsection{Early exploration} The \targetexperiment{} class has typical \Rfunction{show/print} and \Rfunction{summary} \R{} methods implemented. In addition, the \Rfunction{summaryGeneLev} and \Rfunction{summaryFeatureLev} methods allow the summary exploration at ``gene'' and ``feature'' level. The next example illustrates how do you call these methods: <>= # show/print myPanel # summary summary(myPanel) #summary at feature level summaryFeatureLev(myPanel) #summary at gene level summaryGeneLev(myPanel) @ Using those methods you can easily find, for example, that in average all amplicons were sequenced at a coverage of 256. You can also see that there is at least one amplicon that was not read. This is because the minimum value of the attribute (\textit{coverage}) is 0. In order to complement this analysis, you could explore the attribute distribution using:\\ <>= g<-plotAttrExpl(myPanel,level="feature",join=TRUE, log=FALSE, color="blue") x11(type="cairo"); g @ \begin{figure}[!htp] \begin{center} \includegraphics{attrExpl.pdf} \end{center} \caption{Attribute distribution and density plots.} \label{attrexpl} \end{figure} \par In the Figure \ref{attrexpl}, the \Rfunarg{join} parameter was set as 'TRUE'. If it is set as 'FALSE', the figure will contain the attribute box-plot on the left and the corresponding attribute density plot on the right. \\ \section{Deep exploration and Quality Control} \subsection{Panel overview} When you are working with a TS experiment, it is interesting to simultaneously evaluate the performance of all the features. In addition, if you have prefixed attribute intervals, it could be important to compare features according to them. For example, five coverage intervals can be defined according to the Table \ref{table:featureInterval}.\\ \begin{table}[ht] \caption{Coverage intervals} % title name of the table \centering % centering table \begin{tabular}{| l | l | } % creating 10 columns \hline \textbf{Coverage Interval} & \textbf{Motivation} \\ \hline $[0,1)$ & \textit{Not sequenced}\\ [0.5ex] \hline $[1,50)$ & \textit{Low sequencing coverage}\\ [0.5ex] \hline $[50,200)$ & \textit{Regular sequencing coverage}\\ [0.5ex] \hline $[200,500)$ & \textit{Very good sequencing coverage}\\ [0.5ex] \hline $[500,Inf)$ & \textit{Excellent sequencing coverage}\\ [0.5ex] \hline \end{tabular} \label{table:featureInterval} \end{table} Then, these coverage intervals could be incorporated into the analysis. To do this, the \tarseq{} \R{} package needs an interval extreme definitions: <>= # definition of the interval extreme values attributeThres<-c(0,1,50,200,500, Inf) @ %In order to facilitate the object exploration and help the %\textit{Quality Control} process, additional methods were developed. A panel results overview is critical in order to compare and integrate it. To help this task, we have implemented the \Rfunction{plot} method. This is a graphical tool consisting in a polar histogram, in which each gene is represented as a bar. Each bar is colored depending the percentage of features that have their attribute value in a particular prefixed interval. In addition, the bars (genes) can be grouped in chromosomes in order to facilitate the comparisson at this level. \\ To build this plot, you can do: <>= # plot panel overview g<-plot(myPanel, attributeThres, chrLabels =TRUE) g @ \begin{figure}[!htp] \begin{center} \includegraphics{plot.pdf} \end{center} \caption{Panel overview plot.} \label{fig:plot} \end{figure} \par In the example presented here, we can easily distinguish that the unique amplicon of the ``gene3'' was not sequenced. This is because in the Figure \ref{fig:plot}, the bar corresponding to ``gene3'' is colored in red and this color is related to the [0,1) coverage interval. In the same plot, we can also appreciate that this gene has only one amplicon, as depicted in parenthesis in the bar label \textit{``gene3(1)''}. Also it is possible to note that 40\% of ``gene4'' amplicons has a coverage between 1 and 50. Note that this gene have five amplicons, then the 40 \% corresponds to 2 amplicon. Another 20 \% (1 amplicon) has a coverage value between 50 and 200, other one ``gene4'' amplicon have a very good coverage value, it means, between 200 and 500 and the other amplicon have an excellent coverage higher than 500. \\ It is important to note that a small and simple example is presented here. The prevoious plot could have a grater impact when you have more features and genes. Figure \ref{fig:plotCPP} contains the panel overview for a TS Experiment based on the \textit{Ion AmpliSeq Cancer Panel Primer Pool}. This is a TS Panel offered by \textit{Life Technologies} (\cite{CPP}) that allows to explorae 190 amplicons. In this case, you can easily observe that ``MLH1'' and ``CDKN2A'' genes were no sequenced. You can also appreciate that several genes like ``ALK'', ``VHL'', ``AKT1'', ``ARBB2'', among others, have more uniforme coverage values along their amplicons. On the contrary, ``KDR'' and ``PTEN'' genes have some amplicons not sequenced and some other with a high coverage. \begin{figure}[!htp] \begin{center} \includegraphics{plotCPP.pdf} \end{center} \caption{Cancer Panel Primer Pool overview plot.} \label{fig:plotCPP} \end{figure} Complementing the previous plot, \Rfunction{plotFeatPerform} illustrates a similar graphic where the bars are distributed along the x-axis. In order to expand the polar histogram shown in the Figure \ref{fig:plot}, the parameter \Rfunarg{complete} is included. If you set it as TRUE, the resultant plot will contain two graphics. The upper panel is a bar plot at feature level, and the lower, at a gene level. Both graphics incorporate the prefixed attribute intervals information and contain a red line to indicate the mean value of the attribute at the corresponding level. In our example, you could run:\\ <>= # plot panel overview g<-plotFeatPerform(myPanel, attributeThres, complete=TRUE, log=FALSE, featureLabs=TRUE, sepChr=TRUE, legend=TRUE) g @ \begin{figure}[!htp] \begin{center} \includegraphics{featPerf.pdf} \end{center} \caption{Amplicon coverage performance. The upper panel is a bar plot at feature level, and the lower, at gene level.} \label{fig:perform} \end{figure} In Figure \ref{fig:perform} we could evaluate the coverage value for each amplicon and gene. We can observe that when coverage is summarized at gene level the highest value is lower than 500. However, at amplicon level, the highest value is grater than 800. \\ The previous plot is also very useful when we are working with panels made-up by several primer pools combination. For example, the \textit{Comprehensive Cancer Panel} is another \textit{Life Tachnologies} panel that allows the exploration of 16000 amplicons from 409 genes related to several cancer types using 4 primer pools (\cite{CCP}). In this case, the \bedfile{} contains a ``pool'' column that stores the number pool for each feature. This information will be conserved in the \targetexperiment{} object built from this panel.\\ In the \textit{Quality Control} context, it is so important to evaluate in early analysis stages if some pool effect exists and if all pool results are comparable. Naturally, the \tarseq{} \R{} package uses this information to assist the user. For example, the Figure \ref{fig:perfCCP} illustrates the use of the \Rfunction{plotFeatPerform} in the described case. Now, you can see that the graphic corresponding to the amplicon level shows a separation between amplicons according to its pool value. Note that the same plot at a gene level is not showed because the \Rfunarg{complete} parameter was set too 'FALSE'. It is important to emphasize that, if correspond, the pool information will be included in all methods of the \targetexperiment{} class. Thus, for example, when you call the \Rfunction{summary} function for a \targetexperiment{} object that has pool information, the output will contain statistic results for the amplicon level and for each pool separately. \begin{figure}[!htp] \begin{center} \includegraphics{perfExplCCP.pdf} \end{center} \caption{Performance exploration of an Ion AmpliSeq Comprehensive Cancer Panel experiment.} \label{fig:perfCCP} \end{figure} \clearpage \subsection{Controlling low counts features} Low counts features should be detected in early analysis stages. The \Rfunction{summaryIntervals} method builds a frequency table of the fetures that have its attribute value between predefined intervals. For example, if you are interested in explore the ``coverage'' intervals defined before, you could do:\\ <>= # summaryIntervals summaryIntervals(myPanel, attributeThres) @ The previous methods is also useful when you are interesting in quantifying how many features have at least it attribute value (\textit{coverage}) lower or higher than a threshold. In this example, you could be interested in knowing how many amplicons have shown at least a coverage of 50, because you consider that this is a minimum value that you will admit. This is a typical aspect that you will explore when you do an experiment \textit{Quality Control}.\\ Another method that could help you is \Rfunction{getLowCtsFeatures}. This method returns a \Rclass{data.frame} object that contains all the features that have its attribute value lower than a threshold. The output \Rclass{data.frame} also contains the panel and attribute information for each feature. For example, if you want to known which are the genes that have a coverage value lower than 50, you can do:\\ <>= getLowCtsFeatures(myPanel, level="gene", threshold=50) @ In addition, if you want to known which amplicons have a coverage value lower than 50, you should execute:\\ <>= getLowCtsFeatures(myPanel, level="feature", threshold=50) @ Graphical methods were also implemented. The \Rfunction{plotGeneAttrPerFeat} allows the attribute value exploration for all the features of a selected gene. For instance, if you want to explore the ``gene4'', you should do:\\ <>= g<-plotGeneAttrPerFeat(myPanel, geneID="gene4") # adjust text size g<-g+theme(title=element_text(size=16), axis.title=element_text(size=16), legend.text=element_text(size=14)) g @ \begin{figure}[!htp] \begin{center} \includegraphics{gene4AttrFeat.pdf} \end{center} \caption{Performance attribute exploration of the \textit{gene4}.} \label{fig:perfgene4} \end{figure} In Figure \ref{fig:perfgene4} you can observe the attribute value for each feature contained in the ``gene4'' gene. \subsection{Read counts exploration} When you are doing a \textit{Quality Control} the analysis of coverage/median counts achieved for each feature is important. But, sometimes could be interesting in exploring the read profile obtained for a particular genomic region or a feature. For this reason, the \tarseq{} \R{} package provides methods to help the exploration at a nucleotide resolution. \\ Remember that when the \Rfunarg{featurePanel} slot of a \targetexperiment{} object is built, the \Rfunction{pileupCounts} function is called. This and the \Rfunction{buildFeaturePanel} method are responsible for the pileup construction, read counts obtainment, and \textit{coverage}/\textit{medianCounts} calculation. Even though these functions are internally invoked by the \targetexperiment{} constructor, if you wish, you can call them. In particular, when \Rfunction{pileupCounts} is invoked, you will obtain a data.frame that contains the read counts information for each position that was specified in the \bedfile. In addition, this is a function, not a \targetexperiment{} method that could be called externally to the class. Note that the columns in the obtained object could change, depending on the \Rfunarg{pileupP} parameter definition. In our case we are working with its default constructor and only the \Rfunarg{maxdepth} parameter was modified. For this reason, the resultant \Rclass{data.frame} will contain one column for each nucleotide and one column (``-'') storing deletion counts.\\ In order to call the \Rfunction{pileupCounts} function is neccesary to specify several parameters: \begin{itemize} \item \Rfunarg{bed}: is a \Rclass{GRanges} object that, at least, should have values in the \Rfunarg{seqnames}, \Rfunarg{start} and \Rfunarg{end} slots. \item \Rfunarg{bamFile}: is a \Rclass{character} indicating the full path to the BAM file. \item \Rfunarg{fastaFile}: is a \Rclass{character} indicating the full path to the FASTA file. \item \Rfunarg{scanBamP}: is a \Rclass{ScanBamParam} object, that specifies rules to scan a \Rclass{BamFile} object. If it was not specified, its default value will be used and then, the \Rfunarg{which} parameter will be specified using the \bedfile{}. \item \Rfunarg{pileupP}: is a \Rclass{PileupParam} object, that specifies rules to build the \pileup{}, starting from a \Rclass{BamFile}. If it was not specified, the \Rfunarg{pileupP} parameter will be defined using the constructor default values. \end{itemize} In our case, to work with the example data, you could do: <>= # define function parameters bed<-getBedFile(myPanel) bamFile<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) fastaFile<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) scanBamP<-getScanBamP(myPanel) pileupP<-getPileupP(myPanel) #call pileupCounts function myCounts<-pileupCounts(bed=bed, bamFile=bamFile, fastaFile=fastaFile, scanBamP=scanBamP, pileupP=pileupP, BPPARAM=BPPARAM) head(myCounts) @ Using the obtained read count information it is possible to build a \textit{read profile} plot, in which the x axis represents the genomic position and the y axis, the obtained read counts. It is also important to distinguish how many read counts correspond to the reference nucleotide and how many could correspond to a genomic variation. The \Rfunction{plotRegion} allows the read profile exploration for a specific genomic region. Helping the region definition, the \Rfunction{getRegion} method extracts the information for a genomic region. For example: <>= #complete information for gene7 getRegion(myPanel, level="gene", ID="gene7", collapse=FALSE) #summarized information for gene7 getRegion(myPanel, level="gene", ID="gene7", collapse=TRUE) @ Then, you could use the previous information to specify a genomic region, as: <>= g<-plotRegion(myPanel, region=c(4500,6800), seqname="chr10", SNPs=TRUE, xlab="", title="gene7 amplicons",size=0.5) x11(type="cairo") g @ \begin{figure}[!htp] \begin{center} \includegraphics{gene7Region.pdf} \end{center} \caption{Read counts profile for the gene7 genomic region.} \label{fig:regiongene7} \end{figure} The \Rfunction{plotFeature} allows the read profile exploration of a particular feature. For example if we wish to explore the ``AMPL20'' amplicon of the ``gene7'', we should do: <>= g<-plotFeature(myPanel, featureID="AMPL20") x11(type="cairo") g @ \begin{figure}[!htp] \begin{center} \includegraphics{featuregene7.pdf} \end{center} \caption{Read counts profile for the "20" gene7 amplicon.} \label{fig:featgene7} \end{figure} As you can see in Figure \ref{fig:featgene7}, the gray shadow correspond to the total counts that were obtained at each genomic position insight the selected amplicon. The violet line indicates the read counts matching with the reference sequence. The green, blue, red and brown lines illustrate the read counts that do not match with the reference and inform about the detected nucleotide variation. In this example, the selected amplicon show a variation that change the reference nucleotide for a ``T''. If you wan to know exactly the proportion of read counts that match and no match against the reference, you can use the \Rfunction{plotNtdPercentage} as: <>= g<-plotNtdPercentage(myPanel, featureID="AMPL20") g @ \begin{figure}[!htp] \begin{center} \includegraphics{NTDPercAMPL20.pdf} \end{center} \caption{Nucleotide percentages for each genomic position on the "AMPL20" gene7 amplicon.} \label{fig:ntd} \end{figure} In Figure \ref{fig:ntd} you can observe that in the position 4912 of the reference genome indicates that there should be a ``G'' and the read counts indicate that in this position is a ``T''. You could also extract this information using the previous read counts \Rclass{data.frame} \Rfunarg{myCounts}. For this, remember that the \Rfunarg{featurePanel} slot is a \Rclass{GRanges} object. Then, you could subset this directly using the feature name: <>= getFeaturePanel(myPanel)["AMPL20"] @ Using this information, you could subset in \Rfunarg{myCounts} object only those rows that corresponding with the feature: <>= featureCounts<-myCounts[myCounts[, "seqnames"] =="chr10" & myCounts[,"pos"] >= 4866 & myCounts[,"pos"] <= 4928,] @ Then, you could find which position have the lowest value in the ``='' column. It means, the minimum value of read counts matching against the reference: <>= featureCounts[which.min(featureCounts[,"="]),] @ \section{Quality Control Report} The \tarseq{} \R{} package provides a method that generates an .xlsx report in which \textit{Quality Control} relevant information is contained. This file has three sheets. In the first, a summary is presented, containing the results of \Rfunction{summary} and \Rfunction{summaryIntervals} methods. This sheet also includes a plot that characterize the experiment. You could choose any graphic, but if you don not specify its name, the method calls the \Rfunction{plot} \tarseq{} method to build it. The second and third sheets store the panel information at a gene and a feature level respectively. Only the information corresponding to the selected attribute will be stored. Then, if you only want to generate the report, you could call the \Rfunction{buildReport} after the object construction. In our case, we want to specify the image file that we want to include in the report, to do this, we should do: <>= imageFile<-system.file("extdata", "plot.pdf", package="TarSeqQC", mustWork=TRUE) buildReport(ampliPanel, attributeThres, imageFile ,file="Results.xlsx") @ \section{Troubleshoot} Remember that all \targetexperiment{} methods that need read count information at a nucleotide level work over the \bedfile, \bamfile{} and the \fastafile. For this reason, if you use some of them, please make sure that the corresponding \targetexperiment{} slots have the file names well defined. For example, if you wish loading the \tarseq{} example data, you can do: <>= data(ampliPanel, package="TarSeqQC") ampliPanel @ But, if you want to re-built the \Rfunarg{featurePanel} slot, the \Rfunction{pileupCounts} execution will cause an error because the method cannot find the files. \\ \begin{Schunk} \begin{Soutput} buildFeaturePanel(ampliPanel) [1] "The index of your BAM file doesn't exist" [1] "Building BAM file index" open: No such file or directory Error in FUN(X[[i]], ...) : failed to open SAM/BAM file file: './mybam.bam' \end{Soutput} \end{Schunk} To solve the previous error, you should do: <>= setBamFile(ampliPanel)<-system.file("extdata", "mybam.bam", package="TarSeqQC", mustWork=TRUE) setFastaFile(ampliPanel)<-system.file("extdata", "myfasta.fa", package="TarSeqQC", mustWork=TRUE) @ and then: <>= setFeaturePanel<-buildFeaturePanel(ampliPanel) @ \clearpage \section*{Session Info} <>= sessionInfo() @ \clearpage \bibliographystyle{apalike} \bibliography{TarSeqQCvignette} \end{document}