%\VignetteIndexEntry{Rcade Vignette} %\VignetteDepends{} %\VignetteKeywords{Rcade} %\VignettePackage{Rcade} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{cite} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in %\headheight=-.3in %\newcommand{\Rfunction}[1]{{\texttt{#1}}} %\newcommand{\Robject}[1]{{\texttt{#1}}} %\newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} %\newcommand{\Rclass}[1]{{\textit{#1}}} \SweaveOpts{keep.source=TRUE} \newcommand{\classdef}[1]{ {\em #1} } \begin{document} \title{\Rpackage{Rcade}: R-based analysis of ChIP-seq And Differential Expression data} \author{Jonathan Cairns} \maketitle \tableofcontents \section{Introduction} \Rpackage{Rcade} is a tool that analyses ChIP-seq data and couples the results to an existing Differential Expression (DE) analysis. A key application of \Rpackage{Rcade} is in inferring the direct targets of a transcription factor (TF) - these targets should exhibit TF binding activity, and their expression levels should change in response to a perturbation of the TF. The ChIP-seq analysis element of \Rpackage{Rcade} is performed through methods from the \Rpackage{baySeq} package, with respect to a user-defined universe of potential binding sites. This means that \Rpackage{Rcade} avoids the noise issues associated with peak-calling and focuses instead on robust quantification of binding activity. Any universe can be selected, and \Rpackage{Rcade} provides functionality to accommodate the common case where bins are defined relative to genomic annotation features. In some situations, it may be appropriate to define the binding site universe based on a set of peak-calls from other data sets. However, it is inappropriate to use peak-calls from the ChIP-seq data used in the \Rpackage{Rcade} analysis - such an analysis would be prone to confirmation bias. \Rpackage{Rcade} uses a fully Bayesian modelling approach. In particular, it uses log-odds values, or $B$-values, in both its input and output. The log-odds value is related to the posterior probability ($PP$) of an event, as per the formula $B = \log\left(\frac{PP}{1 - PP}\right)$. $PP$-values should not be confused with the frequentist concept of \emph{p}-values. %FAQs? TODO \section{Model} \Rpackage{Rcade} perform analysis on any set of genes, with each gene uniquely identified by a gene ID. In the below example, we use Ensembl gene IDs. Each gene is assumed to have some number of associated binding sites, each of which can be active or inactive as inferred from the ChIP-seq data. Additionally, every gene has one or more expression values, each of which is either DE or not DE under some perturbation - for example, knockdown or stimulation of a TF of interest. It is assumed that, conditional on a gene having both a ChIP-seq signal and a DE signature, the ChIP-seq and expression data associated with that gene are independent. All pairwise interactions between ChIP and DE data are considered for a given gene. The ``gene ID" need not literally refer to genes. For example, a user with transcript-specific expression data could use transcript IDs instead. %FIXME Transcript \section{Simple workflow} In this vignette, we will use the example data provided in the \Rpackage{Rcade} package. These data were obtained from two sources, each pertaining to the transcription factor STAT1. All of the experiments were performed with cells from the HeLa cell line. The two data sets are: \begin{description} \item[Differential Expression data] from Array Express, \\{\tt http://www.ebi.ac.uk/arrayexpress}, under accession number E-GEOD-11299. In this experiment, HeLa cells were stimulated with IFN$\gamma$, and a time course microarray experiment was performed. We have assessed DE status between the 0h and 6h time points. \item[STAT1 ChIP-seq data] from the Snyder lab, as part of the ENCODE consortium\\ %(\cite{Birney2007}). Input DCC accession numbers: wgEncodeEH000611 and wgEncodeEH000612\\ ChIP DCC accession number: wgEncodeEH000614\\ Here, HeLa cells were stimulated with IFN$\gamma$ for 30 minutes, then cells were harvested and STAT1 ChIP-seq was performed. \end{description} To keep the size of the package down, all of these files have been truncated. Thus, they only contain data pertinent to a handful of selected genes. %Instructions for recreating the full files from the original data sources are in the appendix. FIXME separate vignette? The location of these data files will vary from system to system. To find the data directory on your computer, use the following code: <>= dir <- file.path(system.file("extdata", package="Rcade"), "STAT1") dir @ \subsection{Preliminary} Load the \Rpackage{Rcade} package: <<>>= library(Rcade) @ To perform an \Rpackage{Rcade} analysis, you will need the following input data: \subsubsection{DE summary} \begin{description} \item[DE matrix:] Bayesian information about the DE status of each gene. Any Bayesian source can be used for this purpose: \begin{itemize} \item \Rpackage{limma}: Full DE results obtained with \Rfunction{topTable(..., number=Inf)} \item \Rpackage{baySeq}. %FIXME instructions for obtaining B values from baySeq \end{itemize} At time of writing, \Rpackage{edgeR} and \Rpackage{DEseq} do not supply Bayesian output. The DE data must contain the following fields: \begin{description} \item[geneID] - gene IDs used to link DE results to genes. \item[logFC] - The log fold change associated with each gene. \item[B] - B values (log-odds). \end{description} <<>>= DE <- read.csv(file.path(dir, "DE.csv")) @ \item[DElookup:] We also need to tell \Rpackage{Rcade} which columns of the DE matrix are important. This is done using an object of the form \Rfunction{list(RcadeField1=DEdataField1, ...)}. You may omit the name \Rfunction{RcadeField1} if it is the same as \Rfunction{DEdataField1}. Any fields that are specified in addition to the three required fields above will not be manipulated in the analysis, but will be carried through and appear in the output. <<>>= DElookup <- list(GeneID="ENSG", logFC="logFC", B="B", "Genes.Location", "Symbol") @ \end{description} \subsubsection{ChIP-seq data} \begin{description} \item[.bam and .bai files:] Sets of aligned reads. These reads should have already undergone sequence-level pre-processing, such as any read trimming and adaptor removal that may be required. Moreover, they should have appropriate index files - for example, using the \Rfunction{indexBam} function in the package \Rpackage{Rbamtools}. Example .bam and .bai files are provided in the package: <<>>= dir(dir, pattern = ".bam") @ \item[Targets information:] A matrix containing information about the .Bam files to be used in the analysis. Required fields: \begin{description} \item[fileID] -- ID associated with the file. \item[sampleID] -- ID associated with the sample that was sequenced. Technical replicates of the same population should have the same sampleID. \item[factor] -- The antibody used in the experiment. Control files should be labelled "Input". \item[filepath] -- The .bam file's name/path. \end{description} Optional fields: \begin{description} \item[shift] -- Half of the fragment length. That is, before counting, \Rpackage{Rcade} will shift reads on the positive strand forwards by \Rfunction{shift}, and reads on the negative strand backwards by \Rfunction{shift}. \end{description} <<>>= targets <- read.csv(file.path(dir, "targets.csv"), as.is = TRUE) targets @ \end{description} %FIXME Input -> control \subsubsection{Annotation information} In the ChIP-seq analysis, \Rpackage{Rcade} performs its analysis based on the counts in user-defined bin regions. These regions are specified with a \Robject{GRanges} object from the \Rpackage{GenomicRanges} package. A common requirement is to define bins about genomic annotation features: \Rpackage{Rcade} provides simple functionality to generate an appropriate \Robject{GRanges} object, through the \Rfunction{defineBins()} function. Since STAT1 is a promoter-bound TF, we define bins about Ensembl-derived transcription start sites. In this vignette, we use a very reduced annotation file as follows: %we're using a reduced version of the annotation, as follows. Do not use this in your own analysis! <<>>= anno <- read.csv(file.path(dir, "anno.csv")) anno <- anno[order(anno$chromosome_name),] colnames(anno) <- c("ENSG","chr","start","end","str") @ Only use the preceding code to recreate the analysis in this vignette -- do not use it at any other point! When analysing genome-wide data, use full annotation information -- for example, download full transcript annotation information from Ensembl using \Rpackage{biomaRt}, as follows: <>= library(biomaRt) anno <- getBM( attributes= c("ensembl_gene_id", "chromosome_name", "transcript_start", "transcript_end", "strand"), mart= useDataset("hsapiens_gene_ensembl", useMart("ensembl")) ) ##order, to reduce size of ChIPannoZones object later anno <- anno[order(anno$chromosome_name),] ##use appropriate column names colnames(anno) <- c("ENSG","chr","start","end","str") @ We define bins based on the annotation information through the \Rfunction{defineBins()} function, as follows: <<>>= ChIPannoZones <- defineBins(anno, zone=c(-1500, 1500), geneID="ENSG") @ The \Rfunarg{zone=c(-1500,1500)} argument defines the zone of interest: it starts 1500bp $5'$ of each TSS (\Rfunarg{-1500}), and ends 1500bp $3'$ of each TSS (\Rfunarg{1500}). This zone was chosen based on mapping peak-calls to TSSs using the \Rpackage{ChIPpeakAnno} package, and plotting the positional distribution of these peak-calls. %FIXME more detail The object \emph{ChIPannoZones} can now be used in the \Rpackage{Rcade} analysis. \subsubsection{Prior specification} By default, \Rpackage{Rcade}'s prior belief is that each gene's DE and ChIP-seq statuses are independent. This is unlikely to be true in real data, as genes with ChIP-seq signal are more likely to be DE than other genes. Thus, we should give \Rpackage{Rcade} appropriate priors, if possible. For example, the prior dependency can be specified as follows: <<>>= DE.prior = 0.01 prior.mode = "keepChIP" prior = c("D|C" = 0.05, "D|notC" = 0.005) @ We specify \Rfunarg{DE.prior = 0.01} because that is the prior probability used by \Rpackage{limma}, the package that our DE analysis was performed with. The remaining settings ensure that genes with ChIP-seq signal have a higher than average probability of DE (\Rfunarg{"D|C" = 0.05}), whereas genes without ChIP-seq signal have a lower that average probability of DE (\Rfunarg{"D|notC" = 0.005}). (The values \Rfunarg{0.05} and \Rfunarg{0.005} were selected arbitrarily, before analysis. An advanced user might select \Rfunarg{prior} in a less arbitrary manner -- for example, by looking at the overlap between ChIP-seq and DE in ``similar" datasets. We do not go into further details of such an analysis here.) If you do not supply this information, you may get unreasonably small $B$ values at the end of the analysis. \subsection{Analysis function} We can parallelize \Rpackage{Rcade} by supplying a compute cluster (though this is optional): <>= library(parallel) cl <- makeCluster(4, "SOCK") @ If you don't have a compute cluster or a multicore machine, you can disable parallel processing: <<>>= cl <- NULL @ We now process the data with the \Rfunction{RcadeAnalysis()} function, obtaining an \Rpackage{Rcade} object: <<>>= Rcade <- RcadeAnalysis(DE, ChIPannoZones, annoZoneGeneidName="ENSG", ChIPtargets=targets, ChIPfileDir = dir, cl=cl, DE.prior=DE.prior, prior.mode=prior.mode, prior=prior, DElookup=DElookup) Rcade @ The \Rpackage{Rcade} object stores information from the DE analysis: <<>>= x <- getDE(Rcade) @ It also contains the ChIP-seq analysis: <<>>= x <- getChIP(Rcade) @ and the Rcade table obtained from linking the ChIP-seq analysis with the DE data: <<>>= x <- getRcade(Rcade) @ \section{Plotting, QC} We can perform Principle Component Analysis on the counts with the \Rfunction{plotPCA} function. Usually, one would expect similar samples to cluster together on this plot, and so we can identify samples that do not fit this assumption - further investigation is required to determine the significance of such a finding. (Output is shown in Figure \ref{fig:P1}.) %FIXME choose components <>= plotPCA(Rcade) @ \begin{figure}[p] \begin{center} <>= <> @ \end{center} \caption{Output from \Rfunction{plotPCA(Rcade)}.} \label{fig:P1} \end{figure} The MM plot shows log-ratios from DE plotted against log-ratios from the ChIP-seq. The colour of each point corresponds to the probability of both ChIP and DE being present. (Output is shown in Figure \ref{fig:P2}.) <>= plotMM(Rcade) @ \begin{figure}[p] \begin{center} <>= <> @ \end{center} \caption{Output from \Rfunction{plotMM(Rcade)}.} \label{fig:P2} \end{figure} The function \Rfunction{plotBBB()} plots log-odds values for ChIP-seq, DE, and combined ChIP-seq/DE analysis together. The package \Rpackage{rgl} is required for this 3D plot. (Output is shown in Figure \ref{fig:P3}.) <>= library(rgl) plotBBB(Rcade) @ \begin{figure}[p] \begin{center} \includegraphics[width=\textwidth]{plotBBB.png} \end{center} \caption{Output from \Rfunction{plotBBB(Rcade)}, subsequently saved to disk with the command \Rfunction{rgl.snapshot(filename="plotBBB.png")}} \label{fig:P3} \end{figure} % - FIXME Read histogram in individual bins \section{Output} To export the Rcade results to disk in a user-friendly .csv format, use \Rfunction{exportRcade()} function as follows: <>= exportRcade(Rcade, directory="RcadeOutput") @ Usually, the file of interest is ``DEandChIP.csv", which contains the genes most likely to have both DE and ChIP signals. A full explanation of all of the files can be found in the help page: <>= ?exportRcade @ By default, Rcade outputs the top 1000 geneIDs for each hypothesis. To increase the number of geneIDs, use a larger value for cutoffArg: <>= exportRcade(Rcade, directory="RcadeOutput", cutoffArg=2000) @ Alternative cutoff methods are described in the \Rfunction{exportRcade} help page - for example, <>= exportRcade(Rcade, directory="RcadeOutput", cutoffMode="B", cutoffArg=0) @ % - TODO Extract sequences? (e.g. to FASTQ) %\section{Advanced} % - Recovering from failed analysis (TODO) % - Multiple epitopes (TODO) % - FIXME Fiddling with the prior? \section{Questions/Bug reports} If you have any questions about \Rpackage{Rcade}, or encounter a bug, please post a message on the Bioconductor support site at \url{https://support.bioconductor.org/}. This will help anybody who later searches for help on the same query. \section{Session Info} <<>>= sessionInfo() @ \section{Acknowledgements} Differential Expression data from Array Express, {\tt http://www.ebi.ac.uk/arrayexpress}, under accession number E-GEOD-11299. STAT1 ChIP-seq data from the Snyder lab, as part of the ENCODE consortium\\ %(\cite{Birney2007}). Input DCC accession numbers: wgEncodeEH000611 and wgEncodeEH000612\\ ChIP DCC accession number: wgEncodeEH000614 %\bibliographystyle{plainnat} %\bibliography{Rcade} FIXME \end{document}