--- title: "The ChIPanalyser User's Guide" author: "Patrick Martin" date: "22/08/2017" output: pdf_document: null keep_tex: yes number_sections: no html_document: default package: '`ChIPanalyser''' vignette: | %\VignetteIndexEntry{ChIPanalyser User's Guide} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- # Introduction Transcriptional regulation is undeniably a key aspect of cellular homeostasis. It comes to no surprise that modern molecular biology and genomics have showed a keen interest in the subject. Transcription factors (TF) are a force to be reckoned with in the world of transcriptional regulation. Transcription factors are proteins that bind to DNA in a site-specific manner. Experimentally, this binding site can be determined by various methods such as SELEX-seq, EMSA or DNAse footprinting. The final result will be a sequence to which a given TF will bind preferentially. In many case, these results are presented in the form of a Position Frequency Matrix or Position Weight Matrix. However at a genome wide scale, modern molecular biology relies on methods such as Chromatin Immuno-precipitation linked to sequencing. This method generates a genome wide profile with peaks at sites of high TF occupancy. These experiments may be very costly and it would be interesting to be able to predict TF occupancy sites *in silico*. With this idea in mind, we present **ChIPanalyser** , a R package developed in the effort of understanding Transcription factor binding. At the core of this package resides an approximation of statistical thermodynamcis as suggested by Zabet (Zabet et al. 2015). The statistical thermodynamcis framework proposed by Zabet offers a strong ground for binding site prediction as it requires minimal data input. In its current version, ChIPAnalyser requires a DNA sequence, a Position Weight Matrix, the number of bound molecules (or TFs bound to DNA) and a scaling factor for TF specificity. To improve the accuracy of the model, it is also possible to incorporate DNA accessibility data. # Methods As described above, ChIPAnalyser is based on an approximation of statistical thermodynamics. The core formula describing TF binding is given by : $$P(N,a,\lambda,\omega)_{j} = \frac{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}}{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}+ L \cdot n \cdot [a_{i} \cdot e^{(\frac{1} \lambda \cdot \omega_j)}]_ {i}} $$ with * N , the number of TF molecules bound to DNA * a , DNA accessibility * $\lambda$ , a parameter scaling the specificity of a given TF * $\omega$ , a Position Weight Matrix. # Work Flow - Quick start ## Example data Loading Before going through the inner workings of the package and the work flow, this section will quickly demonstrate how to load example datasets stored in the package. This data represents a minimal workable examples for the different functions. All data is derived from real biological data in *Drosophila melanogaster* (The *Drosophila melanogaster* genome can be found as a `BSgenome` ). ```{r, eval=TRUE, warnings = FALSE, echo=TRUE,message=FALSE} library(ChIPanalyser) #Load data data(ChIPanalyserData) # Loading DNASequenceSet from BSgenome object library(BSgenome.Dmelanogaster.UCSC.dm3) DNASequenceSet <-getSeq(BSgenome.Dmelanogaster.UCSC.dm3) #Loading Position Frequency Matrix PFM <- file.path(system.file("extdata",package="ChIPanalyser"),"BCDSlx.pfm") #Checking if correctly loaded ls() ``` The global environment should now contain a few new variables: DNASequenceSet,PFM,Access,geneRef, eveLocus, eveLocusChip. * `DNASequenceSet` is DNAStringSet extracted from the *Drosophila melanogaster* genome (BSgenome). It is advised to use a full genome sequence for this object. * `PFM` is a path to file. In this case, it is a Position Frequency Matrix derived from the Bicoid Transcription factor in *Drosophila melanogaster*. This PFM is in _raw_ format. Although it is possible to to directly use a PFM R matrix (see motifDB R package), we chose to use a path to a file for this example. Most PFM's found online will come in a text file (with various formats: RAW, TRANSFAC, JASPAR). `ChIPanalyser` is capable of handling all these formats and parsing these files to usable objects within the package. * `Access` is a `GRanges` object containing accessible DNA for the sequence above. * `geneRef` is a `GRanges` containing genetic information (exon, intron, 3'UTR, 5'UTR) for the sequence above. * `eveLocus` is a `GRanges` object with genomic postion for the eve strip locus in *Drosophila melanogaster*. * `eveLocusChip` is a data frame with ChIP score in the format of a simple bed file ( 4 columns : chromosome, start, end and score) for Bicoid transcription factor. ## Quick Start This section presents a quick work flow. For details on the work flow and objects, see section **Work Flow - Full Guide** ### Step 1 - Building Data objects and Pre-processing ChIP data The first step is to set up your data storing objects and extract normalised ChIP scores at loci of interest. When provided with a PFM, the `genomicProfileParameters` object will automatically convert it to a PWM. The `occupancyProfileParameters` object requires parameters extracted from ChIP data. If using the `processingChIPseq` function, the `occupancyProfileParameters` will be generated internally. ```{r,eval=TRUE, warnings = FALSE} # Building a genomicProfileParameters objects for data # storage and PWM computation GPP <- genomicProfileParameters(PFM=PFM,PFMFormat="raw", BPFrequency=DNASequenceSet, ScalingFactorPWM = 1.5, PWMThreshold = 0.7) GPP # Building occupancyProfileParameters with default values OPP <- occupancyProfileParameters() OPP # Building occupancyProfileParameters with custom values OPP <- occupancyProfileParameters(ploidy= 2, boundMolecules= 1000, chipMean = 200, chipSd = 200, chipSmooth = 250, maxSignal = 1.847, backgroundSignal = 0.02550997) OPP ## Extracting ChIP score eveLocusChip<-processingChIPseq(eveLocusChip,eveLocus,noiseFilter="zero",cores=1) str(eveLocusChip) ### Extracting occupancy profile parameters object built from ChIP data OPP<-eveLocusChip[[2]] eveLocusChip<-eveLocusChip[[1]] ``` ### Step 2 - Optimal Parameters The model is based on the approximation of statistical thermodynamics with inference of two parameters (ScalingFactorPWM and boundMolecules). In order to infer these parameters, we suggest to use `computeOptimal`. Values that should be tested for `ScalingFactorPWM` and for `boundMolecules` should be provided by user as described above. If these values are not provided (default value OR only one value for each parameter), then they will be assigned internally. ChIPanalyser also has multi-core support. If you are using large genomes, using multiple cores will significantly decrease computational time. The internal values are the following: ```{r,eval=FALSE} ScalingFactorPWM(genomicProfileParameters) <- c(0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5 ,4 ,4.5, 5) boundMolecules(occupancyProfileParameters) <- c(1, 10, 20, 50, 100, 200, 500,1000,2000, 5000,10000,20000,50000, 100000, 200000, 500000, 1000000) ``` `computeOptimal`contains the following arguments: ```{r, eval = TRUE, warnings = FALSE} optimalParam <- suppressWarnings(computeOptimal(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GPP, LocusProfile = eveLocusChip, setSequence = eveLocus, DNAAccessibility = Access, occupancyProfileParameters = OPP, optimalMethod = "all", peakMethod="moving_kernel", cores=1)) str(optimalParam) ``` **This Function might take some time to compute. Do not be alarmed if it takes some time to run. You should be notified of the progress of the function as it goes** This function is a combination of all the functions bellow with some more magic to it. In the following steps we will describe each of the functions. ### Step 3 - Genome Wide Scoring Computing Genome Wide metrics that will be used further down the line. It is possible to set a higher number of cores to decrease computational time. ```{r, eval=TRUE, warnings = FALSE} genomeWide <- computeGenomeWidePWMScore(DNASequenceSet=DNASequenceSet, genomicProfileParameters=GPP, DNAAccessibility = Access,cores=1) genomeWide ``` `computeGenomeWidePWMScore` will return a `genomicProfileParameters` object with updated values for `maxPWMScore`, `minPWMScore`,`averageExpPWMScore`, and `DNASequenceLength`. ### Step 4 - PWM Scores Above Threshold Once genome wide scores have been computed, the `genomeWide` object (previously computed) should be parsed to the next function. The next function will compute sites above the assigned threshold (see below) for a given locus (or set of loci). If no Locus is provided then the whole genome will be considered. It is possible to set a higher number of cores to decrease computational time. ** It is important to set names to your setSequence object (see below). We recommend to set names yourself to make your analysis easier to keep track of. Names will be set internally for computational reasons but there must be concordance between setSequence and ChIP data ** ** This aspect cannot be stressed enough** ```{r,eval=TRUE, warnings = FALSE} SitesAboveThreshold <- computePWMScore(DNASequenceSet=DNASequenceSet, genomicProfileParameters=genomeWide, setSequence=eveLocus, DNAAccessibility = Access,cores=1) SitesAboveThreshold ``` This function returns another `genomicProfileParameters` object with an updated `AllSitesAboveThreshold` slot. This slot contains a `GRanges` object with sites above threshold and associated PWMScores. ### Step 4 - compute Occupancy From the PWMScores, `ChIPanalyser` will compute occupancy for each sites above threshold. ```{r,eval=TRUE, warnings = FALSE} Occupancy <- computeOccupancy(SitesAboveThreshold, occupancyProfileParameters= OPP) Occupancy ``` This function will return a `genomicProfileParameters` object with an updated `AllSitesAboveThreshold`. Now the Occupancy values for each sites are included. ### Step 5 - compute ChIP -seq like profiles The ultimate goal of `ChIPanalyser` is to produce ChIP-seq like profile predicting transcription factor binding. To do so, the following function will compute ChIP-seq like scores from occupancy values. ```{r, eval=TRUE, warnings = FALSE} chipProfile <- computeChipProfile(setSequence = eveLocus, occupancy = Occupancy,occupancyProfileParameters = OPP, method="moving_kernel") chipProfile ``` This function will return a`List` of `GRangesList`s of `GRanges`. Each element of the list represents a combination of `ScalingFactorPWM` and `boundMolecules`. The `GRangesList` contains the Loci of interest. Finally, the individual `GRanges` contains ChIP-seq like scores for every *n* base pairs (with n = stepSize, see bellow). This object may be difficult to navigate if many different parameters, or Loci are used. In order to facilitate navigation, we included a search function. **See function: searchSites** This function can also be used to navigate `AllSitesAboveThreshold` slot after occupancy scores have been computed. ### Step 6 - Model Accuracy Assessing model quality (predicted model against real ChIP-seq data). ```{r, eval=TRUE, warnings = FALSE} AccuracyEstimate <- profileAccuracyEstimate(LocusProfile = eveLocusChip, predictedProfile = chipProfile, occupancyProfileParameters = OPP,method="all") AccuracyEstimate <-AccuracyEstimate[[1]][[1]][[1]] AccuracyEstimate ``` ### Step 7 - Plotting Finally, once all has been computed, it is possible to plot the results. ```{r, eval=TRUE, warnings = FALSE, fig.width=10, fig.height=8} # Plotting Optimal heat maps par(oma=c(0,0,3,0)) layout(matrix(1:8,ncol=4, byrow=T),width=c(6,1.5,6,1.5),height=c(1,1)) plotOptimalHeatMaps(optimalParam,layout=FALSE) ``` ```{r, eval=TRUE, warnings = FALSE, fig.width=10, fig.height=4.5} # Plotting occupancy Profile ## plotOccupancyProfile(predictedProfile=chipProfile[[1]][[1]], setSequence=eveLocus, chipProfile = eveLocusChip[[1]], DNAAccessibility = Access, occupancy = AllSitesAboveThreshold(Occupancy)[[1]][[1]], profileAccuracy=AccuracyEstimate, occupancyProfileParameters = OPP, geneRef=geneRef ) ``` # Work Flow - Full Guide This section will describe `ChIPanalyser`'s work flow. However in this section we will describe in detail data objects, parameters, and functions. Please refer to this section if in doubt. If the doubt persists, don't hesitate to send an email to the maintainer. ## Data objects - Genomic Profile Parameters The very first aspect to consider when using ChIPAnalyser is data input. Many (if not all functions) require specific data inputs and parameters in order to carry out the computation. To facilitate, the storage of these parameters, we created a `genomicProfileParameters` object (S4 class). This is the very first step before any other work. All other functions rely on this `genomicProfileParameters` object in one form or another. The output of most functions will be a `genomicProfileParameters` object. Thus the output of one functions should be used as an input for the next functions in the pipeline. All functions are described bellow in section **Work Flow - Analysis**. This object comes in the following form: ```{r, eval=FALSE, echo=TRUE} genomicProfileParameters(PWM, PFM, ScalingFactorPWM, PFMFormat, pseudocount, BPFrequency, naturalLog, noOfSites, minPWMScore, maxPWMScore, PWMThreshold, AllSitesAboveThreshold, DNASequenceLength, averageExpPWMScore, strandRule,whichstrand, NoAccess) ``` To build a `genomicProfileParameters` object : ```{r, eval=FALSE, echo=TRUE} # Assign Value wanted for each parameter GPP <- genomicProfileParameters(PWM, PFM,ScalingFactorPWM, PFMFormat, pseudocount, BPFrequency, naturalLog, noOfSites, PWMThreshold, DNASequenceLength, strandRule, whichstrand) ``` As one can see, `genomicProfileParameters` contains many arguments. However many of these arguments already have default values assigned to them. Some of the arguments should not be set by user. These values are computed internally and will automatically updated (minPWMScore, maxPWMScore, AllSitesAboveThreshold, NoAccess). In this situation, most arguments are not required to build a `genomicProfileParameters` object and a minimal build can be described as: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} # return empty genomicProfileParameters object GPP <- genomicProfileParameters() # return minimal working object GPP <- genomicProfileParameters(PFM=PFM,PFMFormat="raw") # Suggested Minimal Build GPP <- genomicProfileParameters(PFM=PFM,PFMFormat="raw", BPFrequency=DNASequenceSet) ``` Although many parameters have assigned default values, it is recommended to use custom parameters to better fit the needs of the analysis. The method described above will build a new `genomicProfileParameters` object with the values that were assigned to each argument. Only three slots are required in order to build a `genomicProfileParameters` object (see below - **The compulsory ones**).Most other slots are optional. If after building `genomicProfileParameters`, you wish to modify the value of only *one* slot and keep the values that you had previously assigned, it is possible to modify each slot individually by using the slot *access/setter* methods. Each slot and it's *access/setter* method is described below. ### Position Matricies - The compulsory ones * `PWM` , a Position Weight Matrix. If a Position Weight Matrix is readily available it is possible to directly use this Matrix. This `PWM` should contain four rows ( one for each base pair; ACTG in order). The number c olumns will depend on the length of the preferred binding motif of a given Transcription Factor. This argument is only necessary IF and ONLY IF, no `PFM` (Position Frequency Matrix) is available. Choosing between `PWM` or `PFM` comes down to personal choice as long a `PWM` is available for further computation (see `PFM`). If a `PFM` is available (see below), the Position Weight Matrix will be directly computed from the Position Frequency Matrix. Although it is possible to assign a new PWM to the `genomicProfileParameters` object without creating a new object, we suggest that if you were to decided to use another Position Weight Matrix to create a new `genomicProfileParameters`. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} #Accessing PositionWeightMatrix slot PositionWeightMatrix(GPP) ``` ```{r,eval=FALSE, warnings= FALSE, echo=TRUE} # Setting PositionWeightMatrix slot PositionWeightMatrix(GPP) <- newPWM ### This is not the advised method ### newPWM is a matrix following the format described above ``` * `PFM` , a Position Frequency Matrix. The Position Frequency Matrix argument may come in multiple forms: in the form of a Matrix containing four rows (one for each base pair ACTG) and columns depending of the length of the binding motif or in the form of a path to file linking to a `PFM`. Position Frequency Matricies come in various configurations. The most common ones (all supported by ChIPAnalyser) are RAW (similar to the simple matrix described previously), Transfac and JASPAR. Finally, if the binding sequences are available, the `PFM` will be generated from sequence information. We suggest to use a path/to/file linking towards the `PFM` file. Most `PFM` will come in one of the formats described above and `ChIPanalyser` will parse these files in a usable format. However, PLEASE NOTE THAT THE FORMAT SHOULD BE SPECIFIED. See `PFMFormat` bellow. If a `PWM` is readily available, `PFM` is not necessary. However, keep in mind that at least one is necessary. Although it is possible to assign a new PFM to the `genomicProfileParameters` object without creating a new object, we suggest that if you were to decided to use another Position Frequency Matrix to create a new `genomicProfileParameters`. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} # Accessing PositionFrequencyMatrix slot PositionFrequencyMatrix(GPP) ``` ```{r,eval=FALSE, warnings=FALSE, echo=TRUE} # Setting PositionFrequencyMatrix slot PositionFrequencyMatrix(GPP) <- newPFM ``` In this situation, `newPFM` is either a path to file or a PFM matrix. The PFMFormat will be the one assigned to the `genomicProfileParameters` object. At least one of **PWM** or **PFM** is required to create a `genomicProfileParameters` storage object. If a `PFM` is provided then the `PWM` will be automatically computed and updated. * `PFMFormat`, a file format for `PositionFrequencyMatrix` file. When Loading a PFM from a file (as described above), one should included the format of the file that they are using. `PFMFormat` may be one of the following: "raw","transfac","JASPAR" or "sequences". ```{r,eval=FALSE, echo=TRUE} PFMFormat(GPP) PFMFormat(GPP)<-"raw" ``` Default is set at "raw". All other arguments are optional however we strongly recommend to tailor the values assigned to `genomicProfileParameters` to your needs. The following sections will describe these optional parameters. ### Genomic Parameters - The optional ones * `ScalingFactorPWM` , a scaling factor for TF specificity. Although this parameter is optional (Default value is set at 1), the *scaling factor* (or *lambda* as described in the equations above) is crucial for many functions (described below). ScalingFactorPWM, must be a positive numeric value or a vector containing positive numeric values. The optimal value for `ScalingFactorPWM` may be inferred by using `computeOptimal`. Different values for `ScalingFactorPWM` will influence the goodness of fit of the model. For more information, see `computeOptimal` and `profileAccuracyEstimate`. ```{r, eval=F, echo=TRUE} ScalingFactorPWM(GPP) ScalingFactorPWM(GPP) <- 0.5 ScalingFactorPWM(GPP) <- c(0.5, 1, 1.5, 2) ``` * `PWMpseudocount`, a probability modifier. When computing a `PWM` from a `PFM`, it is possible that certain base pairs are completely absent from the Position Frequency Matrix. This absence will lead to odd results as part of this transformation requires a logarithmic transformation (at Position probability matrix step - a Matrix that describes the simple probability of a base pair being in that position of a binding motif given the PFM). *zeroes* will give minus infinities. In order to overcome this problem,a `PWMpseudocount` is introduced in the Position Probability Matrix. a `PWMpseudocount` of 1 (Default Value is 1) will then become a 0 after logarithmic transformation thus removing any mathematical discomforts. ```{r, eval=F, echo=TRUE} PWMpseudocount(GPP) PWMpseudocount(GPP) <- 1 ``` * `BPFrequency`, the frequency at which each base pair will occur in a given organism. Probabilistically speaking, all base pairs have an equal chance of occurring in the genome (Default value for this slot is set at 0.25 per base pair). However, biologically speaking this is not the case. `BPFrequency` may be supplied in various forms. If base pair frequency is known, it may be supplied as a vector containing the probability of occurrence of each base pair. If however, this frequency is unknown, `genomicProfileParameters` will compute `BPFrequency` from a `BSgenome` or a `DNAStringSet`. Bare in mind that `BPFrequency` is used to generate a *PWM* from a *PFM*, thus if one were to change the BPFrequency after creating a `genomicProfileParameters` with an already computed *PWM* , this would not influence the value of the *PWM*. It would be necessary to rebuild a new `genomicProfileParameters` object. ```{r, eval=F, echo=TRUE} BPFrequency(GPP) BPFrequency(GPP)<-c(0.2900342,0.2101426,0.2099192,0.2899039) BPFrequency(GPP) <- DNASequenceSet ``` * `naturalLog`, a logical value. As described previously (see `pseudocount`), the transformation from PFM to PWM requires a logarithmic transformation. The user may choose which logarithmic transformation, they would rather apply (Default is `TRUE`). If `naturalLog = TRUE`, then the natural logarithm will be used for transformation. If `naturalLog = FALSE`, then *log2* will be used instead. Keep in mind that, the goal is to avoid any funky business during PFM to PWM transformation (e.g. Minus infinities or division by zero). ```{r, eval=F, echo=TRUE} naturalLog(GPP) naturalLog(GPP) <- FALSE ``` * `noOfSites` , the number of sites used to compute the `PWM` from the `PFM`. In the event that a `PFM` contains a large amount of sites (as it sometimes is the case with Transfac PFM), it is possible to restrict this number of sites. The default value is 0. When `noOfSites = 0`, the whole `PFM` is used to compute the `PWM`. ```{r, eval=F, echo=TRUE} noOfSites(GPP) noOfSites(GPP) <- 8 ``` * `PWMThreshold`, a numeric threshold against which PWMScores are selected (Default is 0.7). Although it is possible to compute every single motif present in a stretch of DNA (if this is of interest, set `PWMThreshold` to 0), in most cases, only the sites with a high PWM Score will be of interest. The `PWMThreshold` , a numeric value between 0 and 1, will select regions above that given threshold. For the default threshold of 0.7, only the top 30% of PWMScores will be selected. ```{r, eval=F, echo=TRUE} PWMThreshold(GPP) PWMThreshold(GPP) <- 0.7 ``` * `strandRule`, indicates how the genome should be scored with the `PWM` (Default is "max"). As DNA is double stranded, it is necessary to specify how a strand of DNA should be scored. If `strandRule = "max"`, both strands will be scored and the highest score between each strand will be selected. If `strandRule = "sum"`, both strands will be scored and their respective score will be summed. If `strandRule = "mean"`, both strands will be score and the average score between both strands will selected as PWM Score. Only three possibilities: "max", "sum" and "mean" ```{r, eval=F, echo=TRUE} strandRule(GPP) strandRule(GPP) <- "mean" ``` * `whichstrand`, indicates which strand will be used to score the genome with the `PWM` (Default is both strand and is indicated by "+-"). Three options exist: plus strand ("+"), minus strand ("-") or both ("+-" or "-+"). ```{r, eval=F, echo=TRUE} whichstrand(GPP) whichstrand(GPP) <- "+" ``` ### Genomic Parameters - The Updated ones Some of the slots `genomicProfileParameters` should not be changed by user. We strongly advise against changing these slots. Certain Parameters are updated after a certain computation has been carried out. For example, `maxPWMScore` and `minPWMScore` are computed during the `computeGenomeWidePWMScore` function (see below) and represent both the highest and the lowest score of the given DNA sequence. These slots will be updated in the `genomicProfileParameters` object as one makes its way through the ChIPAnalyser work flow. Essentially, they are place holders for information required further down the work flow. Only slots that are of interest for the user are available for visualisation. If these slots have note been updated, the function will not return any value. * `maxPWMScore`, a numeric value describing the highest PWM Score on a given DNA sequence and the value assigned to `lambda`. It is still possible to access this slot using: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} maxPWMScore(Occupancy) ``` * `minPWMScore`, a numeric value describing the lowest PWM Score on a given DNA sequence and the value assigned to `lambda`. It is possible to access this slot using: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} minPWMScore(Occupancy) ``` * `averageExpPWMScore` a numeric value representing the exponential of the average PWM Score. This score depends on the values assigned to `lambda`. It is possible to access this slot using: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} averageExpPWMScore(Occupancy) ``` * `DNASequenceLength` , a numeric value describing the length of the DNA sequence used. Although theoretically one could provide this information, DNA length is automatically computed and the slot updated during `computeGenomeWidePWMScore` function. The length of this sequence is the length of the sequence used to compute the scores previously mentioned (maxPWMScore, minPWMScore and averageExpPWMScore). This means that if DNA accessibility data is provided, the length of the sequence will only be the length of the accessible DNA. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} DNASequenceLength(Occupancy) ``` * `NoAccess`, indicates if certain Loci of interest (see setSequence below) **do not** contain any accessible DNA. It is possible that certain of the loci you have chosen do not contain any accessible DNA (no overlap with DNA accessibility data provided). If this is the case, you will be notified during the computation and the loci will be s tored in the `NoAccess` slot. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} NoAccess(Occupancy) ``` * `AllSitesAboveThreshold`, stores all sites above threshold with the associated PWM Score and Occupancy. This slot may contain a variety of objects however they all represent the same thing: it will always contain at its core a GRanges object (slot class defined as "GRlist" - can be one of the following `GRangesList` or `list`). This GRanges inlcudes sites above threshold (start, end and strand), PWMScores for those sites and possibly Occupancy (depending on what has already been computed). GRanges are encapsulated in a GRangesList as each GRanges represent a specific Loci. This GRangesList may also be encapsulated in a list. This list will represent a combination of `lambda` and number of bound Molecules (see `boundMolecules`). For more information on this list see `computeOccupancy`. It is possible to access this slot by using: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} AllSitesAboveThreshold(Occupancy) # Or searchSites(Occupancy) ``` The size of the `AllSitesAboveThreshold` slot will increase drastically as the number of values assinged to `ScalingFactorPWM` (or `lambda`) and `boundMolecules` increases. In order to navigate and search this slot with ease, it is possible to use the `searchSites` function (**See below: searchSites**). ## Data Objects - Occupancy Profile Parameters `genomicProfileParameters` represents a good chunk of the parameters needed to go through the entire ChIPAnalyser work flow. However, there are more to come! A second parameter storing object was created to handle non-compulsory parameters. This lightens `genomicProfileParameters` by handling part of the parameters. This second S4 object is called `occupancyProfileParameters`. The interesting aspect of this object is that none of the slots are compulsory. This means that if not provided , a new `occupancyProfileParameters` object will be created internally. All default values will be used for further computation. As stated previously, we strongly advise using custom parameters in order to increase goodness of fit of model. The `processingChIPseq` function returns a list, the second element of which is an `occupancyProfileParameter` object. This specific object will contain both max signal and background directly extracted from you ChIP data. These slots are explained below. However this function can also take a `occupancyProfileParameters` object as an argument if you wish to change the smoothing parameters (see `chipSd`,`chipMean` and `chipSmooth`). The `processingChIPseq` function will update the one that is provided. ** See processcingChIPseq below ** ```{r, eval=FALSE} OPP <- occupancyProfileParameters(ploidy = 2 ,boundMolecules = 1000 , backgroundSignal = 0 ,maxSignal = 1, chipMean = 150 , chipSd = 150 , chipSmooth = 250 , stepSize = 10 , removeBackground = 0 ) ``` As it is the case with `genomicProfileParameters`, it is also possible to *access/set* each slot individually after having created an `occupancyProfileParameters` object. Each slot is described as the following: * `ploidy`, the ploidy level of the of the organism of interest (Default is set at 2). This only considers simple polyploidy (or haploidy). The model does not (yet) consider hybrids such as wheat. ```{r, eval=FALSE} ploidy(OPP) ploidy(OPP) <- 2 ``` * `boundMolecules`, a positive integer (or vector of positive integers) describing the number of bound molecules (Transcription factors) to DNA (Default value is set at 2000). In this model, occupancy is reliant on the number of bound molecules. The number of molecules will influence the goodness of fit of the model. It is possible to infer the number of bound Molecules by using the `computeOptimal` function. For more information, see `computeOptimal` and `profileAccuracyEstimate`. ```{r, eval=FALSE} boundMolecules(OPP) boundMolecules(OPP) <- 5000 ``` * `backgroundSignal`, a numeric value representing the background Signal in real ChIP-seq data (Default is set at 0). The background siganl is defined as the mean ChIP score over the entire genome (see `processingChIPseq` below). ```{r, eval=FALSE, echo=TRUE} backgroundSignal(OPP) backgroundSignal(OPP) <- 0.02550997 ``` * `maxSignal`, a numeric value representing the maximum signal in real ChIP-seq data (Default is set at 1) See `processingChIPseq` below) ```{r, eval=FALSE, echo=TRUE} maxSignal(OPP) maxSignal(OPP) <- 1.86 ``` * `chipMean`, a numeric value representing the average peak width in base pairs in real ChIP-seq data (Default is set at 150). ```{r, eval=FALSE, echo=TRUE} chipMean(OPP) chipMean(OPP) <- 150 ``` * `chipSd`, a numeric value representing the standard deviation of peak width in real ChIP-seq data (Default is set at 150). ```{r, eval=FALSE, echo=TRUE} chipSd(OPP) chipSd(OPP) <- 150 ``` * `chipSmooth`, a numeric value representing the size of the window used for smoothing the profile (Default is set at 250). The goal of ChIPAnalyser is to produce ChIP-seq like profile from predicted high occupancy sites. In order to mimic these ChIP-seq profile, a smoothing algorithm is used to smooth occupancy profiles. This algorithm uses ChIP-seq parameters such as `chipMean`, `chipSd`, `maxSignal`, `backgroundSignal` and `chipSmooth`. ```{r, eval=FALSE, echo=TRUE} chipSmooth(OPP) chipSmooth(OPP) <- 250 ``` * `stepSize`, a numeric value describing the bin size (in base pairs) used for computing ChIP-seq like profiles (Default is set at 10). In the case of long sequences, it not always necessary to include ChIP-like occupancy at every base pair (mainly for speed and memory usage). `stepSize` will determine the size of the bins used to split your sequence of interest. As an example, if your sequence is 16 000 bp long with a `stepSize` of 10, the resulting profile will be composed of 1600 occupancy points. ```{r, eval=FALSE, echo=TRUE} stepSize(OPP) stepSize(OPP) <- 10 ``` * `removeBackground`, a numeric value describing a threshold at which Occupancy signals must be removed (Default is set at 0). ```{r, eval=FALSE, echo=TRUE} removeBackground(OPP) removeBackground(OPP) <- 0 ``` # Work Flow - Analysis Once a `genomicProfileParameter` object has been established, the rest of the analysis becomes fairly straight forward. Unless, you already have prior knowledge on the number of bound molecules (`boundMolecules`) and the PWM scaling factor (`ScalingFactorPWM` or referred to as *lambda*), we advise you to first infer the optimal set of parameters as described in `computeOptimal`. However, as this function is essentially a combination of all other functions in the package (with a little bit more magic to it), we will overview a simple analysis work flow first and finish with `computeOptimal` function and its associated plotting function `plotOptimalHeatMaps`. ## ChIP Score extraction. It can be required to provied ChIP data to some of the functions in ChIPanalyser. We provided a function that will extract and normalise ChIP data from various formats. Based on this score, an `occupancyProfileParameters` will also be builtor updated (see above) with value for the relevant slots (See `backgroundSignal` and `maxSignal`). ```{r, eval=FALSE} processingChIPseq(profile,loci=NULL,reduce=NULL, occupancyProfileParameters=NULL,noiseFilter="zero",peaks=NULL, Access=NULL,cores=1) ``` As input, this functions takes: * `profile` a path to file containing ChIP scores (_bed,wig,bigBed,..._), a `GRanges` or a `data.frame`containing ChIP scores. * `loci` argument is a `GRanges` of the loci of interest. If none are supplied, a set of Sequence will be built internally based on the different chromosome available in ChIP data. * `reduce` is a numeric value describing how many regions are to be selected. If a large amount of regions are supplied (let's say you split your favourite genome into bins of 20kbp and built a `GRanges` with those bins), you may choose to select the top regions based on the average ChIP singal in that region or if `peaks` are supplied select the regions overlapping with `peaks` with the highest ChIP score. * `occupancyProfileParameters` is an `occupancyProfileParameters` object. If you wish to change smoothing parameters (see `chipSd`,`chipMean` and `chipSmooth`) please provide the aformentioned object. `processingChIPseq` will return a updated `occupancyProfileParameters`. * `peaks` is a path to file or `GRanges` containing ChIP peaks. This argument is generally used if reduce is **not NULL** * `Access` is a `GRanges` object containing accesible DNA for a given organism/cell type. If reduce is not NULL, only regions with accessible DNA will be selected. If `peaks` is not NULL, regions with peaks *and* accessible DNA will be selected. * `noiseFilter` is one of the following: zero, mean, median or sigmoid. The noise filter is the method by which noise is buffered. Zero will remove all score below 0. Mean and median will remove all score below the mean and median respectively. Finally, sigmoid will assign a weight to each score based on a logistic regression curve (midpoint = ChIP-seq score 95 quantile). All scores above the midpoint will receive a weight between 1 and 2. All score below will receive a score between 0 and 1. * `cores` is the number of course that should be used to process ChIP data. ## Genome Wide Scoring In order to score the entire genome (or the accessible genome), it is possible to use the `computeGenomeWidePWMScore` function. The output of this function will be influenced by the value assigned to `lambda`. If more than one value was assigned to the scaling factor, parameters dependant on lambda will be updated accordingly (computed for each value of lambda). It is possible to run this functions and make use of multiple cores in order to decrease computational time. The arguments of the function are the following : ```{r, eval=F} computeGenomeWidePWMScore(DNASequenceSet, genomicProfileParameters, DNAAccessibility = NULL, GenomeWide = TRUE,cores=1, verbose = TRUE) ``` ### Input Data - Genome Wide scoring As input, `computeGenomeWidePWMScore` requires to obligatory arguments: `DNASequenceSet` and `genomicProfileParameters`. `DNASequenceSet` comes in the form of the following: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} DNASequenceSet ``` `DNASequenceSet` may also come in the form of a `BSgenome` object. However, we advise to use a DNAStringSet for a question of ease and speed. If you are unfamiliar with `BSgenome` and `DNAStringSet`, the following example demonstrates how to use these objects in this context. ```{r, eval=TRUE, warnings = FALSE} #Extracting DNAStringSet from BSgenome DNASequenceSet <- getSeq(BSgenome.Dmelanogaster.UCSC.dm3) ``` As a reminder a `genomicProfileParameters` are presented in the following format: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} GPP ``` `DNAAccessibility` is an optional argument in `computeGenomeWidePWMScore`. If present, then the genome will be scored only on the accessible DNA. `DNAAccessibility` comes as a `GRanges` containing accessible DNA sites. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} # DNA accessibility Access ``` `verbose` will determine if progress messages should be printed in the console and cores will determine the number of cores that will be used to compute genome wide metrics. ### computeGenomeWidePWMScore As an example of `computeGenomeWidePWMScore` usage: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} # With DNAAccessibility GenomeWide <- computeGenomeWidePWMScore(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GPP, DNAAccessibility = Access,cores=1) GenomeWide # Without DNA accessibility GenomeWide <- computeGenomeWidePWMScore(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GPP,cores=1) GenomeWide ``` ## Scoring sites above threshold Once genome wide metrics have been computed, the next step in the analysis is to extract sites above threshold (Sites with strong binding sites according to PWM Scores). The `computePWMScore` function will score the genome and extract sites above a local threshold (dependant on `PWMThreshold`, `maxPWMScore` and `minPWMScore`). It is possible to run this functions and make use of multiple cores in order to decrease computational time. The arguments of this functions are the following: ```{r, eval=FALSE} computePWMScore(DNASequenceSet, genomicProfileParameter, setSequence = NULL, DNAAccessibility = NULL, cores=1 ,verbose = TRUE) ``` ### Input Data - Sites Above threshold Only two arguments are absolutely required: `DNASequenceSet` and `genomicProfileParameters`. However, `setSequence` represents the Loci of interest. If `setSequence = NULL`, then sites above threshold will computed and extracted on a genome wide scale (or accessible genome if DNA Accessibility is provided). `DNASequenceSet` and `DNAAccessibility` are in the same format as previously described (`verbose` plays the same role as previously described). `setSequence` is a `GRanges` representing the loci of interest (may contain more than one loci/range) and comes in the following format: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} eveLocus ``` An important aspect to mention, is that it is recommended you name your loci of interest (not to be confused with `seqnames`). If no names are supplied they will be named internally following the format: * ChromosomeName_startOfRange..endOfRange If you are unfamiliar with GRanges, the following examples demonstrates naming in the context of ChIPAnalyser. We recommend getting acquainted with GenomicRanges as many aspect of ChIPanalyser require the use of GRanges. ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} # Sequence names of Loci seqnames(eveLocus) # Names of Loci names(eveLocus) # Naming Loci in GRanges names(eveLocus) <- "eve" ``` ### computePWMScore To compute PWM Scores at sites above threshold: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} # With DNA Accessibility PWMScores <- computePWMScore(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GenomeWide, setSequence = eveLocus, DNAAccessibility = Access,cores=1) PWMScores # Without DNA Accessibility PWMScores <- computePWMScore(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GenomeWide, setSequence = eveLocus,cores=1) PWMScores ``` As you can see, the `genomicProfileParameters` argument is the `genomicProfileParameters` object computed in the previous example. ChIPAnalyser works in a sequential manner: resulting objects from one function are often parsed as arguments to other functions. Finally, if your sequence of interest does not contain any accessible DNA, you will be notified during the computation and it is possible to extract inaccessible loci by using `NoAccess(PWMScores)` (See NoAccess slot in `genomicProfileParameters`). ## Occupancy Occupancy scores are computed using the formula described in **Methods**. It is worth mentioning that Occupancy scores are dependant on values assigned to `ScalingFactorPWM` and `boundMolecules`. If more than one value were to be assigned to these parameters, the resulting output will be a combination of both. For more information see the `computeOptimal` example as we will demonstrate multiple value computation (Single Value for lambda and boundMolecules will return an object identical in structure as with multiple values). The arguments for `computeOccupancy` are the following: ```{r,eval=FALSE} computeOccupancy(AllSitesPWMScore, occupancyProfileParameters = NULL, norm = TRUE,verbose = TRUE) ``` ### Input Data - Occupancy `computeOccupancy` requires a `genomicProfileParameters` object result of the previous function (`computePWMScore`). If you are unsure, if your `genomicProfileParameter` contains the right information, it is possible to check by using: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} AllSitesAboveThreshold(PWMScores) ``` If your GRanges does not contain PWMScore as a metadata column, you are either using the wrong object or you have not yet computed PWM Scores. `occupancyProfileParameters` is an `occupancyProfileParameters` object. If not provided, a new one will be generated internally. As previously mentioned, we strongly recommend to set those parameters to improve the model's goodness of it. As a reminder, a `occupancyProfileParameters` object (previously created - see section **Data object - Occupancy profile Parameters**) should print on the screen as follows: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} OPP ``` Finally, if `norm = TRUE`, the occupancy profiles will be normalised and `verbose = TRUE` progress messages will be printed to the console. ### computeOccupancy To compute Occupancy scores with `computeOccupancy`: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} Occupancy <- computeOccupancy(AllSitesPWMScore = PWMScores, occupancyProfileParameters = OPP) Occupancy ``` As it is the case in the previous functions, `AllSitesPWMScore` should be the result of the previous function (`computePWMScore`). `computeOccupancy` will return a `genomicProfileParameters` object with an updated `AllSitesAboveThreshold` slot. This slot should now contain a list of GRangesLists containing GRanges (one for each Loci of interest) with two metadata columns (PWMScore and Occupancy). Each element in the list is named with the specific combination of *lambda* and *boundMolecules* used to compute this set of occupancies. Finally, if your sequence of interest does not contain any accessible DNA, you will be notified during the computation and it is possible to extract inaccessible loci by using `NoAccess(PWMScores)` (See NoAccess slot in `genomicProfileParameters`). ## ChIP-seq like profiles The ultimate goal of ChIPAnalyser is to produce *ChIP-seq like* profile from occupancy data (from sites that display a high TF occupancy). `computeChipProfile` creates *ChIP-seq like* profiles from occupancy data by smoothing occupancy *profiles* and mimicking real ChIP-seq data. It is possible to run this functions and make use of multiple cores in order to decrease computational time. The arguments of `computeChipProfile` are the following: ```{r, eval=FALSE} computeChipProfile( setSequence , occupancy, occupancyProfileParameters = NULL, norm = TRUE, method = c("moving_kernel","truncated_kernel","exact"), peakSignificantThreshold= NULL,cores=1 verbose = TRUE) ``` ### Input data - ChIP-seq profiles The `computeChipProfile` function requires two compulsory arguments `setSequence` and `occupancy`. `setSequence` is a GRanges describing the loci of interest (this is the same GRanges used in `computePWMScore`). `occupancy` is a `genomicProfileParameters` object result of `computeOccupancy` function. To make sure this is the right `genomicProfileParameters`, you may use `AllSitesAboveThreshold()` (See AllSitesAboveThreshold slot description above). `occupancyProfileParameters` is an `occupancyProfileParameters` object. If not supplied, it will be generated *de novo* internally. Once again, we recommend to set the parameters of this object in relationship to real ChIP-seq data. `norm = TRUE` and `method` respectively represent if the ChIP-seq like profile should be normalised and if you wish to use an approximation for ChIP-seq profile or not. `moving_kernel` will use Rcpp to approximate and compute peaks, `truncated_kernel` will also approximate peaks but without using Rcpp, and `exact` will not approximate peaks. These methods represent different way of computing and/or approximating ChIP-seq peaks. Finally, `peakSignificantThreshold` is a threshold at which peaks will be selected. If you select "moving_kernel" then this threshold is a numeric value describing the peak tail hight cut-off value. The default in this case is 0.001. In the case of "truncated_kernel" and "exact", the threshold represents a distance in base pair from the peak summit at which the peak should be cut. In this case, default is set at 1250 base pairs. It should be noted that these methods will produce very similar results. And by very similar results, we mean nearly identical. ### computeChipProfile To generate a ChIP-seq like profile: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} chipProfile <- computeChipProfile(setSequence = eveLocus, occupancy = Occupancy,occupancyProfileParameters = OPP,cores=1) chipProfile ``` The output of this functions is slightly different as it returns a named list (each element in the list is named after the specific combination of *lambda* and *boundMolecules* used to compute occupancies) containing a GRangesList of GRanges with ChIP profile values as a metadata column. These GRanges also differ in the sense that they now contain the whole loci (or accessible loci) cut into bins of size equal to `stepSize` (See stepSize slot in `occupancyProfileParameters`). Each GRangesList contains GRanges for each Loci of interest. ## Searching through SitesAboveThreshold and ChIP-seq profiles As described previously, The size of the `AllSitesAboveThreshold` slot will increase drastically as the number of values assigned to `ScalingFactorPWM` (or `lambda`) and `boundMolecules` increases. In order to navigate and search this slot with ease, it is possible to use the `searchSites` function. This function may also be used on predicted ChIP-seq profiles (result of `computeChipProfile`). `searchSites` comes in the following form: ```{r, eval=FALSE,echo=TRUE} searchSites(Sites,ScalingFactor="all", BoundMolecules="all",Locus="all") ``` It is possible to use this function as a simple extraction method similarly to the `AllSitesAboveThreshold` method. In this case, the usage is the following: ```{r, eval=TRUE, echo=TRUE} searchSites(Occupancy) ``` If you wish to navigate and extract only certain combinations of `ScalingFactorPWM` and/or `boundMolecules` and/or Loci, `searchSites` could be use as shown below: ```{r, eval=TRUE, echo=TRUE} searchSites(chipProfile, ScalingFactor=c(1.5,2.5), BoundMolecules=c(1000,1500) ,Locus=c("eve","odd")) ``` ## Estimating the accuracy of the model In order to determine how accurate the predicted model is, it is possible to compare the predicted *ChIP-seq like profile* (as built in `computeChipProfile`) to real ChIP-seq data for a given Transcription Factors at loci of interest. `profileAccuracyEstimate` provides a way to compare both profiles. The arguments for this function are the following: ```{r, eval=FALSE} profileAccuracyEstimate(LocusProfile, predictedProfile, occupancyProfileParameters = NULL,method="all") ``` ### Input data - Accuracy Estimate `profileAccuracyEstimate` requires only two arguments. `precitedProfile` is the result of `computeChipProfile`. Finally, `LocusProfile` is a list containing actual ChIP-seq profiles. This list is the result of the `processingChIPseq` function. We also strongly recommend that each loci in `LocusProfile` (each element of the list) should be named in an identical manner as the loci used in `setSequence` (See previous functions). This list should come in the following format: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} str(eveLocusChip) ``` In this example, there is only one element in the list. However, this list can be as long as you wish and contain all the Loci that you are interested in. `occupancyProfileParameters` is only required if you have change the step size slot. Each predicted profile, will only contain a fraction of the score (see stepSize) Finally, `method` is the quality assessment method that you wish to use. The following possibilities are available: pearson, spearman, kendall, ks, geometric,fscore and all. `pearson,spearman and kendall` are correlation methods. `ks` is a Kolmogorov-Smirnov test (Will return both KS Distance and KS p-value) `geometric` is in an house metric that describes the ratio of difference in the area between curves over shared area between curves. `fscore` is a combination of multiple metrics such as Precision,recall,Fscore,Accuracy,MCC, and AUC. `all` is a combination of all the above mentioned. In every case, the Mean Squared Error will also be returned. ### profileAccuracyEstimate To test the accuracy the model against ChIP-seq data: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} AccuracyEstimate <- profileAccuracyEstimate(LocusProfile = eveLocusChip, predictedProfile = chipProfile, occupancyProfileParameters = OPP) AccuracyEstimate ``` This function returns a list of two elements. The first element represents lists containing the model quality assessments for every combination of parameters (Bound Molecules and lambda) for every genomic region. The second element of the list contains the result of the ROCR package: False positives, False Negative, etc... Generally Speaking, we recommend to only use the first element of this list. However, we offer the possibility to choose from the other metrics available in the ROCR package by returning the ROCR prediction object. ## Finding optimal Parameters As described previously, it is not always possible to know the optimal set of parameters for `ScalingFactorPWM` and `boundMolecules`. `ChIPAnalyser` offers the possibility to backward infer the parameters using the `computeOptimal` function. By testing different combinations of `ScalingFactorPWM` and `boundMolecules`, the `computeOptimal` function will return a list of values that were the highest ranking parameter combination for each model quality assessment method selected (see`profileAccuracyEstimate` above). This function will also return a list of matricies containing relevant scores for each combination. It is possible to run this functions and make use of multiple cores in order to decrease computational time. Values that should be tested for `ScalingFactorPWM` and for `boundMolecules` should be provided by user. If these values are not provided (default value and only one value for each parameter), then they will be assigned internally. The internal values are the following: ```{r,eval=FALSE} ScalingFactorPWM(genomicProfileParameters) <- c(0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5 ,4 ,4.5, 5) boundMolecules(occupancyProfileParameters) <- c(1, 10, 20, 50, 100, 200, 500,1000,2000, 5000,10000,20000,50000, 100000, 200000, 500000, 1000000) ``` In terms of its arguments,`computeOptimal` can be described as: ```{r, eval=FALSE} computeOptimal(DNASequenceSet, genomicProfileParameters, LocusProfile, setSequence, DNAAccessibility = NULL, occupancyProfileParameters = NULL, optimalMethod = "all", peakMethod="moving_kernel" cores=1) ``` **Please note that this functions will take some time to complete. Do not be alarmed if it seems to have stalled.** ### Input Data - Optimal Parameters `computeOptimal` is essentially a combination of previous functions (with a bit more magic to it). For this reason, data input in extremely similar to the functions described above. As a quick reminder: * `DNASequenceSet`, a DNAStringSet (or BSgenome) containing the sequences of the organism of interest. * `genomicProfileParameters`, a `genomicProfileParameters` object containing at least a *Position Weight Matrix* or *Position Frequency Matrix*. All other slots will be computed internally. * `LocusProfile`, a named list of ChIP-seq profile for loci of interest. * `setSequence`, a named GRanges containing loci of interest. * `DNAAccessibility`, a GRanges containing Accessible DNA. * `occupancyProfileParameters`, an `occupancyProfileParameters` object. Although optional, we strongly advise to tailor this object by using values directly extracted from `LocusProfile` `optimalMethod` defines which metric you wish to compute. There are four possible choices: *pearson*, *spearman*, *kendall*, *ks*, *geometric*, *fscore* or *all* It is imperative that the lists/GRanges are named with the name of the Loci of interest. `peakMethod`describes if you wish to use an approximation for ChIP-seq profile peaks. `moving_kernel` will use Rcpp to approximate and compute peaks, `truncated_kernel` will also approximate peaks but without using Rcpp, and `exact` will not approximate peaks. These methods represent different way of computing and/or approximating ChIP-seq peaks. Finally, cores describes the number of cores that will be used to compute the optimal set of parameters. ### computeOptimal As a example describing the usage of `compute optimal` ```{r, eval=FALSE, warnings = FALSE} optimalParam <- computeOptimal(DNASequenceSet = DNASequenceSet, genomicProfileParameters = GPP, LocusProfile = eveLocusChip, setSequence = eveLocus, DNAAccessibility = Access, occupancyProfileParameters = OPP, optimalMethod = "all", cores=1) optimalParam ``` The `computeOptimal` function will return a list of values that were the highest ranking parameter combination for each model quality assessment method selected (see`profileAccuracyEstimate` above). This function will also return a list of matricies containing relevant scores for each combination. # Plotting Results As it is the case in mamy fields, data visualisation is a key aspect in any analysis. For this purpose, `ChIPAnalyser` offers two plotting functions: `plotOptimalHeatMaps` and `plotOccupancyProfile`. ## Optimal Parameters Once you have computed the optimal set of parameters, it is possible to plot these results in the form of a heat map using `plotOptimalHeatMaps`. Depending on what you are interested in, this function will either plot *correlation ,MSE, theta* or *all of the previous*. This functions requires minimal input as described below: ```{r, eval=F} plotOptimalHeatMaps(optimalParam=optimalParam,contour=TRUE,col=NULL,main=NULL, layout=TRUE) ``` ### Input Data & Plotting `plotOptimalHeatMaps` only requires one data input in the form of the result of `computeOptimal` (see `computeOptimal`). `contour` defines if contour lines should be plotted. `col` are the colors that you wish to use for your heatmaps. It should be noted that the color vector will be recycled if not enough colors are provided. `main` is the main title. Finally, `layout` is a logical value that defines if standard layout should be selected. Standard layout will generate an individual heatmap for each matrix provided (`computeOptimal` result) with a heat map scale bar on the left. NOTE: If you use your own layout, you should be aware that the `plotOptimalHeatMaps` will always plot both the heat map and the scale bar. In R this is considered as two plots (rasterImage scales). As an example: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE, fig.width=4, fig.height=10} plotOptimalHeatMaps(optimalParam) ``` See plot in **Quick Guide** The boxed tile represents the highest correlation or theta for a given combination of `ScalingFactorPWM` and `boundMolecules`. In the case of MSE the boxed tile represents the lowest Mean Squared Error. ## Plotting Profiles `ChIPAnalyser` produces ChIP-seq like profiles. It is possible to plot these profiles but also to add a variety of features to these plots as well graphical parameter parsing. `plotOccupancyProfile` takes care of plotting with the following arguments: ```{r,eval=FALSE, fig.width=4, fig.width=10, fig.height=3} plotOccupancyProfile <- function(predictedProfile, setSequence, chipProfile = NULL, DNAAccessibility = NULL, occupancy = NULL, profileAccuracy=NULL, PWM=FALSE, occupancyProfileParameters = NULL, geneRef = NULL,axis=TRUE,...) ``` ### Input Data & Profiles In order to increase plotting flexibility, `plotOccupancyProfile` only plots one profile at a time. In practice, this means that only simple data units should be parsed to this functions. This also means that the main title is left to the user discretion. The arguments described above should come in the following format: * `precitedProfile`, a GRanges object containing the predicted ChIP-seq like profile for *one* locus and one combination of `lambda` and `boundMolecules`. * `setSequence`, a GRanges object containing the locus of interest. * `profileAccuracy`, the profile Accuracy estimate for one loci and for *one* combination of `lambda` and `boundMolecules` * `chipProfile`, a vector containing ChIP-seq data for locus of interest. In previous functions, ChIP-seq data was stored in a named list. In this case, it is the individual numeric vector contained within that list. * `occupancy`, a GRanges object containing both PWMScore and Occupancy. This GRanges is the result of `computeOccupancy` and should only contain a GRanges object for one locus and one combination of `lambda` and `boundMolecules`. * `PWM`, a logical operator indicating wherever you wish to plot *occupancy* or *PWMScores*. It is necessary to also include `occupancy` data. * `DNAAccessibility`, a GRanges object containing DNAAccessibility. DNAAccessibility is similar to DNAAccessibility data described previously. * `occupancyProfileParameters`, an `occupancyProfileParameters` object. This object should be the same as the one used in functions described above. However, the minimal requirement is that the `stepSize` slot remains consistent with `stepSize` used previously. As a reminder, stepSize default value is set at 10. * `geneRef`, a List containing genetic information (3'UTR, 5'UTR, exons, intron and enhancers). Each element of this list, is a GRanges containing the information regarding 3'UTR, 5'UTR, exons, intron and enhancers. * `axis` determine if the axes should be included * `...` Any other graphical Parameter of the following : col, density, border, lty, lwd, cex, cex.axis, xlab, ylab, xlim, ylim, las and axislables. As this object has not yet be described, `geneRef` should come in a similar format as the following: ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} geneRef ``` It should be noted that only two arguments are necessary (`predictedProfile` and `setSequence`). The more arguments are provided the more information will be plotted. As an example: ```{r, eval=FALSE, warnings = FALSE, echo=TRUE} plotOccupancyProfile(predictedProfile=chipProfile[[1]][[1]], setSequence=eveLocus, chipProfile = eveLocusChip[[1]], DNAAccessibility = Access, occupancy = AllSitesAboveThreshold(Occupancy)[[1]][[1]], occupancyProfileParameters = OPP, geneRef =geneRef) ``` # Session Information ```{r, eval=TRUE, warnings = FALSE, echo=TRUE} sessionInfo() ``` # References Zabet NR, Adryan B (2015) Estimating binding properties of transcription factors from genome-wide binding profiles. Nucleic Acids Res., 43, 84–94.