--- title: "monaLisa - MOtif aNAlysis with Lisa" author: "Dania Machlab, Lukas Burger, Charlotte Soneson, Michael Stadler" date: "`r Sys.Date()`" bibliography: monaLisa-refs.bib output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{monaLisa - MOtif aNAlysis with Lisa} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, out.width = "80%", fig.align = "center", crop = NULL # suppress "The magick package is required to crop" issue ) library(BiocStyle) ``` # Introduction `r Biocpkg("monaLisa")` is a collection of functions for working with biological sequences and motifs that represent the binding preferences of transcription factors or nucleic acid binding proteins. For example, `r Biocpkg("monaLisa")` can be used to conveniently find motif hits in sequences (see section \@ref(findhits)), or to identify motifs that are likely associated with observed experimental data. Such analyses are supposed to provide potential answers to the question "Which transcription factors are the drivers of my observed changes in expression/methylation/accessibility?". Several other approaches have been described that also address this problem, among them REDUCE [@reduce] and ISMARA [@ismara]. In `r Biocpkg("monaLisa")`, we aim to provide a flexible implementation that integrates well with other Bioconductor resources, makes use of the sequence composition correction developed for Homer [@homer] or stability selection [@StabSel] and provides several alternative ways to study the relationship between experimental measurements and sequence motifs. You can use known motifs from collections of transcription factor binding specificities such as `r Biocpkg("JASPAR2020")`, also available from Bioconductor. Genomic regions could be for example promoters, enhancers or accessible regions for which experimental data is available. Two independent approaches are implemented to identify interesting motifs: - In **binned motif enrichment analysis** (`monaLisa::calcBinnedMotifEnrR`, see section \@ref(binned)), genomic regions are grouped into bins according to a numerical value assigned to each region, such as the change in expression, accessibility or methylation. Motif enrichments are then calculated for each bin, normalizing for differences in sequence composition in a very similar way as originally done by [Homer](http://homer.ucsd.edu/homer/index.html) [@homer]. The binned motif enrichment approach was first introduced in @ginno2018 and subsequently applied in e.g. @barisic2019. To see more details on how `calcBinnedMotifEnrR` resembles `Homer`, check the function help page. We recommend using this function to do the binned motif enrichment analysis, since it corrects for sequence composition differences similarly to `Homer`, but is implemented more efficiently. `monaLisa::calcBinnedMotifEnrHomer` implements the same analysis using Homer and therefore requires a local [installation of Homer](http://homer.ucsd.edu/homer/introduction/install.html), and `monaLisa::calcBinnedKmerEnr`(see section \@ref(binnedkmers)) implements the analysis for k-mers instead of motifs, to study sequence enrichments without the requirement of known motifs. - **Randomized Lasso stability selection** (`monaLisa::randLassoStabSel`, see `r Biocpkg("monaLisa", vignette="selecting_motifs_with_randLassoStabSel.html", label="the stability selection vignette")`, `r Biocpkg("monaLisa")` uses a robust regression approach (stability selection, @StabSel) to predict what transcription factors can explain experimental measurements, for example changes in chromatin accessibility between two conditions. Also this approach allows to correct for sequence composition. In addition, similar motifs have to "compete" with each other to be selected. For both approaches, functions that allow visualization of obtained results are provided. If you prefer to jump right in, you can continue with section \@ref(quick) that shows a quick hypothetical example of how to run a binned motif enrichment analysis. If you prefer to actually compute enrichments on real data, you can find below a detailed example for a binned motif enrichment analysis (section \@ref(binned)). # Installation `r Biocpkg("monaLisa")` can be installed from Bioconductor via the `r CRANpkg("BiocManager")` package: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("monaLisa") ``` # Quick example: Identify enriched motifs in bins{#quick} The quick example below, which we do not run, illustrates how a binned motif enrichment analysis can be performed in `monaLisa`. We assume that you already have extracted a set of peaks. The sequences of the peak regions are stored in a `Biostrings::DNAStringSet` (`peak_seqs`), and additionally each peak is associated with a numeric value (e.g., the change of methylation between two conditions, stored in the `peak_change` vector), that will be used to bin the regions before finding motifs enriched in each bin. ```{r, quick, eval=FALSE} # load package library(monaLisa) # bin regions # (peak_change is a numerical vector) # (peak_change needs to be created by the user to run this code) peak_bins <- bin(x = peak_change, binmode = "equalN", nElement = 400) # calculate motif enrichments # (peak_seqs is a DNAStringSet, pwms is a PWMatrixList) # (peak_seqs and pwms need to be created by the user to run this code) se <- calcBinnedMotifEnrR(seqs = peak_seqs, bins = peak_bins, pwmL = pwms) ``` The returned `se` is a `SummarizedExperiment` with `assays` *negLog10P*, *negLog10Padj*, *expForegroundWgt*, *pearsonResid*, *log2enr*, *sumForegroundWgtWithHits* and *sumBackgroundWgtWithHits*, each containing a matrix with motifs (rows) by bins (columns). The values are: - *negLog10P*: the raw P value ($-\log_{10} p$) of a given motif enrichment in a given bin. Each P value results from an enrichment calculation comparing occurrences of each motif in the bin to its occurrences in background sequences, defined by the `background` argument (by default: sequences in all other bins). - *negLog10Padj*: Same as *negLog10P* but adjusted for multiple testing using the method of provided in the `p.adjust.method` argument, by default: Benjamini and Hochberg, 1995 (`p.adjust(..., method="fdr")`). - *expForegroundWgt*: The expected number of regions in the bin containing a given motif. - *pearsonResid*: Standardized Pearson residuals, a measure of motif enrichment akin to a z-score for the number of regions in the bin containing the motif. The standardized Pearson residuals are given by $resid = (o - \mu)/\sigma$, where $\mu$ is the expected count and $\sigma$ the standard deviation of the expression in the numerator, under the null hypothesis that the probability of containing a motif is independent of whether the sequence is in the foreground or the background (see e.g. @agresti2007, section 2.4.5). - *log2enr*: Motif enrichments, calculated as: $log2enr = log2((o + c)/(e + c))$, where $o$ and $e$ are the observed and expected numbers of regions in the bin containing a given motif, and $c$ is a pseudocount defined by the `pseudocount.log2enr` argument. - *sumForegroundWgtWithHits* and *sumBackgroundWgtWithHits* are the sum of foreground and background sequences that have at least one occurrence of the motif, respectively. The background sequences are weighted in order to adjust for differences in sequence composition between foreground and background. In addition, `rowData(se)` and `colData(se)` give information about the used motifs and bins, respectively. In `metadata(se)` you can find information about parameter values. # Binned motif enrichment analysis: Finding TFs enriched in differentially methylated regions{#binned} This example is based on an *in vitro* differentiation system, in which mouse embryonic stem (ES) cells are differentiated into neuronal progenitors (NP). In an earlier study [@LMRs], we have analyzed the genome-wide CpG methylation patterns in these cell types and identified so called *low methylated regions* (LMRs), that have reduced methylation levels and correspond to regions bound by transcription factors. We also developed a tool that systematically identifies such regions from genome-wide methylation data [@MethylSeekR]. Interestingly, a change in methylation of LMRs is indicative of altered transcription factor binding. We will therefore use these regions to identify transcription factor motifs that are enriched or depleted in LMRs that change their methylation between ES and NP cell states. ## Load packages We start by loading the needed packages: ```{r loadlib, message=FALSE} library(GenomicRanges) library(SummarizedExperiment) library(JASPAR2020) library(TFBSTools) library(BSgenome.Mmusculus.UCSC.mm10) library(monaLisa) library(ComplexHeatmap) library(circlize) ``` ## Genomic regions of interest `r Biocpkg("monaLisa")` contains a file with genomic coordinates (mouse mm10 assembly) of LMRs, with the respective changes of methylation: ```{r loadLMRs} lmrfile <- system.file("extdata", "LMRsESNPmerged.gr.rds", package = "monaLisa") lmr <- readRDS(lmrfile) lmr ``` We can see there are `r length(lmr)` LMRs, most of which gain methylation between ES and NP stages: ```{r deltameth} hist(lmr$deltaMeth, 100, col = "gray", main = "", xlab = "Change of methylation (NP - ES)", ylab = "Number of LMRs") ``` In order to keep the computation time reasonable, we'll select 10,000 of the LMRs randomly: ```{r lmrsel} set.seed(1) lmrsel <- lmr[ sample(x = length(lmr), size = 10000, replace = FALSE) ] ``` ## Bin genomic regions Now let's bin our LMRs by how much they change methylation, using the `bin` function. We are not interested in small changes of methylation, say less than 0.3, so we'll use the `minAbsX` argument to create a *no-change* bin in [-0.3, 0.3). The remaining LMRs are put into bins of 800 each: ```{r binlmrs} bins <- bin(x = lmrsel$deltaMeth, binmode = "equalN", nElement = 800, minAbsX = 0.3) table(bins) ``` We can see which bin has been set to be the zero bin using `monaLisa::getZeroBin`, or set it to a different bin using `monaLisa::setZeroBin`: ```{r} # find the index of the level representing the zero bin levels(bins) getZeroBin(bins) ``` Because of the asymmetry of methylation changes, there is only a single bin with LMRs that lost methylation and many that gained: ```{r plotbins} plotBinDensity(lmrsel$deltaMeth, bins, legend = "topleft") ``` Note that the bin breaks around the *no-change* bin are not exactly -0.3 to 0.3. They have been adjusted to have the required 800 LMRs per bin. ## Prepare motif enrichment analysis Next we prepare the motif enrichment analysis. We first need known motifs representing transcription factor binding site preferences. We extract all vertebrate motifs from the `r Biocpkg("JASPAR2020")` package as positional weight matrices (PWMs): ```{r getmotifs} pwms <- getMatrixSet(JASPAR2020, opts = list(matrixtype = "PWM", tax_group = "vertebrates")) ``` Furthermore, we need the sequences corresponding to our LMRs. As sequences in one bin are compared to the sequences in other bins, we would not want differences of sequence lengths or composition between bins that might bias our motif enrichment results. In general, we would recommend to use regions of similar or even equal lengths to avoid a length bias, for example by using a fixed-size region around the midpoint of each region of interest using `GenomicRanges::resize`. In addition, the resized regions may have to be constrained to the chromosome boundaries using trim: ```{r makewidthequal} summary(width(lmrsel)) lmrsel <- trim(resize(lmrsel, width = median(width(lmrsel)), fix = "center")) summary(width(lmrsel)) ``` We can now directly extract the corresponding sequences from the `r Biocpkg("BSgenome.Mmusculus.UCSC.mm10")` package ```{r getseqs} lmrseqs <- getSeq(BSgenome.Mmusculus.UCSC.mm10, lmrsel) ``` and check for differences in sequence composition between bins using the `monaLisa::plotBinDiagnostics` function. "GCfrac" will plot the distributions of the fraction of G+C bases, and "dinucfreq" creates a heatmap of average di-nucleotide frequencies in each bin, relative to the overall average. ```{r bindiag} plotBinDiagnostics(seqs = lmrseqs, bins = bins, aspect = "GCfrac") plotBinDiagnostics(seqs = lmrseqs, bins = bins, aspect = "dinucfreq") ``` From these plot, we can see that LMRs with lower methylation in NP cells compared to ES cells (bin `r levels(bins)[1]`) tend to be GC-poorer than LMRs in other bins. A strong bias of this kind can give rise to false positives in that bin, e.g. enrichments of AT-rich motifs. In this example, the bias is however not very strong and the built-in sequence composition correction in `monaLisa::calcBinnedMotifEnrR` should be able to handle it. ## Run motif enrichment analysis Finally, we run the binned motif enrichment analysis. This step will take a while, and typically you would use the `BPPARAM` argument to run it with parallelization using `n` cores as follows: `calcBinnedMotifEnrR(..., BPPARAM = BiocParallel::MulticoreParam(n)))`. For this example however, you can just load the pre-computed results as shown below. ```{r runbinned, eval=FALSE} se <- calcBinnedMotifEnrR(seqs = lmrseqs, bins = bins, pwmL = pwms) ``` In case you did not run the above code, let's now read in the results: ```{r getresults} se <- readRDS(system.file("extdata", "results.binned_motif_enrichment_LMRs.rds", package = "monaLisa")) ``` `se` is a `SummarizedExperiment` object which nicely keeps motifs, bins and corresponding metadata together: ```{r summarizedexperiment} # summary se dim(se) # motifs-by-bins # motif info rowData(se) head(rownames(se)) # bin info colData(se) head(colnames(se)) # assays: the motif enrichment results assayNames(se) assay(se, "log2enr")[1:5, 1:3] ``` We can plot the results using the `monaLisa::plotMotifHeatmaps` function, e.g. selecting all transcription factor motifs that have an $-log_{10} FDR$ of at least 4.0 in any bin (corresponding to an $FDR < 10^{-4}$). FDR values are stored in the `negLog10Padj` assay: ```{r plottfs, fig.height=10} # select strongly enriched motifs sel <- apply(assay(se, "negLog10Padj"), 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4.0 sum(sel) seSel <- se[sel, ] # plot plotMotifHeatmaps(x = seSel, which.plots = c("log2enr", "negLog10Padj"), width = 2.0, cluster = TRUE, maxEnr = 2, maxSig = 10, show_motif_GC = TRUE) ``` In order to select only motifs with significant enrichments in a specific bin, or in any bin except the "zero" bin, you could use: ```{r selsigvariants} # significantly enriched in bin 8 levels(bins)[8] sel.bin8 <- assay(se, "negLog10Padj")[, 8] > 4.0 sum(sel.bin8, na.rm = TRUE) # significantly enriched in any "non-zero" bin getZeroBin(bins) sel.nonZero <- apply( assay(se, "negLog10Padj")[, -getZeroBin(bins), drop = FALSE], 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4.0 sum(sel.nonZero) ``` Setting `cluster = TRUE` in `monaLisa::plotMotifHeatmaps` has re-ordered the rows using hierarchical clustering of the `"pearsonResid"` assay. As many transcription factor binding motifs are similar to each other, it is also helpful to show the enrichment heatmap clustered by motif similarity. To this end, we first calculate all pairwise motif similarities (measured as maximum Pearson correlation of all possible shifted alignments). This can be quickly calculated for the few selected motifs using the `monaLisa::motifSimilarity` function. For many motifs, this step may take a while, and it may be useful to parallelize it using the `BPPARAM` argument (e.g. to run on `n` parallel threads using the multi-core backend, you can use: `motifSimilarity(..., BPPARAM = BiocParallel::MulticoreParam(n))`). ```{r wmclustering} SimMatSel <- motifSimilarity(rowData(seSel)$motif.pfm) range(SimMatSel) ``` The order of the TFs in the resulting matrix is consistent with the elements of `seSel`, and the maximal similarity between any pair of motifs is 1.0. By subtracting these similarities from 1.0, we obtain distances that we use to perform a hierarchical clustering with the `stats::hclust` function. The returned object is then passed to the `cluster` argument to define the order of the rows in the heatmap. The plotting of the dendrogram is controlled by the argument `show_dendrogram`, and we also display the motifs as sequence logos using `show_seqlogo`: ```{r plottfsclustered, fig.height=10} # create hclust object, similarity defined by 1 - Pearson correlation hcl <- hclust(as.dist(1 - SimMatSel), method = "average") plotMotifHeatmaps(x = seSel, which.plots = c("log2enr", "negLog10Padj"), width = 1.8, cluster = hcl, maxEnr = 2, maxSig = 10, show_dendrogram = TRUE, show_seqlogo = TRUE, width.seqlogo = 1.2) ``` ## Convert between motif text file for `Homer` and motif objects in `R` `r Biocpkg("monaLisa")` provides two functions for performing binned motif enrichment analysis (`monaLisa::calcBinnedMotifEnrR` and `monaLisa::calcBinnedMotifEnrHomer`). `monaLisa::calcBinnedMotifEnrR` implements the binned motif enrichment analysis in `R`, similarly to `Homer` and does not require the user to have the `Homer` tool pre-installed. For more information on that function and how it resembles the `Homer` tool see the function documentation. A simple way to represent a DNA sequence motif that assumes independence of positions in the motif is a matrix with four rows (for the bases A, C, G and T) and `n` columns for the `n` positions in the motif. The values in that matrix can represent the sequence preferences of a binding protein in several different ways: - **Position frequency matrices (PFM)** contain values that correspond to the number of times (frequency) that a given base has been observed in at a given position of the motif. It is usually obtained from a set of known, aligned binding site sequences, and depending on the number of sequences, the values will be lower or higher. In `R`, PFMs are often represented using `TFBSTools::PFMatrix` (single motif) or `TFBSTools::PFMatrixList` objects (set of motifs). This is the rawest way to represent a sequence motif and can be converted into any other representation. - **Position probability matrices (PPM)** are obtained by dividing the counts in each column of a PFM by their sum. The values now give a probability of observing a given base at that position of the motif and sum up to one in each column. This is the representation used in motif text files for `Homer`. A PPM can only be converted back to a PFM by knowing or assuming how many binding site sequences were observed (see argument `n` in `homerToPFMatrixList`). - **Position weight matrices (PWM)** (also known as position specific scoring matrices, PSSM) are obtained by comparing the base probabilities in a PPM to the probabilities of observing each base outside of a binding site (background base probabilities), for example by calculating log-odds scores (see `TFBSTools::toPWM` for details). This is a useful representation for scanning sequences for motif matches. In `R`, PWMs are often represented using `TFBSTools::PWMatrix` (single motif) or `TFBSTools::PWMatrixList` (set of motifs). `monaLisa::calcBinnedMotifEnrR` takes PWMs as a `TFBSTools::PWMatrixList` object to scan for motif hits. `monaLisa::calcBinnedMotifEnrHomer` on the other hand takes a motif text file with PPMs, and requires the user to have `Homer` installed to use it for the binned motif enrichment analysis. Here, we show how one can get motif PFMs from `r Biocpkg("JASPAR2020")` and convert them to a `Homer`-compatible text file with PPMs (`monaLisa::dumpJaspar`) and vice versa (`monaLisa::homerToPFMatrixList` ), and how to convert a `TFBSTools::PFMatrixList` to a `TFBSTools::PWMatrixList` for use with `monaLisa::calcBinnedMotifEnrR` or `monaLisa::findMotifHits`: ```{r} # get PFMs from JASPAR2020 package (vertebrate subset) pfms <- getMatrixSet(JASPAR2020, opts = list(matrixtype = "PFM", tax_group = "vertebrates")) # convert PFMs to PWMs pwms <- toPWM(pfms) # convert JASPAR2020 PFMs (vertebrate subset) to Homer motif file tmp <- tempfile() convert <- dumpJaspar(filename = tmp, pkg = "JASPAR2020", pseudocount = 0, opts = list(tax_group = "vertebrates")) # convert Homer motif file to PFMatrixList pfms_ret <- homerToPFMatrixList(filename = tmp, n = 100L) # compare the first PFM # - notice the different magnitude of counts (controlled by `n`) # - notice that with the default (recommended) value of `pseudocount = 1.0`, # there would be no zero values in pfms_ret matrices, making # pfms and pfms_ret even more different as.matrix(pfms[[1]]) as.matrix(pfms_ret[[1]]) # compare position probability matrices with the original PFM round(sweep(x = as.matrix(pfms[[1]]), MARGIN = 2, STATS = colSums(as.matrix(pfms[[1]])), FUN = "/"), 3) round(sweep(x = as.matrix(pfms_ret[[1]]), MARGIN = 2, STATS = colSums(as.matrix(pfms_ret[[1]])), FUN = "/"), 3) ``` # Binned k-mer enrichment analysis{#binnedkmers} In some situations it may be beneficial to perform the enrichment analysis in a more 'unbiased' way, using k-mers rather than annotated motifs. Here, we will illustrate the process using the same LMR data set as used for the motif enrichment analysis above. Similarly to the motif enrichment, this step takes a while to perform, and we can also load the processed object directly. ```{r binnedkmerenr, eval=FALSE} sekm <- calcBinnedKmerEnr(seqs = lmrseqs, bins = bins, kmerLen = 6, includeRevComp = TRUE) ``` ```{r binnedkmerenr-load} sekm <- readRDS(system.file( "extdata", "results.binned_6mer_enrichment_LMRs.rds", package = "monaLisa" )) ``` Just as for the motif enrichment analysis, the return value is a `SummarizedExperiment` object, with the same set of assays and annotations. ```{r} sekm ``` As for the motif enrichment, we can extract any k-mer that is enriched in any of the bins. ```{r} selkm <- apply(assay(sekm, "negLog10Padj"), 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4 sum(selkm) sekmSel <- sekm[selkm, ] ``` Next, let's compare the enriched k-mers to the motifs that were found earlier. This can be done using the `monaLisa::motifKmerSimilarity` function. By showing the similarity between the enriched k-mers and motifs, we can see whether, e.g., strongly enriched k-mers do not seem to correspond to an annotated motif. ```{r, fig.width=10, fig.height=10} pfmSel <- rowData(seSel)$motif.pfm sims <- motifKmerSimilarity(x = pfmSel, kmers = rownames(sekmSel), includeRevComp = TRUE) dim(sims) maxwidth <- max(sapply(TFBSTools::Matrix(pfmSel), ncol)) seqlogoGrobs <- lapply(pfmSel, seqLogoGrob, xmax = maxwidth) hmSeqlogo <- rowAnnotation(logo = annoSeqlogo(seqlogoGrobs, which = "row"), annotation_width = unit(1.5, "inch"), show_annotation_name = FALSE ) Heatmap(sims, show_row_names = TRUE, row_names_gp = gpar(fontsize = 8), show_column_names = TRUE, column_names_gp = gpar(fontsize = 8), name = "Similarity", column_title = "Selected TFs and enriched k-mers", col = colorRamp2(c(0, 1), c("white", "red")), right_annotation = hmSeqlogo) ``` # Use `r Biocpkg("monaLisa")` to annotate genomic regions with predicted motifs{#findhits} As mentioned `r Biocpkg("monaLisa")` can also be used to scan sequences for motifs. Here is an example (just on a few sequences/motifs for illustration): ```{r findMotifs, warning=FALSE} # get sequences of promoters as a DNAStringSet # (could also be a single DNAString, or the name of a fasta file) library(TxDb.Mmusculus.UCSC.mm10.knownGene) gr <- trim(promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream = 1000, downstream = 500)[c(1, 4, 5, 10)]) library(BSgenome.Mmusculus.UCSC.mm10) seqs <- getSeq(BSgenome.Mmusculus.UCSC.mm10, gr) seqs # get motifs as a PWMatrixList # (could also be a single PWMatrix, or the name of a motif file) library(JASPAR2020) library(TFBSTools) pfms <- getMatrixByID(JASPAR2020, c("MA0885.1", "MA0099.3", "MA0033.2", "MA0037.3", "MA0158.1")) pwms <- toPWM(pfms) pwms name(pwms) # predict hits in sequences res <- findMotifHits(query = pwms, subject = seqs, min.score = 6.0, method = "matchPWM", BPPARAM = BiocParallel::SerialParam()) res # create hit matrix: # number of sites of each motif per sequence m <- table(factor(seqnames(res), levels = names(seqs)), factor(res$pwmname, levels = name(pwms))) m ``` The transformation of sequence and PWM names to factors with defined levels in the creation of the hit matrix above is not strictly needed, but it ensures that even sequences or motifs without any hits are reported in the matrix, and that the order of sequences (rows) and motifs (columns) is identical to the order in `seqs` and `pwms`. # Session info and logo The monaLisa logo uses a drawing that was obtained from http://vectorish.com/lisa-simpson.html under the Creative Commons attribution - non-commercial 3.0 license: https://creativecommons.org/licenses/by-nc/3.0/. This vignette was built using: ```{r, session} sessionInfo() ``` # References