--- title: "ChIP-seq Analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{ChIP-seq Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true") eval_vignette <- NOT_CRAN & memes::meme_is_installed() knitr::opts_chunk$set( collapse = TRUE, comment = "#>", purl = eval_vignette, eval = eval_vignette ) ``` ```{r setup} library(memes) library(magrittr) library(ggplot2) suppressPackageStartupMessages(library(GenomicRanges)) ``` # See package website for full vignette The Bioconductor build system does not have the MEME Suite installed, therefore these vignettes will not contain any R output. To view the full vignette, visit this article page on the memes website [at this link](https://snystrom.github.io/memes-manual/articles/integrative_analysis.html) # Introduction In this vignette, we'll explore using `memes` to deeply analyze a set of ChIP-seq peaks to identify motifs to explain differences in transcription factor binding, and consequences to chromatin accessibility at these ChIP peaks. We will use a dataset from (Nystrom, 2020) which investigated the binding of the transcription factor E93 during *Drosophila* wing development. These data are useful because they contain multiple groupings with interesting motif properties between groups. The ChIP experiments were performed under two conditions: ectopic expression of E93 during early development, and under wild-type conditions in a late stage of development. Peaks are annotated based on whether they are found in both conditions ("Endogenous" peaks), only in the ectopic expression condition ("Ectopic" peaks), or only found under wild-type conditions ("Orphan" peaks). A key question from this observation is whether there are motifs which can distinguish these three categories of binding from eachother. To investigate this question, we will first look for enrichment of known transcription factor motifs within each binding category using the AME tool. To start, ensure memes can detect your local install of the MEME Suite (see `vignette("install_guide")` for help). ```{r} check_meme_install() ``` ### Prepare peaks for analysis A subset of annotated E93 ChIP summits are shipped with the memes package and can be loaded as follows: ```{r, message=F} data("example_chip_summits", package = "memes") head(example_chip_summits, 5) ``` The ChIP summits have 3 additional annotation columns: `id`, `peak_binding_description`, and `e93_sensitive_behavior`. `id` is a unique identifier for each peak, `peak_binding_description` describes the condition in which E93 is observed to bind to this region (described in detail below), and `e93_sensitive_behavior` describes the change in chromatin accessibility observed at this site following E93 expression. These annotations will be used throughout this vignette to identify motifs that are associated with differences in E93 binding, and differences in chromatin accessibility at E93 bound sites. The summit of a ChIP peak represents the single base-pair position with the highest amplitude ChIP signal within a ChIP peak. Typically, motif analysis gives the best results when performed in a small window (100 - 200 bp) around the summit. We can resize our summits to use a 100bp window using `plyranges`: ```{r} # These data use the dm3 reference genome dm.genome <- BSgenome.Dmelanogaster.UCSC.dm3::BSgenome.Dmelanogaster.UCSC.dm3 # Get sequences in a 100bp window around the peak summit summit_flank <- example_chip_summits %>% # this ensures we take 50bp up & downstream of the summit for a total width of 100bp plyranges::anchor_center() %>% plyranges::mutate(width = 100) ``` ## Determinants of ectopic and orphan binding One approach to identify enriched motifs is to search for enrichment of known transcription factor (TF) motifs within sequences of interest. The Fly Factor Survey is a database of known *Drosophila* TF motifs that we will use throught this vignette. These data are also packaged with `memes`: ```{r} meme_db_path <- system.file("extdata/flyFactorSurvey_cleaned.meme", package = "memes", mustWork = TRUE) ``` ### Pre-filtering database for expressed transcription factors A key consideration when using a database of known transcription factors is whether to include motifs for TFs which are not expressed. On one hand, including all motifs allows maximum discovery of potential regulators, but on the other hand may detect factors which are not relevant under your experimental conditions. Another consideration for many tools is that increasing the number of tested motifs increases the multiple testing penaly, so limiting the motif database to a smaller number of relevant candidates can be a way to improve statistical power. There is no one right answer to this question. For this vignette, I will give an example for how one could go about filtering a database to consider only expressed factors. For other analyses, this question will need to be carefully considered. First, we load the motif database into R using `read_meme`. To more easily manipulate the motif metadata, we can convert this to a data.frame using `to_df`. ```{r} library(universalmotif) meme_db <- read_meme(meme_db_path) %>% to_df() ``` To simplify the vignette, we've preproccessed RNAseq data from our tissue of interest and shipped them with `memes`. These are FPKM normalized counts for each transcription factor gene at early and late developmental times. ```{r} data("example_rnaseq", package = "memes") head(example_rnaseq, 4) ``` ```{r, fig.height=5, fig.width=5, eval=F, include=F, echo=F} library(ggplot2) example_rnaseq %>% dplyr::group_by(symbol) %>% dplyr::filter(fpkm == max(fpkm)) %>% ggplot(aes(fpkm)) + geom_histogram(binwidth = 1) + coord_cartesian(xlim = c(0, 25)) + theme_bw() + labs(title = "Largest FPKM value in RNAseq dataset for each gene", y = "Number of genes", x = "Largest FPKM Value") ``` Choosing a method to decide whether a gene is expressed or not is a non-trivial topic. **DO NOT BLINDLY DO WHAT I DO HERE**. To keep this vignette on topic, I will use an overly simplistic cutoff of FPKM > 5 to define expressed genes. In this example, I will consider any gene that exceeds the cutoff at either timepoint as expressed. ```{r} expressed_genes <- example_rnaseq %>% # For each gene, keep only those with max(FPKM) greater or equal to 5. dplyr::group_by(symbol) %>% dplyr::filter(max(fpkm) >= 5) ``` Finally, I will filter the full motif database to select only those motifs corresponding to expressed gene. ```{r} meme_db_expressed <- meme_db %>% # the altname slot of meme_db contains the gene symbol # (this is database-specific) dplyr::filter(altname %in% expressed_genes$symbol) ``` Below is a comparison of the number of motifs in the database before and after filtering. ```{r} #Number of motifs pre-filtering: nrow(meme_db) #Number of motifs post-filter: nrow(meme_db_expressed) ``` Using this filtering strategy, we've decreased our multiple-testing penalty by ~54%. As a convenience function, we can set a default motif database to use for each `memes` function by setting the R option `meme_db` to a `universalmotif` object. This can be done as follows: ```{r} # to_list() converts the database back from data.frame format to a standard `universalmotif` object. options(meme_db = to_list(meme_db_expressed, extrainfo = FALSE)) ``` Alternatively, you can pass a universalmotif object to the `database` parameter for any `memes` function and this will override the global option. ### Examination of binding categories with AME Next, to test for enrichment of known motifs between E93 binding categories, we first collect the sequences of each category like so: ```{r} by_binding <- summit_flank %>% # Get a list of chip peaks belonging to each set split(mcols(.)$peak_binding_description) %>% # look up the DNA sequence of each peak within each group get_sequence(dm.genome) ``` The data are returned as a Biostrings List, where each list entry represents the sequences of each E93 binding category. ```{r} head(by_binding) ``` Finally, we test each set of sequencing using AME with the `runAme` function. This will run AME on each set of input sequences in the Biostrings List. ```{r} ame_by_binding <- by_binding %>% runAme ``` This returns a list object where each entry is the AME results for each category. For example, here is what the results for the ectopic peaks look like: ```{r} head(ame_by_binding$ectopic, 5) ``` ### Visualizing AME results The `plot_ame_heatmap()` function provides a quick way to visualize AME results. It is built on top of `ggplot2`, so all `ggplot2` functions can be used to further modify the plot. By default, it uses the -log10(adjusted p-value) as the heat values. See the documentation (`?plot_ame_heatmap`) for additional notes on customization. We can visualize the top 10 hits from the ectopic binding results as follows: ```{r, fig.height=3.5, fig.width=7} ame_by_binding$ectopic %>% dplyr::filter(rank %in% 1:10) %>% plot_ame_heatmap(group_name = "Ectopic Sites") + ggtitle("Top 10 AME Hits in Ectopic Sites") ``` To plot results from multiple runs together, they must first be joined into a single data frame. The `ame_by_binding` object is a list whose names correspond to the E93 binding category. The list can be combined into a data.frame using `dplyr::bind_rows`. Setting `.id = "binding_type"` creates a new column `binding_type` that contains the names from the `ame_by_binding` list. In this way, the `ame_res` data.frame contains all AME results for each run, which can be distinguished by the `binding_type` column. ```{r} ame_res <- ame_by_binding %>% dplyr::bind_rows(.id = "binding_type") ``` It is possible to aggregate results from multiple runs into a heatmap by setting the `group` parameter in `plot_ame_heatmap()`. This will stratify the y-axis by the entries in the column passed to `group`. This results in clustering motifs with shared hits across categories along the x-axis for quickly visualizing motifs that may be shared or distinct between multiple groups. ```{r, fig.height=5,fig.width=15} ame_res %>% plot_ame_heatmap(group = binding_type) ``` ### Reducing redundant motif hits Another key consideration for the above visualization is that in the FlyFactorSurvey database we used, different TFs can have multiple motif entries in the database which are all detected separately by AME. Here, when returning the top 5 hits from each group, you can see, for example, that a motif matching "Aef1" is reported 2 times within the top 5 hits of the ectopic sites. In this situation, it makes sense to summarize the data at the TF level, instead of the motif level. **Note:** There may be exceptions to this if, for example, a TF has multiple DNA binding sequences it can recognize, in which case having multiple hits may reflect a biological property of your sequences. You will have to handle this on a case-by-case basis for interesting hits and different motif databases. Here, we can see that at least for Aef1, the consensus sequences are very similar. ```{r} ame_res %>% dplyr::filter(binding_type == "ectopic", rank %in% 1:5) %>% head(5) %>% dplyr::select(binding_type, rank, motif_id, motif_alt_id, consensus) ``` How to solve this problem will vary with different motif databases (For details on how to pre-process a motif database, see `vignette("tidy_motifs")`). In the tidy version of the FlyFactorSurvey database, the `altname`s of the motifs are set to the transcription factor gene symbol. This information is included in the AME results as the `motif_alt_id` column. In order to reduce the occurrance of redundant hits in the heatmap, we select the hit for each TF with the lowest p-value (i.e. the most significant hit) as follows: ```{r, fig.height=3, fig.width=12} ame_res %>% # perform the next dplyr operation on each TF within each binding type dplyr::group_by(binding_type, motif_alt_id) %>% # within each binding type, select the TF hit with the lowest adjusted p-value dplyr::filter(adj.pvalue == min(adj.pvalue)) %>% plot_ame_heatmap(group = binding_type, id = motif_alt_id) + labs(y = "Binding Category", x = "Transcription Factor Motif") ``` ### AME Heatmap Visualization If you have read `vignette("core_ame")`, the "normalized rank" heatmap was introduced as an alternative visualization method useful when AME produces very large numbers of hits, or when p-values are on very different scales between groups. The `ame_compare_heatmap_methods()` function can be used to visualize why. Below is a comparison of the distribution of values when using `-log10(adj.pvalue)` (A) vs normalized ranks (B). Because there are relatively few hits in the results (~30), and the number of hits between groups varies more than the `-log10(p-value)` distributions, the "normalize" method will produce a misleading heatmap relative to the `-log10(p-value)` map. ```{r, fig.height=3, fig.width=7.5} ame_res %>% dplyr::group_by(binding_type, motif_alt_id) %>% dplyr::filter(adj.pvalue == min(adj.pvalue)) %>% ame_compare_heatmap_methods(group = binding_type) ``` A third option for adjusting this visualization is to cap the heatmap scale values at a certain value. A good rule of thumb for selecting this cutoff is to view the cumulative distribution plot above in (A), and select a value on the x-axis which captures a majority of the data. Here, we see that `-log10(adj.pvalue) < 10` captures >90% of the entopic & ectopic hits, but only ~50% of the orphan hits. By selecting 10 as the heatmap scale cap, we can improve the dynamic range of the signal for values <10, while preserving the information that many orphan sites generally have higher scores than the other categories. The heatmap scale can be capped at 10 by setting `scale_max = 10`. Below is a comparison of the 3 different visualizations. Notice how the intensity of the orphan vs other categories hits changes. ```{r, fig.height=3, fig.width=12} best_ame_hits <- ame_res %>% dplyr::group_by(binding_type, motif_alt_id) %>% dplyr::filter(adj.pvalue == min(adj.pvalue)) pval_heatmap <- best_ame_hits %>% plot_ame_heatmap(group = binding_type, id = motif_alt_id) + labs(x = NULL, title = "-log10(adj.pvalue)") norm_heatmap <- best_ame_hits %>% plot_ame_heatmap(group = binding_type, id = motif_alt_id, value = "normalize") + labs(x = NULL, title = "normalize") pval_scaled_heatmap <- best_ame_hits %>% plot_ame_heatmap(group = binding_type, id = motif_alt_id, scale_max = 10) + labs(x = NULL, title = "-log10(adj.pvalue) (scale capped at 10)") ``` ```{r, fig.height=9, fig.width=12} cowplot::plot_grid(pval_heatmap, norm_heatmap, pval_scaled_heatmap, ncol = 1, labels = "AUTO") ``` ## *De-novo* motif similarity by binding Discovery of *de-novo* motifs within binding categories can be another way to identify meaningful motifs that distinguish between ectopic, entopic, and orphan sites which does not rely on known motif information. The MEME tool, Dreme, discovers short *de-novo* motifs in input sequences. We begin the analysis by searching for 5 *de-novo* motifs per binding category and shuffling the input sequences to use as the control set. We pass the same `by_binding` list as the input to `runDreme` which will run Dreme on each set of sequences independently. The results are returned in list format, which we combine into a data.frame using `dplyr::bind_rows` as above. ```{r dreme_by_binding, eval=F} dreme_by_binding <- by_binding %>% runDreme("shuffle", nmotifs = 5) %>% dplyr::bind_rows(.id = "binding_type") ``` ```{r} # The above code chunk takes a long time to run. # memes is packaged with the results of this run in the "example_dreme_by_binding" dataset # which can be loaded as follows: data("example_dreme_by_binding", package = "memes") dreme_by_binding <- example_dreme_by_binding %>% dplyr::bind_rows(.id = "binding_type") ``` Next, we want to compare the de-novo motifs between each binding category to identify motifs which could distinguish the three groups. One way to interrogate this is to examine the correlation score between each motif. Rename the motifs to indicate the binding category they were discovered in (for visualization purposes). ```{r} dreme_by_binding_renamed <- dreme_by_binding %>% dplyr::mutate(name = paste(binding_type, seq, sep = "_")) %>% # update_motifs updates the information in the special `motif` column update_motifs() ``` Create the correlation heatmap: ```{r, fig.height=5, fig.width=8} # Set the color values for the heatmap scale cols <- colorRampPalette(c("white", "dodgerblue4"))(255) # This is for adding the colored annotation blocks indicating group membership # to the heatmap anno.df <- dreme_by_binding_renamed %>% dplyr::select(name, binding_type) %>% tibble::remove_rownames() %>% tibble::column_to_rownames("name") dreme_by_binding_renamed %>% # Convert to universalmotif format to_list() %>% # Compute the pearson correlation for each motif with all other motifs universalmotif::compare_motifs(method = "PCC") %>% # Plot the correlation matrix along with the annotations pheatmap::pheatmap(color = cols, # This sets the heatmap range to be from 0-1 breaks = seq(0, 1, by = 1/255), annotation_col = anno.df, # the cutree options are just cosmetic to add some spacing # between some of the clusters cutree_rows = 6, cutree_cols = 6, show_colnames = FALSE) ``` This heatmap reveals that most motifs discovered in each group are highly similar to the motifs found in other groups. ### Test *de-novo* motif enrichment using AME The above analysis suggests the motif content of the different binding categories are highly similar in sequence composition. To extend these analyses, we can use AME to test for motif enrichment of the *de-novo* discovered motifs within each binding category, and determine whether the motifs detected in one category are indeed enriched in another. To do this, we can provide the *de-novo* motifs as the AME database to test for their enrichment in each sequence category. `runAme()` allows using a `runDreme()` results object as the `database` input by passing it within a `list()`. Naming the `list()` entry produces an informative `motif_db` name in the results data.frame. ```{r ame_by_binding} ame_denovo_by_binding <- by_binding %>% runAme(database = list("denovo_binding_motifs" = dreme_by_binding_renamed)) %>% dplyr::bind_rows(.id = "binding_type") ``` Plotting the heatmap of results reveals that indeed a majority of the *de-novo* motifs discovered within a single category are detected in all 3 categories, supporting the conclusion that orphan, ectopic, and entopic sites are highly similar in sequence content. ```{r, fig.height=4, fig.width=10} ame_denovo_by_binding %>% plot_ame_heatmap(group = binding_type, scale_max = 10) ``` However, there are 2 interesting motifs which distinguish orphan and ectopic sites from entopic sites. To help identify which TFs these motifs might belong to, we can use TomTom to match them to known TF motifs. First, we select the motif id's that are not found in entopic sites. ```{r} entopic_motifs <- ame_denovo_by_binding %>% dplyr::filter(binding_type == "entopic") %>% dplyr::pull(motif_id) ame_denovo_binding_unique <- ame_denovo_by_binding %>% dplyr::filter(!(motif_id %in% entopic_motifs)) ``` Next, we use the motif id's from the unique AME results to select those entries in the Dreme results object, and run TomTom on that subset. ```{r} dreme_by_binding_unique <- dreme_by_binding_renamed %>% dplyr::filter(name %in% ame_denovo_binding_unique$motif_id) %>% runTomTom(dist = "ed") ``` Finally, we visualize the TomTom results to identify candidate TFs driving the presence of ectopic and orphan sites. ```{r, fig.height=4, fig.width=12} dreme_by_binding_unique %>% view_tomtom_hits(3) %>% cowplot::plot_grid(plotlist = ., nrow = 1, labels = "AUTO", byrow = TRUE) ``` ## Motifs in opening vs closing sites Another category of ChIP peaks in this dataset are ones that overlap regions with dynamic changes to chromatin accessibility in response to E93 expression. These peaks are annotated in the `e93_sensitive_behavior` column which indicates whether sites have higher accessibility after ectopic E93 expression ("Increasing"), lower accessibility after ectopic E93 expression ("Decreasing"), or if the accessibility does not change ("Static"). To identify motifs associated with Increasing or Decreasing accessibility in response to ectopic E93 expression, we first grab the DNA sequences of the Increasing, Decreasing, and Static sites by splitting the ChIP peaks on the `e93_sensitive_behavior` column. ```{r} # split by response to E93 binding by_sens <- summit_flank %>% split(mcols(.)$e93_sensitive_behavior) %>% get_sequence(dm.genome) ``` Proper selection of the control set of sequences is critical to identifying biologically relevant motifs. Because we are interested in motifs associated with Increasing or Decreasing behavior, we want to exclude detection of motifs that are simply associated with E93 binding. To do this, we can use the "Static" category of E93 ChIP peaks, because these theoretically represent E93 binding sites without motifs that influence chromatin accessibility. To start, we will search for *de-novo* motifs within dynamic E93 binding sites using Dreme. The `by_sens` list has three entries: `Increasing`, `Decreasing`, and `Static`. When calling `runDreme` on a sequence list, you can pass the name of a list entry as the `control` argument to use that set as the background sequences for the remaining sets of sequences. Below we search for *de-novo* motifs in Increasing and Decreasing peaks using the Static peaks as the background set. ```{r dreme_by_sens_vs_static} # By setting "Static" as the control, runDreme will run: # - Increasing vs Static # - Decreasing vs Static dreme_by_sens_vs_static <- runDreme(by_sens, "Static") ``` The results are returned as a list with two entries: `Increasing`, and `Decreasing`. These can be joined into a single data.frame using `dplyr::bind_rows`. ```{r} dreme_results <- dreme_by_sens_vs_static %>% dplyr::bind_rows(.id = "e93_response") ``` Next, we match the *de-novo* motifs with known TF motifs using TomTom. The TomTom results will be added as new columns to the input data. TomTom data columns begin with `best_match_` summarizing the top hit, and the `tomtom` column is a nested data.frame storing the full results for each input motif. ```{r} denovo_dynamic_motifs <- dreme_results %>% runTomTom() ``` We can visualize some of the top TomTom hits for each *de-novo* motif using `view_tomtom_hits()`. Passing a number to `view_tomtom_hits()` shows that many matches ranked in descending order. Passing no value will show all hits. It's a good idea to inspect all hits visually. This can be an extremely revealing step, as it will give you insight into which motifs are generally similar to your input motif. You may also discover proteins with identical motifs, in which case you may need to decide which TF is the better candidate. For brevity, I'll only show the top 3 hits for each motif. ```{r, fig.height=8, fig.width=8.5} denovo_dynamic_motifs %>% # Plot the top 3 matches view_tomtom_hits(3) %>% # Arrange into figure panels # You don't need to do this if you're just exploring your data, I do it here to make a pretty figure cowplot::plot_grid(plotlist = ., labels = "AUTO") ``` After manually inspecting the above hits, we may decide that a lower-ranked motif is a better match by eye, or because of domain-specific knowledge that makes a different TF a better candidate. We can swap the best matched motif using `force_best_match()`. This will update the `best_match_` columns and re-rank the `tomtom` nested data column to set a lower ranked match to the top hit. The updates `best_match_` columns will reflect the values corresponding to the new match. As an example, I'll set the top hit for `m03_AKGG` to the lower ranked `pho_SANGER_10` since it has weaker consensus sequence in the regions that don't overlap the *de-novo* motif. ```{r} # use force_best_match to use update the best match info # %<>% is a shortcut from magrittr for updating the data in-place denovo_dynamic_motifs %<>% force_best_match(c("m03_AKGG" = "pho_SANGER_10")) ``` Replotting the top hits, you can see that "pho_SANGER_10" is now listed as the top hit for "m03_AKGG". ```{r, fig.height=4.5, fig.width=8.5} denovo_dynamic_motifs %>% view_tomtom_hits(1) %>% cowplot::plot_grid(plotlist = ., labels = "AUTO") ``` Finally, we can make publication quality figures using cowplot. This code is a little complicated, but hopefully gives an idea for how you can use R graphics to generate nice motif figures! A striking result from this is that the *de-novo* motifs associated with Decreasing sites match the E93 motif itself. ```{r, fig.height=3, fig.width=8.5} denovo_dynamic_motifs %>% # Create a new label that indicates whether the motif is found in Increasing or Decreasing sites dplyr::mutate(label = paste0(e93_response, " in response to E93")) %>% # Sort the top hits first dplyr::arrange(rank) %>% # Run the plot command separately for Increasing and Decreasing sets split(.$label) %>% # For each set: purrr::imap(~{ # Plot the top hit for each motif in the set as a panel figure top_hits <- view_tomtom_hits(.x, 1) %>% cowplot::plot_grid(plotlist = ., nrow = 1, labels = "AUTO") # Create a figure title from the label we created above title <- cowplot::ggdraw() + cowplot::draw_text(.y) # Combine the title & the motif hits figure cowplot::plot_grid(plotlist = list(title, top_hits), ncol = 1, rel_heights = c(0.1, 1) ) }) ``` ## Scanning for motif matches using FIMO Because we discover a *de-novo* motif from DREME that matches to E93 in Decreasing sites, this suggests that the E93 motifs in Decreasing sites have different properties relative to E93 motifs in Increasing or Static sites. For example, the E93 motif may have higher similarity to the canonical sequence, may be present in greater number, or may have different positioning in Decreasing sites relative to Increasing or Static sites. To test these questions directly, we can identify individual E93 motif matches within each sequence and investigate their properties. The FIMO tool is used to identify the positions of motif matches in a set of sequences. It will return the coordinates, sequence, and a score (higher values are more similar to the consensus) for each motif found in the input sequences. In order to scan for the E93 motif within target sequences, we first need a copy of the E93 motif. Although we could import the motifs from our local meme database, it is also possible to use motifs pulled from a [`MotifDb`](https://bioconductor.org/packages/release/bioc/html/MotifDb.html) query as follows. I'll use that approach here as an example, but if this were a real analysis, I'd take the result from our local database. ```{r} e93_flyfactor <- MotifDb::MotifDb %>% # Query the database for the E93 motif using it's gene name MotifDb::query("Eip93F") %>% # Convert from motifdb format to universalmotif format universalmotif::convert_motifs() %>% # The result is a list, to simplify the object, return it as a single universalmotif .[[1]] # Rename the motif from it's flybase gene number to a more user-friendly name e93_flyfactor["name"] <- "E93_FlyFactor" # Display the cleaned motif e93_flyfactor ``` Just like `runDreme()` and `runAme()`, `runFimo()` takes a `Biostrings::XStringSet` as input. Using `get_sequence()` to generate these for DNA sequences is preferred because it creates sequence inputs that can be parsed into genomic coordinates by Fimo. ```{r} fimo_res <- summit_flank %>% get_sequence(dm.genome) %>% runFimo(motifs = e93_flyfactor, thresh = 1e-3) ``` The results of `runFimo()` are returned as a `GRanges` object containing the positions of each motif discovered in the input sequences. The best way to integrate these data with our input peaks is to use the `plyranges` suite of tools for performing overlaps and joins between `GRanges` objects. ```{r} fimo_res ``` ### Counting the number of motifs per peak `plyranges` extends `dplyr`-like syntax to range objects. Here we add a count for each motif per peak in the `n_motifs` column. We also add a column `has_motif` which will be a binary indicator of whether a peak contains any motifs. ```{r} summit_flank %<>% # Note, if running FIMO with multiple motifs, this solution will not work # as it will count all motifs within the fimo-results without splitting by motif_id plyranges::mutate(n_motifs = plyranges::count_overlaps(., fimo_res), has_motif = n_motifs > 0) ``` First, we want to determine whether E93 sensitive sites are more likely to have E93 motifs in certain response types. Here we can see that sensitive decreasing sites are more likely to have E93 motifs than sensitive increasing or insensitive static sites. ```{r, fig.height=4, fig.width=5} summit_flank %>% data.frame %>% dplyr::mutate(has_match = dplyr::if_else(has_motif, "Match", "No Match")) %>% ggplot(aes(e93_sensitive_behavior)) + geom_bar(aes(fill = forcats::fct_rev(has_match)), position = "fill") + scale_fill_manual(values = c("Match" = "firebrick", "No Match" = "Black")) + labs(fill = "E93 Motif Match", y = "Fraction of Sites", x = "Response to E93 binding") ``` To investigate whether E93-sensitive sites that increase, decrease, or remain static have different numbers of E93 motifs, we plot the fraction of sites with each number of motifs per group. Here we can see that in addition to being more likely to contain an E93 motif, sensitive decreasing sites are more likely to to contain 2 or more matches, where 10% contain at least 2 motifs. ```{r, fig.height=5, fig.width=7} summit_flank %>% # currently, group operations are faster as data.frames, so we convert to data.frame data.frame %>% dplyr::group_by(e93_sensitive_behavior, n_motifs) %>% dplyr::count() %>% dplyr::group_by(e93_sensitive_behavior) %>% dplyr::mutate(frac = n/sum(n)) %>% ggplot(aes(n_motifs, frac)) + geom_line(aes(color = e93_sensitive_behavior), size = 1) + labs(y = "Fraction of Sites", x = "Number of E93 Motifs", color = "Response to E93 Binding") + theme_bw() ``` Finally, we want to assess whether the quality of E93 motifs is different between sensitivity categories. To examine this, we need to determine which motifs are found in which peaks. We use `plyranges::join_overlap_intersect` to return motif entries appended with peak-level metadata, like the peak id each motif is found within. ```{r} # return position of each motif match w/ peak metadata intersect <- fimo_res %>% plyranges::join_overlap_intersect(summit_flank) ``` We use the FIMO `score` as a proxy for quality, where higher scores are better matches to the motif. Here we examine only the best match (highest score) motif within each peak. ```{r} best_motifs <- intersect %>% # group by ChIP peak id plyranges::group_by(id) %>% # Keep only motifs w/ the highest score within a peak plyranges::filter(score == max(score)) %>% plyranges::ungroup() ``` Finally, visualize the FIMO scores for the best E93 motif within each ChIP peak across categories. This reveals that Decreasing sites appear to have higher FIMO scores, suggesting the E93 motifs within Decreasing sites are higher quality E93 motifs than those in Increasing or Static sites. ```{r, fig.height=4, fig.width=3} best_motifs %>% data.frame %>% ggplot(aes(e93_sensitive_behavior, score)) + geom_boxplot(aes(fill = e93_sensitive_behavior), notch = TRUE, width = 0.5) + guides(fill = "none") + labs(x = "Response to E93", y = "FIMO Score") + theme_bw() ``` We can test whether the distribution of FIMO scores between multiple groups is statistically significantly different using an [Anderson-Darling test](https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test). For a pairwise comparison (in case you only have 2 groups to test), a [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) is commonly used instead. The `PMCMRplus::adAllPairsTest()` performs the Anderson-Darling test and corrects for multiple testing. This reveals that Decreasing FIMO scores are significantly different from Increasing and Static scores (p < 0.05), but Increasing and Static scores are not significantly different from eachother (p > 0.05), supporting the conclusion that E93 motifs in Decreasing sites are higher quality than those in Increasing or Static sites. ```{r} best_motifs %>% data.frame %>% dplyr::mutate(behavior = factor(e93_sensitive_behavior)) %>% PMCMRplus::adAllPairsTest(score ~ behavior, ., p.adjust.method = "fdr") ``` We can visualize the differences in E93 motif quality between groups by reconstructing PCM's from the matched motif sequences. Here, it is clear that although each category contains an E93 motif match, the sequence content of those motifs differs between categories, providing a nice visual representation of the statistical analysis above. ```{r, fig.height=6, fig.width=5} best_motifs %>% # Get sequences of each matched motif add_sequence(dm.genome) %>% # Split by E93 response category split(mcols(.)$e93_sensitive_behavior) %>% # Convert from GRangesList to list() to use `purrr::imap` as.list() %>% # imap passes the list entry as .x and the name of that object to .y # here, .y holds the e93_sensitive_behavior name, and .x holds the motif positions purrr::imap(~{ # This builds a new motif using the matched sequences from each group create_motif(.x$sequence, # Append the response category to the motif name name = paste0("E93_", .y)) }) %>% # Prepend E93 Fly Factor motif to beginning of motif list so it is plotted at the top c(list("E93_FlyFactor" = e93_flyfactor), .) %>% # Plot each motif view_motifs() ``` We can also build a more complex representation of these data by making a heatmap of the sequence variability for each matched sequence using `plot_sequence_heatmap()`. This function will plot each sequence as a row of a heatmap where each column represents the position relative to the motif start coordinate. The cells are colored by the DNA sequence found at that position for each sequence. Using this type of plot it becomes even more clear how variability at position 6 differs between Decreasing and Increasing sites, where Decreasing sites have a very consistent "C" at position 6, while Increasing sites are more likely to have non-C bases at this position. ```{r, fig.height=5.25, fig.width=8} best_motifs %>% # Get sequences of each matched motif add_sequence(dm.genome) %>% data.frame %>% # Split by E93 response category split(.$e93_sensitive_behavior) %>% # Extract the "sequence" column from each category purrr::map("sequence") %>% # Plot a heatmap for each category plot_sequence_heatmap(title_hjust = 0.5) %>% # Arrange into a grid cowplot::plot_grid(plotlist = ., nrow = 1) ``` ### Centrality of E93 motif Next we want to visualize whether the E93 motif has altered positioning across the response categories. To do this we add the metadata for each nearest motif to our peak summits. Setting `distance = TRUE` in `plyranges::join_nearest` adds a `distance` column indicating the distance to the nearest joined region. ```{r} summit_nearest_e93_motif <- summit_flank %>% plyranges::anchor_center() %>% plyranges::mutate(width = 1) %>% plyranges::join_nearest(fimo_res, distance = TRUE) ``` We can visualize these data as a cumuluative distribution plot. In this example, it does appear that decreasing sites tend to be closer to the peak summit than increasing or static sites. ```{r} summit_nearest_e93_motif %>% data.frame %>% # Limit our analysis to peaks that contain motifs dplyr::filter(has_motif == TRUE) %>% ggplot(aes(distance)) + stat_ecdf(aes(color = e93_sensitive_behavior), size = 1, pad = FALSE) + labs(x = "Distance to nearest E93 Motif", y = "Fraction of Sites", color = "E93 Response") + theme_linedraw() ``` ```{r, include=F, eval=F, echo=F} summit_nearest_e93_motif %>% data.frame %>% dplyr::filter(has_motif == TRUE) %>% dplyr::mutate(behavior = factor(e93_sensitive_behavior)) %>% # Anderson-darling test for multiple distributions PMCMRplus::adAllPairsTest(distance ~ behavior, ., p.adjust.method = "fdr") ``` # Conclusion This is just one example of how `memes` can integrate multiple data sources to perform deep analysis of genomics data. Here we went from a set of annotated ChIP peaks do multifaceted interrogation of motif content. Each approach has advantages and disadvantages. The most important thing is to critically evaluate the data at every step: visualize *de-novo* motif hits with `view_motifs`, compare matches using `view_tomtom_hits`, and think carefully about which sequences to use as control regions. # References Nystrom SL, Niederhuber MJ, McKay DJ. Expression of E93 provides an instructive cue to control dynamic enhancer activity and chromatin accessibility during development. Development 2020 Mar 16;147(6). # Session Info ```{r} sessionInfo() ```