--- title: "Spectrum Motif Analysis (SPMA)" author: "Konstantin Krismer" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Spectrum Motif Analysis (SPMA)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In order to investigate how RBP targets are distributed across a spectrum of transcripts (e.g., all transcripts of a platform, ordered by fold change), Spectrum Motif Analysis visualizes the distribution of putative RBP binding sites across all transcripts. # Analysis ```{r} library(transite) ``` Load example data set from `transite` package, a data frame with 1000 rows and the following columns: * `refseq`: RefSeq identifiers * `value`: signal-to-noise ratios between treatment and control samples (sorting criterion) * `seq_type`: specifying the type of sequence in `seq` column; either 3'-UTR, 5'-UTR, or entire mRNAs (including 5'-UTRs, coding regions, and 3'-UTRs) * `seq`: sequence ```{r, message=FALSE} background_df <- transite:::ge$background_df ``` Sort sequences (i.e., transcripts) by ascending signal-to-noise ratio. Transcripts upregulated in control samples are at the beginning of list, transcripts upregulated in treatment samples at the end. ```{r, message=FALSE} background_df <- dplyr::arrange(background_df, value) ``` The DNA sequences in the data frame are converted to RNA sequences. Motifs in the Transite motif database are RNA-based. ```{r, message=FALSE} background_set <- gsub("T", "U", background_df$seq) ``` Prepare named character vector of presorted sequences for `run_matrix_spma`. The function expects a character vector containing sequences, sorted in a meaningful way, named according to the following format: `[REFSEQ]|[SEQ_TYPE]` The name is only important if results are cached. It is used as the key in a dictionary that stores the number of putative binding sites for each RefSeq identifier and sequence type. ```{r, message=FALSE} names(background_set) <- paste0(background_df$refseq, "|", background_df$seq_type) ``` In this example we limit our analysis to an arbitrarily selected motif in the motif database in order to reduce run-time. ```{r, message=FALSE} motif_db <- get_motif_by_id("M178_0.6") ``` Matrix-based SPMA is executed: ```{r, message=FALSE} results <- run_matrix_spma(background_set, motifs = motif_db, cache = FALSE) # Usually, all motifs are included in the analysis and results are cached to make subsequent analyses more efficient. # results <- run_matrix_spma(background_set) ``` # Results Matrix-based SPMA returns a number of result objects, combined in a list with the following components: * `foreground_scores` * `background_scores` * `enrichment_dfs` * `spectrum_info_df` * `spectrum_plots` * `classifier_scores` In the following, we plot the spectrum plot, showing how putative binding sites are distributed across the spectrum of transcripts. (`spectrum_plots`). In addition, we show statistics describing the spectrum plot (`spectrum_info_df`). ```{r, results='asis', echo=FALSE, fig.width=10, fig.height=7} cat("\n\n####", results$spectrum_info_df$motif_rbps, " (", results$spectrum_info_df$motif_id, ")\n\n", sep = "") cat("\n\n**Spectrum plot with polynomial regression:**\n\n") grid::grid.draw(results$spectrum_plots[[1]]) cat("\n\n**Classification:**\n\n") if (results$spectrum_info_df$aggregate_classifier_score == 3) { cat('\n\n

spectrum classification: non-random (3 out of 3 criteria met)

\n\n') } else if (results$spectrum_info_df$aggregate_classifier_score == 2) { cat('\n\n

spectrum classification: random (2 out of 3 criteria met)

\n\n') } else if (results$spectrum_info_df$aggregate_classifier_score == 1) { cat('\n\n

spectrum classification: random (1 out of 3 criteria met)

\n\n') } else { cat('\n\n

spectrum classification: random (0 out of 3 criteria met)

\n\n') } cat("\n\nProperty | Value | Threshold\n") cat("------------- | ------------- | -------------\n") cat("adjusted $R^2$ | ", round(results$spectrum_info_df$adj_r_squared, 3), " | $\\geq 0.4$ \n") cat("polynomial degree | ", results$spectrum_info_df$degree, " | $\\geq 1$ \n") cat("slope | ", round(results$spectrum_info_df$slope, 3), " | $\\neq 0$ \n") cat("unadjusted p-value estimate of consistency score | ", round(results$spectrum_info_df$consistency_score_p_value, 7), " | $< 0.000005$ \n") cat("number of significant bins | ", results$spectrum_info_df$n_significant, " | ", paste0("$\\geq ", floor(40 / 10), "$"), " \n\n") ``` Visual inspection of the spectrum plot as well as the classification and the underlying properties suggest that the putative binding sites of this particular RNA-binding protein are distributed in a random fashion across the spectrum of transcripts. There is no evidence that this RNA-binding protein is involved in modulating gene expression changes between treatment and control group. # Additional information Most of the functionality of the Transite package is also offered through the Transite website at https://transite.mit.edu. For a more detailed discussion on Spectrum Motif Analysis and Transite in general, please have a look at the Transite paper: **Transite: A computational motif-based analysis platform that identifies RNA-binding proteins modulating changes in gene expression** Konstantin Krismer, Molly A. Bird, Shohreh Varmeh, Erika D. Handly, Anna Gattinger, Thomas Bernwinkler, Daniel A. Anderson, Andreas Heinzel, Brian A. Joughin, Yi Wen Kong, Ian G. Cannell, and Michael B. Yaffe Cell Reports, Volume 32, Issue 8, 108064; DOI: https://doi.org/10.1016/j.celrep.2020.108064