--- title: "RcisTarget: Transcription factor binding motif enrichment" package: "`r pkg_ver('RcisTarget')`" abstract: > This tutorial shows how to use **RcisTarget** to obtain transcription factor binding motifs enriched on a gene list. vignette: > %\VignetteIndexEntry{RcisTarget: Transcription factor binding motif enrichment} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document: toc: yes number_sections: false css: corrected.css pdf_document: toc: yes html_notebook: toc: yes --- ```{r libraries, echo=FALSE, message=FALSE, warning=FALSE} suppressPackageStartupMessages({ library(RcisTarget) library(RcisTarget.hg19.motifDBs.cisbpOnly.500bp) library(DT) library(data.table) #require(visNetwork) }) ``` # What is RcisTarget? RcisTarget is an R-package to identify transcription factor (TF) binding motifs over-represented on a gene list. RcisTarget is based on the methods previously implemented in [i-cisTarget](http://gbiomed.kuleuven.be/apps/lcb/i-cisTarget) (web interface, region-based) and [iRegulon](http://iregulon.aertslab.org) (Cytoscape plug-in, gene-based). If you use RcisTarget in your research, please cite: ```{r citation, echo=FALSE} print(citation("RcisTarget")[1], style="textVersion") ``` # Overview of the workflow The function `cisTarget()` allows to perform the motif-enrichment analysis on a gene list. The main input parameters are the gene list and the motif databases, which should be chosen depending on the organism and the search space around the TSS of the genes. This is a sample on how to run the analysis (see the following sections for details): ```{r overview, eval=FALSE} library(RcisTarget) # Load gene sets to analyze. e.g.: geneList1 <- c("gene1", "gene2", "gene3") geneLists <- list(geneListName=geneList1) # geneLists <- GSEABase::GeneSet(genes, setName="geneListName") # alternative # Select motif database to use (i.e. organism and distance around TSS) data(motifAnnotations_hgnc) motifRankings <- importRankings("hg19-500bp-upstream-7species.mc9nr.feather") # Motif enrichment analysis: motifEnrichmentTable_wGenes <- cisTarget(geneLists, motifRankings, motifAnnot=motifAnnotations_hgnc) ``` **Advanced use**: The `cisTarget()` function is enough for most simple analyses. However, for further flexibility (e.g. removing unnecessary steps on bigger analyses), RcisTarget also provides the possibility to run the inner functions individually. Running `cisTarget()` is equivalent to running this code: ```{r overviewAdvanced, eval=FALSE} # 1. Calculate AUC motifs_AUC <- calcAUC(geneLists, motifRankings) # 2. Select significant motifs, add TF annotation & format as table motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, motifAnnot=motifAnnotations_hgnc) # 3. Identify significant genes for each motif # (i.e. genes from the gene set in the top of the ranking) # Note: Method 'iCisTarget' instead of 'aprox' is more accurate, but slower motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable, geneSets=geneLists, rankings=motifRankings, nCores=1, method="aprox") ``` # Before starting ## Setup RcisTarget uses species-specific databases which are provided as independent R-packages. Prior to running RcisTarget, you will need to download and install the databases for the relevant organism (more details in the "Motif databases" section). In addition, some extra packages can be installed to run RcisTarget in parallel or run the interactive examples in this tutorial: ```{r installDep, eval=FALSE} source("http://bioconductor.org/biocLite.R") # To support paralell execution: biocLite(c("doMC", "doRNG")) # For the examples in the follow-up section of the tutorial: biocLite(c("DT", "visNetwork")) ``` ## Some tips... ### Help At any time, remember you an access the help files for any function (i.e. `?cisTarget`), and the other tutorials in the package with the following commands: ```{r vignette, eval=FALSE} # Explore tutorials in the web browser: browseVignettes(package="RcisTarget") # Commnad line-based: vignette(package="RcisTarget") # list vignette("RcisTarget") # open ``` ### Reports To generate an HTML report with your own data and comments, you can use the `.Rmd file` of this tutorial as template (i.e. copy the .Rmd file, and edit it as [R notebook](http://rmarkdown.rstudio.com/r_notebooks.html) in RStudio). ```{r editRmd, eval=FALSE} vignetteFile <- paste(file.path(system.file('doc', package='RcisTarget')), "RcisTarget.Rmd", sep="/") # Copy to edit as markdown file.copy(vignetteFile, ".") # Alternative: extract R code Stangle(vignetteFile) ``` # Running RcisTarget ## Input: Gene sets The main input to RcisTarget is the gene set(s) to analyze. The gene sets can be provided as a 'named list' in which each element is a gene-set (character vector containing gene names) or as a `GSEABase::GeneSet`. The gene-set name will be used as ID in the following steps. ```{r geneSets} library(RcisTarget) geneSet1 <- c("gene1", "gene2", "gene3") geneLists <- list(geneSetName=geneSet1) # or: # geneLists <- GSEABase::GeneSet(geneSet1, setName="geneSetName") ``` Some extra help: ```{r geneSetsFormat, eval=FALSE} class(geneSet1) class(geneLists) geneSet2 <- c("gene2", "gene4", "gene5", "gene6") geneLists[["anotherGeneSet"]] <- geneSet2 names(geneLists) geneLists$anotherGeneSet lengths(geneLists) ``` In this example we will be using a list of genes up-regulated in MCF7 cells under hypoxia conditions ([PMID:16565084](https://www.ncbi.nlm.nih.gov/pubmed/?term=16565084)). The original work highlighted the role of hypoxia-inducible factor (HIF1-alpha or HIF1A) pathways under hypoxia. This gene list is also used for the turorials in iRegulon (http://iregulon.aertslab.org/tutorial.html). ```{r sampleGeneSet} txtFile <- paste(file.path(system.file('examples', package='RcisTarget')), "hypoxiaGeneSet.txt", sep="/") geneLists <- list(hypoxia=read.table(txtFile, stringsAsFactors=FALSE)[,1]) head(geneLists$hypoxia) ``` ## Required databases To analyze the gene list, RcisTarget needs two types of databases: - 1. Gene-motif rankings: which provides the rankings (~score) of all the genes for each motif. - 2. The annotation of motifs to transcription factors. ### Gene-motif rankings The score of each pair of gene-motif can be performed using different parameters. Therefore, we provide multiple gene-rankings, according to several possibilities: - Species: Species of the input gene set. *Available values: Human (Homo sapiens), mouse (Mus musculus) or fly (Drosophila melanogaster)* - Scoring/search space: determine the search space around the transcription start site (TSS). *Available values: 500 bp uptream the TSS, 5kbp or 10kbp around the TSS (e.g. 10kbp upstream and 10kbp downstream of the TSS).* - Number of orthologous species taken into account to score the motifs (e.g. conservation of regions). *Available values: 7 or 10 species* If you don't know which database to choose, we would suggest using the *500bp upstream TSS*, and a larger search space (e.g. *TSS+/-5kbp* or *TSS+/-10kbp*) with 7 species. Of course, selecting Human or Mouse depending on your input gene list. For each ranking there are two files: one with extension `.feather` (which contains the main ranking, aprox 1GB), and one with extension `.descr` which contains the description of the file. You can download the files using your preferred method. For example, to download within R: ```{r downloadDatabases, eval=FALSE} featherURL <- "http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-7species.mc9nr.feather" # .feather download.file(featherURL, destfile=basename(featherURL)) # saved in current dir featherURL <- "http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-7species.mc9nr.descr" # .descr download.file(featherURL, destfile=basename(featherURL)) ``` **Download URLs** for the current databases (version 'mc9nr': including 24453 motifs) [Aprox file size: **1GB**]: Species | Search space | #ort | File ------- | ----- | ----- | ----- | ----- Human | 500bp upstream TSS | 7 | [hg19-500bp-upstream-7species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-7species.mc9nr.feather) Human | TSS+/-5kbp | 7 | [hg19-tss-centered-5kb-7species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-tss-centered-5kb-7species.mc9nr.feather) Human | TSS+/-10kbp | 7 | [hg19-tss-centered-10kb-7species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-tss-centered-10kb-7species.mc9nr.feather) Human | 500bp upstream TSS | 10 | [hg19-500bp-upstream-10species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-10species.mc9nr.feather) Human | TSS+/-5kbp | 10 | [hg19-tss-centered-5kb-10species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-tss-centered-5kb-10species.mc9nr.feather) Human | TSS+/-10kbp | 10 | [hg19-tss-centered-10kb-10species.mc9nr](http://pyscenic.aertslab.org/databases/hg19-tss-centered-10kb-10species.mc9nr.feather) Mouse | 500bp upstream TSS | 7 | [mm9-500bp-upstream-7species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-500bp-upstream-7species.mc9nr.feather) Mouse | TSS+/-5kbp | 7 | [mm9-tss-centered-5kb-7species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-tss-centered-5kb-7species.mc9nr.feather) Mouse | TSS+/-10kbp | 7 | [mm9-tss-centered-10kb-7species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-tss-centered-10kb-7species.mc9nr.feather) Mouse | 500bp upstream TSS | 10 | [mm9-500bp-upstream-10species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-500bp-upstream-10species.mc9nr.feather) Mouse | TSS+/-5kbp | 10 | [mm9-tss-centered-5kb-10species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-tss-centered-5kb-10species.mc9nr.feather) Mouse | TSS+/-10kbp | 10 | [mm9-tss-centered-10kb-10species.mc9nr](http://pyscenic.aertslab.org/databases/mm9-tss-centered-10kb-10species.mc9nr.feather) Fly | | | Coming soon... (Previous version: [dm6-5kb-upstream-full-tx-11species.mc8nr](http://pyscenic.aertslab.org/databases/dm6-5kb-upstream-full-tx-11species.mc8nr.feather) Once you have both files, they can be loaded with `importRankings()`: ```{r loadDatabases, eval=FALSE} # Search space: 10k bp around TSS - HUMAN motifRankings <- importRankings("hg19-500bp-upstream-7species.mc9nr.feather") # Load the annotation to human transcription factors data(motifAnnotations_hgnc) ``` ### Motif annotations All the calculations performed by RcisTarget are motif-based. However, most users will be interested on TFs potentially regulating the gene-set. The association of motif to transcription factors are provided in an independent file. In addition to the motifs annotated by the source datatabse (i.e. **direct** annotation), we have also **inferred** some further annotations based on motif similarity and gene ortology (e.g. similarity to other genes annotated to the motif). These annotations are typically used with the functions `cisTarget()` or `addMotifAnnotation()`. For the motifs in version 'mc9nr' of the rankings (24453 motifs), these annotations are already included in RcisTarget package and can be loaded with the command: ```{r loadAnnot, eval=TRUE} # mouse: # data(motifAnnotations_mgi) # human: data(motifAnnotations_hgnc) motifAnnotations_hgnc[199:202,] ``` For other versions of the rankings, the function `importAnnotations` allows to import the annotations from the source file. These annotations can be easily queried to obtain further information about specific motifs or TFs: ```{r motifAnnotQuery} motifAnnotations_hgnc[(directAnnotation==TRUE) & (TF %in% c("HIF1A")), ] ``` #### Database example (subset) In addition to the full versions of the databases (20k motifs), we also provide a subset containing only the 4.6k motifs from cisbp (human only: *RcisTarget.hg19.motifDBs.cisbpOnly.500bp*). These subsets are available in Bioconductor for demonstration purposes. They will provide the same AUC score for the existing motifs. However, we highly recommend to use the full version (~20k motifs) for more accurate results, since the normalized enrichment score (NES) of the motif depends on the total number of motifs in the database. To install this package: ```{r installDatabases, eval=FALSE} # From Bioconductor source("http://bioconductor.org/biocLite.R") biocLite("RcisTarget.hg19.motifDBs.cisbpOnly.500bp") ``` For this vignette (demonstration purposes), we will use this database: ```{r loadDatabasesCisbpOnly} library(RcisTarget.hg19.motifDBs.cisbpOnly.500bp) # Rankings data(hg19_500bpUpstream_motifRanking_cispbOnly) motifRankings <- hg19_500bpUpstream_motifRanking_cispbOnly motifRankings # Annotations data(hg19_motifAnnotation_cisbpOnly) motifAnnotations_hgnc <- hg19_motifAnnotation_cisbpOnly ``` ## Running the analysis Once the gene lists and databases are loaded, they can be used by `cisTarget()`. `cisTarget()` runs sequentially the steps to perform the (**1**) motif enrichment analysis, (**2**) motif-TF annotation, and (**3**) selection of significant genes. It is also possible to run these steps as individual commands. For example, to skip steps, for analyses in which the user is interested in one of the outputs, or to optimize the workflow to run it on multiple gene lists (see *Advanced* section for details). ```{r eval=FALSE} motifEnrichmentTable_wGenes <- cisTarget(geneLists, motifRankings, motifAnnot=motifAnnotations_hgnc) ``` ## Advanced: Execute step by step ### 1. Calculate enrichment The first step to estimate the over-representation of each motif on the gene-set is to calculate the Area Under the Curve (AUC) for each pair of motif-geneSet. This is calculated based on the recovery curve of the gene-set on the motif ranking (genes ranked decreasingly by the score of motif in its proximity, as provided in the motifRanking database). ```{r calcAUC, cache=TRUE} motifs_AUC <- calcAUC(geneLists, motifRankings, nCores=1) ``` The AUC is provided as a matrix of Motifs by GeneSets. In principle, the AUC is mostly meant as input for the next step. However, it is also possible to explore the distribution of the scores, for example in a gene-set of interest: ```{r AUChistogram, cache=TRUE, fig.height=5, fig.width=5} auc <- getAUC(motifs_AUC)["hypoxia",] hist(auc, main="hypoxia", xlab="AUC histogram", breaks=100, col="#ff000050", border="darkred") nes3 <- (3*sd(auc)) + mean(auc) abline(v=nes3, col="red") ``` ### 2. Select significant motifs and/or annotate to TFs The selection of significant motifs is done based on the Normalized Enrichment Score (NES). The NES is calculated -for each motif- based on the AUC distribution of all the motifs for the gene-set [(x-mean)/sd]. Those motifs that pass the given threshold (3.0 by default) are considered significant. Furthermore, this step also allows to add the TFs annotated to the motif. ```{r addMotifAnnotation, cache=TRUE} motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, nesThreshold=3, motifAnnot=motifAnnotations_hgnc, highlightTFs=list(hypoxia="HIF1A")) ``` ```{r} class(motifEnrichmentTable) dim(motifEnrichmentTable) head(motifEnrichmentTable[,-"TF_lowConf", with=FALSE]) ``` The cathegories considered high/low confidence can me modified with the arguments `motifAnnot_highConfCat` and `motifAnnot_lowConfCat`. ### 3. Identify the genes with the best enrichment for each Motif Since RcisTarget searches for enrichment of a motif within a gene list, finding a motif 'enriched' does not imply that *all* the genes in the gene-list have a high score for the motif. In this way, the third step of the workflow is to identify which genes (of the gene-set) are highly ranked for each of the significant motifs. There are two methods to identify these genes: (1) equivalent to the ones used in iRegulon and i-cisTarget (`method="iCisTarget"`, recommended if running time is not an issue), and (2) a faster implementation based on an approximate distribution using the average at each rank (`method="aprox"`, useful to scan multiple gene sets). IMPORTANT: Make sure that the **motifRankings** are the **same as in Step 1**. ```{r addSignificantGenes, cache=TRUE} motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable, rankings=motifRankings, geneSets=geneLists) dim(motifEnrichmentTable_wGenes) ``` Plot for a few motifs: ```{r getSignificantGenes, fig.height=7, fig.width=7} geneSetName <- "hypoxia" selectedMotifs <- c("cisbp__M6275", sample(motifEnrichmentTable$motif, 2)) par(mfrow=c(2,2)) getSignificantGenes(geneLists[[geneSetName]], motifRankings, signifRankingNames=selectedMotifs, plotCurve=TRUE, maxRank=5000-20, genesFormat="none", method="aprox") ``` ## Output The final output of RcisTarget is a `data.table` containing the information about the motif enrichment and its annotation organized in the following fields: * geneSet: Name of the gene set * motif: ID of the motif * NES: Normalized enrichment score of the motif in the gene-set * AUC: Area Under the Curve (used to calculate the NES) * TFinDB: Indicates whether the *highlightedTFs* are included within the high confidence annotation (two asterisks) or low confidence annotation (one asterisk). * TF_highConf: Transcription factors annotated to the motif according to 'motifAnnot_highConfCat'. * TF_lowConf: Transcription factors annotated to the motif according to 'motifAnnot_lowConfCat'. * enrichedGenes: Genes that are highly ranked for the given motif. * nErnGenes: Number of genes highly ranked * rankAtMax: Ranking at the maximum enrichment, used to determine the number of enriched genes. ```{r output} motifEnrichmentTable_wGenes_wLogo <- addLogo(motifEnrichmentTable_wGenes) resultsSubset <- motifEnrichmentTable_wGenes_wLogo[1:10,] library(DT) datatable(resultsSubset[,-c("enrichedGenes", "TF_lowConf"), with=FALSE], escape = FALSE, # To show the logo filter="top", options=list(pageLength=5)) ``` # Follow up examples ## TFs annotated to the enriched motifs Note that the TFs are provided based on the motif annotation. They can be used as a guide to select relevant motifs or to prioritize some TFs, but the motif annotation does not imply that all the TFs appearing in the table regulate the gene list. ```{r anotatedTfs, cache=TRUE} anotatedTfs <- lapply(split(motifEnrichmentTable_wGenes$TF_highConf, motifEnrichmentTable$geneSet), function(x) { genes <- gsub(" \\(.*\\). ", "; ", x, fixed=FALSE) genesSplit <- unique(unlist(strsplit(genes, "; "))) return(genesSplit) }) anotatedTfs$hypoxia ``` ## Building a network ```{r network, cache=FALSE, eval=FALSE} signifMotifNames <- motifEnrichmentTable$motif[1:3] incidenceMatrix <- getSignificantGenes(geneLists$hypoxia, motifRankings, signifRankingNames=signifMotifNames, plotCurve=TRUE, maxRank=5000-20, genesFormat="incidMatrix", method="aprox")$incidMatrix library(reshape2) edges <- melt(incidenceMatrix) edges <- edges[which(edges[,3]==1),1:2] colnames(edges) <- c("from","to") ``` *Output not shown:* ```{r visNetwork, eval=FALSE} library(visNetwork) motifs <- unique(as.character(edges[,1])) genes <- unique(as.character(edges[,2])) nodes <- data.frame(id=c(motifs, genes), label=c(motifs, genes), title=c(motifs, genes), # tooltip shape=c(rep("diamond", length(motifs)), rep("elypse", length(genes))), color=c(rep("purple", length(motifs)), rep("skyblue", length(genes)))) visNetwork(nodes, edges) %>% visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) ``` # sessionInfo() This is the output of `sessionInfo()` on the system on which this document was compiled: ```{r} date() sessionInfo() ```