--- title: 'cTRAP: identifying candidate causal perturbations from differential gene expression data' author: Bernardo P. de Almeida & Nuno Saraiva-Agostinho date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > \usepackage[utf8]{inputenc} %\VignetteIndexEntry{cTRAP: DGE comparison with cellular perturbations} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- --- # Introduction `cTRAP` is an R package designed to compare differential gene expression results with those from known cellular perturbations (such as gene knock-down, overexpression or small molecules) derived from the [Connectivity Map][clue.io] ([Subramanian et al., Cell 2017][subramanian2017]). Such analyses allow not only to infer the molecular causes of the observed difference in gene expression but also to identify small molecules that could drive or revert specific transcriptomic alterations. To illustrate the package functionalities, we will use an example based on a gene knock-down dataset from the [ENCODE project][ENCODE] for which there is available RNA-seq data. After performing differential expression analyses to the macthed-control sample, we will compare the respective transcriptomic changes with the ones caused by all Connectivity Map's gene knock-down perturbations to identify which ones have similar or inverse transcriptomic changes to the observed ones. As a positive control, we expect to find the knock-down of the gene depleted in the ENCODE experiment as one of the most similar transcriptomic perturbations. # Getting started To load the `cTRAP` package into your R environment type: ```{r load package} library(cTRAP) ``` # Load ENCODE RNA-seq data and perform differential gene expression analysis In this example, we will use the EIF4G1 shRNA knockdown followed by RNA-seq experiment in HepG2 cell line from the ENCODE project as the dataset of interest. The RNA-seq processed data (gene quantifications from RSEM method) for the EIF4G1 knock-down and respective controls (two replicates each) can be downloaded by typing: ```{r load ENCODE samples, eval=FALSE} gene <- "EIF4G1" cellLine <- "HepG2" ENCODEmetadata <- downloadENCODEknockdownMetadata(cellLine, gene) table(ENCODEmetadata$`Experiment target`) length(unique(ENCODEmetadata$`Experiment target`)) ENCODEsamples <- downloadENCODEsamples(ENCODEmetadata) counts <- prepareENCODEgeneExpression(ENCODEsamples) ``` ```{r load data, include=FALSE} gene <- "EIF4G1" cellLine <- "HepG2" data("ENCODEmetadata") data("counts") ``` Gene expression data (read counts) were quantile-normalized using [`voom`][voom] and differential expression analysis was performed using the [`limma`][limma] R package. ```{r differential gene expression} # Remove low coverage (at least 10 counts shared across two samples) minReads <- 10 minSamples <- 2 filter <- rowSums(counts[ , -c(1, 2)] >= minReads) >= minSamples counts <- counts[filter, ] # Convert ENSEMBL identifier to gene symbol library(biomaRt) mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl")) genes <- sapply(strsplit(counts$gene_id, "\\."), `[`, 1) geneConversion <- getBM(filters="ensembl_gene_id", values=genes, mart=mart, attributes=c("ensembl_gene_id", "hgnc_symbol")) counts$gene_id <- geneConversion$hgnc_symbol[ match(genes, geneConversion$ensembl_gene_id)] # Perform differential gene expression analysis diffExpr <- performDifferentialExpression(counts) ``` As our metric of differential expression after EIF4G1 shRNA knock-down, we will use the respective t-statistic. ```{r t-statistics} # Get t-statistics of differential expression with respective gene names # (expected input for downstream analyses) diffExprStat <- diffExpr$t names(diffExprStat) <- diffExpr$Gene_symbol ``` # Load Connectivity Map perturbation data We will compare our differential gene expression metric with all Connectivity Map's gene knock-down perturbations in the same cell line (HepG2). Note that this comparison can also be done to perturbations in a different cell line or in all cell lines, using in the latter the average result across cell lines. Differential gene expression data for each Connectivity Map's perturbation are available in z-score normalised values (see [Subramanian et al., Cell 2017][subramanian2017] for more details). ```{r L1000 metadata conditions} # Check available conditions for L1000 perturbations l1000metadata <- downloadL1000data("l1000metadata.txt", "metadata") getL1000conditions(l1000metadata) ``` ```{r L1000 knockdown perturbations, eval=FALSE} # Code for loading CMap gene KD HepG2 data l1000metadataKnockdown <- filterL1000metadata( l1000metadata, cellLine="HepG2", perturbationType="Consensus signature from shRNAs targeting the same gene") l1000zscores <- downloadL1000data("l1000zscores.gctx", "zscores", l1000metadataKnockdown$sig_id) l1000geneInfo <- downloadL1000data("l1000geneInfo.txt", "geneInfo") l1000perturbationsKnockdown <- loadL1000perturbations( l1000metadataKnockdown, l1000zscores, l1000geneInfo) ``` If the interest is in small molecules, one can download the differential gene expression z-scores for each small molecule perturbation in HepG2 (given a specific concentration and time point of RNA extraction) through the following command: ```{r L1000 small molecule perturbations, eval=FALSE} l1000metadataSmallMolecules <- filterL1000metadata( l1000metadata, cellLine="HepG2", timepoint="24 h", dosage="5 \U00B5M", # \U00B5 is the unicode code for the micro symbol perturbationType="Compound") l1000zscores <- downloadL1000data("l1000zscores.gctx", "zscores", l1000metadataSmallMolecules$sig_id) l1000geneInfo <- downloadL1000data("l1000geneInfo.txt") l1000perturbationsSmallMolecules <- loadL1000perturbations( l1000metadataSmallMolecules, l1000zscores, l1000geneInfo, sanitizeCompoundNames=TRUE) ``` ```{r load L1000 perturbations, include=FALSE} data("l1000perturbationsKnockdown") data("l1000perturbationsSmallMolecules") ``` # Comparison with Connectivity Map's perturbations This is the main function of the package. Here we compare the statistic for differential expression (the t-statistic, in this case) with the z-scores for the perturbations we have loaded using the three different methods available (Spearman's correlation, Pearson's correlation and Gene Set Enrichment Analysis (GSEA)). For GSEA, the default option is to use the top and bottom 150 genes (ranked by the user's t-statistic) as gene sets, but this can be changed by the user. ```{r compare knockdown, include=FALSE} compareKnockdown <- list() # Compare against L1000 using Spearman correlation compareKnockdown$spearman <- compareAgainstL1000( diffExprStat, l1000perturbationsKnockdown, cellLine, method="spearman") # Compare against L1000 using Pearson correlation compareKnockdown$pearson <- compareAgainstL1000( diffExprStat, l1000perturbationsKnockdown, cellLine, method="pearson") # Compare against L1000 using gene set enrichment analysis (GSEA) with the top # and bottom 150 genes compareKnockdown$gsea <- compareAgainstL1000( diffExprStat, l1000perturbationsKnockdown, cellLine, method="gsea", geneSize=150) ``` For small molecules: ```{r compare small molecules} compareSmallMolecule <- list() # Compare against L1000 using Spearman correlation compareSmallMolecule$spearman <- compareAgainstL1000( diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="spearman") # Compare against L1000 using Pearson correlation compareSmallMolecule$pearson <- compareAgainstL1000( diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="pearson") # Compare against L1000 using gene set enrichment analysis (GSEA) with the top # and bottom 150 genes compareSmallMolecule$gsea <- compareAgainstL1000( diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="gsea", geneSize=150) ``` For each method used, this function will return a table with the results of the comparison with each perturbation tested, including the test statistics (depending on the method used: Spearman's correlation coefficient, Pearson's correlation coefficient or GSEA Enrichment Score), the respective p-value and the Benjamini-Hochberg-adjusted p-value. ```{r order knockdown perturbation comparison} # Order knockdown perturbations according to similarity compareKnockdown$spearman_ordered <- compareKnockdown$spearman[ order(compareKnockdown$spearman$HepG2_spearman_coef, decreasing=TRUE)] compareKnockdown$pearson_ordered <- compareKnockdown$pearson[ order(compareKnockdown$pearson$HepG2_pearson_coef, decreasing=TRUE)] compareKnockdown$gsea_ordered <- compareKnockdown$gsea[ order(compareKnockdown$gsea$HepG2_WTCS, decreasing=TRUE)] # Most positively associated perturbations (note that EIF4G1 knockdown is the # 6th, 1st and 2nd most positively associated perturbation based on Spearman # correlation, Pearson correlation and GSEA, respectively) head(compareKnockdown$spearman_ordered) head(compareKnockdown$pearson_ordered) head(compareKnockdown$gsea_ordered) # Most negatively associated perturbations tail(compareKnockdown$spearman_ordered) tail(compareKnockdown$pearson_ordered) tail(compareKnockdown$gsea_ordered) ``` For small molecules: ```{r order small molecule perturbation comparison} # Order small molecule perturbations according to similarity compareSmallMolecule$spearman_ordered <- compareSmallMolecule$spearman[ order(compareSmallMolecule$spearman$HepG2_spearman_coef, decreasing=TRUE)] compareSmallMolecule$pearson_ordered <- compareSmallMolecule$pearson[ order(compareSmallMolecule$pearson$HepG2_pearson_coef, decreasing=TRUE)] compareSmallMolecule$gsea_ordered <- compareSmallMolecule$gsea[ order(compareSmallMolecule$gsea$HepG2_WTCS, decreasing=TRUE)] # Most positively associated perturbations head(compareSmallMolecule$spearman_ordered) head(compareSmallMolecule$pearson_ordered) head(compareSmallMolecule$gsea_ordered) # Most negatively associated perturbations tail(compareSmallMolecule$spearman_ordered) tail(compareSmallMolecule$pearson_ordered) tail(compareSmallMolecule$gsea_ordered) ``` # Plot To analyse the relationship between the user-provided differential expression profile with that associated with a specific perturbation, we provide the option to create a scatter plot (for Spearman and Pearson analyses) or a GSEA plot. Below you can see how to plot the relationship between the ENCODE's EIF4G1 shRNA knockdown with the perturbations of interest. ```{r plot knockdown perturbation comparison} # Plot relationship with EIF4G1 knockdown EIF4G1knockdown <- grep("EIF4G1", compareKnockdown$gsea_ordered$genes, value=TRUE) plotL1000comparison(compareKnockdown$spearman, EIF4G1knockdown) plotL1000comparison(compareKnockdown$pearson, EIF4G1knockdown) plotL1000comparison(compareKnockdown$gsea, EIF4G1knockdown, topGenes=TRUE) plotL1000comparison(compareKnockdown$gsea, EIF4G1knockdown, topGenes=FALSE) # Plot relationship with most negatively associated perturbation plotL1000comparison(compareKnockdown$spearman, tail(compareKnockdown$spearman_ordered, 1)$genes) plotL1000comparison(compareKnockdown$pearson, tail(compareKnockdown$pearson_ordered, 1)$genes) plotL1000comparison(compareKnockdown$gsea, tail(compareKnockdown$gsea_ordered, 1)$genes, topGenes=TRUE) plotL1000comparison(compareKnockdown$gsea, tail(compareKnockdown$gsea_ordered, 1)$genes, topGenes=FALSE) ``` For small molecules: ```{r plot small molecule perturbation comparison} # Plot relationship with most positively associated perturbation plotL1000comparison(compareSmallMolecule$spearman, head(compareSmallMolecule$spearman_ordered, 1)$genes) plotL1000comparison(compareSmallMolecule$pearson, head(compareSmallMolecule$pearson_ordered, 1)$genes) plotL1000comparison(compareSmallMolecule$gsea, head(compareSmallMolecule$gsea_ordered, 1)$genes) # Plot relationship with most negatively associated perturbation plotL1000comparison(compareSmallMolecule$spearman, tail(compareSmallMolecule$spearman_ordered, 1)$genes) plotL1000comparison(compareSmallMolecule$pearson, tail(compareSmallMolecule$pearson_ordered, 1)$genes) plotL1000comparison(compareSmallMolecule$gsea, tail(compareSmallMolecule$gsea_ordered, 1)$genes) ``` # Feedback All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to: > Nuno Saraiva-Agostinho (nunoagostinho@medicina.ulisboa.pt) > > Bernardo P. de Almeida (bernardo.almeida94@gmail.com) > > [Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)][iMM] [iMM]: http://imm.medicina.ulisboa.pt/group/distrans/ [ENCODE]: https://www.encodeproject.org/ [clue.io]: https://clue.io/ [limma]: https://www.ncbi.nlm.nih.gov/pubmed/25605792 [voom]: https://www.ncbi.nlm.nih.gov/pubmed/24485249 [subramanian2017]: https://doi.org/10.1016/j.cell.2017.10.049