--- title: "singscore" author: "Ruqian Lyu" date: "`r Sys.Date()`" package: singscore output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Single sample scoring} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ../inst/REFERENCES.bib --- # Introduction This vignette demonstrates how to use the functions in R package 'singscore' to score a gene expression dataset against a gene set at a single-sample level and provides visualisation functions to improve interpretation of the results. `singscore` implements a simple single-sample gene-set (gene-signature) scoring method which scores individual samples independently without relying on other samples in gene expression datasets. It provides stable scores which are less likely to be affected by varying sample and gene sizes in datasets and unwanted variations across samples. The scoring method uses a rank-based statistics and is quick to compute. For details of the methods please refer to the paper [@Foroutan231217]. It also provides various visualisation functions to further explore results of the analysis. # Install "singscore" R package Install 'singscore' from Bioconductor ```{r installationBio, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("singscore") ``` The most updated version of 'singscore' is hosted on GitHub and can be easily installed using `devtools::install_github()` function provided by **devtools**, (https://cran.r-project.org/package=devtools) ```{r installationGit, eval=FALSE} # You would need to install 'devtools' package first. install.packages("devtools") # And install the 'singscore' package from the GitHub repository # 'singscore' requires these packages to be installed: methods, stats, graphics, ggplot2, ggsci, grDevices, # ggrepel, plotly, tidyr, plyr, magrittr, reshape, edgeR, RColorBrewer, Biobase, GSEABase, BiocParallel devtools::install_github('DavisLaboratory/singscore') # Set build_vignette = TRUE if would like to browseVignette() ``` #Scoring samples against a gene-set ## Load datasets To illustrate the usage of 'simpleScore()', we first need to load the example datasets. The datasets used in this vignette have been built within the package. You can use the following scripts to load them into your R environment. Detailed steps of obtaining the full datasets are included at the end of the vignette. The 'tgfb_expr_10_se' dataset was obtained from [@Foroutanmolcanres.0313.2016] and it is a ten-sample subset of the original dataset. We are going to score the integrated TGFb-treated gene expression dataset (4 cases and 6 controls) against a TGFb gene signature with an up-regulated and down-regulated gene-set pair [@Foroutanmolcanres.0313.2016]. ```{r loadDataset, message=FALSE} library(singscore) library(GSEABase) # The example expression dataset and gene signatures are included in the package # distribution, one can directly access them using the variable names # To see the description of 'tgfb_expr_10_se','tgfb_gs_up','tgfb_gs_dn', look at # their help pages using: # ?tgfb_expr_10_se # ?tgfb_gs_up # ?tgfb_gs_dn # Have a look at the object tgfb_expr_10_se containing gene expression data # for 10 samples tgfb_expr_10_se # Get the sample names by colnames(tgfb_expr_10_se) # View what tgfb_gs_up/dn contains tgfb_gs_up tgfb_gs_dn # Get the size of the gene sets length(GSEABase::geneIds(tgfb_gs_up)) length(GSEABase::geneIds(tgfb_gs_dn)) ``` ## Sample scoring To score samples, the gene expression dataset first needs to be ranked using the `rankGenes()` function which returns a rank matrix. This matrix along with the signatures are then passed to the `simpleScore()` function which returns a data.frame containing the scores for each sample. When only a single gene-set is available (i.e. not an up- and down- regulated pair), the same function can be called by setting the `upSet` argument to the gene-set. ```{r simplescoring} # The recommended method for dealing with ties in ranking is 'min', you can # change by specifying 'tiesMethod' parameter for rankGenes function. rankData <- rankGenes(tgfb_expr_10_se) # Given the ranked data and gene signature, simpleScore returns the scores and # dispersions for each sample scoredf <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = tgfb_gs_dn) scoredf # To view more details of the simpleScore, use ?simpleScore # Note that, when only one gene set is available in a gene signature, one can # only input values for the upSet argument. In addition, a knownDirection # argument can be set to FALSE if the direction of the gene set is unknown. # simpleScore(rankData, upSet = tgfb_gs_up, knownDirection = FALSE) ``` The returned data.frame consists of the scores for the up- and down- regulated gene-sets along with the combined score (`TotalScore`). Dispersion is calculated using the `mad` function by default and can be substituted by passing another function to the `dispersionFun` argument in `simpleScore()` such as `IQR` to calculate the inter-quartile range. ## Sample scoring with a reduced number of measurements Singscore requires transcriptome-wide measurements whereby all or most transcripts are measured. This is required to assess the expression of a gene in relation to the transcriptome. Panel-based transcriptomic assays measure a much smaller selection of transcripts therefore assessing the relative expression of each gene becomes a challenge. In such a setting, stably expressed genes can be used to assess relative expression. Expression of such genes is invariable across samples therefore they can be used to calibrate sample-wise gene expression measurements. This property allows for estimation of transcriptome-wide ranks with a small set of measurements inclusing a small set of stable genes. Detailed explanation of the rank estimation procedure using this logic is available in Bhuva et al. (2020) (manuscript in preparation). Stably expressed genes in carcinoma transcriptomes and blood can be obtained using the `getStableGenes` function. ```{r} #get the top 5 stable genes in carcinomas getStableGenes(5, type = 'carcinoma') #get the top 5 stable genes in blood getStableGenes(5, type = 'blood') #get ensembl IDs instead of gene symbold getStableGenes(5, type = 'carcinoma', id = 'ensembl') ``` This list can be used to score samples using a variant of singscore called *stingscore*. ```{r} #here we specify a custom set of genes (Entrez IDs) stable_genes <- c('8315', '9391', '23435', '3190') #create a dataset with a reduced set of genes (signature genes and stable genes) measured <- unique(c(stable_genes, geneIds(tgfb_gs_up), geneIds(tgfb_gs_dn))) small_tgfb_expr_10 <- tgfb_expr_10_se[measured, ] dim(small_tgfb_expr_10) #rank genes using stable genes rankData_st <- rankGenes(small_tgfb_expr_10, stableGenes = stable_genes) head(rankData_st) #score samples using stingscore #simpleScore invoked with the new rank matrix will execute the stingscore # algorithm scoredf_st <- simpleScore(rankData_st, upSet = tgfb_gs_up, downSet = tgfb_gs_dn) scoredf_st #plot the two scores against each other plot(scoredf$TotalScore, scoredf_st$TotalScore) abline(coef = c(0, 1), lty = 2) ``` Scores computed using the classic singscore and the modified stingscore will be similar (correlated if not equal) if the selection of stable genes is good. Ideal stable genes will have stable gene expression across the samples being investigated and should cover a wide range of the expression spectrum. The genes produced by `getStableGenes` fulfill these conditions. Our approach using the product of ranks can be used to determine a set of stable genes for a new context (manuscript in preparation). # Visualisation and diagnostic functions In this section, we show example usages of the visualisation functions included in this package. ## Plot rank densities Scores of each sample are stored in `scoredf`. We can use the `plotRankDensity` function to plot the ranks of genes in the gene-sets for a specific sample. We plot the rank distribution for the second sample in `rankData` which combines a density plot (densities calculated using KDE) with a barcode plot. Please note that since we are subsetting the data.frame rankData by one column, we set `drop = FALSE` to maintain the structure of the data.frame/matrix. ```{r plotdt, fig.height = 4, fig.width = 8} # You can provide the upSet alone when working with unpaired gene-sets # We plot the second sample in rankData, view it by head(rankData[,2,drop = FALSE]) plotRankDensity(rankData[,2,drop = FALSE], upSet = tgfb_gs_up, downSet = tgfb_gs_dn, isInteractive = FALSE) ``` Setting `isInteractive = TRUE` generates an interactive plot using the `plotly` package. Hovering over the bars in the interactive plot allows you to get information such as the normalised rank (between 0 and 1) and ID of the gene represented by the bar. For the rest of the plotting functions, the `isInteractive = TRUE` argument has the same behavior. ## Plot dispersions of scores Function `plotDispersion` generates the scatter plots of the 'score VS. dispersions' for the total scores, the up scores and the down score of samples. It requires the scored data.frame from `simpleScore` function and annotations (via `annot` parameter) can be used for coloring the points. ```{r plotds, fig.height = 4, fig.width = 8} # Get the annotations of samples by their sample names tgfbAnnot <- data.frame(SampleID = colnames(tgfb_expr_10_se), Type = NA) tgfbAnnot$Type[grepl("Ctrl", tgfbAnnot$SampleID)] = "Control" tgfbAnnot$Type[grepl("TGFb", tgfbAnnot$SampleID)] = "TGFb" # Sample annotations tgfbAnnot$Type plotDispersion(scoredf,annot = tgfbAnnot$Type,isInteractive = FALSE) # To see an interactive version powered by 'plotly', simply set the # 'isInteractive' = TRUE, i.e : # # plotDispersion(scoredf,annot = tgfbAnnot$Type,isInteractive = TRUE) ``` ## Plot score landscape `plotScoreLandscape` plots the scores of the samples against two different gene signatures in a landscape for exploring their relationships. There are two styles of the landscape plot (i.e scatter and hexBin plot). When the number of samples in the gene expression dataset is above the default threshold (100), `plotScoreLandscape` generates a hex bin plot otherwise a scatter plot. The threshold can be modified by changing the `hexMin`. In order to better demonstrate the usage of `plotScoreLandscape`, we load some additional datasets consisting of pre-computed scores of larger public datasets. `scoredf_ccle_epi` and `scoredf_ccle_mes` are two scored results of a CCLE dataset [@barretina2012cancer] against an epithelial gene signature and mesenchymal gene signature [@Ep-MES] respectively. For details on how to obtain the dataset please see the section at the end of the vignette. ```{r loadCCLE,fig.height = 4, fig.width = 8} plotScoreLandscape(scoredf_ccle_epi, scoredf_ccle_mes, scorenames = c('ccle-EPI','ccle-MES'),hexMin = 10) ``` Similarly, pre-computed scores for the [TCGA](https://cancergenome.nih.gov/) breast cancer RNA-seq dataset against epithelial and mesenchymal gene signatures are stored in `scoredf_tcga_epi` and `scoredf_tcga_mes` respectively [@Ep-MES]. The utility of this function is enhanced when the number of samples is large. ```{r loadTCGA,fig.height = 4, fig.width = 8} tcgaLandscape <- plotScoreLandscape(scoredf_tcga_epi, scoredf_tcga_mes, scorenames = c('tcga_EPI','tcga_MES'), isInteractive = FALSE) tcgaLandscape # To get an interactive version of plot, set isInteractive = TRUE ``` You can also project new data points onto the landscape plot generated above by using the `projectScoreLandscape` function. For example, the plot below overlays 3 CCLE samples onto the TCGA epithelial-mesenchymal landscape. Points are labeled with their sample names by default. ```{r projectScore,fig.height = 4, fig.width = 8} # Project on the above generated 'tcgaLandscape' plot projectScoreLandscape(plotObj = tcgaLandscape, scoredf_ccle_epi, scoredf_ccle_mes, subSamples = rownames(scoredf_ccle_epi)[c(1,4,5)], annot = rownames(scoredf_ccle_epi)[c(1,4,5)], sampleLabels = NULL, isInteractive = FALSE) ``` Custom labels can be provided by passing a character vector to the `sampleLabels` argument. ```{r projectScore2,fig.height = 4, fig.width = 8} projectScoreLandscape(plotObj = tcgaLandscape, scoredf_ccle_epi, scoredf_ccle_mes, subSamples = rownames(scoredf_ccle_epi)[c(1,4,5,8,9)], sampleLabels = c('l1','l2','l3','l4','l5'), annot = rownames(scoredf_ccle_epi)[c(1,4,5,8,9)], isInteractive = FALSE) ``` # Estimate empirical p-values for the obtained scores in individual samples and plot null distributions ## Permutation test Hypothesis testing of the calculated scores is performed using a permutation test. The null hypothesis is that the gene-set is not enriched in the sample. For each sample, gene labels are randomly shuffled and scores computed against the gene-set. This is done $B$ times to generate the null distribution. The `generateNull()` function computes these for multiple samples ($n$) simultaneously resulting in an $n \times B$ matrix with permuted scores along the columns for each sample. The permutation function has parallel computing features provided by using [`BiocParallel`](http://bioconductor.org/packages/release/bioc/html/BiocParallel.html) ```{r pvalue, fig.height = 8, fig.width = 10} # tgfb_gs_up : up regulated gene set in the tgfb gene signature # tgfb_gs_dn : down regulated gene set in the tgfb gene signature # This permutation function uses BiocParallel::bplapply() parallel function, by # supplying the first 5 columns of rankData, we generate Null distribution for # the first 5 samples. # detect how many CPU cores are available for your machine # parallel::detectCores() ncores <- 1 # Provide the generateNull() function the number of cores you would like # the function to use by passing the ncore parameter permuteResult <- generateNull( upSet = tgfb_gs_up, downSet = tgfb_gs_dn, rankData = rankData, subSamples = 1:5, centerScore = TRUE, knownDirection = TRUE, B = 1000, ncores = ncores, seed = 1, useBPPARAM = NULL ) # Note here, the useBPPARAM parameter allows user to supply a BPPARAM variable # as a parameter which decides the type of parallel ends to use. # such as # snow <- BiocParallel::SnowParam(type = "SOCK") # permuteResult <- generateNull(upSet = tgfb_gs_up, downSet = tgfb_gs_dn, # rankData[,1:5], B = 1000, seed = 1,ncores = ncores, useBPPARAM = snow) # If you are not sure about this, just leave the value as NULL and set how many # CPU cores to use via the ncores argument. It will use the default parallel # backend for your machine. # For more information, please see the help page for ?generateNull() # Please also note that one should pass the same values to the upSet, # downSet, centerScore and knownDirection arguments as what they provide # for the simpleScore() function to generate a proper null distribution. head(permuteResult) ``` ## Estimate empirical p-values $p$-values can be estimated using the `getPvals()` function by providing the null distributions calculated above. Unless all permutations are exhausted (mostly infeasible), the minimum $p$-value obtained is $\frac{1}{B}$. ```{r getPvals, fig.height = 4, fig.width = 8} pvals <- getPvals(permuteResult, scoredf, subSamples = 1:5) # getPval returns p-values for each individual sample. # show the p-values for first 5 samples pvals ``` ## Plot null distribution Plot the null distributions for the first sample with the estimated $p$-value labelled. The function uses the `sampleNames` parameter to decide which samples to plot. ```{r plotNull1, fig.height = 4, fig.width = 8 } plotNull(permuteResult, scoredf, pvals, sampleNames = names(pvals)[1]) ``` You can provide multiple sample names to plot these samples in one plot. For example, plot the first 2 samples. ```{r plotNull2, fig.height = 6, fig.width = 12 } # plot the null distributions for the first 2 samples and the p-values # We can see from the plot, the control samples are not significant and TGFb # samples are very significant with very low p-values plotNull(permuteResult, scoredf, pvals, sampleNames = names(pvals)[1:2]) ``` # More on the datasets 1. TGFb-EMT data In the examples above, we loaded a gene expression matrix data `tgfb_expr_10_se`. This dataset is a ten-sample subset of a full dataset originally from the integrated TGFb-EMT data published by [@Foroutanmolcanres.0313.2016]. The full dataset can be accessed here https://figshare.com/articles/comBat_corrected_Foroutan_et_al_2017/5682862. The `tgfb_gs_up` and `tgfb_gs_dn` are the derived up-regulated/down-regulated genes of the TGFb-induced EMT gene signature by [@Foroutanmolcanres.0313.2016] (see Table S1. TGFβ-EMT signature). 2. CCLE dataset `scoredf_ccle_epi` and `scoredf_ccle_mes` are two data frames of the pre-computed scores using The Cancer Cell Line Encyclopaedia (CCLE) breast cancer cell line RNA-seq dataset [@barretina2012cancer]. The CCLE dataset was normalised by TPM and can be downloaded from https://www.synapse.org/#!Synapse:syn5612998 Cell lines were scored against the epithelial and mesenchymal gene signatures, which were obtained from [@Ep-MES] and can be found in the 'Table S1B. Generic EMT signature for cell line' in the supplementary files. 3. TCGA cancer samples dataset `scoredf_tcga_epi` and `scoredf_tcga_mes` are two data frames of the pre-computed scores using The Cancer Genome Atlas (TCGA) breast cancer RNA-seq data (RSEM normalised) [@TCGAcite] against the epithelial and mesenchymal gene signatures respectively. The gene signatures were obtained from [@Ep-MES] and can be found in the 'Table S1A. Generic EMT signature for tumour'. The TCGA dataset was downloaded from The UCSC Cancer Genomics Browser in February 2016 (https://genome-cancer.ucsc.edu) ```{r previewData} # preview the scored CCLE samples head(scoredf_ccle_epi) # preview the scored TCGA samples head(scoredf_tcga_epi) ``` # Session Info ```{r sessionInfo} sessionInfo() ``` # References