--- title: 'psichomics tutorial: command-line interface (CLI)' author: "Nuno Saraiva-Agostinho" date: "15 February 2016" bibliography: refs.bib output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Command-line interface tutorial} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- --- psichomics is an interactive R package for the analysis of alternative splicing using data from [The Cancer Genome Atlas (TCGA)][1] (containing molecular data associated with 34 tumour types) and from the [Genotype-Tissue Expression][] project (containing data for multiple normal human tissues). The data available in these including clinical information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons. # Installing and starting the program Install psichomics by typing the following in an R console (the [R environment](https://www.r-project.org/) is required): ```{r install, eval=FALSE} ## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("psichomics") ``` After the installation, load psichomics by typing: ```{r load, message=FALSE} library(psichomics) ``` # Retrieving data The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI [@wang2008]. To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from [TCGA][1]. ## Download and load data from TCGA/Firehose Data is downloaded from [Firehose][2], a service that hosts proccessed data from [TCGA][1] as required to run the downstream analyses. Before downloading data, check the following available options: ```{r TCGA options} # Available tumour types cohorts <- getFirehoseCohorts() # Available sample dates date <- getFirehoseDates() # Available data types dataTypes <- getFirehoseDataTypes() ``` After deciding on the options to use, download and load data of interest using: ```{r download, eval=FALSE} # Set download folder folder <- getDownloadsFolder() # Download and load most recent junction quantification and clinical data from # TCGA/Firehose for Adrenocortical Carcinoma data <- loadFirehoseData(folder=folder, cohort="BRCA", data=c("Clinical", "junction_quantification"), date="2016-01-28") # Select clinical and junction quantification dataset clinical <- data[[1]]$`Clinical data` junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)` ``` Data is only downloaded if the files are not present in the given folder. In other words, if the files were already downloaded, the function will just load the files, so it is possible to reuse the code above just to load the requested files. ```{r prepare examples, include=FALSE} clinical <- readRDS("BRCA_clinical.RDS") ``` ## Load local files To load local files instead, indicate the folder of interest. Any files located in this folder and sub-folders will be attempted to load. To avoid errors during this process, files of interest should be put in a dedicated folder. For instance, to load GTEx files, create a directory called **GTEx**, put all files of interest inside that folder and follow these commands: ```{r load local, eval=FALSE} folder <- "~/Downloads/GTEx/" ignore <- c(".aux.", ".mage-tab.") data <- loadLocalFiles(folder, ignore=ignore) # Select clinical and junction quantification dataset clinical <- data[[1]]$`Clinical data` junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)` ``` # Quantifying alternative splicing As previously mentioned, alternative splicing is quantified from the previously loaded junction quantification and an alternative splicing annotation file. To check current annotation files available: ```{r quantify options} # Available alternative splicing annotation annotList <- listSplicingAnnotations() annotList ``` Custom alternative splicing annotation obtained from SUPPA, rMATS, VAST-TOOLS or MISO can also be provided. Note that SUPPA and rMATS are able to create their splicing annotation based on transcript annotation. [Read more here][6]. To quantify alternative splicing, first select the junction quantification, alternative splicing annotation and alternative splicing event type(s) of interest: ```{r prepare to quantify splicing, eval=FALSE} # Load Human (hg19/GRCh37 assembly) annotation human <- listSplicingAnnotations()[[1]] annotation <- loadAnnotation(human) ``` ```{r event types} # Available alternative splicing event types (skipped exon, alternative # first/last exon, mutually exclusive exons, etc.) eventType <- getSplicingEventTypes() eventType eventType <- "SE" ``` Afterwards, quantify alternative splicing using the previously defined parameters: ```{r quantify splicing, eval=FALSE} # Min. reads threshold: number of reads required to quantify a splicing event minReads <- 10 psi <- quantifySplicing(annotation, junctionQuant, eventType=eventType, minReads=minReads) ``` ```{r load splicing, echo=FALSE} psi <- readRDS("BRCA_psi.RDS") ``` ```{r check splicing events} # Check the identifier of the splicing events in the resulting table events <- rownames(psi) head(events) ``` Note that the event identifier (for instance, `SE_1_-_2125078_2124414_2124284_2121220_C1orf86`) is composed of: - Event type (`SE` stands for skipped exon) - Chromosome (`1`) - Strand (`-`) - Relevant coordinates depending on event type (in this case, the first constitutive exon's end, the alternative exon' start and end and the second constitutive exon's start) - Associated gene (`C1orf86`) > WARNING: all the examples shown below are performed using a small, but representative subset of the data available. # Survival analysis Survival data can be analysed based on clinical attributes, for instance, by tumour stage and patient gender, using time to death as the follow-up time and death as the event of interest. Around 80% of breast cancers have cells that express estrogen receptors (ER) and require estrogen binding to grow. Estrogen receptors may be blocked by tamoxifen and other drugs. These drugs are thus used as treatment or prevention for ER-positive breast cancers. To compare the overall survival of ER-positive patients treated with and without tamoxifen, the following groups will be created: * ER-positive tamoxifen-treated patients, * ER-positive non-tamoxifen-treated patients and * non-ER-positive tamoxifen-treated patients. ```{r survival groups} # Get available groups in clinical data cols <- colnames(clinical) # Check the name from which to retrieve groups of interest er_expr <- grep("estrogen_receptor_status", cols, value=TRUE)[1] er_expr <- createGroupByColumn(col=er_expr, dataset=clinical) # Discard values other than "positive" and "negative" for ER expression er_expr <- er_expr[c("positive", "negative")] ############################################################################ # Now the same for patients treated with tamoxifen. However, TCGA presents # # data for the administred drugs spread through multiple columns, so... # ############################################################################ # Look for the appropriate columns with the drugs administred drug_name <- grep("drug_name", cols, value=TRUE)[1:23] # Using the previous columns, look for either "tamoxifen" or "tamoxiphen" tamoxifen <- lapply(drug_name, function(i) grep("tamoxi.*", clinical[ , i])) # Collect all previous results and create a single list named tamoxifen tamoxifen <- sort(unique(unlist(tamoxifen))) tamoxifen <- list("tamoxifen"=tamoxifen) # Combine all previously created groups groups <- c(er_expr, tamoxifen) # Assign each patient to a group patients <- nrow(clinical) g <- groupPerPatient(groups[c("tamoxifen", "positive")], patients) # Append the created groups to the clinical dataset cl <- cbind(clinical, groups=g) ``` Before continuing, be sure you undestand how to use formulas in R. ```{r help formula, eval=FALSE} help(formula) ``` To plot the survival curves: ```{r survival by er expression and tamoxifen} # Create the right-hand side of a formula formulaStr <- "groups" # Events are retrieved based on the information available for time to event: if # there is info for a patient, the event is considered as occurred daysToDeath <- "days_to_death" survTerms <- processSurvTerms(cl, censoring="right", event=daysToDeath, timeStart=daysToDeath, formulaStr=formulaStr, scale="years") require("survival") surv <- survfit(survTerms) pvalue <- testSurvival(survTerms) plotSurvivalCurves(surv, pvalue=pvalue) ``` Information regarding number of individuals and events is returned when hovering over each survival curve in the plot. The plot also allows zooming in by clicking and dragging and to omit data series by clicking on their name in the legend. ## Cox Proportional Hazards model Calculate Cox proportional hazards model using the previously used parameters by adding the argument `coxph=TRUE` when processing survival terms. ```{r cox model} survTermsCox <- processSurvTerms(cl, censoring="right", event=daysToDeath, timeStart=daysToDeath, formulaStr=formulaStr, coxph=TRUE) summary(survTermsCox) ``` # Exploring principal component analysis Principal component analysis (PCA) is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data [@Ringner2008gb]. To explore alternative splicing quantification groups by estrogen receptor (ER) expression: ```{r plot PCA variance} # Percentage of missing values tolerated by splicing event: tolerated missing # values are replaced with the median value of that splicing event naTolerance <- 0 # Center bu do not scale values (they are already scaled between 0 and 1) center <- TRUE scale <- FALSE # Match patients with samples samples <- colnames(psi) match <- getPatientFromSample(samples, clinical) # Filter alternative splicing quantification to colour values by positive or # negative ER expression erGroups <- getMatchingSamples(groups[c("positive", "negative")], samples, clinical, match=match) filtered_psi <- psi[ , unlist(erGroups)] # Perform principal component analysis (transpose alternative splicing # quantification to have samples as rows) pca <- performPCA(t(filtered_psi), center=center, scale.=scale, naTolerance=naTolerance) # Plot the variance explained by each principal component plotVariance(pca) ``` Now plot the score plot of the PCA coloured according to the ER expression: ```{r plot PCA} plotPCA(pca, pcX="PC1", pcY="PC2", erGroups) ``` Moreover, let's plot the corresponding loadings plot that displays the variables (in this case, alternative splicing events): ```{r plot loadings} plotPCA(pca, pcX="PC1", pcY="PC2", individuals = FALSE, loadings = TRUE) ``` The bubble size of the loadings plot represent the total contribution of each alternative splicing event to the selected principal components. Note that the clinical samples from ER-positive individuals (i.e. patients whose cancer cells express estrogen receptors) seem to separate from samples from ER-negative individuals along the principal component 1. There is one alternative splicing event that may contribute to this separation: *SE 10 + 79797062 79799962 79799983 79800373 RPS24*. ## Differential splicing analysis By default, differential splicing analysis is performed by sample types (i.e. tumour, normal, metastasis, etc) for multiple parametric and non-parametric statistical tests. Let's analyse the previous splicing event found: ```{r diff splicing 1} # Choose a method to use when adjusting p-values ("none" is valid) # help(p.adjust.methods) pvalueAdjust <- "BH" # Check sample types available (normal, tumour, etc.) types <- parseSampleGroups(colnames(psi)) unique(types) # Analyse by sample types (by default) and perform all statistical analyses (by # default) event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24" eventPSI <- psi[event, ] stats <- diffAnalyses(eventPSI, pvalueAdjust=pvalueAdjust, analyses = c("wilcoxRankSum", "ttest", "kruskal", "levene", "fligner")) # View(stats) plotDistribution(as.numeric(eventPSI), groups=types) ``` Differential splicing analysis returns a data frame with the results of the statistical tests, as well as other info, such as associated gene and number of samples per group. The following statistical tests are available: * Unpaired t-test (`ttest`) * Wilcoxon rank sum test (`wilcoxRankSum`) * Kruskal test (`kruskal`) * Levene's test (`levene`) * Fligner-Killeen test (`fligner`) Not all statistical tests are performed depending on the number of groups available. Check if any tests show statistical significance using only tumour against normal samples (for instance, Log-rank p-values below 0.05): ```{r diff splicing 2} filter <- grep("Tumor|Normal", types) gr_event <- types[filter] eventPSI <- eventPSI[filter] stats <- diffAnalyses(eventPSI, groups=gr_event, pvalueAdjust=pvalueAdjust, analyses = c("wilcoxRankSum", "ttest", "kruskal", "levene", "fligner")) # View(stats) plotDistribution(as.numeric(eventPSI), groups=gr_event) ``` * Hover each group in the plot to compare the respective number of samples, median and variance. * To zoom in a specific region, click-and-drag in the region of interest. * To hide or show groups, click on their name in the legend. Now, change the groups to *positive* and *negative* for ER expression and check if any of the tests show statistical significance: ```{r diff splicing 3} # Filter alternative splicing quantification by positive or negative ER # expression; match between patient information and samples as such: erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, clinical, match=match) eventPSI <- psi[event, unlist(erGroups)] erGroups <- rep(names(erGroups), sapply(erGroups, length)) stats <- diffAnalyses(eventPSI, groups=erGroups, pvalueAdjust=pvalueAdjust, analyses = c("wilcoxRankSum", "ttest", "kruskal", "levene", "fligner")) # View(stats) plotDistribution(as.numeric(eventPSI), groups=erGroups) ``` ## Survival analysis To study the impact of an alternative splicing event on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by a given PSI cut-off for the selected alternative splicing event. Let's carry on with the splicing event from before: ```{r splicing event data} event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24" # Assign alternative splicing quantification to patients based on their samples clinicalPSI <- getPSIperPatient(psi, match, clinical) eventPSI <- as.numeric(clinicalPSI[event, ]) # Statistics of the alternative splicing event's quantification summary(eventPSI) ``` The optimal PSI cut-off that maximises the significance of their difference in survival (i.e. minimises the p-value of the Wald/Log/Logrank tests of difference in survival between individuals with PSI below and above that threshold) can be calculated as per: ```{r optimal cut-off} opt <- optimalPSIcutoff(clinical, eventPSI, censoring="right", event="days_to_death", timeStart="days_to_death") optimalCutoff <- opt$par # Optimal exon inclusion level optimalCutoff optimalPvalue <- opt$value # Respective p-value optimalPvalue ``` Finally, plot survival curves separated by the optimal quantification cut-off or any other cut-off of interest: ```{r} cutoff <- optimalCutoff group <- labelBasedOnCutoff(eventPSI, cutoff, label="PSI values") survTerms <- processSurvTerms(clinical, censoring="right", event="days_to_death", timeStart="days_to_death", group=group) require("survival") surv <- survfit(survTerms) pvalue <- testSurvival(survTerms) plotSurvivalCurves(surv, pvalue=pvalue) ``` ## Literature support and external database information If an event is differentially spliced and has an impact on patient survival, its association with the studied disease might be already described in the literature. Check for relevant research articles on [PubMed](http://pubmed.gov). It is also possible to plot transcripts and proteins related to the gene associated to a given splicing event. To retrieve and plot transcript information from [ENSEMBL][4]: ```{r plot transcripts} parsed <- parseSplicingEvent(event) info <- queryEnsemblByEvent(event, species="human", assembly="hg19") plotTranscripts(info, parsed$pos[[1]]) ``` To retrieve and plot protein information from [UniProt][5]: ```{r plot proteins} # Some of these transcripts have no corresponding protein proteins <- info$Transcript$Translation$id protein <- proteins[[6]] # Convert protein identifier from ENSEMBL to UniProt uniprot <- ensemblToUniprot(protein) uniprot uniprot <- uniprot[[1]] plotProtein(uniprot) ``` The protein plot shows the [UniProt][5] matches for the selected transcript. Hover the protein's rendered domains to obtain more information on them. Other database of interest include: - [Human Protein Atlas (Cancer Atlas)](http://www.proteinatlas.org/cancer) allows to check the evidence of a gene at protein level for multiple cancer tissues. - [VastDB](http://vastdb.crg.eu) shows multi-species alternative splicing profiles for diverse tissues and cell types. - [UCSC Genome Browser](https://genome.ucsc.edu) may reveal protein domain disruptions caused by the alternative splicing event. To check so, activate the **Pfam in UCSC Gene** and **UniProt** tracks (in *Genes and Gene Predictions*) and check if any domains are annotated in the alternative and/or constitutive exons of the splicing event. # Exploring differential splicing analysis To analyse differencial splicing, choose which clinical groups on which to perform the analyses. For instance, splicing events can be analysed based on sample types (e.g., tumour versus normal tissue, if available) or clinical groups of the patients (e.g. stage of the disease). Let's perform differential splicing analysis based on ER expression: ```{r exploring diff splicing} # Filter alternative splicing quantification by positive or negative ER # expression; match between patient information and samples as such: erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, clinical, match=match) eventsPSI <- psi[ , unlist(erGroups)] erGroups <- rep(names(erGroups), sapply(erGroups, length)) diff <- diffAnalyses(eventsPSI, erGroups) # View(diff) ``` ## Survival analysis To study the impact of the alternative splicing events on prognosis, significant differences in survival may be identified between patients separated by a given splicing quantification cut-off. Given the time-consuming process of identifying a cut-off that minimises the survival difference, it is recommended to filter differentially spliced events supported by statistical significance before calculating the optimal cut-off and associated p-value. ```{r survival analysis on filtered events, results="hide"} # Filter statistically significant events as you see fit filter <- diff$`Wilcoxon p-value (BH adjusted)` <= 0.05 filter <- filter[!is.na(filter)] events <- rownames(diff[filter, ]) # Assign alternative splicing quantification to patients based on their samples clinicalPSI <- getPSIperPatient(psi, match, clinical) # Get time for all survival analyses of interest (the same for all, so this is # faster) daysToDeath <- "days_to_death" survTime <- getColumnsTime(clinical, event=daysToDeath, timeStart=daysToDeath) # Prepare progress bar min <- 1 max <- nrow(clinicalPSI) pb <- txtProgressBar(min, max, "Calculating optimal PSI cut-off", style=3) # Prepare empty vectors where information will be stored (faster) optimalCutoff <- rep(NA, max) optimalPvalue <- rep(NA, max) # Retrieve the optimal PSI cut-off and respective p-value for multiple events for (row in min:max) { singlePSI <- as.numeric(clinicalPSI[row, ]) opt <- optimalPSIcutoff(clinical, singlePSI, censoring="right", event=daysToDeath, timeStart=daysToDeath, survTime=survTime) optimalCutoff[row] <- opt$par # Optimal exon inclusion level optimalPvalue[row] <- opt$value # Respective p-value setTxtProgressBar(pb, row) } # Bind columns to differential splicing analysis if desired diff <- cbind(diff, optimalCutoff, optimalPvalue) # View(diff) ``` Afterwards, check literature information and search external databases for more information on the events of interest, including on **UCSC Genome Browser** for putative protein domain disruptions resulting from the splicing event. # Exploring an alternative splicing event of interest The event *SE 6 - 46823711 46822518 46822452 46821808 GPR116* has been previously reported to have potential prognostic value in breast cancer patients, where patients with higher PSI values for this event have a lower 5-year survival rate than patients with lower PSI values [@Tsai2015jf]. Confirm so by performing differential splicing (for example, normal vs tumour samples) on this event and survival analysis by PSI cut-off. Also, check literature information and search external databases for more information, including on **UCSC Genome Browser** for putative protein domain disruptions resulting from the alternative splicing event. # Feedback All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to: | Nuno Saraiva-Agostinho (nunodanielagostinho@gmail.com) | [Computation Biology Lab, Instituto de Medicina Molecular (Portugal)][3] # References [1]: https://tcga-data.nci.nih.gov/docs/publications/tcga [2]: http://gdac.broadinstitute.org [3]: http://imm.medicina.ulisboa.pt/group/compbio [4]: http://www.ensembl.org [5]: http://www.uniprot.org [6]: http://rpubs.com/nuno-agostinho/alt-splicing-annotation [7]: http://gtexportal.org