--- title: 'psichomics tutorial: visual interface' author: "Nuno Saraiva-Agostinho" date: "18 February 2016" bibliography: refs.bib output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Visual interface tutorial} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- --- psichomics is an interactive R package for the analysis of alternative splicing using data from [The Cancer Genome Atlas (TCGA)][5] (containing molecular data associated with 34 tumour types) and from the [Genotype-Tissue Expression][8] project (containing data for multiple normal human tissues). The data available in these including clinical information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons. # Installing and starting the program Install psichomics by typing the following in an R console (the [R environment](https://www.r-project.org/) is required): ```{r, eval=FALSE} ## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("psichomics") ``` After the installation, start the visual interface of the program in your default web browser by typing: ```{r, eval=FALSE} library(psichomics) psichomics() ``` # Downloading and loading TCGA data The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI [@wang2008]. To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from TCGA. For instance, load breast cancer data by following these instructions: 1. To load TCGA data, click on the blue panel **Download/load TCGA data**. 2. Fill in the **Tumour type** field with *Breast invasive carcinoma (BRCA)*. 3. Set the most recent date in the **Date** field. 4. In the **Data type** field, select *clinical* and *junction quantification* (more data types will soon be supported). 5. Confirm if the **Folder to store the data** field contains the folder where the files will be downloaded to. 6. Click **Load data**. If the required files are not available in the given folder, they will start downloading when you click **Download data** in the message that appears. When all downloads have finished, proceed by clicking on **Load data** again with the exact same parameters. After the data finish loading (keep an eye on the progress at the top-right corner), the on-screen instructions at the right will be replaced by the loaded datasets. Please note the following: - To optimise performance, not all clinical data columns will be visible by default. However, columns of interest can be added or removed in the field **Visible columns**. - Each column of a dataset is sortable by clicking on it to toggle between ascending and descending order and filtrable by clicking on and editing the field below the column name. Filtering and sorting of loaded data have **no** impact on the subsequent analyses available in the application. ```{r, echo=FALSE, fig.retina=NULL, out.width='400pt'} knitr::include_graphics("load_data.png") ``` **Figure 1:** Available options for TCGA data loading. # Quantifying alternative splicing After loading the clinical and alternative splicing junction quantification data from TCGA, quantify alternative splicing by clicking the blue panel **Alternative splicing quantification** on the left. 1. Select the junction quantification dataset to use from the loaded data. For many tumour types, only one dataset is provided. 2. Designate the alternative splicing event annotation. Currently, only the annotation for **Human (hg19/GRCh37 assembly)** is available[^1]. 3. Choose the event type(s) of interest. To follow the rest of this tutorial, only select **Skipped exon (SE)**. 4. Set the minimum read counts threshold to 10. Inclusion levels calculated with total read counts below this threshold are discarded from further analyses. [^1]: You can create additional alternative splicing annotations for psichomics by parsing the annotation from programs like [VAST-TOOLS][4], [MISO][3], [SUPPA][1] and [rMATS][2]. For more information, [read this tutorial][6]. Click on **Quantify events** to start quantifying alternative splicing. ```{r, echo=FALSE, fig.retina=NULL, out.width='400pt'} knitr::include_graphics("quantify_splicing.png") ``` **Figure 2:** Available options for alternative splicing quantification. # Survival analysis Survival data can be analysed based on clinical attributes, for instance, by tumour stage and patient gender, using time to death as the follow-up time and death as the event of interest. To analyse survival data, click on the **Analyses** tab located in the navigation menu at the top and select **Survival analysis**. Around 80% of breast cancers have cells that express estrogen receptors (ER) and require estrogen binding to grow. Estrogen receptors may be blocked by tamoxifen and other drugs. These drugs are thus used as treatment or prevention for ER-positive breast cancers. To compare the overall survival of ER-positive patients treated with and without tamoxifen: 1. Check **right** data censoring. 2. Use *days to death* for the **follow up time**. 3. Use *death* as the **event of interest**. 4. Display time in **years**. 5. Click on the blue button **Groups**. 6. To create groups of patients based on ER expression: 1. Click on the field below **Select attribute**. 2. Start typing `estrogen_receptor` and click the first suggestion. 3. Click on **Create group**. New groups have now been created based on the unique values of that attribute (*NA*, *indeterminate*, *negative* and *positive*). 4. Click on the *indeterminate* and *NA* groups and click on **Remove** as these groups will not be needed for this tutorial. 7. To create groups of patients treated with tamoxifen: 1. Click **Group by patients** and select **Regular expression**. 2. In the **Regular expression** field, type *tamoxi.\*en* (to retrieve records with either *tamoxifen* or *tamoxiphen*). 3. Click on the field below **Select column to GREP**, type *drug_name* and select the first suggestion. 4. Name the group as *tamoxifen*. 5. Click on **Create group**. 6. Go back to step 7.3 and create groups with the second, the third and the fourth suggestion[^4]. 7. Select all the *tamoxifen* groups by clicking on them one by one and click on **Merge**. 8. Rename the created group by selecting it, scrolling down to the bottom, writing **tamoxifen merged** in the text field and clicking **Rename**. 9. Close the group selection interface by clicking on the **Close** button. 8. In the group selection field, select the *tamoxifen merged* group and the *positive* group for ER expression. 9. Plot survival curves and fit a Cox proportional hazards (PH) model by clicking on the respective buttons at the bottom. [^4]: Unfortunately, TCGA data divides drugs taken by patients through multiple columns. Here, only the majority of tamoxifen-treated patients are selected. The resulting plot will return the survival curves for: * ER-positive tamoxifen-treated patients, * ER-positive non-tamoxifen-treated patients and * non-ER-positive tamoxifen-treated patients. Information regarding number of individuals and events is returned when hovering over each survival curve in the plot. The plot also allows zooming in by clicking and dragging and to omit data series by clicking on their name in the legend. ```{r, echo=FALSE, fig.retina=NULL, out.width='400pt'} knitr::include_graphics("survival.png") ``` **Figure 3:** Available options for patient survival. # Exploring principal component analysis Principal component analysis (PCA) is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data [@Ringner2008gb]. To analyse principal components, click on the **Analyses** tab located in the navigation menu at the top and select **Principal component analysis (PCA)**. Explore alternative splicing quantification groups by estrogen receptor (ER) expression: 1. Confirm that **Inclusion levels** will be used as the input of the PCA. 2. In data preprocessing, check **Center values** and uncheck **Scale values**[^2]. 4. Set the tolerance of missing values[^3] to 0%. 5. In **Samples to use for PCA**, click on **Samples from selected groups** and select the *positive* and *negative* groups for ER expression. 7. Click on **Calculate PCA**. [^2]: As PSI values are fixed between an interval of 0 to 1, there is no need to scale values. [^3]: Missing values are replaced with the median value for the respective event across samples. After PCA is performed, options to plot the PCA result appear. Note that the explained variance of each principal component is shown next to the respective component. The **variance plot** is also available to compare the explained variance across principal components (by clicking **Show variance plot**). Now: 1. Choose **PC1** (principal component 1) as the X axis. 2. Choose **PC2** as the Y axis. 3. Select the *positive* and *negative* groups to guide the colouring of samples in the PCA plot. 4. Click on **Plot PCA**. ```{r, echo=FALSE, fig.retina=NULL, out.width='700pt'} knitr::include_graphics("pca.png") ``` **Figure 4:** Available options for PCA performance and plotting. Two PCA plots are then rendered. The plot above is a **score plot** that shows the clinical samples, while the **loadings plot** below displays the variables (in this case, alternative splicing events). The bubble size of the loadings plot represent the total contribution of each alternative splicing event to the selected principal components. By clicking on one alternative splicing event, the respective differential splicing analysis will be shown. Note that the clinical samples from ER-positive individuals (i.e. patients whose cancer cells express estrogen receptors) seem to separate from samples from ER-negative individuals along the principal component 1. There is one alternative splicing event that may contribute to this separation: *SE 10 + 79797062 79799962 79799983 79800373 RPS24*. Click on this alternative splicing event to perform differential splicing analysis. ## Differential splicing analysis In **Groups of samples to analyse**, check **Samples by selected groups** and select the *negative* and *positive* groups for ER-expression. Click **Perform analyses** to plot the PSI distribution and calculate multiple parametric and non-parametric statistical tests based on the selected groups. Check if any of the tests show statistical significance (for instance, p-values below 0.05). Also of interest: * Hover each group in the plot to compare the respective number of samples, median and variance. * To zoom in a specific region, click-and-drag in the region of interest. * To hide or show groups, click on their name in the legend. Moreover, it may be interesting to compare the distribution of normal versus tumour samples. To do so, click **Groups**. In the group selection dialog, click **Group by samples** and then on **Attribute**. In the **Select attribute** field, select **Sample types** and click **Create group**. The groups *Solid Tissue Normal*, *Primary solid Tumor* and *Metastatic* are then created. Click the **Close** button to dismiss. Now, to compare the distributions of the newly created groups, select only these three groups in the **Samples by selected groups** field and click on **Perform analyses**. Not all statistical tests are available depending on the number of groups available. Check if any statistical tests show statistical significance. To study survival analysis by alternative splicing quantification cut-off, click on the blue **Survival analysis by PSI cut-off** button at the sidebar box. ## Survival analysis To study the impact of an alternative splicing event on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by a given PSI cut-off for the selected alternative splicing event. The optimal PSI cut-off that maximises the significance of their difference in survival (i.e. minimises the p-value of the Wald/Log/Logrank tests of difference in survival between individuals with PSI below and above that threshold) is suggested in the green box and used as the default PSI cut-off, when available. This value can be manually adjusted using the slider named **Splicing quantification cut-off**. Click the buttons *Plot survival curves* and/or *Fit Cox PH model* whenever this slider is changed to update the Kaplan-Meier plot and/or the Cox model. ```{r, echo=FALSE, fig.retina=NULL, out.width='350pt'} knitr::include_graphics("psi_cutoff_2.png") ``` **Figure 5:** Options to adjust the alternative splicing quantification cut-off when performing survival analysis. ## Literature support and external database information If an event is differentially spliced and has an impact on patient survival, its association with the studied disease might be already described in the literature. To check so, go to **Analyses** > **Gene, transcript and protein information** where information regarding the associated gene (such as description and genomic position), transcripts and protein domain annotation are available. - The protein plot shows the UniProt matches for the selected transcript. Hover the protein's rendered domains to obtain more information on them. More information about each protein can be retrieved by clicking the respective **UniProt** link. - Links to related research articles are also available. Click **Show more articles** to be directed to PubMed. - Multiple links to related external databases are available too: - **Human Protein Atlas (Cancer Atlas)** allows to check the evidence of a gene at protein level for multiple cancer tissues. - **VastDB** shows multi-species alternative splicing profiles for diverse tissues and cell types. - **UCSC Genome Browser** may reveal protein domain disruptions caused by the alternative splicing event. To check so, activate the **Pfam in UCSC Gene** and **UniProt** tracks (in *Genes and Gene Predictions*) and check if any domains are annotated in the alternative and/or constitutive exons of the splicing event. # Exploring differential splicing analysis To analyse differential splicing, click on the **Analyses** tab located in the navigation menu at the top and select **Differential splicing analysis**. Scroll to the top of the page and click on **All events (table)**. 1. In **Groups of samples to analyse**, click on **Samples by selected groups**. 2. Select the *positive* and *negative* ER-expression groups. 3. Confirm that all statistical analyses of interest are checked. 4. Confirm p-values will be adjusted according to the **Benjamini-Hochberg's method**. 5. Click on **Perform analyses**. When the analyses complete, the results are shown in a filtrable and sortable table. ```{r, echo=FALSE, fig.retina=NULL, out.width='350pt'} knitr::include_graphics("diff_splicing.png") ``` **Figure 6:** Options for differential splicing analysis. ## Statistical significance Select statistically significant splicing events by performing the following: 1. Find a column whose name is *Fligner-Killeen p-value (BH adjusted)*. Note that **Horizontal scrolling** is required to visualise all available columns. 2. Click the box, labeled **All**, just below the column name and filter the values between 0 and around 0.05. 3. Sort the values by the difference in variance by clicking once on **Delta variance**. ## Distribution of alternative splicing quantifications The statistical analyses table provides a column with a density plot of the distribution of the alternative splicing quantification for each event. By clicking on the plot (or its respective event identifier), a page dedicated to that alternative splicing event's statistics and exhibiting the density plot in greater detail will show up. To go back to the table with all events, click on the button above the grey box titled **All events (table)**. ## Survival analysis To study the impact of an alternative splicing event on prognosis, survival data can be incorporated for 10 splicing events shown in the table at a time, by default. After performing differential splicing analysis, the options shown in **Figure 7** appear. Given the slow process of calculating the optimal splicing quantification cut-off for multiple events, it is recommended to perform this on either the events shown on-screen or after filtering the table for differentially spliced events supported by statistical significance. Perform survival analysis by alternative splicing quantification cut-off: 1. Check **right** data censoring. 2. Use *days to death* for the **follow up time**. 3. Use *death* as the **event of interest**. 4. Select to perform survival analyses based on the **splicing events shown in the screen**. 5. Click on **Plot survival curves**. ```{r, echo=FALSE, fig.retina=NULL, out.width='400pt'} knitr::include_graphics("psi_cutoff.png") ``` **Figure 7:** Available options for survival analysis. Kaplan-Meier plots with the results will appear below the table. Each plot corresponds to one alternative splicing event shown in the table above. To test differences in survival with another PSI cut-off, clicking on the plotted curves will lead the user to the **Survival analyses** tab, allowing to manually adjust the alternative splicing quantification cut-off. Click on one alternative splicing event with a p-value below 0.05 (such as the splicing events for *C1D* or *ALG13*). Check literature information and search external databases for more information, including on **UCSC Genome Browser** for putative protein domain disruptions resulting from the alternative splicing event. # Exploring an alternative splicing event of interest At any time during these analyses, the alternative splicing event of interest may be changed by clicking on **Change...** in the top-right corner relative to the selected alternative splicing event. Any analyses that depend on the selected alternative splicing event are now performed based on the currently selected event. This allows the user to explore an alternative splicing event of their choice. For instance, the event *SE 6 - 46823711 46822518 46822452 46821808 GPR116* has been previously reported to have potential prognostic value in breast cancer patients, where patients with higher PSI values for this event have a lower 5-year survival rate than patients with lower PSI values [@Tsai2015jf]. Confirm so by performing differential splicing (for example, normal vs tumour samples) on this event and survival analysis by PSI cut-off. Also, check literature information and search external databases for more information, including on **UCSC Genome Browser** for putative protein domain disruptions resulting from the alternative splicing event. # Feedback All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to: > Nuno Saraiva Agostinho (nunodanielagostinho@gmail.com) > [Computation Biology Lab, Instituto de Medicina Molecular (Portugal)][7] # References [1]: https://bitbucket.org/regulatorygenomicsupf/suppa [2]: http://rnaseq-mats.sourceforge.net [3]: http://genes.mit.edu/burgelab/miso/ [4]: https://github.com/vastgroup/vast-tools [5]: https://tcga-data.nci.nih.gov/docs/publications/tcga [6]: http://rpubs.com/nuno-agostinho/alt-splicing-annotation [7]: http://imm.medicina.ulisboa.pt/group/compbio [8]: http://gtexportal.org