---
title: 'psichomics case study: command-line interface (CLI)'
author: "Nuno Saraiva-Agostinho"
date: "`r Sys.Date()`"
bibliography: refs.bib
csl: bioinformatics.csl
output: 
    rmarkdown::html_vignette:
        toc: true
vignette: >
    %\VignetteIndexEntry{Case study: command-line interface (CLI)}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

---

*psichomics* is an interactive R package for integrative analyses of alternative
splicing and gene expression based on [The Cancer Genome Atlas (TCGA)][TCGA]
(containing molecular data associated with 34 tumour types), the
[Genotype-Tissue Expression (GTEx)][GTEx] project (containing data for multiple
normal human tissues), [Sequence Read Archive][SRA] and user-provided data. The 
data from GTEx, TCGA and select SRA projects include subject/sample-associated
information and transcriptomic data, such as the quantification of RNA-Seq reads
aligning to splice junctions (henceforth called junction quantification) and 
exons.

# Installing and starting the program
Install *psichomics* by typing the following in an R console (the 
[R environment](https://www.r-project.org/) is required):

```{r install, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("psichomics")
```

After the installation, load psichomics by typing:

```{r load, message=FALSE}
library(psichomics)
```

# Available functions: quick reference

* `psichomics`: Start the visual interface of *psichomics*
* `parseSplicingEvent`: Parse splicing events
* `getSplicingEventFromGenes`: Get alternative splicing events from genes
* `getGenesFromSplicingEvents`: Get genes from alternative splicing events

### Data retrieval

**TCGA/Firebrowse**

* `getDownloadsFolder`: Get the user's Downloads folder
* `isFirebrowseUp`: Check if Firebrowse web API is online
* `getFirebrowseCohorts`: Query the Firebrowse web API for TCGA cohorts
* `getFirebrowseDataTypes`: Query the Firebrowse web API for TCGA data types
* `getFirebrowseDates`: Query the Firebrowse web API for processing dates
* `loadFirebrowseData`: Download and load TCGA data through the Firebrowse web 
API
* `parseTcgaSampleInfo`: Parse sample information from TCGA samples

**GTEx**

* `getGtexTissues`: Check available tissues from a file containing sample
metadata
* `loadGtexData`: Load GTEx data

**SRA**

* `recount::recount_abstract`: Check available [SRA][SRA] projects from 
[*recount*][recount]
* `loadSRAproject`: Download and load [SRA][SRA] projects through the 
[*recount*][recount] R package

**Custom and/or local files**

* `loadLocalFiles`: Load local files from a given folder
* `prepareSRAmetadata`: Prepare metadata from [SRA][SRA] (as obtained from Run
Selector page)
* `prepareJunctionQuant`: Prepare junction quantification files from 
splice-aware aligners (currently, psichomics supports the output of the 
splice-aware aligner [STAR][STAR])
* `prepareGeneQuant`: Prepare gene quantification files from splice-aware 
aligners (currently, psichomics supports the output of the splice-aware aligner
[STAR][STAR])

### Gene expression pre-processing
* `plotRowStats`: Plot statistics (median, variance, etc.) per gene
* `plotGeneExprPerSample`: Plot distribution of gene expression per sample
* `filterGeneExpr`: Filter genes based on their expression
* `normaliseGeneExpression`: Normalise gene expression data
* `convertGeneIdentifiers`: Convert between different gene identifiers

### PSI quantification
* `getSplicingEventTypes`: Get quantifiable splicing event types
* `listSplicingAnnotations`: List available alternative splicing annotation 
files
* `loadAnnotation`: Load an alternative splicing annotation file
* `quantifySplicing`: Quantify alternative splicing
* `plotRowStats`: Plot statistics (median, variance, etc.) per alternative 
splicing event
* `filterPSI`: Filter alternative splicing quantification

**Custom alternative splicing annotation preparation**

* `prepareAnnotationFromEvents`
* `parseMatsAnnotation`: Parse splicing annotation from [rMATS][rMATS]
* `parseMisoAnnotation`: Parse splicing annotation from [MISO][MISO]
* `parseSuppaAnnotation`: Parse splicing annotation from [SUPPA][SUPPA]
* `parseVastToolsAnnotation`: Parse splicing annotation from 
[VAST-TOOLS][VAST-TOOLS]

### Data Grouping
* `getGeneList`: Get pre-created, literature-based lists of genes
* `createGroupByAttribute`: Split elements into groups based on a given 
attribute
* `getSampleFromSubject`: Get samples matching the given patients
* `getSubjectFromSample`: Get patients matching the given samples
* `groupPerElem`: Return a vector with one group for each element
* `testGroupIndependence`: Test multiple contigency tables comprised by two 
groups (one reference group and another containing remaing elements) and 
provided groups
* `plotGroupIndependence`: Plot -log10(p-values) of the results obtained after 
multiple group independence testing

### Dimensionality reduction

**Principal component analysis (PCA)**

* `performPCA`: Perform PCA
* `plotVariance`: Render variance plot
* `calculateLoadingsContribution`: Calculate the contribution of the variables
and return values in a data frame
* `plotPCA`: Plot PCA individuals (scores) and/or variable contributions

**Independent component analysis (ICA)**

* `performICA`: Perform ICA
* `plotICA`: Plot ICA scores

### Survival analysis
* `getAttributesTime`: Get time for given columns in a clinical dataset
* `assignValuePerSubject`: Assign average samples values to their corresponding
patients
* `labelBasedOnCutoff`: Label groups based on a given cutoff
* `optimalSurvivalCutoff`: Calculate optimal data cutoff that best separates 
survival curves
* `testSurvival`: Test the survival difference between groups of patients
* `processSurvTerms`: Process survival curves terms to calculate survival curves
* `survfit`: Compute estimates of survival curves
* `survdiffTerms`: Test differences between survival curves
* `plotSurvivalCurves`: Plot survival curves
* `plotSurvivalPvaluesByCutoff`: Plot p-values of survival difference between
groups based on multiple cutoffs

### Differential analyses
* `diffAnalyses`: Perform statistical analyses (including differential splicing
and gene expression)
* `plotDistribution`: Plot distribution using a density plot

### Gene expression and alternative splicing correlation
* `correlateGEandAS`: Test for association between paired samples' gene 
expression (for any genes of interest) and alternative splicing quantification
* `plotCorrelation`: Scatter plots of the correlation results
* `as.table`: Table of the correlation results

### Annotation retrieval
* `queryEnsemblByEvent`: Query Ensembl based on an alternative splicing event
* `queryEnsemblByGene`: Query Ensembl based on a gene
* `ensemblToUniprot`: Convert an Ensembl identifier to the respective UniProt 
identifier
* `plotProtein`: Plot domains of a given protein
* `plotTranscripts`: Plot transcripts of a given gene

# Exploration of clinically-relevant, differentially spliced events in breast cancer

> The following case study was adapted from *psichomics*' original article:
>
> Nuno Saraiva-Agostinho and Nuno L. Barbosa-Morais (2019). 
**[psichomics: graphical application for alternative splicing quantification and analysis][article]**. *Nucleic Acids Research*.

Breast cancer is the cancer type with the highest incidence and mortality in 
women [@Torre2015] and multiple studies have suggested that transcriptome-wide
analyses of alternative splicing changes in breast tumours are able to uncover
tumour-specific biomarkers [@Tsai2015; @DananGotthold2015; @Anczukow2015].
Given the relevance of early detection of breast cancer to patient survival, we 
can use *psichomics* to identify novel tumour stage-I-specific molecular 
signatures based on differentially spliced events.

## Downloading and loading TCGA data
The quantification of each alternative splicing event is based on the proportion
of junction reads that support the inclusion isoform, known as percent 
spliced-in or PSI [@wang2008].

To estimate this value for each splicing event, both alternative splicing
annotation and junction quantification are required. While alternative splicing
annotation is provided by the package, junction quantification may be retrieved
from [TCGA][TCGA], [GTEx][GTEx], [SRA][SRA] or user-provided files.

Data is downloaded from [Firebrowse][Firebrowse], a service that hosts 
proccessed data from [TCGA][TCGA], as required to run the downstream analyses. 
Before downloading data, check the following options:

```{r TCGA options}
# Available tumour types
cohorts <- getFirebrowseCohorts()

# Available sample dates
date <- getFirebrowseDates()

# Available data types
dataTypes <- getFirebrowseDataTypes()
```

> Note there is also the option for *Gene expression (normalised by RSEM)*.
However, we recommend to load the raw gene expression data instead, followed by 
filtering and normalisation as demonstrated afterwards.

After deciding on the options to use, download and load breast cancer data as 
follows:
```{r download, eval=FALSE}
# Set download folder
folder <- getDownloadsFolder()

# Download and load most recent junction quantification and clinical data from
# TCGA/Firebrowse for Breast Cancer
data <- loadFirebrowseData(folder=folder,
                           cohort="BRCA",
                           data=c("clinical", "junction_quantification",
                                  "RSEM_genes"),
                           date="2016-01-28")

# Select clinical and junction quantification dataset
clinical      <- data[[1]]$`Clinical data`
sampleInfo    <- data[[1]]$`Sample metadata`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
geneExpr      <- data[[1]]$`Gene expression`
```

Data is only downloaded if the files are not present in the given folder. In 
other words, if the files were already downloaded, the function will just load
the files, so it is possible to reuse the code above just to load the requested
files.

> **Windows limitations**: If you are using *Windows*, note that the downloaded 
files have huge names that may be over [*Windows* Maximum Path Length][maxpath].
A workaround would be to manually rename the downloaded files to have shorter 
names, move all downloaded files to a single folder and load such folder. Read
how in section **Load unspecified local files** at the end of this document.

```{r prepare examples, include=FALSE}
clinical <- readRDS("BRCA_clinical.RDS")
geneExpr <- readRDS("BRCA_geneExpr.RDS")
```

## Filtering and normalising gene expression
As this package does not focuses on gene expression analysis, we suggest to read
the RNA-seq section of `limma`'s user guide. Nevertheless, we present the 
following commands to quickly filter and normalise gene expression:

```{r normalise gene expression}
# Check genes where min. counts are available in at least N samples and filter
# out genes with mean expression and variance of 0
checkCounts <- rowSums(geneExpr >= 10) >= 10
filter <- rowMeans(geneExpr) > 0 & rowVars(geneExpr) > 0 & checkCounts
geneExprFiltered <- geneExpr[filter, ]

# Normalise gene expression and perform log2-transformation (after adding 0.5
# to avoid zeroes)
geneExprNorm <- normaliseGeneExpression(geneExprFiltered, log2transform=TRUE)
```


## Quantifying alternative splicing
After loading the clinical and alternative splicing junction quantification data
from TCGA, quantify alternative splicing by clicking the green panel 
**Alternative splicing quantification**.

As previously mentioned, alternative splicing is quantified from the previously
loaded junction quantification and an alternative splicing annotation file. To 
check current annotation files available:

```{r quantify options}
# Available alternative splicing annotation
annotList <- listSplicingAnnotations()
annotList
```

> **Custom splicing annotation:** Additional alternative splicing annotations 
can be prepared for *psichomics* by parsing the annotation from programs like 
[VAST-TOOLS][VAST-TOOLS], [MISO][MISO], [SUPPA][SUPPA] and [rMATS][rMATS]. Note
that SUPPA and rMATS are able to create their splicing annotation based on
transcript annotation. For more information, 
[read this tutorial][annotation-tutorial].

To quantify alternative splicing, first select the junction quantification,
alternative splicing annotation and alternative splicing event type(s) of 
interest:

```{r prepare to quantify splicing, eval=FALSE}
# Load Human (hg19/GRCh37 assembly) annotation
human <- listSplicingAnnotations()[[1]]
annotation <- loadAnnotation(human)
```

```{r event types}
# Available alternative splicing event types (skipped exon, alternative 
# first/last exon, mutually exclusive exons, etc.)
getSplicingEventTypes()
```

Afterwards, quantify alternative splicing using the previously defined
parameters:

```{r quantify splicing, eval=FALSE}
# Discard alternative splicing quantified using few reads
minReads <- 10 # default

psi <- quantifySplicing(annotation, junctionQuant, minReads=minReads)
```

```{r load splicing, echo=FALSE}
psi <- readRDS("BRCA_psi.RDS")
sampleInfo <- parseTcgaSampleInfo(colnames(psi))
```

```{r check splicing events}
# Check the identifier of the splicing events in the resulting table
events <- rownames(psi)
head(events)
```

Note that the event identifier (for instance,
`SE_1_-_2125078_2124414_2124284_2121220_C1orf86`) is composed of:

- Event type (`SE` stands for skipped exon)
- Chromosome (`1`)
- Strand (`-`)
- Relevant coordinates depending on event type (in this case, the first
constitutive exon's end, the alternative exon' start and end and the second
constitutive exon's start)
- Associated gene (`C1orf86`)

> **Warning:** all examples shown in this case study are performed using a 
small, yet representative subset of the available data. Therefore, values shown
here may correspond to those when performing the whole analysis.

## Data grouping
Let us create groups based on available samples types (i.e. *Metastatic*,
*Primary solid Tumor* and *Solid Tissue Normal*) and tumour stages. As tumour
stages are divided by sub-stages, we will merge sub-stages so as to have only
tumour samples from stages I, II, III and IV (stage X samples are discarded as 
they are uncharacterised tumour samples).

```{r data grouping}
# Group by normal and tumour samples
types  <- createGroupByAttribute("Sample types", sampleInfo)
normal <- types$`Solid Tissue Normal`
tumour <- types$`Primary solid Tumor`

# Group by tumour stage (I, II, III or IV) or normal samples
stages <- createGroupByAttribute(
    "patient.stage_event.pathologic_stage_tumor_stage", clinical)
groups <- list()
for (i in c("i", "ii", "iii", "iv")) {
    stage <- Reduce(union,
           stages[grep(sprintf("stage %s[a|b|c]{0,1}$", i), names(stages))])
    # Include only tumour samples
    stageTumour <- names(getSubjectFromSample(tumour, stage))
    elem <- list(stageTumour)
    names(elem) <- paste("Tumour Stage", toupper(i))
    groups <- c(groups, elem)
}
groups <- c(groups, Normal=list(normal))

# Prepare group colours (for consistency across downstream analyses)
colours <- c("#6D1F95", "#FF152C", "#00C7BA", "#FF964F", "#00C65A")
names(colours) <- names(groups)
attr(groups, "Colour") <- colours

# Prepare normal versus tumour stage I samples
normalVSstage1Tumour <- groups[c("Tumour Stage I", "Normal")]
attr(normalVSstage1Tumour, "Colour") <- attr(groups, "Colour")

# Prepare normal versus tumour samples
normalVStumour <- list(Normal=normal, Tumour=tumour)
attr(normalVStumour, "Colour") <- c(Normal="#00C65A", Tumour="#EFE35C")
```

## Principal component analysis (PCA)
PCA is a technique to reduce data dimensionality by identifying variable 
combinations (called principal components) that explain the variance in the data
[@Ringner2008]. Use the following commands to perform PCA:

```{r perform pca}
# PCA of PSI between normal and tumour stage I samples
psi_stage1Norm    <- psi[ , unlist(normalVSstage1Tumour)]
pcaPSI_stage1Norm <- performPCA(t(psi_stage1Norm))
```

> As PCA cannot be performed on data with missing values, missing values need to
be either removed (thus discarding data from whole splicing events or genes) or 
impute them (i.e. attributing to missing values the median of the non-missing 
ones). Use the argument `missingValues` within function `performPCA` to select
the number of missing values that are tolerable per event (i.e. if a splicing 
event or gene has less than N missing values, those missing values will be 
imputed; otherwise, the event is discarded from PCA).

```{r plot pca}
# Explained variance across principal components
plotVariance(pcaPSI_stage1Norm)

# Score plot (clinical individuals)
plotPCA(pcaPSI_stage1Norm, groups=normalVSstage1Tumour)

# Loading plot (variable contributions)
plotPCA(pcaPSI_stage1Norm, loadings=TRUE, individuals=FALSE)

# Table of variable contributions (as used to plot PCA, also)
table <- calculateLoadingsContribution(pcaPSI_stage1Norm)
knitr::kable(head(table, 5))
```

> For performance reasons, the loading plot is able to exclusively render the 
top variables that most contribute to the select principal components by using 
the argument `nLoadings` within function `plotPCA`.

> **Hint:** As most plots in *psichomics*, PCA plots can be zoomed-in by
clicking-and-dragging within the plot (click *Reset zoom* to zoom-out). To 
toggle the visibility of the data series represented in the plot, click its 
respective name in the plot legend.

To perform PCA using alternative splicing quantification and gene expression 
data (both using *all samples* and only *Tumour Stage I* and *Normal* samples):

```{r perform and plot remaining pca, eval=FALSE}
# PCA of PSI between all samples (coloured by tumour stage and normal samples)
pcaPSI_all <- performPCA(t(psi))
plotPCA(pcaPSI_all, groups=groups)
plotPCA(pcaPSI_all, loadings=TRUE, individuals=FALSE)

# PCA of gene expression between all samples (coloured by tumour stage and 
# normal samples)
pcaGE_all <- performPCA(t(geneExprNorm))
plotPCA(pcaGE_all, groups=groups)
plotPCA(pcaGE_all, loadings=TRUE, individuals=FALSE)

# PCA of gene expression between normal and tumour stage I samples
ge_stage1Norm    <- geneExprNorm[ , unlist(normalVSstage1Tumour)]
pcaGE_stage1Norm <- performPCA(t(ge_stage1Norm))
plotPCA(pcaGE_stage1Norm, groups=normalVSstage1Tumour)
plotPCA(pcaGE_stage1Norm, loadings=TRUE, individuals=FALSE)
```

## *NUMB* exon 12 inclusion and correlation with QKI gene expression
One of the splicing events that most contribute the separation between tumour 
stage I and normal samples is **NUMB exon 12 inclusion**, whose protein is 
crucial for cell differentiation as a key regulator of the Notch pathway. The
RNA-binding protein QKI has been shown to repress NUMB exon 12 inclusion in
lung cancer cells by competing with core splicing factor SF1 for binding to the
branch-point sequence, thereby repressing the Notch signalling pathway, which 
results in decreased cancer cell proliferation [@zong2014].

### Differential inclusion of *NUMB* exon 12
Let's check whether a significant difference in *NUMB* exon 12 inclusion
between tumour and normal TCGA breast samples. To do so:

```{r diff splicing NUMB exon 12}
# Find the right event
ASevents <- rownames(psi)
(tmp     <- grep("NUMB", ASevents, value=TRUE))
NUMBskippedExon12 <- tmp[1]

# Plot its PSI distribution
plotDistribution(psi[NUMBskippedExon12, ], normalVStumour)
```

Consistent with the cited article, *NUMB* exon 12 inclusion is significantly
increased in cancer.

Also of interest:

* Hover each group in the plot to compare the respective number of samples,
median and variance.
* To zoom in a specific region, click-and-drag in the region of interest.
* To hide or show groups, click on their name in the legend.

### Correlation between *NUMB* exon 12 inclusion and QKI expression
To verify if *NUMB* exon 12 inclusion is correlated with QKI expression:

```{r correlation, warning=FALSE}
# Find the right gene
genes <- rownames(geneExprNorm)
(tmp  <- grep("QKI", genes, value=TRUE))
QKI   <- tmp[1] # "QKI|9444"

# Plot its gene expression distribution
plotDistribution(geneExprNorm[QKI, ], normalVStumour, psi=FALSE)
plotCorrelation(correlateGEandAS(
    geneExprNorm, psi, QKI, NUMBskippedExon12, method="spearman"))
```

According to the obtained results and also consistent with the previous article,
the inclusion of the exon is negatively correlated with QKI expression.

## Differential splicing analysis

To analyse alternative splicing between normal and tumour stage I samples:

```{r exploratory diff analysis, message=FALSE}
diffSplicing <- diffAnalyses(psi, normalVSstage1Tumour)

# Filter based on |∆ Median PSI| > 0.1 and q-value < 0.01
deltaPSIthreshold <- abs(diffSplicing$`∆ Median`) > 0.1
pvalueThreshold   <- diffSplicing$`Wilcoxon p-value (BH adjusted)` < 0.01

# Plot results
library(ggplot2)
ggplot(diffSplicing, aes(`∆ Median`, 
                         -log10(`Wilcoxon p-value (BH adjusted)`))) +
    geom_point(data=diffSplicing[deltaPSIthreshold & pvalueThreshold, ],
               colour="orange", alpha=0.5, size=3) + 
    geom_point(data=diffSplicing[!deltaPSIthreshold | !pvalueThreshold, ],
               colour="gray", alpha=0.5, size=3) + 
    theme_light(16) +
    ylab("-log10(q-value)")
```

### Performing multiple survival analysis
To study the impact of alternative splicing events on prognosis, Kaplan-Meier
curves may be plotted for groups of patients separated by the optimal PSI 
cutoff for a given alternative splicing event that that maximises the 
significance of group differences in survival analysis (i.e. minimises the 
p-value of the log-rank tests of difference in survival between individuals 
whose samples have their PSI below and above that threshold).

Given the slow process of calculating the optimal splicing quantification 
cutoff for multiple events, it is recommended to perform this for a subset of
differentially spliced events.

```{r survival}
# Events already tested which have prognostic value
events <- c(
    "SE_9_+_6486925_6492303_6492401_6493826_UHRF2",
    "SE_4_-_87028376_87024397_87024339_87023185_MAPK10",
    "SE_2_+_152324660_152324988_152325065_152325155_RIF1",
    "SE_2_+_228205096_228217230_228217289_228220393_MFF",
    "MXE_15_+_63353138_63353397_63353472_63353912_63353987_63354414_TPM1",
    "SE_2_+_173362828_173366500_173366629_173368819_ITGA6",
    "SE_1_+_204957934_204971724_204971876_204978685_NFASC")

# Survival curves based on optimal PSI cutoff
library(survival)

# Assign alternative splicing quantification to patients based on their samples
samples <- colnames(psi)
match <- getSubjectFromSample(samples, clinical, sampleInfo=sampleInfo)

survPlots <- list()
for (event in events) {
    # Find optimal cutoff for the event
    eventPSI <- assignValuePerPatient(psi[event, ], match, clinical,
                                      samples=unlist(tumour))
    opt <- optimalSurvivalCutoff(clinical, eventPSI, censoring="right", 
                                 event="days_to_death", 
                                 timeStart="days_to_death")
    (optimalCutoff <- opt$par)    # Optimal exon inclusion level
    (optimalPvalue <- opt$value)  # Respective p-value
    
    label     <- labelBasedOnCutoff(eventPSI, round(optimalCutoff, 2), 
                                    label="PSI values")
    survTerms <- processSurvTerms(clinical, censoring="right",
                                  event="days_to_death", 
                                  timeStart="days_to_death",
                                  group=label, scale="years")
    surv <- survfit(survTerms)
    pvalue <- testSurvival(survTerms)
    plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)
}
```

## Differential gene expression

Detected alterations in alternative splicing may simply be a reflection of 
changes in gene expression levels. Therefore, to disentangle these two effects, 
differential expression analysis between tumour stage I and normal samples 
should also be performed. In order to do so:

```{r exploratory diff expression, message=FALSE, warning=FALSE}
# Prepare groups of samples to analyse and further filter unavailable samples in
# selected groups for gene expression
ge           <- geneExprNorm[ , unlist(normalVSstage1Tumour), drop=FALSE]
isFromGroup1 <- colnames(ge) %in% normalVSstage1Tumour[[1]]
design       <- cbind(1, ifelse(isFromGroup1, 0, 1))

# Fit a gene-wise linear model based on selected groups
library(limma)
fit <- lmFit(ge, design)

# Calculate moderated t-statistics and DE log-odds using limma::eBayes
ebayesFit <- eBayes(fit, trend=TRUE)

# Prepare data summary
pvalueAdjust <- "BH" # Benjamini-Hochberg p-value adjustment (FDR)
summary <- topTable(ebayesFit, number=nrow(fit), coef=2, sort.by="none",
                    adjust.method=pvalueAdjust, confint=TRUE)
names(summary) <- c("log2 Fold-Change", "CI (low)", "CI (high)", 
                    "Average expression", "moderated t-statistics", "p-value", 
                    paste0("p-value (", pvalueAdjust, " adjusted)"),
                    "B-statistics")
attr(summary, "groups") <- normalVSstage1Tumour

# Calculate basic statistics
stats <- diffAnalyses(ge, normalVSstage1Tumour, "basicStats", 
                      pvalueAdjust=NULL)
final <- cbind(stats, summary)

# Differential gene expression between breast tumour stage I and normal samples
library(ggplot2)
library(ggrepel)
cognateGenes <- unlist(parseSplicingEvent(events)$gene)
logFCthreshold  <- abs(final$`log2 Fold-Change`) > 1
pvalueThreshold <- final$`p-value (BH adjusted)` < 0.01

final$genes <- gsub("\\|.*$", "\\1", rownames(final))
ggplot(final, aes(`log2 Fold-Change`, 
                  -log10(`p-value (BH adjusted)`))) +
    geom_point(data=final[logFCthreshold & pvalueThreshold, ],
               colour="orange", alpha=0.5, size=3) + 
    geom_point(data=final[!logFCthreshold | !pvalueThreshold, ],
               colour="gray", alpha=0.5, size=3) + 
    geom_text_repel(data=final[cognateGenes, ], aes(label=genes),
                    box.padding=0.4, size=5) +
    theme_light(16) +
    ylab("-log10(q-value)")
```

### *UHRF2* exon 10 inclusion

One splicing event with prognostic value is the alternative splicing of *UHRF2* 
exon 10. Cell-cycle regulator UHRF2 promotes cell proliferation and inhibits the
expression of tumour suppressors in breast cancer [@wu2012].

#### Differential splicing analysis

Let's check whether a significant difference in *UHRF2* exon 10 inclusion
between tumour stage I and normal samples. To do so:

```{r UHRF2 exon 10 diff splicing}
# UHRF2 skipped exon 10's PSI values per tumour stage I and normal samples
UHRF2skippedExon10 <- events[1]
plotDistribution(psi[UHRF2skippedExon10, ], normalVSstage1Tumour)
```

Higher inclusion of *UHRF2* exon 10 is associated with normal samples.

#### Survival analysis 
To study the impact of alternative splicing events on prognosis, Kaplan-Meier
curves may be plotted for groups of patients separated by a given PSI 
cutoff for a given alternative splicing event. The optimal PSI cutoff 
maximises the  significance of group differences in survival analysis (i.e. 
minimises the p-value of the log-rank tests of difference in survival between
individuals whose samples have a PSI below and above that threshold).

```{r UHRF2 PSI survival}
# Find optimal cutoff for the event
UHRF2skippedExon10 <- events[1]
eventPSI <- assignValuePerPatient(psi[UHRF2skippedExon10, ], match, clinical,
                                  samples=unlist(tumour))
opt <- optimalSurvivalCutoff(clinical, eventPSI, censoring="right", 
                             event="days_to_death", timeStart="days_to_death")
(optimalCutoff <- opt$par)    # Optimal exon inclusion level
(optimalPvalue <- opt$value)  # Respective p-value

label     <- labelBasedOnCutoff(eventPSI, round(optimalCutoff, 2), 
                                label="PSI values")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death",
                              group=label, scale="years")
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)
```

As per the results, higher inclusion of *UHRF2* exon 10 is associated with 
better prognosis.

#### Differential expression
To check whether alternative splicing changes are related with gene expression 
alterations, let us perform differential expression analysis on UHRF2:

```{r UHRF2 GE diff expression}
plotDistribution(geneExprNorm["UHRF2", ], normalVSstage1Tumour, psi=FALSE)
```

It seems UHRF2 is differentially expressed between *Tumour Stage I* and 
*Solid Tissue Normal*. However, going back to exploratory differential gene 
expression, *UHRF2* has a log2(fold-change) ≤ 1, low enough not to be 
biologically relevant. Following this criterium, the gene can thus be considered
not to be differentially expressed between these conditions.

#### Survival analysis
To confirm if gene expression has an overall prognostic value, perform the 
following:

```{r UHRF2 GE survival}
UHRF2ge <- assignValuePerPatient(geneExprNorm["UHRF2", ], match, clinical, 
                                 samples=unlist(tumour))

# Survival curves based on optimal gene expression cutoff
opt <- optimalSurvivalCutoff(clinical, UHRF2ge, censoring="right",
                             event="days_to_death", timeStart="days_to_death")
(optimalCutoff <- opt$par)    # Optimal exon inclusion level
(optimalPvalue <- opt$value)  # Respective p-value

# Process again after rounding the cutoff
roundedCutoff <- round(optimalCutoff, 2)
label     <- labelBasedOnCutoff(UHRF2ge, roundedCutoff, label="Gene expression")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death",
                              group=label, scale="years")
surv   <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)
```

There seems to be no significant difference in survival between patient groups
stratified by UHRF2's optimal gene expression cutoff in tumour samples (log-rank
p-value > 0.05).

#### Literature support and external database information

If an event is differentially spliced and has an impact on patient survival, 
its association with the studied disease might be already described in the
literature. To check so, go to **Analyses** >
**Gene, transcript and protein information** where information regarding the
associated gene (such as description and genomic position), transcripts and
protein domain annotation are available.

- The protein plot shows the UniProt matches for the selected transcript. Hover 
the protein's rendered domains to obtain more information on them. More
information about each protein can be retrieved by clicking the respective
**UniProt** link.
- Links to related research articles are also available. Click **Show more 
articles** to be directed to PubMed.
- Multiple links to related external databases are available too:
    - **Human Protein Atlas (Cancer Atlas)** allows to check the evidence of a
    gene at protein level for multiple cancer tissues.
    - **VastDB** shows multi-species alternative splicing profiles for diverse 
    tissues and cell types.
    - **UCSC Genome Browser** may reveal protein domain disruptions caused by 
    the alternative splicing event. To check so, activate the **Pfam in UCSC
    Gene** and **UniProt** tracks (in *Genes and Gene Predictions*) and check if
    any domains are annotated in the alternative and/or constitutive exons of 
    the splicing event.

#### Interpretation

Higher inclusion of *UHRF2* exon 10 is associated with normal samples and better
prognosis, and potentially disrupts UHRF2's SRA-YDG protein domain, related to 
the binding affinity to epigenetic marks. Hence, exon 10 inclusion may suppress 
UHRF2's oncogenic role in breast cancer by impairing its activity through the
induction of a truncated protein or a non-coding isoform. Moreover, this 
hypothesis is independent from gene expression changes, as UHRF2 is not 
differentially expressed between tumour stage I and normal samples 
(|log2(fold-change)| < 1) and there is no significant difference in survival 
between patient groups stratified by its expression in tumour samples (log-rank 
p-value > 0.05).

# Loading data from other sources

## Load GTEx files

First, GTEx data needs to be downloaded from the
[GTEx Portal](http://gtexportal.org). Afterwards, load GTEx data (subject
phenotype, sample attributes and junction quantification for given tissues) by
following these commands:

```{r load GTEx, eval=FALSE}
# Replace with the correct path to these files
subjects <- "~/Downloads/GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt"
sampleAttr <- "~/Downloads/GTEx_Data_V6_Annotations_SampleAttributesDS.txt"
junctionQuant <- "~/Downloads/GTEx_junction_reads.txt"

# Check GTEx tissues available based on the sample attributes
getGtexTissues(sampleAttr)

tissues <- c("blood", "brain")
gtex <- loadGtexData(subjects, sampleAttr, junctionQuant, tissues)
```

If you desire to load junction quantification for all tissues, you can also do
so through the following commands:

```{r load all GTEx, eval=FALSE}
tissues <- NULL
gtex <- loadGtexData(subjectPhenotype, sampleAttr, junctionQuant, tissues)
```

## Load SRA project data using recount

[recount2][recount] is a resource of pre-processed data for thousands of 
[SRA][SRA] projects (including gene read counts, splice junction quantification 
and sample metadata). psichomics supports automatic downloading and loading of 
[SRA][SRA] data from recount2, as exemplified below:

```{r load recount, eval=FALSE}
library(recount)
View(recount_abstract)
sra <- loadSRAproject("SRP053101")
```

## Load other SRA and user-provided local files

> Although only select [SRA][SRA] projects are available to be automatically
downloaded (based on pre-processed data from the [recount2][recount] project), 
other SRA projects can be manually downloaded, aligned using a splice-aware 
aligner and loaded by the user, as per the instructions in
[Loading SRA and user-provided RNA-seq data][tutorial-custom-data]. 
Sample-associated files from SRA are also supported.

To load local files instead, indicate the folder of interest. Any files located
in this folder and sub-folders will be loaded. To mitigate any errors during 
this process, files of interest should be put in a dedicated folder.

For instance, to load [GTEx][GTEx] files in this way, create a directory called
**GTEx**, put all files of interest inside that folder and follow these 
commands:

```{r load local, eval=FALSE}
folder <- "~/Downloads/GTEx/"
ignore <- c(".aux.", ".mage-tab.")
data <- loadLocalFiles(folder, ignore=ignore)

# Select clinical and junction quantification dataset
clinical      <- data[[1]]$`Clinical data`
sampleInfo    <- data[[1]]$`Sample metadata`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
```

# Feedback

All feedback on the program, documentation and associated material (including
this tutorial) is welcome. Please send any suggestions and comments to:

> Nuno Saraiva-Agostinho (nunoagostinho@medicina.ulisboa.pt)
>
> [Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)][iMM]

# References

[SUPPA]: https://bitbucket.org/regulatorygenomicsupf/suppa
[rMATS]: http://rnaseq-mats.sourceforge.net
[MISO]: http://genes.mit.edu/burgelab/miso/
[VAST-TOOLS]: https://github.com/vastgroup/vast-tools
[TCGA]: https://tcga-data.nci.nih.gov/docs/publications/tcga
[annotation-tutorial]: http://rpubs.com/nuno-agostinho/alt-splicing-annotation
[iMM]: http://imm.medicina.ulisboa.pt/group/distrans/
[GTEx]: http://gtexportal.org
[maxpath]: https://msdn.microsoft.com/library/windows/desktop/aa365247.aspx#maxpath
[article]: https://doi.org/10.1093/nar/gky888
[Firebrowse]: http://firebrowse.org
[Ensembl]: http://www.ensembl.org
[UniProt]: http://www.uniprot.org
[SRA]: https://www.ncbi.nlm.nih.gov/sra
[recount]: https://jhubiostatistics.shinyapps.io/recount/
[STAR]: https://github.com/alexdobin/STAR
[tutorial-custom-data]: http://rpubs.com/nuno-agostinho/psichomics-custom-data