---
title: "TCGAutils: Helper functions for working with TCGA datasets"
author: "Marcel Ramos & Levi Waldron"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{TCGAutils Essentials}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE, warning=FALSE}
suppressPackageStartupMessages({
    library(TCGAutils)
    library(curatedTCGAData)
    library(MultiAssayExperiment)
    library(RTCGAToolbox)
})
```

# Overview

The `TCGAutils` package completes a suite of Bioconductor packages for
convenient access, integration, and analysis of *The Cancer Genome Atlas*.
It includes:
    0.    helpers for working with TCGA through the Bioconductor packages
    `r Biocpkg("MultiAssayExperiment")` (for coordinated representation and
    manipulation of multi-omits experiments) and `r Biocpkg("curatedTCGAData")`,
    which provides unrestricted TCGA data as `MultiAssayExperiment` objects,
    0.    helpers for importing TCGA data as from flat data structures such as
    `data.frame` or `DataFrame` read from delimited data structures provided by
    the Broad Institute’s Firehose, and
    0.    functions for interpreting TCGA barcodes and for mapping between
    barcodes and Universally Unique Identifiers (UUIDs).

# Installation

```{r, eval=FALSE}
BiocInstaller::biocLite("TCGAutils")
```

Required packages for this vignette:

```{r, eval=FALSE}
library(TCGAutils)
library(curatedTCGAData)
library(MultiAssayExperiment)
library(RTCGAToolbox)
```

# `curatedTCGAData` helpers

Functions such as `getSubtypeMap` and `getClinicalNames` provide information
on data inside a `r Biocpkg("MultiAssayExperiment")` object downloaded from
`r Biocpkg("curatedTCGAData")`. `sampleTables` and `splitAssays` support useful
operations on these `MultiAssayExperiment` objects.

## obtaining TCGA as `MultiAssayExperiment` objects from `curatedTCGAData`

For demonstration we download part of the Colon Adenocarcinoma (COAD) dataset
using`curatedTCGAData` via `ExperimentHub`. This command will download any
data type that starts with `CN*` such as `CNASeq`:

```{r, echo = FALSE}
suppressMessages({
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
    assays = "CN*", dry.run = FALSE)
})
```

```{r, eval = FALSE}
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
    assays = "CN*", dry.run = FALSE)
```

For a list of all available data types, use `dry.run = FALSE` and an
asterisk `*` as the assay input value:

```{r}
curatedTCGAData("COAD", "*")
```

## `sampleTables`: what sample types are present in the data?

The `sampleTables` function gives a tally of available
samples in the dataset based on the TCGA barcode information.

```{r}
sampleTables(coad)
```

For reference in interpreting the sample type codes, see the `sampleTypes`
table:

```{r}
data("sampleTypes")
sampleTypes
```

## `splitAssays`: separate the data from different tissue types

TCGA datasets include multiple -omics for solid tumors, adjacent normal
tissues, blood-derived cancers and normals, and other tissue types, which may
be mixed together in a single dataset. The `MultiAssayExperiment` object
generated here has one patient per row of its `colData`, but each patient may
have two or more -omics profiles by any assay, whether due to assaying of
different types of tissues or to technical replication. `splitAssays` separates
profiles from different tissue types (such as tumor and adjacent normal) into
different assays of the `MultiAssayExperiment` by taking a vector of sample
codes, and partitioning the current assays into assays with an appended sample
code:

```{r}
(tnmae <- splitAssays(coad, c("01", "11")))
```

The `r Biocpkg("MultiAssayExperiment")` package then provides functionality to
merge replicate profiles for a single patient (`mergeReplicates()`), which
would now be appropriate but would **not** have been appropriate before
splitting different tissue types into different assays, because that would
average measurements  from tumors and normal tissues.

`MultiAssayExperiment` also defines the `MatchedAssayExperiment` class, which
eliminates any profiles not present across all assays and ensures identical
ordering of profiles (columns) in each assay. In this example, it will match
tumors to adjacent normals in subsequent assays:

```{r}
(matchmae <- as(tnmae, "MatchedAssayExperiment"))
```

Only about 12 participants have both a matched tumor and solid normal sample.

## `getSubtypeMap`: manually curated molecular subtypes

Per-tumor subtypes are saved in the `metadata` of the `colData`
slot of `MultiAssayExperiment` objects downloaded from `curatedTCGAData`.
These subtypes were manually curated from the supplemental tables of all
primary TCGA publications:

```{r}
getSubtypeMap(coad)
```

## `getClinicalNames`: key "level 4" clinical &  pathological data

The `curatedTCGAData` `colData` contain hundreds of columns, obtained from
merging all unrestricted levels of clinical, pathological, and biospecimen data.
This function provides the names of "level 4" clinical/pathological variables,
which are the only ones provided by most other TCGA analysis tools.
Users may then use these variable names for subsetting or analysis, and may
even want to subset the `colData` to only these commonly used variables.

```{r}
getClinicalNames("COAD")
```

*Warning*: some names may not exactly match the `colData` names in the object
due to differences in variable types. These variables are kept separate and
differentiated with `x` and `y`. For example, `vital_status` in this case
corresponds to two different variables obtained from the pipeline. One variable
is interger type and the other character:

```{r}
class(colData(coad)[["vital_status.x"]])
class(colData(coad)[["vital_status.y"]])

table(colData(coad)[["vital_status.x"]])
table(colData(coad)[["vital_status.y"]])
```

Such conflicts should be inspected in this manner, and conflicts resolved by
choosing the more complete variable, or by treating any conflicting values as
unknown ("NA").

## `mergeColData`: expanding the `colData` of a `MultiAssayExperiment`

This function merges a `data.frame` or `DataFrame` into the
`colData` of an existing `MultiAssayExperiment` object. It will match
column names and row names to do a full merge of both data sets. This
convenience function can be used, for example, to add subtype information
available for a subset of patients to the `colData`. Here is an example on an
empty `MultiAssayExperiment` just to demonstrate its usage:

```{r}
mergeColData(MultiAssayExperiment(), data.frame())
```

# Importing TCGA text data files to Bioconductor classes

A few functions in the package accept either files or classes such as
`data.frame` and `FirehoseGISTIC` as input and return standard Bioconductor
classes.

## `makeGRangesListFromExonFiles`

%??% Could you use the GenomicDataCommons library to get an example file
instead, or show the GDC command that would be used to obtain the file?

The `GRangesList` class from the `r Biocpkg("GenomicRanges")` package is
suitable for grouping `GRanges` vectors as a list, such as for grouping exons
by gene. In this example we use a legacy exon quantification file from the
Genomic Data Commons. We then use `makeGRangesListFromExonFiles` to create a
`GRangesList` from vectors of file paths and names (where necessary). Some
adjustments have been made to the file name for cross-platform compatibility
with Windows operating system limitations.

```{r}
## Load example file found in package
pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE)
exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE)
exonFile

## We add the original file prefix to query for the UUID and get the
## TCGAbarcode
filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371."

## Add actual file name manually
makeGRangesListFromExonFiles(exonFile,
    fileNames = paste0(filePrefix, basename(exonFile)))
```

Note `GRangesList` objects must be converted to `r Biocpkg("RaggedExperiment")`
class to incorporate them into a `MultiAssayExperiment`.

## `makeGRangesListFromCopyNumber`

Other processed, genomic range-based data from TCGA data can be imported using
`makeGRangesListFromCopyNumber`. This tab-delimited data file of copy number
alterations from *bladder urothelial carcinoma* (BLCA) was obtained from the
Genomic Data Commons and is included in `TCGAUtils` as an example:

```{r}
grlFile <- system.file("extdata", "blca_cnaseq.txt", package = "TCGAutils")
grl <- read.table(grlFile)
head(grl)

makeGRangesListFromCopyNumber(grl, split.field = "Sample")

makeGRangesListFromCopyNumber(grl, split.field = "Sample",
    keep.extra.columns = TRUE)
```

## `makeSummarizedExperimentFromGISTIC`

This function is only used for converting the `FirehoseGISTIC` class of the
`r Biocpkg("RTCGAToolbox")` package. It allows the user to obtain thresholded
by gene data, probabilities and peak regions.

```{r}
tempDIR <- tempdir()
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE,
    destdir = tempDIR)

selectType(co, "GISTIC")
class(selectType(co, "GISTIC"))

makeSummarizedExperimentFromGISTIC(co, "Peaks")
```

# Translating and interpreting TCGA identifiers

## Translation

The TCGA project has generated massive amounts of data. Some data can be
obtained with **U**niversally **U**nique **ID**entifiers (**UUID**) and other
data with TCGA barcodes. The Genomic Data Commons provides a JSON API for
mapping between UUID and barcode, but it is difficult for many people to
understand. `TCGAutils` makes simple functions available for two-way
translation between vectors of these identifiers.

### TCGA barcode to UUID

Here we translate the first two TCGA barcodes of the previous copy-number
alterations dataset to UUID:

```{r}
(xbarcode <- head(colnames(coad)[["COAD_CNASeq-20160128"]], 4L))
barcodeToUUID(xbarcode)
```

### UUID to TCGA barcode

Here we have a known case UUID that we want to translate into a TCGA barcode.

```{r}
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", id_type = "case_id")
```

Where we have a known file UUID that we translate into the associated TCGA
barcode. Optional barcode information can be included for sample,
portion/analyte, and plate/center.

```{r}
UUIDtoBarcode("0001801b-54b0-4551-8d7a-d66fb59429bf",
    id_type = "file_id", end_point = "center")
```

## Parsing TCGA barcodes

Several functions exist for working with TCGA barcodes, the main function being
`TCGAbarcode`. It takes a TCGA barcode and returns information about
participant, sample, and/or portion.

```{r}
## Return participant barcodes
TCGAbarcode(xbarcode, participant = TRUE)

## Just return samples
TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)

## Include sample data as well
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE)

## Include portion and analyte data
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE, portion = TRUE)
```

## Sample select

Based on lookup table values, the user can select certain sample types from a
vector of sample barcodes. Below we select "Primary Solid Tumors" from a vector
of barcodes, returning a logical vector identifying the matching samples.

```{r}
## Select primary solid tumors
TCGAsampleSelect(xbarcode, "01")

## Select blood derived normals
TCGAsampleSelect(xbarcode, "10")
```

## `data.frame` representation of barcode

The straightforward `TCGAbiospec` function will take the information contained
in the TCGA barcode and display it in `data.frame` format with appropriate
column names.

```{r}
TCGAbiospec(xbarcode)
```

# Reference data

The `TCGAutils` package provides several helper datasets for working with TCGA barcodes.

## `sampleTypes`

As shown previously, the reference dataset `sampleTypes` defines sample codes
and their sample types (see `?sampleTypes` for source url).

```{r}
## Obtained previously
sampleCodes <- TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)

## Lookup table
head(sampleTypes)

## Match codes found in the barcode to the lookup table
sampleTypes[match(unique(substr(sampleCodes, 1L, 2L)), sampleTypes[["Code"]]), ]
```

Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes

## `clinicalNames` - Firehose pipeline clinical variables

`clinicalNames` is a list of the level 4 variable names (the most commonly used
clinical and pathological variables, with follow-ups merged) from each
`colData` datasets in `curatedTCGAData`. Shipped `curatedTCGAData`
`MultiAssayExperiment` objects merge additional levels 1-3 clinical,
pathological, and biospecimen data and contain many more variables than the ones
listed here.

```{r}
data("clinicalNames")

clinicalNames

lengths(clinicalNames)
```

# `sessionInfo`

```{r}
sessionInfo()
```