--- title: "terraTCGAdata Introduction" author: "Marcel Ramos" date: "`r format(Sys.Date(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Obtain Terra TCGA data as MultiAssayExperiment} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache = TRUE) ``` # terraTCGAData ## Installation ```{r,eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("terraTCGAdata") ``` # Overview The `terraTCGAdata` R package aims to import TCGA datasets, as [MultiAssayExperiment](http://bioconductor.org/packages/MultiAssayExperiment/), available on the Terra platform. The package provides a set of functions that allow the discovery of relevant datasets. It provides one main function and two helper functions: 1. `terraTCGAdata` allows the creation of the `MultiAssayExperiment` object from the different indicated resources. 2. The `getClinicalTable` and `getAssayTable` functions allow for the discovery of datasets within the Terra data model. The column names from these tables can be provided as inputs to the `terraTCGAdata` function. ## Data Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data resources are linked within the data model). Particularly the workspaces that are labelled `OpenAccess_V1-0`. Datasets harmonized to the hg38 genome, such as those from the Genomic Data Commons data repository, use a different data model / workflow and are not compatible with the functions in this package. For those that are, we make use of the Terra data model and represent the data as `MultiAssayExperiment`. For more information on `MultiAssayExperiment`, please see the vignette in that package. # Requirements ## Loading packages ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} library(AnVIL) library(terraTCGAdata) ``` ## gcloud sdk installation A valid GCloud SDK installation is required to use the package. To get set up, see the Bioconductor tutorials for running RStudio on Terra. Use the `gcloud_exists()` function from the `r BiocStyle::Biocpkg("AnVIL")` package to identify whether it is installed in your system. ```{r} gcloud_exists() ``` You can also use the `gcloud_project` to set a project name by specifying the project argument: ```{r, eval=AnVIL::gcloud_exists()} gcloud_project() ``` # Default Data Workspace To get a table of available TCGA workspaces, use the `selectTCGAworkspace()` function: ```{r, eval=AnVIL::gcloud_exists()} selectTCGAworkspace() ``` You can also set the package-wide option with the `terraTCGAworkspace` function and check the setting with `getOption('terraTCGAdata.workspace')` or by running `terraTCGAworkspace` function. ```{r,eval=AnVIL::gcloud_exists()} terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA") getOption("terraTCGAdata.workspace") ``` # Clinical data resources In order to determine what datasets to download, use the `getClinicalTable` function to list all of the columns that correspond to clinical data from the different collection centers. ```{r, eval=AnVIL::gcloud_exists()} ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") ct names(ct) ``` # Clinical data download After picking the column in the `getClinicalTable` output, use the column name as input to the `getClinical` function to obtain the data: ```{r, eval=AnVIL::gcloud_exists()} column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin" clin <- getClinical( columnName = column_name, participants = TRUE, workspace = "TCGA_COAD_OpenAccess_V1-0_DATA" ) clin[, 1:6] dim(clin) ``` # Assay data resources We use the same approach for assay data. We first produce a list of assays from the `getAssayTable` and then we select one along with any sample codes of interest. ```{r, eval=AnVIL::gcloud_exists()} at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") at names(at) ``` # Summary of sample types in the data You can get a summary table of all the samples in the adata by using the `sampleTypesTable`: ```{r, eval=AnVIL::gcloud_exists()} sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") ``` # Intermediate function for obtaining only the data Note that if you have the package-wide option set, the workspace argument is not needed in the function call. ```{r, eval=AnVIL::gcloud_exists()} prot <- getAssayData( assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data", sampleCode = c("01", "10"), workspace = "TCGA_COAD_OpenAccess_V1-0_DATA", sampleIdx = 1:4 ) head(prot) ``` # MultiAssayExperiment Finally, once you have collected all the relevant column names, these can be inputs to the main `terraTCGAdata` function: ```{r, eval=AnVIL::gcloud_exists()} mae <- terraTCGAdata( clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin", assays = c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data", "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"), sampleCode = NULL, split = FALSE, sampleIdx = 1:4, workspace = "TCGA_COAD_OpenAccess_V1-0_DATA" ) mae ``` We expect that most `OpenAccess_V1-0` cancer datasets follow this data model. If you encounter any errors, please provide a minimally reproducible example at https://github.com/waldronlab/terraTCGAdata. # Session Info ```{r} sessionInfo() ```