--- title: "**pradMAEO**: The **PR**ostate **AD**enocarcinoma **M**ulti**A**ssay**E**xperiment **O**bject using data from _RTCGAToolbox_" author: "Lucas Schiffer" date: "`r doc_date()`" output: BiocStyle::html_document: number_sections: yes toc: yes abstract: > This vignette displays the processing steps needed in order to create a MultiAssayExperiment object from TCGA (The Cancer Genome Atlas) data. Specifically, the `getFirehoseData()` method of the `r Githubpkg("LiNk-NY/RTCGAToolbox")` package is used to access and read in data; output is then further coerced to fit the MultiAssayExperiment object specifications. A built HTML version of this vignette is available on [RPubs](https://rpubs.com/schifferl/pradMAEO-RTCGAToolbox) and the source is available on [GitHub](http://tinyurl.com/pradMAEO) vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{MultiAssayExperiment: Prostate Cancer Data} %\VignetteEncoding{UTF-8} --- # Prerequisites ```{r, include = FALSE} knitr::opts_chunk$set(eval = FALSE) ``` Methods from two packages hosted on GitHub are used in this vignette, the packages are installed as follows. ```{r} BiocInstaller::biocLite("LiNk-NY/RTCGAToolbox") BiocInstaller::biocLite("waldronlab/TCGAutils") ``` These and other packages available in Bioconductor or CRAN are loaded as follows. ```{r} library(MultiAssayExperiment) library(RTCGAToolbox) library(TCGAutils) library(readr) ``` # Argument Definitions The `r Githubpkg("LiNk-NY/RTCGAToolbox")` package provides the `getFirehoseDatasets()` method for obtaining the names of all 33 cohorts contained within the TCGA data. Beyond the 33 cohorts, there are 5 additional "pan" cohorts where data of multiple cohorts was merged - information about the cohorts is available via the TCGA [website](http://cancergenome.nih.gov/cancersselected). Additionally, the `getFirehoseRunningDates()` and `getFirehoseAnalyzeDates()` methods are used to obtain the most recent running and analysis dates. Finally, a character vector `dd` is created to specify the location of the data directory where output should be saved. ```{r} TCGAcode <- getFirehoseDatasets()[27] # PRAD stopifnot(identical(ds, "PRAD")) runDate <- getFirehoseRunningDates()[1] analyzeDate <- getFirehoseAnalyzeDates()[1] dataDirectory <- "data" ``` # Function Definition A function, `buildMultiAssayExperiments()`, is defined as shown below for the purpose of creating a new MultiAssayExperiment object with a single line of code. It accepts the arguments defined in the previous chunk and is capable of accepting multiple cohort names (e.g. `dataDirectory <- getFirehoseDatasets()[1:5]`). Even though the implementation is not parallel, low-level operations remain vectorized regardless of the for loop. In the first part of the function, the existence of the data directory is checked and it is created if necessary. Then a cohort object is either loaded or serialized from the `getFirehoseData()` method and saved to the data directory. Once serialized, `colData` is extracted from the clinical slot and the rownames are cleaned by `gsub()` and `type_convert()` functions. A named list of extraction targets is then created from the slot names of the cohort object and the `TCGAextract()` function is used within a try statement. The outputs are then passed to `generateMap()` which will generate a `sampleMap` specific to TCGA data. Finally, the named `ExperimentList` of extracted datasets, the `colData`, and the generated sample map can be passed to the `MultiAssayExperiment()` constructor function. The constructor function will ensure that orphaned samples, samples that don't match a record in `colData`, are removed. A `MultiAssayExperiment` will be created, serialized and saved to the data directory, making it easier to return to in the future. ```{r} buildMultiAssayExperiments <- function(TCGAcode, runDate, analyzeDate, dataDirectory) { if (!dir.exists(dataDirectory)) dir.create(dataDirectory) for (cancer in TCGAcodes) { message("\n######\n", "\nProcessing ", cancer, " : )\n", "\n######\n") serialPath <- file.path("data", paste0(cancer, ".rds")) if (file.exists(serialPath)) { cancerObject <- readRDS(serialPath) } else { cancerObject <- getFirehoseData(cancer, runDate = runDate, gistic2_Date = analyzeDate, RNAseq_Gene = TRUE, Clinic = TRUE, miRNASeq_Gene = TRUE, RNAseq2_Gene_Norm = TRUE, CNA_SNP = TRUE, CNV_SNP = TRUE, CNA_Seq = TRUE, CNA_CGH = TRUE, Methylation = TRUE, Mutation = TRUE, mRNA_Array = TRUE, miRNA_Array = TRUE, RPPA_Array = TRUE, RNAseqNorm = "raw_counts", RNAseq2Norm = "normalized_count", forceDownload = FALSE, destdir = "./tmp", fileSizeLimit = 500000, getUUIDs = FALSE) saveRDS(cancerObject, file = serialPath, compress = "bzip2") } ## Add clinical data from RTCGAToolbox pd <- Clinical(co) rownames(pd) <- toupper(gsub("\\.", "-", rownames(pd))) clinicalData <- type_convert(pd) ## slotNames in FirehoseData RTCGAToolbox class targets <- c("RNASeqGene", "RNASeq2GeneNorm", "miRNASeqGene", "CNASNP", "CNVSNP", "CNAseq", "CNACGH", "Methylation", "mRNAArray", "miRNAArray", "RPPAArray", "Mutations", "gistica", "gistict") names(targets) <- targets dataList <- lapply(targets, function(datType) { tryCatch({TCGAutils::TCGAextract(cancerObject, datType)}, error = function(e) { message(datType, " does not contain any data!") }) }) dataFull <- Filter(function(x) {!is.null(x)}, dataList) NewMap <- generateMap(dataFull, clinicalData, TCGAbarcode) MAEO <- MultiAssayExperiment(dataFull, clinicalData, NewMap) saveRDS(MAEO, file = file.path(dataDirectory, paste0(cancer, "_MAEO.rds")), compress = "bzip2") } } ``` # Function Call Lastly, it is necessary to call the `buildMultiAssayExperiments()` function defined above and pass it the arguments defined using the `r Githubpkg("LiNk-NY/RTCGAToolbox")` package. Using this function, a `MultiAssayExperiment` object for the prostate adenocarcinoma cohort (`PRAD`) is created with a single call. ```{r} buildMultiAssayExperiments(TCGAcode, runDate, analyzeDate, dataDirectory) ```