--- title: "RTCGAToolbox" author: "Mehmet Kemal Samur" date: "`r Sys.Date()`" output: BiocStyle::html_document: number_sections: yes toc: true references: - id: ref1 title: Comprehensive genomic characterization defines human glioblastoma genes and core pathways author: - family: Cancer Genome Atlas Research Network given: journal: Nature volume: 455 number: 7216 pages: 1061-1068 issued: year: 2008 - id: ref2 title: GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers author: - family: Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G given: journal: Genome Biol volume: 12 number: 4 pages: R41 issued: year: 2011 - id: ref3 title: RTCGAToolbox\:\ A New Tool for Exporting TCGA Firehose Data author: - family: Samur MK. given: journal: Plos ONE volume: 9 number: 9 pages: e106397 issued: year: 2014 vignette: > %\VignetteIndexEntry{RTCGAToolbox Tutorial} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Introduction Managing data from large scale projects such as The Cancer Genome Atlas (TCGA)[@ref1] for further analysis is an important and time consuming step for research projects. Several efforts, such as Firehose project, make TCGA pre-processed data publicly available via web services and data portals but it requires managing, downloading and preparing the data for following steps. We developed an open source and extensible R based data client for Firehose Level 3 and Level 4 data and demonstrated its use with sample case studies. RTCGAToolbox could improve data management for researchers who are interested with TCGA data. In addition, it can be integrated with other analysis pipelines for further data analysis. RTCGAToolbox is open-source and licensed under the GNU General Public License Version 2.0. All documentation and source code for RTCGAToolbox is freely available. Please site the paper at [@ref3]. Currently, following functions are provided to access datasets and process datasets. * Control functions: + getFirehoseRunningDates: This function can be called to access valid stddata run dates. To access data, users have to provide valid dates. + getFirehoseAnalyzeDates: This function can be called to access valid analyze run dates. To access data, users have to provide valid dates. This function only affects the GISTIC2 [@ref2] processed copy estimate matrices. + getFirehoseDatasets: This function can be called to access valid dataset aliases. * Data client function: + getFirehoseData: This is the core function of the package. Users can access Firehose processed data via this function. Once it is called, several steps are realized by the library to access data. Finally this function returns an S4 object that keeps all the downloaded data. # Installation To install RTCGAToolbox, you can use Bioconductor. Source code is also available on GitHub. First time users use the following code snippet to install the package ```{r eval=FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("RTCGAToolbox") ``` # Data Client Before getting the data from Firehose pipelines, users have to check valid dataset aliases, stddata run dates and analyze run dates. To provide valid information RTCGAToolbox comes with three control functions. Users can list datasets with "getFirehoseDatasets" function. In addition, users have to provide stddata run date or/and analyze run date for client function. Valid dates are accessible via "getFirehoseRunningDates" and "getFirehoseAnalyzeDates" functions. Below code chunk shows how to list datasets and dates. ```{r} library(RTCGAToolbox) # Valid aliases getFirehoseDatasets() ``` ```{r} # Valid stddata runs getFirehoseRunningDates(last = 3) ``` ```{r} # Valid analysis running dates (will return 3 recent date) getFirehoseAnalyzeDates(last=3) ``` When the dates and datasets are determined users can call data client function ("getFirehoseData") to access data. Current version can download multiple data types except ISOFORM and exon level data due to their huge data size. Below code chunk will download READ dataset with clinical and mutation data. ```{r, message=FALSE} # READ mutation data and clinical data brcaData <- getFirehoseData(dataset="READ", runDate="20160128", forceDownload=TRUE, clinical=TRUE, Mutation=TRUE) ``` Printing the object will show the user what datasets are in the `FirehoseData` object: ```{r} brcaData ``` Users have to set several parameters to get data they need. Below "getFirehoseData" options has been explained: * dataset: Users should set cohort code for the dataset they would like to download. List can be accessiable via `getFirehoseDatasets()` like as explained above. * runDate: Firehose project provides different data point for cohorts. Users can list dates by using function above,`getFirehoseRunningDates()`. * gistic2Date: Just like cohorts Firehose project runs their analysis pipelines to process copy number data with GISTIC2 [@ref2]. Users who want to get GISTIC2 processed copy number data should set this date. List can be accessible via "getFirehoseAnalyzeDates()" Following logic keys are provided for different data types. By default client only download clinical data. * RNAseqGene * clinical * RNASeqGene * RNASeq2Gene * RNASeq2GeneNorm * miRNASeqGene * CNASNP * CNVSNP * CNASeq * CNACGH * Methylation * Mutation * mRNAArray * miRNAArray * RPPAArray Users can also set following parameters to set client behavior. * forceDownload: By default RTCGAToolbox checks your working directory before download data. If you have data in the working directory from previous run it loads data by using these exports. If you would like to suppress this and re download data you can force RTCGAToolbox. * fileSizeLimit: If you would like to set a limit for downloaded file size you can use this parameter. Huge data files require longer download time and memory to load. By default his parameter set as 500MB. * getUUIDs: Firehose provides TCGA barcodes for every sample. In some cases users may want to use UUIDs for samples. If this parameter set, then after processing data RTCGAToolbox gets UUIDs for each barcode. ## Example Dataset We've provided an abbreviated dataset from the 'ACC' (Adrenocortical carcinoma) that contains only the top 6 rows for each dataset and a full clinical dataset. This dataset can be invoked by doing: ```{r} data(accmini) accmini ``` * `accmini` data is a FirehoseData object that stores RNAseq, copy number, mutation, clinical data from the Adrenocortical Carcinoma (ACC) study. ## Conversion to Bioconductor classes The `biocExtract` function allows the user to take any downloaded dataset and convert it into a standard Bioconductor object. These can either be a `SummarizedExperiment`, `RangedSummarizedExperiment`, or `RaggedExperiment` based on features of the data. The user must provide the desired data type as input to the function along with the actual `FirehoseData` data object. This allows for easy adaptability to other software in the Bioconductor ecosystem. ```{r} biocExtract(accmini, "RNASeq2Gene") biocExtract(accmini, "CNASNP") ``` # Raw Data You can obtain the downloaded data in tabular or list format from the `FirehoseData` object by using 'getData()' function. ```{r} head(getData(accmini, "clinical")) getData(accmini, "RNASeq2GeneNorm") getData(accmini, "GISTIC", "AllByGene") ``` ## Session Info ```{r} sessionInfo() ``` # References