--- title: "cBioPortalData: User Guide" author: "Marcel Ramos & Levi Waldron" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{cBioPortalData User Guide} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: no toc: yes toc_depth: 4 bibliography: ../inst/REFERENCES.bib --- ```{r, setup, include=FALSE} knitr::opts_chunk$set(cache = TRUE) ``` # Installation ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} library(cBioPortalData) library(AnVIL) ``` # Introduction The cBioPortal for Cancer Genomics [website](https://cbioportal.org) is a great resource for interactive exploration of study datasets. However, it does not easily allow the analyst to obtain and further analyze the data. We've developed the `cBioPortalData` package to fill this need to programmatically access the data resources available on the cBioPortal. The `cBioPortalData` package provides an R interface for accessing the cBioPortal study data within the Bioconductor ecosystem. It downloads study data from the cBioPortal API (the full API specification can be found here https://cbioportal.org/api) and uses Bioconductor infrastructure to cache and represent the data. We demonstrate common use cases of `cBioPortalData` and `curatedTCGAData` during Bioconductor conference [workshops](https://waldronlab.io/MultiAssayWorkshop/). We use the [MultiAssayExperiment][1] (@Ramos2017-og) package to integrate, represent, and coordinate multiple experiments for the studies available in the cBioPortal. This package in conjunction with `curatedTCGAData` give access to a large trove of publicly available bioinformatic data. Please see our JCO Clinical Cancer Informatics publication [here][2] (@Ramos2020-ya). [1]: https://dx.doi.org/10.1158/0008-5472.CAN-17-0344 [2]: https://dx.doi.org/10.1200/CCI.19.00119 # Citations Our free and open source project depends on citations for funding. When using `cBioPortalData`, please cite the following publications: ```{r, eval=FALSE} citation("MultiAssayExperiment") citation("cBioPortalData") ``` ## Overview ### Data Structures Data are provided as a single `MultiAssayExperiment` per study. The `MultiAssayExperiment` representation usually contains `SummarizedExperiment` objects for expression data and `RaggedExperiment` objects for mutation and CNV-type data. `RaggedExperiment` is a data class for representing 'ragged' genomic location data, meaning that the measurements per sample vary. For more information, please see the `RaggedExperiment` and `SummarizedExperiment` vignettes. ### Identifying available studies As we work through the data, there are some datasest that cannot be represented as `MultiAssayExperiment` objects. This can be due to a number of reasons such as the way the data is handled, presence of mis-matched identifiers, invalid data types, etc. To see what datasets are currently not building, we can look refer to `getStudies()` with the `buildReport = TRUE` argument. ```{r} cbio <- cBioPortal() studies <- getStudies(cbio, buildReport = TRUE) head(studies) ``` The last two columns will show the availability of each `studyId` for either download method (`pack_build` for `cBioDataPack` and `api_build` for `cBioPortalData`). ### Choosing download method There are two main user-facing functions for downloading data from the cBioPortal API. * `cBioDataPack` makes use of the tarball distribution of study data. This is useful when the user wants to download and analyze the entirety of the data as available from the cBioPortal.org website. * `cBioPortalData` allows a more flexibile approach to obtaining study data based on the available parameters such as molecular profile identifiers. This option is useful for users who have a set of gene symbols or identifiers and would like to get a smaller subset of the data that correspond to a particular molecular profile. ## Two main functions ### cBioDataPack: Obtain Study Data as Zipped Tarballs This function will access the packaged data from \url{cBioPortal.org/datasets} and return an integrative MultiAssayExperiment representation. ```{r,message=FALSE,warning=FALSE} ## Use ask=FALSE for non-interactive use laml <- cBioDataPack("laml_tcga", ask = FALSE) laml ``` ### cBioPortalData: Obtain data from the cBioPortal API This function provides a more flexible and granular way to request a `MultiAssayExperiment` object from a study ID, molecular profile, gene panel, sample list. ```{r,warning=FALSE} acc <- cBioPortalData(api = cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "IMPACT341", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA") ) acc ``` Note. To avoid overloading the API service, the API was designed to only query a part of the study data. Therefore, the user is required to enter either a set of genes of interest or a gene panel identifier. ## Considerations Note that `cBioPortalData` and `cBioDataPack` obtain data diligently curated by the cBio Portal data team. The original data and curation lies in the GitHub repository. However, despite the curation efforts there may be some inconsistencies in identifiers in the data. This causes our software to not work as intended though we have made efforts to represent all the data from both API and tarball formats. ### metadata You may notice that the `metadata()` may have some additional data that was not able to be integrated in the `MultiAssayExperiment`. ```{r} metadata(acc) ``` ### Build prompts You will also get a message for `studyId`s whose data has not been fully integrated into a `MultiAssayExperiment`. ```{r,echo=FALSE} cat( "Our testing shows that '%s' is not currently building.\n", " Use 'downloadStudy()' to manually obtain the data.\n", " Proceed anyway? [y/n]: y" ) ``` ### Manual downloads For this reason, we have also provided the `downloadStudy`, `untarStudy`, and `loadStudy` functions to allow researchers to simply download the data and potentially, manually curate it. Generally, we advise researchers to report inconsistencies in the data in the [cBioPortal][3] data repository. [3]: https://github.com/cbioportal/datahub ## Clearing the cache ### cBioDataPack In cases where a download is interrupted, the user may experience a corrupt cache. The user can clear the cache for a particular study by using the `removeCache` function. Note that this function only works for data downloaded through the `cBioDataPack` function. ```{r,eval=FALSE} removeCache("laml_tcga") ``` ### cBioPortalData For users who wish to clear the entire `cBioPortalData` cache, it is recommended that they use: ```{r,eval=FALSE} unlink("~/.cache/cBioPortalData/") ``` ## Example Analysis: Kaplan-Meier Plot We can use information in the `colData` to draw a K-M plot with a few variables from the `colData` slot of the `MultiAssayExperiment`. First, we load the necessary packages: ```{r,message=FALSE,warning=FALSE} library(survival) library(survminer) ``` We can check the data to lookout for any issues. ```{r} table(colData(laml)$OS_STATUS) class(colData(laml)$OS_MONTHS) ``` Now, we clean the data a bit to ensure that our variables are of the right type for the subsequent survival model fit. ```{r} collaml <- colData(laml) collaml[collaml$OS_MONTHS == "[Not Available]", "OS_MONTHS"] <- NA collaml$OS_MONTHS <- as.numeric(collaml$OS_MONTHS) colData(laml) <- collaml ``` We specify a simple survival model using `SEX` as a covariate and we draw the K-M plot. ```{r} fit <- survfit( Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX, data = colData(laml) ) ggsurvplot(fit, data = colData(laml), risk.table = TRUE) ``` ### Data update requests If you are interested in a particular study dataset that is not currently building, please open an issue at our [GitHub repository][4] and we will do our best to resolve the issues with the code base. Data issues can be opened at the cBioPortal [data repository][3]. [4]: https://github.com/waldronlab/cBioPortalData/issues We appreciate your feedback! # sessionInfo
Click to see session info ```{r} sessionInfo() ```
# References