---
title: "Using `RTCGA` package to download mutations data that are included in `RTCGA.mutations` package"
subtitle: "Date of datasets release: 2015-11-01"
author: "Marcin Kosiński"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using RTCGA to download mutations data as included in RTCGA.mutations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
library(knitr)
opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150), options(width=150), eval = FALSE)
```

# RTCGA package

> The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

`RTCGA` package offers download and integration of the variety and volume of TCGA data using patient barcode key, what enables easier data possession. This may have a benefcial infuence on  development of science and improvement of patients' treatment. `RTCGA` is an open-source R package, available to download from Bioconductor 

```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")
```

or from github
```{r, eval=FALSE}
if (!require(devtools)) {
    install.packages("devtools")
    require(devtools)
}
biocLite("RTCGA/RTCGA")
```

Furthermore, `RTCGA` package transforms TCGA data into form which is convenient to use in R statistical package. Those data transformations can be a part of statistical analysis pipeline which can be more reproducible with `RTCGA`.

Use cases and examples are shown in `RTCGA` packages vignettes:
```{r, eval=FALSE}
browseVignettes("RTCGA")
```

# How to download RNAseq data to gain the same datasets as in RTCGA.rnaseq package?

There are many available date times of TCGA data releases. To see them all just type:
```{r, eval=FALSE}
library(RTCGA)
checkTCGA('Dates')
```

Version 20151101.0.0 of `RTCGA.mutations` package contains mutations datasets which were released `2015-11-01`. They were downloaded in the following way (which is mainly copied from [http://rtcga.github.io/RTCGA/](http://rtcga.github.io/RTCGA/):

## Available cohorts

All cohort names can be checked using:
```{r, eval=FALSE}
(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))
```

For all cohorts the following code downloads the mutations data.

## Downloading mutations files
```{r, eval=FALSE}
# dir.create( "data2" ) # name of a directory in which data will be stored
releaseDate <- "2015-11-01"
sapply( cohorts, function(element){
tryCatch({
downloadTCGA( cancerTypes = element, 
              dataSet = "Mutation_Packager_Calls.Level",
              destDir = "data2", 
              date = releaseDate )},
error = function(cond){
   cat("Error: Maybe there weren't mutations data for ", element, " cancer.\n")
}
)
})
```

## Reading downloaded mutations dataset

### Shortening paths and directories 

```{r, eval=FALSE}
list.files( "data2") %>% 
   file.path( "data2", .) %>%
   file.rename( to = substr(.,start=1,stop=50))
```


### Removing `NA` files from data2 folder

If there were not mutations data for some cohorts we should remove corresponding `NA` files.

```{r, eval=FALSE}
list.files( "data2") %>%
   file.path( "data2", .) %>%
   sapply(function(x){
      if (x == "data2/NA")
         file.remove(x)      
   })
```

### Paths to mutations data folder

Below is the code that automatically assigns paths to files for all mutations files for all available cohorts types downloaded to `data2` folder.

```{r}
cohorts %>%
	sapply(function(element){
		grep(paste0("_", element, "\\."),
				 x = list.files("data2") %>%
				 	file.path("data2", .),
				 value = TRUE)
		}) -> potential_datasets

for(i in seq_along(potential_datasets)){
	if(length(potential_datasets[[i]]) > 0){
		assign(value =  potential_datasets[[i]],
					 x = paste0(names(potential_datasets)[i], ".mutations.path"),
					 envir = .GlobalEnv)
	}
}

```

### Reading mutations data using `readTCGA`

Because of the fact that mutations data are are in separate files, there has been prepared special function `readTCGA` to read and merge data automatically. Code is below

```{r, eval=FALSE}
ls() %>%
   grep("mutations\\.path", x = ., value = TRUE) %>% 
   sapply(function(element){
      tryCatch({
         readTCGA(get(element, envir = .GlobalEnv),
               dataType = "mutations") -> mutations_file
         	for( i in 1:ncol(mutations_file)){
						mutations_file[, i] <- iconv(mutations_file[, i],
																				 "UTF-8", "ASCII", sub="")
					}
         	
         assign(value = mutations_file,
                x = sub("\\.path", "", x = element),
                envir = .GlobalEnv )
      }, error = function(cond){
         cat(element)
      }) 
     invisible(NULL)
    }    
)
```

# Saving mutations data to `RTCGA.mutations` package


```{r, eval=FALSE}
grep( "mutations", ls(), value = TRUE) %>%
   grep("path", x=., value = TRUE, invert = TRUE) %>%
   cat( sep="," ) #can one to it better? as from use_data documentation:
   # ...	Unquoted names of existing objects to save
   devtools::use_data(ACC.mutations,BLCA.mutations,BRCA.mutations,
   									 CESC.mutations,CHOL.mutations,COAD.mutations,
   									 COADREAD.mutations,DLBC.mutations,ESCA.mutations,
   									 GBMLGG.mutations,GBM.mutations,HNSC.mutations,
   									 KICH.mutations,KIPAN.mutations,KIRC.mutations,
   									 KIRP.mutations,LAML.mutations,LGG.mutations,
   									 LIHC.mutations,LUAD.mutations,LUSC.mutations,
   									 OV.mutations,PAAD.mutations,PCPG.mutations,
   									 PRAD.mutations,READ.mutations,SARC.mutations,
   									 SKCM.mutations,STAD.mutations,STES.mutations,
   									 TGCT.mutations,THCA.mutations,UCEC.mutations,
   									 UCS.mutations,UVM.mutations,
                     # overwrite = TRUE,
                      compress="xz")
```