SomaticCancerAlterations

1. Motivation
2. Data Sets
3. Exploring Mutational Data
4. Exploring Multiple Studies
5. Data Provenance
- 5.1. TCGA Data
6. Alternatives
7. Session Info

1 Motivation

Over the last years, large efforts have been taken to characterize the somatic landscape of cancers. Many of the conducted studies make their results publicly available, providing a valuable resource for investigating beyond the level of individual cohorts. The SomaticCancerAlterations package collects mutational data of several tumor types, currently focusing on the TCGA calls sets, and aims for a tight integration with and workflows. In the following, we will illustrate how to access this data and give examples for use cases.

2 Data Sets

The Cancer Genome Atlas (TCGA)¹ is a consortium effort to analyze a variety of tumor types, including gene expression, methylation, copy number changes, and somatic mutations². With the SomaticCancerAlterations package, we provide the callsets of somatic mutations for all publically available TCGA studies. Over time, more studies will be added, as they become available and unrestriced in their usage.

To get started, we get a list of all available data sets and access the metadata associated with each study.

Next, we load a single dataset with the scaLoadDataset function.

3 Exploring Mutational Data

The somatic variants of each study are represented as a object, ordered by genomic positions. Additional columns describe properties of the variant and relate it the the affected gene, sample, and patient.

With such data at hand, we can identify the samples and genes haboring the most mutations.

4 Exploring Multiple Studies

Instead of focusing on an individual study, we can also import several at once. The results are stored as a GRangesList in which each element corresponds to a single study. This can be merged into a single GRanges object with merge = TRUE.

We then compute the number of mutations per gene and study:

Further, we can subset the data by regions of interests, and compute descriptive statistics only on the subset.

For example, we can investigate which type of somatic variants can be found in TP53 throughout the studies.

To go further, how many patients have mutations in TP53 for each cancer type?

5 Data Provenance

5.1 TCGA Data

When importing the mutation data from the TCGA servers, we checked the data for consistency and fix common ambiguities in the annotation.

5.1.1 Processing

Selection of the most recent somatic variant calls for each study. These were stored as *.maf files in the TCGA data directory³. If both manually curated and automatically generated variant calls were available, the curated version was chosen.
Importing of the *.maf files into and checking for consistency with the TCGA MAF specifications⁴. Please note that these guidelines are currently only suggestions and most TCGA files violate some of these.
Transformation of the imported variants into a GRanges object, with one row for each reported variant. Only columns related to the genomic origin of the somatic variant were stored, additional columns describing higher-level effects, such as mutational consequences and alterations at the protein level, were dropped. The seqlevels information defining the chromosomal ranges were taken from the 1000genomes phase 2 reference assembly⁵.
The patient barcode was extracted from the sample barcode.
Metadata describing the design and analysis of the study was extracted.
The processed variants were written to disk, with one file for each study. The metadata for all studies were stored as a single, separate object.

5.1.2 Selection Criteria of Data Sets

We included data sets in the package that were

conducted by the Broad Institute.
cleared for unrestricted access and usage⁶.
sequenced with Illumina platforms.

5.1.3 Consistency Check

According to the TCGA specifications for the MAF files, we screened and corrected for common artifacts in the data regarding annotation. This included:

Transfering of all genomic coordinates to the NCBI 37 reference notation (with the chromosome always depicted as 'MT')
Checking of the entries against all allowed values for this field (currently for the columns Hugo_Symbol, Chromosome, Strand, Variant_Classification, Variant_Type, Reference_Allele, Tumor_Seq_Allele1, Tumor_Seq_Allele2, Verification_Status, Validation_Status, Sequencer).

6 Alternatives

The TCGA data sets can be accessed in different ways. First, the TCGA itself offers access to certain types of its collected data⁷. Another approach has been taken by the cBioPortal for Cancer Genomics⁸ which has performed high-level analyses of several TCGA data sources, such as gene expression and copy number changes. This summarized data can be queried through an interface⁹.