SomaticCancerAlterations
Table of Contents
1 Motivation
Over the last years, large efforts have been taken to characterize the somatic landscape of cancers. Many of the conducted studies make their results publicly available, providing a valuable resource for investigating beyond the level of individual cohorts. The SomaticCancerAlterations package collects mutational data of several tumor types, currently focusing on the TCGA calls sets, and aims for a tight integration with and workflows. In the following, we will illustrate how to access this data and give examples for use cases.
2 Data Sets
The Cancer Genome Atlas (TCGA)1 is a consortium effort to analyze a variety of tumor types, including gene expression, methylation, copy number changes, and somatic mutations2. With the SomaticCancerAlterations package, we provide the callsets of somatic mutations for all publically available TCGA studies. Over time, more studies will be added, as they become available and unrestriced in their usage.
To get started, we get a list of all available data sets and access the metadata associated with each study.
Next, we load a single dataset with the scaLoadDataset function.
3 Exploring Mutational Data
The somatic variants of each study are represented as a object, ordered by genomic positions. Additional columns describe properties of the variant and relate it the the affected gene, sample, and patient.
With such data at hand, we can identify the samples and genes haboring the most mutations.
4 Exploring Multiple Studies
Instead of focusing on an individual study, we can also import several at
once. The results are stored as a GRangesList in which each
element corresponds to a single study. This can be merged into a single GRanges
object with merge = TRUE
.
We then compute the number of mutations per gene and study:
Further, we can subset the data by regions of interests, and compute descriptive statistics only on the subset.
For example, we can investigate which type of somatic variants can be found in TP53 throughout the studies.
To go further, how many patients have mutations in TP53 for each cancer type?
5 Data Provenance
5.1 TCGA Data
When importing the mutation data from the TCGA servers, we checked the data for consistency and fix common ambiguities in the annotation.
5.1.1 Processing
- Selection of the most recent somatic variant calls for each study. These were
stored as
*.maf
files in the TCGA data directory3. If both manually curated and automatically generated variant calls were available, the curated version was chosen. - Importing of the
*.maf
files into and checking for consistency with the TCGA MAF specifications4. Please note that these guidelines are currently only suggestions and most TCGA files violate some of these. - Transformation of the imported variants into a GRanges object, with one row for each reported variant. Only columns related to the genomic origin of the somatic variant were stored, additional columns describing higher-level effects, such as mutational consequences and alterations at the protein level, were dropped. The seqlevels information defining the chromosomal ranges were taken from the 1000genomes phase 2 reference assembly5.
- The patient barcode was extracted from the sample barcode.
- Metadata describing the design and analysis of the study was extracted.
- The processed variants were written to disk, with one file for each study. The metadata for all studies were stored as a single, separate object.
5.1.2 Selection Criteria of Data Sets
We included data sets in the package that were
- conducted by the Broad Institute.
- cleared for unrestricted access and usage6.
- sequenced with Illumina platforms.
5.1.3 Consistency Check
According to the TCGA specifications for the MAF
files, we screened and
corrected for common artifacts in the data regarding annotation. This included:
- Transfering of all genomic coordinates to the NCBI 37 reference notation (with the chromosome always depicted as 'MT')
- Checking of the entries against all allowed values for this field (currently
for the columns
Hugo_Symbol
,Chromosome
,Strand
,Variant_Classification
,Variant_Type
,Reference_Allele
,Tumor_Seq_Allele1
,Tumor_Seq_Allele2
,Verification_Status
,Validation_Status
,Sequencer
).
6 Alternatives
The TCGA data sets can be accessed in different ways. First, the TCGA itself offers access to certain types of its collected data7. Another approach has been taken by the cBioPortal for Cancer Genomics8 which has performed high-level analyses of several TCGA data sources, such as gene expression and copy number changes. This summarized data can be queried through an interface9.