---
title: "autonomics"
subtitle: "Read + preprocess + analyze + visualize cross-platform omics data"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{using_autonomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r echo=FALSE, message=FALSE, include=TRUE}
knitr::opts_chunk$set(
  echo = TRUE, collapse = TRUE, cache = TRUE, fig.show = 'hold')
#str(knitr::opts_chunk$get())
```
<!-- ![ ](https://bitbucket.org/graumannlabtools/autonomics/downloads/read_prepro_analyze.png) -->

# Abstract {-}

`autonomics` is created to make cross-platform omics analysis intuitive and enjoyable. It contains **import + preprocess + analyze + visualize** one-liners for *RNAseq*, *MS proteomics*, *SOMAscan* and *Metabolon* platforms and a *generic* importer for data from *any rectangular omics file*. With a focus on not only automation but also customization, these importers have a flexible  `formula`/ `block`/`contrastdefs` interface which allows to define any statistical formula, fit any general linear model, quantify any contrast, and use random effects or precision weights if required, building on top of the powerful `limma` platform for statistical analysis. It also offers exquisite support for analyzing **complex designs** such as the innovative **contrastogram** visualization to unravel and summarize the differential expression structure in complex designs. By autonomating repetitive tasks, generifying common steps, and intuifying complex designs, it makes cross-platform omics data analysis a fun experience. Try it  and enjoy :-).

# read + preprocess + analyze + visualize

Autonomics offers  **import + preprocess + analyze + visualize** one-liners for *RNAseq*, *MS proteomics*, *SOMAscan* and *Metabolon* platforms, as well a *generic* importer for data from *any rectangular omics file*. We will discuss each of these in more detail in the following sections, but all of them follow the following general steps:

  * **read**
    - exprs, features, samples
    - extract **subgroups** from sampleids / samplefile
    - wrap these into a **SummarizedExperiment**
  * **preprocess**
    - **normalize**
    - **log2** transform
    - **impute** in a platform-aware fashion
    - **filter**
  * **analyze**
    - plot **sample distributions**
    - plot **detections** (absent/present)
    - plot **pca** biplots
    - **fit** the linear model `~0+subgroup`
    - quantify **contrasts** between subsequent factor levels and **plot volcanoes**
  

## RNAseq

Autonomics provides ready-to-use importers for both **count** as well as **BAM** files, which read / preprocess / analyze RNAseq data. Specific to RNAseq is counting reads using Rsubread::featureCounts (for read_rna_bams)), normalizing for library composition (**cpm**) or gene length (**tpm**), estimating **voom** precision weights, and using these in linear modeling. Let's illustrate both of these on example datasets:

### read_rnaseq_counts {-}

**Billing et al. (2019)** studied the differentiation of embryonic (**E00**) into mesenchymal stem cells (**E00.8**, **E01**, **E02**, **E05**, **E15**, **E30**), with similar properties as bone-marrow mesenchymal stem cells (**M00**). Importing the RNAseq counts:

```{r}
require(autonomics, quietly = TRUE)
object <-  read_rnaseq_counts(file = download_data('billing19.rnacounts.txt'), 
                              pca = TRUE, fit = 'limma', plot = TRUE)
```

### read_rnaseq_counts(sfile) {-}


The `sfile/sfileby` arguments in both importers allow to merge in sample data from a separate file by a specified column. This is especially useful in clinical datasets, such as the GSE161731 Covid-19 dataset, for which both a **counts** file as well as a **sample** file can be downloaded from GEO:

```{r}
basedir <- file.path(tempdir(), 'datasets')
dir.create(basedir, showWarnings = FALSE, recursive = TRUE)
if (!dir.exists(file.path(basedir, 'GSE161731'))){
    GEOquery::getGEOSuppFiles("GSE161731", baseDir=basedir)
}
object <- read_rnaseq_counts(
  file  = file.path(basedir, 'GSE161731/GSE161731_counts.csv.gz'), 
  sfile = file.path(basedir, 'GSE161731/GSE161731_counts_key.csv.gz'), 
  sfileby = 'rna_id', 
  subgroupvar = 'gender', 
  pca = TRUE, 
  fit = 'limma', 
  plot = TRUE)
```

### read_rnaseq_counts(block) {-}

The `block` argument in both importers allows to model e.g. subject_id as a random effect, which is very common in clinical data. Repeating the GEO Covid-19 analysis with subject_id as a block effect. Outcomment the line below, which takes a few minutes to complete.

```{r}
#object <- read_rnaseq_counts(
#  file  = '~/autonomicscache/datasets/GSE161731/GSE161731_counts.csv.gz', 
#  sfile = '~/autonomicscache/datasets/GSE161731/GSE161731_counts_key.csv.gz', 
#  sfileby = 'rna_id', 
#  subgroupvar = 'gender', 
#  block = 'subject_id')
```

### read_rnaseq_bams {-}

**Billing et al. (2016)** compared three types of stem cells: embryonic (**E**), embryonic differentiated into mesenchymal  (**EM**), and bone-marrow mesenchymal (**BM**). Importing a downsized version of the RNAseq BAM files (with only a subset of all reads):

```{r, results = 'hide', eval = FALSE}
# not run to avoid issues with R CMD CHECK 
if (requireNamespace('Rsubread')){
  object <- read_rnaseq_bams(
                dir    = download_data('billing16.bam.zip'), 
                paired = TRUE, 
                genome = 'hg38', 
                pca    = TRUE, 
                fit   = 'limma', 
                plot   = TRUE)
}
```

### cpm/tmm/voom effects on power {-}

Proper preprocessing leads to increased power:

```{r}
  file <- download_data('billing19.rnacounts.txt')
# log2counts
  object <- read_rnaseq_counts(file, 
       cpm = FALSE, voom = FALSE, fit = 'limma', verbose = FALSE, plot = FALSE)
  colSums(summarize_fit(object, 'limma')[, -1])
# log2cpm
  object <- read_rnaseq_counts(file,
       cpm = TRUE,  voom = FALSE, fit = 'limma', verbose = FALSE, plot=FALSE)
  colSums(summarize_fit(object, 'limma')[, -1])
# log2cpm + voom
  object <- read_rnaseq_counts(file,  # log2 cpm + voom
       cpm = TRUE,  voom = TRUE,  fit = 'limma', verbose = FALSE, plot=FALSE)
  colSums(summarize_fit(object, 'limma')[, -1])
```

## Proteingroups/phosphosites

A popular approach in (DDA) MS proteomics data analysis is to use **MaxQuant** to produce **proteinGroups.txt** and **phospho (STY)Sites.txt** files. These can be imported using `read_proteingroups` / `read_phosphosites`, which :

* **Read**
    - **quantification cols**: LFQ intensities, Normalized Ratios, or Reporter intensities
    - **feature cols**: id, Majority protein IDs, Protein names, Gene names, Contaminant, Potential contaminant, Reverse
    - **subgroups** from sampleids / samplefile
* **Preprocess**
    - Convert **0 -> NA** (they represent lack of detection)
    - **Log2** transform
    - Rm **contaminants/reverse/unreplicated** features
    - **Impute** consistent NAs with values drawn from a half-normal distribution around zero
    - For phosphosites: **compute occupancies** by subtracting protein expressions
* **Analyze**
    - plot **sample distributions**
    - plot **detections**
    - **pca** biplot
    - **Fit** a linear model
    - Quantify **contrasts** between subsequent factor levels and plot **volcanoes**


### LFQ intensities {-}

An **LFQ intensities** example is the **Fukuda et al. (2020)** dataset, which compares the proteome of **30d** zebrafish juveniles versus **adults** using label-free quantification. In the volcano plot measured values are shown with circles, imputed values as triangles.

```{r}
object <- read_proteingroups(file = download_data('fukuda20.proteingroups.txt'), 
                             pca = TRUE, fit = 'limma', plot = TRUE)
```


### Normalized ratios {-}

**Normalized ratios** were used in the proteomics profiling of the earlier described **Billing et al. (2019)** study, which examined the differentiation of embryonic stem cells (**E00**) into mesenchymal stem cells (**E01**,**E02**, **E05**, **E15**, **E30**), and compared these to bone-marrow derived mesenchymal stem cells (**M00**). The proteomics profiling experiment was performed using MS1 labeling: a light (**L**) labeled internal standard was created by mixing all samples in equal amounts, the subsequent samples were either medium (**M**) or heavy (**H**) labeled. Not all label combinations were of interest, and the `select_subgroups` argument allows to specifies which subgroup combinations to retain:


```{r}
object <- read_proteingroups(
              file = download_data('billing19.proteingroups.txt'),
              select_subgroups = c('E00_STD', 'E01_STD', 'E02_STD',
                                   'E05_STD', 'E15_STD', 'E30_STD', 'M00_STD'), 
              pca = TRUE, fit = 'limma', plot = TRUE)
```

The phosphosites can be read in a similar fashion:

```{r}
object <- read_phosphosites(
              file = download_data('billing19.phosphosites.txt'), 
              proteinfile = download_data('billing19.proteingroups.txt'), 
              select_subgroups = c('E00_STD', 'E01_STD', 'E02_STD',
                                   'E05_STD', 'E15_STD', 'E30_STD', 'M00_STD'), 
              pca = TRUE, fit = 'limma', plot = TRUE)
```

## Metabolon

**Metabolon** delivers metabolomics results in the form of an **xlsx** file, with values of interest likely in the (second) **OrigScale** sheet. Features are laid out in rows, samples in columns, and additional feature/sample data are self-contained in the file. Metabolon data can be read/analyzed with the autonomics function `read_metabolon`, as illustrated below on the dataset of **Halama and and Atkin (2018)**, who investigates the effects of **hypoglycemia** on a number of subjects (**SUB**) at four different time points (**t0**, **t1**, **t2**, **t3**):

```{r, fig.width=7, fig.height=4, out.width="95%"}
object <- read_metabolon(
        file  = download_data('atkin18.metabolon.xlsx'),  subgroupvar = 'SET',
        pca = TRUE, fit = 'limma', block = 'SUB', plot = TRUE)
```

## Somascan

**Somascan** results are returned in a txt file with extension **.adat**. Features are laid out in columns and samples in rows. Rectangles with feature/sample data are also contained in the file. The autonomics function `read_somascan` reads/analyzes SOMAscan files, additionally filtering samples/features for quality and type. This is illustrated on the above described dataset from **Halama and and Atkin (2018)**, who investigated the effects of **hypoglycemia** on a number of subjects (**SUB**) at four different time points (**t0**, **t1**, **t2**, **t3**):

```{r}
object <- read_somascan(
            file = download_data('atkin18.somascan.adat'), 
            pca = TRUE, fit = 'limma', block = 'Subject_ID', plot = TRUE)
```

<!--
# complex designs

Multi-factor subgroups are automatically decomposed into `subfactors`, for each of which contrasts between subsequent levels are computed and visualized into a two-dimensional volcano grid.

## The contrastogram
The contrastogram

# Analyzing clinical data
-->

# SessionInfo {-}

```{r}
sessionInfo()
```


# References {-}

A  M Billing, S S Dib, A M Bhagwat, I T da Silva, R D Drummond, S Hayat, R Al-Mismar, H Ben-Hamidane, N Goswami, K Engholm-Keller, M R Larsen, K Suhre, A Rafii, J Graummann (2019). Mol Cell Proteomics. 18(10):1950-1966. doi: 10.1074/mcp.RA119.001356.

A M Billing, H Ben Hamidane, S S Dib, R J Cotton, A M Bhagwat, P Kumar, S Hayat, N A Yousri, N Goswami, K Suhre, A Rafii, J Graumann (2016). Comprehensive transcriptomics and proteomics characterization of human mesenchymal stem cells reveals source specific cellular markers. Sci Rep. 9;6:21507. doi: 10.1038/srep21507.

R Fukuda, R Marin-Juez, H El-Sammak, A Beisaw, R Ramadass, C Kuenne, S Guenther, A Konzer, A M Bhagwat, J Graumann, D YR Stainier (2020). EMBO Rep. 21(8): e49752. doi: 10.15252/embr.201949752

A Halama, M Kulinski, S Dib, S B Zaghlool, K S Siveen, A Iskandarani, J Zierer 3, K S Prabhu, N J Satheesh, A M Bhagwat, S Uddin, G Kastenmüller, O Elemento, S S Gross, K Suhre (2018). Accelerated lipid catabolism and autophagy are cancer survival mechanisms under inhibited glutaminolysis. Cancer Lett., 430:133-147. doi:10.1016/j.canlet.2018.05.017

A Halama, H Kahal, A M Bhagwat, J Zierer, T Sathyapalan, J Graumann, K Suhre, S L Atkin (2018). Metabolic and proteomics signatures of hypoglycaemia in type 2 diabetes. Diabetes, obesity and metabolism,  https://doi.org/10.1111/dom.13602