---
title: "Introduction to `awst`"
author: 
  - name: Davide Risso
    affiliation:
    - Department of Statistical Sciences, University of Padova
    email: risso.davide@gmail.com
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('awst')`"
vignette: >
  %\VignetteIndexEntry{Introduction to awst}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
)
```


```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    awst = citation("awst")[1]
)
```

# Basics

## Install `awst`

`R` is an open-source statistical environment which can be easily modified to 
enhance its functionality via packages. `r Biocpkg("awst")` is a `R` package 
available via the [Bioconductor](http://bioconductor.org) repository for 
packages. `R` can be installed on any operating system from 
[CRAN](https://cran.r-project.org/) after which you can install 
`r Biocpkg("awst")` by using the following commands in your `R` session:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("awst")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

## Required knowledge

`r Biocpkg("awst")` is based on many other packages and in particular in those 
that have implemented the infrastructure needed for dealing with RNA-seq data. 
That is, packages like `r Biocpkg("SummarizedExperiment")`.

If you are asking yourself the question "Where do I start using Bioconductor?" 
you might be interested in [this blog 
post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU).

## Asking for help

As package developers, we try to explain clearly how to use our packages and in 
which order to use the functions. But `R` and `Bioconductor` have a steep 
learning curve so it is critical to learn where to ask for help. The blog post 
quoted above mentions some but we would like to highlight the 
[Bioconductor support site](https://support.bioconductor.org/) as the main 
resource for getting help: remember to use the `awst` tag and check 
[the older posts](https://support.bioconductor.org/t/awst/). 
Other alternatives are available such as creating GitHub issues and tweeting. 
However, please note that if you want to receive help you should adhere to the 
[posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). 
It is particularly critical that you provide a small reproducible example and 
your session information so package developers can track down the source of the 
error.

## Citing `awst`

We hope that `r Biocpkg("awst")` will be useful for your research. Please use 
the following information to cite the package and the overall approach. 
Thank you!

```{r "citation"}
## Citation info
citation("awst")
```

# What does `awst` does?

AWST aims to regularize the original read counts to reduce the effect of noise 
on the clustering of samples. In fact, gene expression data are characterized by
high levels of noise in both lowly expressed features, which suffer from 
background effects and low signal-to-noise ratio, and highly expressed features,
which may be the result of amplification bias and other experimental artifacts. 
These effects are of utmost importance in highly degraded or low input material 
samples, such as tumor samples and single cells.

AWST comprises two main steps. In the first one, namely the standardization 
step, we standardize the counts by centering and scaling them, exploiting the 
log-normal probability distribution. We refer to the standardized counts as 
z-counts. The second step, namely the smoothing step, leverages a highly skewed 
transformation that decreases the noise while preserving the influence of genes 
to separate molecular subtypes. These two steps are implemented in the `awst` 
function.

A further filtering method, implemented in the `gene_filter` function, is 
suggested to remove those features that only contribute noise to the clustering.

# Quick start

```{r "start", message=FALSE, warning=FALSE}
library(awst)
library(airway)
library(SummarizedExperiment)
library(EDASeq)
library(ggplot2)
```

Here, we will use the data in the `r Biocpkg("airway")` package to illustrate 
the `awst` approach.

Please, see our paper `r Citep(bib[["awst"]])` and 
[this repository](https://github.com/drisso/awst_analysis) for more extensive 
and biologically relevant examples.


```{r reading}
data(airway)
airway
```

The data are stored in a `RangedSummarizedExperiment`, a special case of the 
`SummarizedExperiment` class, one of the central classes in Bioconductor. If you
are not familiar with it, I recomment to look at its vignette available at 
`r Biocpkg("SummarizedExperiment")`.

First, we filter out non-expressed genes. For simplicity, we remove those genes
with fewer than 10 reads on average across samples.

```{r filtering}
filter <- rowMeans(assay(airway)) >= 10
table(filter)

se <- airway[filter,]
```

We are left with `r sum(filter)` genes. We are now ready to apply `awst` to the 
data.

```{r raw_awst}
se <- awst(se)
se
plot(density(assay(se, "awst")[,1]), main = "Sample 1")
```

We can see that the majority of the values have been shrunk around −2, while the
others values gradually increase up to around 4. The effect of reducing the 
contribution of lowly expressed genes, and of the winsorization for the highly 
expressed ones, results in a better separation of the samples, reflecting 
biological differences `r Citep(bib[["awst"]])`.

The other main function of the `r Biocpkg("awst")` package is `gene_filter`. 
It can be used to remove those genes that contribute little to nothing to the 
distance between samples. The function uses an entropy measure to remove the 
uninformative genes.

```{r genefilter}
filtered <- gene_filter(se)
dim(filtered)
```

Our final dataset is made of `r ncol(filtered)` genes.

We can see how the `awst` transformation leads to separation between treatment 
(along PC1) and cell line (along PC2).

```{r pca}
res_pca <- prcomp(t(assay(filtered, "awst")))
df <- as.data.frame(cbind(res_pca$x, colData(airway)))
ggplot(df, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
    geom_point() + theme_classic()
```

# Role of normalization

Although in this example `awst` applied to raw data works well, a prior 
normalization step can help. We have found that full-quantile normalization 
works well and has the computational advantage of allowing `awst` to estimate 
the parameters only once for all samples `r Citep(bib[["awst"]])`.

Here we show the results of `awst` after full-quantile normalization 
(implemented in `r Biocpkg("EDASeq")`).

```{r full}
assay(se, "fq") <- betweenLaneNormalization(assay(se), which="full")
se <- awst(se, expr_values = "fq")

res_pca <- prcomp(t(assay(se, "awst")))
df <- as.data.frame(cbind(res_pca$x, colData(airway)))
ggplot(df, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
    geom_point() + theme_classic()
```

# Reproducibility

The `r Biocpkg("awst")` package `r Citep(bib[["awst"]])` was made possible 
thanks to:

* R `r Citep(bib[["R"]])`
* `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
* `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])`
* `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
* `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])`
* `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])`
* `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])`

This package was developed using `r BiocStyle::Biocpkg("biocthis")`.


Code for creating the vignette

```{r createVignette, eval=FALSE}
## Create the vignette
library("rmarkdown")
system.time(render("awst_intro.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("awst_intro.Rmd", tangle = TRUE)
```

Date the vignette was generated.

```{r reproduce1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```

Wallclock time spent generating the vignette.

```{r reproduce2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```


# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` 
`r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")`
`r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results="asis", echo=FALSE, warning=FALSE, message=FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```