---
title: "perturbatr cookbook"
author: "Simon Dirmeier"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    toc_float: true
bibliography: perturbatr.bib
vignette: >
  %\VignetteIndexEntry{perturbatr cookbook}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r style, echo=FALSE, results='asis'}
  BiocStyle::markdown()
```

```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  options(warn = -1)
  library(dplyr)
  library(tibble)
  library(methods)
  library(perturbatr)
  data(rnaiscreen)
  rnaiscreen <- dataSet(rnaiscreen) %>%    
    dplyr::select(Condition, Replicate, GeneSymbol,
                  Perturbation, Readout, Control, 
                  Design, ScreenType, Screen) %>%
    as.tibble()
```

# Introduction

`perturbatr` does stage-wise analysis of large-scale genetic
 perturbation screens for integrated data sets consisting of multiple screens.
 For multiple integrated perturbation screens a hierarchical model that
 considers the variance between different biological conditions is fitted.
 That means that we first estimate relative effect sizes for all genes.
 The resulting hit lists is then further extended using a network
 propagation algorithm to correct for false negatives.

Here we show an example data analysis using a pan-pathogenic data set of 
three RNAi screening studies.
The data set consists of two kinome and a druggable genome wide RNAi screen 
and have been 
published in @reiss2011recruitment (HCV) and @de2015kinome (SARS).

# Data analysis tutorial

This tutorial walks you to the basic functionality of `perturbatr`.

## Creating a `PerturbationData` object

You supposedly start with something like a `data.frame` or `tibble`:

```{r, include=TRUE, size="tiny"}
  head(rnaiscreen)
```

In order to start your analysis you need to create a perturbation data set 
first.For this you only need to call the `as` method on your `data.frame`:

```{r, include=TRUE, eval=TRUE}
  rnaiscreen <- methods::as(rnaiscreen, "PerturbationData")
```

Coercing your `data.frame` to `PerturbationData` will automatically warn you if
your table is formatted wrongly. You need at least the following column names
order to be able to do analysis of perturbation screens using `perturbatr`:

* Condition: an identifier that best describes the respective screen. 
For instance this can be the name of a virus for pathogen screens, the name 
of a cell line, organoid or the like. The *condition* describes a single data 
set, i.e. if you want to integrate multiple different data sets, make sure to 
give each a different condition.
* Replicate: an integer representing the replicate number of a screen.
* GeneSymbol: the HUGO identifier, ENTREZ id, etc. as character.
* Perturbation: a siRNA id or gRNA id that describes the knockout/knockdown for 
the gene.
* Readout: a *normalized* readout like a log-fold change for gRNAs, 
a GFP signal, etc.
* Control: vector of integers representing perturbations that have been
used as negative or positive controls. A negative control is marked with
'-1', a positive control with '1' and a normal sample with '0'.

Depending on how you want to model the readout using the hierarchical model, 
you might want to add additional
columns. For the sake of simplicity this suffices though.

## Working with `PerturbationData` S4 objects

A `PerturbationData` object consists of a single slot that stores your data. We
bundled your data into an `S4` object such that dispatch is easier to handle
and to make sure that your data set has the correct columns:

```{r, eval=TRUE, include=TRUE}
  rnaiscreen
  dataSet(rnaiscreen)
```

`PerturbationData` has some basic `filter` and `rbind` functionality.
Similar to `dplyr::filter` you can select rows by some predicate(s). In the
example below we extract all rows from the data set that have a positive
readout.

```{r, eval=TRUE, include=TRUE}
  perturbatr::filter(rnaiscreen, Readout > 0)
```

Filtering on multiple rows works by just adding predicates:
```{r, eval=TRUE, include=TRUE}
  perturbatr::filter(rnaiscreen, Readout > 0, Replicate == 2)
```

If you want to combine data sets you can call `rbind`, which will 
automatically dispatch on `PerturbationData` object:
```{r, eval=TRUE, include=TRUE}
  dh <- perturbatr::filter(rnaiscreen, Readout > 0, Replicate == 2)
  rbind(dh, dh)
```

## Data analysis using a hierarchical model and network diffusion

Finally, after having set up the data set, we analyse it using a hierarchical 
model and network diffusion. 

We **expect you already normalized the data sets accordingly**. 
As noted above, if you want to analyse multiple data sets, make sure that
every data set corresponds to a unique `Condition`.

First, let's have a rough look at the data set that we are using:
```{r, eval=TRUE, include=TRUE, fig.align="center", fig.width=4, message=FALSE}
  plot(rnaiscreen)
```

We have roughly the same number of replicates per gene, but the HCV screen has
less genes
than the SARS data set. That is no problem however, because we automatically 
filter such that the genes are same.
We also automatically remove positive controls for obvious reasons.

Next we rank the genes using a hierarchical model which requires explicitely
modelling the readout of our data set using an
R `formula`. Let's look at the data in more detail first:

```{r}
  dataSet(rnaiscreen) %>% str()
```

Here, variables like `Replicate`, `Plate`, `RowIdx/ColIdx` should not be 
associated with a change in the response `Readout` as we normalized the data 
and corrected for batch effects.
However, the `Readout`s should definitely have been different between 
`ScreenType`s:

```{r}
  dataSet(rnaiscreen) %>% pull(ScreenType) %>% unique()
```

where `E/R` represents that the screen has measured the effect of a gene 
knockdown during the *entry and replication* stages of the viral 
lifecycle while `A/R` repesents the gene knockdown's effect having been 
measures during the *assembly and release* stages of the lifecycle.
In the life cycle of positive-sense RNA viruses we know 
that viruses make use of different host factors during their life cycle. 
That means while some genes are required during *entry and replication*, 
others might play a role in *assembly and release* of the virions. 
So we have reason to believe that the stage of the infection also introduces 
a clustering effect. In that case we would need to add a random effect 
for the stage of the infection.

A model selection using the Bayesian information criterion indeed suggests 
the following hierarchical random intercept model:

$$y_{cgtp} \mid\gamma_g, \delta_{cg}, \zeta_t , \xi_{ct} \sim \mathcal{N}(x_c \beta + \gamma_g  + \delta_{cg} + \zeta_t + \xi_{ct},  \sigma^2),$$
where $y_{cgtp}$ is the readout of virus $c$, gene $g$, stage of the viral
lifecycle $t$ (`E/R` vs `A/R`) and $p$ is the perturbation (siRNA) used for
gene $g$.
We estimate the parameters of the model using `lme4` [@bates2014lme4]:

```{r, eval=TRUE, include=TRUE, warning=FALSE, message=FALSE}
  frm <- Readout ~ Condition +
                   (1|GeneSymbol) + (1|Condition:GeneSymbol) +
                   (1|ScreenType) + (1|Condition:ScreenType)
  res.hm <- hm(rnaiscreen, formula = frm)
```

**Note that for your own data different effects might be visible. Thus,** 
**before modelling you need to exploratorily detect possible effects.**

Let's take the last result and plot them. This yields a list of multiple
plots. The first plot shows the first 25 strongest gene effects ranked by their
absolute effect sizes. Most of the genes are colored in blue which represents
that a
gene knockdown leads to an inhibition of viral growth on a **pan-viral** level. 
Bars colored in red represent genes for which a knockdown results in increased 
viral viability.
If you are interested in the complete ranking of genes, use 
`geneEffects(res.hm)`.

```{r, include=TRUE, fig.align="center", fig.width=4, message=FALSE}
  pl <- plot(res.hm)
  pl[[1]]
```

The second plots shows the *nested gene effects*, i.e. the estimated effects of
a gene knockdown for a **single** virus. The genes shown here are the same as
in the first plot,
so it might be possible that there are *nested gene effects* that are stronger
which are just not plotted.
You can get all nested gene effects using `nestedGeneEffects(res.hm)`.

```{r include=TRUE, fig.align="center", fig.width=4, message=FALSE}
  pl[[2]]
```

Next we might want to *smooth* the effect from the hierarchical model using 
network diffusion, by that possibly reduce the number of some false negatives.
For that we need to supply a graph as a `data.frame` and call the `diffuse`
function:

```{r include=TRUE, fig.align="center", fig.width=4, message=FALSE}
  graph <- readRDS(
    system.file("extdata", "graph_small.rds",package = "perturbatr"))
  diffu <- diffuse(res.hm, graph=graph, r=0.3)
```

If we plot the results we get a list of reranked genes. Note that the
ranking uses the network diffusion computes a stationary distribution
of a Markov random walk with restarts.
```{r include=TRUE, fig.align="center", fig.width=4, message=FALSE}
  plot(diffu)
```
Further note that we used a very small network here. You might want to redo
this analysis with the full graph which is located in 
`system.file("extdata", "graph_full.rds",package = "perturbatr")`.

## Session info

```{r eval=TRUE, include=TRUE}
  sessionInfo()
```

## References