---
title: "simPIC: simulating single-cell ATAC-seq data"
author: "Sagrika Chugh"
package: simPIC
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  \usepackage[utf8]{inputenc}
  %\VignetteIndexEntry{simPIC: simulating single-cell ATAC-seq data}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 80
---

```{r knitr-options, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(max.print = 30)
```

# Introduction to simPIC

simPIC is an R package for the simple simulation of single-cell RNA sequencing
data. This vignette gives an overview and introduction to simPIC's
functionality.

# Installation

simPIC can be installed from Bioconductor:

```{r install-bioc, eval = FALSE, include=TRUE}
BiocManager::install("simPIC")
```

# Quick start

Assuming you already have a matrix of count data similar to that you wish to
simulate there are two simple steps to creating a simulated data set with
simPIC.

```{r quickstart}
# Load package
suppressPackageStartupMessages({
    library(simPIC)
})

# Load test data
set.seed(12)
counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))

# Estimate parameters
est <- simPICestimate(counts)

# Simulate data using estimated parameters
sim <- simPICsimulate(est)
```

These steps will be explained in detail in the following sections but briefly
the first step takes a dataset and estimates simulation parameters from it and
the second step takes those parameters and simulates a new dataset.

# Input data

`simPIC` recommends to use a Paired-Insertion Count (PIC) matrix for optimal use
of the quantitative information present in scATAC-seq data. You can convert your
own matrix to PIC by following the `r BiocStyle::Biocpkg("PICsnATAC")`
[vignette][PIC-vignette]. Briefly, you need three input files for PIC -

1.  cell barcodes with metadata (`singlecell.csv`)
2.  list of peak regions (`peaks.bed`)
3.  fragment files (`fragment.tsv.gz`)

```{r pic, eval=FALSE, include=TRUE}
pic_mat <- PIC_counting(
    cells = cells,
    fragment_tsv_gz_file_location = fragment_tsv_gz_file_location,
    peak_sets = peak_sets
)
```

# The simPIC simulation

The core of the simPIC model is a gamma-Poisson distribution. This model is used
to generate a uniform quantification based peak-by-cell count matrix. Mean
chromatin accessibility signals for each peak are simulated from a [weibull
distribution](https://en.wikipedia.org/wiki/Weibull_distribution) with default
settings. Users have the flexibility to choose from `gamma`, `lognormal-gamma` or
`pareto` distribution as well. Each cell is given an expected library size,
simulated from a log-normal distribution to match to a given dataset. Sparsity
is imposed on counts simulated from a [Poisson
distribution](https://en.wikipedia.org/wiki/Poisson_distribution).

# The `simPICcount` class

All the parameters for the simPIC simulation are stored in a `simPICcount`
object. A class specifically desgined for storing simPIC scATAC-seq simulation
parameters. Let's create a new one and see what it looks like.

```{r simPICparams}
sim.params <- newsimPICcount()
```

we can see the default values for the `simPICcount` object parameters. These
values are based on provided test data.

```{r params}
sim.params
```

## Getting and setting parameters

To get a particular parameter, for e.g., number of peaks, we can use `simPICget`
function:

```{r getParam}
simPICget(sim.params, "nPeaks")
```

Alternatively, to give a parameter a new value we can use the
`setsimPICparameters` function:

```{r setParam}
sim.params <- setsimPICparameters(sim.params, nPeaks = 2000)
simPICget(sim.params, "nPeaks")
```

To get or set multiple parameters use `simPICget` or `setsimPICparameters`
functions:

```{r getParams-setParams}
# Set multiple parameters at once (using a list)
sim.params <- setsimPICparameters(sim.params,
    update = list(nPeaks = 8000, nCells = 500)
)
# Extract multiple parameters as a list
params <- simPICgetparameters(
    sim.params,
    c("nPeaks", "nCells", "peak.mean.shape")
)
# Set multiple parameters at once (using additional arguments)
params <- setsimPICparameters(sim.params,
    lib.size.sdlog = 3.5, lib.size.meanlog = 9.07
)
params
```

# Estimating parameters

simPIC allows you to estimate many of it's parameters from a
SingleCellExperiment object containing counts or a counts matrix using the
`simPICestimate` function.

```{r simPICestimate}
# Get the counts from test data
count <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))

# Check that counts is a dgCMatrix
class(count)
typeof(count)

# Check the dimensions, each row is a peak, each column is a cell
dim(count)

# Show the first few entries
count[1:5, 1:5]

new <- newsimPICcount()
new <- simPICestimate(count)

## estimating using gamma distribution
## new <- simPICestimate(count, pm.distr = "gamma")
```

Here we estimated parameters from a counts matrix using default parameters. The
estimation process has the following steps:

1.  Mean parameters are estimated by fitting a weibull distribution (default) to
    the peak means.
2.  Library size parameters are estimated by fitting a log-normal distribution
    to the library sizes.
3.  Sparsity parameter is estimated by fitting a Bernoulli distribution.

For more details of the estimation procedures see `?simPICestimate`.

# Simulating counts

Once we have a set of parameters we are happy with we can use `simPICsimulate`
to simulate counts. To make adjustments to the parameters provide them as
additional arguments. Alternatively if we don't supply any parameters defaults
will be used:

```{r simPICsimulate}
sim <- simPICsimulate(new, nCells = 1000)
sim

## simulating using gamma distribution
## sim <- simPICsimulate(new, nCells =1000, pm.distr = "gamma")
```

Looking at the output of `simPICsimulate` we can see that `sim` is
`SingleCellExperiment` object with `r nrow(sim)` features (peaks) and
`r ncol(sim)` samples (cells). The main part of this object is a features by
samples matrix containing the simulated counts (accessed using `counts`).
Additionally a `SingleCellExperiment` contains information about each cell
(accessed using `colData`) and each peak (accessed using `rowData`). simPIC uses
these slots, as well as `assays`, to store information about the intermediate
values of the simulation.

```{r SCE}
# Access the counts
counts(sim)[1:5, 1:5]
# Information about peaks
head(rowData(sim))
# Information about cells
head(colData(sim))
# Peak by cell matrices
names(assays(sim))
```

For more details about the `r BiocStyle::Biocpkg("SingleCellExperiment")` 
object refer to the[vignette][SCE-vignette].

The `simPICsimulate` function provides additional simulation details:

\* **Cell information (`colData`)**

\* `Cell` - Unique cell identifier.

\* `exp.libsize` - The expected library size for that cell. (not obtained from
the final simulated counts)

\* **Peak information (`rowData`)**

\* `Peak` - Unique peak identifier.

\* `exp.peakmean` - The expected peak means for that peak. (not obtained from
the final simulated counts)

\* **Peak by cell information (`assays`)**

\* `counts` - The final simulated counts.

For more information on the simulation see `?simPICsimulate`.

# Comparing simulations and real data

simPIC provides a function `simPICcompare` that aims to make these comparisons
easier. This function takes a list of `SingleCellExperiment` objects, combines
the datasets and produces comparison plots. Let's make two small simulations and
see how they compare.

```{r comparison}
sim1 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
sim2 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
comparison <- simPICcompare(list(real = sim1, simPIC = sim2))

names(comparison)
names(comparison$Plots)
```

The returned list has three items. The first two are the combined datasets by
peak (`RowData`) and by cell (`ColData`) and the third contains some comparison
plots (produced using `ggplot2`), for example a plot of the distribution of
means:

```{r comparison-means}
comparison$Plots$Means
```

These are only a few of the plots you might want to consider but it should be
easy to make more using the returned data.

# Citing simPIC

If you use simPIC in your work please cite our paper:

```{r citation}
citation("simPIC")
```

# Session information {.unnumbered}

```{r sessionInfo}
sessionInfo()
```
[SCE-vignette]:https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html
[PIC-vignette]:https://github.com/Zhen-Miao/PICsnATAC/blob/main/vignettes/Run_PIC_counting_on_pbmc_3k_data.ipynb).