---
title: "scDesign3 Quickstart"
author: 
  - name: Dongyuan Song
    affiliation:
    - Bioinformatics IDP, University of California, Los Angeles
    email: dongyuansong@ucla.edu
  - name: Qingyang Wang
    affiliation:
    - Department of Statistics, University of California, Los Angeles
    email: qw802@g.ucla.edu
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('scDesign3')`"
vignette: >
  %\VignetteIndexEntry{scDesign3-quickstart-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---
```{css, echo=FALSE}
pre {
  white-space: pre !important;
  overflow-x: scroll !important;
}
```

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    message = FALSE,
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
tools::R_user_dir("scDesign3", which="cache")
```

```{r, message=FALSE, warning=FALSE, results='hide'}
library(scDesign3)
library(SingleCellExperiment)
library(ggplot2)
theme_set(theme_bw())
```

## Introduction
scDesign3 is a unified probabilistic framework that generates realistic in silico high-dimensional single-cell omics data of various cell states, including discrete cell types, continuous trajectories, and spatial locations by learning from real datasets. Since the functions of scDesign3 is very comprehensive, here we only introduce how scDesign3 simulates an scRNA-seq dataset with one continuous developmental trajectory. For more information, please check the Articles on our website: (https://songdongyuan1994.github.io/scDesign3/docs/index.html).

## Read in the reference data

The raw data is from the [scvelo](https://scvelo.readthedocs.io/scvelo.datasets.pancreas/), which describes pancreatic endocrinogenesis. We pre-select the top 1000 highly variable genes and filter out some cell types to ensure a **single trajectory**.

```{r}
example_sce <- readRDS((url("https://figshare.com/ndownloader/files/40581992")))
print(example_sce)
```
To save computational time, we only use the top 100 genes.
```{r}
example_sce <- example_sce[1:100, ]
```

## Simulation
The function `scdesign3()` takes in a `SinglecellExperiment` object with the cell covariates (such as cell types, pesudotime, or spatial coordinates) stored in the `colData` of the `SinglecellExperiment` object.
```{r, message=FALSE, warning=FALSE, results='hide'}
set.seed(123)
example_simu <- scdesign3(
    sce = example_sce,
    assay_use = "counts",
    celltype = "cell_type",
    pseudotime = "pseudotime",
    spatial = NULL,
    other_covariates = NULL,
    mu_formula = "s(pseudotime, k = 10, bs = 'cr')",
    sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')"
    family_use = "nb",
    n_cores = 2,
    usebam = FALSE,
    corr_formula = "1",
    copula = "gaussian",
    DT = TRUE,
    pseudo_obs = FALSE,
    return_model = FALSE,
    nonzerovar = FALSE
  )
```
The output of `scdesign3()` is a list which includes:

* `new_count`: This is the synthetic count matrix generated by `scdesign3()`.
* `new_covariate`: 
  + If the parameter `ncell` is set to a number that is different from the number of cells in the input data, this will be a matrix that has the new cell covariates that are used for generating new data.
  + If the parameter `ncell` is the default value, this will be `NULL`.
* `model_aic`: This is a vector include the genes' marginal models' AIC, fitted copula's AIC, and total AIC, which is the sum of the previous two AIC. 
* `model_bic`: This is a vector include the genes' marginal models' BIC, fitted copula's BIC, and total BIC, which is the sum of the previous two BIC.
* `marginal_list`:
  + If the parameter `return_model` is set to `TRUE`, this will be a list which contains the fitted gam or gamlss model for all genes in the input data.
  + If the parameter `return_model` is set to the default value `FALSE`, this will be `NULL`.
* `corr_list`:
  + If the parameter `return_model` is set to `TRUE`, this will be a list which contains the either a correlation matrix (when `copula = "gaussian"`) or the fitted vine copula (when `copula = "vine`) for each user specified correlation groups (based on the parameter `corr_by`).
  + If the parameter `return_model` is set to the default value `FALSE`, this will be `NULL`.

In this example, since we did not change the parameter `ncell`, the synthetic count matrix will have the same dimension as the input count matrix. 
```{r}
dim(example_simu$new_count)
```
Then, we can create the `SinglecellExperiment` object using the synthetic count matrix and store the `logcounts` to the input and synthetic `SinglecellExperiment` objects.
```{r}
logcounts(example_sce) <- log1p(counts(example_sce))
simu_sce <- SingleCellExperiment(list(counts = example_simu$new_count), colData = example_simu$new_covariate)
logcounts(simu_sce) <- log1p(counts(simu_sce))
```

## Visualization
```{r}
set.seed(123)
compare_figure <- plot_reduceddim(ref_sce = example_sce, 
                                  sce_list = list(simu_sce), 
                                  name_vec = c("Reference", "scDesign3"),
                                  assay_use = "logcounts", 
                                  if_plot = TRUE, 
                                  color_by = "pseudotime", 
                                  n_pc = 20)
plot(compare_figure$p_umap)
```

## Session information
```{r}
sessionInfo()
```