--- title: "countsimQC - Comparing characteristic features across count data sets" author: "Charlotte Soneson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{countsimQC User Guide} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set(echo = TRUE, crop = NA) ``` ## Introduction The `countsimQC` package provides a simple way to compare the characteristic features of a collection of (e.g., RNA-seq) count data sets. An important application is in situations where a synthetic count data set has been generated using a real count data set as an underlying source of parameters, in which case it is often important to verify that the final synthetic data captures the main features of the original data set. However, the package can be used to create a visual overview of any collection of one or more count data sets. ## Data preparation In this vignette we will show how to generate a comparative report from a collection of two simulated data sets and the original, underlying real data set. First, we load the object containing the three data sets. The object is a named list, where each element is a `DESeqDataSet` object, containing the count matrix, a sample information data frame and a model formula (necessary to calculate dispersions). For more information about the `DESeqDataSet` class, please see the [`DESeq2`](http://bioconductor.org/packages/release/bioc/html/DESeq2.html) Bioconductor package. For speed reasons, we use only a subset of the features in each data set for the following calculations. ```{r, warning = FALSE} suppressPackageStartupMessages({ library(countsimQC) library(DESeq2) }) data(countsimExample) countsimExample countsimExample <- lapply(countsimExample, function(cse) { cse[seq_len(1500), ] }) ``` ## Report generation Next, we generate the report using the `countsimQCReport()` function. Depending on the level of detail and the type of information that are required for the final report, this function can be run in different "modes": - by setting `calculateStatistics = FALSE`, only plots will be generated. This is the fastest way of running `countsimQCReport()`, and in many cases generates enough information for the user to make a visual evaluation of the count data set(s). - by setting `calculateStatistics = TRUE` and `permutationPvalues = FALSE`, some quantitative pairwise comparisons between data sets will be performed. In particular, the Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test will be used to compare distributions, and additional statistics will be calculated to evaluate how similar the evaluated aspects are between pairs of data sets. - by setting `calculateStatistics = TRUE` and `permutationPvalues = TRUE` (and giving the requested number of permutations via the `nPermutations` argument), permutation of data set labels will be used to evaluate the significance of the statistics calculated in the previous point. Naturally, this increases the run time of the analysis considerably. Here, for the sake of speed, we calculate statistics for a small subset of the observations (`subsampleSize = 25`) and refrain from calculating permutation p-values. ```{r, results = "hide", warning = FALSE, message = FALSE} tempDir <- tempdir() countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html", outputDir = tempDir, outputFormat = "html_document", showCode = FALSE, forceOverwrite = TRUE, savePlots = TRUE, description = "This is my test report.", maxNForCorr = 25, maxNForDisp = Inf, calculateStatistics = TRUE, subsampleSize = 25, kfrac = 0.01, kmin = 5, permutationPvalues = FALSE, nPermutations = NULL) ``` The `countsimQCReport()` function can generate either an HTML file (by setting `outputFormat = "html_document"` or `outputFormat = NULL`) or a pdf file (by setting `outputFormat = "pdf_document"`). The `description` argument can be used to provide a more extensive description of the data set(s) that are included in the report. ## Generation of individual figures If the argument `savePlots` is set to TRUE, an .rds file containing the individual ggplot objects will be generated. These objects can be used to perform fine-tuning of the visualizations if desired. Note, however, that the .rds file can become large if the number of data sets is large, or if the individual data sets have many samples or features. The convenience function `generateIndividualPlots()` can be used to quickly generate individual figures for all plots included in the report, using a variety of devices. For example, to generate each plot in pdf format: ```{r, warning = FALSE} ggplots <- readRDS(file.path(tempDir, "countsim_report_ggplots.rds")) if (!dir.exists(file.path(tempDir, "figures"))) { dir.create(file.path(tempDir, "figures")) } generateIndividualPlots(ggplots, device = "pdf", nDatasets = 3, outputDir = file.path(tempDir, "figures")) ``` ## Input data format In the example above, all data sets were provided as `DESeqDataSet` objects. The advantage of this is that it allows the specification of the experimental design, which is used in the dispersion calculations. `countsimQC` also allows a data set to be provided as either a `data.frame` or a `matrix`. However, in these situations, it will be assumed that all samples are replicates (i.e., a design `~1`). An example is provided in the `countsimExample_dfmat` data set, provided with the package. ```{r} data(countsimExample_dfmat) names(countsimExample_dfmat) lapply(countsimExample_dfmat, class) ``` ```{r, results = "hide", warning = FALSE, eval = FALSE} tempDir <- tempdir() countsimQCReport(ddsList = countsimExample_dfmat, outputFile = "countsim_report_dfmat.html", outputDir = tempDir, outputFormat = "html_document", showCode = FALSE, forceOverwrite = TRUE, savePlots = TRUE, description = "This is my test report.", maxNForCorr = 25, maxNForDisp = Inf, calculateStatistics = TRUE, subsampleSize = 25, kfrac = 0.01, kmin = 5, permutationPvalues = FALSE, nPermutations = NULL) ``` ## Session info ```{r} sessionInfo() ```