---
title: 'Data object'
author:
- name: Lindsay Rutter
date: '`r Sys.Date()`'
package: bigPint
bibliography: bigPint.bib
output:
  BiocStyle::html_document:
    toc_float: true
    tidy: TRUE
vignette: >
  \usepackage[utf8]{inputenc}
  %\VignetteIndexEntry{"Data object"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignettePackage{bigPint}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE)
```

## About data object

All functions in the `bigPint` package require an input parameter called `data`, which should be a data frame that contains the full dataset of interest. If a researcher is using the package to visualize RNA-seq data, then this `data` object should be a count table that contains the read counts for all genes of interest.

____________________________________________________________________________________

## Example: two treatments

The `data` object requires the same particular data frame format for all `bigPint` functions. There should be $n$ rows in the data frame, where $n$ is the number of genes. There should be $p + 1$ columns in the data frame, where $p$ is the number of samples. The first column contains the genes names and the rest of the columns should contain the read counts for all samples of interest. An example of this format is shown below:

```{r, eval=TRUE, include=TRUE, message=FALSE}
library(bigPint)
data("soybean_ir_sub")
head(soybean_ir_sub)
```

We can also examine the structure of an example `data` object as follows:

```{r, eval=TRUE, include=TRUE}
str(soybean_ir_sub, strict.width = "wrap")
```

This example dataset contains 5,604 genes and six samples [@soybeanIR]. There are two treatment groups, N and P. Each treatment group contains three replicates.

____________________________________________________________________________________

## Data object rules

As demonstrated above, the `data` object must meet the following conditions:

* Be of type `data.frame`
* Contain at least two treatment groups and at least two replicates per treatment group
* Its first column must
    + Be called "ID"
    + Be of class `character`
    + Contain the names of the genes (or a unique set of names in general)
* Each of the rest of its columns must
    + Contain the read counts for a given sample (or quantitative values in general)
    + Be of class `integer` or `numeric`
    + Be called in a three-part format (such as "A.3" or "S4.1") that matches the Perl expression `^[a-zA-Z0-9]+\\.[0-9]+`, where
        - The first part indicates the treatment group name and must contain alphanumeric characters. Examples include "A", "AR", and "A9"
        - The second part consists of a dot "." to serve as a delimeter
        - The third part indicates the replicate number and must consist of numbers

It is important that the names of all columns except the first follow the three-part format delineated above. All functions in the `bigPint` package require this format to successfully produce plots. If your `data` object
does not fit this format, `bigPint` will likely throw an informative
error about why your format was not recognized.

____________________________________________________________________________________

## Example: three treatments

Note that the `data` object can contain more than two treatment groups. In this case, the `bigPint` software will automatically create plots for all pairs of treatment groups. An example of this type of dataset is provided in the `bigPint` package and can accessed as follows:

```{r, eval=TRUE, include=TRUE}
data(soybean_cn_sub)
```

```{r, eval=TRUE, include=FALSE}
data(soybean_cn_sub)
str(soybean_cn_sub, strict.width = "wrap")
```

This example dataset contains 7,332 genes and nine samples [@brown2015developmental]. There are three treatment groups, S1, S2, and S3. Each treatment group contains three replicates. In such cases where the `data` object contains more than two treatment groups, all functions in the `bigPint` package (except `plotSMApp()`) will automatically produce a plot for each pairwise combination of treatment groups.

For example, `bigPint` functions will produce plots for S1 versus S2, S1 versus S3, and S2 versus S3 in this case. The same could be accomplished (although less efficiently) by separating the dataset into three separate datasets and running a `bigPint` function of interest on each of them individually.

```{r, eval=TRUE, include=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
soybean_cn_sub_S1S2 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S2"))
soybean_cn_sub_S1S3 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S3"))
soybean_cn_sub_S2S3 <- soybean_cn_sub %>% select("ID", contains("S2"), contains("S3"))
```

```{r, eval=TRUE, include=TRUE}
head(soybean_cn_sub_S1S2, 3)
```

```{r, eval=TRUE, include=TRUE}
head(soybean_cn_sub_S1S3, 3)
```

```{r, eval=TRUE, include=TRUE}
head(soybean_cn_sub_S2S3, 3)
```

____________________________________________________________________________________

## Preprocessing of data object

Some popular RNA-seq analysis packages (such as [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) [@robinson2010edger], [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) [@love2014moderated], and [limma](https://bioconductor.org/packages/release/bioc/html/limma.html) [@ritchie2015limma]) advise researchers to perform certain preprocessing steps to their data, such as filtering the genes, normalizing their read counts, and standardizing their read counts before visualization. Researchers can use datasets whether or not they have been filtered, normalized, and standardized for setting the `data` object in the `bigPint` package. If they wish, they can use `bigPint` plots to investigate how their dataset changes after filters, normalizations, and standardizations.

____________________________________________________________________________________

## References