systemPipeTools 1.14.0
systemPipeR
systemPipeTools package extends the widely used systemPipeR (SPR) (H Backman and Girke 2016) workflow environment with enhanced toolkit for data visualization, including utilities to automate the analysis of differentially expressed genes (DEGs). systemPipeTools provides functions for data transformation and data exploration via scatterplots, hierarchical clustering heatMaps, principal component analysis, multidimensional scaling, generalized principal components, t-Distributed Stochastic Neighbor embedding (t-SNE), and MA and volcano plots. All these utilities can be integrated with the modular design of the systemPipeR environment that allows users to easily substitute any of these features and/or custom with alternatives.
systemPipeTools
environment can be installed from the R console using the BiocManager::install
command. The associated package systemPipeR
can be installed the same way. The latter provides core functionalities for
defining workflows, interacting with command-line software, and executing both
R and/or command-line software, as well as generating publication-quality
analysis reports.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("systemPipeTools")
BiocManager::install("systemPipeR")
library("systemPipeTools") # Loads the package
library(help = "systemPipeTools") # Lists package info
vignette("systemPipeTools") # Opens vignette
The first step is importing the targets
file and raw reads counting table.
- The targets
file defines all FASTQ files and sample comparisons of the
analysis workflow.
- The raw reads counting table represents all the reads that map to gene
(row) for each sample (columns).
## Targets file
targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment = "#")
cmp <- systemPipeR::readComp(file = targetspath, format = "matrix", delim = "-")
## Count table file
countMatrixPath <- system.file("extdata", "countDFeByg.xls", package = "systemPipeR")
countMatrix <- read.delim(countMatrixPath, row.names = 1)
showDT(countMatrix)
For gene differential expression, raw counts are required, however for data
visualization or clustering, it can be useful to work with transformed count
data.
exploreDDS
function is convenience wrapper to transform raw read counts
using the DESeq2
package transformations methods. The input
file has to contain all the genes, not just differentially expressed ones.
Supported methods include variance stabilizing transformation (vst
)
(Anders and Huber (2010)), and regularized-logarithm transformation or rlog
(Love, Huber, and Anders (2014)).
exploredds <- exploreDDS(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL,
transformationMethod = "rlog")
exploredds
## class: DESeqTransform
## dim: 116 18
## metadata(1): version
## assays(1): ''
## rownames(116): AT1G01010 AT1G01020 ... ATMG00180 ATMG00200
## rowData names(51): baseMean baseVar ... dispFit rlogIntercept
## colnames(18): M1A M1B ... V12A V12B
## colData names(2): condition sizeFactor
Users are strongly encouraged to consult the DESeq2
vignette
for more detailed information on this topic and how to properly run DESeq2
on
datasets with more complex experimental designs.
To decide which transformation to choose, we can visualize the transformation effect comparing two samples or a grid of all samples, as follows:
exploreDDSplot(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL, samples = c("M12A",
"M12A", "A12A", "A12A"), scattermatrix = TRUE)
The scatterplots are created using the log2 transform normalized reads count,
variance stabilizing transformation (VST) (Anders and Huber (2010)), and
regularized-logarithm transformation or rlog
(Love, Huber, and Anders (2014)).
The following computes the sample-wise correlation coefficients using the
stats::cor()
function from the transformed expression values. After
transformation to a distance matrix, hierarchical clustering is performed with
the stats::hclust
function and the result is plotted as a dendrogram,
as follows:
hclustplot(exploredds, method = "spearman")
The function provides the utility to save the plot automatically.
This function performs hierarchical clustering on the transformed expression
matrix generated within the DESeq2
package. It uses, by default, a Pearson
correlation-based distance measure and complete linkage for cluster join.
If samples
selected in the clust
argument, it will be applied the
stats::dist()
function to the transformed count matrix to get sample-to-sample
distances. Also, it is possible to generate the pheatmap
or plotly
plot format.
## Samples plot
heatMaplot(exploredds, clust = "samples", plotly = FALSE)
If ind
selected in the clust
argument, it is necessary to provide the list
of differentially expressed genes for the exploredds
subset.
## Individuals genes identified in DEG analysis DEG analysis with `systemPipeR`
degseqDF <- systemPipeR::run_DESeq2(countDF = countMatrix, targets = targets, cmp = cmp[[1]],
independent = FALSE)
DEG_list <- systemPipeR::filterDEGs(degDF = degseqDF, filter = c(Fold = 2, FDR = 10))
### Plot
heatMaplot(exploredds, clust = "ind", DEGlist = unique(as.character(unlist(DEG_list[[1]]))))
The function provides the utility to save the plot automatically.
This function plots a Principal Component Analysis (PCA) from transformed expression matrix. This plot shows samples variation based on the expression values and identifies batch effects.
PCAplot(exploredds, plotly = FALSE)
## using ntop=500 top features by variance
The function provides the utility to save the plot automatically.
MDSplot
This function computes and plots multidimensional scaling analysis for dimension
reduction of count expression matrix. Internally, it is applied the
stats::dist()
function to the transformed count matrix to get sample-to-sample
distances.
MDSplot(exploredds, plotly = FALSE)
The function provides the utility to save the plot automatically.
GLMplot
This function computes and plots generalized principal components analysis for dimension reduction of count expression matrix.
exploredds_r <- exploreDDS(countMatrix, targets, cmp = cmp[[1]], preFilter = NULL,
transformationMethod = "raw")
GLMplot(exploredds_r, plotly = FALSE)
The function provides the utility to save the plot automatically.
This function plots log2 fold changes (y-axis) versus the mean of normalized counts (on the x-axis). Statistically significant features are colored.
MAplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20), genes = "ATCG00280")
The function provides the utility to save the plot automatically.
tSNEplot
This function computes and plots t-Distributed Stochastic Neighbor embedding
(t-SNE) analysis for unsupervised nonlinear dimensionality reduction of count
expression matrix. Internally, it is applied the Rtsne::Rtsne()
(Krijthe 2015)
function, using the exact
t-SNE computing with theta=0.0
.
tSNEplot(countMatrix, targets, perplexity = 5)
A simple function that shows statistical significance (p-value
) versus
magnitude of change (log2 fold change
).
volcanoplot(degseqDF, comparison = "M12-A12", filter = c(Fold = 1, FDR = 20), genes = "ATCG00280")
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] systemPipeR_2.12.0 ShortRead_1.64.0
## [3] GenomicAlignments_1.42.0 SummarizedExperiment_1.36.0
## [5] Biobase_2.66.0 MatrixGenerics_1.18.0
## [7] matrixStats_1.4.1 BiocParallel_1.40.0
## [9] Rsamtools_2.22.0 Biostrings_2.74.0
## [11] XVector_0.46.0 GenomicRanges_1.58.0
## [13] GenomeInfoDb_1.42.0 IRanges_2.40.0
## [15] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [17] systemPipeTools_1.14.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 deldir_2.0-4 formatR_1.14
## [4] rlang_1.1.4 magrittr_2.0.3 compiler_4.4.1
## [7] png_0.1-8 vctrs_0.6.5 stringr_1.5.1
## [10] pwalign_1.2.0 pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 magick_2.8.5 labeling_0.4.3
## [16] utf8_1.2.4 rmarkdown_2.28 UCSC.utils_1.2.0
## [19] tinytex_0.53 purrr_1.0.2 xfun_0.48
## [22] zlibbioc_1.52.0 cachem_1.1.0 aplot_0.2.3
## [25] jsonlite_1.8.9 highr_0.11 DelayedArray_0.32.0
## [28] jpeg_0.1-10 parallel_4.4.1 R6_2.5.1
## [31] stringi_1.8.4 bslib_0.8.0 RColorBrewer_1.1-3
## [34] GGally_2.2.1 jquerylib_0.1.4 Rcpp_1.0.13
## [37] bookdown_0.41 knitr_1.48 glmpca_0.2.0
## [40] Matrix_1.7-1 tidyselect_1.2.1 abind_1.4-8
## [43] yaml_2.3.10 codetools_0.2-20 hwriter_1.3.2.1
## [46] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [49] withr_3.0.2 treeio_1.30.0 evaluate_1.0.1
## [52] Rtsne_0.17 gridGraphics_0.5-1 ggstats_0.7.0
## [55] pillar_1.9.0 BiocManager_1.30.25 ggtree_3.14.0
## [58] DT_0.33 ggfun_0.1.7 plotly_4.10.4
## [61] generics_0.1.3 ggplot2_3.5.1 munsell_0.5.1
## [64] scales_1.3.0 tidytree_0.4.6 glue_1.8.0
## [67] pheatmap_1.0.12 lazyeval_0.2.2 tools_4.4.1
## [70] interp_1.1-6 data.table_1.16.2 locfit_1.5-9.10
## [73] fs_1.6.4 grid_4.4.1 tidyr_1.3.1
## [76] ape_5.8 crosstalk_1.2.1 latticeExtra_0.6-30
## [79] colorspace_2.1-1 nlme_3.1-166 GenomeInfoDbData_1.2.13
## [82] patchwork_1.3.0 cli_3.6.3 fansi_1.0.6
## [85] S4Arrays_1.6.0 viridisLite_0.4.2 dplyr_1.1.4
## [88] gtable_0.3.6 yulab.utils_0.1.7 DESeq2_1.46.0
## [91] sass_0.4.9 digest_0.6.37 SparseArray_1.6.0
## [94] ggrepel_0.9.6 ggplotify_0.1.2 htmlwidgets_1.6.4
## [97] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [100] httr_1.4.7 MASS_7.3-61
This project is funded by NSF award ABI-1661152.
Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biol. 11 (10): R106.
H Backman, Tyler W, and Thomas Girke. 2016. “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics 17 (1): 388. https://doi.org/10.1186/s12859-016-1241-0.
Krijthe, Jesse H. 2015. Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation. https://github.com/jkrijthe/Rtsne.
Love, Michael, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.