--- title: "QUBIC Tutorial" author: - Yu Zhang - Juan Xie - Qin Ma classoption: hyperref, vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{QUBIC Tutorial} output: pdf_document: fig_caption: yes pandoc_args: --listings html_document: highlight: tango theme: flatly word_document: fig_caption: yes highlight: tango md_extensions: -autolink_bare_uris header-includes: - \usepackage{xcolor} - \lstset{ backgroundcolor=\color[RGB]{248,248,248}, basewidth=0.5em, basicstyle=\small\ttfamily, breakatwhitespace=true, breakautoindent=true, breaklines=true, captionpos=b, columns=flexible, commentstyle=\color[rgb]{0.56,0.35,0.01}\itshape, escapeinside={\%*}{*)}, frame=tb, keepspaces=true, keywordstyle=\color[rgb]{0.13,0.29,0.53}\bfseries, linewidth=\textwidth, numberstyle=\footnotesize, showspaces=false, showstringspaces=false, showtabs=false, stringstyle=\color[rgb]{0.31,0.60,0.02}, tabsize=2, } bibliography: bibliography.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=98) ``` Gene expression data is very important in experimental molecular biology [@brazma00], especially for cancer study [@fehrmann15]. The large-scale microarray data and RNA-seq data provide good opportunity to do the gene co-expression analyses and identify co-expressed gene modules; and the effective and efficient algorithms are needed to implement such analysis. Substantial efforts have been made in this field, such as @cheng00, Plaid [@lazzeroni02], Bayesian Biclustering [BCC, @gu08], among them Cheng and Church and Plaid has the R package implementation. It is worth noting that our in-house biclustering algorithm, QUBIC [@li09], is reviewed as one of the best programs in terms of their prediction performance on benchmark datasets. Most importantly, it is reviewed as the best one for large-scale real biological data [@eren12]. Until now, QUBIC has been cited over 120 times (via [Google Scholar](https://scholar.google.com/scholar?cites=15047943471221701463)) and its web server, QServer, was developed in 2012 to facilitate the users without comprehensive computational background [@zhou12]. In the past five years, the cost of RNA-sequencing decreased dramatically, and the amount of gene expression data keeps increasing. Upon requests from users and collaborators, we developed this R package of QUBIC to void submitting large data to a webserver. The unique features of our R package [@zhang16] include (1) updated and more stable back-end resource code (re-written by C++), which has better memory control and is more efficient than the one published in 2009. For an input dataset in *Arabidopsis*, with 25,698 genes and 208 samples, we observed more than 40% time saving; and (2) comprehensive functions and examples, including discretize function, heatmap drawing and network analysis. # How to cite ```{r, comment=NA, message=FALSE} citation("QUBIC") ``` # Other languages If R is not your thing, there is also [a C version of *QUBIC*](https://github.com/maqin2001/QUBIC). # Help If you are having trouble with this R package, contact [the maintainer, Yu Zhang](mailto:zy26@jlu.edu.cn). # Install and load Stable version from BioConductor ```r source("https://bioconductor.org/biocLite.R") biocLite("QUBIC") ``` Or development version from GitHub ```r install.packages("devtools") devtools::install_github("zy26/QUBIC") ``` Load *QUBIC* ```{r, message=FALSE} library("QUBIC") ``` Functions ====================== There are nine functions provided by QUBIC package. * *qudiscretize()* creates a discrete matrix for a given gene expression matrix; * *BCQU()* performs a qualitative biclustering for real matrix; * *BCQUD()* performs a qualitative biclustering for discretized matrix; * *quheatmap()* can draw heatmap for singe bicluster or overlapped biclusters; * *qunetwork()* can automatically create co-expression networks based on the identified biclusters by QUBIC; * *qunet2xml()* can convert the constructed co-expression networks into XGMML format for further network analysis in Cytoscape, Biomax and JNets; * *query-based biclustering* allows users to input additional biological information to guide the biclustering progress; * *bicluster expanding* expands existing biclusters under specified consistency level; * *biclusters comparison* compares biclusters obtained via different algorithms or parameters. The following examples illustrate how these functions work. # Example of a random matrix with two diferent embedded biclusters ```{r random, cahce = TRUE, message = FALSE, fig.cap = 'Heatmap for two overlapped biclusters in the simulated matrix'} library(QUBIC) set.seed(1) # Create a random matrix test <- matrix(rnorm(10000), 100, 100) colnames(test) <- paste("cond", 1:100, sep = "_") rownames(test) <- paste("gene", 1:100, sep = "_") # Discretization matrix1 <- test[1:7, 1:4] matrix1 matrix2 <- qudiscretize(matrix1) matrix2 # Fill bicluster blocks t1 <- runif(10, 0.8, 1) t2 <- runif(10, 0.8, 1) * (-1) t3 <- runif(10, 0.8, 1) * sample(c(-1, 1), 10, replace = TRUE) test[11:20, 11:20] <- t(rep(t1, 10) * rnorm(100, 3, 0.3)) test[31:40, 31:40] <- t(rep(t2, 10) * rnorm(100, 3, 0.3)) test[51:60, 51:60] <- t(rep(t3, 10) * rnorm(100, 3, 0.3)) # QUBIC res <- biclust::biclust(test, method = BCQU()) summary(res) # Show heatmap hmcols <- colorRampPalette(rev(c("#D73027", "#FC8D59", "#FEE090", "#FFFFBF", "#E0F3F8", "#91BFDB", "#4575B4")))(100) # Specify colors par(mar = c(4, 5, 3, 5) + 0.1) quheatmap(test, res, number = c(1, 3), col = hmcols, showlabel = TRUE) ``` # Example of *Saccharomyces cerevisiae* ```{r yeast, cache = TRUE} library(QUBIC) data(BicatYeast) # Discretization matrix1 <- BicatYeast[1:7, 1:4] matrix1 matrix2 <- qudiscretize(matrix1) matrix2 # QUBIC x <- BicatYeast system.time(res <- biclust::biclust(x, method = BCQU())) summary(res) ``` We can draw heatmap for single bicluster. ```{r yeast-heatmap, cache = TRUE, message = FALSE, fig.cap = 'Heatmap for the second bicluster identified in the *Saccharomyces cerevisiae* data. The bicluster consists of 53 genes and 5 conditions', fig.show = 'asis'} # Draw heatmap for the second bicluster identified in Saccharomyces cerevisiae data library(RColorBrewer) paleta <- colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(11) par(mar = c(5, 4, 3, 5) + 0.1, mgp = c(0, 1, 0), cex.lab = 1.1, cex.axis = 0.5, cex.main = 1.1) quheatmap(x, res, number = 2, showlabel = TRUE, col = paleta) ``` We can draw heatmap for overlapped biclusters. ```{r yeast-heatmap2, cache = TRUE, message = FALSE, fig.cap = 'Heatmap for the second and third biclusters identified in the *Saccharomyces cerevisiae* data. Bicluster #2 (topleft) consists of 53 genes and 5 conditions, and bicluster #3 (bottom right) consists of 37 genes and 7 conditions.', fig.show = 'asis'} # Draw for the second and third biclusters identified in Saccharomyces cerevisiae data par(mar = c(5, 5, 5, 5), cex.lab = 1.1, cex.axis = 0.5, cex.main = 1.1) paleta <- colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(11) quheatmap(x, res, number = c(2, 3), showlabel = TRUE, col = paleta) ``` We can draw network for single bicluster. ```{r yeast-network1, cache = TRUE, message = FALSE, fig.cap = 'Network for the second bicluster identified in the *Saccharomyces cerevisiae* data.', fig.show = 'asis'} # Construct the network for the second identified bicluster in Saccharomyces cerevisiae net <- qunetwork(x, res, number = 2, group = 2, method = "spearman") if (requireNamespace("qgraph", quietly = TRUE)) qgraph::qgraph(net[[1]], groups = net[[2]], layout = "spring", minimum = 0.6, color = cbind(rainbow(length(net[[2]]) - 1), "gray"), edge.label = FALSE) ``` We can also draw network for overlapped biclusters. ```{r yeast-network2, cache = TRUE, message = FALSE, fig.cap = 'Network for the second and third biclusters identified in the *Saccharomyces cerevisiae* data.', fig.show = 'asis'} net <- qunetwork(x, res, number = c(2, 3), group = c(2, 3), method = "spearman") if (requireNamespace("qgraph", quietly = TRUE)) qgraph::qgraph(net[[1]], groups = net[[2]], layout = "spring", minimum = 0.6, legend.cex = 0.5, color = c("red", "blue", "gold", "gray"), edge.label = FALSE) ``` ```r # Output overlapping heatmap XML, could be used in other software such # as Cytoscape, Biomax or JNets sink('tempnetworkresult.gr') qunet2xml(net, minimum = 0.6, color = cbind(rainbow(length(net[[2]]) - 1), "gray")) sink() # We can use Cytoscape, Biomax or JNets open file named 'tempnetworkresult.gr' ``` # Example of *Escherichia coli* data The *Escherichia coli* data consists of 4,297 genes and 466 conditions. ```{r ecoli, cache = TRUE} library(QUBIC) library(QUBICdata) data("ecoli", package = "QUBICdata") # Discretization matrix1 <- ecoli[1:7, 1:4] matrix1 matrix2 <- qudiscretize(matrix1) matrix2 # QUBIC res <- biclust::biclust(ecoli, method = BCQU(), r = 1, q = 0.06, c = 0.95, o = 100, f = 0.25, k = max(ncol(ecoli)%/%20, 2)) system.time(res <- biclust::biclust(ecoli, method = BCQU(), r = 1, q = 0.06, c = 0.95, o = 100, f = 0.25, k = max(ncol(ecoli)%/%20, 2))) summary(res) ``` ```{r ecoli-heatmap, cache = TRUE, message = FALSE, fig.cap = 'Heatmap for the fifth bicluster identified in the *Escherichia coli* data. The bicluster consists of 103 genes and 38 conditions', fig.show = 'asis'} # Draw heatmap for the 5th bicluster identified in Escherichia coli data library(RColorBrewer) paleta <- colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(11) par(mar = c(5, 4, 3, 5) + 0.1, mgp = c(0, 1, 0), cex.lab = 1.1, cex.axis = 0.5, cex.main = 1.1) quheatmap(ecoli, res, number = 5, showlabel = TRUE, col = paleta) ``` ```{r ecoli-heatmap2, cache = TRUE, message = FALSE, fig.cap = 'Heatmap for the fourth and eighth biclusters identified in the *Escherichia coli* data. Bicluster #4 (topleft) consists of 108 genes and 44 conditions, and bicluster #8 (bottom right) consists of 26 genes and 33 conditions', fig.show = 'asis'} library(RColorBrewer) paleta <- colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(11) par(mar = c(5, 4, 3, 5), cex.lab = 1.1, cex.axis = 0.5, cex.main = 1.1) quheatmap(ecoli, res, number = c(4, 8), showlabel = TRUE, col = paleta) ``` ```{r ecoli-network, cache = TRUE, message = FALSE, fig.cap = 'Network for the fifth bicluster identified in the *Escherichia coli* data.', fig.show = 'asis' } # construct the network for the 5th identified bicluster in Escherichia coli data net <- qunetwork(ecoli, res, number = 5, group = 5, method = "spearman") if (requireNamespace("qgraph", quietly = TRUE)) qgraph::qgraph(net[[1]], groups = net[[2]], layout = "spring", minimum = 0.6, color = cbind(rainbow(length(net[[2]]) - 1), "gray"), edge.label = FALSE) ``` ```{r ecoli-network2, cache = TRUE, message = FALSE, fig.cap = 'Network for the fourth and eighth biclusters identified in the *Escherichia coli* data.', fig.show = 'asis'} # construct the network for the 4th and 8th identified bicluster in Escherichia coli data net <- qunetwork(ecoli, res, number = c(4, 8), group = c(4, 8), method = "spearman") if (requireNamespace("qgraph", quietly = TRUE)) qgraph::qgraph(net[[1]], groups = net[[2]], legend.cex = 0.5, layout = "spring", minimum = 0.6, color = c("red", "blue", "gold", "gray"), edge.label = FALSE) ``` ## Query-based biclustering We can conduct a query-based biclustering by adding the weight parameter. In this example, the instance file "511145.protein.links.v10.txt" [@szklarczyk14] was downloaded from string (http://string-db.org/download/protein.links.v10/511145.protein.links.v10.txt.gz) and decompressed and saved in working directory. ```r # Here is an example to download and extract the weight library(igraph) url <- "http://string-db.org/download/protein.links.v10/511145.protein.links.v10.txt.gz" tmp <- tempfile() download.file(url, tmp) graph = read.graph(gzfile(tmp), format = "ncol") unlink(tmp) ecoli.weight <- get.adjacency(graph, attr = "weight") ``` ```{r query-based-biclustering, cache = TRUE, message = FALSE} library(QUBIC) library(QUBICdata) data("ecoli", package = "QUBICdata") data("ecoli.weight", package = "QUBICdata") res0 <- biclust(ecoli, method = BCQU(), verbose = FALSE) res0 res4 <- biclust(ecoli, method = BCQU(), weight = ecoli.weight, verbose = FALSE) res4 ``` ## Bicluster-expanding we can expand existing biclustering results to recruite more genes according to certain consistency level: ```{r bicluster-expanding, cache = TRUE} res5 <- biclust(x = ecoli, method = BCQU(), seedbicluster = res, f = 0.25, verbose = FALSE) summary(res5) ``` ## Biclusters comparison We can compare the biclustering results obtained from different algorithms, or from a same algorithm with different combinations of parameter. ```{r biclusters-comparison, cache = TRUE} test <- ecoli[1:50,] res6 <- biclust(test, method = BCQU(), verbose = FALSE) res7 <- biclust (test, method = BCCC()) res8 <- biclust(test, method = BCBimax()) showinfo (test, c(res6, res7, res8)) ``` # References