--- title: "SQLDataFrame: Lazy representation of SQL database in DataFrame metaphor" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last compiled: `r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true package: SQLDataFrame vignette: > %\VignetteIndexEntry{SQLDataFrame: Lazy representation of SQL database in DataFrame metaphor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` last edit: 10/7/2019 # Introduction SQL database are very commonly used in the storage of very large genomic data resources. Many useful tools, such as [DBI][], [dbplyr][] have provided convenient interfaces for R users to check and manipulate the data. These tools represent the SQL tables in tidy formats and support lazy and quick aggregation operations (e.g, `*_join`, `union`, etc.) for tables from same resources. Cross database aggregation is also supported when opted (using `copy=TRUE`) but become very expensive due to the internal copying process of a whole table into the other connection. Use of advanced functions often involves specialized SQL knowledge which brings challenges for common _R_ users. The interoperability of existing bioinformatics tools are suboptimal, e.g., the [SummarizedExperiment][] container for representation of sequencing or genotyping experiments that many modern bioinformatics pipelines are based. The SQLDataFrame package was developed using familiar DataFrame-like paradigm and lazily represents the very large dataset from different SQL databases, such as SQLite and MySQL. The DataFrame-like interface provides familiarity for common _R_ users in easy data manipulations such as square bracket subsetting, rbinding, etc. For modern _R_ users, it also recognizes the `tidy` data analysis and [dplyr][] grammar by supporting `%>%`, `select`, `filter`, `mutate`, etc. More importantly, database type-specific strategies were implemented in SQLDataFrame to efficiently handle the cross-database operations without incurring any internally expensive processes (especially for database with write permission). Some previously difficult data operations are made quick and easy in _R_, such as cross-database ID matching and conversion, variant annotation extraction, etc. The scalability and interoperability of SQLDataFrame are expected to significantly promote the handling of very large genomic data resources and facilitating the overall bioinformatics analysis. Currently SQLDataFrame supports the DBI backend of SQLite, MySQL and Google BigQuery, which are most commonly used SQL-based databases. In the future or upon feature request, we would implement this package so that users could choose to use different database backend for `SQLDataFrame` representation. Here is a list of commonly used backends (bolded are already supported!): - **RSQLite**: embeds a SQLite database. (now used as default) - **RMySQL**: connects to MySQL and MariaDB - **bigrquery** connects to Google’s BigQuery. - _RPostgreSQL_: connects to Postgres and Redshift. - _odbc_ connects to many commercial databases via the open database connectivity protocol. [DBI]: https://CRAN.R-project.org/package=DBI [dbplyr]: https://CRAN.R-project.org/package=dbplyr [dplyr]: https://CRAN.R-project.org/package=dplyr [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment/ # Installation 1. Download the package. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SQLDataFrame") ``` The development version is also available to download from Github. ```{r getDevel, eval=FALSE} BiocManager::install("Liubuntu/SQLDataFrame") ``` 2. Load the package(s) into R session. ```{r Load, message=FALSE, eval = TRUE} library(SQLDataFrame) library(DBI) ``` # `SQLDataFrame` class ## `SQLDataFrame` constructor There are two ways to construct a `SQLDataFrame` object: 1) Provide an argument of `conn`, `dbtable` and `dbkey`. `conn` is a valid DBIConnection from SQLite or MySQL; `dbtable` specifies the database table name that is going to be represented as `SQLDataFrame` object. If only one table is available in the specified database name, this argument could be left blank. The `dbkey` argument is used to specify the column name in the table which could uniquely identify all the data observations (rows). 2) Provide `dbtable`, `dbkey` as specified above, and credentials to build valid DBIConnections. for SQLite, the credential argument includes `dbname`. For MySQL, the credential arguments are `host`, `user`, `password`. Additional to the credentials, users must provide the `type` argument to specify the SQL database type. Supported types are "SQLite" and "MySQL". If not specified, "SQLite" is used by default. Supported database tables could be on-disk or remote on the web or cloud. ```{r constructor_conn} dbfile <- system.file("extdata/test.db", package = "SQLDataFrame") conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile) obj <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = "state") ``` construction from database credentials: ```{r constructor_credentials} obj1 <- SQLDataFrame(dbname = dbfile, type = "SQLite", dbtable = "state", dbkey = "state") all.equal(obj, obj1) ``` Note that after reading the database table into `SQLDataFrame`, the key columns will be kept as fixed columns showing on the left hand side, with `|` separating key column(s) with the other columns. The `ncol`, `colnames`, and corresponding column subsetting will only apply to the non-key-columns. ```{r} obj dim(obj) colnames(obj) ``` ## Slot & accessors To make the `SQLDataFrame` object as light and compact as possible, there are only 5 slots contained in the object: `tblData`, `dbkey`, `dbnrows`, `dbconcatKey`, `indexes`. Metadata information could be returned through these 5 slots using slot accessors or other utility functions. ```{r} slotNames(obj) dbtable(obj) dbkey(obj) connSQLDataFrame(obj) ``` Besides, many useful common methods are defined on `SQLDataFrame` object to make it a more DataFrame-like data structure. e.g., we can use `dimnames()` to return the row/colnames of the data. It returns an unnamed list, with the first element being rownames which is always `NULL`, and 2nd element being colnames (could also use `colnames()` method). `dim()` method is defined to return the dimension of the database table, which enables the `nrow()/ncol()` to extract a specific dimension. `length()` method is also defined which works same as `ncol()`. Note that the `rownames(SQLDataFrame)` would always be `NULL` as rownames are not supported in `SQLDataFrame`. However, `ROWNAMES(obj)` was implemented for the `[` subsetting with characters. ```{r methods} dim(obj) dimnames(obj) length(obj) ROWNAMES(obj) ``` **NOTE** that the `dbtable()` accessor only works for a `SQLDataFrame` object that the lazy tbl carried in `tblData` slot corresponds to a single database. If the `SQLDataFrame` was generated from `rbind`, `union` or `*_join`, call `saveSQLDataFrame()` to save the lazy tbl to disk so that `dbtable()` will be activated. ```{r} dbtable(obj) aa <- rbind(obj[1:5, ], obj[6:10, ]) aa dbtable(aa) ## message bb <- saveSQLDataFrame(aa, dbname = tempfile(fileext=".db"), dbtable = "aa", overwrite = TRUE) connSQLDataFrame(bb) dbtable(bb) ``` # makeSQLDataFrame We could also construct a `SQLDataFrame` object directly from a file name. The `makeSQLDataFrame` function takes input of character value of file name for common text files (.csv, .txt, etc.), write into database tables, and open as `SQLDataFrame` object. Users could provide values for the `dbname` and `dbtable` argument. If NULL, default value for `dbname` would be a temporary database file, and `dbtable` would be the `basename(filename)` without extension. **NOTE** that the input file must have one or multiple columns that could uniquely identify each observation (row) to be used the `dbkey()` for `SQLDataFrame`. Also the file must be rectangular, i.e., rownames are not accepted. But users could save rownames as a separate column. ```{r} mtc <- tibble::rownames_to_column(mtcars)[,1:6] filename <- file.path(tempdir(), "mtc.csv") write.csv(mtc, file= filename, row.names = FALSE) aa <- makeSQLDataFrame(filename, dbkey = "rowname", sep = ",", overwrite = TRUE) aa connSQLDataFrame(aa) dbtable(aa) ``` # saveSQLDataFrame With all the methods (`[` subsetting, `rbind`, `*_join`, etc.,) provided in the next section, the `SQLDataFrame` always work like a lazy representation until users explicitly call the `saveSQLDataFrame` function for realization. `saveSQLDataFrame` write the lazy tbl carried in `tblData` slot into an on-disk database table, and re-open the `SQLDataFrame` object from the new path. It's also recommended that users call `saveSQLDataFrame` frequently to avoid too many lazy layers which slows down the data processing. ```{r} connSQLDataFrame(obj) dbtable(obj) obj1 <- saveSQLDataFrame(obj, dbname = tempfile(fileext = ".db"), dbtable = "obj_copy") connSQLDataFrame(obj1) dbtable(obj1) ``` # SQLDataFrame methods ## `[[` subsetting `[[,SQLDataFrame` Behaves similarly to `[[,DataFrame` and returns a realized vector of values from a single column. `$,SQLDataFrame` is also defined to conveniently extract column values. ```{r} head(obj[[1]]) head(obj[["region"]]) head(obj$size) ``` We can also get the key column values using character extraction. ```{r} head(obj[["state"]]) ``` ## `[` subsetting `SQLDataFrame` instances can be subsetted in a similar way of `DataFrame` following the usual _R_ conventions, with numeric, character or logical vectors; logical vectors are recycled to the appropriate length. **NOTE**, use `drop=FALSE` explicitly for single column subsetting if you want to return a `SQLDataFrame` object, otherwise, the default `drop=TRUE` would always return a realized value for that column. ```{r, subsetting} obj[1:3, 1:2] obj[c(TRUE, FALSE), c(TRUE, FALSE), drop=FALSE] obj[1:3, "population", drop=FALSE] obj[, "population"] ## realized column value ``` Subsetting with character vector works for the `SQLDataFrame` objects. With composite keys, users need to concatenate the key values by `:` for row subsetting (See the vignette for internal implementation for more details). ```{r} rnms <- ROWNAMES(obj) obj[c("Alabama", "Colorado"), ] ``` ```{r} obj1 <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = c("region", "population")) rnms <- ROWNAMES(obj1) obj1[c("South:3615.0", "West:365.0"), ] ``` List style subsetting is also allowed to extract certain columns from the `SQLDataFrame` object which returns `SQLDataFrame` by default. ```{r} obj[1] obj["region"] ``` ## filter & mutate We have also enabled the S3 methods of `filter` and `mutate` from `dplyr` package, so that users could have the convenience in filtering data observations and adding new columns. ```{r} obj1 %>% filter(division == "South Atlantic" & size == "medium") ``` ```{r} obj1 %>% mutate(p1 = population/10, s1 = size) ``` ## union & rbind To be consistent with `DataFrame`, `union` and `rbind` methods were implemented for `SQLDataFrame`, where `union` returns the `SQLDataFrame` sorted by the `dbkey(obj)`, and `rbind` keeps the original orders of input objects. ```{r} dbfile1 <- system.file("extdata/test.db", package = "SQLDataFrame") con1 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile1) dbfile2 <- system.file("extdata/test1.db", package = "SQLDataFrame") con2 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile2) ss1 <- SQLDataFrame(conn = con1, dbtable = "state", dbkey = c("state")) ss2 <- SQLDataFrame(conn = con2, dbtable = "state1", dbkey = c("state")) ss11 <- ss1[sample(5), ] ss21 <- ss2[sample(10, 5), ] ``` - union ```{r, eval=FALSE} obj1 <- union(ss11, ss21) obj1 ## reordered by the "dbkey()" ``` - rbind ```{r} obj2 <- rbind(ss11, ss21) obj2 ## keeping the original order by updating the row index ``` ## *_join methods The `*_join` family methods was implemented for `SQLDataFrame` objects, including the `left_join`, `inner_join`, `semi_join` and `anti_join`, which provides the capability of merging database files from different sources. ```{r} ss12 <- ss1[1:10, 1:2] ss22 <- ss2[6:15, 3:4] left_join(ss12, ss22) inner_join(ss12, ss22) semi_join(ss12, ss22) anti_join(ss12, ss22) ``` # Support of MySQL database data SQLDataFrame now supports the MySQL database tables through [RMySQL][], for local MySQL servers, or remote ones on the web or cloud. The SQLDataFrame construction, `*_join` functions, `union`, `rbind`, and saving are all supported. Aggregation operations are supported for same or cross MySQL databases. Details please see the function documentations. Here I'll show a simple use case for MySQL tables from ensembl. [RMySQL]: https://CRAN.R-project.org/package=RMySQL ```{r, mysql} library(RMySQL) ensbConn <- dbConnect(dbDriver("MySQL"), host="genome-mysql.soe.ucsc.edu", user = "genome", dbname = "xenTro9") enssdf <- SQLDataFrame(conn = ensbConn, dbtable = "xenoRefGene", dbkey = c("name", "txStart")) enssdf1 <- enssdf[1:20, 1:2] enssdf2 <- enssdf[11:30,3:4] res <- left_join(enssdf1, enssdf2) ``` # Support of Google BigQuery SQLDataFrame has just added support for Google BigQuery tables. Construction and queries using `[` and `filter` are supported! "Authentication and authorization" will be needed when using [bigrquery][]. Check [here](https://github.com/r-dbi/bigrquery) for more details. Also note that, the support of BigQuery tables has implemented specialized strategy for efficient data representation. The `dbkey()` is assigned by default as `SurrogateKey`, and `dbkey` argument will be ignored during construction. [bigrquery]: https://CRAN.R-project.org/package=bigrquery ```{r bigquery, eval=FALSE} library(bigrquery) bigrquery::bq_auth() ## use this to authorize bigrquery in the ## browser. bqConn <- DBI::dbConnect(dbDriver("bigquery"), project = "bigquery-public-data", dataset = "human_variant_annotation", billing = "") ## if not previous provided ## authorization, must specify a ## project name that was already ## linked with Google Cloud with ## billing info. sdf <- SQLDataFrame(conn = bqConn, dbtable = "ncbi_clinvar_hg38_20180701") sdf[1:5, 1:5] sdf %>% filter(GENEINFO == "PYGL:5836") sdf %>% filter(reference_name == "21") ``` # SessionInfo() ```{r} sessionInfo() ```