--- title: "Example Usage" author: - name: Jonathan Carroll email: rpkg@jcarroll.com.au output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('DFplyr')`" vignette: > %\VignetteIndexEntry{Example Usage} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) ``` # Basics ## Install `DFplyr` `r Biocpkg("DFplyr")` is a `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages and can be downloaded via `BiocManager::install()`: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("DFplyr") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Background `r Biocpkg("DFplyr")` is inspired by `r CRANpkg("dplyr")` which implements a wide variety of common data manipulations (`mutate`, `select`, `filter`) but which only operates on objects of class `data.frame` or `tibble` (from `r CRANpkg("tibble")`). When working with `r Biocpkg("S4Vectors")` `DataFrame`s - which are frequently used as components of, for example `r Biocpkg("SummarizedExperiment")` objects - a common workaround is to convert the `DataFrame` to a `tibble` in order to then use `r CRANpkg("dplyr")` functions to manipulate the contents, before converting back to a `DataFrame`. This has several drawbacks, including the fact that `tibble` does not support rownames (and `r CRANpkg("dplyr")` frequently does not preserve them), does not support S4 columns (e.g. `r Biocpkg("IRanges")` vectors), and requires the back and forth transformation any time manipulation is desired. # Quick start to using `DFplyr` ```{r "start", message=FALSE} library("DFplyr") ``` To being with, we create an `r Biocpkg("S4Vectors")` `DataFrame`, including some S4 columns ```{r "create_d", message=FALSE} library(S4Vectors) m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")] d <- as(m, "DataFrame") d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10)) d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10)) d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2))) d ``` This will appear in RStudio's environment pane as a ``` Formal class DataFrame (dplyr-compatible) ``` when using `r Biocpkg("DFplyr")`. No interference with the actual object is required, but this helps identify that `r CRANpkg("dplyr")`-compatibility is available. `DataFrame`s can then be used in `r CRANpkg("dplyr")`-like calls the same as `data.frame` or `tibble` objects. Support for working with S4 columns is enabled provided they have appropriate functions. Adding multiple columns will result in the new columns being created in alphabetical order. For example, adding a new column `newvar` which is the sum of the `cyl` and `hp` columns ```{r "mutate_newvar"} mutate(d, newvar = cyl + hp) ``` or doubling the `nl` column as `nl2` ```{r "nl2"} mutate(d, nl2 = nl * 2) ``` or calculating the `length()` of the `nl` column cells as `length_nl` ```{r "length_nl"} mutate(d, length_nl = lengths(nl)) ``` Transformations can involve S4-related functions, such as extracting the `seqnames()`, `strand()`, and `end()` of the `grX` column ```{r "s4cols"} mutate(d, chr = GenomeInfoDb::seqnames(grX), strand_X = BiocGenerics::strand(grX), end_X = BiocGenerics::end(grX) ) ``` the object returned remains a standard `DataFrame`, and further calls can be piped with `%>%`, in this case extracting the newly created `newvar` column ```{r "pipe"} mutate(d, newvar = cyl + hp) %>% pull(newvar) ``` Some of the variants of the `dplyr` verbs also work, such as transforming the numeric columns using a quosure style lambda function, in this case squaring them ```{r "mutate_if"} mutate_if(d, is.numeric, ~ .^2) ``` or extracting the `start` of all of the `"GRanges"` columns ```{r "mutate_if_granges"} mutate_if(d, ~ isa(., "GRanges"), BiocGenerics::start) ``` Use of `r CRANpkg("tidyselect")` helpers is limited to within `vars()` calls and using the `_at` variants ```{r "at_mutate"} mutate_at(d, vars(starts_with("c")), ~ .^2) ``` and also works with other verbs ```{r "at_select"} select_at(d, vars(starts_with("gr"))) ``` Importantly, grouped operations are supported. `DataFrame` does not natively support groups (the same way that `data.frame` does not) so these are implemented specifically for `DFplyr` with group information shown at the top of the printed output ```{r "group_by"} group_by(d, cyl, am) ``` Other verbs are similarly implemented, and preserve row names where possible. For example, selecting a limited set of columns using non-standard evaluation (NSE) ```{r "rownames"} select(d, am, cyl) ``` Arranging rows according to the ordering of a column ```{r "rownames_arrange"} arrange(d, desc(hp)) ``` Filtering to only specific values appearing in a column ```{r "rownames_filter"} filter(d, am == 0) ``` Selecting specific rows by index ```{r "rownames_slice"} slice(d, 3:6) ``` These also work for grouped objects, and also preserve the rownames, e.g. selecting the first two rows from _each group_ of `gear` ```{r "grouped_slice"} group_by(d, gear) %>% slice(1:2) ``` `rename` is itself renamed to `rename2` due to conflicts between `r CRANpkg("dplyr")` and `r Biocpkg("S4Vectors")`, but works in the `r CRANpkg("dplyr")` sense of taking `new = old` replacements with NSE syntax ```{r "rename2"} select(d, am, cyl) %>% rename2(foo = am) ``` Row names are not preserved when there may be duplicates or they don't make sense, otherwise the first label (according to the current de-duplication method, in the case of `distinct`, this is via `BiocGenerics::duplicated`). This may have complications for S4 columns. ```{r "distinct"} distinct(d) ``` Behaviours are ideally the same as those of `r CRANpkg("dplyr")` wherever possible, for example a grouped tally ```{r "group_tally"} group_by(d, cyl, am) %>% tally(gear) ``` or a count with weights ```{r "count"} count(d, gear, am, cyl) ``` ## Citing `DFplyr` We hope that `r Biocpkg("DFplyr")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} citation("DFplyr") ``` ## Session Information. ```{r reproduce3, echo=FALSE} options(width = 120) sessioninfo::session_info() ```