--- title: "Organism.dplyr" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Organism.dplyr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The `r Biocpkg("Organism.dplyr")` creates an on disk sqlite database to hold data of an organism combined from an 'org' package (e.g., `r Biocpkg("org.Hs.eg.db")`) and a genome coordinate functionality of the 'TxDb' package (e.g., `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")`). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a _src_organism_ object is created to point to the database. The _src_organism_ object is created as an extension of _src_sql_ and _src_sqlite_ from [dplyr][], which inherited all [dplyr][] methods. It also implements the `select()` interface from `r Biocpkg("AnnotationDbi")` and genomic coordinates extractors from `r Biocpkg("GenomicFeatures")`. # Constructing a _src_organism_ ## Make sqlite datebase from 'TxDb' package The `src_organism()` constructor creates an on disk sqlite database file with data from a given 'TxDb' package and corresponding 'org' package. When dbpath is given, file is created at the given path, otherwise temporary file is created. ```{r, echo=FALSE} suppressPackageStartupMessages({ library(Organism.dplyr) library(GenomicRanges) library(ggplot2) }) ``` ```{r, eval=FALSE} library(Organism.dplyr) ``` Running `src_organism()` without a given path will save the sqlite file to a `tempdir()`: ```{r, eval=FALSE} src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene") ``` Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date. ```{r, eval=FALSE} path <- "path/to/my.sqlite" src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path) ``` `supportedOrganisms()` provides a list of organisms with corresponding 'org' and 'TxDb' packages being supported. ```{r} supportedOrganisms() ``` ## Make sqlite datebase from organism name Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent 'TxDb' package is used. ```{r, eval=FALSE} src <- src_ucsc("human", path) ``` ## Access existing sqlite file An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with [TxDb.Hsapiens.UCSC.hg38.knownGene][], with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package ```{r} src <- src_organism(dbpath = hg38light()) src ``` # The "dplyr" interface All methods from package [dplyr][] can be used for a _src_organism_ object. Look at all available tables. ```{r} src_tbls(src) ``` Look at data from one specific table. ```{r} tbl(src, "id") ``` Look at fields of one table. ```{r} colnames(tbl(src, "id")) ``` Below are some examples of querying tables using dplyr. 1. Gene symbol starting with "SNORD" (the notation `SNORD%` is from SQL, with `%` representing a wild-card match to any string) ```{r} tbl(src, "id") %>% filter(symbol %like% "SNORD%") %>% dplyr::select(entrez, map, ensembl, symbol) %>% distinct() %>% arrange(symbol) %>% collect() ``` 2. Gene ontology (GO) info for gene symbol "ADA" ```{r} inner_join(tbl(src, "id"), tbl(src, "id_go")) %>% filter(symbol == "ADA") %>% dplyr::select(entrez, ensembl, symbol, go, evidence, ontology) ``` 3. Gene transcript counts per gene symbol ```{r} txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>% dplyr::select(symbol, tx_id) %>% group_by(symbol) %>% summarize(count = n()) %>% dplyr::select(symbol, count) %>% arrange(desc(count)) %>% collect(n=Inf) txcount library(ggplot2) ggplot(txcount, aes(x = symbol, y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Transcript count") + labs(x = "Symbol") + labs(y = "Count") ``` 4. Gene coordinates of symbol "ADA" and "NAT2" as _GRanges_ ```{r} inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>% filter(symbol %in% c("ADA", "NAT2")) %>% dplyr::select(gene_chrom, gene_start, gene_end, gene_strand, symbol, map) %>% collect() %>% GenomicRanges::GRanges() ``` # The "select" interface Methods `select()`, `keytypes()`, `keys()`, `columns()` and `mapIds` from `r Biocpkg("AnnotationDbi")` are implemented for _src\_organism_ objects. Use `keytypes()` to discover which keytypes can be passed to keytype argument of methods `select()` or `keys()`. ```{r} keytypes(src) ``` Use `columns()` to discover which kinds of data can be returned for the _src_organism_ object. ```{r} columns(src) ``` `keys()` returns keys for the _src\_organism_ object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used. Keys of entrez ```{r} head(keys(src)) ``` Keys of symbol ```{r} head(keys(src, "symbol")) ``` `select()` retrieves the data as a _tibble_ based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, `select_tbl()` will return a _tibble_ with one row for each possible match, and `select()` will return a data frame. ```{r} keytype <- "symbol" keys <- c("ADA", "NAT2") columns <- c("entrez", "tx_id", "tx_name","exon_id") select_tbl(src, keys, columns, keytype) ``` `mapIds()` gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector. ```{r} mapIds(src, keys, column = "tx_name", keytype) mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList") ``` # Genomic Coordinates Extractor Interfaces Eleven genomic coordinates extractor methods are available in this package: `transcripts()`, `exons()`, `cds()`, `genes()`, `promoters()`, `transcriptsBy()`, `exonsBy()`, `cdsBy()`, `intronsByTranscript()`, `fiveUTRsByTranscript()`, `threeUTRsByTranscript()`. Data can be returned in two versions, for instance _tibble_ (`transcripts_tbl()`) and _GRanges_ or _GRangesList_ (`transcripts()`). Filters can be applied to all extractor functions. The output can be resctricted by an `AnnotationFilter`, an `AnnotationFilterList`, or an expression that can be tranlated into an `AnnotationFilterList`. Valid filters can be retrieved by `supportedFilters(src)`. ```{r} supportedFilters(src) ``` All filters take two parameters: value and condition, condition could be one of "==", "!=", "startsWith", "endsWith", ">", "<", ">=" and "<=", default condition is "==". ```{r} EnsemblFilter("ENSG00000196839") SymbolFilter("SNORD", "startsWith") ``` The following illustrates several ways of inputting filters to an extractor function. ```{r} smbl <- SymbolFilter("SNORD", "startsWith") transcripts_tbl(src, filter=smbl) filter <- AnnotationFilterList(smbl) transcripts_tbl(src, filter=filter) transcripts_tbl(src, filter=~smbl) ``` A `GRangesFilter()` can also be used as a filter for the methods with result displaying as _GRanges_ or _GRangesList_. ```{r} gr <- GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121")) transcripts(src, filter=~smbl & gr) ``` Filters in extractor functions support `&`, `|`, `!`(negation), and `()`(grouping). Transcript coordinates of gene symbol equal to "ADA" and transcript start position less than 44619810. ```{r} transcripts_tbl(src, filter = AnnotationFilterList( SymbolFilter("ADA"), TxStartFilter(44619810,"<"), logicOp="&") ) ## Equivalent to transcripts_tbl(src, filter = ~symbol == "ADA" & tx_start < 44619810) ``` Transcripts coordinates of gene symbol equal to "ADA" or transcript end position equal to 243843236. ```{r} txend <- TxEndFilter(243843236, '==') transcripts_tbl(src, filter = ~symbol == "ADA" | txend) ``` Using negation to find transcript coordinates of gene symbol equal to "ADA" and transcript start positions NOT less than 44619810. ```{r} transcripts_tbl(src, filter = ~symbol == "ADA" & !tx_start < 44618910) ``` Using grouping to find transcript coordinates of a long filter statement. ```{r} transcripts_tbl(src, filter = ~(symbol == 'ADA' & !(tx_start >= 44619810 | tx_end < 44651742)) | (smbl & !tx_end > 25056954) ) ``` ```{r} sessionInfo() ``` [dplyr]: https://CRAN.R-project.org/package=dplyr [TxDb.Hsapiens.UCSC.hg38.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene