--- title: "Using SynExtend" author: "Nicholas Cooley" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{UsingSynExtend} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} editor_options: chunk_output_type: console --- # Introduction `SynExtend` is a package of tools for working with objects of class `Synteny` built from the package `DECIPHER`'s `FindSynteny()` function. Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a 'syntenic block'. That designation of synteny can then used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction. `FindSynteny` takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximate shared k-mers are significant. Combining the information generated by `FindSynteny` with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the k-mers that they share, or alignment. # Package Structure Currently `SynExtend` contains one set of functions for performig orthology predictions, as well as a rearrangement estimation function that is currently under construction. ## Installation 1. Install the latest version of R using [CRAN](https://cran.r-project.org/). 2. Install `SynExtend` in R by running the following commands: ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("SynExtend") ``` # Usage Using the `FindSynteny` function in `DECIPHER` builds an object of class `Synteny`. In this tutorial, a prebuilt `DECIPHER` database is used. For database construction see `?Seqs2DB` in `DECIPHER`. This example starts with a database containing three endosymbiont genomes that were chosen to keep package data to a minimum. ```{r, build_synteny_object} library(SynExtend) DBPATH <- system.file("extdata", "Endosymbionts.sqlite", package = "SynExtend") Syn <- FindSynteny(dbFile = DBPATH) ``` Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class `Synteny` can also be plotted to provide clear visual representations of the data inside. The genomes used in this example are distantly related and fairly dissimilar. ```{r, synplots01} Syn pairs(Syn) ``` Data present inside objects of class `Synteny` can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see `?FindSynteny` in the package `DECIPHER`. ```{r, synplots02} print(head(Syn[[1, 2]])) print(head(Syn[[2, 1]])) ``` The above printed objects show the data for the comparison between the first and second genome in our database. To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map. Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. Parsed GFFs can be constructed with `gffToDataFrame`, for full functionality, or GFFs can be imported via `rtracklater::import()` for limited functionality. GeneCalls for both the `PairSummaries` and `NucleotideOverlap` functions must be named list, and those names must match `dimnames(Syn)[[1]]`. ```{r, generate_genecalls} # generating genecalls with local data: GC <- gffToDataFrame(GFF = system.file("extdata", "GCF_021065005.1_ASM2106500v1_genomic.gff.gz", package = "SynExtend"), Verbose = TRUE) # in an effort to be space conscious, not all original gffs are kept within this package GeneCalls <- get(data("Endosymbionts_GeneCalls", package = "SynExtend")) ``` `SynExtend`'s `gffToDataFrame` function will directly import gff files into a usable format, and includes other extracted information. ```{r, print_gene_calls} print(head(GeneCalls[[1]])) ``` Raw GFF imports are also acceptable, but prevent alignments in amino acid space with `PairSummaries()`. ```{r, show_rtracklayer, eval = FALSE} X01 <- rtracklayer::import(system.file("extdata", "GCA_000875775.1_ASM87577v1_genomic.gff.gz", package = "SynExtend")) class(X01) print(X01) ``` `SynExtend`'s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links. ```{r, generate_initial_links} Links <- NucleotideOverlap(SyntenyObject = Syn, GeneCalls = GeneCalls, Verbose = TRUE) ``` The `Links` object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class `Synteny`. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should a user desire to have access to it. ```{r, link_printing} class(Links) print(Links) ``` This raw data can be processed to provide a straightforward summary of predicted pairs. ```{r, describe_links} LinkedPairs1 <- PairSummaries(SyntenyLinks = Links, DBPATH = DBPATH, PIDs = FALSE, Verbose = TRUE) ``` The object `LinkedPairs1` is a data.frame where each row is populated by information about a predicted orthologous pair. By default `PairSummaries` uses a simple model to determine whether the k-mers that link a pair of genes are likely to provide an erroneous link. When set to `Model = "Global"`, is is simply a prediction of whether the involved nucleotides are likely to describe a pair of genomic features whose alignment would result in a PID that falls within a random distribution. This model is effective if somewhat permissive, but is significantly faster than performing many pairwise alignments. ```{r, describe_more_links_again} print(head(LinkedPairs1)) ``` PairSummaries includes arguments that allow for aligning all pairs that are predicted, via `PIDs = TRUE`, while `IgnoreDefaultStringSet = FALSE` indicates that alignments should be performed in nucleotide or amino acid space as is appropriate for the linked sequences. Setting `IgnoreDefaultStringSet = TRUE` will force all alignments into nucleotide space. As of SynExtend v 1.3.13, the functions `ExtractBy` and `DisjointSet` have been added to provide users with direct tools to work with `PairSummaries` objects. ```{r, pairsummariesoperations} SingleLinkageClusters <- DisjointSet(Pairs = LinkedPairs1, Verbose = TRUE) ``` ```{r, clusters} # extract the first 10 clusters Sets <- ExtractBy(x = LinkedPairs1, y = DBPATH, z = SingleLinkageClusters[1:10], Verbose = TRUE) head(Sets) ``` Session Info: ```{r} sessionInfo() ```