--- title: "An introduction to sitePath" author: "Chengyang Ji" package: sitePath output: BiocStyle::html_document: toc_float: false abstract: > Detection of site fixation in molecular evolution. vignette: > %\VignetteIndexEntry{An introduction to sitePath} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The `sitePath` package does hierarchical search for fixation events given multiple sequence alignment and phylogenetic tree. These fixation events can be specific to a phylogenetic lineages or shared by multiple lineages. This is achieved by three major steps: 1. Import tree and sequence alignment 2. Resolve phylogenetic lineages 3. Hierarchical search for fixation and parallel mutations # Import data There're various R packages for parsing phylogenetic tree and multiple sequence alignment files. For now, `sitepath` accepts `phylo` object and `alignment` object. Functions from `ggtree` and `seqinr` are able to handle most file formats. ## Parse phylogenetic tree The S3 phylo class is a common data structure for phylogenetic analysis in R. The CRAN package [ape](https://cran.r-project.org/web/packages/ape/index.html) provides basic parsing function for reading tree files. The Bioconductor package [ggtree](https://bioconductor.org/packages/release/bioc/html/ggtree.html) provides more comprehensive parsing utilities. ```{r import_tree, message=FALSE} library(sitePath) tree_file <- system.file("extdata", "ZIKV.newick", package = "sitePath") tree <- read.tree(tree_file) ``` It is highly recommended that the file stores a rooted tree as R would consider the tree is rooted by default and re-rooting the tree in R is difficult. Also, we expect the tree to have no super long branches. A bad example is shown below: ```{r bad_tree_example, echo=FALSE} bad_tree <- read.tree(system.file("extdata", "WNV.newick", package = "sitePath")) ggtree::ggtree(bad_tree) ``` ## Parse and match sequence alignment Most multiple sequence alignment format can be parsed by [seqinr](https://cran.r-project.org/web/packages/seqinr/index.html). There is a wrapper function for parsing and adding the sequence alignment. ```{r add_alignment, message=FALSE} alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath") tree <- addMSA(tree, alignment_file, "fasta") ``` # Phylogenetic lineages The names in tree and alignment must be matched. We exploit polymorphism of each site to find the major branches. Before finding putative phylogenetic lineages, there involves a few more steps to evaluate the impact of threshold on result. ## The impact of threshold on resolving lineages In the current version, the resolving function only takes sequence similarity as one single threshold. The impact of threshold depends on the tree topology hence there is no universal choice. The function `sneakPeak` samples thresholds and calculates the resulting number of paths. *The use of this function can be of great help in choosing the threshold.* ```{r sneakPeek_plot} preassessment <- sneakPeek(tree, makePlot = TRUE) ``` ## Choose a threshold Use the return of the function `lineagePath` for downstream analysis. The choice of the threshold really depends. You can use the result from `sneakPeak` as a reference for threshold choosing. Here 0.05 is used as an example. ```{r get_lineagePath} paths <- lineagePath(preassessment, similarity = 0.05) paths ``` You can visualize the result. ```{r plot_paths} plot(paths) ``` # Mutation detection Now you're ready to find fixation and parallel mutations. ## Entropy minimization The `sitesMinEntropy` function perform entropy minimization on every site for each lineage. The fixation and parallel mutations can be derived from the function's return value. ```{r min_entropy} minEntropy <- sitesMinEntropy(paths) ``` ## Fixation mutations The hierarchical search is done by `fixationSites` function. The function detects the site with fixation mutation. ```{r find_fixations} fixations <- fixationSites(minEntropy) fixations ``` To get the position of all the resulting sites, `allSitesName` can be used on the return of `fixationSites` and also other functions like `SNPsites` and `parallelSites`. ```{r} allSites <- allSitesName(fixations) allSites ``` If you want to retrieve the result of a single site, you can pass the result of `fixationSites` and the site index to `extractSite` function. The output is a `sitePath` object which stores the tip names. ```{r get_sitePath} sp <- extractSite(fixations, 139) ``` It is also possible to retrieve the tips involved in the fixation of the site. ```{r get_tipNames} extractTips(fixations, 139) ``` Use `plot` on a `sitePath` object to visualize the fixation mutation of a single site. Alternatively, use `plotSingleSite` on an `fixationSites` object with the site specified. ```{r plot_sitePath} plot(sp) plotSingleSite(fixations, 139) ``` To have an overall view of the transition of fixation mutation, use `plot` on an `fixationSites` object. ```{r plot_fixations} plot(fixations) ``` ## Parallel mutation Parallel mutation can be found by the `parallelSites` function. There are four ways of defining the parallel mutation: `all`, `exact`, `pre` and `post`. Here `exact` is used as an example. ```{r} paraSites <- parallelSites(minEntropy, minSNP = 1, mutMode = "exact") paraSites ``` The result of a single site can be visualized by `plotSingleSite` function. ```{r} plotSingleSite(paraSites, 105) ``` # Additional functions This part is extra and experimental but might be useful when pre-assessing your data. We'll use an example to demonstrate. ## Inspect one site The `plotSingleSite` function will color the tree according to amino acids if you use the output of `lineagePath` function. ```{r plot_sites} plotSingleSite(paths, 139) plotSingleSite(paths, 763) ``` ## SNP sites An SNP site could potentially undergo fixation event. The `SNPsites` function predicts possible SNP sites and the result could be what you'll expect to be fixation mutation. Also, a tree plot with mutation could be visualized with `plotMutSites` function. ```{r find_SNP} snps <- SNPsites(tree) plotMutSites(snps) plotSingleSite(paths, snps[4]) plotSingleSite(paths, snps[5]) ``` # Session info {.unnumbered} ```{r session_info} sessionInfo() ```