--- title: "GrafGen: Classifying Subpopulations of H. pylori Genomes" author: "William Wheeler, Difei Wang, Isaac Zhao, Yumi Jin and Charles Rabkin" output: rmarkdown::html_document vignette: > %\VignetteIndexEntry{GrafGen: Classifying H. pylori genomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The GrafGen package is for classifying Helicobacter pylori genomes according to genetic distance from nine reference populations as defined by equation 2 in Jin (2019). The main function is this package is `grafGen()` which requires a file of genotypes that can be either a PLINK bed file or a VCF file. # Installing the GrafGen package from Bioconductor ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GrafGen") ``` # Loading the package Before using the GrafGen package, it must be loaded into an R session. ```{r} library(GrafGen) ``` # Example data The GrafGen package includes example data which is a subset of the reference data that was used to train the model. The data is stored in the extdata folder. ```{r} dir <- system.file("extdata", package="GrafGen", mustWork=TRUE) geno.file <- paste0(dir, .Platform$file.sep, "data.vcf.gz") print(geno.file) ``` # Running grafGen() The `grafGen()` function returns a list of class "grafpop" with two objects: `table` and `vertex`. The object `table` is a data frame containing hypothetical ancestry percents (F_percent, E_percent and A_percent) based on known African, European and Asian samples, respectively, normalized genetic distance scores (GD1_x, GD2_y, GD3_z), the predicted reference population (Refpop), nearest neighboring reference population, percent separation as defined in the user manual and the genetic distances to each reference populations (hpgpAfrica, hpgpAfrica-distant, hpgpAfroamerica, hpgpEuroamerica, hpgpMediterranea, hpgpEurope, hpgpEurasia, hpgpAsia, and hpgpAklavik86-like). The object `vertex` is a list containing the (fixed) x-y coordinates of the African, European and Asian vertex population centroids. ```{r} ret <- grafGen(geno.file, print=0) ret$table[seq_len(5), ] ``` Printing the return object from `grafGen()` will display a table of frequency counts for the predicted reference populations for the user input data. ```{r} print(ret) ``` Plotting the return object will display a plot of the genetic distance scores (GD1_x vs GD2_y) for the user input data and the reference data. Additional plots can be obtained by calling the `grafGenPlot()` function. ```{r, crop=NULL} plot(ret) ``` # Interactive plots The functions `interactiveReferencePlot` and `interactivePlot` create interactive plots for the reference data and user input data respectively. A call to `interactiveReferencePlot` will all show the results of all samples in the reference data. Hovering over a point in the plot will display three lines of information. Line 1 contains the type and id of that sample. Line 2 contains the sample's reference population, next nearest reference population, and separation percent to the next nearest reference population as defined in the user manual. Line 3 contains the percent African, European and Asian ancestry for that sample. The legend shows the types (which are the source countries in `interactiveReferencePlot`) for all samples, and clicking the name of a type will add or remove those samples from the plot. ```{r} if (interactive()) interactiveReferencePlot() ``` # R shiny app The `GrafGen` package also includes an R shiny app to view and filter the plot using up to two variables. The function `createApp` returns a list containing the app and data objects needed with the app. The app then can be launched with the `runApp` function. ```{r} tmp <- createApp(ret) if (interactive()) { reference_results <- tmp$reference_results user_results <- tmp$user_results user_metadata <- tmp$user_metadata shiny::runApp(tmp$app) } ``` # Session Information ```{r} sessionInfo() ```