--- title: "fastreeR Vignette" author: - name: "Anestis Gkanogiannis" email: anestis@gkanogiannis.com package: fastreeR output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{fastreeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # About fastreeR The goal of fastreeR is to provide functions for calculating distance matrix, building phylogenetic tree or performing hierarchical clustering between samples, directly from a VCF or FASTA file. # Installation To install `fastreeR` package: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("fastreeR") ``` # Preparation ## Allocate RAM and load required libraries You should allocate minimum 10kb per sample per variant of RAM for the JVM. The more RAM you allocate, the faster the execution will be (less pauses for garbage collection). In order to allocate RAM, a special parameter needs to be passed while JVM initializes. JVM parameters can be passed by setting `java.parameters` option. The `-Xmx` parameter, followed (without space) by an integer value and a letter, is used to tell JVM what is the maximum amount of heap RAM that it can use. The letter in the parameter (uppercase or lowercase), indicates RAM units. For example, parameters `-Xmx1024m` or `-Xmx1024M` or `-Xmx1g` or `-Xmx1G`, all allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM. ```{r, eval=TRUE, message=FALSE} options(java.parameters="-Xmx1G") unloadNamespace("fastreeR") library(fastreeR) library(utils) library(ape) library(stats) library(grid) library(BiocFileCache) ``` ## Download sample vcf file We download, in a temporary location, a small vcf file from 1K project, with around 150 samples and 100k variants (SNPs and INDELs). We use `BiocFileCache` for this retrieval process so that it is not repeated needlessly. If for any reason we cannot download, we use the small sample vcf from `fastreeR` package. ```{r, eval=TRUE} bfc <- BiocFileCache::BiocFileCache(ask = FALSE) tempVcfUrl <- paste0("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/", "1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/", "supporting/related_samples/", "ALL.chrX.shapeit2_integrated_snvindels_v2a_related_samples_27022019.", "GRCh38.phased.vcf.gz") tempVcf <- BiocFileCache::bfcquery(bfc,field = "rname", "tempVcf")$rpath[1] if(is.na(tempVcf)) { tryCatch( { tempVcf <- BiocFileCache::bfcadd(bfc,"tempVcf",fpath=tempVcfUrl)[[1]] }, error=function(cond) { tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR") }, warning=function(cond) { tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR") } ) } if(file.size(tempVcf) == 0L) { tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR") } ``` ## Download sample fasta files We download, in temporary location, some small bacterial genomes. We use `BiocFileCache` for this retrieval process so that it is not repeated needlessly. If for any reason we cannot download, we use the small sample fasta from `fastreeR` package. ```{r, eval=TRUE} tempFastasUrls <- c( #Mycobacterium liflandii paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/", "Mycobacterium_liflandii/latest_assembly_versions/", "GCF_000026445.2_ASM2644v2/GCF_000026445.2_ASM2644v2_genomic.fna.gz"), #Pelobacter propionicus paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/", "Pelobacter_propionicus/latest_assembly_versions/", "GCF_000015045.1_ASM1504v1/GCF_000015045.1_ASM1504v1_genomic.fna.gz"), #Rickettsia prowazekii paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/", "Rickettsia_prowazekii/latest_assembly_versions/", "GCF_000022785.1_ASM2278v1/GCF_000022785.1_ASM2278v1_genomic.fna.gz"), #Salmonella enterica paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/", "Salmonella_enterica/reference/", "GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz"), #Staphylococcus aureus paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/", "Staphylococcus_aureus/reference/", "GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz") ) tempFastas <- list() for (i in seq(1,5)) { tempFastas[[i]] <- BiocFileCache::bfcquery(bfc,field = "rname", paste0("temp_fasta",i))$rpath[1] if(is.na(tempFastas[[i]])) { tryCatch( { tempFastas[[i]] <- BiocFileCache::bfcadd(bfc, paste0("temp_fasta",i), fpath=tempFastasUrls[i])[[1]] }, error=function(cond) { tempFastas <- system.file("extdata", "samples.fasta.gz", package="fastreeR") break }, warning=function(cond) { tempFastas <- system.file("extdata", "samples.fasta.gz", package="fastreeR") break } ) } if(!file.exists(tempFastas[[i]])) { tempFastas <- system.file("extdata", "samples.fasta.gz", package="fastreeR") break } if(file.size(tempFastas[[i]]) == 0L) { tempFastas <- system.file("extdata", "samples.fasta.gz", package="fastreeR") break } } ``` # Functions on vcf files ## Sample Statistics ```{r echo=TRUE, fig.cap="Sample statistics from vcf file", fig.wide=TRUE} myVcfIstats <- fastreeR::vcf2istats(inputFile = tempVcf) plot(myVcfIstats[,7:9]) ``` ## Calculate distances from vcf The most time consuming process is calculating distances between samples. Assign more processors in order to speed up this operation. ```{r, eval=TRUE} myVcfDist <- fastreeR::vcf2dist(inputFile = tempVcf, threads = 2) ``` ## Histogram of distances ```{r echo=TRUE, fig.cap="Histogram of distances from vcf file", fig.wide=TRUE} graphics::hist(myVcfDist, breaks = 100, main=NULL, xlab = "Distance", xlim = c(0,max(myVcfDist))) ``` We note two distinct groups of distances. One around of distance value 0.05 and the second around distance value 0.065. ## Plot tree from `fastreeR::dist2tree` Notice that the generated tree is ultrametric. ```{r echo=TRUE, fig.cap="Tree from vcf with fastreeR", fig.wide=TRUE} myVcfTree <- fastreeR::dist2tree(inputDist = myVcfDist) plot(ape::read.tree(text = myVcfTree), direction = "down", cex = 0.3) ape::add.scale.bar() ape::axisPhylo(side = 2) ``` Of course the same can be achieved directly from the vcf file, without calculating distances. ```{r echo=TRUE, fig.cap="Tree from vcf with fastreeR", fig.wide=TRUE} myVcfTree <- fastreeR::vcf2tree(inputFile = tempVcf, threads = 2) plot(ape::read.tree(text = myVcfTree), direction = "down", cex = 0.3) ape::add.scale.bar() ape::axisPhylo(side = 2) ``` As expected from the histogram of distances, two groups of samples also emerge in the tree. The two branches, one at height around 0.055 and the second around height 0.065, are clearly visible. ## Plot tree from `stats::hclust` For comparison, we generate a tree by using `stats` package and distances calculated by `fastreeR`. ```{r echo=TRUE, fig.cap="Tree from vcf with stats::hclust", fig.wide=TRUE} myVcfTreeStats <- stats::hclust(myVcfDist) plot(myVcfTreeStats, ann = FALSE, cex = 0.3) ``` Although it does not initially look very similar, because it is not ultrametric, it is indeed quite the same tree. We note again the two groups (two branches) of samples and the 4 samples, possibly clones, that they show very close distances between them. ## Hierarchical Clustering We can identify the two groups of samples, apparent from the hierarchical tree, by using `dist2clusters` or `vcf2clusters` or `tree2clusters`. By playing a little with the `cutHeight` parameter, we find that a value of `cutHeight=0.067` cuts the tree into two branches. The first group contains 106 samples and the second 44. ```{r, eval=TRUE} myVcfClust <- fastreeR::dist2clusters(inputDist = myVcfDist, cutHeight = 0.067) if (length(myVcfClust) > 1) { tree <- myVcfClust[[1]] clusters <- myVcfClust[[2]] tree clusters } ``` # Functions on fasta files Similar analysis we can perform when we have samples represented as sequences in a fasta file. ## Calculate distances from fasta Use of the downloaded sample fasta file : ```{r, eval=TRUE} myFastaDist <- fastreeR::fasta2dist(tempFastas, kmer = 6) ``` Or use the provided by `fastreeR` fasta file of 48 bacterial RefSeq : ```{r, eval=FALSE} myFastaDist <- fastreeR::fasta2dist( system.file("extdata", "samples.fasta.gz", package="fastreeR"), kmer = 6) ``` ## Histogram of distances ```{r echo=TRUE, fig.cap="Histogram of distances from fasta file",fig.wide=TRUE} graphics::hist(myFastaDist, breaks = 100, main=NULL, xlab="Distance", xlim = c(0,max(myFastaDist))) ``` ## Plot tree from `fastreeR::dist2tree` ```{r echo=TRUE, fig.cap="Tree from fasta with fastreeR", fig.wide=TRUE} myFastaTree <- fastreeR::dist2tree(inputDist = myFastaDist) plot(ape::read.tree(text = myFastaTree), direction = "down", cex = 0.3) ape::add.scale.bar() ape::axisPhylo(side = 2) ``` ## Plot tree from `stats::hclust` ```{r echo=TRUE, fig.cap="Tree from fasta with stats::hclust", fig.wide=TRUE} myFastaTreeStats <- stats::hclust(myFastaDist) plot(myFastaTreeStats, ann = FALSE, cex = 0.3) ``` # Session Info ```{r setup} utils::sessionInfo() ```