--- title: "VariantExperiment methods" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 08/14/2019" output: BiocStyle::html_document: toc: true toc_float: true package: VariantExperiment vignette: | %\VignetteIndexEntry{VariantExperiment-methods} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # User instruction This vignette is about the conversion methods and statistical functions that are enabled on `VariantExperiment` objects, for the analysis of genotyping or DNA-seq data sets. If you want to learn more about the implementation of the `VariantExperiment` class, and basic methods, please refer to the other vignette. # Introduction The package of `VariantExperiment` is implemented to represent VCF/GDS files using standard SummarizedExperiment metaphor. It is a container for high-through genetic/genomic data with GDS back-end, and is interoperable with the statistical functions/methods that are implemented in `SeqArray` and `SeqVarTools` that are designed for GDS data. The `SummarizedExperiment` metaphor also gets the benefit of common manipulations within _Bioconductor_ ecosystem that are more user-friendly. First, we load the package into R session. ```{r Load, message=FALSE} library(VariantExperiment) ``` # Coercion methods To take advantage of the functions and methods that are defined on `SummarizedExperiment`, from which the `VariantExperiment` extends, we have defined coercion methods from `VCF` and `GDS` to `VariantExperiment`. ## From `VCF` to `VariantExperiment` The coercion function of `makeVariantExperimentFromVCF` could convert the `VCF` file directly into `VariantExperiment` object. To achieve the best storage efficiency, the assay data are saved in `GDSArray` format, and the annotation data are saved in `DelayedDataFrame` format (with no option of ordinary `DataFrame`), which could be retrieved by `rowData()` for feature related annotations and `colData()` for sample related annotations (Only when `sample.info` argument is specified). ```{r makeVariantExperimentFromVCF} vcf <- SeqArray::seqExampleFileName("vcf") ve <- makeVariantExperimentFromVCF(vcf) ve ``` Internally, the `VCF` file was converted into a on-disk `GDS` file, which could be retrieved by: ```{r retrieve GDS} gdsfile(ve) ``` assay data is in `GDSArray` format: ```{r makeVariantExperimentFromVCF2} assay(ve, 1) ``` feature-related annotation is in `DelayedDataFrame` format: ```{r makeVariantExperimentFromVCF3} rowData(ve) ``` User could also have the opportunity to save the sample related annotation info directly into the `VariantExperiment` object, by providing the file path to the `sample.info` argument, and then retrieve by `colData()`. ```{r sampleInfo} sampleInfo <- system.file("extdata", "Example_sampleInfo.txt", package="VariantExperiment") ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo) colData(ve) ``` Arguments could be specified to take only certain info columns or format columns from the vcf file. ```{r makeVariantExperimentFromVCFArgs} ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP")) rowData(ve1) ``` In the above example, only 2 info entries ("OR" and "GP") are read into the `VariantExperiment` object. The `start` and `count` arguments could be used to specify the start position and number of variants to read into `Variantexperiment` object. ```{r makeVariantExperimentFromVCFArgs_startCount} ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000) ve2 ``` For the above example, only 1000 variants are read into the `VariantExperiment` object, starting from the position of 101. ## From `GDS` to `VariantExperiment` The coercion function of `makeVariantExperimentFromGDS` coerces `GDS` files into `VariantExperiment` objects directly, with the assay data saved as `GDSArray`, and the `rowData()/colData()` in `DelayedDataFrame` by default (with the option of ordinary `DataFrame` object). ```{r makeVariantExperimentFromGDS} gds <- SeqArray::seqExampleFileName("gds") ve <- makeVariantExperimentFromGDS(gds) ve ``` ```{r makeVariantExperimentFromGDS2} rowData(ve) colData(ve) ``` Arguments could be specified to take only certain annotation columns for features and samples. All available data entries for `makeVariantExperimentFromGDS` arguments could be retrieved by the `showAvailable()` function with the gds file name as input. ```{r showAvailable} showAvailable(gds) ``` Note that the `infoColumns` from gds file will be saved as columns inside the `rowData()`, with the prefix of "info_". `rowDataOnDisk/colDataOnDisk` could be set as `FALSE` to save all annotation data in ordinary `DataFrame` format. ```{r makeVariantExperimentFromGDSArgs} ve3 <- makeVariantExperimentFromGDS(gds, rowDataColumns = c("ID", "ALT", "REF"), infoColumns = c("AC", "AN", "DP"), rowDataOnDisk = TRUE, colDataOnDisk = FALSE) rowData(ve3) ## DelayedDataFrame object colData(ve3) ## DataFrame object ``` # Lazy data operations `VariantExperiment` supports basic subsetting operations using `[`, `[[`, `$`, and ranged-based subsetting operations using `subsetByOverlap`. **NOTE** that after a set of lazy operations, you need to call `saveVariantExperiment` function to synchronize the on-disk file associated with the in-memory representation by providing a file path. Statistical functions could only work on synchronized `VariantExperiment` object, or error will return. Refer to the "VariantExperiment-class" vignette for more details. # Statistical functions Many statistical functions and methods are defined on `VariantExperiment` objects, most of which has their generic defined in _Bioconductor_ package of `SeqArray` and `SeqVarTools`. These functions could be called directly on `VariantExperiment` object as input, with additional arguments to specify based on user's need. More details please refer to the vignettes of [SeqArray][] and [SeqVarTools][]. [SeqArray]: https://bioconductor.org/packages/SeqArray [SeqVarTools]: https://bioconductor.org/packages/SeqVarTools Here is a list of the statistical functions with brief description: statistical functions | Description --------------------- | ------------ seqAlleleFreq | Calculates the allele frequencies seqAlleleCount | Calculates the allele counts seqMissing | Calculates the missing rate for variant/sample seqNumAllele | Calculates the number of alleles (for ref/alt allele) hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants inbreedCoeff | Calculates the inbreeding coefficient by variant/sample pca | Calculates the eigenvalues and eignevectors with Principal Component Analysis titv | Calculate transition/transversion ratio overall or by sample refDosage | Calculate the dosage of reference allele (matrix with integers of 0/1/2) altDosage | Calculate the dosage of alternative allele (matrix with integers of 0/1/2) countSingletons | Count singleton variants for each sample heterozygosity | Calculate heterozygosity rate by sample or by variants homozygosity | Calculate homozygosity rate by sample or by variants meanBySample | Calculate the mean value of a variable by sample over all variants isSNV | Flag a single nucleotide variant isVariant | Locate which samples are variant for each site Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample. ```{r stats} ## sample missing rate mr.samp <- seqMissing(ve, per.variant = FALSE) head(mr.samp) ## hwe hwe <- hwe(ve) head(hwe) ## titv ratio by sample / overall titv <- titv(ve, by.sample=TRUE) head(titv) titv(ve, by.sample=FALSE) ## countSingletons countSingletons(ve) ``` As we have noted in the other vignette, after the subsetting by `[`, `$` or `Ranged-based operations`, we need to save the new `VariantExperiment` object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned. # Future work As a feature addition, we want to add the option of `VCFArray` in saving the `assay` data in the step of `makeVariantExperimentFromVCF`. We also seek to implement the [SQLDataFrame][] in representation of the annotation data. We also plan to connect _Bioconductor_ package [VariantAnnotation][] to implement the variant filtering and annotation functions based on `VariantExperiment` format, and with that, to develop a pipeline for using `VariantExperiment` object as the basic data structure for DNA-sequencing data analysis. [SQLDataFrame]: https://github.com/Bioconductor/SQLDataFrame [VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation # Session Info ```{r sessionInfo} sessionInfo() ```