---
title: "SeqArray Overview"
author: "Dr. Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)"
date: "Jun 25, 2016"
output: slidy_presentation
vignette: >
  %\VignetteIndexEntry{SeqArray Overview}
  %\VignetteDepends{gdsfmt}
  %\VignetteKeywords{GWAS, Whole-genome, Sequencing, SNV}
  %\VignetteEngine{knitr::rmarkdown}
---

## Introduction

Whole-genome sequencing (WGS) data is being generated at an unprecedented rate

- 1000 Genomes Project Phase 3 (1KG)
    - 81 million variants and 2,504 individuals
    - http://www.1000genomes.org

- Variant Call Format (VCF)
    - a generic and flexible text-based format
    - VCF files are large and data retrieval is relatively slow


## Methods

**CoreArray (C++ library)**

- designed for large-scale data management of genome-wide variants
- data format (GDS) to store multiple array-oriented datasets in a single file

**Two R packages**

- gdsfmt -- R interface to CoreArray Genomic Data Structure (GDS) files
- SeqArray -- specifically designed for data management of genome-wide sequence variants from Variant Call Format (VCF) files


## Methods -- Advantages

- SeqArray provides the same capabilities as VCF

- Stores data in a binary and array-oriented manner
    - efficient access using the R language

- Genotypes are stored in a compressed manner
    - 2-bit array to store alleles (95% sites are bi-allelic)
    - rare variants: highly compressed
    - 1KG, 203.5 billion genotypes, saved in 4.3G (2.26% if a byte stores a genotype)

- Parallel access
    - multiple cluster nodes and/or cores


## Methods -- File Contents

```
File: SeqArray/extdata/CEU_Exon.gds (387.3K)
|--+ description   [  ] *
|--+ sample.id   { Str8 90 ZIP_ra(30.8%), 222B }
|--+ variant.id   { Int32 1348 ZIP_ra(35.7%), 1.9K }
|--+ position   { Int32 1348 ZIP_ra(86.4%), 4.6K }
|--+ chromosome   { Str8 1348 ZIP_ra(2.66%), 91B }
|--+ allele   { Str8 1348 ZIP_ra(17.2%), 928B }
|--+ genotype   [  ] *
|  \--+ data   { Bit2 2x90x1348 ZIP_ra(28.4%), 16.8K } *
|--+ phase   [  ]
|  \--+ data   { Bit1 90x1348 ZIP_ra(0.36%), 55B } *
|--+ annotation   [  ]
|  |--+ id   { Str8 1348 ZIP_ra(41.0%), 5.8K }
|  |--+ qual   { Float32 1348 ZIP_ra(0.91%), 49B }
|  |--+ filter   { Int32,factor 1348 ZIP_ra(0.89%), 48B } *
|  |--+ info   [  ]
|  |  |--+ AA   { Str8 1348 ZIP_ra(24.2%), 653B } *
|  |  \--+ HM2   { Bit1 1348 ZIP_ra(117.2%), 198B } *
|  \--+ format   [  ]
|     \--+ DP   [  ] *
|        \--+ data   { Int32 90x1348 ZIP_ra(33.8%), 160.3K }
\--+ sample.annotation   [  ]
   \--+ family   { Str8 90 ZIP_ra(34.7%), 135B }
```


## Methods -- Key Functions

**Table 1**: The key functions in the SeqArray package.

| Function     | Description |
|:-------------|:-------------------------------------------|
| seqVCF2GDS   | Reformats VCF files |
| seqSetFilter | Defines a data subset of samples or variants |
| seqGetData   | Gets data from a SeqArray file with a defined filter |
| seqApply     | Applies a user-defined function over array margins |
| seqParallel  | Applies functions in a computing cluster |


## Benchmark

- Dataset
    - 1000 Genomes Project Phase 3, chromosome 1
    - 6,468,094 variants, 2,504 individuals
    - original VCF.gz file: 1.2G
    - reformat to a SeqArray file: 458M (zlib compression)

- Calculate the frequencies of reference alleles
    1. R code (sequential version)
    2. R code (parallel version)
    3. R and C++ integration via the Rcpp package


## Benchmark -- Test 1 (sequentially)

```R
# load the R package
library(SeqArray)

# open the file
genofile <- seqOpen("1KG_chr1.gds")

# apply a user-defined function over variants
system.time(afreq <- seqApply(genofile, "genotype",
    FUN = function(x) { mean(x==0L, na.rm=TRUE) },
    as.is="double", margin="by.variant")
)
```

**10.8 minutes** on Linux with Intel Xeon CPU @2GHz and 128GB RAM
`function(x) { mean(x==0L, na.rm=TRUE) }` is a user-defined function, where `x` is an integer matrix:

```R
                           sample
  allele [,1] [,2] [,3] [,4] [,5]
    [1,]    0    1    0   NA    1
    [2,]    0    0    0    1    0
```

0 -- reference allele, 1 -- the first alternative allele


## Benchmark -- Test 2 (in parallel)

`seqParallel()` splits genotypes into 4 non-overlapping parts according to different cores.

```R
# load the R package
library(parallel)

# create a computing cluster with 4 cores
seqParallelSetup(4)

# run in parallel
system.time(afreq <- seqParallel(gdsfile=genofile,
    FUN = function(f) {
        seqApply(f, "genotype", as.is="double", margin="by.variant",
            FUN = function(x) mean(x==0L, na.rm=TRUE))
    }, split = "by.variant")
)
```

**3.1 minutes** (vs. 10.8m in Test 1)


## Benchmark -- Test 3 (C++ Integration)

```R
library(Rcpp)

# dynamically define an inline C/C++ function in R
cppFunction('double RefAlleleFreq(IntegerMatrix x) {
    int nrow = x.nrow(), ncol = x.ncol();
    int cnt=0, zero_cnt=0, g;
    for (int i = 0; i < nrow; i++) {
        for (int j = 0; j < ncol; j++) {
            if ((g = x(i, j)) != NA_INTEGER) {
                cnt ++;
                if (g == 0) zero_cnt ++;
            }
    }}
    return double(zero_cnt) / cnt;
}')

system.time(
    afreq <- seqApply(genofile, "genotype", RefAlleleFreq,
        as.is="double", margin="by.variant")
)
```

**1.5 minutes** (significantly faster! vs. 10.8m in Test 1)


## Conclusion

**SeqArray is of great interest to**

- R users involved in data analyses of large-scale sequence variants
- particularly those with limited experience of parallel / high-performance computing

**SeqVarTools (Bioconductor)**

- variant analysis, such like allele frequencies, HWE, Mendelian errors, etc
- functions to display genotypes / annotations in a readable format

**SNPRelate (Bioconductor)**

- a parallel computing toolset for relatedness and principal component analysis


## Resource

1000 Genomes Project: [http://bochet.gcc.biostat.washington.edu/seqarray/1000genomes](http://bochet.gcc.biostat.washington.edu/seqarray/1000genomes)

1. Phase 1:
	- [1KG_autosomes_SHAPEIT2_integrated_phase1_v3_20101123_snps_indels_svs_genotypes.seq.gds](http://bochet.gcc.biostat.washington.edu/seqarray/1000genomes/1KG_autosomes_SHAPEIT2_integrated_phase1_v3_20101123_snps_indels_svs_genotypes.seq.gds) [1.5G]
	- 1,092 individuals and 38,219,238 variants

2. Phase 3:
	- [1KG_autosomes_phase3_shapeit2_mvncall_integrated_v5_20130502_lzma.seq.gds](http://bochet.gcc.biostat.washington.edu/seqarray/1000genomes/1KG_autosomes_phase3_shapeit2_mvncall_integrated_v5_20130502_lzma.seq.gds) [2.4G]
	- 2,504 individuals and 81,271,745 variants


## Acknowledgements

Department of Biostatistics at University of Washington -- Seattle

Genetic Analysis Center:

- Dr. Stephanie M. Gogarten
- Dr. David Levine
- Dr. Cathy Laurie