---
title: "gwasurvivr Vignette"
author:
- name: Abbas Rizvi
  affiliation: The Ohio State University, Columbus, OH
- name: Ezgi Karaesmen
  affiliation: The Ohio State University, Columbus, OH
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY    
- name: Lara Sucheston-Campbell  
  affiliation: The Ohio State University, Columbus, OH
package: gwasurvivr
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{gwasurvivr Vignette}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

  
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=FALSE, eval = TRUE)
library(knitr)
```

# Introduction
`gwasurvivr` can be used to perform survival analyses of imputed genotypes from Sanger and Michigan imputation servers and IMPUTE2 software. This vignette is a tutorial on how to perform these analyses. This package can be run locally on a Linux, Mac OS X, Windows or conveniently batched on a high performing computing cluster. `gwasurvivr` iteratively processes the data in chunks and therefore intense memory requirements are not necessary.      
`gwasurvivr` package comes with three main functions to perform survival analyses using Cox proportional hazard (Cox PH) models depending on the imputation method used to generate the genotype data:    

1. `michiganCoxSurv`: Performs survival analysis on imputed genetic data stored in compressed VCF files generated via Michigan imputation server.    
2. `sangerCoxSurv`:  Performs survival analysis on imputed genetic data stored in compressed VCF files generated via Sanger imputation server.    
3. `impute2CoxSurv`: Performs survival analysis on imputed genetic data from IMPUTE2 output.
4. `plinkCoxSurv`: Performs survival analysis on directly typed genetic data from PLINK files (.BED, .BIM, and .FAM)  
5. `gdsCoxSurv`: Performs survival analysis on genetic data that user has already converted from IMPUTE2 format to GDS format.  

All functions fit a Cox PH model to each SNP including other user defined covariates and will save the results as a text file directly to the disk that contains survival analysis results. `gwasurvivr` functions can also test for interaction of SNPs with a given covariate. See examples for further details. 

## Main input arguments

All three functions require the following main arguments:

* `vcf.file`: A character string giving the path to genotype data file (`impute.file` for IMPUTE2; `b.file` for PLINK)
* `covariate.file`: A data frame comprising sample IDs (that link to the genotype data samples), phenotype (time, event) and additional covariate data
* `id.column`: A character string providing sample ID column name in covariate.file
* `time.to.event`: A character string that contains the time column name in covariate.file
* `event`: Character string that contains the event column name in covariate.file
* `covariates`: Character vector with contains column names in covariate.file of covariates to include in model
* `out.file`: A character string giving output file name

Further arguments can be passed depending on the user preference. For example, user can define minor allele frequency (MAF) or info/R2 score threshold to filter out SNPs that have low MAF or info/R2 score to avoid false-positive signals and to reduce computing time. User can also define a subset of samples to be analyzed by providing the set of sample IDs. Users can also control how chunk size -- the number of rows (SNPs) to include for each iteration. 

**IMPORTANT: In the `covariate.file`, categorical variables need to be converted to indicator (dummy) variables and be of class `numeric`. Ordinal variables represented as characters, ie "low", "medium" and "high" should be converted to the appropriate numeric values as well.**

## Main output format
The output for the 3 main functions in `gwasurvivr` are very similar but with subtle differences. In general the output includes the following main fields: RSID, TYPED, CHR, POS, REF, ALT, Allele Frequencies\*, INFO\*, PVALUES, HRs, HR confidence intervals, coefficient estimates, standard errors, Z-statistic, N, and NEVENT. Allele frequencies and INFO differ by the input imputation software. 

**Note: Invoking the `inter.term` argument for any of the functions will make the PVALUE and HRs and HR confidence intervals represent the INTERACTION term and not the SNP alone.**

The non-software specific fields are summarized below:

```{r, echo=FALSE}
cols <- c("RSID",
          "CHR",
          "POS",
          "REF",
          "ALT",
          "SAMP_FREQ_ALT",
          "SAMP_MAF",
          "PVALUE",
          "HR",
          "HR_lowerCI",
          "HR_upperCI",
          "COEF",
          "SE.COEF",
          "Z",
          "N",
          "NEVENT")

desc <- c("SNP ID",
          "Chromosome number",
          "Genomic Position (BP)",
          "Reference Allele",
          "Alternate Allele",
          "Alternate Allele frequency in sample being tested",
          "Minor allele frequency in sample being tested",
          "P-value of single SNP or interaction term",
          "Hazard Ratio (HR)",
          "Lower bound 95% CI of HR",
          "Upper bound 95% CI of HR",
          "Estimated coefficient of SNP",
          "Standard error of coefficient estimate",
          "Z-statistic",
          "Number of individuals in sample being tested",
          "Number of events that occurred in sample being tested")

df <- cbind(cols, desc)
colnames(df) <- c("Column", "Description")
kable(df)
```

The software specific fields are summarized below:  
1. `michiganCoxSurv` unique output columns are AF, MAF, R2, ER2. They are summarized below.  

```{r, echo=FALSE}
cols <- c("TYPED",
          "AF", 
          "MAF",
          "R2",
          "ER2")
desc <- c("Imputation status: TRUE (SNP IS TYPED)/FALSE (SNP IS IMPUTED)",
          "Minimac3 output Alternate Allele Frequency",
          "Minimac3 output of Minor Allele Frequency",
          "Imputation R2 score (minimac3 $R^2$)",
          "Minimac3 ouput empirical $R^2$")
df <- cbind(cols, desc)
colnames(df) <- c("Column", "Description")

kable(df)
```

Please see [Minimac3 Info File](https://genome.sph.umich.edu/wiki/Minimac3_Info_File) for details on output

2. `sangerCoxSurv`  

```{r, echo=FALSE}
cols <- c("TYPED",
          "RefPanelAF",
          "INFO")
desc <- c("Imputation status: TRUE (SNP IS TYPED)/FALSE (SNP IS IMPUTED)",
          "HRC Reference Panel Allele Frequency",
          "Imputation INFO score from PBWT")
df <- cbind(cols, desc)
colnames(df) <- c("Column", "Description")
kable(df)
```


3. `impute2CoxSurv`  

```{r, echo=FALSE}
cols <- c("TYPED",
          "A0",
          "A1", 
          "exp_freq_A1")
desc <- c("`---` is imputed, repeated RSID is typed",
          "Allele coded 0 in IMPUTE2",
          "Allele coded 1 in IMPUTE2",
          "Expected allele frequency of alelle code A1")
df <- cbind(cols, desc)
colnames(df) <- c("Column", "Description")
kable(df)
```

More statistics can be printed out by invoking the `print.covs` argument and setting it to `print.covs=all` (single SNP/SNP\*covariate interaction) or `print.covs=some` (SNP\*covariate ineraction). These options are available mainly for modeling purposes (covariate selection) and aren't suggested for very large analyses as it will greatly increase the number of columns in the output, depending on how many covariates users are adjusting for. 

# Getting started
Install `gwasurvivr` from the [Sucheston-Campbell Lab Github repository](http://github.com/suchestoncampbelllab/gwasurvivr) using `devtools`. 

```{r, eval=FALSE}
devtools::install_github("suchestoncampbelllab/gwasurvivr")
```

Or please install from the `devel` branch of Bioconductor (R version >= 3.5)

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("gwasurvivr", version = "devel")    
```

## Dependencies
**Note**:  This package depends on `GWASTools` which uses netcdf framework on linux. Therefore, for linux users, please install `libnetcdf-dev` and `netcdf-bin` before installing `gwasurvivr`. These linux libraries may already installed on an academic computing cluster. 

CRAN packages:  
1. `ncdf4`  
2. `matrixStats`  
3. `parallel`    
4. `survival` 

```{r, eval=FALSE}
install.packages(c("ncdf4", "matrixStats", "parallel", "survival"))
```

Bioconductor packages:  
1. `GWASTools`  
2. `VariantAnnotation`  
3. `SummarizedExperiment` 
4. `SNPRelate`

```{r, eval=FALSE}
BiocManager::install("GWASTools")    
BiocManager::install("VariantAnnotation")    
BiocManager::install("SummarizedExperiment")    
BiocManager::install("SNPRelate")    
```

Load `gwasurvivr`.

```{r}
library(gwasurvivr)
```

## User settings: parallelization setup

`gwasurvivr` uses `parallel` package for its internal parallelization to fit the Cox PH models. Users are not required to define a parallelization setup, by default `gwasurvivr` functions will detect the user's operating system and set the cluster object to `FORK` if the platform is Linux/OS X and to `SOCK` if Windows. However, parallelization settings can be modified by the user if needed. Users are given two ways to define their cluster settings:  

**1. Setting the number of cores to be used:**

Linux/OS X users can run analyses on a prespecified number of cores by setting the option in the R session as shown below. This option should be defined in the R session before running any of the `gwasurvivr` functions. Here we decide to use 4 cores. This option is not available to Windows users.   

```{r, eval=FALSE}
options("gwasurvivr.cores"=4)
```

**2. Providing a user defined cluster object**

To modify more settings, users can also provide a "cluster object" to any of the `gwasurvivr` functions. The cluster object can be generated via `makeCluster`, `makePSOCKcluster`, `makeForkCluster` functions from `parallel` package or similar cluster object functions from `snow` or `snowfall` packages. This method can be applied on any operating system. User can create a cluster object before running any of the functions and pass the cluster object to the `clusterObj` argument as shown below. For further details see `??parallel::parallel`.

```{r, eval=FALSE}
library(parallel)
cl <- makeCluster(detectCores())

impute2CoxSurv(..., clusterObj=cl)
sangerCoxSurv(..., clusterObj=cl)
michiganCoxSurv(..., clusterObj=cl)
```

# R Session Examples

##  Michigan Imputation Server
[Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html) pre-phases typed genotypes using HAPI-UR, SHAPEIT, or EAGLE (default is EAGLE2), imputes using Minimac3 imputation engine and outputs Blocked GNU Zip Format VCF files (`.vcf.gz`). These `.vcf.gz` files are used as input for `gwasurvivr`.  Minimac uses slightly different metrics to assess imputation quality ($R^2$ versus info score) and complete details as to minimac output are available on the [Minimac3 Wikipage](https://genome.sph.umich.edu/wiki/Minimac3_Imputation_Cookbook).

The function, `michiganCoxSurv` uses a modification of Cox proportional hazard regression from the R library `survival:::coxph.fit`. Built specifically for genetic data, `michiganCoxSurv` allows the user to filter on info ($R^2$) score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using `RefPanelAF` as the input arguement for `maf.filter`. Users are also provided with the sample minor allele frequency (MAF) in the `sangerCoxSurv` output, which can be used for filtering post analysis. 

Samples can be selected for analyses by providing a vector of `sample.ids`. The output from Sanger imputation server returns the samples as `SAMP1, ..., SAMPN`, where `N` is the total number of samples. The sample order corresponds to the sample order in the `vcf.file` used for imputation. Note, sample order can also be found in the `.fam` file if genotyping data were initially in `.bed`, `.bim` and `.fam` (PLINK) format prior to conversion to VCF. If no sample list is specified all samples are included in the analyses.

```{r, message=FALSE}
vcf.file <- system.file(package="gwasurvivr",
                        "extdata", 
                        "michigan.chr14.dose.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
                        "extdata", 
                        "simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
                         sep=" ", 
                         header=TRUE,
                         stringsAsFactors = FALSE)
head(pheno.file)

# recode sex column and remove first column 
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
```

In this example, we will select samples from the `experimental` group and will test survival only on these patients. The first column in the `pheno.file` are sample IDs, which link the phenotype file to the imputation file. We include `age`, `DrugTxYes`, and `sex` in the survival model as covariates. 

We perform the analysis using the `experimental` group to demonstrate how one may want to prepare their data if interested in testing only on a subset of samples (i.e. a case-control study and survival of cases is of interest). Note that how the IDs (`sample.ids`) need to be a vector of class `character`. The `chunk.size` refers to size of each data chunk read in and is defaulted to 10,000 rows. Users can customize that to their needs. The larger the `chunk.size` the more memory (RAM) required to run the analysis. The recommended `chunk.size=10000` and probably should not exceed `chunk.size=100000`. This does not mean that `gwasurvivr` is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration. 

By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (`print.covs='only'`), however users can select `print.covs=all` to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size. 

### Single SNP analysis
Next we run `michiganCoxSurv` with the default, `print.covs="only"`, load the results into R and provide descriptions of output by column. We will then run the analysis again using `print.covs="all"`. `verbose=TRUE` is used for these examples so the function display messages while running.

Use `?michiganCoxSurv` for argument specific documentation.

`print.covs="only"`

```{r,eval=FALSE}
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term=NULL,
                print.covs="only",
                out.file="michigan_only",
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=TRUE,
                clusterObj=NULL)
```


```{r, echo=FALSE}
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term=NULL,
                print.covs="only",
                out.file=tempfile("michigan_only"),
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=TRUE,
                clusterObj=NULL)
```

Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using `print.covs="only"`.

```{r, message=FALSE, eval=FALSE}
michigan_only <- read.table("michigan_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
```

```{r, message=FALSE, echo=FALSE}
michigan_only <- read.table(dir(tempdir(), pattern="^michigan_only.*\\.coxph$", full.names = TRUE), sep="\t", header=TRUE, stringsAsFactors = FALSE)
```

```{r}
str(head(michigan_only))
```

### SNP with covariate interaction
A SNP*covariate interaction can be implemented using the `inter.term` argument. In this example, we will use `DrugTxYes` from the covariate file as the covariate we want to test for interaction with the SNP. 

`print.covs="only"`

```{r,eval=FALSE}
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term="DrugTxYes",
                print.covs="only",
                out.file="michigan_intx_only",
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=FALSE,
                clusterObj=NULL)
```

```{r, echo=FALSE}
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term="DrugTxYes",
                print.covs="only",
                out.file=tempfile("michigan_intx_only"),
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=FALSE,
                clusterObj=NULL)
```

Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using `print.covs="only"`.

```{r, message=FALSE, eval=FALSE}
michigan_intx_only <- read.table("michigan_intx_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
```

```{r, message=FALSE, echo=FALSE}
michigan_intx_only <- read.table(dir(tempdir(), pattern="^michigan_intx_only.*\\.coxph$", full.names = TRUE), sep="\t", header=TRUE, stringsAsFactors = FALSE)
```

```{r}
str(head(michigan_intx_only))
```

## Sanger Imputation Server
[Sanger Imputation Server](https://imputation.sanger.ac.uk/) pre-phases typed genotypes using either SHAPEIT or EAGLE, imputes genotypes using PBWT algorithm and outputs a `.vcf.gz` file for each chromosome. These `.vcf.gz` files are used as input for `gwasurvivr`. The function, `sangerCoxSurv`  uses a modification of Cox proportional hazard regression from `survival::coxph.fit`. Built specifically for genetic data, `sangerCoxSurv` allows the user to filter on info score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using `RefPanelAF` as the input arguement for `maf.filter`. Users are also provided with the sample minor allele frequency in the `sangerCoxSurv` output. 

Samples can be selected for analyses by providing a vector of `sample.ids`. The output from Sanger imputation server returns the samples as `SAMP1, ..., SAMPN`, where `N` is the total number of samples. The sample order corresponds to the sample order in the VCF file used for imputation. Note, sample order can also be found in the `.fam` file if genotyping data were initially in `.bed`, `.bim` and `.fam` (PLINK) format prior to conversion to VCF. If no sample list is specified, all samples are included in the analyses.

In this example, we will select samples from the `experimental` group and will test survival only on these patients. The first column in the `pheno.file` are sample IDs (we will match on these). We include `age`, `DrugTxYes`, and `sex` in the survival model as covariates. 

```{r, eval=TRUE}
vcf.file <- system.file(package="gwasurvivr",
                        "extdata", 
                        "sanger.pbwt_reference_impute.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
                        "extdata", 
                        "simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
                         sep=" ",
                         header=TRUE,
                         stringsAsFactors = FALSE)
head(pheno.file)
# recode sex column and remove first column 
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
```

We perform the analysis using the `experimental` group to demonstrate how one may want to prepare their data if not initially all samples are patients or cases (i.e. a case-control study and survival of cases is of interest). We also are showing how the IDs (`sample.ids`) need to be a vector of class `character`. The `chunk.size` refers to size of each data chunk read in and is defaulted to 10,000 rows, users can customize that to their needs. The larger the `chunk.size` the more memory (RAM) required to run the analysis. The recommended `chunk.size=10000` and probably should not exceed `chunk.size=100000`. This does not mean that `gwasurvivr` is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration. 

By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (`print.covs='only'`), however users can select `print.covs=all` to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size. 

Use `?sangerCoxSurv` for argument specific documentation.

### Single SNP analysis
Next we run `sangerCoxSurv` with the default, `print.covs="only"`, load the results into R and provide descriptions of output by column. `verbose=TRUE` is used for these examples so the function display messages while running.

`print.covs="only"`

```{r, eval=FALSE}
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term=NULL,
              print.covs="only",
              out.file="sanger_only",
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
```

```{r, echo=FALSE}
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term=NULL,
              print.covs="only",
              out.file=tempfile("sanger_only"),
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
```

Here we load the data and glimpse the first few values in each column from the survival analyses.
```{r, message=FALSE, echo=FALSE}
sanger_only <- read.table(dir(tempdir(), pattern="^sanger_only.*\\.coxph$", full.names = TRUE), sep="\t", header=TRUE, stringsAsFactors = FALSE)
```

```{r}
str(head(sanger_only))
```

Column names with descriptions from the survival analyses with covariates, specifying the default `print.covs="only"`

### SNP with covariate interaction
A SNP*covariate interaction can be implemented using the `inter.term` argument. In this example, we will use `DrugTxYes` from the covariate file as the covariate we want to test for interaction with the SNP. 

`print.covs="only"`

```{r,eval=FALSE}
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term="DrugTxYes",
              print.covs="only",
              out.file="sanger_intx_only",
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
```


```{r, echo=FALSE, eval=FALSE}
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term="DrugTxYes",
              print.covs="only",
              out.file=tempfile("sanger_intx_only"),
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
```


## IMPUTE2 Imputation
IMPUTE2 is a genotype imputation and haplotype phasing program (Howie et al 2009). IMPUTE2 outputs 6 files for each chromosome chunk imputed (usually 5 MB in size). Only 2 of these files are required for analyses using `gwasurvivr`:

  - Genotype file (`.impute`)  
  - Sample file (`.sample`)  
  
[More information can be read about these formats](http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html)

We are going to load in and pre-process the genetic data and the phenotypic data (`covariate.file`).
```{r, message=FALSE}
impute.file <- system.file(package="gwasurvivr",
                            "extdata",
                            "impute_example.impute2.gz")
sample.file <- system.file(package="gwasurvivr",
                           "extdata", 
                           "impute_example.impute2_sample")
covariate.file <- system.file(package="gwasurvivr", 
                              "extdata",
                              "simulated_pheno.txt")
covariate.file <- read.table(covariate.file, 
                             sep=" ",
                             header=TRUE,
                             stringsAsFactors = FALSE)
covariate.file$SexFemale <- ifelse(covariate.file$sex=="female", 1L, 0L)
sample.ids <- covariate.file[covariate.file$group=="experimental",]$ID_2
```

To perform survival analysis using IMPUTE2 the function arguments are very similarto  `michiganCoxSurv` and `sangerCoxSurv`, however the function now takes a chromosome arguement. This is needed to properly annotate the file output with the chromosome that these SNPs are in. This is purely an artifact of IMPUTE2 and how we leverage `GWASTools` in this function. 

### Single SNP analysis
First we will do the analysis with no interaction term, followed by doing the analysis with the interaction term. The recommended output setting for single SNP analysis is `print.cov="only"`.

```{r, eval=FALSE}
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term=NULL,
               print.covs="only",
               out.file="impute_example_only",
               chunk.size=100,
               maf.filter=0.005,
               exclude.snps=NULL,
               flip.dosage=TRUE,
               verbose=TRUE,
               clusterObj=NULL)
```

```{r, echo=FALSE}
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term=NULL,
               print.covs="only",
               out.file=tempfile("impute_example_only"),
               chunk.size=100,
               maf.filter=0.005,
               exclude.snps=NULL,
               flip.dosage=TRUE,
               verbose=TRUE,
               clusterObj=NULL)
```
Here we load the data and glimpse the first few values in each column output.

```{r, message=FALSE, eval=FALSE}
impute2_res_only <- read.table("impute_example_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_only))
```


```{r, message=FALSE, echo=FALSE}
impute2_res_only <- read.table(dir(tempdir(), pattern="^impute_example_only.*\\.coxph$", full.names=TRUE), sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_only))
```

### SNP covariate interaction
Now we will perform a SNP*covariate interaction survival analysis using `impute2CoxSurv`.  
```{r, eval=FALSE}
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term="DrugTxYes",
               print.covs="only",
               out.file="impute_example_intx",
               chunk.size=100,
               maf.filter=0.005,
               flip.dosage=TRUE,
               verbose=FALSE,
               clusterObj=NULL,
               keepGDS=FALSE)
```

```{r, echo=FALSE}
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term="DrugTxYes",
               print.covs="only",
               out.file=tempfile("impute_example_intx"),
               chunk.size=100,
               maf.filter=0.005,
               flip.dosage=TRUE,
               verbose=FALSE,
               clusterObj=NULL,
               keepGDS=FALSE)
```
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using `print.covs="only"`.

```{r, message=FALSE, eval=FALSE}
impute2_res_intx <- read.table("impute_example_intx.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_intx))
```


```{r, message=FALSE, echo=FALSE}
impute2_res_intx <- read.table(dir(tempdir(), pattern="^impute_example_intx.*\\.coxph$", full.names=TRUE), sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_intx))
```

# plinkCoxSurv

# Batch Examples

## Batch Example sangerCoxSurv
Batch jobs for multiple analyses and different subsets are easy to implement using `gwasurvivr`. These types of analyses should be reserved for usage on a UNIX-based high performance computing cluster. This is facilitated by the package `batch`, which can internalize R variables from bash. First write an R script (e.g. `mysurvivalscript.R`) to pass in bash.

**Note: it is important to refer to the Getting Started part of the manual for packages that need to be installed for this**

```{r, eval=FALSE}
## mysurvivalscript.R
library(gwasurvivr)
library(batch)

parseCommandArgs(evaluate=TRUE) # this is loaded in the batch package

options("gwasurvivr.cores"=4)

vcf.file <- system.file(package="gwasurvivr",
                        "extdata", 
                        vcf.file)
pheno.fl <- system.file(package="gwasurvivr",
                        "extdata", 
                        pheno.file)
pheno.file <- read.table(pheno.fl, 
                         sep=" ",
                         header=TRUE, 
                         stringsAsFactors = FALSE)
# recode sex column and remove first column 
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
## -- unlist the covariates 
## (refer below to the shell script as to why we are doing this)
covariates <- unlist(sapply(covariates, strsplit, "_", 1, "[["))

sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event=time,
              event=event,
              covariates=covariates,
              inter.term=NULL,
              print.covs="only",
              out.file=out.file,
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
```

Now we can run a shell script. This can be used well with manifest files to set up multiple runs with different outcomes and different subsets of samples. We define a manifest file has columns that corresond to the functions that the user wants to pass and each row is a separate analysis that a user may want to run. The covariates are separated by an underscore (`"_"`). This is so it can be passed properly, and also why we used `str_split` to split the covariates. 

```{r, eval=FALSE}
#!/bin/bash
DIRECTORY=/path/to/dir/impute_chr

module load R 

R --script ${DIRECTORY}/survival/code/mysurvivalscript.R -q --args \
        vcf.file ${DIRECTORY}/sanger.pbwt_reference_impute.vcf.gz \
        pheno.file ${DIRECTORY}/phenotype_data/simulated_pheno.txt \
        covariates DrugTxYes_age_SexFemale\
        time.to.event time \
        event event \
        out.file ${DIRECTORY}/survival/results/sanger_example_output
```

The file paths above are completely arbitrary and were just used as an example of how file structure may be and where desirable output would be stored.

## Batch Example impute2CoxSurv
Exactly the same as for `sangerCoxSurv` but this time with the input arguments for `impute2CoxSurv`. See `?impute2CoxSurv` for help

## Batch Example michiganCoxSurv
Exactly the same as for `sangerCoxSurv` but this time with the input arguments for `michiganCoxSurv`. See `?michiganCoxSurv` for help

# Session info {.unnumbered}
Here is the output of `sessionInfo()` on the system that this document was compiled:
```{r, echo=FALSE}
sessionInfo()
```

# References {.unnumbered}
1. Terry M. Therneau and Patricia M. Grambsch (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York. ISBN 0-387-98784-3.  

2. Martin Morgan, Valerie Obenchain, Jim Hester and Hervé Pagès (2017). SummarizedExperiment: SummarizedExperiment container. R package version 1.6.3.  

3. Gogarten SM, Bhangale T, Conomos MP, Laurie CA, McHugh CP, Painter I, Zheng X, Crosslin DR, Levine D, Lumley T, Nelson SC, Rice K, Shen J, Swarnkar R, Weir BS and Laurie CC (2012). “GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies.” Bioinformatics, 28(24), pp. 3329-3331. doi: 10.1093/bioinformatics/bts610.  

4. B. N. Howie, P. Donnelly and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529  

5. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. **Next-generation genotype imputation service and methods.** Nature Genetics 48, 1284–1287 (2016). [27571263](https://www.ncbi.nlm.nih.gov/pubmed/27571263)  

6. Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT)", Richard Durbin Bioinformatics 30:1266-72 (2014).