---
title: "Standardise the format of summary statistics from GWAS with *MungeSumstats*"
author: "Alan Murphy and Nathan Skene"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 3
    code_folding: show
    df_print: paged
csl: nature.csl
vignette: >
    %\VignetteIndexEntry{Standardise the format of summary statistics from GWAS with MungeSumstats} 
    %\usepackage[utf8]{inputenc}
    %\VignetteEngine{knitr::rmarkdown} 
bibliography: MungeSumstats.bib
editor_options: 
  markdown: 
    wrap: 72
---

```{r style, echo=FALSE, results='asis', message=FALSE}
BiocStyle::markdown()
knitr::opts_chunk$set(tidy         = FALSE,
                      message      = FALSE)

```

# Citation

If you use the MungeSumstats package, please cite

[Murphy, et al. MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics](https://www.biorxiv.org/content/10.1101/2021.06.21.449239v1).

# Overview

The *MungeSumstats* package is designed to facilitate the
standardisation of GWAS summary statistics as utilised in our Nature
Genetics paper [@Skene2018].

The package is designed to handle the lack of standardisation of output
files by the GWAS community. There is a group who have now manually
standardised many GWAS: [R interface to the IEU GWAS database API •
ieugwasr](https://mrcieu.github.io/ieugwasr/) and
[gwasvcf](https://github.com/MRCIEU/gwasvcf) but because a lot of GWAS
remain closed access, these repositories are not all encompassing.

The [GWAS-Download
project](https://github.com/mikegloudemans/gwas-download) has collated
summary statistics from 200+ GWAS. This repository has been utilsed to
identify the most common formats, all of which can be standardised with
*MungeSumstats*.

Moreover, there is an emerging standard of VCF format for summary
statistics files with multiple, useful, associated R packages such as
*vcfR*. However, there is currently no method to convert VCF formats to
a standardised format that matches older approaches.

The *MungeSumstats* package standardises both VCF and the most common
summary statistic file formats to enable downstream integration and
analysis.

*MungeSumstats* also offers a number of Quality Control (QC) steps 
which are important prerequisites for downstream analysis like Linkage 
disequilibrium score regression (LDSC) and MAGMA.

# Aim

*MungeSumstats* will ensure that the all essential columns for analysis are 
present and syntactically correct. Generally, summary statistic files include 
(but are not limited to) the columns: 

* SNP : SNP ID (rs IDs) 
* CHR : Chromosome number 
* BP : Base pair positions
* A1 : Effect allele 
* A2 : Non-effect allele
* Z : Z-score
* BETA : Effect size estimate relative to the alternative allele 
* P : Unadjusted p-value for SNP
* N : Sample size
* INFO: The imputation information score
* FRQ: The minor allele frequency (MAF) of the SNP

Tests run by *MungeSumstats* include:

* Check VCF format
* Check tab, space or comma delimited
* Check for header name synonyms
* Check for multiple models or traits in GWAS
* Check for uniformity in SNP ID - no mix of rs/missing rs/chr:bp 
* Check for CHR:BP:A2:A1 all in one column
* Check for CHR:BP in one column
* Check for A1/A2 in one column
* Check if CHR and/or BP is missing (infer from reference genome)
* Check if SNP ID is missing (infer from reference genome)
* Check if A1 and/or A2 are missing (infer from reference genome)
* Check that vital columns are present (SNP,CHR,BP,P,A1,A2)
* Check for one signed/effect column (Z,OR,BETA,LOG_ODDS,SIGNED_SUMSTAT)
* Check for missing data
* Check for duplicated columns
* Check for small p-values (lower than  5e-324)
* Check N column is an integer
* Check for SNPs with N greater than 5 times standard dev. plus the mean  
* Check SNPs are RS ID's
* Check for duplicated rows, based on SNP ID
* Check for duplicated rows, based on base-pair position
* Check for SNPs on reference genome. Correct not found SNP IDs using CHR and BP (infer from reference genome)
* Check INFO score
* Check for strand-ambiguous SNPs
* Check for non-biallelic SNPs (infer from reference genome)
* Check for allele flipping
* Check for SNPs on chromosome X, Y, and mitochondrial SNPs (MT)
  
If a test is failed, the user will be notified and if possible, the input will 
be corrected. The QC steps from the checks above can also be adjusted to suit 
the user's analysis, see `MungeSumstats::format_sumstats`.

# Data

The *MungeSumstats* package contains small subsets of GWAS summary
statistics files. Firstly, on Educational Attainment by Okbay et al 2016: PMID:
27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b.

Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic lateral 
sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647:
https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/

These datasets will be used to showcase MungeSumstats functionality.

# Running MungeSumstats

*MungeSumstats* is available on Bioconductor. To install the package on 
Bioconductor run the following lines of code:

```
if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("MungeSumstats")
```
Once installed, load the package:

```{r setup}
library(MungeSumstats)
```

To standardise the summary statistics' file format, simply call
`format_sumstats()` passing in the path to your summary statistics file.
You must also specify which genome build was used in the GWAS(GRCh37 or GRCh38).
The reference genome is used for multiple checks like deriving missing 
data such SNP/BP/CHR/A1/A2 and for QC steps like removing non-biallelic
SNPs or strand-ambiguous SNPs. The path to the reformatted summary statistics 
file is returned by the function call.

Note that for a number of the checks implored by *MungeSumstats* a reference 
genome is used. If your GWAS summary statistics file of interest relates to
*GRCh38*, you will need to install `SNPlocs.Hsapiens.dbSNP144.GRCh38` and 
`BSgenome.Hsapiens.NCBI.GRCh38` from Bioconductor as follows:
```
BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
```
If your GWAS summary statistics file of interest relates to *GRCh37*, you will 
need to install `SNPlocs.Hsapiens.dbSNP144.GRCh37` and 
`BSgenome.Hsapiens.1000genomes.hs37d5` from Bioconductor as follows:
```
BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
```
These may take some time to install and are not included in the package as some 
users may only need one of *GRCh37*/*GRCh38*.

The Educational Attainment by Okbay GWAS summary statistics file is saved as a 
text document in the package's external data folder so we can just pass the file
path to it straight to *MungeSumstats*.

```{r, message=TRUE}
eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
                                  package="MungeSumstats")
reformatted <- 
  MungeSumstats::format_sumstats(path=eduAttainOkbayPth,ref_genome="GRCh37")
```

The hyperparameters `format_sumstats` in that control the level of QC conducted 
by *MungeSumstats* are:

* **convert_small_p** Binary, should p-values < 5e-324 be converted to 0? Small p-values pass the R limit and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.
* **convert_n_int** Binary, if N (the number of samples) is not an integer, should this be rounded? Default is TRUE.
analysis_trait If multiple traits were studied, name of the trait for analysis from the GWAS. Default is NULL
* **INFO_filter** 0-1 The minimum value permissible of the imputation information score (if present in sumstatsfile). Default 0.9
* **N_std** Numeric, the number of standard deviations above the mean a SNP's N is needed to be removed. Default is 5.
* **rmv_chr** vector or character The chromosomes on which the SNPs should be removed. Use NULL if no filtering necessary. Default is X, Y and mitochondrial. 
* **on_ref_genome** Binary, should a check take place that all SNPs are on the reference genome by SNP ID. Any that aren't on SNPs not on the reference genome, will be corrected from the reference genome (if possible) from the chromosome and base pair postion data. Default is TRUE
* **strand_ambig_filter** Binary, should SNPs with strand-ambiguous alleles be removed. Default is FALSE
* **allele_flip_check** Binary, should the allele columns be chacked against reference genome to infer if flipping is necessary. Default is TRUE
* **bi_allelic_filter** Binary, should non-biallelic SNPs be removed. Default is TRUE

VCF files can also be standardised to the same format as other summary statistic
files. A subset of the Amyotrophic lateral sclerosis GWAS from the ieu open GWAS
project (a .vcf file) has been added to *MungeSumstats* to demonstrate this 
functionality.Simply pass the path to the file in the same manner you would for 
other summary statistic files:

```{r, message=TRUE}
#save ALS GWAS from the ieu open GWAS project to a temp directory
ALSvcfPth <- system.file("extdata","ALSvcf.vcf",package="MungeSumstats")
#set a low INFO filter so we get a return
reformatted_vcf <- 
  MungeSumstats::format_sumstats(path=ALSvcfPth, ref_genome="GRCh37",INFO_filter=0.1)
```

# Future Enhancements

The *MungeSumstats* package aims to be able to handle the most common
summary statistic file formats including VCF. If your file can not be
formatted by *MungeSumstats* feel free to report the bug on github:
<https://github.com/neurogenomics/MungeSumstats> along with your summary
statistic file header. 

We also encourage people to edit the code to resolve their particular issues 
too and are happy to incorporate these through pull requests on github. If your
summary statistic file headers are not recognised by *MungeSumstats* but 
correspond to one of: 
```
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS,
SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ 
```
feel free to update the `MungeSumstats::sumstatsColHeaders` following the 
approach in the data.R file and add your mapping. Then use a pull request on 
github and we will incorporate this change into the package.

# Session Information
```{r, message=TRUE, echo=FALSE}
sessionInfo()
```

# References