--- title: "Standardise the format of summary statistics from GWAS with *MungeSumstats*" author: "Alan Murphy and Nathan Skene" date: "`r Sys.Date()`" output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 3 code_folding: show df_print: paged csl: nature.csl vignette: > %\VignetteIndexEntry{Standardise the format of summary statistics from GWAS with MungeSumstats} %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} bibliography: MungeSumstats.bib editor_options: markdown: wrap: 72 --- ```{r style, echo=FALSE, results='asis', message=FALSE} BiocStyle::markdown() knitr::opts_chunk$set(tidy = FALSE, message = FALSE) ``` # Citation If you use the MungeSumstats package, please cite [Murphy, et al. MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics](https://www.biorxiv.org/content/10.1101/2021.06.21.449239v1). # Overview The *MungeSumstats* package is designed to facilitate the standardisation of GWAS summary statistics as utilised in our Nature Genetics paper [@Skene2018]. The package is designed to handle the lack of standardisation of output files by the GWAS community. There is a group who have now manually standardised many GWAS: [R interface to the IEU GWAS database API • ieugwasr](https://mrcieu.github.io/ieugwasr/) and [gwasvcf](https://github.com/MRCIEU/gwasvcf) but because a lot of GWAS remain closed access, these repositories are not all encompassing. The [GWAS-Download project](https://github.com/mikegloudemans/gwas-download) has collated summary statistics from 200+ GWAS. This repository has been utilsed to identify the most common formats, all of which can be standardised with *MungeSumstats*. Moreover, there is an emerging standard of VCF format for summary statistics files with multiple, useful, associated R packages such as *vcfR*. However, there is currently no method to convert VCF formats to a standardised format that matches older approaches. The *MungeSumstats* package standardises both VCF and the most common summary statistic file formats to enable downstream integration and analysis. *MungeSumstats* also offers a number of Quality Control (QC) steps which are important prerequisites for downstream analysis like Linkage disequilibrium score regression (LDSC) and MAGMA. # Aim *MungeSumstats* will ensure that the all essential columns for analysis are present and syntactically correct. Generally, summary statistic files include (but are not limited to) the columns: * SNP : SNP ID (rs IDs) * CHR : Chromosome number * BP : Base pair positions * A1 : Effect allele * A2 : Non-effect allele * Z : Z-score * BETA : Effect size estimate relative to the alternative allele * P : Unadjusted p-value for SNP * N : Sample size * INFO: The imputation information score * FRQ: The minor allele frequency (MAF) of the SNP Tests run by *MungeSumstats* include: * Check VCF format * Check tab, space or comma delimited * Check for header name synonyms * Check for multiple models or traits in GWAS * Check for uniformity in SNP ID - no mix of rs/missing rs/chr:bp * Check for CHR:BP:A2:A1 all in one column * Check for CHR:BP in one column * Check for A1/A2 in one column * Check if CHR and/or BP is missing (infer from reference genome) * Check if SNP ID is missing (infer from reference genome) * Check if A1 and/or A2 are missing (infer from reference genome) * Check that vital columns are present (SNP,CHR,BP,P,A1,A2) * Check for one signed/effect column (Z,OR,BETA,LOG_ODDS,SIGNED_SUMSTAT) * Check for missing data * Check for duplicated columns * Check for small p-values (lower than 5e-324) * Check N column is an integer * Check for SNPs with N greater than 5 times standard dev. plus the mean * Check SNPs are RS ID's * Check for duplicated rows, based on SNP ID * Check for duplicated rows, based on base-pair position * Check for SNPs on reference genome. Correct not found SNP IDs using CHR and BP (infer from reference genome) * Check INFO score * Check for strand-ambiguous SNPs * Check for non-biallelic SNPs (infer from reference genome) * Check for allele flipping * Check for SNPs on chromosome X, Y, and mitochondrial SNPs (MT) If a test is failed, the user will be notified and if possible, the input will be corrected. The QC steps from the checks above can also be adjusted to suit the user's analysis, see `MungeSumstats::format_sumstats`. # Data The *MungeSumstats* package contains small subsets of GWAS summary statistics files. Firstly, on Educational Attainment by Okbay et al 2016: PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b. Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic lateral sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647: https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/ These datasets will be used to showcase MungeSumstats functionality. # Running MungeSumstats *MungeSumstats* is available on Bioconductor. To install the package on Bioconductor run the following lines of code: ``` if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("MungeSumstats") ``` Once installed, load the package: ```{r setup} library(MungeSumstats) ``` To standardise the summary statistics' file format, simply call `format_sumstats()` passing in the path to your summary statistics file. You must also specify which genome build was used in the GWAS(GRCh37 or GRCh38). The reference genome is used for multiple checks like deriving missing data such SNP/BP/CHR/A1/A2 and for QC steps like removing non-biallelic SNPs or strand-ambiguous SNPs. The path to the reformatted summary statistics file is returned by the function call. Note that for a number of the checks implored by *MungeSumstats* a reference genome is used. If your GWAS summary statistics file of interest relates to *GRCh38*, you will need to install `SNPlocs.Hsapiens.dbSNP144.GRCh38` and `BSgenome.Hsapiens.NCBI.GRCh38` from Bioconductor as follows: ``` BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh38") BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38") ``` If your GWAS summary statistics file of interest relates to *GRCh37*, you will need to install `SNPlocs.Hsapiens.dbSNP144.GRCh37` and `BSgenome.Hsapiens.1000genomes.hs37d5` from Bioconductor as follows: ``` BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh37") BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5") ``` These may take some time to install and are not included in the package as some users may only need one of *GRCh37*/*GRCh38*. The Educational Attainment by Okbay GWAS summary statistics file is saved as a text document in the package's external data folder so we can just pass the file path to it straight to *MungeSumstats*. ```{r, message=TRUE} eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt", package="MungeSumstats") reformatted <- MungeSumstats::format_sumstats(path=eduAttainOkbayPth,ref_genome="GRCh37") ``` The hyperparameters `format_sumstats` in that control the level of QC conducted by *MungeSumstats* are: * **convert_small_p** Binary, should p-values < 5e-324 be converted to 0? Small p-values pass the R limit and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE. * **convert_n_int** Binary, if N (the number of samples) is not an integer, should this be rounded? Default is TRUE. analysis_trait If multiple traits were studied, name of the trait for analysis from the GWAS. Default is NULL * **INFO_filter** 0-1 The minimum value permissible of the imputation information score (if present in sumstatsfile). Default 0.9 * **N_std** Numeric, the number of standard deviations above the mean a SNP's N is needed to be removed. Default is 5. * **rmv_chr** vector or character The chromosomes on which the SNPs should be removed. Use NULL if no filtering necessary. Default is X, Y and mitochondrial. * **on_ref_genome** Binary, should a check take place that all SNPs are on the reference genome by SNP ID. Any that aren't on SNPs not on the reference genome, will be corrected from the reference genome (if possible) from the chromosome and base pair postion data. Default is TRUE * **strand_ambig_filter** Binary, should SNPs with strand-ambiguous alleles be removed. Default is FALSE * **allele_flip_check** Binary, should the allele columns be chacked against reference genome to infer if flipping is necessary. Default is TRUE * **bi_allelic_filter** Binary, should non-biallelic SNPs be removed. Default is TRUE VCF files can also be standardised to the same format as other summary statistic files. A subset of the Amyotrophic lateral sclerosis GWAS from the ieu open GWAS project (a .vcf file) has been added to *MungeSumstats* to demonstrate this functionality.Simply pass the path to the file in the same manner you would for other summary statistic files: ```{r, message=TRUE} #save ALS GWAS from the ieu open GWAS project to a temp directory ALSvcfPth <- system.file("extdata","ALSvcf.vcf",package="MungeSumstats") #set a low INFO filter so we get a return reformatted_vcf <- MungeSumstats::format_sumstats(path=ALSvcfPth, ref_genome="GRCh37",INFO_filter=0.1) ``` # Future Enhancements The *MungeSumstats* package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by *MungeSumstats* feel free to report the bug on github: along with your summary statistic file header. We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by *MungeSumstats* but correspond to one of: ``` SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ ``` feel free to update the `MungeSumstats::sumstatsColHeaders` following the approach in the data.R file and add your mapping. Then use a pull request on github and we will incorporate this change into the package. # Session Information ```{r, message=TRUE, echo=FALSE} sessionInfo() ``` # References