\documentclass{article} %\VignetteIndexEntry{CalculatingrfPredscoreswithpackagerfPred} <>= BiocStyle::latex() @ \title{Calculating rfPred scores with package rfPred} \author{Hugo Varet, Fabienne Jabot-Hanin and Jean-Philippe Jais} \begin{document} \maketitle \newcommand{\up}[1]{\textsuperscript{#1}} \DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom = {\color[rgb]{0, 0, 0.56}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom = {\color[rgb]{0.56, 0, 0}}} \section{Background} Exome sequencing is becoming a standard tool for gene mapping of monogenic diseases. Given the vast amount of data generated by Next Generation Sequencing techniques, identification of disease causal variants is like finding a needle in a haystack. The impact assessment and the prioritization of potential pathogenic variants are expected to reduce work in biological validation, which is long and costly. \\ One of the possible approaches to determine the most probable deleterious variants in individual exomes is to use protein function alteration prediction algorithms. This package proposes a new method \cite{rfPred} based on five previously described algorithms (SIFT \cite{SIFT}, Polyphen2 \cite{Polyphen2}, LRT \cite{LRT}, PhyloP \cite{PhyloP} and MutationTaster \cite{MutationTaster}) compiled in the dbNSFP database \cite{dbNSFP}. A functional meta-score is derived from a random forest method trained on a dataset of 61,500 non-synonymous SNPs. On Two independent validation datasets, the random forest method appears to be globally better than each of the algorithms separately or in combination in a logistic regression model. rfPred scores have been pre-calculated and made available for all the possible non-synonymous SNPs of human exome. \section{Computing rfPred scores} After having launched the \R{} software, the user must load the \Rpackage{rfPred} package with: <<>>= library(rfPred) @ Several packages whose \Rpackage{rfPred} is depending on will also be loaded (especially \Biocpkg{Rsamtools}). The package contains one main function - \Rfunction{rfPred\_scores()} - which returns the rfPred scores. It works as follows: the user gives as input a list of variants for which he/she wants the corresponding scores and the function looks for these variants in a data base which stores the pre-calculated scores. It avoids re-calculating the scores for each request. The list of variants can ben either a \Rclass{data.frame}, or the path to a VCF (Variant Call Format) file as a \Rclass{character} string or a \Rclass{GRanges} object (see package \Biocpkg{GenomicRanges} on \Bioconductor{} \cite{GR}). For example, the list of variants can be the data frame \Robject{variant\_list\_Y}: <<>>= data(variant_list_Y) print(variant_list_Y) @ \Robject{variant\_list\_Y} is a \Rclass{data.frame} which contains four or five columns respecting this order: the chromosome, the HG19 position on the chromosome, the reference nucleotid allele, the alteration nucleotid allele and optionally the protein (Uniprot accession number). Here, this \Rclass{data.frame} is part of the package, but it is possible to import data from an Excel or CSV file in \R{} (see \R{} documentation). The user can then run the function: <<>>= result=rfPred_scores(variant_list=variant_list_Y, data=system.file("extdata/chrY_rfPred.txtz", package="rfPred"), index=system.file("extdata/chrY_rfPred.txtz.tbi", package="rfPred")) print(result) @ The argument \Robject{data} is the path to the data base (TabixFile) and the argument \Robject{index} is the path to the TabixFile index. Note that the default paths correspond to the example chromosome Y but the user can download the entire data base at \url{https://doi.org/10.5281/zenodo.7127736} (see section \ref{database} for details). \\ It is also possible to look for variants without specifying the protein. In this case, the user gives only the columns 1-4 as input: <<>>= result2=rfPred_scores(variant_list=variant_list_Y[,1:4], data=system.file("extdata/chrY_rfPred.txtz", package="rfPred"), index=system.file("extdata/chrY_rfPred.txtz.tbi", package="rfPred")) print(result2) @ The object \Robject{result2} contains more lines (13) than the object \Robject{result} (5) because the user does not impose any matching on the protein.\\ \section{TabixFile and TabixFile index} \label{database} To use the entire TabixFile/index containing all the chromosomes, the user must download the two files (about 3.3 Go) in order to use them locally on his/her computer. They are available on Zenodo at \url{https://doi.org/10.5281/zenodo.7127736}. In this case, the arguments \Robject{data} and \Robject{index} must be the paths to the TabixFile and index on the user's computer. \section{Exporting the results} In order to share the results with non-\R{} users, the \Rfunction{rfPred\_scores()} function allows one to export the results in a CSV file thanks to the \Robject{file.export} argument: <<>>= result3=rfPred_scores(variant_list=variant_list_Y, data=system.file("extdata/chrY_rfPred.txtz", package="rfPred"), index=system.file("extdata/chrY_rfPred.txtz.tbi", package="rfPred"), file.export="results.csv") @ The CSV file "results.csv" will be created in the current working directory of the \R{} session. \section{Number of cores} Computers with multi-core processors are able to run several tasks simultaneously. The \Robject{n.cores} argument of the \Rfunction{rfPred\_scores()} function exploits this possibility. Specifying \Robject{n.cores=4} instead of \Robject{n.cores=1} divides the computational time by about 2.5. For example, the function ran during 106 second on a 121,606 rows variants list using a 3.3 GHz Intel\textregistered Core\texttrademark i5 computer. However, for small requests, one should use only one core to avoid the overhead time due to opening hidden \R{} sessions and lauching the dependent packages. \section{rfPred random forest model} One may want to compute the rfPred score from his/her own SIFT, Polyphen2, LRT, PhyloP and MutationTaster scores. This functionnality does not appear within the package but we have made the model available on our website (\url{http://www.sbim.fr/rfPred/}) in a .RData file (about 52 Mo). The \R{} object \Robject{rfPred\_model} has to be given as input of the \Rfunction{predict} function of the \CRANpkg{randomForest} package in addition of the 5 scores. The scores used to build the random forest are \cite{dbNSFPweb}: \begin{itemize} \item SIFT\_score = $1-$original SIFT score; \item Polyphen2\_score = original HVAR Polyphen2 score; \item MutationTaster\_score = original MutationTaster score; \item PhyloP\_score = $1 - 0.5 \times 10$\up{phyloP} if phyloP $\geq 0$ or $0.5 \times 10$\up{-phyloP} if phyloP $<0$; \item LRT\_score = $1 - 0.5\ \times$ LRToriginal if LRT\_Omega $<1$ or $0.5\ \times$ LRToriginal if LRT\_Omega $\geq 1$. \end{itemize} Note that all these scores have to be stored in a \Rclass{data.frame} whose column names are textually \Robject{SIFT\_score}, \Robject{PhyloP\_score}, \Robject{Polyphen2\_score}, \Robject{LRT\_score} and \Robject{MutationTaster\_score} in order to match with the model parameters. \begin{thebibliography}{9} \bibitem{rfPred} Jabot-Hanin F and Varet H and Tores F and Jais JP, \emph{rfPred: a new meta-score for functional prediction of missense variants in human exome}. (submitted) 2013. \bibitem{SIFT} Kumar P and Henikoff S and Ng PC, \emph{Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm}. Nat Protoc 2009, VOL.4 NO.8. \bibitem{Polyphen2} Adzhubei IA and Schmidt S and Peshkin L et al, \emph{A method and server for predicting damaging missense mutations}. Nat Methods 2010 7(4):248-249. \bibitem{LRT} Muse SV and Gaut BS, \emph{A Likelihood Approach for Comparing Synonymous and Nonsynonymous Nucleotide Substitution Rates, with Application to the Chloroplast Genome}. Mol Biol Evol 1994 1(5):175-724. \bibitem{PhyloP} Margulies EH and Cooper GM and Asimenos G et al, \emph{Analyses of deep mammalian sequence alignments and constraint predictions for 1\% of the human genome}. Genome Res 2007 17:760-774 \bibitem{MutationTaster} Schwarz JM and Rodelsperger and Schuelke M and Seelow D, \emph{MutationTaster evaluates diseasecausing potential of sequence alterations}. Nat Methods 2010 7(8) \bibitem{dbNSFP} Liu X and Jian X and and Boerwinkle E, \emph{dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions}. Hum Mutat 2011 32:894-899. \bibitem{GR} Aboyoun P and Pages H and Lawrence M, \emph{GenomicRanges: Representation and manipulation of genomic intervals}. R package version 1.12.4. \bibitem{dbNSFPweb} dbNSFP v1.3 READ-ME, \url{http://dbnsfp.houstonbioinformatics.org/dbNSFPzip/dbNSFPv1.3.readme.txt}. \end{thebibliography} \end{document}