\documentclass[a4paper]{article} %\usepackage[british,UKenglish,USenglish,english,american]{babel} \usepackage[none]{hyphenat} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{tabularx,ragged2e,booktabs,caption} \usepackage{amsmath} \usepackage[left=3cm,right=3cm,top=3cm,bottom=3cm]{geometry} \usepackage{float} \usepackage[linktoc=all]{hyperref} \usepackage{titlesec} \usepackage{pdflscape} \hypersetup{ colorlinks, citecolor=blue, filecolor=blue, linkcolor=blue, urlcolor=blue } \begin{document} \newcommand{\subsubsubsection}[1]{\vspace{0.15cm}\noindent\textbf{#1}\vspace{0.15cm}} %\VignetteIndexEntry{psygenet2r: An R package for querying PsyGeNET and to perform comorbidity studies in psychiatric disorders} \title{\texttt{psygenet2r}: An R package for querying PsyGeNET and to perform comorbidity studies in psychiatric disorders} \author{Alba Gutierrez-Sacristan \and Carles Hernandez-Ferrer \and Juan R. Gonzalez \and Laura I. Furlong} \maketitle \tableofcontents \newpage \section{Introduction} The \texttt{psygenet2r} package contains functions to query PsyGeNET \cite{REF1}, a resource on psychiatric diseases and their genes. The \texttt{psygenet2r} package includes analysis functions to study psychiatric diseases, their genes and disease comorbidities. A special focus is made on visualization of the results, providing a variety of representation formats such as networks, heatmaps and barplots. \subsection{Background} \noindent During the last years there has been a growing interest in the genetics of psychiatric disorders, leading to a concomitant increase in the number of publications that report these studies \cite{REF2}. However, there is still limited understanding on the celular and molecular mechanisms leading to psychiatric diseases, which has limited the application of this wealth of data in the clinical practice. This situation also applies to psychiatric comorbidities. Some of the factors that explain the current situation is the heterogeneity of the information about psychiatric disorders and its fragmentation into knowledge silos, and the lack of resources that collect these wealth of data, integrate them, and supply the information in an intuitive, open access manner to the community. PsyGeNET has been developed to fill this gap. \texttt{psygenet2r} has been developed to facilitate statistical analysis of PsyGeNET data, allowing its integration with other packages available in R to develop data analysis workflows.\\ \noindent PsyGeNET is a resource for the exploratory analysis of psychiatric diseases and their associated genes. The second release of PsyGeNET (version 2.0) contains updated information on depression, bipolar disorder, alcohol use disorders and cocaine use disorders, and has been expanded to cover other psychiatric diseases of interest: bipolar disorder, schizophrenia, substance-induced depressive disorder and psychoses and cannabis use disorder. PsyGeNET allows the exploration of the molecular basis of psychiatric disorders by providing a comprehensive set of genes associated to each disease. Moreover, it allows the analysis of the molecular mechanisms underlying psychiatric disease comorbidities.\\ \noindent PsyGeNET database is the result of the data extracted from the literature by text mining using BeFree \cite{REF4}, followed by manual curation by domain experts. A team of 22 experts participates as curators of the database. The current version of PsyGeNET (version 2.0) contains 3,771 associations between 1,549 genes and 117 psychiatric disease concepts.\\ \noindent With \texttt{psygenet2r} package the user will be able to submit queries to PsyGeNET from R, perform a variety of analysis on the data, and visualize the results through different types of graphical representations.\\ \noindent The tasks that can be performed with \texttt{psygenet2r} package are the following: \begin{enumerate} \item Retrieve Gene-Disease Associations (GDAs) from PsyGeNET using as query a gene or a disease (single or a set of genes/diseases) of interest \item Visualize the results according to the GDAs' attributes: PsyGeNET evidence index, number of publications, sentences that report the GDA, source dadatabase \item Visualize the results according to the disease (disease class) or gene (Panther class) attributes \item Analyze the association between two diseases from the molecular perspective (using the Jaccard index) \end{enumerate} \noindent In the following sections the specific functions that can be used to address each of these tasks are presented.\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Installation} \noindent The package \texttt{psygenet2r} is provided through Bioconductor. To install \texttt{psygenet2r} the user must type the two following commands in an R session: <>= source( "http://bioconductor.org/biocLite.R" ) biocLite( "psygenet2r" ) @ <>= library( psygenet2r ) @ <>= suppressMessages(suppressWarnings(library( psygenet2r ))) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{\texttt{DataGeNET.Psy object}} \noindent \texttt{DataGeNET.Psy} object is obtained when \texttt{psygenetGene} and \texttt{psygenetDisease} functions are applied. This object is used as input for the rest of \texttt{psyGeNET2r} functions, like the \texttt{plot} function.\\ \noindent \texttt{DataGeNET.Psy} object contains all the information about the different diseases/genes associated with the gene/disease of interested retrieved from PsyGeNET. This object contains a summary of the search, such as the search input (gene or disease), the selected database, the gene or disease identifier, the number of associations found (N. Results) and the number of unique results obtained (U. Results).\\ <>= t1 <- psygenetGene( gene = 4852, database = "ALL") @ <>= t1 class( t1 ) @ \noindent This object comes with a series of functions to allow users to interact with the information retrieved from PsyGeNET. These functions are \texttt{ngene}, \texttt{ndisease}, \texttt{extract} and \texttt{plot}. The first function \texttt{ngene} returns the number of retrieved genes for a given query. \texttt{ndisease} is the homologous function but for the diseases. The function \texttt{extract} returns a formatted \texttt{data.frame} with the complete set of information downloaded from PsyGeNET. Finally, the \texttt{plot} function allows the visualization of the results in a variety of ways such as gene-disease association networks or heatmaps.\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{PsyGeNET and \texttt{psygenet2r}} \noindent The PsyGeNET web interface can be explored by searching a specific gene or a specific disease, and \texttt{psygenet2r} package has the same options. Therefore, the starting point for \texttt{psygenet2r} are \texttt{psygenetGene} and \texttt{psygenetDisease} functions.\\ \noindent PsyGeNET data is classified according to the database used as a source of information ("source database"). Therefore, any query run on PsyGeNET requires to specify the source database using the argument called \texttt{database}. Table \ref{tab:psygenet_databases} shows the source databases in PsyGeNET and their description. By default, the database \texttt{ALL} is used in \texttt{psygenet2r}. For illustrating purposes along the vignette, database \texttt{ALL} will be used in most of code snippets.\\ \noindent \begin{minipage}{\linewidth} \small{ \centering \captionof{table}{Source databases included in PsyGeNET} \label{tab:psygenet_databases} \begin{tabular}{|l|l|} \hline Name & Description \\ \hline \texttt{psycur15} & Genes associated to DEP, BD, AUD and CUD between 1980 and 2013 (PsyGeNET release v1.0) \\ \texttt{psycur16} & Genes associated to DEP, BD, AUD, CUD, SCHZ, S-DEP, CanUD and D-PSY between 1980 and 2015 \\ \texttt{ALL} & All previous Databases \\ \hline \end{tabular} \par %\bigskip } \end{minipage} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Retrieve gene-disease associations (GDAs) from psygenet2r} \subsection{Using genes as a query} \texttt{psygenet2r} package allows exploring PsyGeNET information using a specifc gene or a list of genes. It retrieves the information that is available in PsyGeNET (associated diseases, source database, PsyGeNET evidence index, number of publications, attributes of genes, etc) and allows to visualize the results in different ways. \subsubsection{Using as a query a single gene} \noindent In order to look for a single gene into PsyGeNET, we can use the \texttt{psygenetGene} function. This function retrieves PsyGeNET's information using both, the NCBI gene identifier and the official Gene Symbol from HUGO. It contains also other arguments like the database to query, the PsyGeNET evidence index (score argument).\\ \noindent As an example, the gene \textit{NPY}, whose entrez id is \textsl{4852} is queried using \texttt{psygenetGene} function, and using alternatively the official HUGO Gene Symbol. In this example database \texttt{"ALL"}.\\ <>= t1 <- psygenetGene( gene = 4852, database = "ALL") t1 @ <>= t2 <- psygenetGene( gene = "NPY", database = "ALL" ) t2 @ \noindent Both cases result in an \texttt{DataGeNET.Psy} object: <>= class( t1 ) class( t2 ) @ \noindent In the particular example used, by inspecting the \texttt{DataGeNET.Psy} object, we can see that the gene \textit{NPY} is associated to 13 different diseases in PsyGeNET (with no restriction on the PsyGeNET evidence index). \subsubsection{Ploting the results of a Single Gene Query} \noindent \texttt{psygenet2r} offers several options to visualize the results from PysGeNET: a network showing the diseases related to the gene of interest, or a network showing the strength of the association between the eight main psychiatric disorders in PsyGeNET and the gene of interest. Each one of these graphics can be obtained changing the type argument.\\ \noindent By default, \texttt{psygenet2r} shows this type of network when ploting a \texttt{DataGeNET.Psy} object obtained by a gene-query. The result is a network where green nodes are diseases and the orange node is the gene of interest.\\ \begin{figure} \begin{center} <>= plot( t1, type = "individual disease" ) @ \end{center} \end{figure} \noindent On the other hand, results can be visualized according to the 8 psychiatric disorders classes available in PsyGeNET (depression, bipolar disorder, alcohol use disorders, cocaine use disorder, bipolar disorder, schizophrenia, cannabis use disorder, substance-induced depressive disorder and psychoses) setting the \texttt{type} argument to \texttt{"disease class"}. As a result, a network with 5 nodes is obtained. The node's size of each psychiatric disorder is proportional to the number of disease concepts that belongs to each disease class, from the total number of diseases associated to the gene.\\ \begin{figure}[H] \begin{center} <>= plot( t1, type = "disease class" ) @ \end{center} \end{figure} \noindent In our example, NPY is associated to four of the eight psychiatric disorders present in PsyGeNET, with an important contribution of depression. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Using as a query a list of genes} \noindent In the same way, \texttt{psygenet2r} allows to query PsyGeNET given a list of genes of interest. The same function, \texttt{psygenetGene}, accepts a vector of NCBI gene identifiers or HUGO official gene symbols.\\ \noindent To illustrate this functionality, a list of 20 genes was extracted from the article entitled \textsl{"The Genetics of Major Depression"} \cite{REF5}, where these genes are associated to depression. The vector of genes can be defined as follows: <>= genesOfInterest <- c( "COMT", "CLOCK", "DRD3", "GNB3", "HTR1A", "MAOA", "HTR2A","HTR2C", "HTR6", "SLC6A4", "ACE", "BDNF", "DRD4", "HTR1B", "HTR2B", "HTR2C", "MTHFR", "SLC6A3", "TPH1", "SLC6A2", "GABRA3" ) @ \noindent Then, the function \texttt{psygenetGene} is applied. In this case an extra argument called \texttt{verbose} was set to \texttt{TRUE}. This shows some information during the query-process, for example the message informing that there are repeated genes in the list (the gene \textsl{HTR2C} was placed twice in the list to raise this message) and the message informing that one or more of the given genes is not in PsyGeNET (in this case the gene HTR2B).\\ <>= m1 <- psygenetGene( gene = genesOfInterest, database = "ALL", verbose = TRUE ) @ \noindent A \texttt{DataGeNET.Psy} object is obtained. In this particular example, 19 genes are present in PsyGeNET and are associated to 42 diseases, involving 212 GDAs.\\ <>= m1 @ \subsubsection{Ploting the results of the query using a list of genes} \noindent \texttt{psygenet2r} provides several options to visualize the results of these queries, such as networks, heatmaps and barplots.\\ \noindent As for the single gene example, the default option in \texttt{psygenet2r} results is a network chart, where the green nodes represent diseases and the oranges nodes represent genes. It is also possible to visualize the results by grouping the diseases according to the psychiatric disorders present in PsyGeNET specifying the argument type.\\ \begin{figure}[H] \begin{center} <>= plot( m1 ) @ \end{center} \end{figure} \begin{figure}[H] \begin{center} <>= plot( m1, type = "disease class" ) @ \end{center} \end{figure} \noindent \texttt{psygenet2r} package allows to visualize the GDAs attributes in a heatmap. The argument \texttt{type} must be \texttt{"heatmapGenes"} and the PsyGeNET evidence index can also be determined by the user setting the \texttt{cut-off} argument to the evidence index of interest. In this example, the cut-off is set to 0, in order to obtain all the results. If we set the cut-off to 0.5, only those associations with at least half of the publications supporting the association will be shown. In this kind of representation we can identify genes that are associated to several diseases (e.g. SLC6A4), others that are associated only to one disease (e.g. HTR6) and we can visualize the evidence index for each association.\\ \noindent Note that heatmap cells can be coloured in green, yellow or red. Green color represents those GDAs where all the evidences reviewed by the experts support the existence of an association between the gene and the disease (Association, EI = 1); it will be yellow when there is contradictory evidence for the GDA (some publications support the association while others publications do not support it, 1 > EI > 0), and it will be red when all the evidences reviewed by the experts report that there is no association between the gene and the disease (Association, EI = 0).\\ \begin{landscape} <>= plot( m1, type="heatmapGenes" ) @ \end{landscape} \noindent In this example we can see that there are 3 GDAs in red (GNB3-chronic alcoholic intoxication; HTR2A-drug psychoses and ACE-major affective disorder 2), indicating that all publications report that there is no association.\\ \noindent An alternative graphical representation is a psychiatric disorder heatmap related to the percentage of diseases to which a gene appear associated. It allows to analyze if the genes that are being studied present a specific association with a subtype of disorder or if they are associated with several of them in the same psychiatric disorder class. The percentage of diseases to which a gene appear associated with each psychiatric disorder is estimated. This percentage is relative to the total possible subtypes of disorders present in PsyGeNET (33 for alcohol UD, 9 for bipolar disorder, 37 for depression, 24 for schizophrenia, 6 for cocaine UD, 3 for cannabis UD, 3 for DI-Psychosis and 2 for SI-Depression). The resultant values are represented in a heatmap according to a blue color scale.\\ \noindent In order to obtain this graphic, \texttt{type} must be set to \texttt{"heatmap"}. The resulting heatmap shows which are the genes that are higher or lower associated to each one of the psychiatric disorders. <>= plot( m1, type = "heatmap" ) @ \noindent As it could be expected for the genes that are used as genes of interest, all of them are associated to depression, to a greater or lesser extent. Some of them are also associated to the other six psychiatric disorders, following a similar pattern alcohol UD and bipolar disorder.\\ \noindent \texttt{psygenet2r} package also allows to analyze a gene list according to the function of the proteins encoded by these genes. The PANTHER Protein Class Ontology classifies proteins according to their function.\\ \noindent The \texttt{pantherGraphic} function shows the Panther class to which the proteins belong according to their associated psychiatric disorder. It provides a graphic with the results of these analysis, being the input a list of genes and the database (\texttt{ALL}, \texttt{psycur15}, \texttt{psycur16}). The input genes can be from a vector that contains the genes of interest, or from the genes obtained in the \texttt{DataGeNET.psy} object in a disease or disease-list query. An score argument can be added to filter results. It can also be done given a DataGeNET.psy object obtained by querying with a single gene. <>= genesOfInterest <- unique( genesOfInterest ) pantherGraphic( genesOfInterest, "ALL" ) @ \subsubsection{Ploting results of the multiple gene query} \noindent \texttt{psygenet2r} offers several options to visualize the results of a query using a list of genes or from the genes obtained in the \texttt{DataGeNET.psy} object in a gene or gene-list query. Barplots and pie charts showing the gene attributes can be obtained by applying the \texttt{geneAttrPlot} function.\\ \noindent \texttt{psygenet2r} package allows to visualize how many of our genes of interest are associated to each psychiatric disorder present in PsyGeNET and how many of them are exclusively associated to a particular psychiatric disorder. This can be done applying the \texttt{geneAttrPlot} function and setting the \texttt{type} argument to \texttt{"cateogry"}.\\ <>= geneAttrPlot( m1, type = "category" ) @ \noindent As a result, a barplot is obtained. The X axis contains the psychiatric disorders, sorted alphabetically and the number of genes are represented in the Y axis. For each psychiatric disorder a maximum of two bars can be plotted, the blue bar represents the total number of genes for the psychiatric disorder. For those cases in which some of these genes are specific for the disorder a second bar, coloured in orange, will be displayed.\\ \noindent Alternatively, \texttt{psygenet2r} package allows to visualize for each gene, how many disease concepts and how many psychiatric categories are associated to it. This can be done applying the \texttt{geneAttrPlot} function and setting the \texttt{type} argument to \texttt{"gene"}.\\ <>= geneAttrPlot( m1, type = "gene" ) @ \noindent As a result, a barplot is obtained. The X axis contains the genes, sorted alphabetically and the number of cuis and psychiatric categories are represented in the Y axis. For each psychiatric disorder two bars are plotted, the purple bar represents the number of cuis and the yellow one belongs to the number of categories.\\ \noindent In our example, the gene SLC6A4 is the one whith more associated cuis (26) and categories (6).\\ \noindent If we are only interested in how many of our input genes are associated to each disease category, \texttt{psygenet2r} package allows to visualize it in a pie chart by applying the \texttt{geneAttrPlot} function and setting up the \texttt{type} argument to \texttt{pieChart}.\\ <>= geneAttrPlot( m1, type = "pie" ) @ \noindent In our example, depression has 19 associated genes, and the majority of them are also associated to other psychiatric disorders, like bipolar disorder (18 genes) and alcohol UD (16 genes).\\ \noindent In PsyGeNET it is important to keep track of both “positive” and the “negative” findings, and let the user make their own judgements based on the available evidence. Thus, for each GDA and each supporting publication,the association type information is provided. According to the evidence, there are two types: “Association” and “No Association” (e.g. the “negative findings”).\\ \texttt{psygenet2r} package allows to visualize this information in a representative barplot. It can be done by applying the \texttt{geneAttrPlot} function and setting the \texttt{type} argument to \texttt{"index"}. As a result a barplot showing, for each psychiatric disorder, how many gene-disease association are in total (red bar), how many of them are 100\% association (green bar), 100\% no association (blue bar) and how many GDAs are supported by both association types (purple bar).\\ <>= geneAttrPlot( m1, type = "index" ) @ \noindent In our example, the barplot shows that depression is the psychiatric disorder with more gene-disease associations (127 GDAs), and there is no negative evidence for any of the associations.\\ \subsubsection{Enrichment analysis for a list of genes} \noindent The R package \texttt{psygenet2r} allows to perform an enrichment analysis on a list of genes with PsyGeNET diseases. It is done by using the function \texttt{enrichedPD}. In order to illustrate this function, the previous list of 20 genes associated to depression \cite{REF5} will be used. \\ <>= tbl <- enrichedPD( genesOfInterest, database = "ALL") tbl @ \noindent The result is a table with a p-value of the enrichment of the given list of genes for each psychiatric disorder in PsyGeNET. As we can see, if we put a p-value cut-off of 0.01, these genes are enriched in 5 of the 8 psychiatric disorders, being alcohol UD (p-val ~ 9e-10) and depression (p-val ~ 6e-9), the ones with the lowest p-value.\\ \subsubsection{Enrichment analysis based on anatomical terms (TopAnat) for a list of genes} \noindent \texttt{psygenet2r} package allows to perform gene set enrichment test based on expression of genes in anatomical structures, importing data from the Bgee database \cite{Bastian} and importing functions from BgeeDB R package \cite{Komljenovic}.\\ \noindent It is done by using the function \texttt{topAnatEnrichment}. This function perform the enrichment analysis using a list of genes (NCBI gene identifier or official Gene Symbol from HUGO). It contains also other arguments like the \texttt{dataType} (rna\_seq or affymetrix), \texttt{statistics}, that by default is \texttt{fisher} and the \texttt{cutOff} argument.\\ <>= tpAnat <- topAnatEnrichment( genesOfInterest, cutOff = 1 ) @ <>= head( tpAnat ) @ \noindent The result is a data frame that contains the anatomical structures. Results are sorted by p-value, and FDR values are calculated.\\ \subsubsection{Sentences that report a GDA} \noindent \texttt{psygenet2r} package also allows to extract the sentences that report a gene-disease asssociation from the supporting publications. It is done by using two different functions, \texttt{psygenetGeneSentences} and \texttt{extractSentences}. \texttt{psygenetGeneSentences} needs as input a gene list and a database to query in. The output of this function is a \texttt{DataGeNET.Psy} object. This object is passed to the \texttt{extractSentences} function, that also needs the disorder of interest.\\ <>= genesOfInterest sss <- psygenetGeneSentences( geneList = genesOfInterest, database = "ALL") sss geneSentences <- extractSentences( object = sss, disorder = "alcohol abuse") dim( geneSentences ) @ \noindent The result is a data frame that contains the gene symbol, gene identifier, disease name, original db, the pmid, the annotation type and the sentence.\\ \subsection{Using diseases as a query} \texttt{psygenet2r} package allows to explore PsyGeNET information searching a specifc disease or a list of diseases. As in the case of genes, it retrieves the information that is available in PsyGeNET and allows to visualize the results in several ways. \subsubsection{Using as a query a single disease} \noindent In order to look for a single disease into PsyGeNET, \texttt{psygenet2r} has the \texttt{psygenetDisease} function. This function allows you to obtain PsyGeNET's information using both disease id or disease name, and the database as input (by default is \texttt{ALL}). \\ \noindent If the user does not know the disease identifier, the \texttt{getUMLS} function can be used to obtain disease names and UMLS CUIs from a string query. Providing as input the term and source of interest, \texttt{getUMLs} function retrieves all the PsyGeNET concepts that contain it. As an example it is shown the query results for \texttt{depressive} term in \texttt{ALL} databases.\\ <>= getUMLs( "depressive", database = "ALL" ) @ \noindent As an example, the disease \textit{major affective disorder 2}, whose disease id is \textsl{umls:C1839839} is queried using \texttt{psygenetDisease} function, and using both, disease name and disease id. For this example database \texttt{"ALL"} is selected:\\ <>= d1 <- psygenetDisease( disease = "umls:C1839839", database = "ALL", score = c('>', 0.5 ) ) d1 @ <>= d2 <- psygenetDisease( disease = "major affective disorder 2", database = "ALL", score = c('>', 0.5 ) ) d2 @ \noindent Both cases result in an \texttt{DataGeNET.Psy} object, that contains the same information as in the gene query search: <>= class( d1 ) class( d2 ) @ \noindent The argument \texttt{score} is filled with a vector which first position can be \texttt{'<'} or \texttt{'>'} to indicate if the threshold is read as lower or upper. The second argument is the threshold in itself which will always be included. This argument is also present in \texttt{psygenetGene}. \subsubsection{Plotting results of a Single Disease Query} \noindent \texttt{psygenet2r} package offers several options to visualize the results from PysGeNET given a disease: a network showing the genes related to the disease of interest and a barplot showing how many publications report each one of the gene-disease associations. \\ \noindent By default, \texttt{psygenet2r} shows the GDAs network when ploting a \texttt{DataGeNET.Psy} object with a disease-query. The result is a network where, orange nodes are genes and the central and green node is the disease of interest.\\ <>= plot ( d1 ) @ \subsubsection{Using a list of diseases as a query} \noindent In the same way, \texttt{psygenet2r} allows to query PsyGeNET given a set of diseases of interest. The same function, \texttt{psygenetDisease}, accepts a vector of disease-names or disease-ids (umls code).\\ \noindent To illustrate this functionality, two disorders has been selected: chronic schizophrenia and alcohol use disorder. The vector of diseases can be defined for example, as follows:\\ <>= diseasesOfInterest <- c( "chronic schizophrenia","alcohol use disorder" ) @ <>= tt <- psygenetDisease( disease = diseasesOfInterest, database = "ALL" ) tt @ <>= dm <- psygenetDisease( disease = c( "umls:C0221765", "umls:C0001956" ), database = "ALL" ) dm @ <>= tm <- psygenetDisease( disease = c( "chronic schizophrenia","umls:C0001956" ), database = "ALL" ) tm @ \noindent Three cases result in an \texttt{DataGeNET.Psy} object: <>= class( tt ) class( dm ) class( tm ) @ \noindent This type of object contains all the information about the different genes associated with the diseases of interest retrieved from PsyGeNET. By inspecting the \texttt{DataGeNET.Psy} object we can see that, according to PsyGeNET and querying in ALL databases, the 2 disorders of interest are associated to 25 different genes in 25 different associations. \subsubsection{Ploting results: Multiple Diseases} \noindent \texttt{psygenet2r} provides a network graphic and a heatmap to visualize the results of search with multiple input items. \\ \noindent As for single disease, GDAs network is the default option in \texttt{psygenet2r}. In the resulting network chart, the green nodes represent diseases and the oranges nodes represent genes.\\ <>= plot( tm ) @ \noindent Another possible option is visualize it in a heatmap. The argument \texttt{type} can be set to \texttt{"heatmap"}. <>= plot( tm, type = "heatmap" ) @ \noindent The result is a heatmap where the genes are located at X axis, and disordes appear at Y axis. The red rank color is related to the PsyGeNET score, being the darkest one the association with the highest score. The score is the PsyGeNET evidence index (EI), which ranges from 0 to 1 (EI=1, when all the evidences reviewed by the experts support the existence of an association between the gene and the disease; 1 > EI > 0, when there is contradictory evidence for the GDA and EI=0 when all the evidences reviewed by the experts report that there is no association between the gene and the disease).\\ \subsubsection{Barplot according to number of publications that support the GDA} \noindent \texttt{psygenet2r} package allows to see how many publications support each gene-disease association. This can be visualized in a barplot by determining the gene or disease id in the \texttt{name} argument and setting \texttt{type} argument to \texttt{"barplot"}.\\ <>= plot( d1, name = "major affective disorder 2", type = "barplot" ) @ \noindent As a result, a barplot is obtained. The X axis contains the genes related to the disease of interest, sorted by the number of pubmed ids in which we can find the gene-disease association. Alternatively, the results can be visualized for the diseases.\\ <>= plot( t1, name = "NPY", type = "barplot" ) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section {Analyze the association between two diseases from the molecular perspective} \noindent We can study the association between two diseases from the point of view of shared genetic contribution. More precisely, we can estimate the degree of association of two diseases by means of the number of genes that are shared between the two diseases, over the total number of disease genes. Similarity measures such as the Jaccard Index can be used to estimate disease similarity. The significance of the Jaccard index is estimated by a bootstrap procedure (see below).\\ \subsection{Using the Jaccard Index} \noindent The Jaccard Index, also known as the Jaccard similarity coefficient, is a statistic used for comparing the similarity of two sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: \begin{equation*} J(A, B) = \frac{\mid A \cap B \mid}{\mid A \cup B \mid} \end{equation*} \noindent \texttt{psygenet2r} comes with functions to compute the Jaccard Index as an estimation of the similarity of two diseases based on shared genes, given information retrieved from PsyGeNET. The user can compute the Jaccard Index using the function \texttt{jaccardEstimation}. This function accepts multiple inputs: \begin{enumerate} \item Using a list of genes of interest the function will compute the Jaccard Index between the set of genes and all the diseases in PsyGeNET. <>= genes_interest <- c("SLC6A4", "DRD2", "HTR1B", "PLP1", "TH", "DRD3") ji1 <- jaccardEstimation(genes_interest, database = "ALL") @ \item Using a list of genes of interest and a list of diseases of interest, the function computes the Jaccard Index between the set of genes and each disease: <>= disease_interest <- c("delirium", "bipolar i disorder", "severe depression", "cocaine dependence") ji2 <- jaccardEstimation(genes_interest, disease_interest, database = "ALL") @ \item With a list of diseases of interest, the function will calculate the Jaccard Index between themselves: <>= ji3 <- jaccardEstimation(disease_interest, database = "ALL") @ \end{enumerate} \noindent To determine if the association between two diseases as estimated by the Jaccard Index was statistically significant, we applied a bootstrap procedure to estimate the likelihood of obtaining a Jaccard Index greater than the one obtained for the association between the diseases by chance. In other words, we sampled at random gene sets of size n and p (n, p is the number of genes associated to disease 1 and 2, respectively) from a population of human disease genes obtained from DisGeNET \cite{REF3}. These random gene sets (n and p) were then used to compute the Jaccard Index for diseases 1 and 2. This procedure was repeated 100 times. Then, we calculated the number of times we obtained a Jaccard Index for the random gene sets larger than the observed value of the Jaccard Index. % We only kept disease associations for which we obtained less than 5/100 random cases an Jaccard Index value larger than the observed one. \noindent The raw results are stored in \texttt{JaccardIndexPsy} and can be obtained using the function \texttt{extract}. For example: <>= head(extract(ji1)) tail(extract(ji1)) @ \subsection{Plotting results: Jaccard Index} \noindent The plot of the result of a \texttt{jaccardEstimation} using a singe set of genes corresponds to a bar-plot of the Jaccard Index with each disease: <>= plot(ji1, cutOff = 0.1) @ \noindent The previous bar-plot shows the Jaccard Index greater than 0.1 obtained from testing each diseases in PsyGeNET. When given a set of genes and a set of diseases, the resulting plot is equivalent: <>= plot(ji2) @ \noindent The plot resulting from more than one disease is a heat-map with the given disease as X axis and all the diseases that share genes with them placed as Y axis. The intensity of the color represents the value of the Jaccard Index between, being the darker one the major Jaccard Index.\\ <>= plot(ji3) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Warnings} \noindent \texttt{heatmap} type argument do not allow queries for single gene: \begin{verbatim} > plot( t1, type = "heatmap" ) ==> Error: For this type of chart, a multiple gene query created with 'psygenetGene' is required. \end{verbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \begin{thebibliography}{9} \bibitem{REF1} Alba Gutierrez-Sacristan; Solene Grosdidier; Olga Valverde; Marta Torrens; Alex Bravo; Janet Pinero; Ferran Sanz; Laura I. Furlong. \textbf{PsyGeNET: a knowledge platform on psychiatric disorders and their genes} Bioinformatics 2015 doi: 10.1093/bioinformatics/btv301 \bibitem{REF2} Sullivan, Patrick F; Daly, Mark J; O'Donovan, Michael. \textbf{Genetic architectures of psychiatric disorders: the emerging picture and its implications} Nature reviews. Genetics (2012) vol. 13 (8) p. 537-51 \bibitem{REF3} Janet Piñero, Núria Queralt-Rosinach, Àlex Bravo, Jordi Deu-Pons, Anna Bauer-Mehren, Martin Baron, Ferran Sanz, Laura I Furlong. \textbf{DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes} Database (2015) Vol. 2015: article ID bav028; doi:10.1093/database/bav028 \bibitem{REF4} Bravo, À.; Piñero, J.; Queralt, N.; Rautschka, M.; Furlong, L.I. \textbf{Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research} BMC Bioinformatics 2015, 16:55 doi:10.1186/s12859-015-0472-9 \bibitem{REF5} Flint, Jonathan; Kendler, Kenneth S. \textbf{The genetics of major depression} Neuron (2014) Vol. 81 (3) p. 484-503 \bibitem{Treutlein} Jens Treutlein, Sven Cichon et al. \textbf{Genome-wide association study of alcohol dependence}. Archives of general psychiatry (2009) vol.66(7) p.773 doi: 10.1001/archgenpsychiatry.2009.83. \bibitem{Komljenovic} Komljenovic A and Roux J \textbf{BgeeDB: an R package for annotation and gene expression data retrieval from Bgee database} (2016) \bibitem{Bastian} Bastian F \textbf{Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species} Data Integration Life Sci. Lecture Notes in Computer Science (2008), pp. 124-31. \end{thebibliography} \end{document}