--- title: "INDEED R package for cancer biomarker candidate selection" author: "Zhenzhi Li and Yiming Zuo" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{INDEED R package for cancer biomarker candidate selection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The biomarker candidates selected by **INDEED** lead to more accurate survival time prediction compared with those selected by differential expression (DE) analysis and differential network (DN) analysis. ## Introduction Differential expression (DE) analysis is commonly used to identify biomarker candidates that have significant changes in their expression levels between distinct biological groups. One drawback of DE analysis is that it only considers the changes on single biomolecular level. In differential network (DN) analysis, network is typically built based on the correlation and biomarker candidates are selected by investigating the network topology. However, correlation tends to generate over-complicated networks and the selection of biomarker candidates purely based on network topology ignores the changes on single biomolecule level. Thus, we have proposed a novel method INDEED, which considers both the changes on single biomolecular and network levels by integrating DE and DN analysis. INDEED has been published in Methods journal (PMID: 27592383). This is the R package that implements the algorithm. This R package will generate a list of dataframes containing information such as p-values, node degree and activity score for each biomolecule. A higher activity score indicates that the corresponding biomolecule has more neighbors connceted in the differential network and their p-values are more statistically significant. It will also generate a csv file for the differential network created by INDEED. The **INDEED** package doesn't get loaded automatically, so remember to load it first: ``` library(INDEED) ``` `glasso` function will also be loaded as **INDEED** depends on it to obtain the sparse differential network. ## partial correlation data preprocessing function `pre_partial()` To use the function `sig_select()`, users will need to have these data sets in hand: - (**data**) A data frame that contains the expression level of individual biomolecule from two biologically disparate groups (p * n) - (**class_label**) A binary array with group 1 labeled as 0 and group 2 as 1 - (**id**) An array that includes the corresponding ID for each biomolecule ## partial correlation result calculation function `partial_cor()` - (**data_list**) A list of dataframes output from pre_partial step - (**rho_group1**) Rule to choose rho for first group of data, user can input "min" for minimum rule, "ste" as one standard error rule, or user can input a number of choice - (**rho_group2**) Rule to choose rho for second group of data, user can input "min" for minimum rule, "ste" as one standard error rule, or user can input a number of choice - (**permutation**) A positive integer number of permutation, default 1000 - (**p_val**) A dataframe containing p-values obtained from DE analysis (optional) The demo data is presented in the **Demo Data** section below. It will be automatically loaded with the package. In partical correlation method, user will need to preprocess the data using `pre_partial()` function, followed by using `partial_cor()` and choosing desired rho number and complete the analysis. User can also choose to provide p-value table and number of permutations in `partial_cor()` function. Result will be saved in a list of two dataframes: activity_score and diff_network. Activity_score contains a dataframe with biomolecules ranked by activity score calculated from p-value and node degree. Diff_network contains a dataframe of connection weight between two nodes. The following example demonstrates how to use `pre_partial()` and `partial_cor()`function: ``` pre_data <- pre_partial(data=Met_GU,class_label = Met_Group_GU,id=Met_name_GU) partial_cor(data_list=pre_data,rho_group1='min',rho_group2="min",permutation = 1000,p_val=pvalue_M_GU) ``` In this case, the sparse differential network is based on partial correlation and p-value for each biomolecule is calculated for users. Also, **rho** is picked based on minimum rule and number of permutations is set to 1000. ## non-partial correlation data preprocessing function `non_partial_cor()` - (**data**) A data frame that contains the expression level of individual biomolecule from two biologically disparate groups (p * n) - (**class_label**) A binary array with group 1 labeled as 0 and group 2 as 1 - (**id**) An array that includes the corresponding ID for each biomolecule - (**method**) Type of correlation, user can choose between pearson and spearman correaltion, default pearson - (**permutation**) A positive integer number of permutation, default 1000 - (**p_val**) An optional dataframe containing p-values obtained from DE analysis In non partical correlation method, user only need to run `non_partial_cor()` function. No rho number will be needed.In this method, user input dataframe with expression level, class label, id, method, number of permutation and p-value similar to partical correaltion but in one step fashion. Result will be saved in a list of two dataframes: activity_score and diff_network. Activity_score contains a dataframe with biomolecules ranked by activity score calculated from p-value and node degree. Diff_network contains a dataframe of connection weight between two nodes. The following example demonstrates how to use `non_partial_cor()` function: ``` non_partial_cor(data=Met_GU,class_label = Met_Group_GU,id=Met_name_GU,method="spearman") ``` ## More Examples ### Example 1 ``` pre_data <- pre_partial(data=Met_GU,class_label = Met_Group_GU,id=Met_name_GU) partial_cor(data_list=pre_data,rho_group1=0.3,rho_group2=0.25,permutation = 500) ``` By calling the above function, rho for group 1 is 0.3 and rho for group 2 is 0.25, permutation is 500 and p-value will be calculated for users. ### Example 2 ``` non_partial_cor(data=Met_GU,class_label = Met_Group_GU,id=Met_name_GU,method="pearson") By calling the above function, the differential network is obtained based on Pearson correlation coefficient. ``` By calling the above function, the differential network is obtained based on Spearman correlation coefficient. And the table below is part of the csv file output.