--- title: "Working with StarBioTrek package" author: "Claudia Cava, Isabella Castiglioni" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true number_sections: false toc_depth: 2 highlight: haddock references: - id: ref1 title: Orchestrating high-throughput genomic analysis with Bioconductor author: - family: Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others given: journal: Nature methods volume: 12 number: 2 pages: 115-121 issued: year: 2015 - id: ref2 title: GC-content normalization for RNA-Seq data author: - family: Risso, Davide and Schwartz, Katja and Sherlock, Gavin and Dudoit, Sandrine given: journal: BMC bioinformatics volume: 12 number: 1 pages: 480 issued: year: 2011 - id: ref3 title: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments author: - family: Bullard, James H and Purdom, Elizabeth and Hansen, Kasper D and Dudoit, Sandrine given: journal: BMC bioinformatics volume: 11 number: 1 pages: 94 issued: year: 2010 - id: ref4 title: Inferring regulatory element landscapes and transcription factor networks from cancer methylomes author: - family: Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. given: journal: Genome biology volume: 16 number: 1 pages: 105 issued: year: 2015 - id: ref5 title: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments author: - family: James H Bullard, Elizabeth Purdom, Kasper D Hansen and Sandrine Dudoit given: journal: BMC Bioinformatics volume: 11 number: 1 pages: 94 issued: year: 2010 - id: ref6 title: GC-content normalization for RNA-Seq data author: - family: Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. given: journal: BMC Bioinformatics volume: 12 number: 1 pages: 480 issued: year: 2011 - id: ref7 title: Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma author: - family: Noushmehr, H., Weisenberger, D.J., Diefes, K., Phillips, H.S., Pujara, K., Berman, B.P., Pan, F., Pelloski, C.E., Sulman, E.P., Bhat, K.P. et al. given: journal: Cancer cell volume: 17 number: 5 pages: 510-522 issued: year: 2010 - id: ref8 title: Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma author: - family: Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others given: journal: Cell URL: "http://doi.org/10.1016/j.cell.2015.12.028" DOI: "10.1016/j.cell.2015.12.028" volume: 164 number: 3 pages: 550-563 issued: year: 2016 - id: ref9 title: Comprehensive molecular profiling of lung adenocarcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature13385" DOI: "10.1038/nature13385" volume: 511 number: 7511 pages: 543-550 issued: year: 2014 - id: ref10 title: Comprehensive molecular characterization of gastric adenocarcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature13480" DOI: "10.1038/nature13480" issued: year: 2014 - id: ref11 title: Comprehensive molecular portraits of human breast tumours author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature11412" DOI: "10.1038/nature11412" volume: 490 number: 7418 pages: 61-70 issued: year: 2012 - id: ref12 title: Comprehensive molecular characterization of human colon and rectal cancer author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature11252" DOI: "10.1038/nature11252" volume: 487 number: 7407 pages: 330-337 issued: year: 2012 - id: ref13 title: Genomic classification of cutaneous melanoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Cell URL: "http://doi.org/10.1016/j.cell.2015.05.044" DOI: "10.1016/j.cell.2015.05.044" volume: 161 number: 7 pages: 1681-1696 issued: year: 2015 - id: ref14 title: Comprehensive genomic characterization of head and neck squamous cell carcinomas author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature14129" DOI: "10.1038/nature14129" volume: 517 number: 7536 pages: 576-582 issued: year: 2015 - id: ref15 title: The somatic genomic landscape of chromophobe renal cell carcinoma author: - family: Davis, Caleb F and Ricketts, Christopher J and Wang, Min and Yang, Lixing and Cherniack, Andrew D and Shen, Hui and Buhay, Christian and Kang, Hyojin and Kim, Sang Cheol and Fahey, Catherine C and others given: journal: Cancer Cell URL: "http://doi.org/10.1016/j.ccr.2014.07.014" DOI: "10.1016/j.ccr.2014.07.014" volume: 26 number: 3 pages: 319-330 issued: year: 2014 - id: ref16 title: Comprehensive genomic characterization of squamous cell lung cancers author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature11404" DOI: "10.1038/nature11404" volume: 489 number: 7417 pages: 519-525 issued: year: 2012 - id: ref17 title: Integrated genomic characterization of endometrial carcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature12113" DOI: "10.1038/nature12113" volume: 497 number: 7447 pages: 67-73 issued: year: 2013 - id: ref18 title: Integrated genomic characterization of papillary thyroid carcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Cell URL: "http://doi.org/10.1016/j.cell.2014.09.050" DOI: "10.1016/j.cell.2014.09.050" volume: 159 number: 3 pages: 676-690 issued: year: 2014 - id: ref19 title: The molecular taxonomy of primary prostate cancer author: - family: Cancer Genome Atlas Research Network and others given: journal: Cell URL: "http://doi.org/10.1016/j.cell.2015.10.025" DOI: "10.1016/j.cell.2015.10.025" volume: 163 number: 4 pages: 1011-1025 issued: year: 2015 - id: ref20 title: Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma author: - family: Linehan, W Marston and Spellman, Paul T and Ricketts, Christopher J and Creighton, Chad J and Fei, Suzanne S and Davis, Caleb and Wheeler, David A and Murray, Bradley A and Schmidt, Laura and Vocke, Cathy D and others given: journal: NEW ENGLAND JOURNAL OF MEDICINE URL: "http://doi.org/10.1056/NEJMoa1505917" DOI: "10.1056/NEJMoa1505917" volume: 374 number: 2 pages: 135-145 issued: year: 2016 - id: ref21 title: Comprehensive molecular characterization of clear cell renal cell carcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Nature URL: "http://doi.org/10.1038/nature12222" DOI: "10.1038/nature12222" volume: 499 number: 7456 pages: 43-49 issued: year: 2013 - id: ref22 title: Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma author: - family: Cancer Genome Atlas Research Network and others given: journal: Cancer Cell URL: "http://dx.doi.org/10.1016/j.ccell.2016.04.002" DOI: "10.1016/j.ccell.2016.04.002" volume: 29 pages: 43-49 issued: year: 2016 - id: ref23 title: Complex heatmaps reveal patterns and correlations in multidimensional genomic data author: - family: Gu, Zuguang and Eils, Roland and Schlesner, Matthias given: journal: Bioinformatics URL: "http://dx.doi.org/10.1016/j.ccell.2016.04.002" DOI: "10.1016/j.ccell.2016.04.002" pages: "btw313" issued: year: 2016 - id: ref24 title: "TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages" author: - family: Silva, TC and Colaprico, A and Olsen, C and D'Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H given: journal: F1000Research URL: "http://dx.doi.org/10.12688/f1000research.8923.1" DOI: "10.12688/f1000research.8923.1" volume: 5 number: 1542 issued: year: 2016 - id: ref25 title: "StarBioTrek: an R/Bioconductor package for integrative analysis of TCGA data" author: - family: Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan given: journal: Nucleic Acids Research URL: "http://dx.doi.org/10.1093/nar/gkv1507" DOI: "10.1093/nar/gkv1507" volume: 44 number: 8 pages: e71 issued: year: 2016 vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(dpi = 300) knitr::opts_chunk$set(cache=FALSE) ``` ```{r, echo = FALSE,hide=TRUE, message=FALSE,warning=FALSE} devtools::load_all(".") ``` # Introduction Motivation: New technologies have made possible to identify marker gene signatures. However, gene expression-based signatures present some limitations because they do not consider metabolic role of the genes and are affected by genetic heterogeneity across patient cohorts. Considering the activity of entire pathways rather than the expression levels of individual genes can be a way to exceed these limits. This tool `StarBioTrek` presents some methodologies to measure pathway activity and cross-talk among pathways integrating also the information of network and TCGA data. New measures are under development. # Installation To install use the code below. ```{r, eval = FALSE} source("https://bioconductor.org/biocLite.R") biocLite("StarBioTrek") ``` # `Get data`: Get KEGG pathway, network and TCGA data ## `getKEGGdata`: Searching KEGG data for download You can easily search KEGG data and their genes using the `getKEGGdata` function. It can download 306 pathways from KEGG or the user can select a particular list of pathways manually curated based on their role in the cell as metabolism, genetic Information processing, environmental information processing, Cellular processes and Organismal Systems using the following parameters: METABOLISM * **Carb_met** Carbohydrate_metabolism * **Ener_met** Energy_metabolism * **Lip_met** Lipid_metabolism * **Amn_met** Aminoacid_metabolism * **Gly_bio_met** Glycan_biosynthesis_and_metabolism * **Cof_vit_met** Metabolism_of_cofactors_and_vitamins GENETIC INFORMATION PROCESSING * **Transcript** Transcription * **Transl** Translation * **Fold_degr** Folding_sorting_and_degradation * **Repl_repair** Replication_and_repair ENVIRONMENTAL INFORMATION PROCESSING * **sign_transd** Signal_transduction * **sign_mol_int** Signaling_molecules_and_interaction CELLULAR PROCESSES * **Transp_cat** Transport_and_catabolism * **cell_grow_d** Cell_growth_and_death * **cell_comm** Cellular_community ORGANISMAL SYSTEMS * **imm_syst** Immune_system * **end_syst** Endocrine_system * **circ_syst** Circulatory_system * **dig_syst ** Digestive_system * **exc_syst** Excretory_system * **nerv_syst ** Nervous_system * **sens_syst ** Sensory_system The following code is an example to download the pathways involved in Transcription: ```{r, eval = TRUE} patha<-getKEGGdata(KEGG_path="Transcript") ``` For example the group Transcript contains different pathways: ```{r, eval = TRUE, echo = FALSE} knitr::kable(colnames(path), digits = 2, caption = "List of patwhays for the group Transcript",row.names = FALSE) ``` The function `getKEGGdata` with KEGG_path="KEGG_path" will download all KEGG pathaway. ## `getNETdata`: Searching network data for download You can easily search human network data from GeneMania using the `getNETdata` function. The network category can be filtered using the following parameters: * **PHint** Physical_interactions * **COloc** Co-localization * **GENint** Genetic_interactions * **PATH** Pathway * **SHpd** Shared_protein_domains For default the organism is homo sapiens. The example show the shared protein domain network for Saccharomyces_cerevisiae. For more information see `SpidermiR` package. ```{r, eval = TRUE} organism="Saccharomyces_cerevisiae" netwa<-getNETdata(network="SHpd",organism) ``` # `Integration data`: Integration between KEGG pathway and network data ## `path_net`: Network of interacting genes for each pathway according a network type (PHint,COloc,GENint,PATH,SHpd) The function `path_net` creates a network of interacting genes for each pathway. Interacting genes are genes belonging to the same pathway and the interaction is given from network chosen by the user, according the paramenters of the function `getNETdata`. The output will be a network of genes belonging to the same pathway. ```{r, eval = TRUE} network_path<-path_net(pathway=path,data=netw) ``` ## `list_path_net`: List of interacting genes for each pathway (list of genes) according a network type (PHint,COloc,GENint,PATH,SHpd) The function `list_path_net` creates a list of interacting genes for each pathway. Interacting genes are genes belonging to the same pathway and the interaction is given from network chosen by the user, according the paramenters of the function `getNETdata`. The output will be a list of genes belonging to the same pathway and those having an interaction in the network. ```{r, eval = TRUE} list_path<-list_path_net(lista_net=network_path,pathway=path) ``` # `Pathway summary indexes`: Score for each pathway ## `GE_matrix`: Get human KEGG pathway data and a gene expression matrix in order to obtain a matrix with the gene expression for only genes given containing in the pathways given in input by the user. Starting from a matrix of gene expression (rows are genes and columns are samples, TCGA data) the function `GE_matrix` creates a of gene expression levels for each pathway given by the user: ```{r, eval = TRUE} list_path_gene<-GE_matrix(DataMatrix=tumo[,1:2],pathway=path) ``` ## `average`: Average of genes for each pathway starting from a matrix of gene expression Starting from a matrix of gene expression (rows are genes and columns are samples, TCGA data) the function `average` creates an average matrix of gene expression for each pathway: ```{r, eval = TRUE} score_mean<-average(dataFilt=tumo[,1:2],pathway=path) ``` ## `st_dv`: Standard deviations of genes for each pathway starting from a matrix of gene expression Starting from a matrix of gene expression (rows are genes and columns are samples, TCGA data) the function `st_dv` creates a standard deviation matrix of gene expression for each pathway: ```{r, eval = TRUE} score_st_dev<-st_dv(DataMatrix=tumo[,1:2],pathway=path) ``` # `Pathway cross-talk indexes`: Score for pairwise pathways ## `euc_dist_crtlk`: Euclidean distance for cross-talk measure Starting from a matrix of gene expression (rows are genes and columns are samples, TCGA data) the function `euc_dist_crtlk` creates an euclidean distance matrix of gene expression for pairwise pathway. ```{r, eval = TRUE} score_euc_dista<-euc_dist_crtlk(dataFilt=tumo[,1:2],pathway=path) ``` ## `ds_score_crtlk`: Discriminating score for cross-talk measure Starting from a matrix of gene expression (rows are genes and columns are samples, TCGA data) the function `ds_score_crtlk` creates an discriminating score matrix for pairwise pathway as measure of cross-talk. Discriminating score is given by |M1-M2|/S1+S2 where M1 and M2 are mean and S1 and S2 standard deviation of expression levels of genes in a pathway 1 and and in a pathway 2 . ```{r, eval = TRUE} cross_talk_st_dv<-ds_score_crtlk(dataFilt=tumo[,1:2],pathway=path) ``` # `Selection of pathway cross-talk`: Selection of pathway cross-talk ## `svm_classification`: SVM classification Given the substantial difference in the activities of many pathways between two classes (e.g. normal and cancer), we examined the effectiveness to classify the classes based on their pairwise pathway profiles. This function is used to find the interacting pathways that are altered in a particular pathology in terms of Area Under Curve (AUC).AUC was estimated by cross-validation method (k-fold cross-validation, k=10).It randomly selected some fraction of TCGA data (e.g. nf= 60; 60% of original dataset) to form the training set and then assigned the rest of the points to the testing set (40% of original dataset). For each pairwise pathway the user can obtain using the methods mentioned above a score matrix ( e.g.dev_std_crtlk ) and can focus on the pairs of pathways able to differentiate a particular subtype with respect to the normal type. ```{r, eval = FALSE} nf <- 60 res_class<-svm_classification(TCGA_matrix=score_euc_dist[1:2,],nfs=nf,normal=colnames(norm[,1:12]),tumour=colnames(tumo[,1:12])) ``` ```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"} library(png) library(grid) img <- readPNG("plot_class.png") grid.raster(img) ``` # `IPPI`: Driver genes for each pathway The function `IPPI`, using pathways and networks data, calculates the driver genes for each pathway. Please see Cava et al. BMC Genomics 2017. ```{r, eval = TRUE} DRIVER_SP<-IPPI(patha=path,netwa=netw) ``` # `plotting_cross_talk`: Plot pathway cross-talk The function `plotting_cross_talk` creates the input for the visualization using `qgraph` package. This function uses the relationship among genes as correlation (green for positive correlations and red for negative correlations) and the width of the lines indicates the strength of the correlation. The result from `plotting_cross_talk` is shown below: ```{r, fig.width=6, fig.height=4, echo=FALSE, fig.align="center"} library(png) library(grid) img <- readPNG("plot_path.png") grid.raster(img) ``` ****** ### Session Information ****** ```{r sessionInfo} sessionInfo() ``` # References