--- title: "SVMDO-Tutorial" author: "

Author: Mustafa Erhan Özer

" date: "

Most recent update: `r format( Sys.Date(), '%b-%d-%Y')`

" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SVMDO-Tutorial} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, comment = "#>", cache = TRUE ) is_windows <- identical(.Platform$OS.type, "windows") ``` # Installation and Package Loading The day-to-day development version from the [Github repository](https://github.com/robogeno/SVMDO) can be installed. ```{r install, eval=FALSE} # From Github if (!requireNamespace("devtools", quietly=TRUE)) install.packages("devtools") devtools::install_github("robogeno/SVMDO") ``` ```{r eval = FALSE} # From Biocodunctor if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SVMDO") ``` ```{r eval = FALSE} #library(SVMDO) #Main screen of GUI #runGUI() # OR #SVMDO::runGUI() ``` # Introduction Transcriptome-based supervised classification can be difficult to handle due to including vast amount of junk features reducing the efficiency of the process. To minimize the efficiency problem, feature selection methods have been applied [1]. Ensemble methodology in constructing a feature selection method has been mainly preferred for its better accuracy in classification processes [2]. Cancer-based studies have been reported to co-occur with several chronic diseases [3] and oncogenic viral infections [4]. Disease Ontology (DO) enrichment analysis can provide a filtration approach through reported diseases that interact with the gene sets used [5]. Wilk's lambda criterion is one of the moderate techniques for feature selection. The method combines different features(genes) step by step based on their contributions to the discriminating power of overall model [6]. Thus, disease related genes with better discriminatory characteristics can be acquired. SVMDO is an easy-to-use GUI using disease information for detecting tumor/normal sample discriminating gene sets from differentially expressed genes. Our approach is based on an iterative algorithm filtering genes with disease ontology enrichment analysis and wilk's lambda criterion connected to SVM classification model construction. Along with gene set extraction, SVMDO also provides individual prognostic marker detection. The algorithm is designed for FPKM and RPKM normalized RNA-Seq transcriptome datasets. During the development our algorithm, Bioconductor provided a robust approach for acquiring significant disease-gene interaction information from disease ontology database. # Implementation SVMDO was developed by using Shiny package. It is available for Windows and Linux from Bioconductor website. SVMDO requires the following R packages: Shiny, ShinyFiles, golem, nortest, e1071, BSDA, data.table, sjmisc, klaR, caTools, dplyr, caret, survival, DOSE, AnnotationDbi, DOSE, org.Hs.eg.db # Dataset Preparation RNA-Seq cancer transcriptome and clinical datasets must be prepared before applying them in SVMDO. The datasets must be used in **txt** format. The expected designs of input datasets are indicated in Table-1 and Table-2 for transcriptome and clinical datasets respectively. ```{r echo=FALSE} id<-c("TCGA-AA-3662","TCGA-AA-3514","TCGA-D5-6541","...") tissue_type<-c("Normal","Normal","Tumour","...") AB1G<-c("Exp_1","Exp_2","Exp_3","...") A2M<-c("Exp_1","Exp_2","Exp_3","...") table_prep_1<-data.frame(id,tissue_type,AB1G,A2M) ``` `r knitr::kable(table_prep_1,"pipe",col.name=c("id","tissue_type","AB1G","A2M"),align = "cccc",caption="Structure of RNA-Seq Transcriptome Dataset")` ```{r echo=FALSE} id<-c("TCGA-AA-3662","TCGA-AA-3514","TCGA-D5-6541","...") days_to_death<-c(49,1331,225,"...") vital_status<-c("Alive","Dead","Dead","...") table_prep_2<-data.frame(id,days_to_death,vital_status) ``` `r knitr::kable(table_prep_2,"pipe",col.name=c("id","days_to_death","vital_status"),align = "cccc",caption="Structure of Clinical Dataset")` - **tissue_type:** Normal/Tumour (or Tumor) Tissue Information - **A1BG, A2M:** Gene Symbol and Gene Expression - **id:** TCGA Sample Ids - **days_to_death:** Survival Time of Patients - **vital_status:** Vitality of Patients Using **tissue_type** and **id** as column names for representing tissue information and TCGA sample id is **mandatory**. If there is not any requirement for **survival analysis**, preparation of **clinical dataset** and involvement of **id** column in the transcriptome dataset are optional. # SVMDO GUI Description The main dialog box is indicated in **Figure-1**. The GUI page consists of two main sections which are "Analysis" and "Result". In the Analysis section, steps of acquiring discriminative gene set and preparations of individual survival plots are included. In the Result section, visualization and download of discriminative gene set and survival plots are included. At each step of the GUI sections, necessary variables are saved as objects in the workspace environment to be used in the following steps. To provide experience about the GUI usage, a test section involving dummy example using SummarizedExperiment objects of transcriptome (small form) and clinical datasets is also included. ```{r, ,echo=FALSE, width = 200, height = 50,fig.align = "center",fig.show='hold'} # two figs side by side knitr::include_graphics("svmdo_fig_1.png") ```


```{r, ,echo=FALSE, width = 200, height = 50,fig.align = "center",fig.show='hold',fig.cap='Figure-1 SVMDO GUI Sections'} # two figs side by side knitr::include_graphics("svmdo_fig_2.png") ``` # How to open SVMDO Main Screen To open GUI screen, you can directly write **SVMDO::runGUI()** in R console. If library is previously activated you can open GUI screen by writing **runGUI()** in R console. # How to use SVMDO Main Screen (**Important:** Except file input processes, each step provides a message box indicating process success/failure. It disappears after clicking on any area in the GUI screen. This is necessary for continuing the steps.) ## Steps of Analysis Screen 1. To search your transcriptome dataset, use the file detection in **Choose Your Expression Dataset** section. The file will be automatically uploaded into the GUI. 2. To prevent clashing with test datasets, select **"None"** option from the radio button section. 3. By clicking on **DEG Analysis** button you further apply differential expression analysis. Labels of tissue_type column in dataset must contain "Nor" and "Tum" for determining normal/tumour (or tumor) samples. 4. When the differential expression process is completed, a user-defined input size (n) is selected to filter the initial gene list (i.e., n number of upregulated and downregulated genes) by entering a number in **Input Size** section. It is predetermined as **50** in GUI which can be changed based on the user. A message window saying **process completed** will appear if there is not any problem. If there is problem with the value of input size, you will get a warning about inappropriate input size selection. If the input size remains, algorithm selects all of the differentiallly expressed genes to be used in the next process. 5. To apply disease ontology-based gene filtration, click on **DO Analysis** button. 6. To further apply the following feature selection and classification processes, click on the **Classification** button. 7. Acquired discriminative gene set can be further used for survival analysis to detect individual prognostic genes. To apply this process, use the file detection in **Choose Clinical Data** section for searching clinical data about patient survival followed by clicking on **Survival Analysis** button. ## Steps of Result Screen 1. To visualize discriminative gene sets inside GUI screen, click on **Show Gene Results** button. When you click this button, a table of gene set will appear. 2. To visualize survival plots of individual genes, two steps have to be applied. First of all, click on **Prepare Plot Lists** button to feed plot information to the visualization system. After that, click on **Show Plots** button to visualize survival plots. 3. Before downloading files, you can adjust the output directory with **Choose Directory** button. It can be used for separating files by selecting a destination before clicking download buttons. If it is desired, files can be downloaded to the same folder by selecting an output directory just one time before the download steps. If you do not select any output directory, files will be downloaded to your **working directory**. 4. To download the resulting discriminative gene set, it is obligatory to define a filename in the **Enter Final Gene Set Filename** section. After that, you can click on **Download Gene List** button to complete the process. 5. To download survival plots, you have to click on **Download Plot List** button. Names of plot files are automatically done by assigning gene names. # Application of test datasets SVMDO includes test datasets providing dummy examples for gaining experience on the GUI usage. Test datasets consists of summarized experiment objects including expression and clinical datasets. These objects are saved in RDA forms and called from **extdata** folder of the package. As expression datasets, test files includes simplified forms of TCGA-COAD (COAD) and TCGA-LUSC (LUSC) with 400 genes In test-based analysis, predetermined expression and clinical datasets are automatically uploaded into the GUI. Furthermore, predefined input size (n=50) is also automatically applied. Therefore, users have to continue with **DO Analysis** after **DEG Analysis** without adjusting input size. # Workspace Clearance When the user task is completed, click on the **Clear Environment** button to remove the global variables created during the algorithm sections. To prevent error in the next usages of GUI, it is a necessary process. It can be applied at any moment without the necessity of completing all of the steps of algorithm. # Output Files of SVMDO - **Discriminative Gene Set:** Final form of gene set providing optimal classification performance - **Individiual Survival Plots:** Separate survival plots of individual genes with prognostic performance # References 1. Cai,J. et al. (2018) Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70--79. 2. Gallagher,E.J. and LeRoith,D. (2015) Obesity and diabetes: The increased risk of cancer and cancer-related mortality. Physiological Reviews, 95, 727--748. 3. Kori,M. and Arga,K.Y. (2020) Pathways involved in viral oncogenesis: New perspectives from virus-host protein interactomics. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, 1866, 165885. 4. LePendu,P. et al. (2011) Enabling enrichment analysis with the human disease ontology. Journal of Biomedical Informatics, 44, S31--S38. 5. Liu,B. et al. (2004) BMC Bioinformatics, 5, 136. 6. Ouardighi,A. el et al. (2007) Feature selection on supervised classification using wilks lambda statistic. In, 2007 international symposium on computational intelligence and intelligent informatics. IEEE. # How to install R and RStudio GUI RStudio requires R 3.3.0+. Choose a version of R that matches your computer's operating system ## Windows Operating System \*\*R-base Install: 1. Copy/Paste the address 2. Click on "Download and install R" 3. In the new page, click on "Download R for Windows" 4. Click on "install R for the first time" 5. Click on "Download R-X.X.X-win.exe for Windows"to install R-base \*\*RStudio Install: 1. Copy/Paste the address 2. Click on either download RStudio desktop for windows 3. Alternatively, .exe /.zip versions can be downloaded from All Installers section ## Ubuntu Operating System \*\*R-base Install: 1. Copy/Paste the address 2. Click on "Download and install R". 3. Click on "Download R for Linux" 4. In the parent directory, click on ubuntu/ 5. Run the following lines in terminal (as root or by prefixing sudo) to provide access to recent version of R: - sudo apt install --no-install-recommends software-properties-common dirmngr \# add the signing key (by Michael Rutter) for these repos \# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc \# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9 - wget -qO- \| sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc \# add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed - sudo add-apt-repository "deb \$(lsb_release -cs)-cran40/" 5. After the previous lines, run sudo apt install --no-install-recommends r-base \*\*RStudio Install: 1. Copy/Paste the address 2. In All Installers section, select .deb or .debian.tar.gz version for Ubuntu 3. Right click on the download and/or unzipped .deb file to install RStudio (Alternatively, .deb file can be installed from terminal by typing sudo dpkg -i installed_Rstudio_file.deb) # Session Info ```{r SessionInfo, eval=TRUE, message=FALSE, echo=FALSE} sessionInfo() ```