--- title: "Suppl. Ch. 3 - Creating Data Objects" author: "Gabriel Odom" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true word_document: toc: true vignette: > %\VignetteIndexEntry{Suppl. 3. Create Data Objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, cache = FALSE, comment = "#>") ``` # 1. Overview This vignette is the third chapter in the "Pathway Significance Testing with `pathwayPCA`" workflow, providing a detailed perspective to the [Creating Data Objects](https://gabrielodom.github.io/pathwayPCA/articles/Supplement1-Quickstart_Guide.html#create-an-omics-data-object) section of the Quickstart Guide. This vignette builds on the material covered in the ["Import and Tidy Data"](https://gabrielodom.github.io/pathwayPCA/articles/Supplement2-Importing_Data.html) vignette. This guide will outline the major steps needed to create a data object for analysis with the `pathwayPCA` package. These objects are called `Omics`-class objects. ## 1.1 Outline Before we move on, we will outline our steps. After reading this vignette, you should be able to 1. Describe the data components within the `Omics` object class. 2. Create a few `Omics` objects. 3. Inspect and edit individual elements contained in these objects. First, load the `pathwayPCA` package and the [`tidyverse` package suite](https://www.tidyverse.org/). ```{r packageLoad, message=FALSE} library(tidyverse) library(pathwayPCA) ``` ## 1.2 Import Data Because this is the third chapter in the workflow, we assume that 1. Your assay is "tidy". 2. Your gene pathways list is stored in a `pathwayCollection` object. 3. Your phenotype and assay data have already been ID-matched. If you are unsure about any of the three points above (or you don't know what these mean), please review the [Import and Tidy Data](https://gabrielodom.github.io/pathwayPCA/articles/Importing_Data.html) vignette first. It isn't very long, but it will help you set up your data in the right way. If your data is not in the proper form, the steps in this vignette may be very difficult. For the purpose of example, we will load some "toy" data: a combined assay / phenotype data frame and a `pathwayCollection` list. These objects already fit the three criteria above. This tidy data set has 656 gene expression measurements (columns) on 250 colon cancer patients (rows). ```{r tumour_data_load} data("colonSurv_df") colonSurv_df ``` Notice that the assay and survival response information have already been merged, so we have two additional columns (for Overall Survival Time and its corresponding death indicator). We also have a small collection of 15 pathways which correspond to our example colon cancer assay. ```{r pathway_list_load} data("colon_pathwayCollection") colon_pathwayCollection str(colon_pathwayCollection$pathways, list.len = 10) ``` The pathway collection and tidy assay (with matched phenotype information) are all the information we need to create an `Omics`-class data object. *******************************************************************************
# 2. `Omics`-Class Objects Defined Now that we have our data loaded, we can create an analysis object for the `pathwayPCA` package. ## 2.1 Class Overview In this package, all primary input data will be in an `Omics` data object. There are three classes of `Omics*` objects, but one function (`CreateOmics`) creates all of them. Each class contains a tidy assay and `pathwayCollection` list. The classes differ in the type of response information they can hold. The classes, and their responses, are 1. `OmicsSurv`---a data object for survival information, which includes event time (the time of last follow-up with a subject) and event indicator (did the subject die, or was the observation right-censored). 2. `OmicsReg`---a data object for continuous responses (usually a linear regression response). 3. `OmicsCateg`---a data object for categorical responses, the dependent variable of a generalized linear model. Currently, we only support binary classification (through logistic regression). 4. `OmicsPathway`---a data object with no response. This is the "parent" class for the other three `Omics` classes. ## 2.2 Review of Data Types in `R` Take a quick look back at the structure of our `colonSurv_df` object. We have a table data frame with the first two columns as subject response information and the rest as an expression design matrix. Look at the types of the columns of this data frame (the `` and `` tags directly under the column names): these tags tell us that the columns contain "double / numeric" (`dbl`) and "integer" (`int`) information. The other tags we could potentially see here are `` (character), `` (logical), or `` (factor). These tags are important because they identify which "class" of data is in each column. Here are some examples of how to change data between types. We inspect the first 10 entries of each object. ```{r data_types} # Original integer column head(colonSurv_df$OS_event, 10) # Integer to Character head(as.character(colonSurv_df$OS_event), 10) # Integer to Logical head(as.logical(colonSurv_df$OS_event), 10) # Integer to Factor head(as.factor(colonSurv_df$OS_event), 10) ``` The `CreateOmics` function puts the response information into specific classes: - Survival data is stored as a pair of `numeric` (time) and `logical` (death indicator) vectors. - Regression data is stored in a `numeric` or `integer` vector. - Binary classification data is stored in a `factor` vector. These restrictions are on purpose: the internal data creation functions in the `pathwayPCA` package have very specific requirements about the types of data they take as inputs. This ensures the integrity of your data analysis. *******************************************************************************
# 3. Create New `Omics` Objects ## 3.1 Overview of Subtypes All new `Omics` objects are created with the `CreateOmics` function. You should use this function to create `Omics`-class objects for survival, regression, or categorical responses. This `CreateOmics` function *internally* calls on a specific creation function for each response type: - Survival response: the `CreateOmicsSurv()` function creates an `Omics` object with class `OmicsSurv`. This object will contain: + `eventTime`: a `numeric` vector of event times. + `eventObserved`: a `logical` vector of death (or other event) indicators. This format precludes the option of recurrent-event survival analysis. + `assayData_df`: a tidy `data.frame` or `tibble` of assay data. Rows are observations or subjects; the columns are -Omics measures (e.g. transcriptome). The column names *must* match a subset of the genes provided in the pathways list (in the `pathwayCollection` object). + `pathwayCollection`: a `list` of pathway information, as returned by the `read_gmt` function (see the [Import and Tidy Data](https://gabrielodom.github.io/pathwayPCA/articles/Importing_Data.html#the-read_gmt-function) vignette for more details). The names of the genes in these pathways *must* match a subset of the genes recorded in the assay data frame (in the `assayData_df` object). - Regression response: the `CreateOmicsReg()` function creates an `Omics` object with class `OmicsReg`. This object will contain: + `response`: a `numeric` vector of the response. + `assayData_df`: a tidy `data.frame` or `tibble` of assay data, as described above. + `pathwayCollection`: a `list` of pathway information, as described above. - Binary Classification response: the `CreateOmicsCateg()` function creates an `Omics` object with class `OmicsCateg`. In future versions, this function will be able to take in $n$-ary responses and ordered categorical responses, but we only support binary responses for now. This object will contain: + `response`: a `factor` vector of the response. + `assayData_df`: a tidy `data.frame` or `tibble` of assay data, as described above. + `pathwayCollection`: a `list` of pathway information, as described above. In order to create example `Omics`-class objects, we will consider the overall patient survival time (and corresponding censoring indicator) as our survival response, the event time as our regression response, and event indicator as our binary classification response. ## 3.2 Create a Survival `Omics` Data Object Now we are prepared to create our first survival `Omics` object for later analysis with either AES-PCA or Supervised PCA. Recall that the `colonSurv_df` data frame has the survival time in the first column, the event indicator in the second column, and the assay expression data in the subsequent columns. Therefore, the four arguments to the `CreateOmics` function will be: - `assayData_df`: this will be only the *expression columns* of the `colonSurv_df` data frame (i.e. all but the first two columns). In `R`, we can remove the first two columns of the `colonSurv_df` data frame by negative subsetting: `colonSurv_df[, -(1:2)]`. - `pathwayCollection_ls`: this will be the `colon_pathwayCollection` list object. Recall that you can import a `.gmt` file into a `pathwayCollection` object via the `read_gmt` function, or create a `pathwayCollection` list object by hand with the `CreatePathwayCollection` function. - `response`: this will be the first two columns of the `colonSurv_df` data frame. The survival time stored in the `OS_time` column and the event indicator stored in the `OS_event` column. - `respType`: this will be the word `"survival"` or an abbreviation of it. Also, when you create an `Omics*`-class object, the `CreateOmics()` function prints helpful diagnostic messages about the overlap between the features in the supplied assay data and those in the pathway collection. ```{r create_OmicsSurv_object} colon_OmicsSurv <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, 1:3], respType = "surv" ) ``` The last three sentences inform you of how strong the overlap is between the genes measured in your data and the genes selected in your pathway collection. This messages tells us that 9% of the 676 total genes included in all pathways were not measured in the assay; zero pathways were removed from the pathways list for having too few genes after gene trimming; and the genes in the pathways list call for 93.8% of the 656 genes measured in the assay. The last number is the most important: it measures how well your pathway collection overlaps with the genes measured in your assay. This number should be as close to 100% as possible. These diagnostic messages depend on the overlap between the pathway collection and the assay, so these messages are response agnostic. ## 3.3 View the New Object In order to view a summary of the contents of the `colon_OmicsSurv` object, you need simply to print it to the `R` console. ```{r print_colonOmicsSurv} colon_OmicsSurv ``` Also notice that the `CreateOmics()` function stores a "cleaned" copy of the pathway collection. The object creation functions within the `pathwayPCA` package subset the feature data frame by the genes in each pathway. Therefore, if we have genes in the pathways that are not recorded in the data frame, then we will necessarily create missing (`NA`) predictors. To circumvent this issue, we check if each gene in each pathway is recorded in the data frame, and remove from each pathway the genes for which the assay does not have recorded expression levels. However, if we remove genes from pathways which do not have recorded levels in the predictor data frame, we could theoretically remove all the genes from a given pathway. Thus, we also check to make sure that each pathway in the given pathways list still has some minimum number of genes present (defaulting to three or more) after we have removed genes without corresponding expression levels. The `IntersectOmicsPwyCollct()` function performs these two actions simultaneously, and this function is called and executed automatically within the object creation step. This function removes the unrecorded genes from each pathway, trims the pathways that have fewer than the minimum number of genes allowed, and returns a "trimmed" pathway collection. If there are any pathways removed by this execution, the `pathways` list within the `trimPathwayCollection` object within the `Omics` object will have a character vector of the pathways removed stored as the `"missingPaths"` attribute. Access this attribute with the `attr()` function. ## 3.4 Regression and Classification `Omics` Data Objects We create regression- and categorical-type `Omics` data objects identically to survival-type `Omics` objects. We will use the survival time as our toy regression response and the death indicator as the toy classification response. ```{r create_OmicsReg_object} colon_OmicsReg <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, 1:2], respType = "reg" ) colon_OmicsReg ``` ```{r create_OmicsCateg_object} colon_OmicsCateg <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, c(1, 3)], respType = "categ" ) colon_OmicsCateg ``` *******************************************************************************
# 4. Inspecting and Editing `Omics`-Class Objects In order to access or edit a specific component of an `Omics` object, we need to use specific *accessor* functions. These functions are named with the component they access. ## 4.1 Example "Get" Function The `get*` functions access the part of the data object you specify. You can save these objects to their own variables, or simply print them to the screen for inspection. Here we print the assay data frame contained in the `colon_OmicsSurv` object to the screen: ```{r getAssay_get} getAssay(colon_OmicsSurv) ``` This function is rather simple: it shows us what object is stored in the `assayData_df` slot of the `colon_OmicsSurv` data object. As we should expect, we see all the columns of the `colonSurv_df` data frame except for the first two (the survival time and event indicator). ## 4.2 Example "Set" Function If we needed to edit the assay data frame in the `colon_OmicsSurv` object, we can use the "replacement" syntax of the `getAssay` function. These are the "set" functions, and they use the `getSLOT(object) <- value` syntax. For example, if we wanted to remove all of the genes except for the first ten from the assay data, we can replace this assay data with a subset of the the original `colonSurv_df` data frame. The `SLOT` shorthand name is `Assay`, and the replacement value is the first ten gene expression columns (in columns 3 through 12) of the `colonSurv_df` data frame: `colonSurv_df[, (3:12)]`. ```{r getAssay_set} getAssay(colon_OmicsSurv) <- colonSurv_df[, (3:12)] ``` Now, when we inspect the `colon_OmicsSurv` data object, we see only ten variables measured in the `assayData_df` slot, instead of our original 656. ```{r view_new_colon_OS} colon_OmicsSurv ``` Before we move on, we should resest the data in the `assayData_df` slot to the full data by ```{r getAssay_set2} getAssay(colon_OmicsSurv) <- colonSurv_df[, -(1:2)] ``` ## 4.3 Table of Accessors Here is a table listing each of the "get" and "set" methods for the `Omics` class, and which sub-classes they can access or modify. | Command | `Omics` Sub-class | Function | |----------------------------------|:-----------------:|------------------------------------------------------------| | `getAssay(object)` | All | Extract the `assayData_df` data frame stored in `object`. | | `getAssay(object) <- value` | All | Set `assayData_df` stored in `object` to `value`. | | `getSampleIDs(object)` | All | Extract the `sampleIDs_char` vector stored in `object`. | | `getSampleIDs(object) <- value` | All | Set `sampleIDs_char` stored in `object` to `value`. | | `getPathwayCollection(object)` | All | Extract the `pathwayCollection` list stored in `object`. | | `getPathwayCollection(object) <- value` | All | Set `pathwayCollection` stored in `object` to `value`. | | `getEventTime(object)` | `Surv` | Extract the `eventTime_num` vector stored in `object`. | | `getEventTime(object) <- value` | `Surv` | Set `eventTime_num` stored in `object` to `value`. | | `getEvent(object)` | `Surv` | Extract the `eventObserved_lgl` vector stored in `object`. | | `getEvent(object) <- value` | `Surv` | Set `eventObserved_lgl` stored in `object` to `value`. | | `getResponse(object)` | `Reg` or `Categ` | Extract the `response` vector stored in `object`. | | `getResponse(object) <- value` | `Reg` or `Categ` | Set `response` stored in `object` to `value`. | The `response` vector accessed or edited with the `getResponse` method depends on if the `object` supplied is a "regression" `Omics`-class object or a "categorical" one. For regression `Omics` objects, `getResponse(object)` and `getResponse(object) <- value` get and set, respectively, the `response_num` slot. However, for categorical `Omics` objects, `getResponse(object)` and `getResponse(object) <- value` get and set, respectively, the `response_fact` slot. This is because regression objects contain `numeric` response vectors while categorical objects contain `factor` response vectors. ## 4.4 Inspect the Updated `pathwayCollection` List As we mentioned in the [Importing with the `read_gmt` Function](https://gabrielodom.github.io/pathwayPCA/articles/Supplement2-Importing_Data.html#import-gmt-files-with-read_gmt) subsection of the previous vignette, the `pathwayCollection` object will be modified upon `Omics`-object creation. Before, this list only had two elements, `pathways` and `TERMS` (we skipped importing the "description" field). Now, it has a third element: `setsize`---the number of genes contained in each pathway. ```{r inspect_updated_pathwayCollection} getPathwayCollection(colon_OmicsSurv) ``` *******************************************************************************
# 5. Review We now summarize our steps so far. We have 1. Defined the `Omics` class and three sub-classes: survival, regression, and categorical (and the "parent" class). 2. Created an `Omics` object for the three sub-classes. 3. Inspected and edited individual elements contained in these objects. Now we are prepared to analyze our created data objects with either AES-PCA or Supervised PCA. Please read vignette chapter 4 next: [Test Pathway Significance](https://gabrielodom.github.io/pathwayPCA/articles/Supplement4-Methods_Walkthrough.html). Here is the R session information for this vignette: ```{r sessionDetails} sessionInfo() ```