--- title: "Suppl. Ch. 2 - Import and Tidy Data" author: "Gabriel Odom" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true word_document: toc: true vignette: > %\VignetteIndexEntry{Suppl. 2. Importing Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, cache = FALSE, comment = "#>") ``` # 1. Overview This vignette is the second chapter in the "Pathway Significance Testing with `pathwayPCA`" workflow, providing a detailed perspective to the [Import Data](https://gabrielodom.github.io/pathwayPCA/articles/Supplement1-Quickstart_Guide.html#import-data) section of the Quickstart Guide. This vignette will discuss using the the `read_gmt` function to import Gene Matrix Transposed (`.gmt`) [pathway collection files](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) as a list object with class `pathwayCollection`. Also, we will discuss importing assay and response data, and how to make your assay data [tidy](https://www.jstatsoft.org/article/view/v059i10). For our pathway analysis to be meaningful, we need gene expression data (from a microarray or something similar), corresponding phenotype information (such as weight, type of cancer, or survival time and censoring indicator), and a pathway collection. Before we move on, we will outline our steps. After reading this vignette, you should be able to 1. Import a `.gmt` file and save the pathways stored therein as a `pathwayCollection` object using the `read_gmt` function. 2. Import an assay `.csv` file with the `read_csv` function from the `readr` package, and transpose this data frame into "tidy" form with the `TransposeAssay` function. 3. Import phenotype information stored in a `.csv` file, and join (merge) it to the assay data frame with the `inner_join` function from the `dplyr` package. First, load the `pathwayPCA` package and the [`tidyverse` package suite](https://www.tidyverse.org/). ```{r packageLoad, message=FALSE} library(tidyverse) # Set tibble data frame print options options(tibble.max_extra_cols = 10) library(pathwayPCA) ``` *******************************************************************************
# 2. GMT Files The `.gmt` format is a commonly used file format for storing [pathway collections](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). Lists of pathways in the Molecular Signatures Database (MSigDB) can be downloaded from the [MSigDB Collections page](http://software.broadinstitute.org/gsea/msigdb/collections.jsp). ## 2.1 GMT Format Description GMT-formatted files follow a very specific set of rules: 1. Each row of the file represents a pathway, and only one pathway is allowed per line. 2. The first entry in each row is the pathway name; e.g. `"KEGG_STEROID_BIOSYNTHESIS"`. 3. The second entry in each row is an optional brief description of the pathway; e.g. `"http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID_BIOSYNTHESIS"`. 4. The third to the last entry on each row are the gene names in the pathway; e.g. `"SOAT1" "LSS" "SQLE" "EBP" "CYP51A1" "DHCR7" "CYP27B1" "DHCR24" "HSD17B7" "MSMO1" "FDFT1" "SC5DL" "LIPA" "CEL" "TM7SF2" "NSDHL" "SOAT2"`. 5. Each entry in each line is seperated by a tab. ## 2.2 Import GMT files with `read_gmt` Based on the clearly-organized `.gmt` file format, we were able to write a very fast function to read `.gmt` files into R. The `read_gmt` function takes in a path specifying where your `.gmt` file is stored, and outputs a pathways list. ```{r read_gmt} gmt_path <- system.file("extdata", "c2.cp.v6.0.symbols.gmt", package = "pathwayPCA", mustWork = TRUE) cp_pathwayCollection <- read_gmt(gmt_path, description = TRUE) ``` We now carefully discuss the form of this information. This `cp_pathwayCollection` object has class `pathwayCollection` and contains the following components: 1. `pathways`: A list of character vectors. Each character vector should contain a subset of the names of the -Omes measured in your assay data frame. These pathways should not be too short, otherwise we devolve the problem into simply testing individual genes. The `pathwayPCA` package requires each pathway to have a minimum of three genes recorded in the assay data frame. **Important**: some protein set lists have proteins markers recorded as character numerics (e.g. "3"), so make sure the feature names of your assay have an overlap with the gene or protein names in the `pathwayCollection` list. Ensure that there is a non-empty overlap between the gene names in the pathways list and the feature names of the assay. Not every gene in your assay data frame will be in the pathways list, and not every gene in each pathway will have a corresponding measurement in the assay data frame. *However, for meaningful results, there should be a significant overlap between the genes measured in the assay data frame and the gene names stored in the pathways list.* If your pathways list has very few matching genes in your assay, then your pathway-based analysis results will be significantly degraded. **Make sure your pathways list and assay data are compatible.** 2. `TERMS`: A character vector comprised of the proper name of each pathway in the pathway collection. 3. `description`: (OPTIONAL) A character vector the same length as the pathways list with descriptive information. For instance, the `.gmt` file included with this package has hyperlinks to the MSigDB description card for that pathway in this field. This field will be imported by the `read_gmt` function when `description = TRUE` (it defaults to `FALSE`). 4. `setsize`: the number of genes originally recorded in each pathway, stored as an integer vector. NOTE: *this information is calculated and added to the pathways list at `Omics`-class object creation (later in the workflow).* This information is useful to measure the ratio of the number of genes from each pathway recorded in your assay to the number of genes defined to be in that pathway. For each pathway, this ratio should be at least 0.5 for best pathway analysis results. The object itself has the following structure: ```{r pathwayCollection_structure} cp_pathwayCollection ``` This object will be the list supplied to the `pathwayCollection_ls` argument in the `CreateOmics` function. ## 2.3 Creating Your Own `pathwayCollection` List Additionally, you can create a `pathwayCollection` object from scratch with the `CreatePathwayCollection` function. This may be useful to users who have their pathway information stored in some form other than a `.gmt` file. You must supply a list of vectors of gene names to the `pathways` argument, and a vector of the proper names of each pathway to the `TERMS` argument. You could also store any other pertinant pathway information by passing a ` = ` pair to this function. ```{r create_test_pathwayCollection} myPathways_ls <- list( pathway1 = c("Gene1", "Gene2"), pathway2 = c("Gene3", "Gene4", "Gene5"), pathway3 = "Gene6" ) myPathway_names <- c( "KEGG_IMPORTANT_PATHWAY_1", "KEGG_IMPORTANT_PATHWAY_2", "SOME_OTHER_PATHWAY" ) CreatePathwayCollection( sets_ls = myPathways_ls, TERMS = myPathway_names, website = "URL_TO_PATHWAY_CITATION" ) ``` ## 2.4 Importing a Pathway Collection from Wikipathways To download a `.gmt` file from Wikipathways, we recommend the `R` package [`rWikiPathways`](https://bioconductor.org/packages/release/bioc/html/rWikiPathways.html). From their [vignette](https://bioconductor.org/packages/release/bioc/vignettes/rWikiPathways/inst/doc/Overview.html): > WikiPathways also provides a monthly data release archived at http://data.wikipathways.org. The archive includes GPML, GMT and SVG collections by organism and timestamped. There’s an R function for grabbing files from the archive... > > `downloadPathwayArchive()` > > This will simply open the archive in your default browser so you can look around (in case you don’t know what you are looking for). By default, it opens to the latest collection of GPML files. However, if you provide an organism, then it will download that file to your current working directory or specified destpath. For example, here’s how you’d get the latest GMT file for mouse: > > `downloadPathwayArchive(organism = "Mus musculus", format = "gmt")` > > And if you might want to specify an archive date so that you can easily share and reproduce your script at any time in the future and get the same result. Remember, new pathways are being added to WikiPathways every month and existing pathways are improved continuously! > > `downloadPathwayArchive(date = "20171010", organism = "Mus musculus", format = "gmt")` > ## 2.4 Writing a `pathwayCollection` Object to a `.gmt` File Finally, we can save the `pathwayCollection` object we just created via the `write_gmt()` function: ```{r write_gmt, eval=FALSE} write_gmt( pathwayCollection = cp_pathwayCollection, file = "../test.gmt" ) ``` *******************************************************************************
# 3. Import and Tidy an Assay Matrix We assume that the assay data (e.g. transcriptomic data) is either in an Excel file or flat text file. For example, your data may look like this: ![](example_assay_data.PNG) In this data set, the columns are individual samples. The values in each row are the -Omic expression measurements for the gene in that row. ## 3.1 Import with `readr` To import data files in `.csv` (comma-separated), `.fwf` (fixed-width), or `.txt` (tab-delimited) format, we recommend the [`readr` package](https://readr.tidyverse.org/). You can `.csv` files with the `read_csv` function, fixed-width files with `read_fwf`, and general delimited files with `read_delim`. These functions are all from the `readr` package. Additionally, for data in `.xls` or `.xlsx` format, we recommend the [`readxl`](http://readxl.tidyverse.org/) package. We would read a `.csv` data file via ```{r read_assay} assay_path <- system.file("extdata", "ex_assay_subset.csv", package = "pathwayPCA", mustWork = TRUE) assay_df <- read_csv(assay_path) ``` The `read_csv` function warns us that the name of the first column is missing, but then automatically fills it in as `X1`. Further, this function prints messages to the screen informing you of the assumptions it makes when importing your data. Specifically, this message tells us that all the imported data is numeric (`.default = col_double()`) except for the gene name column (`X1 = col_character()`). Let's inspect our assay data frame. Note that the gene names were imported as a character column, as shown by the `` tag at the top of the first column. This data import step stored the row names (the gene names) as the first column, and preserved the column names (sample labels) of the data. ```{r assay_print} assay_df ``` ## 3.2 Tidy the Assay Data Frame The assay input to the `pathwayPCA` package must be in [*tidy data*](https://www.jstatsoft.org/article/view/v059i10) format. The "Tidy Data" format requires that each observation be its own row, and each measurement its own column. This means that we must transpose our assay data frame, while preserving the row and column names. To do this, we can use the `TransposeAssay` function. This function takes in a data frame as imported by the three `readr` functions based on data in a format similar to that shown above: genes are the rows, gene names are the first column, samples are stored in the subsequent columns, and all values in the assay (other than the gene names in the first column) are numeric. ```{r transpose} (assayT_df <- TransposeAssay(assay_df)) ``` This transposed data frame has the gene names as the column names and the sample names as a column of character (`chr`) values. Notice that the data itself is 17 genes measured on 36 samples. Before transposition, we had 37 columns because the feature names were stored in the first column. After transposition, we have 36 rows but 18 columns: the first column stores the sample names. This transposed data frame (after filtering to match the response data) will be supplied to the `assayData_df` argument in the `CreateOmics` function. (*See the [Creating `Omics` Data Objects](https://gabrielodom.github.io/pathwayPCA/articles/Create_Omics_Objects.html) vignette for more information on creating `Omics`-class objects.*) ## 3.3 Subsetting a Tidy Data Frame If ever we need to extract individual components of a tidy data frame, we can use the `assay[row, col]` syntax. If we need entire measurements (columns), then we can call the column by name with the `assay$ColName` syntax. For example, - If we need the second row of `assayT_df`---corresponding to Sample "T21101312"---then we type ```{r subset_2ndrow} assayT_df[2, ] ``` Notice that the `tibble` object has 1 row and 18 columns. - If we need the third column of `assayT_df`---corresponding to Gene "LSS"---then we type ```{r subset_3rdcol} assayT_df[, 3] ``` This `tibble` object has 36 rows and 1 column. - If we need the intersection of these two (the expression level of Gene "LSS" in Sample "T21101312"), then we type ```{r subset_23} assayT_df[2, 3, drop = TRUE] ``` This output would normally be a 1 by 1 `tibble` (which isn't terribly helpful), so we add the `drop = TRUE` argument to "drop" the dimensions of the table. This gives us a single basic number (*scalar*). - If we need the third column of `assayT_df`, but we want the result back as a vector instead of a `tibble`, we call the column by name: ```{r subset_3rdcol_byname} assayT_df$LSS ``` ## 3.4 Data from a `SummarizedExperiment` Object Oftentimes, genomic experiment data is stored in a `SummarizedExperiment`-class object. If your assay and response data are stored in such an object, use the `SE2Tidy()` function to extract the necessary information and return it as a tidy data frame. Because `SummarizedExperiment` objects can have more than one assay, you must specify the index for the assay of your choice with the `whichAssay` argument. Here is an example using the `airway` data: ```{r stdExpr_Example, message=FALSE} library(SummarizedExperiment) data(airway, package = "airway") airway_df <- SE2Tidy(airway) ``` Now we can look at a nice summary of the tidied assay and response data. This will drop all of the gene-specific metadata, as well as any experiment metadata. However, `pathwayPCA` can't make use of this data anyway, so we haven't lost much. ```{r} airway_df[, 1:20] ``` *******************************************************************************
# 4. Import and Join Response Data We now have an appropriate pathways list and a tidy -Omics assay data frame. All we need now is some response data. Let's imagine that your phenotype data looks something like this: ![](example_pInfo_data.PNG) We next import this response information. We can use the `read_csv` function once again: ```{r read_pinfo} pInfo_path <- system.file("extdata", "ex_pInfo_subset.csv", package = "pathwayPCA", mustWork = TRUE) pInfo_df <- read_csv(pInfo_path) ``` This phenotype data frame has a column for the sample labels (`Sample`) and the response information. In this case, our response is a survival response with an event time and observation indicator. ```{r pInfo} pInfo_df ``` This `pInfo` data frame has the sample names as a column of character values, just like the transposed assay data frame. This is crucially important for the "joining" step. We can use the `inner_join` function from the `dplyr` library to retain only the rows of the `assayT_df` data frame which have responses in the `pInfo` data frame and vice versa. This way, every response in the phenotype data has matching genes in the assay, and every recorded gene in the assay matches a response in the phenotype data. ```{r innerJoin} joinedExperiment_df <- inner_join(pInfo_df, assayT_df, by = "Sample") joinedExperiment_df ``` **This requires you to have a *key* column in both data frames with the same name.** If the key column was called "Sample" in the `pInfo_df` data set but "SampleID" in the assay, then the `by` argument should be changed to `by = c("Sample" = "SampleID")`. It's much nicer to just keep them with the same names, however. Moreover, it is vitally important that you check your sample IDs. Obviously the recorded genetic data should pair with the phenotype information, but **it is your responsibility as the user to confirm that the assay rows match the correct responses.** You are ultimately responsible to defend the integrity of your data and to use this package properly. *******************************************************************************
# 5. Example Tidy Assay and Pathways List Included in this package, we have a small tidy assay and corresponding gene subset list. We will load and inspect this assay. This data set has 656 gene expression measurements on 250 colon cancer patients. Further notice that the assay and overall survival response information have already been matched. ```{r tumour_data_load} data("colonSurv_df") colonSurv_df ``` We also have a small list of 15 pathways which correspond to our example colon cancer assay. To create a toy example, we have curated this artificial pathways list to include seven significant pathways and eight non-significant pathways. ```{r pathway_list_load} data("colon_pathwayCollection") colon_pathwayCollection ``` The pathways list and tidy assay (with matched phenotype information) are all the information we need to create an `Omics`-class data object. *******************************************************************************
# 6. Review We now summarize our steps so far. We have 1. Imported a `.gmt` file and saved the pathways stored therein as a `pathwayCollection` object using the `read_gmt` function. 2. Imported an assay `.csv` file with the `read_csv` function from the `readr` package, and transposed this data frame into "tidy" form with the `TransposeAssay` function. 3. Imported a phenotype information `.csv` file, and joined it to the assay data frame with the `inner_join` function from the `dplyr` package. Now we are prepared to create our first `Omics`-class object for analysis with either AES-PCA or Supervised PCA. Please read vignette chapter 3: [Creating `Omics` Data Objects](https://gabrielodom.github.io/pathwayPCA/articles/Supplement3-Create_Omics_Objects.html). Here is the R session information for this vignette: ```{r sessionDetails} sessionInfo() ```