--- title: "Extending MetMashR" author: "Gavin Rhys Lloyd" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 number_sections: true toc_float: true vignette: > %\VignetteIndexEntry{Extending MetMashR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(BiocStyle) ``` # Introduction There is a number software solutions available to annotate an LCMS metabolomics datasets. We have only included a small number `annotation_sources` in `MetMashR` as they are the ones we use the most. In this document we describe how to extend the provided templates to include new sources bespoke to your requirements. # Getting Started The latest versions of `r Biocpkg("struct")` and `MetMashR` that are compatible with your current R version can be installed using BiocManager. ```{r,eval = FALSE, include = TRUE} # install BiocManager if not present if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # install MetMashR and dependencies BiocManager::install("MetMashR") ``` Once installed you can activate the packages in the usual way: ```{r, eval=TRUE, include=FALSE} suppressPackageStartupMessages({ # load the packages library(struct) library(MetMashR) library(metabolomicsWorkbenchR) library(ggplot2) }) ``` ```{r, eval=FALSE, include=TRUE} # load the packages library(struct) library(MetMashR) library(metabolomicsWorkbenchR) library(ggplot2) ``` # Example study In this example we are going to import some annotations from a study on Metabolomics Workbench and build a workflow to clean up the table and search for PubChem CIDs so that we can present some images of some of the annotated metabolites for the study. First, we need to import the table of annotations from Metabolomics Workbench. For this we make use of the `metabolomicsWorkbenchR` package. ```{r example-get-data,eval=FALSE,include=TRUE} # get annotations AN <- do_query( context = "study", input_item = "analysis_id", input_value = "AN000465", output_item = "metabolites" ) ``` The imported table has 747 rows and 8 columns. For brevity in this vignette we have cached the first 10 rows and stored them in the package. ```{r, include=FALSE,eval=TRUE} AN <- readRDS( system.file("extdata/AN000465_subset.rds", package = "MetMashR") ) ``` We can convert the imported table into an `annotation_table`: ```{r} AT <- annotation_table(data = AN, id_column = NULL) ``` All `annotation_table` objects must have an id for each row in the table. This is so that later, when examining the outputs of workflow steps, you can easily trace an annotation through the workflow. The default `id_column = NULL` will create a new column '.MetMashR_id' containing the row index as an identifier unless provided with an alternative column name, which should be the name of a column already in the table. Two steps needed to clean up this table are not provided by `MetMashR`: - removal of empty columns - removal of suffix from duplicate molecule names We will implement these steps here as an example of how to add new workflow steps using the `r Biocpkg("struct")` package template system. We will also implement an new `annotation_source` to import the data from Metabolomics Workbench as part of the workflow. # Annotation Sources Annotation sources are the mechanism used by `MetMashR` to get tables of annotation data into R and into a well-defined format. Each extension of the base `annotation_source` template provides methods to import the raw annotation table and convert it into a type of `annotation_table` suitable for the input data. For example, the `cd_source` defines methods to import annotation data from Compound Discoverer's Excel format and convert it into a `lcms_table` object. # Adding new annotation sources If we plan to import annotation data from Metabolomics Workbench a lot we could consider implementing a new `annotation_source` that would import the annotations as part of a workflow model sequence. We will do this now as an example of implementing a new `annotation_source`. All `annotation_source` objects use the `model` template from the struct package, so we could create the definition of a source on-the-fly. However, annotation_sources are likely to be used multiple times to read in data from the same source, so instead we define the new source in a more permanent way, by using code that can be included in a script to be sourced, or in a new R package. There are three key components to an annotation source object, or indeed any struct model object: 1. A definition of the object which defines what the input and output parameters will be. 2. A function to create an instance of the object and populate initial values for input parameters. 3. A `model_apply` method. For annotation sources this method will read in data from the source file and parse it into an `annotation_table`. We will implement each of these steps now. ## The class definition In this example there are no input and output slots, but they could be defined using the `slots` parameter, and then differentiated by name using `.params` and `.outputs` in the prototype. We have defined `libraries` in the prototype. This is a list of R package names that are needed to use this object in addition to the depends/imports of `MetMashR`. ``` {r} .mwb_source <- setClass( "mwb_source", contains = c("annotation_database"), prototype = list( name = "Import from Metabolomics Workbench", libraries = "metabolomicsWorkbenchR" ) ) ``` ## The calling function The calling function is a small function that creates a new instance of the source object using `new_struct` and provides a mechanism to initialise values based on input parameters. The use of ellipsis `...` allows us to pass additional values to the base `struct` object like name and description without including them in the function definition, in case we want to override the values defined in the prototype of the class definition in the previous section. ```{r} mwb_source <- function(...) { # new object out <- new_struct( "mwb_source", ... ) return(out) } ``` ## The import method Like all model objects, the workhorse method for annotation sources is `method_apply`. For annotation sources this method is used to import and parse the source files into an `annotation_table`. Here we use `setMethod` to define the method for the new source. ```{r, eval=TRUE,include=TRUE} setMethod( f = "read_database", signature = c("mwb_source"), definition = function(obj) { ## get annotations using metabolomicsWorkbenchR # AN = do_query( # context = "study", # input_item = "analysis_id", # input_value = M$analysis_id, # output_item = "metabolites") ## for vignette use locally cached subset AN <- readRDS( system.file("extdata/AN000465_subset.rds", package = "MetMashR") ) return(AN) } ) ``` Note how we pass `M$analysis` id to the `do_query` function so that we can import annotations from any Metabolomics Workbench study when using our new source object. ## Using the new source The new `mwb_source` and `method_apply` method are ready to use. To import the table we can use the `import_source` method: ```{r} # initialise source SRC <- mwb_source( source = "AN000465" ) # import AT <- read_source(SRC) ``` Variable `AT` is the imported and cleaned `annotation_database` from Metabolomics Workbench and can be used as input to `model_apply` for other models and sequence, e.g. to look up PubChem Ids etc. # Adding annotation workflow steps `MetMashR` workflow steps use the `model` template from the `r Biocpkg("struct")` package. Although originally intended for statistical methods, the `model` template is a flexible way to implement many different kinds of workflow step. The `r Biocpkg("struct")` package provides convenience functions for "on-the-fly" implementation of new model objects. Here will will use the convenience functions to implement the first new model. Examples of this are also included in the `r Biocpkg("struct")` package vignette. ## Empty column removal To define a new `model` object we can use the `set_struct_obj` function. The new object is fairly simple in that no input parameters are required. The only output slot will contain the `annotation_table` after we have removed the empty columns. For the `prototype` input we provide a name and a description for our new model. This is good practice as our intentions for this model are explicitly stated and stay with the model wherever we use it, helping to ensure transparency, reproducibility etc. We also defined the default output of the model by providing `predicted` in the `prototype`. ```{r new-empty-removal-obj} set_struct_obj( class_name = "drop_empty_columns", struct_obj = "model", params = character(0), outputs = c(updated = "annotation_source"), private = character(0), prototype = list( name = "Drop empty columns", description = paste0( "A workflow step that removes columns from an annotation table ", "where all rows are NA." ), predicted = "updated" ) ) ``` We can now create an instance of our new model, and use the `show` method to display information about it. ```{r} M <- drop_empty_columns() show(M) ``` Before we can use our new model we need to implement the `model_apply` method for it. This is the function that actually does all the work. In this case it will search for columns of missing values and then remove them from the `annotation_table`. To define this method we use the `set_obj_method` function. ```{r} set_obj_method( class_name = "drop_empty_columns", method_name = "model_apply", signature = c("drop_empty_columns", "annotation_source"), definition = function(M, D) { # search for columns of NA W <- lapply( # for each column D$data, # in the annotation table function(x) { all(is.na(x)) # return TRUE if all rows are NA } ) # get index of columns with all rows NA idx <- which(unlist(W)) # if any found, remove from annotation table if (length(idx) > 0) { D$data[, idx] <- NULL } # update model object M$updated <- D # return object return(M) } ) ``` Note that in the `signature` we specify that the second input is an `annotation_table`. The new model is ready to use. We can test it using the `model_apply` method, and check that some columns have been removed. ```{r} M <- model_apply(M, AT) ``` The number of columns before was: ```{r} ncol(AT$data) ``` The number of columns afterwards is: ```{r} ncol(M$updated$data) ``` ## Suffix removal The second model removes the suffix from the molecule names. The is necessary if we want to use the molecule names to e.g search for PubChem identifiers using the REST API; we wont get a match if the suffix is part of the molecule name. As we did for `drop_empty_columns` we use the `set_struct_obj` and `set_obj_method` function to create our new workflow step. ```{r} # define new model object set_struct_obj( class_name = "remove_suffix", struct_obj = "model", params = c(clean = "logical", column_name = "character"), outputs = c(updated = "annotation_source"), prototype = list( name = "Remove suffix", description = paste0( "A workflow step that removes suffixes from molecule names by ", "splitting a string at the last underscore an retaining the part", "of the string before the underscore." ), predicted = "updated", clean = FALSE, column_name = "V1" ) ) # define method for new object set_obj_method( class_name = "remove_suffix", method_name = "model_apply", signature = c("remove_suffix", "annotation_source"), definition = function(M, D) { # get list of molecule names x <- D$data[[M$column_name]] # split string at last underscore s <- strsplit(x, "_(?!.*_)", perl = TRUE) # get left hand side s <- lapply(s, "[", 1) # if clean replace existing column, otherwise new column if (M$clean) { D$data[[M$column_name]] <- unlist(s) } else { D$data$name.fixed <- unlist(x) } # update model object M$updated <- D # return object return(M) } ) ``` For this model we included two input parameters and provided default values in the `prototype`: - column_name: the name of the column in the annotation table containing molecule names - clean: a flag that will replace the old column if TRUE or add a new column if FALSE # Metabolite mashing Now that we have defined our new workflow steps we can use them in a `model_seq` (workflow), alongside other existing steps to mash the annotations with other tables containing identifiers and additional information. In this case we do the following: 1. import the annotations from Metabolomics Workbench 1. remove empty columns 2. remove suffixes from molecule names 3. search the Metabolomics Workbench refmet database for identifiers 4. use the PubChem REST API so search for compound CIDs 5. combine the CID columns from refmet and pubchem into a single column, giving priority to refmet, to obtain a single column of identifiers we have confidence in for as many metabolites as possible 6. use the PubChem REST API to obtain SMILES for each metabolite For clarity in the workflow we import the refmet database before using it in the workflow instead of importing it in-line. We also import some cached REST API responses so that we dont overburden the service when generating the vignette. ```{r, eval=FALSE,include = TRUE} # refmet refmet <- mwb_refmet_database() # pubchem caches pubchem_cid_cache <- rds_database( source = system.file("cached/pubchem_cid_cache.rds", package = "MetMashR" ) ) pubchem_smile_cache <- rds_database( source = system.file("cached/pubchem_smiles_cache.rds", package = "MetMashR" ) ) ``` ```{r, eval=TRUE,include=FALSE} refmet <- mwb_refmet_database() pubchem_cid_cache <- rds_database( source = file.path( system.file("cached", package = "MetMashR"), "pubchem_cid_cache.rds" ) ) pubchem_smile_cache <- rds_database( source = file.path( system.file("cached", package = "MetMashR"), "pubchem_smiles_cache.rds" ) ) ``` In practice it is better to create your own cache using e.g. `rds_database`, `sql_database` or other `struct_database` objects with write access. See the vignette [TODO] for more details. ```{r,message=FALSE, include=TRUE, eval=TRUE} # prepare sequence M <- import_source() + drop_empty_columns() + remove_suffix( clean = TRUE, column_name = "metabolite_name" ) + database_lookup( query_column = "refmet_name", database_column = "name", database = refmet, suffix = "_mwb", include = "pubchem_cid" ) + pubchem_compound_lookup( query_column = "metabolite_name", search_by = "name", suffix = "_pc", output = "cids", records = "best", delay = 0.2, cache = pubchem_cid_cache ) + prioritise_columns( column_names = c("pubchem_cid_mwb", "CID_pc"), output_name = "pubchem_cid", source_name = "pubchem_cid_source", source_tags = c("mwb", "pc"), clean = TRUE ) + pubchem_property_lookup( query_column = "pubchem_cid", search_by = "cid", suffix = "", property = "CanonicalSMILES", delay = 0.2, cache = pubchem_smile_cache ) # apply sequence M <- model_apply(M, mwb_source(source = "AN000465")) ``` Note that because the first step in the workflow is to import data from Metabolomics Workbench, we only need to provide an empty annotation_table as input to `model_apply` as it will be updated when we run the workflow. Once the table has been processed by our workflow we then use OpenBabel to generate images of the molecular structures from SMILES. ```{r} # prepare chart C <- openbabel_structure( smiles_column = "CanonicalSMILES", row_index = 1, scale_to_fit = FALSE, view_port = 400, image_size = 500 ) # loop over some records and plot some of the molecules G <- list() x <- 1 for (k in c(3, 5)) { C$row_index <- k G[[x]] <- chart_plot(C, predicted(M)) x <- x + 1 } cowplot::plot_grid(plotlist = G, nrow = 1, labels = "AUTO") ``` # Summary Extending the templates provided by `struct` to implement workflow steps for metabolite mashing is straight forward using either the provided on-the-fly functions or scripting more permanent solutions. All objects that use the templates will be compatible with other workflow objects provided you follow the templates. As well as new workflow steps MetMashR can also be extended to include additional annotation sources. # Session Info ```{r} sessionInfo() ```