--- title: "Data-Mining Biomarkers and High-Dimensional Testing" author: "[Wilson Cai](http://www.stat.berkeley.edu/~wcai/) and [Nima Hejazi](http://www.stat.berkeley.edu/~nhejazi/)" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Data-Mining Biomarkers and High-Dimensional Testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction The `adaptest` R package can be used to perform data-mining and high-dimensional statistical tests that is common in differential expression studies. The package utilizes a two-stage procedure: 1. The _data-mining_ stage: reduce the dimension of biomarkers based on the associations of biomarkers with an exposure variable. 2. The _multiple testing_ stage: adjust for multiple testing to control false positives. In this vignette, we illustrate how to use `adaptest` to perform such analysis, using a data set containing microarray expression measures. --- ## Data-mining stage First, we load the `adaptest` package and the (included) `simpleArray` data set: ```{r setup_data, message=FALSE} library(adaptest) data(simpleArray) set.seed(1234) Y <- simulated_array A <- simulated_treatment ``` In order to perform Targeted Minimum Loss-Based Estimation, we need three separate data structures: (1) _W_, baseline covariates that could potentially confound the association of biomarkers with the exposure of interest; (2) _A_, the point exposure of interest; and (3) _Y_, the biomarkers of interest. All values in _A_ ought to be binarized, in order to avoid practical violations of the assumption of positivity. To invoke the data-adaptive testing function (`adaptest`), we also need to specify the number of top biomarkers `n_top` to the data-mining algorithm, and the number of folds `n_fold` for cross- validation. The smaller `n_top` is, the more selective data-mining algorithm we have. The larger `n_fold` is, more folds are carried our in cross validaiton. The TMLE-based biomarker discovery process can be invoked using the `adaptest` function. The procedure is quite resource-intensive because it evaluates the association of each individual potential biomarker (of which there are 1e3 in the included data set) with an exposure of interest, while accounting for potential confounding based on all other covariates included in the design matrix. We demonstrate the necessary syntax for calling `adaptest` below: ```{r adaptest_eval, message=FALSE} adaptest_out <- adaptest(Y = Y, A = A, W = NULL, n_top = 35, n_fold = 5, learning_library = c("SL.glm", "SL.mean"), parameter_wrapper = adaptest::rank_DE, absolute = FALSE, negative = FALSE) ``` The output of `adaptest` is an object of class `adaptest`, containing the following objects: 1. `top_index`: (integer vector) - indices for the biomarkers selected by way of data-mining. 2. `top_colname`: (character vector) - names for the biomarkers selected by way of data-mining. 3. `top_colname_significant_q`: (character vector) - names for the biomarkers selected by way of data-mining, which are significant after multiple testing stage. 4. `DE`: (numeric vector) - differential expression effect sizes for the biomarkers in `top_colname`. 5. `p_value`: (numeric vector) - p-values for the biomarkers in `top_colname`. 6. `q_value`: (numeric vector) - q-values for the biomarkers in `top_colname`. 7. `significant_q`: (integer vector) - indices of `top_colname` which is significant after multiple testing stage. 8. `mean_rank_top`: (numeric vector) - average ranking across cross-validation folds for the biomarkers in `top_colname`. 9. `folds`: (origami::folds class) - object of cross-validation folds. After invoking `adaptest`, the resultant `adaptest` object will have the slots described above completely filled in. The statistical results of this procedure can be extracted using `summary` method. --- ## Interpret and Visualize Results This package provides several interpretation methods that can be used to tabular and visualize the results of the data-adaptive tests. The `get_composition` method for a `adaptest` object will produce a table of composition of each data-adaptive parameters that is significant after multiple testing stage: ```{r get_comp_small} get_composition(object = adaptest_out, type = "small") ``` Setting the argument `type = "big"` will instead produce a table of composition of each data-adaptive parameters before multiple testing stage, so that there are more columns ```{r get_comp_big} get_composition(object = adaptest_out, type = "big") ``` The `plot` method for a `adaptest` object will produce two plots that help user interpret the results. The first plot is a plot of sorted average CV-rank for all the biomarkers in the original data set (`Y`). The second plot is a plot of sorted q-values with labels corresponding to the indices of the data-adaptive parameter (as returned in `get_composition`) ```{r plot} plot(adaptest_out) ``` --- ## Bioconductor Integration with `SummarizedExperiment` Now, let's try to acquire a taste for how we would use these algorithmic tools with objects common in computational biology -- we'll do this by performing the same analysis we did above, but using the core `SummarizedExperiment` container object and the popular `airway` data set as an example. To start, let's load the required packages: ```{r sum_exp, message=FALSE} library(SummarizedExperiment) library(airway) data(airway) ``` For simplicity, we'll restrict ourselves to just a random subset of the genes or transcripts available from the _airway_ data set. ```{r augment_se} genes_sub <- order(sample(seq_len(1000))) air_reduced <- airway[genes_sub, ] ``` Generally, finding data-adaptive target parameters is a computationally and data-intensive procedure, requiring a fairly __large sample size__. To work with the relatively small `airway` data set, we'll simply augment the data by _artificially_ doubling it in size in a naive manner: ```{r augment_airway} simple_air <- cbind(air_reduced, air_reduced) ``` We'll now extract the variable of interest -- a `factor` label for whether the unit received treatment or not -- and coerce it to a binary `numeric` vector: ```{r make_tx_var} # use a binary treatment variable (must be 0/1 only) dex_var = as.numeric(as.matrix(colData(simple_air))[, 3] - 1) ``` Now, we can perform the same data-adaptive analysis we discussed above on this simplified version of the _airway_ data set, simply by calling the `bioadaptest` wrapper function: ```{r bioadaptest} airway_out <- bioadaptest(data_in = simple_air, var_int = dex_var, cntrl_set = NULL, n_top = 5, n_fold = 2, parameter_wrapper = rank_DE) ``` --- ## Session Information ```{r session_info, echo=FALSE} sessionInfo() ```