--- title: "Running an AnVIL workflow within R" author: - name: Kayla Interdonato affiliation: Roswell Park Comprehensive Cancer Center - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center email: Martin.Morgan@RoswellPark.org package: AnVIL output: BiocStyle::html_document abstract: | This vignette demonstrates how a user can edit, run, and stop a Terra / AnVIL workflow from within their R session. The configuration of the workflow can be retrieved and edited. Then this new configuration can be sent back to the Terra / AnVIL workspace for future use. With the new configuration defined by the user will then be able to run the workflow as well as stop any jobs from running. vignette: | %\VignetteIndexEntry{Running an AnVIL workflow within R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( eval = AnVIL::gcloud_exists(), collapse = TRUE, cache = TRUE ) options(width=75) ``` # Installation Install the _AnVIL_ package with ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos = "https://cran.r-project.org") BiocManager::install("AnVIL") ``` Once installed, load the package with ```{r, message = FALSE, eval = TRUE, cache = FALSE} library(AnVIL) ``` # Workflow setup: DESeq2 ## Setting up the workspace and choosing a workflow The first step will be to define the namespace (billing project) and name of the workspace to be used with the functions. In our case we will be using the Bioconductor AnVIL namespace and a DESeq2 workflow as the intended workspace. ```{r workspace} avworkspace("bioconductor-rpci-anvil/Bioconductor-Workflow-DESeq2") ``` Each workspace can have 0 or more workflows. The workflows have a `name` and `namespace`, just as workspaces. Discover the workflows available in a workspace ```{r workflows} avworkflows() ``` From the table returned by `avworkflows()`, record the namespace and name of the workflow of interest using `avworkflow()`. ```{r workflow} avworkflow("bioconductor-rpci-anvil/AnVILBulkRNASeq") ``` ## Retrieving the configuration Each workflow defines inputs, outputs and certain code execution. These workflow 'configurations' that can be retrieved with `avworkflow_configuration_get`. ```{r configuration} config <- avworkflow_configuration_get() config ``` This function is using the workspace namespace, workspace name, workflow namespace, and workflow name we recorded above with `avworkspace()` and `avworkflow()`. # Updating workflows ## Changing the inputs / outputs There is a lot of information contained in the configuration but the only variables of interest to the user would be the inputs and outputs. In our case the inputs and outputs are pre-defined so we don't have to do anything to them. But for some workflows these inputs / outputs may be blank and therefore would need to be defined by the user. We will change one of our inputs values to show how this would be done. There are two functions to help users easily see the content of the inputs and outputs, they are `avworkflow_configuration_inputs` and `avworkflow_configuration_outputs`. These functions display the information in a `tibble` structure which users are most likely familiar with. ```{r inputs_outputs} inputs <- avworkflow_configuration_inputs(config) inputs outputs <- avworkflow_configuration_outputs(config) outputs ``` Let's change the `salmon.transcriptome_index_name` field; this is an arbitrary string identifier in our workflow. ```{r change_input} inputs <- inputs |> mutate( attribute = ifelse( name == "salmon.transcriptome_index_name", '"new_index_name"', attribute ) ) inputs ``` ## Update configuration locally Since the inputs have been modified we need to put this information into the configuration of the workflow. We can do this with `avworkflow_configuration_update()`. By default this function will take the inputs and outputs of the original configuration, just in case there were no changes to one of them (like in our example our outputs weren't changed). ```{r update_config} new_config <- avworkflow_configuration_update(config, inputs) new_config ``` ## Set a workflow configuration for reuse in AnVIL Use `avworkflow_configuration_set()` to permanently update the workflow to new parameter values. ```{r set_config} avworkflow_configuration_set(new_config) ``` Actually, the previous command validates `new_config` only; to update the configuration in AnVIL (i.e., replacing the values in the workspace workflow graphical user interface), add the argument `dry = FALSE`. ```{r set_config_not_dry} ## avworkflow_configuration_set(new_config, dry = FALSE) ``` # Running and stopping workflows ## Running a workflow To finally run the new workflow we need to know the name of the data set to be used in the workflow. This can be discovered by looking at the table of interest and using the name of the data set. ```{r entityName} entityName <- avtable("participant_set") |> pull(participant_set_id) |> head(1) avworkflow_run(new_config, entityName) ``` Again, actually running the new configuration requires the argument `dry = FALSE`. ```{r run_not_dry} ## avworkflow_run(new_config, entityName, dry = FALSE) ``` `config` is used to set the `rootEntityType` and workflow method name and namespace; other components of `config` are ignored (the other components will be read by Terra / AnVIL from values updated with `avworkflow_configuration_set()`). ## Monitoring workflows We can see that the workflow is running by using the `avworkflow_jobs` function. The elements of the table are ordered chronologically, with the most recent submission (most likely the job we just started!) listed first. ```{r checking_workflow} avworkflow_jobs() ``` ## Stopping workflows Use `avworkflow_stop()` to stop a currently running workflow. This will change the status of the job, reported by `avworkflow_jobs()`, from 'Submitted' to 'Aborted'. ```{r stop_workflow} avworkflow_stop() # dry = FALSE to stop avworkflow_jobs() ``` # Managing workflow output Workflows can generate a large number of intermediate files (including diagnostic logs), as well as final outputs for more interactive analysis. Use the `submissionId` from `avworkflow_jobs()` to discover files produced by a submission; the default behavior lists files produced by the most recent job. ```{r files} submissionId <- "fb8e35b7-df5d-49e6-affa-9893aaeebf37" avworkflow_files(submissionId) ``` Workflow files are stored in the workspace bucket. The files can be localized to the persistent disk of the current runtime using `avworkflow_localize()`; the default is again to localize files from the most recently submitted job; use `type=` to influence which files ('control' e.g., log files, 'output', or 'all') are localized. ```{r localize} avworkflow_localize( submissionId, type = "output" ## dry = FALSE to localize ) ``` # Session information ```{r sessionInfo} sessionInfo() ```