--- title: "Introduction to ExperimentHubData" author: "Valerie Obenchain" date: "Modified: October 2016. Compiled: `r format(Sys.Date(), '%d %b %Y')`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Introduction to ExperimentHubData} %\VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Overview `ExperimentHubData` provides tools to add or modify resources in Bioconductor's `ExperimentHub`. This 'hub' houses curated data from courses, publications or experiments. The resources are generally not files of raw data (as can be the case in `AnnotationHub`) but instead are `R` / `Bioconductor` objects such as GRanges, SummarizedExperiment, data.frame etc. Each resource has associated metadata that can be searched through the `ExperimentHub` client interface. # New resources Resources are contributed to `ExperimentHub` in the form of a package. The package contains the resource metadata, man pages, vignette and any supporting `R` functions the author wants to provide. This is a similar design to the existing `Bioconductor` experimental data packages except the data are stored in AWS S3 buckets instead of the data/ directory of the package. Below are the steps required for adding new resources. ## Notify `Bioconductor` team member The man page and vignette examples in the software package will not work until the data are available in ExperimentHub. Adding the data to AWS S3 and the metadata to the production database involves assistance from a `Bioconductor` team member. If you are interested in submitting a package, please send an email to packages@bioconductor.org so a team member can work with you through the process. ## Building the software package When a resource is downloaded from `ExperimentHub` the associated software package is loaded in the workspace making the man pages and vignettes readily available. Because documentation plays an important role in understanding these curated resources please take the time to develop clear man pages and a detailed vignette. These documents provide essential background to the user and guide appropriate use the of resources. Below is an outline of package organization. The files listed are required unless otherwise stated. * inst/extdata/ - metadata.csv: This file contains the metadata in the format of one row per resource to be added to the `ExperimentHub` database. The file should be generated from the code in inst/scripts/make-metadata.R where the final data are written out with write.csv(..., row.names=FALSE). The required column names and data types are specified in `AnnotationHubData::readMetadataFromCsv()`. See ?`readMetadataFromCsv` for details. * inst/scripts/ - make-data.R: A script describing the steps involved in making the data object(s). This includes where the original data were downloaded from, pre-processing, and how the final R object was made. Include a description of any steps performed outside of `R` with third party software. Data objects should be serialized with save() with the .rda extension on the filename. - make-metadata.R: A script to make the metadata.csv file located in inst/extdata of the package. See ?`readMetadataFromCsv` for a description of expected fields and data types. `readMetadataFromCsv()` can be used to validate the metadata.csv file before submitting the package. * vignettes/ One or more vignettes describing analysis workflows. * R/ - zzz.R: This sample .onLoad() function in a zzz.R file makes each resource name (i.e., title) into a function which allows the data to be loaded by name. Substitute the name of your package in place of PACKAGENAME below. ``` .onLoad <- function(libname, pkgname) { titles <- read.csv(system.file("extdata", "metadata.csv", package="PACKAGENAME"), stringsAsFactors=FALSE)$Title rda <- gsub(".rda", "", titles, fixed=TRUE) if (!length(rda)) stop("no .rda objects found in metadata") ## Functions to load resources by name: ns <- asNamespace(pkgname) sapply(rda, function(xx) { func = function(metadata = FALSE) { if (!isNamespaceLoaded("ExperimentHub")) attachNamespace("ExperimentHub") eh <- query(ExperimentHub(), "PACKAGENAME") ehid <- names(query(eh, xx)) if (!length(ehid)) stop(paste0("resource ", xx, "not found in ExperimentHub")) if (metadata) eh[ehid] else eh[[ehid]] } assign(xx, func, envir=ns) namespaceExport(ns, xx) }) } ``` - Optional functions to enhance data exploration. * man/ - package man page: The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an \alias entry for each resource title either on the package man page or individual man pages. - resource man pages: It is recommended (but not required) that each resource have a dedicated man page. - document data loading: When the package is loaded the zzz.R file makes the title of each resource into a function with a single argument, 'metadata'. The man page should document that data can be loaded by name, e.g., ``` resourceA(meta = FALSE) ## data are loaded into the workspace resourceA(meta = TRUE) ## only the metadata are displayed ``` Optionally, the data can be loaded through the ExperimentHub interface. ``` library(ExperimentHub) eh <- ExperimentHub() myfiles <- query(eh, "PACKAGENAME") myfiles[[1]] ## load the first resource in the list ``` * DESCRIPTION / NAMESPACE The package should depend on and fully import ExperimentHub. Package authors are encouraged to use the ExperimentHub::listResources() and ExperimentHub::loadResource() functions in their man pages and vignette. These helpers are designed to facilitate data discovery within a specific package vs within all of ExperimentHub. ## Data objects Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should make the data available via dropbox, ftp or another mutually accessible application and it will be uploaded to S3 by a member of the `Bioconductor` team. Data files should be created with save() and have the .rda extension. ## Metadata When you are satisfied with the representation of your resources in make-metadata.R (which produces metadata.csv) the `Bioconductor` team member will add the metadata to the production database. ## Package review Once the data are in AWS S3 and the metadata have been added to the production database the man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the [package tracker](https://github.com/Bioconductor/Contributions) for review. # Add additional resources Multiple versions of the data can be added to the same package as they become available. Be sure the title is descriptive and reflects the distinguishing information such as genome build. Adding new resources to an existing package requires the following steps: * make data available via dropbox, ftp, etc. and notify maintainer@bioconductor.org; if you have access to the AWS S3 buckeet you can upload the files yourself * update make-metadata.R with the new metadata information * re-generate the metadata.csv file * bump package version and commit to svn/git * notify maintainer@bioconductor.org that an update is ready and a team member will add the new metadata to the production database; new resources will not be visible in ExperimentHub until the metadata are added to the database. # Bug fixes A bug fix may involve a change to the metadata, data resource or both. ## Update the resource * the replacement resource must have the same name as the original * notify maintainer@bioconductor.org that you want to replace the data and make the files available via dropbox, ftp, etc. ## Update the metadata * notify maintainer@bioconductor.org that you want to change the metadata * update make-metadata.R with modified information * regenerate metadata.csv * bump the package version and commit to svn/git # Remove resources When a resource is removed from ExperimentHub the 'status' field in the metadata is modified to explain why they are no longer available. Once this status is changed the ExperimentHub() constructor will not list the resource among the available ids. An attempt to extract the resource with '[[' and the EH id will return an error along with the status message. To remove a resource from ExperimentHub contact maintainer@bioconductor.org. # `ExperimentHub_docker` The [ExperimentHub_docker](https://github.com/Bioconductor/ExperimentHub_docker) offers an isolated test environment for inserting / extracting metadata records in the `ExperimentHub` database. The README in the package explains how to set up the Docker and inserting records is done with `ExperimentHub::addResources()`. In general this level of testing should not be necessary when submitting a package with new resources. The best way to validate record metadata is to read inst/extdata/metadata.csv with `ExperimentHubData::readMetadataFromCsv()`. If that is successful the metadata are ready to go.