---
title: "Introduction to ExperimentHubData"
author: "Valerie Obenchain"
date: "Modified: December 2016. Compiled: `r format(Sys.Date(), '%d %b %Y')`"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to ExperimentHubData}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Overview

`ExperimentHubData` provides tools to add or modify resources in 
Bioconductor's `ExperimentHub`. This 'hub' houses curated data from courses,
publications or experiments.  The resources are generally not files of raw data 
(as can be the case in `AnnotationHub`) but instead are `R` / `Bioconductor` 
objects such as GRanges, SummarizedExperiment, data.frame etc.  Each resource 
has associated metadata that can be searched through the `ExperimentHub` client interface.

# New resources

Resources are contributed to `ExperimentHub` in the form of a package.  The
package contains the resource metadata, man pages, vignette and any supporting
`R` functions the author wants to provide.  This is a similar design to the
existing `Bioconductor` experimental data packages except the data are
stored in AWS S3 buckets instead of the data/ directory of the package.

Below are the steps required for adding new resources.

## Notify `Bioconductor` team member

The man page and vignette examples in the software package will not work until
the data are available in ExperimentHub. Adding the data to AWS S3 and the
metadata to the production database involves assistance from a `Bioconductor`
team member.  If you are interested in submitting a package, please send an
email to packages@bioconductor.org so a team member can work with you through
the process.

## Building the software package

When a resource is downloaded from `ExperimentHub` the associated software
package is loaded in the workspace making the man pages and vignettes readily
available. Because documentation plays an important role in understanding these
curated resources please take the time to develop clear man pages and a
detailed vignette. These documents provide essential background to the user and
guide appropriate use the of resources.

Below is an outline of package organization. The files listed are required
unless otherwise stated. 

* inst/extdata/
    - metadata.csv: 
    This file contains the metadata in the format of one row per resource
    to be added to the `ExperimentHub` database. The file should be generated
    from the code in inst/scripts/make-metadata.R where the final data are
    written out with write.csv(..., row.names=FALSE). The required column 
    names and data types are specified in 
    `AnnotationHubData::readMetadataFromCsv()`. See ?`readMetadataFromCsv` for 
    details.

* inst/scripts/
    - make-data.R: 
    A script describing the steps involved in making the data object(s). This
    includes where the original data were downloaded from, pre-processing,
    and how the final R object was made. Include a description of any
    steps performed outside of `R` with third party software. Data objects
    should be serialized with save() with the .rda extension on the filename.

    - make-metadata.R: 
    A script to make the metadata.csv file located in inst/extdata of the 
    package. See ?`readMetadataFromCsv` for a description of expected fields 
    and data types.  `readMetadataFromCsv()` can be used to validate the 
    metadata.csv file before submitting the package.

* vignettes/

    One or more vignettes describing analysis workflows. 

* R/

  - zzz.R:
    Optional. You can include a .onLoad() function in a zzz.R file that
    exports each resource name (i.e., title) into a function. This allows
    the data to be loaded by name, e.g., resouce123().

    ```
    .onLoad <- function(libname, pkgname) {
          fl <- system.file("extdata", "metadata.csv", package=pkgname)
          titles <- read.csv(fl, stringsAsFactors=FALSE)$Title
          createHubAccessors(pkgname, titles)
    }
    ```
    Internal detail is in ExperimentHub::createHubAccessors and 
    ExperimentHub:::.hubAccessorFactory funtions. The resource-named function 
    has a single 'metadata' argument. When metadata=TRUE, the metadata are 
    loaded (equivalent to single-bracket method on an ExperimentHub object) 
    and when FALSE the full resource is loaded (equivalent to double-bracket 
    method).

  - Optional functions to enhance data exploration.

* man/
    - package man page: 
      The package man page serves as a landing point and should briefly describe
      all resources associated with the package. There should be an \alias
      entry for each resource title either on the package man page or individual
      man pages.

    - resource man pages: 
      Resources can be documented on the same page, grouped by common type
      or have their own dedicated man pages.

    - document how data are loaded:
      Data can be accessed via the standard ExperimentHub interface with
      single and double-bracket methods, e.g., 
      
      ```
      library(ExperimentHub)
      eh <- ExperimentHub()
      myfiles <- query(eh, "PACKAGENAME")
      myfiles[[1]]        ## load the first resource in the list
      myfiles[["EH123"]]  ## load by EH id
      ```
      
      If a .onLoad() function is used to export each resource as a function 
      also document that method of loading, e.g.,
 
      ```
      resourceA(meta = FALSE) ## data are loaded
      resourceA(meta = TRUE)  ## metadata are displayed
      ```


* DESCRIPTION / NAMESPACE

    The package should depend on and fully import ExperimentHub. If using
    the suggested .onLoad() function, import the utils package in the
    DESCRIPTION file and selectively importFrom(utils, read.csv) in the
    NAMESPACE.

    Package authors are encouraged to use the ExperimentHub::listResources() and 
    ExperimentHub::loadResource() functions in their man pages and vignette.
    These helpers are designed to facilitate data discovery within a specific
    package vs within all of ExperimentHub.


## Data objects

Data are not formally part of the software package and are stored separately in
AWS S3 buckets. The author should make the data available via dropbox, ftp or
another mutually accessible application and it will be uploaded to S3 by a
member of the `Bioconductor` team.

Data files should be created with save() and have the .rda extension.

## Metadata

When you are satisfied with the representation of your resources in 
make-metadata.R (which produces metadata.csv) the `Bioconductor` team
member will add the metadata to the production database.

## Package review 

Once the data are in AWS S3 and the metadata have been added to the
production database the man pages and vignette can be finalized. When the
package passes R CMD build and check it can be submitted to the 
[package tracker](https://github.com/Bioconductor/Contributions) for 
review.


# Add additional resources

Metadata for new versions of the data can be added to the same package as they
become available. 

* The titles for the new versions must be unique and not match the title of
  any resource currently in AnnotationHub. Good practice would be to 
  include the version and / or genome build in the title.

* Make data available via dropbox, ftp, etc. and notify 
  maintainer@bioconductor.org

* Update make-metadata.R with the new metadata information

* Generate a new or updated metadata.csv file. The package should contain
  metadata for all versions of the data in AnnotationHub.  When adding a new
  version it might be helpful to write a new csv file named by version, e.g.,
  metadata_v84.csv, metadata_85.csv etc.

* Bump package version and commit to svn/git

* Notify maintainer@bioconductor.org that an update is ready and
  a team member will add the new metadata to the production database;
  new resources will not be visible in ExperimentHub until
  the metadata are added to the database.

Contact maintainer@bioconductor.org with any questions.
# Bug fixes 

A bug fix may involve a change to the metadata, data resource or both.

## Update the resource 

* The replacement resource must have the same name as the original

* Notify maintainer@bioconductor.org that you want to replace the data
  and make the files available via dropbox, ftp, etc. 

## Update the metadata

New metadata records can be added for new resources but modifying existing
records is discouraged. Record modification will only be done in the case of 
bug fixes.

* Notify maintainer@bioconductor.org that you want to change the metadata

* Update make-metadata.R with modified information

* Bump the package version and commit to svn/git

# Remove resources

Removing resources should be done with caution. The intent is that
ExperimentHub be a 'reproducible' resource by providing a stable snapshot
of the data. Data made available in Bioconductor version x.y.z should be
available for all versions greater than x.y.z. Unfortunately this is not 
always possible. If you find it necessary to remove data from ExperimentHub 
please contact maintainer@bioconductor.org for assistance.

When a resource is removed from ExperimentHub the 'status' field in the 
metadata is modified to explain why they are no longer available. Once
this status is changed the ExperimentHub() constructor will not list the 
resource among the available ids. An attempt to extract the resource with 
'[[' and the EH id will return an error along with the status message.


# `ExperimentHub_docker`

The [ExperimentHub_docker](https://github.com/Bioconductor/ExperimentHub_docker)
offers an isolated test environment for inserting / extracting metadata records
in the `ExperimentHub` database. The README in the package explains how to
set up the Docker and inserting records is done with
`ExperimentHub::addResources()`.

In general this level of testing should not be necessary when submitting
a package with new resources. The best way to validate record metadata is to 
read inst/extdata/metadata.csv with `ExperimentHubData::readMetadataFromCsv()`.
If that is successful the metadata are ready to go.