---
title: "Creating An AnnotationHub Package"
author: "Valerie Obenchain and Lori Shepherd"
date: "Modified: March 2019. Compiled: `r format(Sys.Date(), '%d %b %Y')`"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  % \VignetteIndexEntry{AnnotationHub: Creating An AnnotationHub Package}
  % \VignetteEngine{knitr::rmarkdown}
  % \VignetteEncoding{UTF-8}
---


# Overview

The `AnnotationHubData` package provides tools to acquire, annotate, convert
and store data for use in Bioconductor's `AnnotationHub`. BED files from the
Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are
examples of data that can be downloaded, described with metadata, transformed
to standard `Bioconductor` data types, and stored so that they may be
conveniently served up on demand to users via the AnnotationHub client. While
data are often manipulated into a more R-friendly form, the data themselves
retain their raw content and are not filtered or curated like those in
[ExperimentHub](http://bioconductor.org/packages/ExperimentHub/).
Each resource has associated metadata that can be searched through the
`AnnotationHub` client interface.

# Setting up a package to use AnnotationHub

## New AnnotationHub package

Multiple, related resources are added to `AnnotationHub` by creating a software
package similar to the existing annotation packages. The package itself does
not contain data but serves as a light weight wrapper around scripts that
generate metadata for the resources added to `AnnotationHub`.

At a minimum the package should contain a man page describing the resources.
Vignettes and additional `R` code for manipulating the objects are optional.

Creating the package involves the following steps:

1. Notify `Bioconductor` team member:
   Man page and vignette examples in the software package will not work until
   the data are available in `AnnotationHub`. Adding the data to AWS S3 and the
   metadata to the production database involves assistance from a `Bioconductor`
   team member.  Please read the section "Uploading Data to S3". The
   metadata.csv file will have to be created before the data can officially be
   added to the hub (See inst/extdata section below).

2. Building the software package:
   Below is an outline of package organization. The files listed are required
   unless otherwise stated.

* inst/extdata/

    - metadata.csv:
    This file contains the metadata in the format of one row per resource
    to be added to the `AnnotationHub` database. The file should be generated
    from the code in inst/scripts/make-metadata.R where the final data are
    written out with write.csv(..., row.names=FALSE). The required column
    names and data types are specified in
    `AnnotationHubData::makeAnnotationHubMetadata`.
    See ?`AnnotationHubData::makeAnnotationHubMetadata` for details. Ensure that
    the above function runs without ERROR.

    If necessary, metadata can be broken up into multiple csv files instead
    having of all records in a single "metadata.csv".

* inst/scripts/

    - make-data.R:
    A script describing the steps involved in making the data object(s). This
    includes where the original data were downloaded from, pre-processing, and
    how the final R object was made. Include a description of any steps
    performed outside of `R` with third party software. Output of the script
    should be files on disk ready to be pushed to S3. If data are to be hosted
    on a personal web site instead of S3, this file should explain any
    manipulation of the data prior to hosting on the web site. For data hosted
    on a public web site with no prior manipultaion this file is not needed.

    - make-metadata.R:
    A script to make the metadata.csv file located in inst/extdata of the
    package. See ?`AnnotationHubData::makeAnnotationHubMetadata` for a
    description of the metadata.csv file, expected fields and data types. The
    `AnnotationHubData::makeAnnotationHubMetadata()` function can be used to
    validate the metadata.csv file before submitting the package.

* vignettes/

    OPTIONAL vignette(s) describing analysis workflows.

* R/

    OPTIONAL functions to enhance data exploration.

* man/

    - package man page:
    OPTIONAL. The package man page serves as a landing point and should briefly
    describe all resources associated with the package. There should be an
    \alias entry for each resource title either on the package man page or
    individual man pages.

    - resource man pages:
    OPTIONAL. Man page(s) should describe the resource (raw data source,
    processing, QC steps) and demonstrate how the data can be loaded through
    the `AnnotationHub` interface. For example, replace "SEARCHTERM*" below
    with one or more search terms that uniquely identify resources in your
    package.

    ```
    library(AnnotationHub)
    hub <- AnnotationHub()
    myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2")
    myfiles[[1]]  ## load the first resource in the list
    ```

* DESCRIPTION / NAMESPACE
The scripts used to generate the metadata will likely use functions from
AnnotationHub or AnnotationHubData which should be listed in Depends/Imports as
necessary. The biocViews should contain terms from
[AnnotationData](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData)
and should also contain the term `AnnotationHub`.


3. Data objects:
Data are not formally part of the software package and are stored
separately in AWS S3 buckets. The author should follow instructions in the
section "Uploading Data to S3"

4. Confirm valid metadata:
Confirm the data in inst/exdata/metadata.csv are valid by running
AnnotationHubData:::makeAnnotationHubMetadata() on your package. Please
address and warnings or errors.

5. Package review:
Submit the package to the
[tracker](https://github.com/Bioconductor/Contributions) for review. The
primary purpose of the package review is to validate the metadata in the csv
file(s). It is ok if the package fails R CMD build and check because the
data and metadata are not yet in place. Once the metadata.csv is approved,
records are added to the production database. At that point the package man
pages and vignette can be finalized and the package should pass R CMD build
and check.


## Additional resources to existing AnnotationHub package

Metadata for new versions of the data can be added to the same package as they
become available.

* The titles for the new versions should be unique and not match the title of
  any resource currently in AnnotationHub. Good practice would be to
  include the version and / or genome build in the title. If the title is
  not unique, the `AnnotationHub` object will list multiple files with the
  same title. The user will need to use 'rdatadateadded' to determine which
  is the most current.

* Make data available: see section on "Uploading Data to S3"

* Update make-metadata.R with the new metadata information

* Generate a new metadata.csv file. The package should contain
  metadata for all versions of the data in AnnotationHub so the old file should
  remain.  When adding a new   version it might be helpful to write a new csv
  file named by version, e.g., metadata_v84.csv, metadata_85.csv etc.

* Bump package version and commit to git

* Notify Lori.Shepherd@Roswellpark.org that an update is ready and
  a team member will add the new metadata to the production database;
  new resources will not be visible in AnnotationHub until
  the metadata are added to the database.

Contact Lori.Shepherd@roswellpark.org or maintainer@bioconductor.org with any
questions.

## Converting a non AnnotationHub annotation package

The concepts and directory structure of the package would stay the same.
The main steps involved would be

1. Restructure the inst/extdata and inst/scripts to include metadata.csv and
make-data.R as described in the section above for creating new packages. Ensure the
metadata.csv file is formatted correctly by running `AnnotationHubData::makeAnnotationHubMetadata()`
on your package.

2. Add biocViews term "AnnotationHub" to DESCRIPTION

3. Upload the data to S3 and remove the data from the package. See the section
for Uploading Data to S3 below.

4. Once the data is officially added to the hub, update any code to utilize
AnnotationHub for retrieving data.


# Bug fixes

A bug fix may involve a change to the metadata, data resource or both.

## Update the resource

* The replacement resource must have the same name as the original and
  be at the same location (path).

* Notify Lori.Shepherd@roswellpark.org that you want to replace the data
  and make the files available: see section "Uploading Data to S3".

## Update the metadata

New metadata records can be added for new resources but modifying existing
records is discouraged. Record modification will only be done in the case of
bug fixes.

* Notify Lori.Shepherd@roswellpark.org that you want to change the metadata

* Update make-metadata.R and regenerate the metadata.csv file

* Bump the package version and commit to git

# Remove resources

When a resource is removed from `AnnotationHub` two things happen:
the 'rdatadateremoved' field is populated with a date and the 'status'
field is populated with a reason why the resource is no longer available. Once
these changes are made, the `AnnotationHub()` constructor will not list the
resource among the available ids. An attempt to extract the resource with
'[[' and the AH id will return an error along with the status message.

In general, resources are only removed when they are no longer available
(e.g., moved from web location, no longer provided etc.).

To remove a resource from `AnnotationHub` contact Lori.Shepherd@roswellpark.org
or maintainer@bioconductor.org.

# Versioning

Versioning of resources is handled by the maintainer. If you plan to provide
incremental updates to a file for the same organism / genome build, we
recommend including a version in the title of the resource so it is easy
to distinguish which is most current. We also would recommend when uploading the
data to S3 to have a directory structure accounting for versioning.

If you do not include a version, or make the title unique in some way,
multiple files with the same title will be listed in the `AnnotationHub`
object. The user will can use the 'rdatadateadded' metadata field
to determine which file is the most current.

# Visibility

Several metadata fields control which resources are visible when
a user invokes AnnotationHub(). Records are filtered based on these criteria:

- 'snapshotdate' >= the date of the Bioconductor release being used
- 'rdatadateadded'  >= today's date
- 'rdatadateremoved' is NULL / NA
- 'biocVersion' is <= to the Bioconductor version being used

Once a record is added to AnnotationHub it is visable from that point forward
until stamped with 'rdatadateremoved'. For example, a record added on
May 1, 2017 with 'biocVersion' 3.6 will be visible in all snapshots >=
May1, 2017 and in all Bioconductor versions >= 3.6.

A special filter for OrgDb is utilized. Only one OrgDb is available per
release/devel cycle. Therefore contributed OrgDb added to a devel cycle are
masked until the following release. There are options for debugging these masked
resources. see `?setAnnotationHubOption`

# Uploading Data to S3

Instead of providing the data files via dropbox, ftp, etc. we will grant
temporary access to an S3 bucket where you can upload your data. Please
email Lori.Shepherd@roswellpark.org for access.

You will be given access to the 'AnnotationContributor' user. Ensure that the
`AWS CLI` is installed on your machine. See instructions for installing `AWS
CLI` [here](https://aws.amazon.com/cli/). Once you  have requested access you
will be emailed a set of keys. There are two options to set the profile up for
AnnotationContributor

1.  Update your `.aws/config` file to include the following updating the keys
accordingly:

```
[profile AnnotationContributor]
output = text
region = us-east-1
aws_access_key_id = ****
aws_secret_access_key = ****
```
2. If you can't find the `.aws/config` file,  Run the following command entering
appropriate information from above

```
aws configure --profile AnnotationContributor
```

After the configuration is set you should be able to upload resources using

```
aws --profile AnnotationContributor s3 cp test_file.txt s3://annotation-contributor/test_file.txt --acl public-read

# to upload a full directory use recursive:

aws --profile AnnotationContributor s3 cp test_dir s3://annotation-contributor/teset_dir --recursive --acl public-read

```

Please upload the data with the appropriate directory structure, including
subdirectories as necessary (i.e. top directory must be software package name,
then if applicable, subdirectories of versions, ...)

Once the upload is complete, email Lori.Shepherd@roswellpark.org to continue the
process. To add the data officially the data will need to be uploaded and the
metadata.csv file will need to be created in the github repository.

# Example metadata.csv file and more information

As described above the metadata.csv file (or multiple metadata.csv files) will
need to be created before the data can be added to the database. To ensure
proper formatting one should run `AnnotationHubData::makeAnnotationHubMetadata`
on the package with any/all metadata files, and address any ERRORs that
occur. Each object uploaded to S3 should have an entry in the metadata
file. Briefly, a description of the metadata columns required:

* Title: 'character(1)'. Name of the resource. This can be the exact file name
  (if self-describing) or a more complete description.
* Description: 'character(1)'. Brief description of the resource, similar to the
  'Description' field in a package DESCRIPTION file.
* BiocVersion: 'character(1)'. The first Bioconductor version the resource was
  made available for. Unless removed from the hub, the resource will be
  available for all versions greater than or equal to this field. This generally
  is the current devel version of Bioconductor.
* Genome: 'character(1)'. Genome. Can be NA.
* SourceType: 'character(1)'. Format of original data, e.g., FASTA, BAM, BigWig,
  etc. 'getValidSourceTypes()' list currently acceptable values. If nothing seems
  appropiate for your data reach out to lori.shepherd@roswellpark.org
* SourceUrl: 'character(1)'. Location of original data files. Multiple urls
  should be provided as a comma separated string. If the data is simulated we
  recommend putting either a lab url or the url of the Bioconductor package.
* SourceVersion: 'character(1)'. Version of original data.
* Species: 'character(1)'. Species. For help on valid species see
  'getSpeciesList, validSpecies, or suggestSpecies.' Can be NA.
* TaxonomyId: 'character(1)'. Taxonomy ID. There are checks for valid taxonomyId
  given the Species which produce warnings. See GenomeInfoDb::loadTaxonomyDb()
  for full validation table. Can be NA.
* Coordinate_1_based: 'logical'. TRUE if data are 1-based. Can be NA.
* DataProvider: 'character(1)'. Name of company or institution that supplied the
  original (raw) data.
* Maintainer: 'character(1)'. Maintainer name and email in the following format:
  Maintainer Name <username@address>.
* RDataClass: 'character(1)'. R / Bioconductor class the data are stored in,
  e.g., GRanges, SummarizedExperiment, ExpressionSet etc.
* DispatchClass: 'character(1)'. Determines how data are loaded into R. The
  value for this field should be 'Rda' if the data were serialized with 'save()'
  and 'Rds' if serialized with 'saveRDS'. The filename should have the
  appropriate 'rda' or 'rds' extension. The function
  'AnnotationHub::DispatchClassList()' will output a matrix of currently
  implemented DispatchClass and brief description of utility. If a predefine
  class does not seem appropriate contact lori.shepherd@roswellpark.org
* Location_Prefix: 'character(1)'. Do not include this field if data are stored
  in the Bioconductor AWS S3; it will be generated automatically. If data will
  be accessed from a location other than AWS S3 this field should be the base
  url.
* RDataPath: 'character(1)'.This field should be the remainder of the path to
  the resource. The 'Location_Prefix' will be prepended to 'RDataPath' for the
  full path to the resource.  If the resource is stored in Bioconductor's AWS S3
  buckets, it should start with the name of the package associated with the
  metadata and should not start with a leading slash. It should include the
  resource file name.
* Tags: 'character() vector'.  'Tags' are search terms used to define a subset
  of resources in a 'Hub' object, e.g, in a call to 'query'.

Any additional columns in the metadata.csv file will be ignored but could be
included for internal reference.

This is a bad example because these annotations are already in the hubs but it
should give you an idea of the format. Let's say I have a package myAnnotations
and I upload two annotation files for dog and cow with information extracted from
ensembl.  You would want the following saved as a csv (comma seperated output)
but for easier view we show in a table:


Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath
---------------|----------------------------------------------|-----|-------------|-----|----------------------------------------------------------------------------------------------------|------------|-------------|------|------|---------|-------------------------------------------------------|-----------|----------|--------------------------------------------------------------------------
Dog Annotation | Gene Annotation for Canis lupus from ensembl | 3.9 | Canis lupus | GTF | ftp://ftp.ensembl.org/pub/release-95/gtf/canis_lupus_dingo/Canis_lupus_dingo.ASM325472v1.95.gtf.gz | release-95 | Canis lupus | 9612 | true | ensembl | Bioconductor Maintainer <maintainer@bioconductor.org> | character | FilePath | myAnnotations/canis_lupus_dingo.ASM325472v1.95.gtf.gz
Cow Annotation | Gene Annotation for Bos taurus from ensemble | 3.9 | Bos taurus  | GTF | ftp://ftp.ensembl.org/pub/release-74/gtf/bos_taurus/Bos_taurus.UMD3.1.74.gtf.gz | release-74 | Bos taurus | 9913 | true | ensembl | Bioconductor Maintainer <maintainer@bioconductor.org> | character | FilePath | myAnnotations/Bos_taurus.UMD3.1.74.gtf.gz