% NOTE -- ONLY EDIT THE .Rnw FILE!
% The .tex file will be overwritten.
%
%\VignetteIndexEntry{FlowRepository R Interface}
%\VignetteDepends{FlowRepositoryR, flowCore}
%\VignetteKeywords{FlowRepository}
%\VignettePackage{FlowRepositoryR}
\documentclass[11pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}
\usepackage{graphicx}
\usepackage{subfigure}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}


\title{FlowRepositoryR: the FlowRepository R Interface}
\author{Josef Spidlen}

\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle

\begin{abstract}

FlowRepository is a free public flow cytometry data repository 
intended for authors of peer-reviewed manuscripts to deposit their 
underlying flow cytometry data, provide annotations, and share annotated 
datasets upon publication.
Primarily, FlowRepository is accessed via a web-based user interface
(https://flowrepository.org/), however, it also includes an application
programming interface (API), which allows for programmatic access from
other software tools.
FlowRepositoryR is an R library that utilizes this API allowing users
to locate available datasets, review provided annotations and download 
releated data files to their local file system, all this conveniently
from within R without the need of opening a web browser. Downloaded 
datasets can then be easily analyzed using flowCore and other libraries
developed for this purpose.\\

\noindent \textbf{Keywords:} FlowRepository, flow cytometry, data repository, 
API, application programming interface, dataset download

\end{abstract}

\section{Introduction}

\subsection{Background}

Data associated with publications should be available and accessible.
Transparency and public availability of protocols, data, analyses, and 
results are crucial to make sense of the complex biology of human diseases.
Funding agencies, regulatory agencies, publishers, and the scientific 
community have all recognized the importance of protecting cumulative 
data outputs to accelerate subsequent exploitation through the 
community-based development of public data repositories \citep{pmid19815759}.

\subsection{FlowRepository}
Until recently, no public repository existed for flow cytometry data.
In order to address this issue, we developed 
FlowRepository \citep{pmid22887982,pmid22752950} - a public resource 
for authors to deposit their flow cytometry data, provide 
MIFlowCyt \citep{pmid18752282} compliant annotation,
and share annotated datasets upon publication.
Development and maintenance of FlowRepository is generously supported 
by the Wallace H. Coulter Foundation, the International Society for 
Advancement of Cytometry (ISAC), the International Clinical Cytometry 
Society, various research grants and the flow cytometry community
in general.
Technically, FlowRepository has been developed by extending and adapting 
Cytobank \citep{pmid20578106}, an online tool for storage and collaborative 
analysis of cytometric data.
Primarily, FlowRepository is accessed via a web-based user interface
(\url{https://flowrepository.org/}), however, it also includes an 
XML-based application programming interface (API).
This API allows for programmatic access from other software tools.

\subsection{FlowRepositoryR}
FlowRepositoryR is an R library that utilizes this API 
\citep{FlowRepositoryAPI} 
to allow users locate available datasets, review annotations and download 
releated data files to their local file system.
Below, we will demonstrate how this library can be used.

\section{Typical Use}

\subsection{Requirements}

You will need R libraries \Rpackage{XML}, \Rpackage{RCurl}, and 
\Rpackage{tools} in order to install \Rpackage{FlowRepositoryR}.
These are being used to parse the XML used to communicate with the
FlowRepository server and to establish the HTTP(s) connection.
If you are installing from BioConductor, BiocManager should resolve
those dependencies for you. 
In addition, it is recommended to also have the \Rpackage{RUnit}
library to be able to run unit tests. Finally, you will likely 
want to have \Rpackage{flowCore} \citep{pmid19358741} and other 
related libraries in order to analyze data from FlowRepository 
datasets obtained using the \Rpackage{FlowRepositoryR} package.
Assuming you have installed \Rpackage{FlowRepositoryR} already, 
we will start by loading the library. 

<<LoadPackage, echo=true,results=hide>>=
library(FlowRepositoryR)
@

\subsection{List available datasets}

FlowRepository has been live since 2012 and it continues to see a steady 
increase in users, data submissions and downloads.
As of March 2015, there are 440 datasets, 215 of those are public.
The majority of the private datasets are presumed to be related to 
manuscripts that are currently under peer-review and will be made 
public once these manuscripts are published. You can use the 
\Rfunction{flowRep.ls} function in order to list the identifiers
of currently available datasets.

<<ListDatasets, echo=true, results=verbatim>>=
dataSets <- flowRep.ls()
## We will only show a maximum of 10 identifiers so that we don't
## clutter the vignette
dataSets[1:min(10, length(dataSets))]
@

\subsection{Searching for a dataset}

You can use the \Rfunction{flowRep.search} function in order to 
search for public datasets matching your search criteria. 
Only public datasets are being searched at this point.
The search covers experiment names, repository identifiers, keywords,
researcher first and last names, reagents and reagent manufactures,
instruments and instrument manufactures, sample annotations and
manuscript identifiers. A vector of identifiers of matching
datasets is retrieved. NULL is returned if no matching datasets are found.

<<SearchDatasets, echo=true, results=verbatim>>=
flowRep.search("OMIP-016")
@

\subsection{Review information about a datasets}

While an extended search functionality is being developed, for now
we will assume that you know which dataset you are interested in.
You can use the \Rfunction{flowRep.get} function in order to obtain 
a dataset from FlowRepository. This will retrieve information about
the dataset but it will not download the data.

<<GetDataset, echo=true, results=verbatim>>=
## FR-FCM-ZZJ7 is a purposely picked dataset that is public and very
## small for the unit tests and the vignette and man pages to compile 
## quickly. Also, FlowRepository is not tracking the downloads of this 
## particular dataset since the stats would be based mainly on these 
## automated downloads.
ds <- flowRep.get("FR-FCM-ZZJ7")
summary(ds)
@

This will return a FlowRepository dataset represented by an object of
the \Rclass{flowRepData} class.
See section \ref{l:flowrepdataclass} for more details about the dataset, or 
you can also use the \Rfunction{str} command to inspect the returned object.

\subsection{Download the data}

Data associated with a FlowRepository dataset can be downloaded using
the \Rfunction{download} method of the \Rclass{flowRepData} class.

<<DownloadData, echo=true, results=verbatim>>=
ds <- download(ds)
summary(ds)
@

Assuming the dataset exists and you have permissions to access it, this 
will download the whole dataset including all FCS files and attachment
files associated with it. 
Unless specified otherwise (see section \ref{l:downloadoptions}), the download
method will create a new directory in your current working directory, name
it based on the identifier of the dataset, and dowload the files there.
A separate \texttt{attachments} subfolder will be created for the attachments.
The location where these files were downloaded can
be obtained from the local path slot of the file proxies. For example, the
local path of the first downloaded dataset can be obtained as follows:

<<WhereIsMyFile, echo=true, results=verbatim>>=
localpath(fcs.files(ds)[[1]])
@

If we wanted the local path of all the downloaded FCS files, we could use 
the \Rfunction{lapply} function as follows:

<<WhereAreMyFCSFiles, echo=true, results=verbatim>>=
unlist(lapply(fcs.files(ds), function(x) paste(localpath(x))))
@

Analogously, we can locate all the attachments as follows:

<<WhereAreMyAttachments, echo=true, results=verbatim>>=
unlist(lapply(attachments(ds), function(x) paste(localpath(x))))
@

\subsection{Downloading private datasets}

In order to download a private dataset, you will need to register with
FlowRepository. Open your web browser and navigate to 
\url{http://flowrepository.org/}. Then follow the \textit{Login} link
in the top right corner of the page. Next, either Sign-in or follow
the registration link if you haven't signed up yet.
FlowRepository uses OpenID or Google+ authentication. Those are used
for web-based authentication. The \Rpackage{FlowRepositoryR} package
(and FlowRepository API in general) use a email/password based authentication.
This needs to be set in your profile independently. Once you have logged in
in your web browser, click on the \textit{Welcome Your Name} link in
the top right corner next to the \textit{Logout} link. This will enter your
profile. Next, follow the \textit{Edit} link from the actions panel on your
left. Scroll down and set your API password as shown in Figure 
\ref{fig:SetFlowRepositoryAPIPassword}. The API password shall use 8 or 
more characters and include at least one number, one upper-case character 
and one lower-case character. Set your password and confirm it by
clicking on the \textit{Update} button.

\begin{figure}[h!]
\begin{center}
\includegraphics[width=1\textwidth]{FlowRepositorySetAPIpasswd.png}
\end{center}
\caption{\textbf{Setting FlowRepository API access password.}
FlowRepository uses OpenID or Google+ authentication for web-based access,
but those are separate from the application programming access, which needs
to be set in your profile by providing an API password.
}
\label{fig:SetFlowRepositoryAPIPassword}
\end{figure}

Once you have set your password online, you can use the 
\Rfunction{setFlowRepositoryCredentials} to set your
FlowRepository API credentials, which will give you access to
non-public datasets created by you or shared with you in FlowRepository.
<<setFlowRepositoryCredentials, echo=true, results=verbatim>>=
setFlowRepositoryCredentials(email="boo@gmail.com", password="foo123456")
@

Alternativelly, you can provide the \texttt{filename} argument instead
of the email and passwords arguments, which will read your credentials
from a text file. This file shall include 2 lines, email address in the 
first line, password in the second line.
Finally, the function will prompt for credentials if called without 
arguments in an interactive mode.

Once your credentials are set, you can use the \texttt{include.private=TRUE}
option of the \Rfunction{flowRep.ls()} function in order to include
non-public dataset in the list of available datasets. In the 
\Rfunction{download} method, if credentials are set then those will 
be used automaticaly. You can disable this by passing the 
\texttt{use.credentials=FALSE} argument to the \Rfunction{download} method of a
\Rclass{flowRepData} object. 

To conclude this section, let's forget the set credentials as the 
\texttt{boo@gmail.com} email and \texttt{foo123456} password are not real
credentials to access FlowRepository.
<<forgetFlowRepositoryCredentials, echo=true, results=verbatim>>=
forgetFlowRepositoryCredentials()
@

\subsection{Additional download options}
\label{l:downloadoptions}

The \texttt{dirpath} argument may be passed to the to the 
\Rfunction{download} method of a \Rclass{flowRepData} object.
This can be used to specify the directory on the local file system
where the dataset shall be downloaded. By default, the files will 
be downloaded to a folder named based on the dataset identifier (FR-FCM-xxxx), 
which will be created in your current working directory.

If you don't want to see the progress about files as they are being downloaded,
you can turn this off by passing the \texttt{show.progress=FALSE} argument
to the \Rfunction{download} method of a \Rclass{flowRepData} object.

\subsection{Downloading only certain files from a dataset}

Should you wish to download only some files of a FlowRepository
dataset, you can do so by using the \Rfunction{download} method of 
the \Rclass{fileProxy} objects (\textit{i.e}, \Rclass{fcsProxy} or 
\Rclass{attachmentProxy}). For example

<<DownloadDataPartially, echo=true, results=verbatim>>=
myDataset <- flowRep.get("FR-FCM-ZZJ7")
summary(myDataset)

## And download a single attachment file
at1 <- download(attachments(myDataset)[[1]])
localpath(at1)
summary(at1)

## A single FCS file proxy can be downloaded
fcs1 <- download(fcs.files(myDataset)[[1]])
localpath(fcs1)
summary(fcs1)
@


\section{Representing FlowRepository Datasets}

\subsection{The \Rclass{flowRepData} Class}
\label{l:flowrepdataclass}

FlowRepository datasets are represented by \Rclass{flowRepData} objects.
Slots of this class capture the metadata (information about) the dataset
as follows:

\begin{description}
\item[\texttt{id}:]{Object of class \texttt{character} containing the
    FlowRepository identified of the dataset. These identifiers are
    typically in the form of \texttt{FR-FCM-}xxxx where xxxx represents
    4 alphanumeric characters.}
\item[\texttt{public.url}:]{Object of class \texttt{character} or
    NULL containing the public URL of this dataset. This will 
    commonly be in the form of
    \texttt{https://flowrepository.org/id/}identifier, where 
    identifier is the FlowRepository identified of the dataset.}
\item[\texttt{name}:]{Object of class \texttt{character} or 
    NULL containing the name of this dataset.}
\item[\texttt{public}:]{Object of class \texttt{logical} or 
    NULL containing the information whether this dataset is public.}
\item[\texttt{primary.researcher}:]{Object of class \texttt{character} or
    NULL containing the name of the primary researcher 
    associated with this dataset.}
\item[\texttt{primary.investigator}:]{Object of class \texttt{character} or
    NULL containing the name of the primary investigator 
    associated with this dataset.}
\item[\texttt{uploader}:]{Object of class \texttt{character} or 
    NULL containing the name of the uploader of this dataset.}
\item[\texttt{experiment.dates}: ]{Object of class \texttt{character} or 
    NULL containing the dates associated with this dataset. 
    Typically, there will be two dates associated with the dataset, 
    the first one for the start of the experiment, the second one 
    for the end of the experiment. A single date indicates the start
    of an experiment that may still be ongoing. The dates shall 
    be encoded as "YYYY-MM-DD".}
\item[\texttt{purpose}:]{Object of class \texttt{character} or 
    NULL stating the purpose of this dataset (experiment).}
\item[\texttt{conclusion}:]{Object of class \texttt{character} or 
    NULL stating the conclusion associated with this dataset 
    (typically conclusions reached by analyzing the data).}
\item[\texttt{comments}:]{Object of class \texttt{character} or 
    NULL stating additional comments associated with this dataset.}
\item[\texttt{funding}:]{Object of class \texttt{character} or NULL 
    stating the funding used to collect the data in this dataset.}
\item[\texttt{qc.measures}:]{Object of class \texttt{character} or 
    NULL stating the quality control measures taken in order to 
    ensure the quality of data in this dataset.}
\item[\texttt{miflowcyt.score}:]{Object of class \texttt{numeric} or 
    NULL stating the MIFlowCyt compliance score of this experiment. 
    MIFlowCyt is the Minimum Information about a Flow Cytometry 
    Experiment - an ISAC Recommendation listing the minimum 
    information that shall be provided as annotation of flow 
    cytometry datasets. The MIFlowCyt compliance score is a value 
    between 0 and 100 percent indicating the level of compliance 
    with MIFlowCyt. Details about how FlowRepository calculates 
    this score ara available here:
    \url{http://flowrepository.org/quick_start_guide#MIFlowCytScoreReport}}
\item[\texttt{keywords}:]{Object of class \texttt{list} (of objects 
    of class \texttt{character}) enumerating keywords associated 
    with this dataset.}
\item[\texttt{publications}:]{Object of class \texttt{list} (of objects 
    of class \texttt{character}) enumerating publications associated 
    with this dataset. Publications are typically listed as
    "PMID:12345678" or "PMCID:PMC1234567".}
\item[\texttt{organizations}:]{Object of class \texttt{list} of objects 
    of class \texttt{flowRepOrganization} (see section
    \ref{l:flowRepOrganizationclass})
    enumerating organizations associated with this dataset.}
\item[\texttt{fcs.files}:]{Object of class \texttt{list} of objects 
    of class \texttt{fcsProxy}
    enumerating FCS files associated with this dataset.}
\item[\texttt{attachments}:]{Object of class \texttt{list} of objects 
    of class \texttt{attachmentProxy}
    enumerating attachments associated with this dataset.}
\end{description}


\subsection{The \Rclass{flowRepOrganization} Class}
\label{l:flowRepOrganizationclass}

The \Rclass{flowRepOrganization} class represents the name and address of 
an organization associated with a dataset stored in FlowRepository.
Slots of this class capture the information as follows:

\begin{description}
\item[\texttt{name}:]{Object of class \texttt{character} containing the
    name of the organization.}
\item[\texttt{street}:]{Object of class \texttt{character} or NULL 
    containing the street of the address of the organization.}
\item[\texttt{city}:]{Object of class \texttt{character} or NULL 
    containing the city of the address of the organization.}
\item[\texttt{zip}:]{Object of class \texttt{character} or NULL 
    containing the zip (or postal code) of the address of the 
    organization.}
\item[\texttt{state}:]{Object of class \texttt{character} or NULL 
    containing the state (or province) of the address of the 
    organization.}
\item[\texttt{country}:]{Object of class \texttt{character} or NULL 
    containing the country of the address of the organization.}
\end{description}


\clearpage
\bibliographystyle{plainnat} 
\bibliography{Refs}
\end{document}