% NOTE -- ONLY EDIT THE .Rnw FILE! % The .tex file will be overwritten. % %\VignetteIndexEntry{FlowRepository R Interface} %\VignetteDepends{FlowRepositoryR, flowCore} %\VignetteKeywords{FlowRepository} %\VignettePackage{FlowRepositoryR} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \usepackage{graphicx} \usepackage{subfigure} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{FlowRepositoryR: the FlowRepository R Interface} \author{Josef Spidlen} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} FlowRepository is a free public flow cytometry data repository intended for authors of peer-reviewed manuscripts to deposit their underlying flow cytometry data, provide annotations, and share annotated datasets upon publication. Primarily, FlowRepository is accessed via a web-based user interface (https://flowrepository.org/), however, it also includes an application programming interface (API), which allows for programmatic access from other software tools. FlowRepositoryR is an R library that utilizes this API allowing users to locate available datasets, review provided annotations and download releated data files to their local file system, all this conveniently from within R without the need of opening a web browser. Downloaded datasets can then be easily analyzed using flowCore and other libraries developed for this purpose.\\ \noindent \textbf{Keywords:} FlowRepository, flow cytometry, data repository, API, application programming interface, dataset download \end{abstract} \section{Introduction} \subsection{Background} Data associated with publications should be available and accessible. Transparency and public availability of protocols, data, analyses, and results are crucial to make sense of the complex biology of human diseases. Funding agencies, regulatory agencies, publishers, and the scientific community have all recognized the importance of protecting cumulative data outputs to accelerate subsequent exploitation through the community-based development of public data repositories \citep{pmid19815759}. \subsection{FlowRepository} Until recently, no public repository existed for flow cytometry data. In order to address this issue, we developed FlowRepository \citep{pmid22887982,pmid22752950} - a public resource for authors to deposit their flow cytometry data, provide MIFlowCyt \citep{pmid18752282} compliant annotation, and share annotated datasets upon publication. Development and maintenance of FlowRepository is generously supported by the Wallace H. Coulter Foundation, the International Society for Advancement of Cytometry (ISAC), the International Clinical Cytometry Society, various research grants and the flow cytometry community in general. Technically, FlowRepository has been developed by extending and adapting Cytobank \citep{pmid20578106}, an online tool for storage and collaborative analysis of cytometric data. Primarily, FlowRepository is accessed via a web-based user interface (\url{https://flowrepository.org/}), however, it also includes an XML-based application programming interface (API). This API allows for programmatic access from other software tools. \subsection{FlowRepositoryR} FlowRepositoryR is an R library that utilizes this API \citep{FlowRepositoryAPI} to allow users locate available datasets, review annotations and download releated data files to their local file system. Below, we will demonstrate how this library can be used. \section{Typical Use} \subsection{Requirements} You will need R libraries \Rpackage{XML}, \Rpackage{RCurl}, and \Rpackage{tools} in order to install \Rpackage{FlowRepositoryR}. These are being used to parse the XML used to communicate with the FlowRepository server and to establish the HTTP(s) connection. If you are installing from BioConductor, BiocManager should resolve those dependencies for you. In addition, it is recommended to also have the \Rpackage{RUnit} library to be able to run unit tests. Finally, you will likely want to have \Rpackage{flowCore} \citep{pmid19358741} and other related libraries in order to analyze data from FlowRepository datasets obtained using the \Rpackage{FlowRepositoryR} package. Assuming you have installed \Rpackage{FlowRepositoryR} already, we will start by loading the library. <>= library(FlowRepositoryR) @ \subsection{List available datasets} FlowRepository has been live since 2012 and it continues to see a steady increase in users, data submissions and downloads. As of March 2015, there are 440 datasets, 215 of those are public. The majority of the private datasets are presumed to be related to manuscripts that are currently under peer-review and will be made public once these manuscripts are published. You can use the \Rfunction{flowRep.ls} function in order to list the identifiers of currently available datasets. <>= dataSets <- flowRep.ls() ## We will only show a maximum of 10 identifiers so that we don't ## clutter the vignette dataSets[1:min(10, length(dataSets))] @ \subsection{Searching for a dataset} You can use the \Rfunction{flowRep.search} function in order to search for public datasets matching your search criteria. Only public datasets are being searched at this point. The search covers experiment names, repository identifiers, keywords, researcher first and last names, reagents and reagent manufactures, instruments and instrument manufactures, sample annotations and manuscript identifiers. A vector of identifiers of matching datasets is retrieved. NULL is returned if no matching datasets are found. <>= flowRep.search("OMIP-016") @ \subsection{Review information about a datasets} While an extended search functionality is being developed, for now we will assume that you know which dataset you are interested in. You can use the \Rfunction{flowRep.get} function in order to obtain a dataset from FlowRepository. This will retrieve information about the dataset but it will not download the data. <>= ## FR-FCM-ZZJ7 is a purposely picked dataset that is public and very ## small for the unit tests and the vignette and man pages to compile ## quickly. Also, FlowRepository is not tracking the downloads of this ## particular dataset since the stats would be based mainly on these ## automated downloads. ds <- flowRep.get("FR-FCM-ZZJ7") summary(ds) @ This will return a FlowRepository dataset represented by an object of the \Rclass{flowRepData} class. See section \ref{l:flowrepdataclass} for more details about the dataset, or you can also use the \Rfunction{str} command to inspect the returned object. \subsection{Download the data} Data associated with a FlowRepository dataset can be downloaded using the \Rfunction{download} method of the \Rclass{flowRepData} class. <>= ds <- download(ds) summary(ds) @ Assuming the dataset exists and you have permissions to access it, this will download the whole dataset including all FCS files and attachment files associated with it. Unless specified otherwise (see section \ref{l:downloadoptions}), the download method will create a new directory in your current working directory, name it based on the identifier of the dataset, and dowload the files there. A separate \texttt{attachments} subfolder will be created for the attachments. The location where these files were downloaded can be obtained from the local path slot of the file proxies. For example, the local path of the first downloaded dataset can be obtained as follows: <>= localpath(fcs.files(ds)[[1]]) @ If we wanted the local path of all the downloaded FCS files, we could use the \Rfunction{lapply} function as follows: <>= unlist(lapply(fcs.files(ds), function(x) paste(localpath(x)))) @ Analogously, we can locate all the attachments as follows: <>= unlist(lapply(attachments(ds), function(x) paste(localpath(x)))) @ \subsection{Downloading private datasets} In order to download a private dataset, you will need to register with FlowRepository. Open your web browser and navigate to \url{http://flowrepository.org/}. Then follow the \textit{Login} link in the top right corner of the page. Next, either Sign-in or follow the registration link if you haven't signed up yet. FlowRepository uses OpenID or Google+ authentication. Those are used for web-based authentication. The \Rpackage{FlowRepositoryR} package (and FlowRepository API in general) use a email/password based authentication. This needs to be set in your profile independently. Once you have logged in in your web browser, click on the \textit{Welcome Your Name} link in the top right corner next to the \textit{Logout} link. This will enter your profile. Next, follow the \textit{Edit} link from the actions panel on your left. Scroll down and set your API password as shown in Figure \ref{fig:SetFlowRepositoryAPIPassword}. The API password shall use 8 or more characters and include at least one number, one upper-case character and one lower-case character. Set your password and confirm it by clicking on the \textit{Update} button. \begin{figure}[h!] \begin{center} \includegraphics[width=1\textwidth]{FlowRepositorySetAPIpasswd.png} \end{center} \caption{\textbf{Setting FlowRepository API access password.} FlowRepository uses OpenID or Google+ authentication for web-based access, but those are separate from the application programming access, which needs to be set in your profile by providing an API password. } \label{fig:SetFlowRepositoryAPIPassword} \end{figure} Once you have set your password online, you can use the \Rfunction{setFlowRepositoryCredentials} to set your FlowRepository API credentials, which will give you access to non-public datasets created by you or shared with you in FlowRepository. <>= setFlowRepositoryCredentials(email="boo@gmail.com", password="foo123456") @ Alternativelly, you can provide the \texttt{filename} argument instead of the email and passwords arguments, which will read your credentials from a text file. This file shall include 2 lines, email address in the first line, password in the second line. Finally, the function will prompt for credentials if called without arguments in an interactive mode. Once your credentials are set, you can use the \texttt{include.private=TRUE} option of the \Rfunction{flowRep.ls()} function in order to include non-public dataset in the list of available datasets. In the \Rfunction{download} method, if credentials are set then those will be used automaticaly. You can disable this by passing the \texttt{use.credentials=FALSE} argument to the \Rfunction{download} method of a \Rclass{flowRepData} object. To conclude this section, let's forget the set credentials as the \texttt{boo@gmail.com} email and \texttt{foo123456} password are not real credentials to access FlowRepository. <>= forgetFlowRepositoryCredentials() @ \subsection{Additional download options} \label{l:downloadoptions} The \texttt{dirpath} argument may be passed to the to the \Rfunction{download} method of a \Rclass{flowRepData} object. This can be used to specify the directory on the local file system where the dataset shall be downloaded. By default, the files will be downloaded to a folder named based on the dataset identifier (FR-FCM-xxxx), which will be created in your current working directory. If you don't want to see the progress about files as they are being downloaded, you can turn this off by passing the \texttt{show.progress=FALSE} argument to the \Rfunction{download} method of a \Rclass{flowRepData} object. \subsection{Downloading only certain files from a dataset} Should you wish to download only some files of a FlowRepository dataset, you can do so by using the \Rfunction{download} method of the \Rclass{fileProxy} objects (\textit{i.e}, \Rclass{fcsProxy} or \Rclass{attachmentProxy}). For example <>= myDataset <- flowRep.get("FR-FCM-ZZJ7") summary(myDataset) ## And download a single attachment file at1 <- download(attachments(myDataset)[[1]]) localpath(at1) summary(at1) ## A single FCS file proxy can be downloaded fcs1 <- download(fcs.files(myDataset)[[1]]) localpath(fcs1) summary(fcs1) @ \section{Representing FlowRepository Datasets} \subsection{The \Rclass{flowRepData} Class} \label{l:flowrepdataclass} FlowRepository datasets are represented by \Rclass{flowRepData} objects. Slots of this class capture the metadata (information about) the dataset as follows: \begin{description} \item[\texttt{id}:]{Object of class \texttt{character} containing the FlowRepository identified of the dataset. These identifiers are typically in the form of \texttt{FR-FCM-}xxxx where xxxx represents 4 alphanumeric characters.} \item[\texttt{public.url}:]{Object of class \texttt{character} or NULL containing the public URL of this dataset. This will commonly be in the form of \texttt{https://flowrepository.org/id/}identifier, where identifier is the FlowRepository identified of the dataset.} \item[\texttt{name}:]{Object of class \texttt{character} or NULL containing the name of this dataset.} \item[\texttt{public}:]{Object of class \texttt{logical} or NULL containing the information whether this dataset is public.} \item[\texttt{primary.researcher}:]{Object of class \texttt{character} or NULL containing the name of the primary researcher associated with this dataset.} \item[\texttt{primary.investigator}:]{Object of class \texttt{character} or NULL containing the name of the primary investigator associated with this dataset.} \item[\texttt{uploader}:]{Object of class \texttt{character} or NULL containing the name of the uploader of this dataset.} \item[\texttt{experiment.dates}: ]{Object of class \texttt{character} or NULL containing the dates associated with this dataset. Typically, there will be two dates associated with the dataset, the first one for the start of the experiment, the second one for the end of the experiment. A single date indicates the start of an experiment that may still be ongoing. The dates shall be encoded as "YYYY-MM-DD".} \item[\texttt{purpose}:]{Object of class \texttt{character} or NULL stating the purpose of this dataset (experiment).} \item[\texttt{conclusion}:]{Object of class \texttt{character} or NULL stating the conclusion associated with this dataset (typically conclusions reached by analyzing the data).} \item[\texttt{comments}:]{Object of class \texttt{character} or NULL stating additional comments associated with this dataset.} \item[\texttt{funding}:]{Object of class \texttt{character} or NULL stating the funding used to collect the data in this dataset.} \item[\texttt{qc.measures}:]{Object of class \texttt{character} or NULL stating the quality control measures taken in order to ensure the quality of data in this dataset.} \item[\texttt{miflowcyt.score}:]{Object of class \texttt{numeric} or NULL stating the MIFlowCyt compliance score of this experiment. MIFlowCyt is the Minimum Information about a Flow Cytometry Experiment - an ISAC Recommendation listing the minimum information that shall be provided as annotation of flow cytometry datasets. The MIFlowCyt compliance score is a value between 0 and 100 percent indicating the level of compliance with MIFlowCyt. Details about how FlowRepository calculates this score ara available here: \url{http://flowrepository.org/quick_start_guide#MIFlowCytScoreReport}} \item[\texttt{keywords}:]{Object of class \texttt{list} (of objects of class \texttt{character}) enumerating keywords associated with this dataset.} \item[\texttt{publications}:]{Object of class \texttt{list} (of objects of class \texttt{character}) enumerating publications associated with this dataset. Publications are typically listed as "PMID:12345678" or "PMCID:PMC1234567".} \item[\texttt{organizations}:]{Object of class \texttt{list} of objects of class \texttt{flowRepOrganization} (see section \ref{l:flowRepOrganizationclass}) enumerating organizations associated with this dataset.} \item[\texttt{fcs.files}:]{Object of class \texttt{list} of objects of class \texttt{fcsProxy} enumerating FCS files associated with this dataset.} \item[\texttt{attachments}:]{Object of class \texttt{list} of objects of class \texttt{attachmentProxy} enumerating attachments associated with this dataset.} \end{description} \subsection{The \Rclass{flowRepOrganization} Class} \label{l:flowRepOrganizationclass} The \Rclass{flowRepOrganization} class represents the name and address of an organization associated with a dataset stored in FlowRepository. Slots of this class capture the information as follows: \begin{description} \item[\texttt{name}:]{Object of class \texttt{character} containing the name of the organization.} \item[\texttt{street}:]{Object of class \texttt{character} or NULL containing the street of the address of the organization.} \item[\texttt{city}:]{Object of class \texttt{character} or NULL containing the city of the address of the organization.} \item[\texttt{zip}:]{Object of class \texttt{character} or NULL containing the zip (or postal code) of the address of the organization.} \item[\texttt{state}:]{Object of class \texttt{character} or NULL containing the state (or province) of the address of the organization.} \item[\texttt{country}:]{Object of class \texttt{character} or NULL containing the country of the address of the organization.} \end{description} \clearpage \bibliographystyle{plainnat} \bibliography{Refs} \end{document}