---
title: "Vignette 1: Getting Started with BANDLE"
author:
- name: Oliver M. Crook 
  affiliation: Department of Statistics, University of Oxford, UK
- name: Lisa M. Breckels
  affiliation: Cambridge Centre for Proteomics, University of Cambridge, UK
package: bandle
abstract: >
  This vignette provides an introduction to the BANDLE package [@bandle] and 
  follows a short theortical example of how to perform differential  
  localisation analysis of quantitative proteomics data using the BANDLE
  model. Explanation and general recommendations of the input parameters 
  are provided here. For a more comprehensive workflow which follows a real-life 
  use case, please see the second vignette in this package.
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: bandle.bib
vignette: >
  %\VignetteIndexEntry{Analysing differential localisation experiments with BANDLE: Vignette 1}
  %\VignetteEngine{knitr::rmarkdown}
  %%\VignetteKeywords{Mass Spectrometry, MS, MSMS, Proteomics, Metabolomics, Infrastructure, Quantitative}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(dpi=25,fig.width=7)
```

# Introduction

Bayesian ANalysis of Differential Localisation Experiments (BANDLE) is an
integrative semi-supervised functional mixture model, developed by [@bandle], 
to obtain the probability of a protein being differentially
localised between two conditions. 

In this vignette we walk users through how to install and use the R [@Rstat]
Bioconductor [@Gentleman:2004] [`bandle` package](https://github.com/ococrook/bandle) 
by simulating a well-defined differential localisation experiment from spatial
proteomics data from the `pRolocdata` package [@pRoloc:2014].

The BANDLE method uses posterior Bayesian computations performed using
Markov-chain Monte-Carlo (MCMC) and thus uncertainty estimates are
available [@Gilks:1995]. Throughout this vignette we use the term *differentially
localised* to pertain to proteins which are assigned to different sub-cellular
localisations between two conditions. One of the main outputs of BANDLE is the
probability that a protein is differentially localised between two conditions.

# Installation

The package can be installed with the `BiocManager` package:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("bandle")
```

and then loaded,

```{r ldpkg, message=FALSE}
library("bandle")
```

For visualisation we also load the packages,

```{r ldpkg2, message=FALSE}
library("pheatmap")
library("viridis")
```

# The data

In this vignette and @bandle, the main data source that we use to study
differential protein sub-cellular localisation are data from high-throughput
mass spectrometry-based experiments. The data from these types of experiments
traditionally yield a matrix of measurements wherein we have, for example, PSMs,
peptides or proteins along the rows, and samples/channels/fractions along the
columns. The `bandle` package uses the `MSnSet` class as implemented in the
Bioconductor `r Biocpkg("MSnbase")` package and thus requires users to import
and store their data as a `MSnSet` instance. For more details on how to create a
`MSnSet` see the relevant vignettes in `r Biocpkg("pRoloc")`. The 
`r Biocpkg("pRolocdata")` experiment data package is a good starting place to
look for test data. This data package contains tens of quantitative proteomics
experiments, stored as `MSnSet`s.

## A well-defined theoretical example

To get started with the basics of using `bandle` we begin by generating a simple
example dataset which simulates a differential localisation experiment (please
see the second vignette in this package for a full real-life biological use
case). In this example data, the key elements are replicates, and a perturbation
of interest. There is code within the `r Biocpkg("bandle")` package to simulate
an example experiment.

In the code chunk below we begin by loading the `r Biocpkg("pRolocdata")`
package to obtain a spatial proteomics dataset. This will be the basis of our
simulation which will use boostrapping to generate new datasets. The dataset we
have chosen to load is a dataset from 2009 (`tan2009r1`). This is data from a
early LOPIT experiment performed on Drosophila embryos [@Tan:2009]. The aim of
this experiment was to apply LOPIT to an organism with heterogeneous cell types.
This experiment used four isotopes across four distinct fractions and thus
yielded four measurements (features) per protein profile. We visualise the
data by using principal components analysis.

```{r loadpkgdat, message=FALSE, warning=FALSE, fig.width=7, fig.height=6}
library("pRolocdata")
data("tan2009r1")

## Let's set the stock colours of the classes to plot to be transparent
setStockcol(NULL)
setStockcol(paste0(getStockcol(), "90")) 

## Plot the data using plot2D from pRoloc
plot2D(tan2009r1,
       main = "An example spatial proteomics datasets", 
       grid = FALSE)
addLegend(tan2009r1, where = "topleft", cex = 0.7, ncol = 2)
```
    
The following code chuck simulates a differential localisation experiment. It
will generate `numRep/2` of each a control and treatment condition. We will also
simulate relocalisations for `numDyn` proteins.

```{r simdata, fig.width=8, fig.height=10}
set.seed(1)
tansim <- sim_dynamic(object = tan2009r1,
                      numRep = 6L,
                      numDyn = 100L)

```

The list of the 6 simulated experiments are found in `tansim$lopitrep`. Each one
is an `MSnSet` instance (the standard data container for proteomics experimental
data). The first 3 are the simulated control experiments (see
`tansim$lopitrep[1:3]`), and the following 3 in the list are the treatment
condition simulated experiments (see `tansim$lopitrep[4:6]`). We can plot them
using the `plot2D` function from `pRoloc`.

```{r plotsims}
plot_title <- c(paste0("Replicate ", seq(3), " condition", " A"), 
               paste0("Replicate ", seq(3), " condition", " B"))

par(mfrow = c(2, 3))
out <- lapply(seq(tansim$lopitrep), function(z) 
    plot2D(tansim$lopitrep[[z]], grid = FALSE, main = plot_title[z]))
```

For understanding, exploring and visualizing individual spatial proteomics
experiments, see the vignettes in `pRoloc` and `MSnbase` packages.
```{r,}
tansim$lopitrep[[1]]
```

# Preparing for `bandle` analysis

The main function of the package is `bandle`, this uses a complex model
to analyse the data. Markov-Chain Monte-Carlo (MCMC) is used to sample the
posterior distribution of parameters and latent variables. From which statistics
of interest can be computed. Here we only run a few iterations for brevity but
typically one needs to run thousands of iterations to ensure convergence, as
well as multiple parallel chains.

## Fitting Gaussian processes  

First, we need to fit non-parametric regression functions to the markers
profiles, upon which we place our analysis. This uses Gaussian processes. The
`fitGPmaternPC` function can be used and fits some default penalised complexity
priors (see `?fitGP`), which works well. However, these can be altered, which is
demonstrated in the next code chunk


```{r fitgps, fig.height=10, fig.width=8}
par(mfrow = c(4, 3))
gpParams <- lapply(tansim$lopitrep, function(x) 
  fitGPmaternPC(x, hyppar = matrix(c(10, 60, 250), nrow = 1)))
```

We apply the `fitGPmaternPC` function to each datasets by calling `lapply` over
the `tansim$lopitrep` list of datasets. The output of `fitGPmaternPC` returns a
list of posterior predictive means and standard deviations. As well as MAP
hyperparamters for the GP. 

Note here we the use the default `hyppar = matrix(c(10, 60, 250), nrow = 1)` as
a starting point for fitting the GPs to the marker profile distributions. In the
@bandle manuscript we found that these values worked well for smaller spatial
proteomics datasets. This was visually assessed by passing these values and
visualising the GP fit using the `plotGPmatern` function.

The `plotGPmatern` function can be used to plot the profiles for each
class in each replicate condition with the posterior predictive distributions
overlayed with the markers protein profiles.

For example, to plot the predictive distributions of the first dataset,
```{r plotgps, fig.height=10, fig.width=8}
par(mfrow = c(4, 3))
plotGPmatern(tansim$lopitrep[[1]], params = gpParams[[1]])
```

The prior needs to form a `K*3` matrix. `K` corresponds to the number of
subcellular classes in the data, and 3 columns for (1) the prior, (2)
length-scale amplitude and (3) standard deviation parameters (see `hyppar` in
`?fitGPmaternPC`). Increasing these values, increases the shrinkage. For more
details see the manuscript by @bandle. We strongly recommend users start with
the recommended `hyppar` parameters and change and assess them as necessary for
their dataset by visually evaluating the fit of the GPs using the `plotGPmatern`
function.

```{r sethyppar}
K <- length(getMarkerClasses(tansim$lopitrep[[1]], fcol = "markers"))
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(10, 60, 250),
                                       each = K), ncol = 3)

```


Now we have generated these complexity priors we can pass them as an
argument to the `fitGPmaternPC` function. For example,

```{r runhyppar}
gpParams <- lapply(tansim$lopitrep,
                   function(x) fitGPmaternPC(x, hyppar = pc_prior))
```

By looking at the plot of posterior predictives using the `gpParams` we can see
the GP fit looks sensible.

## Setting the prior on the weights

The next step is to set up the matrix Dirichlet prior on the mixing weights. These
weights are defined across datasets so these are slightly different to mixture
weights in usual mixture models. The $(i,j)^{th}$ entry is the prior probability
that a protein localises to organelle $i$ in the control and $j$ in the treatment.
This mean that off-diagonal terms have a different interpretation to diagonal terms.
Since we expect re-localisation to be rare, off-diagonal terms should be small.
The following functions help set up the priors and how to interpret them. The
parameter `q` allow us to check the prior probability that more than `q`
differential localisations are expected.

```{r setweightprior}
set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.001, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = tansim$lopitrep[[1]],
                               dirPrior = dirPrior,
                               q = 15)
```

The mean number of re-localisations is small:
```{r,}
predDirPrior$meannotAlloc
```

The prior probability that more than `q` differential localisations are
expected is small
```{r,}
predDirPrior$tailnotAlloc
```

The full prior predictive can be visualised as histogram. The prior probability
that proteins are allocated to different components between datasets concentrates
around 0.

```{r,}
hist(predDirPrior$priornotAlloc, col = getStockcol()[1])
```

For most use-cases we indeed expect the number of differential
localisations to be small. However, there may be specific cases where one may
expect the either a smaller or larger number of differential localisations.
Users could try testing different values for the `dirPrior` for example,
replacing 0.001 with 0.0005 or smaller, for larger datasets to bring the number
of expected re-localisations inline with the biological expectation, and
visa-versa when we expect the number of proteins to have changed to be higher.


# Running the `bandle` function

We are now ready to run the main `bandle` function. Remember to carefully
select the datasets and replicates that define the control and treatment. 
As a reminder, in this introductory vignette we have used a small dataset
and generated theoretical triplicates of each theoretical condition. Please see
the second vignette in this package for a more detailed workflow and real
biological use-case. In the below code chunk we run `bandle` for only 50
iterations for the convenience of building the vignette, but typically we'd
recommend you run the number of iterations (`numIter`) in the 1000s.

Remember: the first 3 datasets are the first 3 elements of `tansim` and the
final 3 elements are the "treatment" triplicate datasets.

```{r runbandle, message=FALSE, warning=FALSE, error=FALSE, echo = TRUE, results = 'hide'}
control <- tansim$lopitrep[1:3] 
treatment <- tansim$lopitrep[4:6]

bandleres <- bandle(objectCond1 = control,
                    objectCond2 = treatment,
                    numIter = 20,  # usually 10,000
                    burnin = 5L,   # usually 5,000
                    thin = 1L,     # usually 20
                    gpParams = gpParams,
                    pcPrior = pc_prior,
                    numChains = 1,  # usually >=4
                    dirPrior = dirPrior)
```

The `bandle` function generates an object of class `bandleParams`. The `show`
method indicates the number of parallel chains that were run, this should
typically be greater than 4 (here we use 1 just as a demo). Normally, we
should also assess convergence but this omitted for the moment so that we 
can move forward with the analysis. Please see the end of the vignette for
convergence plots.

```{r,}
bandleres
```

# Analysing `bandle` output

Before we can begin to extract protein allocation information and a list of
proteins which are differentially localised between conditions, we first need to
populate the `bandleres` object by calling the `bandleProcess` function.

## Populating a `bandleres` object

Currently, the summary slots of the `bandleres` object are empty. The
`summaries` function accesses them.

```{r processbandle1}
summaries(bandleres)
```

These can be populated as follows
```{r processbandle2}
bandleres <- bandleProcess(bandleres)
```

These slots have now been populated
```{r processbandle3}
summary(summaries(bandleres))
```

### `bandle` results

We can save the results by calling `summaries`. We see that it is
of length 2. 1 for control and 1 for treatment.

```{r res}
res <- summaries(bandleres)
length(res)
```

There are a number of slots,

```{r seeslots}
str(res[[1]])
```

The main one of interest is the `posteriorEstimates` slot,

```{r postest}
posteriorEstimates(res[[1]])
```

This output object is a `data.frame` containing the protein allocations and
associated localisation probabilities (including the upper and lower quantiles
of the allocation probability distribution), the mean Shannon entropy and the
`bandle.differential.localisation` probability.

## Extracting posteriors and allocation results

We create two new objects `pe1` and `pe2` in the below code chunk which contain
the output of the `posteriorEstimates` slot. 

```{r getposteriors}
pe1 <- posteriorEstimates(res[[1]])
pe2 <- posteriorEstimates(res[[2]])
```

One quantity of interest is the protein allocations, which we can plot as a
barplot.

```{r barplotalloc, fig.width=8, fig.height=5}
alloc1 <- pe1$bandle.allocation
alloc2 <- pe2$bandle.allocation

par(mfrow = c(1, 2), oma = c(6,2,2,2))
barplot(table(alloc1), col = getStockcol()[2],
        las = 2, main = "Protein allocation: control")
barplot(table(alloc2), col = getStockcol()[2],
        las = 2, main = "Protein allocation: treatment")
```

The barplot tells us for this example that `bandle` has allocated the majority
of unlabelled proteins to the ER, followed by the Golgi (irrespective of the
posterior probability).

The associated posterior estimates are located in the `bandle.probability`
column.

```{r allocspost}
pe_alloc1 <- pe1$bandle.probability
pe_alloc2 <- pe1$bandle.probability
```

## Allocation probabilities

The full allocation probabilities are stored in the `tagm.joint` slot. These can
be visualised in a heatmap

```{r heatmap_control}
bjoint_control <- bandleJoint(summaries(bandleres)[[1]])
pheatmap(bjoint_control, cluster_cols = FALSE, color = viridis(n = 25))
```

```{r heatmap_treatment}
bjoint_treatment <- bandleJoint(summaries(bandleres)[[2]])
pheatmap(bjoint_treatment, cluster_cols = FALSE, color = viridis(n = 25))
```

## Predicting the subcellular location

We can append the results to our original `MSnSet` datasets using the 
`bandlePredict` function.

```{r bandpred}
xx <- bandlePredict(control, 
                    treatment, 
                    params = bandleres, 
                    fcol = "markers")
res_control <- xx[[1]]
res_treatment <- xx[[2]]
```

The output is a `list` of `MSnSets`. In this example,
we have 3 for the control and 3 for the treatment. 

```{r showlength}
length(res_control)
length(res_treatment)
```

The results are appended to the **first** `MSnSet` feature data slot 
for each condition.

```{r viewdata}
fvarLabels(res_control[[1]])
```

To access them use the `fData` function

```{r fdata, eval=FALSE}
fData(res_control[[1]])$bandle.probability
fData(res_control[[1]])$bandle.allocation
```

### Thresholding on protein allocations

It is common practice in supervised machine learning to set a specific threshold
on which to define new assignments/allocations, below which classifications are
left unassigned/unknown. Indeed, we do not expect the whole subcellular
diversity to be represented by the 11 niches defined here, we expect there to be
many more, many of which will be multiply localised within the cell. It is
important to allow for the possibility of proteins to reside in multiple
locations (this information is available in the `bandle.joint` slot - see above
for more details on multiple location).

As we are using a Bayesian model the outputs of the classifier are
probabilities. This not only allows us to look at the distribution of
probabilities over all subcellular classes but also allows us to extract a
probability threshold on which we can define new assignments.

The subcellular allocations are located in the `bandle.allocation` column of the
`fData` slot and the posteriors are located in the `bandle.probability` slot. We
can use the `getPredictions` function from the `pRoloc` package to return a set
of predicted localisations according to if they meet a probability threshold.

For example, in the below code chunk we set a 1% FDR for assigning proteins a
subcellular nice, below which we leave them unlabelled.

```{r setthreshold}
res_control[[1]] <- getPredictions(res_control[[1]], 
                                   fcol = "bandle.allocation",                   
                                   scol = "bandle.probability",                   
                                   mcol = "markers",                   
                                   t = .99)

res_treatment[[1]] <- getPredictions(res_treatment[[1]], 
                                   fcol = "bandle.allocation",                   
                                   scol = "bandle.probability",                   
                                   mcol = "markers",                   
                                   t = .99)
```

We may also wish to take into account the probability of the protein being an
outlier and thus use the results in the `bandle.outlier` column of the feature
data. We could calculate the product of the posterior and the outlier (as they
are both probabilities) to obtain a localisation score that takes into account
the outlier model. More details on this are found in the second vignette of this
package.

## Differential localisation probability

As previously mentioned the term "differentially localised" is used to pertain
to proteins which are assigned to different sub-cellular localisations between
two conditions. For the majority of users this is the main output they are keen
to extract using the BANDLE method.

Following on from the above example, after extracting posterior estimates for
each condition using the `summaries` function we can also access the
differential localisation probability as it is stored in the
`bandle.differential.localisation` column of the `data.frames` of `pe1` and
`pe2`, in the above sections.

The differential localisation probability tells us which proteins are most
likely to *differentially localise*. We can for example, examine how many
proteins get a differential probability greater than 0.99 to look for
the most confident differentially localised candidates.

```{r numtransloc}
diffloc_probs <- pe1$bandle.differential.localisation
head(diffloc_probs, 50)
length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.99))
```

We find there are `r length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.99))` 
proteins above this threshold.

This can also be seen on a rank plot

```{r extractdiffloc}
plot(diffloc_probs[order(diffloc_probs, decreasing = TRUE)],
     col = getStockcol()[3], pch = 19, ylab = "Probability",
     xlab = "Rank", main = "Differential localisation rank plot")
```

In-line with our expectations, the rank plot indicates that most proteins are
not differentially localised.

### Estimating uncertainty

#### Applying the `bootstrapdiffLocprob` function

We can examine the top `n` proteins (here we use an example of `top = 100`) and
produce bootstrap estimates of the uncertainty (note here the uncertainty is
likely to be underestimated as we did not produce many MCMC samples). These can
be visualised as ranked boxplots.

```{r diffloc_boot}
set.seed(1)
boot_t <- bootstrapdiffLocprob(params = bandleres, top = 100,
                               Bootsample = 5000, decreasing = TRUE)

boxplot(t(boot_t), col = getStockcol()[5],
        las = 2, ylab = "Probability", ylim = c(0, 1),
        main = "Differential localisation \nprobability plot (top 100 proteins)")
```


#### Applying the `binomDiffLoc` function

Instead of applying the `bootstrapdiffLocprob` we could use the `binomDiffLoc`
function to obtain credible intervals from the binomial distribution.


```{r diffloc_binom}
bin_t <- binomialDiffLocProb(params = bandleres, top = 100,
                             nsample = 5000, decreasing = TRUE)

boxplot(t(bin_t), col = getStockcol()[5],
        las = 2, ylab = "Probability", ylim = c(0, 1),
        main = "Differential localisation \nprobability plot (top 100 proteins)")
```

#### Obtaining probability estimates

There are many ways we could obtain probability estimates from either of the
above methods. We could, for example, take the mean of each protein estimate, or
compute the cumulative error (there is not really a false discovery rate in
Bayesian statistics) or we could threshold on the interval to reduce the number
of differential localisations if you feel the model has been overconfident.


```{r get_pe}
# more robust estimate of probabilities
dprobs <- rowMeans(bin_t)

# compute cumulative error, there is not really a false discovery rate in
# Bayesian statistics but you can look at the cumulative error rate
ce <- cumsum(1  - dprobs)

# you could threshold on the interval and this will reduce the number of
# differential localisations
qt <- apply(bin_t, 1, function(x) quantile(x, .025))
```


#### The expected false discovery rate

Instead of estimating the false discovery rate we can estimate the expected
false discovery rate from the posterior probabilities at a particular 
threshold. This mean that for fixed threshold, we compute the expected proportion
of false discoveries. Here is an example below. We can see that setting
a probability threshold of 0.9 leads to an expected false discovery rate of
less than $0.5\%$

```{r,}
EFDR(diffloc_probs, threshold = 0.90)
```

# Visualising differential localisation

We can visualise the changes in localisation between conditions on an alluvial
plot using the `plotTranslocations` function

```{r alluvial, warning=FALSE, message=FALSE, fig.height=8, fig.width=7}
plotTranslocations(bandleres)
```

Or alternatively, on a chord (circos) diagram

```{r chord, warning=FALSE, message=FALSE, fig.height=7, fig.width=7}
plotTranslocations(bandleres, type = "chord")
```

### Important consideration 
These visualisations are showing the change in class label between the two
conditions (as assigned by `bandle` i.e. the result stored in
`bandle.allocation`). The results are taken directly from `bandleres` and thus
no thresholding on the class label and posterior to allow for proteins to be
left "unassigned" or unknown, is conducted. Furthermore, there is no
thresholding on the `bandle.differential.localisation` probability.

It would be better to re-plot these figures with some thresholds on the above
quantities, to get a better representation of what is moving. The easiest way to
do this is to pass the `MSnSets` output after performing `bandlePredict` and
`getPredictions`.

For example, first let's identify which proteins get a high differential
localisation probability,

```{r plotafterthresh}
## identify which proteins get a high differential localisation probability
ind <- which(fData(res_control[[1]])$bandle.differential.localisation > 0.99)

## create two new MSnSets with only these proteins
res_dl_control <- res_control[[1]][ind, ]
res_dl_treatment <- res_treatment[[1]][ind, ]
```

Now we can plot only these `r length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.99))`
proteins which are deemed to move/differentially localise. We also specify where the prediction results are
located e.g. `fcol = "bandle.allocation.pred"`.

```{r plottlres, warning=FALSE, fig.height=8, fig.width=7}
## specify colour palette
mycols <- c(getStockcol()[seq(getMarkerClasses(res_control[[1]]))], "grey")
names(mycols) <- c(getMarkerClasses(res_control[[1]]), "unknown")

## Create a list of the datasets for plotTranslocations
res <- list(res_dl_control, res_dl_treatment)

plotTranslocations(res, fcol = "bandle.allocation.pred", col = mycols)
```

We can also use the function `plotTable` to display a summary table of the
number of proteins that have changed in location between conditions.

```{r summaryfinal, warning=FALSE, message=FALSE}
(tbl <- plotTable(res, fcol = "bandle.allocation.pred"))
```

# Assessing convergence

In this section, we demonstrate how to visually assess convergence of the MCMC
algorithm. In the chunk below, we use 4 chains so that we can assess the 
convergence of the method.

```{r, bandleagain, message=FALSE, warning=FALSE, error=FALSE, echo = TRUE, results = 'hide',fig.height=4, fig.width=4}
control <- tansim$lopitrep[1:3] 
treatment <- tansim$lopitrep[4:6]

bandleres <- bandle(objectCond1 = control,
                    objectCond2 = treatment,
                    numIter = 50, # usually 10,000
                    burnin = 10L, # usually 5,000
                    thin = 1L, # usually 20
                    gpParams = gpParams,
                    pcPrior = pc_prior,
                    numChains = 4,
                    dirPrior = dirPrior)

```
We then use the `plotConvergence` function which plots the ranks of the total
number of outliers within in each of the chains. If convergence has been reached
these plots should be uniform. As we can see these plots are not uniform and
are skewed towards the extremes, so they have not converged. Clearly one
of the chains has higher values on average than the other chains producing
the skew. Running the algorithm for more iterations (typically 10,000)
should produce convergence. This is not done here for brevity of the vignette.

```{r, convergence, fig.height=8}
par(mfrow = c(2, 2))
out <- plotConvergence(bandleres)
```


# Description of `bandle` parameters

The `bandle` function has a significant number of parameters to allow flexible 
and bespoke analysis. Here, we describe these parameters in more detail to 
allow user to make decisions on the level of flexibility they wish to exploit.

1. `objectCond1`. This is a list of `MSnSets` containing the first condition.

2. `objectCond2`. This is a list of `MSnSets` containing the second condition.
    a. These object should have the same observations and features. These will
    be checked during bandle analysis.
    
3. `fcol` indicates the feature column in the `MSnSets` that indicated the
markers. Proteins that are not markers should be labels `unknown`. The default
is `markers`.

4. `hyperLearn` is the algorithm used to learn the hyperparameters of the 
Gaussian processes. For speed the default is an optimization algorithm called
"LBFGS", however is users want to perform uncertainty quantification on these
parameters we can use Markov-chain Monte Carlo (MCMC) methods. This is implemented
using the Metropolis-Hastings algorithm. Though this latter methodology provides
more information, it is much more costly. The analysis is expected to take
several days rather than hours. 

5. `numIter` is the number of MCMC iterations for the algorithm. We typically
suggest around 10,000 iterations is plenty for convergence. Though some cases
may take longer. If resources are constrained, we suggest 4,000 iterations
as acceptable. A minimum number of iterations is around 1,000 though at this
level we expect the posterior estimates to suffer considerably. If possible
more parallel chains should be run in this case by changing `numChains` to, 
say, 9. The more chains and iterations the more computationally expensive
the algorithm. The time taken for the algorithm scales roughly linearly
in the number of iterations

6. `burnin` is the number of samples that should be discarded from the
beginning of the chain due to the bias induced by the starting point of the 
algorithm. We suggest sensible `burnin` values to be roughly $10-50\%$ of the
number of iterations

7. `thin` reduces auto-correlation in the MCMC samples. The default is $5$,
which means every 5th sample is taken. If memory requirements are an issue,
we suggest to increase the thinning amount. Though above $20$, you will see 
a decrease in performance.

8. `u` and `v` represent the prior hyperparameters of the proportion of outliers.
This is modelled using a `Beta(u,v)` with `u = 2` and `v = 10` a default. This
suggest that roughly $\frac{u}{u = V} = 16%$ of proteins are believed to be
outliers and that it is quite unlikely that more than $50%$ of proteins
are outliers. Users can examine the quantiles of the `Beta(u,v)` distribution
if they wish to place a more bespoke prior. For example, increasing `u`
will increase the number of a prior believed outliers.

9. `lambda` is a ridge parameter used for numerical stability and is set to 
$0.01$. If you experience the algorithm fails due to numerical issue then you
can set this value larger. If you require values above $1$ it is likely that
there are other issues with the analysis. We suggest checking the method
is appropriate for your problem and opening issue detailing the problems.

10. `gpParams` results from fitting Gaussian proccess (Gaussian random fields).
We refer the users to those functions. The default is `NULL` which will fit
GPs internally but we recommend setting these outside the bandle function
because it leads to more stable results.

11. `hyperIter` if the hyperparameters of the GP are learnt using MH algorithm
then this is the frequency at which these are updated relative to the bandle
algorithm. By default this is unused, but if `hyperLearn` is set to `MH` then
this proceed at every 20 iterations. 

12. `hyperMean` is the mean of the log normal prior used on the hyperparameters.
Though by default this is not used unless `PC` is set to false

13. `hyperSd` is the standard deviation of the log normal prior used on the 
hyperparameters. The default is `c(1,1,1)` for the 3 hyperparameters, increasing
these values increases the uncertainty in the prior values of the hyperparameters.

14. `seed` is the random-number seed.

15. `pg` indicates whether or not to use the Polya-Gamma (PG) prior. The default
is false and a Dirichlet prior is used instead. If set to true the `pg` is used.
In which case a default PG prior is used. This prior attempts to match the
default Dirichlet prior that is used when PG prior is set to false. The PG
prior is more computationally expensive but can provide prior information on 
correlations

16. `pgPrior` is by default NULL. We suggest using the `pg_prior` function
to help set this parameter and the documentation therein. This function
uses an empirical approach to compute a sensible default.

17. `tau` is a parameter used by the Polya-Gamma prior and we refer to 
BANDLE manuscript for details. By default it is only used if `pg` prior is true,
when the default becomes `0.2`. At this value the `pg` prior is similar to the
Dirichlet prior but with information on correlations.

18. `dirPrior` is the Dirichlet matrix prior on the correlations. This should
be provided as a K by K matrix, where K is the number of subcellular niches. 
The diagonal component should represent the prior belief that organelles do
not re-localise (same compartment), where as the off-diagonal terms represent
the prior terms of re-localisation. The `prior_pred_dir` can be used to provide
a prior predictive check based on the provided prior. It is recommended
that the off-diagonal terms are at least two orders of magnitude smaller than the
diagonal terms. An example is given in the vignette.

19. `maternCov` is this true the covariance function is the matern covariance,
otherwise a Gaussian covariance is used.

20. `PC` indicates whether a penalised complexity (PC) is used. The default 
is true and otherwise log normal priors are used. 

21. `pcPrior` is a numeric of length 3 indicating the parameters of the PC prior.
The prior is placed on the parameters of length-scale, amplitude, and variance 
in that order. The default values are $0.5,3,100$, and increasing the value 
increases the shrinkage towards straight-lines with zero variance.

22. `nu` which defaults to 2 is the smoothness of the matern covariance. By 
increasing `nu` you encourage smoother solutions. `nu` should be an integer,
though for values of `nu` above 3, we have observed numerical instability.

23.  `propSd` is the standard deviation of the random-walk update used in the MH
algorithm. We do not recommend changing this unless you are familiar with
Bayesian analysis. The default is `c(0.3,0.1,0.05)` for the 3 hyperparameters.
Changing these will alter the efficiency of the underlying samplers.

24. `numChains` is the number of parrallel chains and defaults to 4. We recommend
using as much processing resources as you have and frequently have used 9 in
practise.

25. `BPPARAM` is the BiocParallel back-end which defaults to 
`BiocParallel::bpparam()`. We refer you to the `r Biocpkg("BiocParallel")` 
package for details on setting this dependent on your computing system.

# Session information

All software and respective versions used to produce this document are listed below.
```{r sessionInfo}
sessionInfo()
```

# References {-}