%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{vignette-hierinf.Rnw}
%\VignetteEncoding{UTF-8}

\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{latexsym,amsmath,amssymb,amsthm,epic,eepic,multirow}
\usepackage{natbib}

%\usepackage{graphicx}
\usepackage{multirow}
\usepackage{subfigure}
% \usepackage{amsfonts}
%\usepackage{jmlr2e}
\usepackage{natbib}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{mdframed}
% \usepackage{sfsbib}
\usepackage{tikz}
\usepackage{hyperref}

\newcommand{\PP}{\mathbb{P}}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\Nat}{\mathbb{N}}
\newcommand{\eps}{\varepsilon}
\newcommand{\parcor}{\mbox{Parcor}}
\newcommand{\Cov}{\mbox{Cov}}
\newcommand{\Cor}{\mbox{Cor}}
\newcommand{\Var}{\mbox{Var}}
\newcommand{\peff}{\mathrm{peff}}
\newcommand{\by}{\mathbf{Y}}
\newcommand{\bx}{\mathbf{X}}
\newcommand{\argmin}{\mathrm{argmin}}
\newcommand{\pa}{\mathrm{pa}}

%\theoremstyle{plain}
\newtheorem{theo}{Theorem}
\newtheorem{prop}{Proposition}
\newtheorem{lemm}{Lemma}
\newtheorem{corr}{Corollary}

%\theoremstyle{definition}
\newtheorem{defi}{Definition}
\newtheorem{examp}{Example}

\begin{document}

\title{Hierarchical inference for genome-wide association studies}

\author{Claude Renaux, Laura Buzdugan, Markus Kalisch and Peter
  B\"uhlmann\\Seminar for Statistics, ETH Z\"urich}
%\date{}

\maketitle

\noindent \textbf{This vignette is based on the pre-print \cite{renaux18} and
contains the part about the illustration of our \textsf{R}-package
\texttt{hierinf}.}

\section{Cite \texttt{hierinf}}
If you use the \texttt{hierinf} package, please cite the paper Renaux, C.,
Buzdugan, L., Kalisch, M., and B\"uhlmann, P. (2018). Hierarchical inference
for genome-wide association studies: a view on methodology with software.
\textit{arXiv preprint arXiv:1805.02988}.

\section{Introduction}
Hierarchical inference is a key technique for computationally and statistically
efficient hypothesis testing and multiple testing adjustment. We consider
inference in a multivariate model which quantifies effects after adjusting for
all remaining single nucleotide polymorphism (SNP) covariates. The hierarchy
enables in a fully data-driven way to infer significant groups or
regions of SNPs at an adaptive resolution, by controlling the familywise
error rate (FWER). We have recently proposed high-dimensional hierarchical
inference for assigning statistical significance in terms of p-values for
groups of SNPs being associated to a response variable: \cite{buzduganetal16}
considers this approach for human GWAS and \cite{klasenetal16} for GWAS
with plants. The methodological and theoretical concepts have been worked
out in \cite{manpb16} and \cite{manpb16b}.

The \textsf{R} package \texttt{hierinf} is an implementation of the
hierarchical inference described in \cite{renaux18} and
it is easy to use for GWAS.
The package is a re-implementation of the \textsf{R} package \texttt{hierGWAS}
\citep{hierGWASpackage160} and includes new features like straightforward
parallelization, an additionally option for constructing a hierarchical
tree based on spatially contiguous genomic positions, and the possibility
of jointly analyzing multiple datasets.

\section{Software}\label{sec.hierinf}
To summarize the method, one starts by clustering the data
hierarchically. This means that the clusters can be represented by a tree.
The main idea is to pursue testing top-down and successively moving
downwards until the null-hypotheses cannot be rejected. The p-value of a
given cluster is calculated based on the multiple sample splitting approach
and aggregation of those p-values as described in \cite{renaux18}.

The work flow is straightforward and is composed in two function calls.
We note that the package \texttt{hierinf} requires complete observations,
i.e. no missing values in the data, because the testing procedure is based
on all the SNPs which is in contrast to marginal tests. If missing
values are present, they can be imputed prior to the analysis. This can
be done in \textsf{R} using e.g. \texttt{mice} \citep{vanbuuren11},
\texttt{mi} \citep{shi11}, or \texttt{missForest} \citep{stekhoven11}.

A small simulated toy example with two chromosomes is used to demonstrate
the procedure. The toy example is taken from \citep{hierGWASpackage160} and
was generated using \textsf{PLINK} where the SNPs were binned into different
allele frequency ranges. The response is binary with 250 controls
and 250 cases.
Thus, there are $n = 500$ samples, the number of SNPs is $p = 1000$, and
there are two additional control variables with column names ``age'' and
``sex''. The first 990 SNPs have no association
with the response and the last 10 SNPs were simulated to have a population
odds ratio of 2.
The functions of the package \texttt{hierinf} require
the input of the SNP data to be a \texttt{matrix} (or a list of matrices
for multiple datasets). We use a \texttt{matrix} instead of a
\texttt{data.frame} since this makes computation faster.

<<>>=
# load the package
library(hierinf)

# random number generator (for parallel computing)
RNGkind("L'Ecuyer-CMRG")

# We use a small build-in dataset for our toy example.
data(simGWAS)

# The genotype, phenotype and the control variables are saved in
# different objects.
sim.geno  <- simGWAS$x
sim.pheno <- simGWAS$y
sim.clvar <- simGWAS$clvar
@

The two following sections correspond to the two function calls in order
to perform hierarchical testing. The third section gives some notes
about running the code in parallel.

\subsection{Software for clustering} \label{subsec.clustering}
The package \texttt{hierinf} offers two possibilities to build a
hierarchical tree for corresponding hierarchical testing. The function
\texttt{cluster\char`_var} performs hierarchical clustering based on
some dissimilarity matrix and is described first. The function
\texttt{cluster\char`_position} builds a tree based on recursive binary
partitioning of consecutive positions of the SNPs. For a short
description, see at the end of Section 2.3 in \cite{renaux18}.

Hierarchical clustering is computationally expensive and prohibitive for
large datasets. Thus, it makes sense to pre-define dis-joint sets of SNPs
which can be clustered separately. One would typically assume that the
second level of a cluster tree structure corresponds to the blocks given
by the chromosomes as illustrated in Figure \ref{fig3}.
For the method based on binary partitioning of consecutive positions
of SNPs, we recommend to pre-define the second level of the hierarchical tree as well.
This allows to run the building of the hierarchical tree and the hierarchical
testing for each block or in our case for each chromosome in parallel, which
can be achieved by adding the two commented arguments in the function calls below.
If one does not want to specify the second level of the tree, then the
argument \texttt{block} in both function calls can be omitted.

\begin{figure}[!htb]
\begin{center}
\begin{tikzpicture}[level distance=1.6cm,
  level 1/.style={sibling distance=2.4cm},
  level 2/.style={sibling distance=1.75cm}]
\node (z) {entire data}
  child {node (a) {block 1}
    child {node (aa) {$\vdots$}}
    child {node (ab) {$\vdots$}}
  }
  child {node (b) {block 2}
    child {node (ba) {$\vdots$}}
    child {node (bb) {$\vdots$}}
  }
  child [ missing ]
  child {node (c) {block $k$}
    child {node (ca) {$\vdots$}}
    child {node (cb) {$\vdots$}}
  };
\path (b) -- (c) node (x) [midway] {$\cdots$};
\end{tikzpicture}
\caption{The top two levels of a hierarchical tree used
to perform multiple testing. The user can optionally specify the second
level of the tree with the advantage that one can easily run the code in
parallel over different clusters in the second level, denoted by block 1,
$\ldots$, block $k$.
A natural choice is to choose the chromosomes as the second level
of the hierarchical tree, which define a partition of the SNPs.
If the second level is not specified, then the first split is
estimated based on clustering the data, i.e. it is a binary split.
The user can define the second level of the tree structure using
the argument \texttt{block} in the functions \texttt{cluster\char`_var} /
\texttt{cluster\char`_position}. The function
\texttt{cluster\char`_var} / \texttt{cluster\char`_position} builds a separate
binary hierarchical tree for each of the blocks.
}\label{fig3}
\end{center}
\end{figure}

In the toy example, we define the second level of the tree structure as follows.
The first and second 500 SNPs of the SNP data \texttt{sim.geno} correspond
to chromosome 1 and chromosome 2, respectively.
The object \texttt{block} is a \texttt{data.frame} which contains two
columns identifying the two blocks.
The blocks are defined in the second column and the corresponding column
names of the SNPs are stored in the first column. The argument
\texttt{stringsAsFactors} of the function \texttt{data.frame} is set to
\texttt{FALSE} because we want both columns to contain integers or strings.

<<>>=
# Define the second level of the tree structure.
block <- data.frame("colname" = paste0("SNP.", 1:1000),
                    "block" = rep(c("chrom 1", "chrom 2"), each = 500),
                    stringsAsFactors = FALSE)

# Cluster the SNPs
dendr <- cluster_var(x = sim.geno,
                     block = block)
                     # # the following arguments have to be specified
                     # # for parallel computation
                     # parallel = "multicore",
                     # ncpus = 2)
@

By default, the function \texttt{cluster\char`_var} uses the agglomeration
method average linkage and the
dissimilarity matrix given by $1 - (\mbox{empirical correlation})^2$.

Alternatively, \texttt{cluster\char`_position} builds a hierarchical tree
using recursive binary partitioning of consecutive genomic positions of the
SNPs.
As for \texttt{cluster\char`_var}, the function
can be run in parallel if the argument block defines the second level
of the hierarchical tree and the two commented arguments parallel and
ncpus are added.

<<eval=FALSE>>=
# Store the positions of the SNPs.
position <- data.frame("colnames" = paste0("SNP.", 1:1000),
                       "position" = seq(from = 1, to = 1000),
                       stringsAsFactors = FALSE)


# Build the hierarchical tree based on the position
# The argument block defines the second level of the tree structure.
dendr.pos <- cluster_position(position = position,
                              block = block)
                              # # the following arguments have to be
                              # # specified for parallel computation
                              # parallel = "multicore",
                              # ncpus = 2)
@

\subsection{Software for hierarchical testing} \label{subsec.hiertesting}

The function \texttt{test\char`_hierarchy} is executed after the function
\texttt{cluster\char`_var} or \texttt{cluster\char`_position} since it requires
the output of one of those two functions as an input (argument \texttt{dendr}).

The function \texttt{test\char`_hierarchy} first randomly splits the data
into two halves (with respect to the observations), by default \texttt{B = 50}
times, and performs variable screening on the second half. Then, the function
\texttt{test\char`_hierarchy} uses those splits and corresponding selected
variables to perform the hierarchical testing according to the tree defined
by the output of one of the two functions \texttt{cluster\char`_var} or
\texttt{cluster\char`_position}.

As mentioned in Section \ref{subsec.clustering}, we can exploit the
proposed hierarchical structure which assumes the chromosomes to form the
second level of the tree structure as illustrated in Figure
\ref{fig3}. This allows to run the testing in parallel for each
block, which are the chromosomes in the toy example.

The following function call performs first the global null-hypothesis test
for the group containing all the variables/SNPs and continues
testing in the hierarchy of the two chromosomes and their children.

<<>>=
# Test the hierarchy using multi sample split
set.seed(1234)
result <- test_hierarchy(x = sim.geno,
                         y = sim.pheno,
                         clvar = sim.clvar,
                         # alternatively: dendr = dendr.pos
                         dendr = dendr,
                         family = "binomial")
                         # # the following arguments have to be
                         # # specified for parallel computation
                         # parallel = "multicore",
                         # ncpus = 2)
@

The function \texttt{test\char`_hierarchy} allows to fit models with
continuous or binary response, the latter being based on logistic
regression. The argument \texttt{family} is set to \texttt{"binomial"}
because the response variable in the toy example is binary.

The output looks as follows:
<<>>=
print(result, n.terms = 4)
@

The output shows significant groups of SNPs or even single SNPs if there is
sufficient strong signal in the data. The block names, the p-values, and
the column names (of the SNP data) of the significant clusters are
returned. There is no significant cluster in chromosome 1.
That's the reason why the p-value and the column names of the significant
cluster are \texttt{NA} in the first row of the output.
Note that the large significant cluster in the second row of the output is
shortened to better fit on screen. In our toy example, the last 8 column names
are replaced by ``\texttt{... [8]}''. The maximum number of terms can be
changed by the argument \texttt{n.terms} of the \texttt{print} function.
One can evaluate the object \texttt{result} in the console and the default
values of the \texttt{print} function are used. In this case, it would only
display the first 5 terms.

The only difference in the \textsf{R} code
when using a hierarchical tree based on binary recursive partitioning of the
genomic positions of the SNPs (whose output is denoted as \texttt{dendr.pos})
is to specify the corresponding hierarchy:
\texttt{test\char`_hierarchy(..., dendr = dendr.pos, ...).}

We can access part of the output
by \texttt{result\$res.hierarchy} which we use below to calculate the
$\mbox{R}^2$ value of the second row of the output, i.e.
\texttt{result\$res.hierarchy[[2, "significant.cluster"]]}. Note that we
need the double square brackets to access the column names stored in the column
\texttt{significant.cluster} of the output since the last column is a list
where each element contains a character vector of the column names. The two
other columns containing the block names and the p-values can both be indexed
using single square brackets as for any \texttt{data.frame}, e.g.
\texttt{result\$res.hierarchy[2, "p.value"]}.

<<>>=
(coln.cluster <- result$res.hierarchy[[2, "significant.cluster"]])
@

The function \texttt{compute\char`_r2} calculates the adjusted $\mbox{R}^2$ value
or coefficient of determination of a cluster for a continuous response. The
Nagelkerke's $\mbox{R}^2$ \citep{nagelkerke91} is calculated for a binary
response as e.g. in our toy example.

<<>>=
compute_r2(x = sim.geno, y = sim.pheno, clvar = sim.clvar,
           res.test.hierarchy = result, family = "binomial",
           colnames.cluster = coln.cluster)
@
The function \texttt{compute\char`_r2} is based on multi-sample splitting.
The $\mbox{R}^2$ value is calculated per split based on the second half of
observations and based on the intersection of the selected variables and
the user-specified cluster. Then, the $\mbox{R}^2$ values are
averaged over the different splits.
If one does not specify the argument \texttt{colnames.cluster}, then the
$\mbox{R}^2$ value of the whole dataset is calculated.

\subsection{Software for parallel computing} \label{subsec.parallel}

The function calls of \texttt{cluster\char`_var}, \texttt{cluster\char`_position},
and \texttt{test\char`_hierarchy} above are evaluated in parallel since we set
the arguments \texttt{parallel = "multicore"} and \texttt{ncpus = 2}.
The argument \texttt{parallel} can be set to \texttt{"no"} for serial
evaluation (default value), to \texttt{"multicore"} for parallel evaluation
using forking, or to \texttt{"snow"} for parallel evaluation using a parallel
socket cluster (PSOCKET); see below for more details. The argument
\texttt{ncpus} corresponds to the number of cores to be used for parallel
computing. We use the \texttt{parallel} package for our implementation
which is already included in the base \textsf{R} installation
\citep{rcoreteam17}.

The user has to select the ``L'Ecuyer-CMRG'' pseudo-random number
generator and set a seed such that the parallel computing of \texttt{hierinf}
is reproducible. This pseudo-random number generator can be selected by
\texttt{RNGkind("L'Ecuyer-CMRG")} and has to be executed once for every new
\textsf{R} session; see \textsf{R} code at the beginning of Section
\ref{sec.hierinf}. This allows us to create multiple streams of pseudo-random
numbers, one for each processor / computing node, using the \texttt{parallel}
package; for more details see the vignette of the \texttt{parallel} package
published by \citet{rcoreteam17}.

We recommend to set the argument \texttt{parallel = "multicore"} which will
work on Unix/Mac (but not Windows) operation systems. The function is
then evaluated in parallel using forking which is leaner on the memory usage.
This is a neat feature for GWAS since e.g. a large SNP dataset does not
have to be copied to the new environment of each of the processors. Note
that this is only possible on a multicore machine and not on a cluster.

On all operation systems, it is possible to create a parallel socket
cluster (PSOCKET) which corresponds to setting the argument
\texttt{parallel = "snow"}. This means that the computing nodes or
processors do not share the memory, i.e. an \textsf{R} session with an
empty environment is initialized for each of the computing nodes or processors.

How many processors should one use? If the user specifies the second level
of the tree, i.e. defines the \texttt{block} argument of the functions
\texttt{cluster\char`_var} / \texttt{cluster\char`_position} and
\texttt{test\char`_hierarchy}, then the building of the hierarchical tree and
the hierarchical testing can be easily performed in parallel across the different
blocks. Note that the package can make use of as many processors as there are
blocks, say, 22 chromosomes. In addition, the multi sample splitting
and screening step, which is performed inside the function
\texttt{test\char`_hierarchy}, can always be executed in parallel regardless
if we defined blocks or not. It can make use of at most $B$ processors where
$B$ is the number of sample splits.


\section{Meta-analysis for several datasets} \label{sec.meta}

The naive (and conceptually wrong) approach would be to pool the different
datasets and proceed as if it would be one homogeneous dataset, say, allowing
for a different intercept per dataset. We advocate
meta analysis and aggregating corresponding p-values; see
\cite[Sec. 4]{renaux18} for more details.

\paragraph{Fast computational methods for pooled GWAS.} There has been a
considerable interest for fast algorithms for GWAS with very large sample
size in the order of $10^5$; see \citet{lippert11, zhousteph14}.
Often though, such large sample size comes from pooling different studies
or sub-populations. We argue in favor of meta analysis and aggregating
corresponding p-values. Besides more statistical robustness against
heterogeneity (arising from the different sub-populations), meta-analysis
is also computationally very attractive: the computations can be trivially
implemented in parallel for every sub-population and the p-value aggregation
step comes essentially without any computational cost.

\subsection{Software for aggregating p-values of multiple studies} \label{subsec.multiplestudies}

It is very convenient to combine the information of multiple studies by
aggregating p-values as described in \cite[Sec. 4]{renaux18}.
The package \texttt{hierinf} offers two methods for jointly estimating
a single hierarchical tree for all datasets using either of the
functions
\texttt{cluster\char`_var} or \texttt{cluster\char`_position}; compare with
Section \ref{subsec.clustering}. Testing is performed by the function
\texttt{test\char`_hierarchy} in a top-down manner given by the joint
hierarchical tree. For a given cluster, p-values are calculated based on the
intersection of the cluster and each dataset (corresponding to a study) and
those p-values are then aggregated to obtain one p-value per cluster using
either Tippett's rule  or Stouffer's method as in \cite[Sec. 4]{renaux18};
see argument \texttt{agg.method} of the function
\texttt{test\char`_hierarchy}. The difference and issues of the two methods for
estimating a joint hierarchical tree are described in the following two
paragraphs.


The function \texttt{cluster\char`_var} estimates a hierarchical tree based on
clustering the SNPs from all the studies. Problems arise if the
studies do not measure the same SNPs and thus,
some of the entries of the dissimilarity matrix cannot be calculated.
By default, pairwise complete observations for each pair of SNPs are
taken to construct the dissimilarity matrix.
The issue affects the building of the hierarchical tree but the testing of
a given cluster remains as described before.

The function \texttt{cluster\char`_position} estimates a hierarchical tree based
on the genomic positions of the SNPs from all the studies.
The problems mentioned above do not show up here since SNPs, may be
different ones for various datasets, can still be uniquely assigned to
genomic regions.

The only differences in the function calls are that the arguments \texttt{x},
\texttt{y}, and \texttt{clvar} are now each a list of matrices instead of
just a single matrix.  Note that the order of the list elements of the arguments
\texttt{x}, \texttt{y}, and \texttt{clvar} matter, i.e. the user has to
stick to the order that the first element of the three lists corresponds to
the first dataset, the second element to the second datasets, and so on.
One would replace the corresponding element of the list containing the
control covariates (argument \texttt{clvar}) by \texttt{NULL} if some dataset
has no control covariates.
If none of the datasets have control covariates, then one can simply omit the
argument. Note that the argument \texttt{block} defines the second level of the
tree which is assumed to be the same for all datasets or studies.
The argument \texttt{block} has to be a \texttt{data.frame} which contains
all the column names (of all the  datasets or studies) and their assignment
to the blocks. The aggregation method can be chosen using the
argument \texttt{agg.method} of the function \texttt{test\char`_hierarchy},
i.e. it can be set to either \texttt{"Tippett"} or \texttt{"Stouffer"}.
The default aggregation method is Tippett's rule.

The example below demonstrates the functions \texttt{cluster\char`_var} and
\texttt{test\char`_hierarchy} for two datasets / studies measuring the
same SNPs.

<<>>=
# The datasets need to be stored in different elements of a list.
# Note that the order has to be the same for all the lists.
# As a simple example, we artificially split the observations of the
# toy dataset in two parts, i.e. two datasets.
set.seed(89)
ind1 <- sample(1:500, 250)
ind2 <- setdiff(1:500, ind1)
sim.geno.2dat  <- list(sim.geno[ind1, ],
                       sim.geno[ind2, ])
sim.clvar.2dat <- list(sim.clvar[ind1, ],
                       sim.clvar[ind2, ])
sim.pheno.2dat <- list(sim.pheno[ind1],
                       sim.pheno[ind2])

# Cluster the SNPs
dendr <- cluster_var(x = sim.geno.2dat,
                     block = block)
                     # # the following arguments have to be specified
                     # # for parallel computation
                     # parallel = "multicore",
                     # ncpus = 2)

# Test the hierarchy using multi sample split
set.seed(1234)
result <- test_hierarchy(x = sim.geno.2dat,
                         y = sim.pheno.2dat,
                         clvar = sim.clvar.2dat,
                         dendr = dendr,
                         family = "binomial")
                         # # the following arguments have to be
                         # # specified for parallel computation
                         # parallel = "multicore",
                         # ncpus = 2)
@
The above \textsf{R} code can be evaluated in parallel if one adds the
two commented arguments parallel and ncpus; compare with Section
\ref{subsec.parallel} for more details about the software for parallel
computing.

The output shows three significant groups of SNPs and one single SNP.

<<>>=
print(result, n.terms = 4)
@
The significance of a cluster is based on the information of both datasets.
For a given cluster, the p-values of each dataset were aggregated using
Tippett's rule as in \cite[Sec. 4]{renaux18} or \cite{tippett31}.
Those aggregated p-values are
displayed in the output above. We cannot judge which dataset (or both or
combined) inherits a strong signal such that a cluster is shown significant
but that is not the goal. The goal is to combine the information of multiple
studies.

The crucial point is that the testing procedure goes top-down through a single
jointly estimated tree for all the studies and only continues if at least one
child is significant (based on the aggregated p-values of the multiple datasets)
of a given cluster. The algorithm determines where to stop and naturally
we get one output for all the studies. A possible single jointly estimated tree
of the above \textsf{R} code is illustrated in Figure \ref{fig7}. In our
example, both datasets measure the same SNPs. If that would not be the case,
then intersection of the cluster and each dataset is taken before calculating
a p-value per dataset / study and then aggregating those.


\begin{figure}[!htb]
\begin{center}
\begin{tikzpicture}[level distance=1.8cm,
  level 1/.style={sibling distance=5.5cm},
  level 2/.style={sibling distance=3cm},
  level 3/.style={sibling distance=1.5cm}]
\node (z) {entire data}
  child {node (a) {chrom 1}
    child {node (aa) [text width=1.5cm]{SNP.22 SNP.41 $\ldots$}
        child {node (aaa) {$\vdots$}}
        child {node (aab) {$\vdots$}}}
    child {node (ab) [text width=1.5cm]{SNP.1 SNP.3 $\ldots$}
        child {node (aaa) {$\vdots$}}
        child {node (aab) {$\vdots$}}}
  }
  child {node (b) {chrom 2}
    child {node (ba) [text width=1.5cm]{SNP.544 SNP.513 $\ldots$}
            child {node (aaa) {$\vdots$}}
            child {node (aab) {$\vdots$}}}
    child {node (bb) [text width=1.5cm]{SNP.647 SNP.648 $\ldots$}
            child {node (aaa) {$\vdots$}}
            child {node (aab) {$\vdots$}}}
  };
\end{tikzpicture}
\caption{Illustration of a possible single jointly estimated tree for
multiple studies based on clustering the SNPs. The second level of the
hierarchical tree is defined by chromosome 1 and 2 (defined by the argument
\texttt{block} of the functions \texttt{cluster\char`_var} /
\texttt{cluster\char`_position}). The function \texttt{cluster\char`_var} /
\texttt{cluster\char`_position} builds a separate
hierarchical tree for each of the chromosomes.}\label{fig7}
\end{center}
\end{figure}


\bibliographystyle{apalike}
\bibliography{references}

\end{document}