\documentclass[10pt]{article}

%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{lfa Package}

\usepackage{tikz}
\usepackage{float}
\usepackage{gensymb}
\usepackage{times}
\usepackage{graphicx}
\usepackage{url}
\usepackage{fullpage}
\usepackage{float}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{bm}
\usepackage{bbm}
\usepackage{enumerate}
\usepackage[font=small,format=plain,labelfont=bf,up,textfont=up]{caption}
\usepackage{hyperref}
\usepackage{subfig}

\title{Logistic Factor Analysis Vignette}
\author{Wei Hao, Minsun Song, John D. Storey}
\date{\today}

\begin{document}
\maketitle

\section{Introduction}

Logistic Factor Analysis (LFA)~\cite{Hao2013}. Briefly, LFA fits a latent
variable model on categorical (i.e. SNP genotypes coded as 0, 1, and 2) data by
modelling the logit transformed binomial parameters in terms of latent
variables. The resulting ``logistic factors'' are analagous to principal
components, but fit into a convenient likelihood based model. As a result, the
logistic factors can power a number of other analyses.

\section{Sample usage}

We include a sample real dataset with the package as the variable 
\texttt{hgdp\_subset}---a small subset of the HGDP genotypes. The row 
names are the rsids for the SNPs and the column names are coarse 
geographical labels for the individuals.

<<preamble>>=
library(lfa)
dim(hgdp_subset)
@

\subsection{\texttt{lfa}}

The \texttt{lfa} function has two required arguments. The first is the 
genotype matrix, and the second is the number of logistic factors 
including the intercept.

<<lfa>>=
LF = lfa(hgdp_subset, 4)
dim(LF)
head(LF)
@

We can plot the first two logistic factors and color by geographical 
information:

<<lfa_plot, fig.width=5, fig.height=4.5>>=
dat = data.frame(LF[,1], LF[,2], colnames(hgdp_subset))
colnames(dat) = c("LF1", "LF2", "geo")
library(ggplot2)
ggplot(dat, aes(LF1, LF2, color=geo)) + geom_point() + theme_bw() +
  coord_fixed(ratio=(max(dat[,1])-min(dat[,1]))/(max(dat[,2])-min(dat[,2])))
@

One aspect of \texttt{lfa} is that the return value is a matrix of 
logistic factors, thus, an important part of subsequent analysis is to 
keep your matrix of logistic factors to pass as an argument.

\subsection{\texttt{af}}

Given a genotype matrix and logistic factors, the \texttt{af} function 
computes the individual-specific allele frequencies

<<af1>>=
allele_freqs = af(hgdp_subset, LF)
allele_freqs[1:5, 1:5]
@

Since the calculation is independent at each locus, you can pass a 
subset of the genotype matrix as an argument if you aren't interested 
in all the SNPs. 

<<af2>>=
subset = af(hgdp_subset[15:25,], LF)
subset[1:5,1:5]
@

Given the allele frequencies, you can do some other interesting 
calculations---for example, compute the log-likelihood for each SNP.

<<af3>>=
ll = function(snp, af){
    -sum(snp*log(af) + (2-snp)*log(1-af))
}
log_lik = sapply(1:nrow(hgdp_subset), function(i) {ll(hgdp_subset[i,], allele_freqs[i,])})
which(max(log_lik) == log_lik)
@

\section{Data Input}

The best way to load genotypes is by using the function \texttt{read.bed},
which assumes that you have binary PLINK formatted genotypes. The binary PLINK
format uses files: a \texttt{.bed} for the genotypes, a \texttt{.bim} for the
genotype information, and a \texttt{.fam} for the individuals information.
\texttt{read.bed} takes as an argument the prefix for your three files.

\bibliographystyle{plain}
\bibliography{lfa}

\end{document}