\documentclass[10pt]{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{lfa Package} \usepackage{tikz} \usepackage{float} \usepackage{gensymb} \usepackage{times} \usepackage{graphicx} \usepackage{url} \usepackage{fullpage} \usepackage{float} \usepackage{amsmath} \usepackage{amsthm} \usepackage{bm} \usepackage{bbm} \usepackage{enumerate} \usepackage[font=small,format=plain,labelfont=bf,up,textfont=up]{caption} \usepackage{hyperref} \usepackage{subfig} \title{Logistic Factor Analysis Vignette} \author{Wei Hao, Minsun Song, John D. Storey} \date{\today} \begin{document} \maketitle \section{Introduction} Logistic Factor Analysis (LFA)~\cite{Hao2013}. Briefly, LFA fits a latent variable model on categorical (i.e. SNP genotypes coded as 0, 1, and 2) data by modelling the logit transformed binomial parameters in terms of latent variables. The resulting ``logistic factors'' are analagous to principal components, but fit into a convenient likelihood based model. As a result, the logistic factors can power a number of other analyses. \section{Sample usage} We include a sample real dataset with the package as the variable \texttt{hgdp\_subset}---a small subset of the HGDP genotypes. The row names are the rsids for the SNPs and the column names are coarse geographical labels for the individuals. <>= library(lfa) dim(hgdp_subset) @ \subsection{\texttt{lfa}} The \texttt{lfa} function has two required arguments. The first is the genotype matrix, and the second is the number of logistic factors including the intercept. <>= LF = lfa(hgdp_subset, 4) dim(LF) head(LF) @ We can plot the first two logistic factors and color by geographical information: <>= dat = data.frame(LF[,1], LF[,2], colnames(hgdp_subset)) colnames(dat) = c("LF1", "LF2", "geo") library(ggplot2) ggplot(dat, aes(LF1, LF2, color=geo)) + geom_point() + theme_bw() + coord_fixed(ratio=(max(dat[,1])-min(dat[,1]))/(max(dat[,2])-min(dat[,2]))) @ One aspect of \texttt{lfa} is that the return value is a matrix of logistic factors, thus, an important part of subsequent analysis is to keep your matrix of logistic factors to pass as an argument. \subsection{\texttt{af}} Given a genotype matrix and logistic factors, the \texttt{af} function computes the individual-specific allele frequencies <>= allele_freqs = af(hgdp_subset, LF) allele_freqs[1:5, 1:5] @ Since the calculation is independent at each locus, you can pass a subset of the genotype matrix as an argument if you aren't interested in all the SNPs. <>= subset = af(hgdp_subset[15:25,], LF) subset[1:5,1:5] @ Given the allele frequencies, you can do some other interesting calculations---for example, compute the log-likelihood for each SNP. <>= ll = function(snp, af){ -sum(snp*log(af) + (2-snp)*log(1-af)) } log_lik = sapply(1:nrow(hgdp_subset), function(i) {ll(hgdp_subset[i,], allele_freqs[i,])}) which(max(log_lik) == log_lik) @ \section{Data Input} The best way to load genotypes is by using the function \texttt{read.bed}, which assumes that you have binary PLINK formatted genotypes. The binary PLINK format uses files: a \texttt{.bed} for the genotypes, a \texttt{.bim} for the genotype information, and a \texttt{.fam} for the individuals information. \texttt{read.bed} takes as an argument the prefix for your three files. \bibliographystyle{plain} \bibliography{lfa} \end{document}