A Likelihood-Free Estimator of Population Structure Bridging Admixture Models and Principal Components Analysis

Characterizing genetic variation in humans is an important task in statistical genetics, enabling disease-gene mapping in genome-wide association studies (GWAS) and informing studies of human evolutionary history. A common approach to quantifying genetic variation...

U NDERSTANDING structured genetic variation in human populations remains a foundational problem in modern genetics. Such an understanding allows researchers to correct for population structure in GWAS studies, enabling accurate disease-gene mapping (Knowler et al. 1988;Marchini et al. 2004;Song et al. 2015). Additionally, characterizing genetic variation is important for the study of human evolutionary history (Cavalli-Sforza et al. 1988;Esteban et al. 1998;Li et al. 2008).
To this end, much work has been done to develop methods to estimate what Alexander et al. (2009) term global ancestry.
In the global ancestry framework, the goal is to simultaneously estimate two quantities: i. the allele frequencies of ancestral populations. ii. the admixture proportions of each modern individual.
Many popular global ancestry estimation methods have been developed within a probabilistic framework. In these methods, which we will refer to as likelihood-based approaches, the strategy is to fit a probabilistic model to the observed genome-wide genotype data by either maximizing the likelihood function (Tang et al. 2005;Alexander et al. 2009) or the posterior probability (Pritchard et al. 2000;Raj et al. 2014;Gopalan et al. 2016). The probabilistic model fit in each of these cases is the admixture model, described in detail in the Model and Theory section below, in which the global ancestry quantities (i) and (ii) are explicit parameters to be estimated.
A related line of work relies on principal components analysis (PCA) and other eigen-decomposition methods, rather than directly fitting probabilistic models; as such, we will refer to them collectively as PCA-based approaches. These methods find many of the same applications as global ancestry estimates while obviating a direct computation of global ancestry itself. For example, the EIGENSTRAT method of Patterson et al. (2006) and Price et al. (2006) uses the principal components of observed data to correct for population stratification in GWAS, avoiding altogether the estimation of admixture proportions or ancestral allele frequencies. Similarly, Hao et al. (2016) observe that many important applications of global ancestry really only require individual-specific allele frequencies. In a sense, individual-specific allele frequencies are simpler than global ancestry; while global ancestry specifies all of the individual-specific allele frequencies, the converse is not true. Therefore, Hao et al. (2016) introduce a simple truncated-PCA method that accurately and efficiently estimates individual-specific allele frequencies alone.
Both likelihood-based and PCA-based approaches have distinct merits and drawbacks. The PCA-based methods are computationally efficient and accurate in practice. It is shown, for instance, that the individual-specific allele frequencies obtained by truncated-PCA are empirically more accurate than those obtained by likelihood-based methods . Another attractive feature of PCA-based methods is that they make minimal assumptions about the underlying data-generative model. However, as mentioned before, PCA-based methods do not provide the full global ancestry estimates that their corresponding likelihood-based methods do. Most notably, they do not provide direct estimates of admixture proportions, which are often of primary interest in some applications. Additionally, PCA-based methods often have weaker statistical justifications, as they are typically not based on a probabilistic model. Although Tipping and Bishop (1999) introduced a probabilistic interpretation of PCA for multivariate Normal data, to our knowledge, no such interpretation of PCA exists when the data are Binomial, as is the case in the admixture model.
Recognizing the relative advantages of each approach, several researchers have attempted to bridge the gap between likelihood-based and PCA-based approaches. In spirit, this is also the approach that we take in the present work, and so we briefly review previous contributions to contextualize the advances made by our own method. Engelhardt and Stephens (2010) observed that fitting the admixture model was related to PCA in the sense that both could be posed as matrix factorization problems, which differ only in the constraints imposed on factors. They then introduced a third matrix factorization problem, called Sparse Factor Analysis (SFA), which encourages a sparsity through a particular prior. However, since SFA does not enforce the probabilistic constraints of the admixture model (nor the orthogonality constraints of PCA), its output cannot be directly interpreted as an estimate of global ancestry. Lawson et al. (2012) provided further insight into the mathematical relationship between admixture models and PCA and introduced a method for the analysis of phased haplotype data. This method, called fineSTRUCTURE, fits a version of the admixture model in which each observed individual belongs to a single (rather than admixed) population. Zheng and Weir (2016) introduced a method called EIGMIX that leverages PCA to infer admixture proportions from unphased genotype data. While EIGMIX allows individual genomes to be derived from a mixture of multiple ancestral populations (unlike fineSTRUCTURE), it requires a set of sampled individuals known to be derived from single ancestral populations. A related line of work uses PCA-based approaches to fit models of local ancestry, in which inferences about the ancestry of individual genetic loci are desired (for example, Brisbin et al. 2012).
While the aforementioned literature illustrates that PCA can be leveraged to provide information about population structure, each approach falls short of providing complete estimates of global ancestry under the general admixture model. The method which we introduce in the present work, called ALStructure, does precisely this. ALStructure requires no additional assumptions (such as the existence of unadmixed individuals in Zheng and Weir 2016), no specialized input (such as the unphased haplotypes of Lawson et al. 2012), and provides direct estimates of admixture proportions (unlike Engelhardt and Stephens 2010). As such, ALStructure is the only existing PCA-based method that can provide a direct substitute to the most popular likelihoodbased approaches. As an additional important advantage, the underlying mathematical theory that justifies ALStructure is sufficiently general so as to apply to a class of models that subsumes the admixture model. As such, we believe that imitable algorithms to ALStructure could be useful beyond the present genetics application.
The basic strategy of ALStructure is to eliminate the primary shortcomings of PCA-based methods while retaining their important advantages over likelihood-based methods. In particular, we extend the approach taken in Hao et al. (2016) in two ways. First, we replace classical PCA with the closely related method of Latent Subspace Estimation (LSE) (Chen and Storey 2015). In so doing, we will make mathematically rigorous the empirically effective truncated-PCA method of Hao et al. (2016) for estimating individual-specific allele frequencies. Second, we use the method of alternating least squares (ALS) to transform the individual-specific allele frequencies obtained via LSE into estimates of global ancestry.
We perform a body of simulations and analyze several globally and locally sampled human studies to demonstrate the performance of the proposed method, showing that ALStructure typically outperforms existing methods both in terms of accuracy and speed. We also discuss its implementation and the trade-offs between theoretical guarantees and run-time. We find that ALStructure is a computationally efficient and statistically accurate method for modeling admixture and decomposing systematic variation due to population structure.
The remainder of this paper is organized as follows. Model and Theory introduces the admixture model and details the mathematical underpinnings of our approach. The ALStructure Algorithm summarizes the ALStructure algorithm. A reader primarily interested in a basic understanding of the operational procedure of ALStructure and its applications may proceed to The ALStructure Algorithm after reading The admixture model subsection, as the remainder of Model and Theory is more technical in nature. Results From Simulated Data and Applications to Global Human Studies assess the performance of ALStructure on a wide range of real simulated datasets.

Model and Theory
In this section and the following we present the ALStructure method and detail some of its mathematical underpinnings. In The admixture model, we define the admixture model: the underlying probabilistic model assumed by ALStructure. Optimal model constraints describes the overall strategy of ALStructure as an optimality search subject to constraints rather than navigating a complex likelihood surface. Leveraging constraints to estimateF describes how the constraints can be used to estimate individual-specific allele frequencies. In Latent subspace estimation we present a mathematical result from Chen and Storey (2015) upon which the ALStructure algorithm heavily relies. Leveraging constraints to estimate P and Q describes why estimating global ancestry, given the individual-specific allele frequencies, is equivalent to a constrained matrix factorization problem. An efficient algorithm based on the method of ALS is also provided in this section for performing the constrained matrix factorization. The complete ALStructure algorithm is then presented in The ALStructure Algorithm.
Throughout this work, we adhere to the following notational convention: for a matrix A, we denote the i row vector of A by a iÁ , the j column vector of A by a Áj , and the ði; jÞ element of A by a ij .

The admixture model
The observed data X is an m 3 n matrix in which m (the number of SNPs) is typically much larger than n (the number of individuals). An element x ij of X takes values 0, 1, or 2 according to the number of reference alleles in the genotype at locus i for individual j.
ALStructure makes the assumption common to all likelihood-based methods that the data are generated from the admixture model. Under this model, the genotypes are generated independently according to x ij jf ij $ Binomialð2; f ij Þ, where F is an m 3 n matrix encoding all of the Binomial parameters. Each element f ij is an individual-specific allele frequency: the frequency of allele i in individual j. F is further assumed to be of rank d, where d ( n ( m. d may be thought of as the number of ancestral populations from which the observed population is derived. F then admits a factorization F ¼ PQ; in which P and Q have the following properties: P 2 ℝ m 3 d with p ij 2 ½0; 1 "ði; jÞ Q 2 ℝ d 3 n with q ij $ 0 "ði; jÞ and The matrices P and Q have the following interpretations: (i) each row p iÁ of P represents the frequencies of a single SNP for each of the d ancestral populations, and (ii) each column q Áj of Q represents the admixture proportions of a single individual. Together, P and Q encode the global ancestry parameters of the observed population; the goal of existing likelihoodbased methods is to estimate these matrices. By contrast, the truncated-PCA method of Hao et al. (2016) is focused on estimating F and not its factors. Equation 1 summarizes the admixture model.
The model introduced in Pritchard et al. (2000), which we refer to as the PSD model, is an important special case of the admixture model. It additionally assumes the following prior distributions on P and Q: The Balding-Nichols distribution (Balding and Nichols 1995) is a reparameterization of the Beta distribution, in which F i is the F ST (Weir and Cockerham 1984) at locus i, and p i is the population minor allele frequency at locus i. Specifically, Balding-Nichols ðF; pÞ ¼ Beta 1 2 F F p; 1 2 F F ð1 2 pÞ . (Other prior distributions can be used for P and Q (Pritchard et al. 2000), but here we refer to the PSD model as that using the priors listed here.) Existing Bayesian methods (Pritchard et al. 2000;Raj et al. 2014;Gopalan et al. 2016) fit the PSD model specifically, while existing maximum likelihood (ML) methods (Tang et al. 2005;Alexander et al. 2009) and ALStructure require only the admixture model assumptions.
Although we focus on fitting the admixture model in the present work, it is important to note that the general strategy of the ALStructure algorithm is insensitive to the particular details of this model. The necessary features that the theoretical underpinnings of ALStructure require are: (i) higher moments of x ij are bounded, (ii) F is low rank, and (iii) m ) n. (For a precise statement of the theoretical assumptions of LSE, see Chen and Storey 2015.) For example, an imitable algorithm could be applied to high dimensional data X where x ij jf ij $ Poissonðf ij Þ; and F is a low rank matrix whose factors P and Q potentially have natural constraints. Because of its generality, we hope that the approach of ALStructure will find useful application beyond the analysis of admixture.

Optimal model constraints
Most existing methods for fitting the admixture model employ various optimization techniques to search for the ML parameters (Pritchard et al. 2000;Alexander et al. 2009) or the maximum a posteriori estimate (Raj et al. 2014;Gopalan et al. 2016). Our approach has a fundamentally different character: rather than searching through a rough likelihood landscape in pursuit of an optimal solution, ALStructure seeks a feasible solution to a set of optimal constraints. To be more specific, we begin with the observation that any solution satisfying a particular set of constraints is risk-minimizing among a class of unbiased estimators. Because any feasible solution is optimal, we can think of the constraints themselves as being optimal. Notably, the need to maximize likelihood is altogether obviated.
The challenge of this approach is twofold. First, the constraints themselves need to be estimated from the data: they are not directly observable. This is done through the method of LSE detailed in Latent subspace estimation. Second, feasible solutions to the estimated constraints will not typically exist. For this reason, we seek solutions that approximately satisfy the constraints, thereby converting a feasibility problem to a least squares (LS) optimization problem. This procedure is done through the method of ALS and is detailed in Leveraging constraints to estimate P and Q. Throughout the remainder of the present subsection, we detail the constraints themselves.
There are several constraints that any reasonable estimate of the parameters of the admixture model must obey. The first is simply that the parameter estimatesF;P; andQ obey the relationshipF ¼PQ. We will refer to this constraint as the Equality constraint. The second obvious requirement is that entries of matricesP andQ obey the probabilistic constraints of the admixture model: As we will encounter these constraints frequently, we refer to Equation 2 as the "h" constraint, and Equation 3 and Equation 4 as the "4" constraint. This is simply because the constraints on P demarcate the boundaries of a d-dimensional unit cube (the generalization of a square), whereas the constraints on Q demarcate a d-dimensional simplex (the generalization of a triangle). Together we refer to the h and 4 constraints as the Boundary constraints.
The final constraint we require is that the row vectors ofF lie in the linear subspace spanned by the rows vectors of Q. If we denote hAi to be the rowspace of a matrix A, we can summarize this condition as: We will refer to Equation 5 as the LS (linear subspace) constraint. The LS constraint is the only nontrivial constraint that ALStructure enforces. The fact that hFi ¼ hQi is a simple consequence of the linearity of the admixture model; indeed, all rows of F are linear combinations of rows of Q since F ¼ PQ. The LS constraint thus requires the same property for our estimateF. It is important to note that the LS constraint is not the same as requiring that hFi ¼ hQi: this is ensured by the Equality constraint. Rather, the LS constraint requires that the row vectors ofF belong to the rowspace of the true Q matrix. The apparent challenge of enforcing the LS constraint is that, a priori, one does not have access to hQi. However, ALStructure takes advantage of a recent result from Chen and Storey (2015) that hQi can be consistently estimated directly from the data matrix X in the asymptotic regime of interest, when the number of SNPs m grows large. The result of Chen and Storey (2015) is, in fact, much more general than is needed in our setting, and, therefore, will likely be useful in many other problems. Because of its importance to this work, we further discuss this result in the context of the admixture model in Latent subspace estimation, and show that a modified PCA of X consistently recovers hQi.

Leveraging constraints to estimateF
The key step in ALStructure is to note that enforcing the LS constraint provides us with an immediate estimate for F. To motivate our estimator, first observe that the simple estimatẽ F ¼ 1 2 X is, in some sense, a reasonable approximation of F: it is unbiased since f ij ¼ 1 2 E½x ij under the admixture model. However, this estimate leaves much to be desired-most importantly, the estimateF will in general be of full rank (n) rather than of low rank (d) and it will have a large variance. Assuming, for now, that we are provided with the true rowspace hQi of F, a natural thing to try is to project the rows of 1 2 X onto this linear subspace. Below, we show that this estimator has some appealing properties.
Let us denote the operator Proj hSi ðXÞ such that the rows of the matrix X are projected onto the linear subspace hSi. (Note that the notation Proj hSi ðXÞ typically refers to projection of the columns of X onto the linear subspace hSi, but here we use this notation to denote projection of the rows of X onto hSi.) If we are given an orthonormal basis fs i g of the d-dimensional linear subspace hSi, then: Lemma 1 below provides us a simple condition under which estimators of F formed by such projections are unbiased.
Lemma 1: For a rank d matrix F that admits a factorization F ¼ PQ; and a random matrix X such that 1 2 E½X ¼ F; any estimator of F of the formF hSi [ 1 2 Proj hSi ðXÞ is unbiased if, and only if, hQi4hSi.
Lemma 1 is proved in Appendix A.1. In particular we note thatF is unbiased. In addition to being unbiased, the estimatorF is optimal in the following sense. Among all unbiased estimators constructed by projecting X onto a linear subspace,F minimizes a matrix equivalent of the mean squared error.
Lemma 2: For a rank d matrix F that admits a factorization F ¼ PQ; and a random matrix X; such that 1 2 E½X ¼ F, the estimatorF [ 1 2 Proj hQi ðXÞ is an unbiased estimator of F and has the smallest risk among all unbiased estimators of the formF [ 1 2 Proj hSi ðXÞ. We define the risk to be the expectation of the squared Frobenius norm: Lemma 2 is proved in detail in Appendix A.2; however, the basic intuition is straightforward. Projecting X onto a linear space hSi⊂hQi is biased (by Lemma 1). While projecting X onto a space hSi⊃hQi will result in an unbiased estimate of F (again, by Lemma 1), dimensions orthogonal to hQi fit noise, increasing the variance (and therefore the mean squared error) of the estimate. We note that this strategy is related to the strategy taken in Hao et al. (2016), in which F was estimated by projecting 1 2 X onto the space spanned by the first d principal components. In that work, it was observed that this simple strategy of estimating F typically outperformed existing methods. We will see in Latent subspace estimation that the space spanned by the first d principal components is a good estimator for hQi itself, but it can be improved practically and with theoretical guarantees by performing a modified PCA. Therefore, Lemma 2 provides a theoretical justification for the empirically accurate method put forward in Hao et al. (2016).

Latent subspace estimation
We have shown that the linear subspace hQi can be leveraged to provide a desirable estimate of F. However, as hQi is a linear subspace spanned by latent variables, it is not directly observable and must be estimated. Here, we show how a general technique developed in Chen and Storey (2015), which we will refer to as Latent Subspace Estimation (LSE), can be used to compute a consistent estimate of hQi from the observed data X.
LSE is closely related to PCA, a popular technique that identifies linear combinations of variables that sequentially maximize variance explained in the data (Jolliffe 2002). As PCA is commonly used to find low-dimensional structure in high-dimensional data, a natural approach to estimating hQi would be to employ SNP-wise PCA. More specifically, we might consider the linear space spanned by the first few eigenvectors of the n 3 n matrix, 1 m X T X as an estimate of hQi.
The LSE-based estimate of hQi almost exactly matches this PCA-based intuition. The only difference is that LSE accounts for the heteroscedastic nature of the admixture model, as detailed in Chen and Storey (2015). LSE has the theoretical advantage of asymptotically capturing hQi in the high-dimensional setting (i.e., as m/N). Theorem 1, as stated here, is a special case of a more general theorem from Chen and Storey (2015), rewritten here for the special case of the admixture model.
and let D be the diagonal matrix with jth entry equal tod j . The d eigenvectors fy 1 ; . . . ; y d g; corresponding to the top d eigenvalues of the with probability 1, where 4 denotes the symmetric set difference. Further, the smallest n 2 d eigenvalues of G converge to 0 with probability 1.
Theorem 1 provides us with a simple procedure for estimating hQi directly from data. One first computesd j and constructs the n 3 n matrix D. Next, an eigen-decomposition of the adjusted matrix 1 m X T X 2 D is computed. Finally, we estimate hQi as where V 1:d are the first d columns from the singular value decomposition of G.
We stress that the general form of Theorem 1 from Chen and Storey (2015) makes LSE applicable to a vast array of models beyond factor models and the admixture model discussed here. As a further benefit to the LSE methodology, it is both easy to implement and computationally appealing. The entire computation of d hQi requires a single eigen-decomposition of an n 3 n matrix, where the accuracy depends only on large m.
Leveraging constraints to estimate P and Q Now that we have a method for obtaining the estimateF by leveraging the LS constraint, what remains is to find estimates for P and Q. Since the estimateF has several appealing properties, as outlined in Leveraging constraints to estimateF, the approach of ALStructure is simply to keepF fixed and seek matricesP andQ that obey the Equality and Boundary constraints of the admixture model. Below we discuss some of the general properties of this approach: namely the question of existence and uniqueness of solutions. We will briefly discuss the general problem of nonidentifiability in the admixture model, and provide simple and interpretable conditions under which the admixture model is identifiable. Finally, we will provide simple algorithms for computingP andQ fromF based on the method of ALS.
Existence, uniqueness, and anchor conditions: First, we develop some terminology. We will say that an m 3 n matrix A admits an admixture-factorization if the following feasibility problem has a solution: find: ðB; CÞ (8) subject to: A ¼ BC and ðh; 4Þ In words, the feasibility problem in Equation 8 simply seeks a factorization of A that obeys the Equality and Boundary constraints from Optimal model constraints imposed by the admixture model. The smallest integer d for which ðB; CÞ is a solution to Equation 8 with B an m 3 d matrix and C a d 3 n matrix is the admixture-rank of A, which we denote rank ADM ðAÞ. By seeking a rank d admixture-factorization ofF, ALStructure converts a problem of high-dimensional statistical inference to a matrix factorization problem.
This simple approach has two apparent shortcomings: i. A rank d admixture-factorization ofF may not exist.
ii. If a valid factorization exists, it will not be unique.
Item (i) is a technical problem; though F admits a rank d admixture factorization by assumption, the same is not true forF in general. Even though the rank ofF is d by construction, rankðFÞ 6 ¼ rank ADM ðFÞ in general. ALStructure avoids this problem by changing the feasibility problem expressed in Equation 8 to the following optimization problem: subject to: ðh; 4Þ It is important to note that (ii) is not a problem unique to ALStructure, but is a fundamental limitation for any ML method as well. This is because the likelihood function depends onP andQ only through their productF; more formally, the admixture model is nonidentifiable. One unavoidable source of nonidentifiability is that any solution ðP;QÞ to the matrix factorization problem in Equation 8 will remain a valid solution after applying a permutation to the columns ofP and the rows ofQ. A natural question to ask is: "When is there a unique factorizationF ¼PQ up to permutations?" Two important types of sufficient conditions under which unique factorizations exist up to permutations are anchor SNPs and anchor individuals. We note that the concept of anchors has been previously employed in the field of topic modeling, where anchor words are of interest (Arora et al. 2013). We define an anchor SNP as one that is fixed in all ancestral populations except one. The anchor SNPs condition is then satisfied if each of the d ancestral populations has at least one corresponding anchor SNP. Analogously, we define an anchor individual as one whose entire genome is inherited from a single ancestral population. The anchor individuals condition is then satisfied if each of the d ancestral populations has at least one corresponding anchor individual. The assumption of anchor individuals is equivalent to the assumption of "surrogate ancestral samples" required by the EIGMIX method of Zheng and Weir (2016). The fact that either a set of d anchor SNPs or d anchor individuals makes the admixture model identifiable up to permutations follows from a simple argument found in Appendix A.3. For the special case of d ¼ 3, Figure 1 graphically displays the anchor conditions. It is important to remember that ALStructure does not require anchors to function. Rather, anchors provide interpretable conditions under which solutions provided by ALStructure, or any likelihood-based method, can be meaningfully compared to the underlying truth.
The anchor SNP and anchor individual conditions are not necessarily the only sufficient conditions for ensuring identifiability of the admixture model, and indeed, to the best or our knowledge, there is not currently a complete characterization of conditions for which the admixture model is identifiable. We regard this as an important open problem. In practice, ALStructure is capable of retrieving solutions remarkably close to the underlying truth even in simulated scenarios far from satisfying the anchor conditions, including conditions that are challenging for existing methods.
Computation: Here, we present two simple algorithms for solving the optimization problem: The first algorithm, which we call cALS (constrained ALS), has the advantage that it is guaranteed to converge to a stationary point of the nonconvex objective function in (10). While a stationary point will not generally correspond to a globally optimal solution, global optimization is seldom possible for nonconvex problems.
Although theoretically appealing, this algorithm relies on solving many constrained quadratic programming problems and is, consequently, potentially slow. To overcome this problem, we introduce a second algorithm called tALS (truncated ALS), which simply ignores the problematic quadratic constraints in cALS. Though lacking a theoretical guarantee of convergence, the increase in speed is significant and the outputs of the two algorithms are often practically indistinguishable.
We note that the general method of ALS is not novel. In particular, previous work has developed ALS methods for the problem of nonnegative matrix factorization (NNMF) (Paatero and Tapper 1994;Lee and Sebastian 1999). In NNMF, one seeks a low-rank factorization, A ¼ BC; in which all elements of the factors B and C are non-negative. Algorithms analogous to cALS and tALS, but with nonnegativity constraints rather than the h and 4 constraints, have previously been considered (Berry et al. 2007;Cichocki et al. 2007;Gillis and Glineur 2012;Kim et al. 2014).
An algorithm with provable convergence: While problem (10) is nonconvex as stated, the following two subproblems are convex: subject to: h subject to: 4 That (11) and (12) are convex problems is clear; norms are always convex functions, and h and 4 are convex constraints. In particular, (11) and (12) are both members of the well-studied class of Quadratic Programs (QP), for which many efficient algorithms exist (Boyd and Vandenberghe 2009). We propose the following procedure for factoringF, which we call Constrained ALS.
Despite the original problem being nonconvex, Algorithm 1 is guaranteed to converge to a stationary point of the objective function in (10) as a result of the following theorem from Grippo and Sciandrone (2000). InitializeP arbitrarily. repeat.
Solve (12) with P ¼P and returnQ. Solve (11) with Q ¼Q and returnP. until Convergence ofP andQ. return ðP;QÞ. then any limit point ðP; QÞ will be a stationary point of the original problem. (Note that the result from Grippo and Sciandrone (2000) is actually more general than this. We reproduce the special case above in order to make clear the connection to our problem.) An efficient heuristic algorithm: If we remove all constraints on P and Q from Equation 11 and Equation 12, the resulting optimization problems are simple linear LS.
Our algorithm proceeds by alternating between solving the unconstrained LS problems (13) and (14). After each step, the optimal solution will not necessarily obey the constraints of problem (10). To keep our algorithm from converging on an infeasible point, we truncate the solution to force it into the feasible set. More precisely, each element of the solution P * to (13) is truncated to satisfy h and each column of the solution Q * to (14) is projected to the closest point on the simplex defined by the 4 constraint. Simplex-projection is nontrivial; however, it is a well-studied optimization problem. Here we use a particularly simple and fast algorithm from Chen and Ye (2011). This algorithm, which we call the Truncated ALS Algorithm, is detailed in Algorithm 2.
An example dataset: Figure 2 displays the output of cALS and tALS on a dataset from the PSD model with the parameters: m ¼ 100; 000, n ¼ 500, d ¼ 3, a ¼ ð0:1; 0:1; 0:1Þ. As can be seen, the output fits for Q provided by cALS and tALS are practically indistinguishable to the eye, and are both excellent approximations of the ground truth. The cALS algorithm performed slightly better than the tALS algorithm (8:5 3 10 23 and 8:7 3 10 23 RMSE, respectively). However, cALS took 3.5 hr to complete while tALS terminated in under 1.5 min.
Because of the significant gains in efficiency, we use tALS exclusively throughout the remainder of this paper. The analyst who requires theoretical guarantees can, of course, use the cALS algorithm instead. Appendix B provides a more detailed comparison between the tALS and cALS algorithms on simulated data.
The ALStructure algorithm In this section, we briefly outline the entire ALStructure method whose components were motivated in depth in Model and Theory. In order to fit the admixture model, we obtain estimatesF,P, andQ from the SNP matrix X through the following three-part procedure: i. Estimate the linear subspace hQi from the data X.
ii. Project 1 2 X onto the estimate d hQi to obtain an estimate of F. iii. Factor the estimateF subject to the Equality and Boundary constraints to obtainP andQ.
For convenience, we detail the entire ALStructure algorithm in Algorithm 3, and annotate each of the three steps described above. We note that we have decided to use the tALS function rather than the cALS function in our definition of the ALStructure algorithm, valuing the speed advantage of tALS over the theoretical guarantees of cALS. If desired, one could of course choose to use the cALS function instead without making any other alterations to the ALStructure.
We emphasize here that ALStructure's estimateQ is ultimately derived from the LSE-based estimate of the latent subspace d hQi. As the method of LSE is closely linked to PCA, we consider ALStructure to be a unification of PCA-based and likelihood-based approaches. Perhaps the most striking feature of Algorithm 3 is its brevity. One advantage of this simplicity is its ease of implementation. Although Algorithm 3 has been implemented in the R package ALStructure, it can clearly be reimplemented in any language quite easily. Equally important is that all of the operations in Algorithm 3 are standard. The only two computationally expensive components are a single eigen-decomposition if n is large, and QR decompositions to find linear least squares (LLS) solutions in the tALS algorithm. Both of these problems have a rich history and consequently have many efficient algorithms. It is likely that the ALStructure implementation of Algorithm 3 can be Algorithm 3: ALStructure procedure ALStructureðX; dÞ repeat Solve (14) with P ¼P, and return the simplex-projected solutionQ.
Solve (13) with Q ¼Q and return the truncated solu-tionP. until Convergence ofP andQ return ðP;QÞ significantly sped up by utilizing approximate or randomized algorithms for the eigen-decomposition and/or LLS computations. In its current form, ALStructure simply uses the base R functions eigen() and solve() for the eigen-decomposition and LLS computations, respectively. Despite this, the current implementation of ALStructure is typically faster than existing algorithms as can be seen in Results From Simulated Data and Applications to Global Human Studies below.
The ALStructure method is a nonparametric estimator in the following ways. It makes no assumptions about the probability distributions of P or Q. Any random variable taking values in f0; 1g is by necessity Bernoulli. In this vein, the assumption that x ij $ Binomialð2; f ij Þ is not a parametric assumption per se, but rather an assumption about independence of alleles. Finally, the likelihood function is not utilized in estimating P and Q, making ALStructure likelihood-free.
For choosing the dimensionality of the model d, we recommend utilizing the recently proposed structural Hardy-Weinberg equilibrium (sHWE) test (Hao and Storey 2017). This test can perform a genome-wide goodness of fit test to the assumptions made in the admixture model over a range of d. It then identifies the minimal value of d that obtains the optimal goodness of fit. There are other ways to choose d, by using the theory and methods in Chen and Storey (2015) or by using other recent proposals Hao et al. 2016).

Simulated data sets
In this section, we compare the performance of ALStructure to three existing methods for global ancestry estimation, Admixture, fastSTRUCTURE and terastructure. Admixture, developed by Alexander et al. (2009), is a popular algorithm that takes a ML approach to fit the admixture model. Both fastSTRUCTURE (Raj et al. 2014) and terastructure (Gopalan et al. 2016) are Bayesian methods that fit the PSD model using variational Bayes approaches. We abbreviate these methods as ADX, FS, and TS in the figures. A comparison among these three methods appears in Gopalan et al. (2016), so we will focus on how they compare to ALStructure.
To this end, we first tested all algorithms on a diverse array of simulated datasets. The bulk of our simulated data sets come from the classical PSD model (defined in The admixture model), in which columns of Q are distributed according to the DirichletðaÞ distribution and the rows of P are drawn from the Balding-Nichols distribution. We varied the following parameters in our simulated datasets: m, n, d, and a. Of particular note is the variation of a. For this we used four a-prototypes: a 1 ¼ ð10; 10; 10Þ, a 2 ¼ ð1; 1; 1Þ, a 3 ¼ ð0:1; 0:1; 0:1Þ, and a 4 ¼ ð10; 1; 0:1Þ. These four prototypes were chosen because they represent four qualitatively different distributions on the Dirichlet simplex as shown in Figure 3: a 1 corresponds to points distributed near the center of the simplex, a 2 corresponds to points distributed evenly across the simplex, a 3 corresponds to points distributed along the edges of the simplex, and a 4 corresponds to an asymmetric distribution in which points are concentrated in one of the corners of the simplex.
When we produced datasets with d . 3, we extended the prototypes in the natural way; for example for d ¼ 6, the a 4 is becomes ð10; 10; 1; 1; 0:1; 0:1Þ. Table 1 lists all of the parameters we used to generate data under the Dirichlet model, for a total of 96 distinct combinations.
The parameters of the Balding-Nichols distributions from which rows of the P matrix were drawn were taken from real data, following the same strategy taken in Gopalan et al. (2016). Specifically, F i and p i were estimated for each SNP in the Human Genome Diversity Project (HGDP) dataset (Cavalli-Sforza 2005). Then, for each simulated dataset, m random samples are taken (with replacement) from the HGDP parameter estimates. Figure 3 Examples of typical random samples from the four different a-prototypes. As can be seen, only a 2 and a 3 approximately obey the "anchorindividuals" condition.  In addition to simulating Q matrices from the classical DirichletðaÞ distribution with many different parameters a, we also simulated data from the spatial model of admixture developed in Ochoa and Storey (2016). We deliberately chose to study this model because it is ill-suited for ALStructure; while ALStructure relies on the estimation of the d-dimensional linear subspace hQi, the columns of Q produced under the spatial model lie on a one-dimensional curve within hQi. Despite this fundamentally challenging scenario, we see that ALStructure is often capable of recovering an accurate approximation.

Results from the PSD model
In order to give a representative picture of the relative performance of ALStructure against existing algorithms, we first plot the fits of all of the algorithms for two particular data sets out of the total 96 model data sets: (i) the data set in which ALStructure performs the best and (ii) the data set in which ALStructure performs the worst, according to mean absolute error (defined below).
On the left side of Figure 4, we see that all four algorithms perform very well for the data set in which ALStructure performs best, which comes from the a 3 -prototype. On the right side of Figure 4, the dataset was generated from the a 4 -prototype. We see that, while ALStructure certainly deviates substantially from the truth, so does every algorithm. Both fastSTRUCTURE and terastructure provide results that are qualitatively very different from the truth; where fast-STRUCTURE compresses all columns of Q onto a single edge of the simplex, terastructure spreads them out through the interior of the simplex. Both Admixture and ALStructure provide solutions qualitatively similar to the truth. While the points in the Admixture solution extend much further along the edge of the simplex than the true model, the ALStructure solution spreads into the interior of the simplex more than the true model. Figure 5 provides a comprehensive summary of the performance of ALStructure against the existing algorithms on all simulated datasets. The top panels of Figure 5 summarize the accuracy of each of the four algorithms, according to two metrics: root mean squared error (RMSE) and mean absolute error (MAE).
The bottom left panel of Figure 5 shows mean per observation log-likelihood, 1 mn P m i¼1 P n j¼1 log P À x ij jf ij Á , on all simulated data sets. (To obtain full data log-likelihoods, multiple these numbers by mn.) It is interesting to note that ALStructure performs comparably to other methods from the likelihood perspective despite the fact that it is the only method that does not explicitly utilize the likelihood function. However, we emphasize that likelihood is an imperfect metric of model fit for two reasons. First, because of the highly nonidentifiable nature of the admixture model as discussed in Model and Theory, many Model fits by ALStructure, Admixture, fastSTRUCTURE, and terastructure on two particular simulated datasets. The left panel shows the fits corresponding to the dataset on which ALStructure performed the best. The right panel shows the fits corresponding to the dataset on which ALStructure performed the worst. Each point represents a column of the Q matrix and is plotted by the first and second coordinates. Blue points are plotted as a visual aid and delineate a common subset of individuals. models are equivalent from the likelihood perspective. Therefore, if one is primarily concerned about the accuracy of admixture estimates, the RMSE or MAE metrics may be more suitable. Second, in high-dimensional models, it has been demonstrated that high likelihood may yield far inferior estimates (Efron 2013). Starting with Stein's Paradox (Stein 1956), it has been shown in many settings that the ML estimator for several parameters may be uniformly worse in accuracy than methods that leverage shared information in the data.
The bottom right panel of Figure 5 shows the distributions of run times for each algorithm on all modeled datasets. Due to the size of the simulated datasets and our computational constraints, each algorithm did not terminate on each of the 96 datasets. In Figure 5, we plot only the datasets for which all four algorithms successfully terminated. See Appendix C for more details. It is clear that ALStructure is competitive with respect to both model fit and time. ALStructure outperforms all methods according to both RMSE and MAE. With respect to time, ALStructure is clearly favored (one should note that the y-axis is on the log scale).

Results from the spatial model
As a challenge to ALStructure, we simulate data from a model developed in Ochoa and Storey (2016), which we will refer to as the spatial model. This model mimics an admixed population that was generated by a process of diffusion in a one-dimensional environment. There are d unmixed ancestral populations equally spaced at positions fx 0 ; x 0 þ 1; . . . ; x 0 þ d 2 1g on an infinite line. If all populations begin to diffuse at time t ¼ 0 at the same diffusive rate, then population i will be distributed as a Gaussian with mean m i ¼ x 0 þ i 2 1 and SD s. Therefore, under the spatial model, an individual sampled from position x will have admixture proportions shown in Equation 15, where f ðm;sÞ denotes the Gaussian distribution with parameters ðm; sÞ.
Although this is just a special case of the admixture model, one would expect the spatial model to be particularly ðq 1 ðxÞ; q 2 ðxÞ; . . . ; q d ðxÞÞ ¼ f  challenging for ALStructure because the admixture proportions belong to a one-dimensional curve parameterized by x, and ALStructure necessitates the estimation of a d-dimensional linear subspace in ℝ n . The challenge is much more pronounced when the populations are highly admixed (large s). Figure 6 shows the model fits provided by ALStructure. Indeed, for large values of s ðs ¼ 2Þ, ALStructure fails to correctly capture the admixture proportions. However, for smaller values of s ðs ¼ f1; 0:5gÞ, it can be seen that the fits provided by ALStructure are excellent. In all simulations m ¼ 10 5 , n ¼ 10 3 , and d ¼ 3.
We note that Gopalan et al. (2016) tested Admixture, fastSTRUCTURE, and terastructure on data drawn from the spatial model (which they refer to as "Scenario B"). They showed this model posed a significant challenge for all three methods, but found that terastructure performed the best.

Applications to Global Human Studies
Here, we apply ALStructure and existing methods to three globally sampled human genotype datasets: the Thousand Genomes Project (TGP), HGDP, and Human Origins (HO) datasets (Cavalli-Sforza 2005;Lazaridis et al. 2014;The 1000Genomes Project Consortium et al. 2015. Table 2 summarizes several basic parameters of each of the datasets and Appendix D details the procedures used for building each dataset. Although we recommend using sHWE from Hao and Storey (2017) for choosing d, here we take the number of ancestral populations d directly from Gopalan et al. (2016) so that our results are easily comparable to those of the latter study. Figure 7 shows scatterplots of the first two rows ofQ for each of the three datasets provided by each of the four fits. To disambiguate the inherent nonidentifiability (see Model and Theory), we ordered the rows of the fitsQ by decreasing variation explained: Perhaps the most striking aspect of Figure 7 is the difference between the fits produced by each method. With the notable exception that Admixture and ALStructure have similar fits for the TGP and HGDP datasets, every pair of comparable scatterplots (i.e., within a single row of Figure 7) are qualitatively different. Figure 11 in the Appendix displays the same data represented as stacked barplots of the admixture proportions. In this representation too, qualitative differences between the fits are also evident. Table 3 shows the mean per observation log-likelihood of the fits provided by each of the four methods. Figure 12 in the Appendix shows that the distributions of per observation likelihood are nearly indistinguishable across all methods.
Next we compare the performance of ALStructure to existing methods both in terms of efficiency and accuracy. Unlike in the case of simulated datasets where the ground truth is known, here we cannot directly compare the quality of model fits across methods. Instead, we assess the quality of each method by its performance on data simulated from real data fits.  For concreteness, we briefly outline the process below: i. Fit each dataset with each of the four methods to obtain 12 model fits. ii. Simulate datasets from the admixture model using parameters obtained in the previous step.
iii. Fit each of the 12 simulated datasets with each of the four datasets (48 fits) and compute error measures.
The process above treats each of the four methods symmetrically, evaluating each method based on its ability to fit data simulated from both its own model fits as well as every other methods' model fits. Figure 8 summarizes the performance of each method with respect to both model fit and efficiency on data simulated from the above described process. As with the results on simulated datasets from Results from the PSD model, it is clear that ALStructure is competitive with respect to both model fit and time. Both Admixture and ALStructure outperform fastSTRUCTURE and terastructure by all quality of fit metrics. ALStructure far outperforms all methods with respect to time (one should note that the y-axis is on the log scale).
In Appendix E, we compare the performance of ALStructure to pre-existing methods on an additional nonglobal dataset from Basu et al. (2016). In this dataset, individuals are sampled from 18 modern Indian subpopulations. India's genetic admixture is of particular interest because of its long history of sociocultural norms promoting endogamy. We find that each of the four methods produce admixture estimates qualitatively similar to each other for this dataset (see Figure 10 in the Appendix). One possible explanation for this observed similarity is that the genetic history of India more closely mimics the admixture model than does global genetic history, as suggested by Lawson et al. (2018).

Discussion
In this work, we introduced ALStructure, a new method to fit the admixture model from observed genotypes. Our method attempts to find common ground between two previously distinct approaches to understanding genetic variation: likelihoodbased approaches and PCA-based approaches. ALStructure features important merits from both. Like the likelihood-based approaches, ALStructure is grounded in the probabilistic admixture model, and provides full estimates of global ancestry. However, operationally the ALStructure method closely resembles PCA-based approaches. In particular, ALStructure's estimates of global ancestry are derived from a consistent PCAderived estimate that captures the underlying low-dimensional latent subspace. In this way, ALStructure can be considered a unification of likelihood-based and PCA-based methods.
Because ALStructure is operationally similar to PCAbased methods, it is computationally efficient. Specifically, the only computationally expensive operations required by the ALStructure algorithm are singular value and QR decompositions. Both of these computations have been extensively studied and optimized. Although ALStructure Figure 7 Biplots of the first two rows of Q (ranked by variation explained) of the fits of the TGP (top), HGDP (middle), and HO (bottom) datasets for each algorithm. Individuals are colored by coarse subpopulation from which they are sampled. already performs favorably compared to preexisting algorithms in computational efficiency, it is likely that, by applying more sophisticated matrix decomposition techniques, ALStructure may see significant improvements in speed. Although extremely simple, ALStructure typically outperforms preexisting algorithms both in terms of accuracy and time. This observation holds under a wide array of datasets, both simulated and real. The usefulness of PCA-based approaches has been increasingly recognized in related settings, such as the mixed membership stochastic block model (Rubin-Delanchy et al. 2017) and topic models (Ke and Wang 2017). The basic approach we have presented is quite general. In particular, the set of models that satisfy the underlying assumptions of LSE is large, subsuming the admixture model as well as many other probabilistic models with low intrinsic dimensionality. Consequently, we expect that the ALStructure method can be trivially altered to apply to many similar problems beyond the estimation of global ancestry.

Acknowledgments
Thank you to Wei Hao and Alejandro Ochoa for feedback and advice on this manuscript. This research was supported in part by National Institutes of Health (NIH) grant HG006448.

Data availability
Publicly available data sets were used in this study. Information on the data sets and computer code used to analyze the data are available at https://github.com/StoreyLab/alstructure_ paper. An R package implementing the method proposed here is available at https://github.com/StoreyLab/alstructure.

Literature Cited
By studying Equation 19, we can see the estimatorF has several favorable properties. First note that the risk is a sum of m 3 k nonnegative numbers since Var½Z $ 0 for any random variable Z. If we were to project onto a larger subspace hSi, where hSi⊃hQi, we would add terms to Equation 19 and consequently increase our risk. If we were to project onto a smaller subspace hSi⊂hQi, then the risk may decrease; however, our new estimator will now be biased by Lemma 1. From these observations, we conclude thatF is optimal in the sense described in the Lemma 2.

A.3 Proof of sufficiency of anchors
Here, we show that either a set of anchor SNPs or a set of anchor individuals is sufficient to specify a unique factorization F ¼ PQ up to the nonidentifiability associated with row permutations. Proposition 1. For a rank d matrix F with a factorization F ¼ PQ, if there is a set S of d rows of P such that for each i 2 f1; 2; . . . ; dg there exists a row vector p iÁ 2 S such that p iÁ ¼ d i e i for d i 6 ¼ 0, where e i is a vector of length n in which element i is 1, and all other elements are 0, then the factorization is unique up to permutation. When such a set S exists, we say that we have "anchor SNPs." Proof. Let us denote the matrix D ¼ diagðd 1 ; d 2 ; . . . ; d d Þ. Without loss of generality, let us assume that S is the first d rows of P and are ordered such that P ¼ D A for some ðm 2 dÞ 3 d matrix A. Then there is a unique Q for this F matrix (up to permutation) which is The matrix A is also uniquely determined by F once Q is fixed. To see this, note that where f jÁ and p jÁ denote the j row of F and P respectively. Since f jÁ is fixed and Q is unique under the anchor SNP assumption, there is a unique solution for p jÁ by the linear independence of the rows of Q.
The interpretation of the anchor SNPs assumption is that every ancestral population has at least one SNP that appears only in it. The presence of such an SNP is therefore a guarantee that the individual is a member of a particular population. Note that an identical argument could be made when we have a set S of d columns of Q that have exactly one nonzero entry at unique locations. When such a set exists, we say that we have "anchor individuals." Under the admixture model, the simplex constraint requires that the nonzero entry of each anchor genotype is exactly one. In this scenario, there exists at least one individual from each ancestral population whose entire genome was inherited by a single ancestral population. We summarize these results in the following corollary and visualize the anchor SNP and anchor genotype scenarios in Figure 1.
Corollary 1. Whenever a rank d matrix F admits a factorization F ¼ PQ such that there are either a set of anchor SNPs or a set of anchor genotypes, the factorization is unique up to permutation.

Figure 11
The admixture proportions of each globally sampled dataset as a stacked barplot. Because of the nonidentifiability of the model, the order of the rows of Q are arbitrary. To disambiguate this, we order the rows of Q in each dataset by decreasing average admixture. The coloring in Figure 11 is then done according to this ordering. As an aid to the eye, we also reorder the columns of Q according to decreasing proportion of the first row of Q of the ALStructure fit. The choice to order all fits according to ALStructure is arbitrary; however, all fits must be ordered consistently to make meaningful comparisons possible. As can be seen, each of the fits differ significantly from each other on every dataset.

HGDP:
The HGDP samples globally from 51 populations and is available here: http://www.hagsc.org/hgdp/files.html. Individuals with firstor second-degree relatives and SNPs with minor allele frequency , 5% are removed. The dimensions of this dataset are 940 individuals and 550,303 SNPs.