Originally published as Genetics Published Articles Ahead of Print on June 11, 2007.

Genetics, Vol. 176, 2405-2420, August 2007, Copyright © 2007
doi:10.1534/genetics.107.071779

A Covariance Structure Model for the Admixture of Binary Genetic Variation

Department of Anthropology, University of California, Davis California 95616

1 Address for correspondence: Department of Anthropology, University of California, 1 Shields Ave., Davis, CA 95616.
E-mail: mngrote{at}ucdavis.edu

Manuscript received February 5, 2007. Accepted for publication May 25, 2007.

ABSTRACT

I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.


ADMIXTURE, the mixing of genetically differentiated populations via migration and subsequent intermating, can create linkage disequilibrium (LD) between genes, even when the genes are not physically linked (see, e.g., CAVALLI-SFORZA and BODMER 1971, p. 69; PROUT 1973). In this work, I show that admixture contributions to LD can be statistically quantified and distinguished from LD attributable to the shared ancestry of linked alleles. OHTA (1982) used Wright's island model to decompose a squared coefficient of LD into within- and between-population terms, in analogy with WRIGHT's (1940) decomposition of the inbreeding coefficient. These decompositions assume that populations connected by migration can be identified and sampled for genetic variation. The method I propose uses a pairwise LD matrix sampled from an admixed population of unknown composition; the number of source populations and the components of LD are inferred by use of a multivariate statistical model.

The data: blocks of binary markers:

It is convenient to develop the model using gametes as the basic units of observation. The data are then n random binary vectors of the form x = (x1,...xL)', with xl isin {0, 1}; l = 1,..., L. Each vector represents the single nucleotide polymorphism (SNP) variation on one sampled gamete, under an arbitrary binary coding scheme, with xl indicating the allele on gamete x at the lth marker locus. The markers are assumed to be selectively neutral and variable in the sample.

Before statistical analysis begins, the markers in x are to be grouped into blocks by the investigator, based on physical criteria (e.g., the markers in a block share a localized region on a physical map), along with empirical evidence (e.g., the markers in a block are known by a previous linkage-mapping study to form a linkage group) independent of the sample under consideration. Each marker belongs to exactly one block; however, a particular marker may be the only member of a block. Any two markers l, m within the same block are assumed to be linked, with recombination fraction Formula. In contrast, markers l, j from different blocks are assumed to be unlinked, with recombination fraction clj {approx} Formula.

In the development below, block structure derives from physical and linkage relationships between markers, with blocks analogous to linkage groups; this is to be distinguished from empirical descriptions of "haplotype block structure" (see, e.g., GABRIEL et al. 2002; PHILLIPS et al. 2003). Recent work of the INTERNATIONAL HAPMAP CONSORTIUM (2005) and MYERS et al. (2005) suggests that haplotype block structure results from variation in recombination rates over small physical distances. The fine-scale rate estimates obtained by MYERS et al. (2005) demonstrate new tools for constructing blocks of tightly linked markers. In the data examples to follow, I form two blocks of markers on the arms of the human X chromosome, using simple physical criteria.

To fix notation, I assume that x can be partitioned into B blocks, labeled 1,..., B in any convenient order (though the same partitioning and labeling scheme is used for all gametes). For b = 1,...B, the bth block contains Lb ≥ 1 marker loci, with Formula. I order the vector elements x1,..., xL so that the indices Formula are assigned to the loci of the bth block. However, markers within a block need not be ordered in any particular way.

For binary loci, linkage disequilibrium is covariance:

For binary loci, a familiar pairwise LD parameter (LEWONTIN and KOJIMA 1960) is a covariance between locus-specific indicator variables. Let pl be the population frequency of the "1" allele at locus l, let plm be the population frequency of gametes carrying the "1" allele at both loci l and m, and let E be the expectation operator for the distribution on (xl, xm) isin {0, 1} x {0, 1} parametrized by pl, pm, and plm. The LD parameter for loci l and m is then

Formula 1(1)
For consistent definitions, I take Dll to be the variance of the locus-specific indicator:

Formula 2(2)
Given a random sample of binary gametes x1 , ... , xn, with Formula 2, I treat the l, m entry of the sample covariance matrix

Formula 3(3)
as an estimate of Dlm. I use the denominator n in (3), rather than the denominator n 1 typically used for unbiased estimation, for consistency with sample allele frequency calculations. The components of linkage disequilibrium under the admixture model are determined by the structure of S and the sample allele frequencies. Other sample-based estimates of the population covariance matrix may provide suitable alternatives to S; the composite LD estimates described by COCKERHAM and WEIR (1977; see also WEIR 1979) are particularly attractive, as they do not require direct observation of gametes. A complete description of the associations among L binary markers requires the use of third- and higher-order moments, along with the second-order moments considered here (see, e.g., EKHOLM et al. 1995). Nonetheless, relevant structure in the data induced by admixture can be detected by an analysis limited to covariances.

In a recent article, PATTERSON et al. (2006) also use a sample covariance matrix to make inferences about admixture. Although the locus-specific variables defined by Patterson et al. refer to diploid genotypes and take values in {0, 1, 2}, it is useful to establish a relationship between the matrix they analyze and S. In the notation above (up to differences in variable definitions, and omitting a normalization step), the n x n matrix of inter-individual covariances analyzed by PATTERSON et al. (2006) could be written as

Formula 4(4)
Thus the implicit expectation is over loci rather than over individuals in the population, and covariances are between multilocus genotypes rather than between locus-specific indicators. WICKENS (1995) uses the terms variable space and subject space to distinguish the multivariate spaces that give rise to S and (4), respectively. GOWER (1966) and JOLLIFFE (2002, Sect. 5.2) compare and contrast the two approaches, although a comparison in the context of admixture inference is beyond the scope of this work.


THE PULSE-DECAY MODEL OF ADMIXTURE AND A MATRIX DECOMPOSITION
The "pulse-decay" model (shown schematically in Figure 1) is a highly simplified admixture model that, somewhat unexpectedly, shares mathematical properties with traditional covariance structure models. It is a variation of the "immediate admixture" model of EWENS and SPIELMAN (1995), emphasizing different statistical properties. I give a detailed derivation to show that the statistical model used later for data analysis is implied by the pulse-decay model.


Figure 1
View larger version (10K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Schematic of the pulse-decay model of admixture, showing representative population parameters. Diagonal arrows indicate the transfer of randomly chosen genotypes, contributed to the admixed population in the proportions q(1),..., q(K). Vertical arrows indicate transitions of the Wright–Fisher process in the admixed population. 2N(k) is the number of gametes present in source population k at generation 0, pl(k) and pm(k) are the frequencies of the "1" allele at loci l and m, and Dlm(k) is the "standing" LD between loci l and m in source population k. Further details of the model are given in the text.

 
Under the pulse-decay model, generation 1 of the admixed population consists of N diploid genotypes sampled at random with replacement from K ≥ 2 source populations. The probability that a genotype is chosen from the kth source population is q(k) > 0; k = 1,..., K, with Formula 4. The mixing proportions q(1),... q(K) are unknown. In practice, K may be an unknown parameter requiring estimation. A genotype contributed by the kth source population consists of two gametes chosen at random with replacement from the 2N(k) gametes present at generation zero in the kth source population. At generation zero, the frequency of the "1" allele at locus l in source population k is pl(k), k = 1,..., K, and the covariance between loci l and m in source population k is Dlm(k). I call the vector of allele frequencies Formula 4' from source population k p(k), and I assume that the source population allele frequency vectors p(1),..., p(K) are linearly independent. The latter assumption implies that the number of markers L is at least as large as the number of source populations K. Linear independence of p(1),..., p(K) also implies that no source population is an admixture of the other source populations. I describe the effect of weakening the assumption of linear independence briefly in a later section. Under the pulse-decay model, the source populations make no further contributions to the sampled population after generation 1. Mating is at random in the admixed population from generation 1 onward.

By straightforward calculation, the expected frequency of the "1" allele at locus l, at generation 1 in the admixed population, is Formula 4, where the expectation is over samples drawn from the source populations. Similarly, the covariance between loci l, m at generation one in the admixed population is

Formula 5(5)
where Dll(k) is the locus-specific variance in the kth source population.

The expected covariance in generation t + 1, over realizations of the Wright–Fisher process in the admixed population, is

Formula 6(6)
where clm = cml isin (0, Formula 6] is the recombination fraction between loci l and m (see, e.g., KARLIN and MCGREGOR 1968, Equation 7). WEIR and COCKERHAM (1974) provide related expressions for "genotypic" disequilibria (see WEIR 1996, Chap. 3), which completely specify two-locus dynamics in diploid organisms. Whereas D1,lm of Equation 5 is a population parameter, Formula 6t+1,lm of Equation 6 is a random variable. In Equation 6, the value clm = 0, corresponding to complete linkage of l and m, is reserved for the variance Formula 6ll, where it gives the formula for the decay of heterozygosity in a finite population (see e.g., CROW and KIMURA 1970, Section 3.11).

Examining Equation 6 with statistical modeling in mind, it is attractive to write the realized covariance in generation t + 1 as

Formula 7(7)
where {rho}t,lm = {rho}t,ml is an unknown parameter giving the proportion of initial covariance (or variance) remaining after t transitions of the Wright–Fisher process. A sample from the admixed population in generation t + 1 produces an estimate of Dt+1,lm, via the l, m element of the sample covariance matrix. Although {rho}t,lm is an important structural component of the model, it will not be an object of estimation. To establish later mathematical results, I assume that there exist {epsilon} and {theta} such that

Formula 8(8)
This assumption places uniform upper and lower bounds on the extent to which covariances (or variances) have decayed since the admixture "pulse," for the loci included in the study. Covariances between unlinked loci (with clm = Formula 8) are expected to decay much more quickly than covariances between linked loci: for unlinked loci, {rho}t,lm is on the order of 2t. In light of this observation, (8) significantly limits the number of generations that could have occurred since the admixture pulse. Assumption (8) also rules out changes in the signs of covariances, up to and including generation t + 1. The pulse-decay model is thus revealed to be primarily suited for recent admixture. Accordingly, any additional changes in allele frequencies and covariances due to mutation are assumed to be small enough to ignore, within the time-scale of the model.

Under the pulse-decay model, the realized covariance matrix in the admixed population at generation t + 1 can be written as

Formula 9(9)
where RFormula 9 is the symmetric matrix having l, m element {rho}t,lm, {odot} is the Hadamard (elementwise) product,

Formula 9

Formula 9
Formula 9, and Formula 9 is the symmetric matrix having l, m element Formula 9. The columns of {Gamma} are weighted vectors of source-population allele frequency deviations. The number of generations elapsed since the admixture pulse is assumed to be unknown. I do not propose to explicitly model time-dependent properties, but rather to model the effect of recent admixture on observed covariances. Hence, the dependence on t is dropped in the final expression of (9).

The properties of Formula 9 and {Psi}:

The theory of population genetics predicts that covariances between unlinked loci will be negligible in the source populations, if the source populations are not very small. Here, I address the consequences of this prediction for {Psi}, along with other properties of Formula 9 and {Psi}.

In terms of understanding and specifying the model, much is gained by making clear how model parameters are derived from stochastically varying quantities. To describe the association between alleles at loci l and m in the kth source population, I imagine a 2 x 2 table, formed from the set of gamete counts

Formula 10(10)
The counts are arranged in the table in some natural way, e.g., by assigning rows to the alleles at locus l and columns to the alleles at locus m. The following discussion parallels WEIR (1996, pp. 112–114), with the important exception that here the entire kth source population at generation zero plays the role of the "sample" in Weir's development. Under the Wright–Fisher model, the gamete counts in (10) are random variables, having a multinomial distribution with parameters 2N(k) and Formula 10; I specify the probabilities Formula 10 in more detail below for the case of unlinked loci. I view the l, m-locus gamete counts in the kth source population at generation zero as a particular realization of (10), and the allele frequencies pl(k), pm(k) appearing in Figure 1 and Equation 5 as marginal frequencies of the realized 2 x 2 table. The covariance between xl and xm in the kth source population, written as a function of the random gamete counts, is

Formula 11(11)
I view the parameter Dlm(k), appearing in Figure 1 and on the right-hand side of Equation 5, as a realization of Formula 11lm(k).

For unlinked loci l, m and 2N(k) sufficiently large, gamete frequencies follow the "independent loci" model (ETHIER 1979); under this model, the probability distribution on 2 x 2 tables has the property of row-by-column independence. Let {pi}l(k) (respectively, {pi}m(k)) be the probability that a randomly chosen gamete of the kth source population carries the "1" allele at locus l (m). Under the independent loci model, the multinomial probabilities Formula 11 are {pi}11(k) = {pi}l(k){pi}m(k), Formula 11, Formula 11, and Formula 11. It is worth noting that the Wright–Fisher process in the kth source population need not be stationary in order for unlinked l, m to follow the independent loci model (ETHIER 1979). Further, for large source populations, even weakly linked loci (those with clm < Formula 11, but clm not very small) can be treated as independent (ETHIER 1979).

By straightforward calculation, under the independent loci model, E(Formula 11lm(k)) = 0, where the expectation is over multinomial samples forming the kth source population at generation zero. For 2N(k) sufficiently large, a calculation using the "delta method" gives

Formula 12(12)
(see WEIR 1996, Equations 2.17, 3.10), again assuming the independent loci model. As a consequence of the above, when l, m are unlinked and 2N(k) is sufficiently large, the parameter Dlm(k) in Equation 5 can be approximated by a normal deviate having mean zero and variance Q(k)/2N(k), where 0 < Q(k) < 1. Provided that all of the source populations are sufficiently large, the parameters Dlm(1), ... Dlm(K) for unlinked l, m may then be viewed as a set of small, independent fluctuations about zero. I assume that the weighted average of these fluctuations, Formula 12, can be equated to zero for unlinked l, m. The fact that unlinked loci l, m are assigned to different blocks leads to the observation that

Formula 12
Consequently, {Psi} is of block-diagonal form:

Formula 13(13)
where {Psi}b is of dimensions Lb x Lb; b = 1,..., B.

The sub-matrices {Psi}b, formed by locus pairs l, m in the same block, contain covariances attributable to the shared ancestry of linked loci. Calculations of OHTA and KIMURA (1969) suggest the approximation Formula 13, for linked loci with recombination fraction clm, where Formula 13. This variance characterizes the order of magnitude of Dlm(1),...Dlm(K) when l, m are linked. For Formula 13, Var(Formula 13lm(k)) is nonnegligible, compared to the scale of a sample covariance slm. In the pulse-decay model, the Dlm(k) for linked l, m are averaged over source populations in order to form the elements of {Psi}1,..., {Psi}B. In contrast to the elements of the off-diagonal-blocks, I do not equate the elements of {Psi}1,..., {Psi}B, to zero; yet, some elements of {Psi}1,..., {Psi}B may indeed be close to zero, if the covariances Dlm(k) for linked loci tend to be of opposite signs in different source populations, and if the weights q(1),..., q(K) are nearly equal. A scenario in which one source population has a relatively large mixing proportion q leads to a straightforward interpretation of the elements of {Psi}1,..., {Psi}B: here, the element {psi}lm mainly reflects the covariance between l, m in the dominant source population.

The matrix Formula 13 contains components of covariance resulting only from differences in allele frequencies between source populations. These components are independent of any notion of physical linkage between loci, and could be considered entirely artifacts of admixture.

Formula 13 and {Psi} are symmetric by definition, and the following proposition establishes that both have the other essential property of covariance matrices. A proof is given in the APPENDIX.

PROPOSITION 1. Formula 13 and {Psi} are nonnegative definite.

In the remaining sections, I write M ≥ 0 to indicate that a symmetric matrix M is nonnegative definite, and M > 0 to indicate that M is positive definite.

The multiple battery factor analysis model:

The matrix Formula 13 containing admixture covariances has a factorization given by the following proposition; the covariance structure model is then an immediate consequence. A proof is given in the APPENDIX.

PROPOSITION 2. There exists a matrix {Lambda}Lx(K–1) of full column rank such that Formula 13.

Proposition 2 establishes that the rank of {Lambda} is one less than the number of linearly independent source population vectors p(1),..., p(K). Linear dependencies among p(1),..., p(K), which could arise if the source populations are related by recent descent from an ancestral population, or are themselves connected by migration, will reduce the rank of {Lambda} accordingly.

The matrix equation

Formula 14(14)
with {Psi} block-diagonal, is a covariance structure model associated with the "inter-battery" (for B = 2 blocks, TUCKER 1958) and "multiple battery" (for B ≥ 3, BROWNE 1980, BROWNE and Tateneni 1997) factor analysis models of psychometrics. The preceding development has shown that the covariance structure Equation 14 is implied by the pulse-decay model. In psychometrics, (14) is the covariance structure implied by the linear latent variable model

Formula 15(15)
where µLx1 and {Lambda}Lx(K–1) are unknown parameter matrices, f isin Formula 15K–1 is a vector of unobserved "inter-battery" factors with E(f) = 0 and Var(f) = I, and z isin Formula 15L is a vector of unobserved "battery-specific" factors, independent of f, with E(z) = 0 and Var(z) = {Psi} (see BROWNE 1980). In ordinary factor analysis, {Psi} is a diagonal matrix containing the "unique variances" of the elements of x (LAWLEY and MAXWELL 1971). The present statistical problem is to estimate {Lambda} and {Psi} given an observation S of {Sigma}.

Psychometric studies of "difficulty factors" (see, e.g., MCDONALD and AHLAWAT 1974) caution against the naïve application of models (14) and (15) to binary data. However, alternative models for binary observations, including latent class and response function models (see e.g., LAZARSFELD and HENRY 1968; BARTHOLOMEW and KNOTT 1999; SATTEN et al. 2001), assume that all pairs of observed variables are conditionally independent given the latent variables, and thus do not allow for the within-block covariances encoded in {Psi}. The block-diagonal structure of {Psi} is an essential aspect of both the genetic model and the statistical methods described here.


ESTIMATION OF MODEL PARAMETERS
I first describe the basic model-fitting problem for an arbitrary number of blocks B; then in subsequent sections I discuss the cases B = 2 and B ≥ 3, and give two data examples.

For {Lambda} with a given column rank K – 1, the parameters of the model (14) are said to be identified if the map from ({Lambda}, {Psi}) to {Sigma} is one-to-one, i.e.,, if

Formula 15
(JÖRESKOG 1981). In practical terms, the identifiability of a model (or lack thereof) determines whether or not unique estimates of model parameters can be found, given the sample. When parameters of (14) are not identified, the strategy I adopt below is to obtain estimates which satisfy a generalized least-squares criterion. In particular situations I will describe, the parametrization {Sigma} = Formula 15 is partly or wholly identified.

We know from population genetics that sample allele frequencies impose constraints on observed LD values, so it is natural to ensure that the same constraints apply to model estimates. As the diagonal element sFormula 15 is the sample variance of a binary variable, clearly

Formula 16(16)

For distinct loci l, m, let Formula 16 and Formula 16, where Formula 16, Formula 16 are the sample frequencies of the "1" allele at loci l and m. The inequality constraints

Formula 17(17)
(LEWONTIN 1964) derive from the fact that two-locus gamete frequencies Formula 17 are nonnegative. THOMSON and BAUR (1984) described an additional set of inequality constraints on the elements of S, deriving from the fact that three-locus gamete frequencies Formula 17 are nonnegative. For the triplet of distinct loci l, m, j, the constraints can be written as

Formula 18(18)

Formula 19(19)

Formula 20(20)

Formula 21(21)
ROBINSON et al. (1991) determined that no additional constraints on the elements of S are imposed by considering loci in groups of four or more; thus, a complete set of inequality constraints can be obtained by applying (16) to the diagonal elements of S, (17) to the elements involving distinct locus pairs l, m, and (18)–(21) to the elements involving distinct locus triplets l, m, j. I will say that model estimates Formula 21 and Formula 21 satisfy the constraints Formula 21 if (16)–(21) hold, upon replacing sample quantities slm by their estimates Formula 21, for l, m isin {1, ... L}.

I use generalized least-squares (GLS) methods for fitting the model (14) to data for two main reasons: these methods do not require explicit distributional assumptions, and the statistical properties of GLS estimates are known under fairly general conditions (see, e.g., BROWNE 1974, 1984; JÖRESKOG 1981). For a "weight" matrix VLxL > 0 yet to be chosen, GLS estimates of {Lambda} and {Psi} are obtained by minimizing the discrepancy function

Formula 22(22)
on the set

Formula 22
where || · || is the Euclidean norm. I also examine the unconstrained problem, in which (22) is minimized on {({Lambda}, {Psi}): rank({Lambda}) = K – 1, {Psi} ≥ 0}; for, a minimizer Formula 22 of the unconstrained problem may, for a given data set, satisfy the constraints Formula 22. In such a situation, Formula 22 is also the minimizer of the constrained problem. When V is not a function of {Lambda} or {Psi}, straightforward matrix differentiation (see, e.g., MAGNUS and NEUDECKER 1988, section 9.9) gives the estimating equations

Formula 23(23)
and

Formula 24(24)
where for MLxL, Bdiag[M] gives M the same block-diagonal structure as {Psi}. Equations 23 and 24 give necessary conditions for a local (or global) minimum of fV, at a differentiable point of the parameter space (MAGNUS and NEUDECKER 1988, Sect. 7.4).

In the following sections, I use weight matrices V > 0 having the same block-diagonal structure as {Psi}. There are two main reasons for this choice: first, block-diagonal V can be chosen to reflect unique features (levels of polymorphism, etc.) of particular blocks; second, block-diagonal V leads to a simple estimating equation for {Psi}. For block-diagonal V, pre- and post-multiplication by V–1, along with straightforward matrix algebra, establishes that the necessary condition (24) is equivalent to

Formula 25(25)
Consequently, model estimates within diagonal blocks equal their observed counterparts:

Formula 26(26)


THE MODEL WITH B = 2 BLOCKS
For B = 2, it is convenient to write the two blocks of a gamete vector as Formula 26 and Formula 26, with Lx + Ly = L. The covariance structure model (14) is then

Formula 27(27)
where {Lambda}x, {Lambda}y are respectively of dimensions Lx x (K – 1), Ly x (K – 1). I assume without loss of generality that Lx ≤ Ly.

Unfortunately, {Lambda} and {Psi} are essentially unidentified for the model with B = 2 blocks. Given Formula 27, Formula 27, Formula 27, and Formula 27, which solve (27), many new solutions can be generated by the family of transformations

Formula 27
where M is any conformable, invertible matrix. All of these solutions will satisfy (27) for a given {Sigma}; thus we need to impose additional criteria to choose a "good" solution from the many. Although {Lambda} and {Psi} are not identified, the off-diagonal blocks Formula 27, under the parametrization Formula 27, are identified as

Formula 27
This parametrization is less informative mathematically, but shows that matrix elements describing admixture LD between loci of different blocks are unique.

I advocate a solution of model (27) first described by RAO (1973, 1979), and here give a rationale for its use in the admixture decomposition. I assume in the following that Sxx and Syy are nonsingular and that

Formula 27
The latter is a technical assumption which ensures that each latent dimension of the admixture model is represented in the sample. The singular value decomposition (SVD) of Formula 27 is accordingly

Formula 28(28)
where Formula 28 and Formula 28 are orthogonal, Formula 28,

Formula 28
and r1 ≥ ··· ≥ rd are the d ≤ Lx singular values.

The RAO (1979) solution for the model with B = 2 blocks is

Formula 29(29)
where {Delta}(K–1) = diag(r1, ... rK–1), and U(K–1), W(K–1) respectively contain the columns of U, W corresponding to the ordered singular values in {Delta}(K–1). The singular values r1 ≥ ··· ≥ rd of SFormula 29SxySFormula 29, all of which lie in [0, 1], are the sample canonical correlations between the variables x and y.

The Rao solution can be connected to the unconstrained GLS problem by a particular choice of V in (22), and by use of the corresponding necessary condition {partial}fV/{partial}{Psi} = 0. A proof of the proposition is given in the APPENDIX.

PROPOSITION 3. For Formula 29, the Rao solution (29) gives a global minimum of fV({Lambda}, {Psi}; S) on the set {({Lambda}, {Psi}): rank ({Lambda}) = K – 1, {Psi} ≥ 0}.

Thus the Rao solution is the GLS solution for the unconstrained problem, given the choice of V immediately above. BROWNE (1979) shows that the Rao solution (29) gives maximum-likelihood estimates of ({Lambda}, {Psi}) in the multivariate normal setting, under additional identifiability conditions. The connection given by Proposition 3 between the Rao solution and the minimization of the GLS discrepancy function is apparently new.

There is no guarantee that the Rao solution (29) will satisfy the constraints Formula 29 imposed by a given data set. Numerical methods for minimization of the discrepancy function (22), which explicitly incorporate inequality constraints, have been described by LEE (1980); these remain an option if the fit of the Rao solution is unsatisfactory.

Data from The SNP Consortium:

This is the first of two data examples using SNPs organized into two blocks on the human X chromosome. My aim is more to show a proof of principle rather than to demonstrate any elaborations of the analysis. Although the examples may seem rather narrow, I do not intend that the method be limited only to very similar situations. In both examples I use only X chromosomes typed in males, in order to avoid any complications introduced by either inferring phase or estimating pairwise LD by indirect methods. Both data sets are, in a broad sense, deliberately admixed, as they combine population samples meant to characterize human ethnic or geographic diversity. However, I will not assume that ethnic or geographic labels necessarily correspond to source populations, even if evidence for admixture is found.

The first data set is a sample of n = 110 independent X chromosomes, typed by Celera Genomics as part of The SNP Consortium (TSC) Linkage Map Project (CLARK et al. 2003; MATISE et al. 2003). The n = 110 males are comprised of the following (presumably self-identified) groups: 13 Asian males from the Celera diversity panel of 30 unrelated Asians; 15 African-American males from the Celera diversity panel of 30 unrelated African Americans; and 82 Caucasian males from the Utah CEPH pedigrees typed in "phase 2" of the Linkage Map Project. I selected CEPH males who could provide independent X chromosomes: typically for a three-generation pedigree, I included the father's and paternal grandfather's X chromosomes, along with a single X chromosome chosen at random from the maternal grandfather and sons.

The 11 SNPs used in this example are a subset of 39 X-linked SNPs typed by Celera Genomics in the diversity panels and CEPH Utah pedigrees. Two broad requirements guided the selection of these SNPs. First, the SNPs can be assigned to two blocks located on different arms of the X chromosome (see Figure 2 for the physical map). Second, the number of parameters estimated in fitting the model with 11 SNPs is moderate, compared to the sample size n = 110. With 11 SNPs there are 11 x 10/2 = 55 population covariances estimated by sample statistics, along with 11 allele frequencies. There are at most 5 canonical correlations to include in the model, determined by the size of the smaller block. Once the number of canonical correlations to retain has been decided (by statistical inference), the other components of the Rao solution are uniquely determined. Thus in the present case there are at most 55 + 11 + 5 = 71 quantities to be estimated in fitting the model.


Figure 2
View larger version (7K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Physical map of the X-linked SNPs used in the SNP Consortium sample. Reference SNP ("rs") identifiers were obtained by searching the original SNP identifier (a numeric code preceded by "TSC") at the dbSNP web page, http://www.ncbi.nlm.nih.gov/projects/SNP/. Map positions are given in the files TSCmap.p1.Celera.txt.gz and TSCmap.p2.Celera.txt.gz, obtained from http://snp.cshl.org/. The centromere location (~60 Mb) was obtained from the UCSC Genome Browser (http://genome.ucsc.edu/index.html; May 2004 assembly). Of the 39 X-linked SNPs typed by Celera, 20 SNPs for which multiple individuals had missing genotypes or obvious typing errors (e.g., heterozygous males) were eliminated. One additional SNP having a minor allele frequency of ~0.05 was eliminated as being relatively uninformative for inferences about LD. Of the remaining 18 SNPs, 8 were on the left arm, and 10 were on the right arm of the X chromosome. The 5 most distal SNPs of the left arm and the 6 most distal SNPs of the right arm form the two blocks and are shown in the figure.

 

Inference of the number of source populations:

In the case of B = 2 blocks, fitting the Rao solution (29) turns the problem of inferring the number of independent sources into that of inferring the number of nonzero canonical correlations of the population LD matrix. Although test-statistics for making this inference exist, their null distributions are known only when the data are multivariate normal, or more generally are from an elliptically-contoured family (MUIRHEAD and WATERNAUX 1980). I use a bootstrap approach in order to avoid making assumptions about the data distribution. It will be convenient to define some basic sample quantities: for i = 1,...n, let Formula 29, Formula 29 be the x and y blocks of the ith gamete, and let Formula 29, Formula 29 be the centered data matrices. The five sample canonical correlations for the TSC data are 0.46, 0.35, 0.25, 0.16, and 0.08.

The initial inference to be made is whether or not there is any support for admixture. For the pulse-decay model, this is equivalent to testing the null hypothesis that there is only one source population, hence no contribution from the term {Lambda}{Lambda}' in (14). Under this hypothesis {Sigma} itself is block-diagonal. A likelihood-ratio test for the null hypothesis of blockwise independence (block-diagonal covariance structure), assuming multivariate normal data, is well known (see, e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5). For B = 2 blocks, the test-statistic is

Formula 30(30)
where r1 ≥ ··· ≥ rd are the sample canonical correlations (ANDERSON 1984, Sect. 12.4.1). For multivariate normal data, the sample value t0 is compared to the {chi}2 distribution on LxLy degrees of freedom.

The null hypothesis of blockwise independence suggests a natural permutation strategy that I use to obtain a null distribution for T0: if the sample consists of pairs (x1, y1) , ... , (xn, yn), then a permuted sample under blockwise independence is (x1, y{pi}(1)), ... , (xn, y{pi}(n)), where {pi} is a uniform random permutation of {1, ... , n}. I computed the test-statistic T0 for each of R = 9, 999 permuted samples, and obtained an approximate p-value of

Formula 30
for the test of blockwise independence, where tFormula 30, ... , tFormula 30 are the values of T0 for the permuted samples (see DAVISON and HINKLEY 1997, Sect. 4.3). There is then good evidence against simple block-diagonal structure in the TSC data.

The permutation test does not extend readily to sequential tests which add nonzero canonical correlations to the model one at a time. So I proceed by computing sequential confidence regions, for the first canonical correlation, then for the first and second canonical correlations jointly, etc., by nonparametric bootstrapping. I take inclusion of the value zero in a confidence region as evidence that the corresponding correlation is nonsignificant.

Construction of a confidence interval for {rho}1 is the first step. I use a bootstrap approach described by DAVISON and HINKLEY (1997, section 2.7.2) which employs smooth transformations of sample statistics and corresponding variance estimates. An approximate pivotal quantity can be obtained from the first sample canonical correlation r1 using FISHER's (1921) transformation

Formula 31(31)
along with a variance approximation calculated from empirical influence values. The variance approximation is

Formula 31
where lj is the empirical influence value for the jth observation on r1 (see DAVISON and HINKLEY 1997, Equation 2.38). GU and FUNG (1998) show that lj can be written as

Formula 32(32)
where sj1 and tj1 are the jth elements of Formula 32 and Formula 32, respectively, and Formula 32, Formula 32 are the sample canonical vectors associated with r1. The basic quantity to be bootstrapped is then Formula 32. The function boot.ci, available in the boot library of the statistical software R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), handles the confidence interval calculations readily. A 95% studentized confidence interval for {rho}1 for the TSC data, obtained from R = 9, 999 ordinary nonparametric bootstraps, is (0.18, 0.47), confirming the existence of a first canonical correlation clearly distinct from zero. As the sample canonical correlation is r1 = 0.46, r1 is apparently highly biased as a naïve estimator of {rho}1. This can be confirmed by performing basic bootstraps of r1.

Bootstrapping the first two or more canonical correlations together requires additional care: the nonindependence of the canonical correlations (imposed by their rank ordering) calls for joint bootstrapping and the use of a covariance matrix reflecting the joint structure. For a joint bootstrap of the first two canonical correlations, each transformed according to Fisher's function, the covariance matrix estimate based on empirical influence values is d'Formula 32d, where Formula 32,

Formula 33(33)

Formula 33
Formula 33, Formula 33, and Formula 33, Formula 33 are the sample canonical vectors associated with r2 (see HAMPEL et al. 1986, Equation 4.2.2; DAVISON and HINKLEY 1997, Sect. 5.8). The basic quantity to be bootstrapped is then the quadratic form

Formula 34(34)
where

Formula 34
The quantile Formula 34 is the key quantity recovered from the bootstrap samples; from this, a 100(1 – {alpha})% joint confidence region for ({rho}1, {rho}2) is obtained as

Formula 35(35)
(see DAVISON and HINKLEY 1997, Sect. 5.8). I checked the stability of the required matrix inversion by a preliminary bootstrap of the singular values of d'Formula 35d. In the present case, d'Formula 35d was invertible in all R = 9, 999 bootstrap samples.

Figure 3 shows joint confidence regions for ({rho}1, {rho}2) based on R = 9, 999 bootstrap samples. The joint confidence regions clearly support the inference {rho}1 > 0, and give weaker, though measurable, support for {rho}2 > 0. Before settling on a model with two canonical correlations, support for {rho}3 > 0 needs to be investigated. I used the natural extensions of (33)–(35) to three variables, to compute joint confidence regions for ({rho}1, {rho}2, {rho}3) based on R = 9, 999 bootstrap samples. With three parameters the confidence regions are nested three-dimensional surfaces, similar to the Russian dolls known as "matreshki." The nested surfaces are not easy to display graphically, so in Figure 4 I show the intersections of the surfaces with the two-dimensional ({rho}1, {rho}2) plane, in which {rho}3 = 0. Each of the three confidence regions intersects the plane, so I conclude that there is little support for a third canonical correlation. Thus the inference is that there are two significant correlations in the TSC data, indicating the presence of three independent sources of variation.


Figure 3
View larger version (15K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Joint confidence regions for ({rho}1, {rho}2) for the TSC data, obtained by nonparametric bootstrapping. The 45° line and horizontal line through zero mark the region on which ({rho}1, {rho}2) are defined. Confidence regions are shown as contours of the quadratic form q, given by expressions (34) and (35).

 

Figure 4
View larger version (12K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Intersections of the joint confidence regions for ({rho}1, {rho}2, {rho}3) with the plane {rho}3 = 0. The contours enclose portions of the plane {rho}3 = 0 contained in the three-dimensional confidence regions; thus, for a given confidence level, the area within the contour indicates how much support there is for the inference {rho}3 = 0.

 

Other methods for the TSC data:

The computer program Structure, available at http://pritch.bsd.uchicago.edu/structure.html, implements Bayesian methods for fitting admixture models described in PRITCHARD et al. (2000) and FALUSH et al. (2003). PRITCHARD et al. (2000) describe a method for inferring the number of source populations K, based on estimates of the probability of the sample given different values of K. I used Structure, along with rules of thumb described in the accompanying user's manual, to infer the number of source populations for the TSC data, for comparison with the inference made above. For each of three models of ancestry implemented in Structure (the "no admixture", "admixture" and "linkage" models), I carried out five independent runs of the Structure program at each value of K = 1, ... , 5 (for a total of 75 runs). For each run, a "burn-in" of 100,000 iterations was followed by 1,000,000 Markov-chain Monte Carlo updates. Arguably, none of the ancestry models implemented in Structure are appropriate for the TSC SNPs, some of which are very tightly linked. Each ancestry model gave strongest support to the value K = 1 for the TSC data. For models with K > 1, in all runs, the overall sample proportion assigned to a "cluster" was 1/K for each cluster. These results are understood to indicate an absence of population structure.

The programs in Eigensoft (http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm) perform a principal components analysis described by PATTERSON et al. (2006) for inference of admixture. Under this method, K – 1 is inferred as the number of statistically significant eigenvalues of an interindividual covariance matrix [having a form similar to expression (4)]. As for Structure, I have reservations about using Eigensoft on the TSC sample, because the data architecture is of a form not intended by PATTERSON et al. (2006). In particular, Patterson et al. intend (but do not require) that the number of markers L be larger than the number of individuals n, and acknowledge that their method is not designed specifically to accomodate markers in linkage disequilibrium. I applied the Eigensoft genotype coding scheme to the TSC X chromosomes as follows: a male with the "reference" allele (chosen arbitrarily) at an X chromosome locus has the value 0 at the locus, whereas a male with the "variant" allele has the value 2. I used the regression method described in PATTERSON et al. (2006) to correct for LD, and varied the number of adjacent markers used for adjustment between 1, 2 and 3. In each case Eigensoft called all eigenvalues nonsignificant, thus the inference would be that only one population contributed to the TSC sample.

The admixture decomposition:

The Rao solution can be computed by straightforward matrix operations commonly implemented in statistical or numerical packages. I do the calculations in R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), and describe a numerically stable method for computing the decomposition in the APPENDIX.

The Rao decomposition for the TSC databased on two canonical correlations is shown graphically in Figure 5. The estimates {Lambda}*{Lambda}*' and {Psi}* satisfy the constraints Formula 35 imposed by the sample allele frequencies. Covariances due to admixture, involving rs1541354 and rs1859004 (respectively, the third and fourth SNPs from the left in Figure 5) with several of the other SNPs, can be seen as vertical bands in the third and fourth columns of {Lambda}*{Lambda}*'. A comparison of the off-diagonal block of the residual matrix with the corresponding block of S reveals a notable tendency for residual covariances to be corrected toward zero.


Figure 5
View larger version (32K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

Graphical representation of the covariance structure model fitted to the TSC data. The shaded squares depict absolute values of matrix elements below the main diagonal, for the sample covariance matrix S, the matrix of estimated admixture covariances {Lambda}*{Lambda}*' (for the model with two canonical correlations), and the residual matrix S{Lambda}*{Lambda}*'. The key, top right, gives the shading scheme applied to absolute values of matrix elements. The off-diagonal block containing covariances between SNPs of different blocks is marked by thick lines. The estimate {Psi}* is obtained by setting the elements of the off-diagonal block of the residual matrix equal to zero.

 

Data from the International HapMap Consortium:

In broad terms, the second example repeats the analysis of the TSC data, using a different sample of X chromosomes typed by the INTERNATIONAL HAPMAP CONSORTIUM (2005), HapMap Public Release #21, http://www.hapmap.org. Here I want to demonstrate that a different sample having similar structure yields similar results. The HapMap sample consists of 53 males of the Yoruban parent-offspring trios from Ibadan, Nigeria; 23 Japanese males from Tokyo; 22 Han Chinese males from Beijing; and 44 males of the Utah CEPH parent-offspring trios. Some of the Utah CEPH pedigrees typed by Celera Genomics for TSC were also typed by HapMap; thus 20 of the Utah males I include in the HapMap sample were also in the TSC sample.

Although HapMap typed many more SNPs overall than TSC, the particular SNPs I used for the TSC sample were incompletely typed by HapMap. I elected to choose a new set of SNPs for the HapMap sample, located in the same X-chromosome regions represented by blocks 1 and 2 of the TSC sample. The HapMap browser locates the TSC SNPs of block 1 in the X-chromosome region 9.2 - 13.7Mb, and the SNPs of block 2 in 139.3 - 147.3Mb. I judged that a model with 15 SNPs would preserve a reasonable balance between sample size (n = 142) and number of parameters. I chose 15 SNPs at random from those typed by HapMap, allocated to blocks 1 and 2 in rough proportion to the sizes of the two regions. Prior to choosing the 15 SNPs at random, I excluded SNPs from the two regions which had obvious genotyping errors or which failed to type in one or more individuals. I discarded approximately three random SNP sets because the sample covariance matrices they produced were singular, or because the Rao decomposition based on 1, 2 or 3 canonical correlations would have failed to satisfy the constraints imposed by the sample allele frequencies. The latter biased the selection toward a SNP set which yields a good fit of the model, but did not privilege any particular value of K. Figure 6 shows the physical map of the 15 chosen SNPs.


Figure 6
View larger version (10K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Physical map of the X-linked SNPs used in the HapMap sample. Details of the SNP selection procedure are given in the text. The flat files containing raw data and SNP positions are available at http://www.hapmap.org/.

 
I used the same data analysis strategy for the HapMap sample as for the TSC sample, so I will summarize results briefly. The six sample canonical correlations for the HapMap data are 0.50, 0.41, 0.33, 0.23, 0.20 and 0.09. The permutation test of the statistic T0 produced an approximate p-value of 0.0007, indicating the presence of at least one significant canonical correlation. A 95% confidence interval for the first correlation is (0.25, 0.49), and joint confidence regions for the first two correlations are shown in Figure 7. The HapMap plot (not shown) analogous to Figure 4 has no contours in the ({rho}1, {rho}2) plane: thus {rho}3 = 0 is excluded from each of the 95, 90, and 85% confidence regions. I take this to imply support for a third correlation. Graphical display of 4-dimensional regions is highly problematic, so I checked for the presence of a fourth significant correlation numerically. I bootstrapped the four-dimensional analog of (34) in order to estimate quantiles q*, then evaluated (34) at candidate points ({rho}1, {rho}2, {rho}3, {rho}4 = 0), with {rho}1, {rho}2 and {rho}3 varying in a region of high support. I found many such parameter combinations inside the joint confidence regions, and concluded that there is little evidence in the HapMap sample for a fourth correlation. Thus the inference is that there are three significant correlations in the HapMap data, indicating the presence of four independent sources of variation. The Rao decomposition for the HapMap data, based on three canonical correlations, is shown graphically in Figure 8. The decomposition satisfies the constraints Formula 35.


Figure 7
View larger version (14K):
In this window
In a new window
Download PPT slide
 
FIGURE 7.—

Joint confidence regions for ({rho}1, {rho}2) for the HapMap data, analogous to Figure 3. The HapMap plot analogous to Figure 4 has no contours in the {rho}3 = 0 plane.

 

Figure 8
View larger version (33K):
In this window
In a new window
Download PPT slide
 
FIGURE 8.—

Graphical representation of the covariance structure model fitted to the HapMap data, analogous to Figure 5. The decomposition uses three canonical correlations.

 

Other methods for the HapMap data:

I followed the same strategy using Structure to infer K for the HapMap data as for the TSC data. In contrast to the TSC data, Structure finds evidence for multiple clusters in the HapMap data; in particular, the "linkage" model produces evidence for four clusters. In five runs of the linkage model with K = 4, Structure consistently assigned overall sample proportions near 0.27 to three clusters, and 0.2 to a fourth cluster. Posterior probabilities of cluster membership suggest that most individuals are of mixed ancestry, having roughly equal proportions of their alleles assigned to two or more clusters.

I followed the same strategy using Eigensoft for the HapMap data as for the TSC data. Eigensoft calls one eigenvalue significant when either 1 or 2 adjacent markers are used for LD adjustment, and calls no eigenvalues significant when 3 adjacent markers are used. Thus Eigensoft suggests that at most two sub-populations have contributed to the HapMap X chromosome data.


THE MODEL WITH B ≥ 3 BLOCKS
Under mild conditions, the model with B ≥ 3 is identified up to post-multiplication of {Lambda} by an orthogonal matrix (i.e., up to rotations of {Lambda}). However, explicit formulae for parameter estimates are not readily available.

To fix notation, the block-partitioned parameter matrices are {Psi}, of the form (13), and

Formula 35
where {Lambda}b is Lb x (K – 1); b = 1 , ... , B. For B ≥ 3, BROWNE (1980, Proposition 2) shows that a sufficient condition for identification of {Psi}, and of {Lambda} up to rotations, is that at least three of the sub-matrices {Lambda}b of {Lambda} be of full column rank.

An assumption of the pulse-decay model, built on the notion that Formula 35, is that the source-population vectors p(1), ... p(K) are linearly independent. If Lb, the number of markers in the bth block, is sufficiently large, it is reasonable to assume that linear independence also holds for the block-specific sub-vectors pb(1) , ... , pb(K). In fact, provided Lb ≥ K – 1, a block-specific version of Proposition 2 (of this paper) is readily established, using suitably defined matrices {Gamma}b, Formula 35bb, and a proof of the same form. Browne's Proposition consequently implies that {Psi} is identified, and {Lambda} is identified up to rotations, if Lb ≥ K – 1 for at least three blocks b isin {1, ... B}. Furthermore, when at least three blocks meet this minimum size criterion, the matrix Formula 35 = {Lambda}{Lambda}' is unique: for Formula 35 = {Lambda}{Lambda}' = {Lambda}Q ({Lambda}Q)' when Q is orthogonal.

Obtaining an analog of Proposition 3 (of this article) for B ≥ 3 appears to be an open problem. Using

Formula 35
{Psi} = B diag[S{Lambda}{Lambda}'] and the reasoning of Proposition 3, GLS estimates {Lambda}Formula 35 , ... , {Lambda}Formula 35 for the unconstrained problem would be obtained by minimizing

Formula 36(36)
It is here that the strategy used for the model with B = 2 blocks breaks down, for although the Eckart–Young theorem (GOWER and HAND 1996, Sect. A.4.4) can be used to minimize the individual summands of (36), it is not clear how to reconcile the different expressions for {Lambda}Formula 36 obtained as i varies.

At present, numerical estimation of ({Lambda}, {Psi}) appears to be the only feasible option for the model with B ≥ 3. LEE (1980) gives algorithms for GLS estimation under parameter constraints, which are appropriate for the data under consideration. For inference of K, an ad hoc strategy involving sequential goodness-of-fit tests, similar to a method used in ordinary factor analysis (see, e.g., BARTHOLOMEW and KNOTT 1999, Sect. 3.8–3.9), may be pursued. For a candidate value K and a GLS estimate {Lambda}*, where {Lambda}* has K – 1 columns, the fit of the model may be assessed by testing the residual matrix S{Lambda}*{Lambda}*' for block-diagonal structure (see e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5); this assumes that such tests could be suitably adapted to multivariate binary observations. The smallest candidate value