## Abstract

I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.

ADMIXTURE, the mixing of genetically differentiated populations via migration and subsequent intermating, can create linkage disequilibrium (LD) between genes, even when the genes are not physically linked (see, *e.g.*, Cavalli-Sforza and Bodmer 1971, p. 69; Prout 1973). In this work, I show that admixture contributions to LD can be statistically quantified and distinguished from LD attributable to the shared ancestry of linked alleles. Ohta (1982) used Wright's island model to decompose a squared coefficient of LD into within- and between-population terms, in analogy with Wright's (1940) decomposition of the inbreeding coefficient. These decompositions assume that populations connected by migration can be identified and sampled for genetic variation. The method I propose uses a pairwise LD matrix sampled from an admixed population of unknown composition; the number of source populations and the components of LD are inferred by use of a multivariate statistical model.

#### The data: blocks of binary markers:

It is convenient to develop the model using gametes as the basic units of observation. The data are then *n* random binary vectors of the form **x** = (*x*_{1},…*x _{L}*)′, with

*x*∈ {0, 1};

_{l}*l*= 1,…,

*L*. Each vector represents the single nucleotide polymorphism (SNP) variation on one sampled gamete, under an arbitrary binary coding scheme, with

*x*indicating the allele on gamete

_{l}**x**at the

*l*th marker locus. The markers are assumed to be selectively neutral and variable in the sample.

Before statistical analysis begins, the markers in **x** are to be grouped into *blocks* by the investigator, based on physical criteria (*e.g.*, the markers in a block share a localized region on a physical map), along with empirical evidence (*e.g.*, the markers in a block are known by a previous linkage-mapping study to form a linkage group) independent of the sample under consideration. Each marker belongs to exactly one block; however, a particular marker may be the only member of a block. Any two markers *l*, *m* within the same block are assumed to be linked, with recombination fraction . In contrast, markers *l*, *j* from *different* blocks are assumed to be unlinked, with recombination fraction *c _{lj}* ≈ .

In the development below, block structure derives from physical and linkage relationships between markers, with blocks analogous to linkage groups; this is to be distinguished from empirical descriptions of “haplotype block structure” (see, *e.g.*, Gabriel *et al.* 2002; Phillips *et al.* 2003). Recent work of the International HapMap Consortium (2005) and Myers *et al.* (2005) suggests that haplotype block structure results from variation in recombination rates over small physical distances. The fine-scale rate estimates obtained by Myers *et al.* (2005) demonstrate new tools for constructing blocks of tightly linked markers. In the data examples to follow, I form two blocks of markers on the arms of the human X chromosome, using simple physical criteria.

To fix notation, I assume that **x** can be partitioned into *B* blocks, labeled 1,…, *B* in any convenient order (though the same partitioning and labeling scheme is used for all gametes). For *b* = 1,…*B*, the *b*th block contains *L _{b}* ≥ 1 marker loci, with . I order the vector elements

*x*

_{1},…,

*x*so that the indices are assigned to the loci of the

_{L}*b*th block. However, markers

*within*a block need not be ordered in any particular way.

#### For binary loci, linkage disequilibrium is covariance:

For binary loci, a familiar pairwise LD parameter (Lewontin and Kojima 1960) is a covariance between locus-specific indicator variables. Let *p _{l}* be the population frequency of the “1” allele at locus

*l*, let

*p*be the population frequency of gametes carrying the “1” allele at both loci

_{lm}*l*and

*m*, and let

*E*be the expectation operator for the distribution on (

*x*,

_{l}*x*) ∈ {0, 1} × {0, 1} parametrized by

_{m}*p*,

_{l}*p*, and

_{m}*p*. The LD parameter for loci

_{lm}*l*and

*m*is then(1)For consistent definitions, I take

*D*to be the variance of the locus-specific indicator:(2)Given a random sample of binary gametes

_{ll}**x**

_{1}, … ,

**x**

_{n}, with , I treat the

*l*,

*m*entry of the sample covariance matrix(3)as an estimate of

*D*. I use the denominator

_{lm}*n*in (3), rather than the denominator

*n*− 1 typically used for unbiased estimation, for consistency with sample allele frequency calculations. The components of linkage disequilibrium under the admixture model are determined by the structure of

**S**and the sample allele frequencies. Other sample-based estimates of the population covariance matrix may provide suitable alternatives to

**S**; the composite LD estimates described by Cockerham and Weir (1977; see also Weir 1979) are particularly attractive, as they do not require direct observation of gametes. A complete description of the associations among

*L*binary markers requires the use of third- and higher-order moments, along with the second-order moments considered here (see,

*e.g.*, Ekholm

*et al.*1995). Nonetheless, relevant structure in the data induced by admixture can be detected by an analysis limited to covariances.

In a recent article, Patterson *et al.* (2006) also use a sample covariance matrix to make inferences about admixture. Although the locus-specific variables defined by Patterson *et al.* refer to diploid genotypes and take values in {0, 1, 2}, it is useful to establish a relationship between the matrix they analyze and **S**. In the notation above (up to differences in variable definitions, and omitting a normalization step), the *n* × *n* matrix of inter-individual covariances analyzed by Patterson *et al.* (2006) could be written as(4)Thus the implicit expectation is over loci rather than over individuals in the population, and covariances are between multilocus genotypes rather than between locus-specific indicators. Wickens (1995) uses the terms *variable space* and *subject space* to distinguish the multivariate spaces that give rise to **S** and (4), respectively. Gower (1966) and Jolliffe (2002, Sect. 5.2) compare and contrast the two approaches, although a comparison in the context of admixture inference is beyond the scope of this work.

## THE PULSE-DECAY MODEL OF ADMIXTURE AND A MATRIX DECOMPOSITION

The “pulse-decay” model (shown schematically in Figure 1) is a highly simplified admixture model that, somewhat unexpectedly, shares mathematical properties with traditional covariance structure models. It is a variation of the “immediate admixture” model of Ewens and Spielman (1995), emphasizing different statistical properties. I give a detailed derivation to show that the statistical model used later for data analysis is implied by the pulse-decay model.

Under the pulse-decay model, generation 1 of the admixed population consists of *N* diploid genotypes sampled at random with replacement from *K* ≥ 2 source populations. The probability that a genotype is chosen from the *k*th source population is *q*(*k*) > 0; *k* = 1,…, *K*, with . The mixing proportions *q*(1),… *q*(*K*) are unknown. In practice, *K* may be an unknown parameter requiring estimation. A genotype contributed by the *k*th source population consists of two gametes chosen at random with replacement from the 2*N*(*k*) gametes present at generation zero in the *k*th source population. At generation zero, the frequency of the “1” allele at locus *l* in source population *k* is *p _{l}*(

*k*),

*k*= 1,…,

*K*, and the covariance between loci

*l*and

*m*in source population

*k*is

*D*(

_{lm}*k*). I call the vector of allele frequencies ′ from source population

*k*

**p**(

*k*), and I assume that the source population allele frequency vectors

**p**(1),…,

**p**(

*K*) are linearly independent. The latter assumption implies that the number of markers

*L*is at least as large as the number of source populations

*K*. Linear independence of

**p**(1),…,

**p**(

*K*) also implies that no source population is an admixture of the other source populations. I describe the effect of weakening the assumption of linear independence briefly in a later section. Under the pulse-decay model, the source populations make no further contributions to the sampled population after generation 1. Mating is at random in the admixed population from generation 1 onward.

By straightforward calculation, the expected frequency of the “1” allele at locus *l*, at generation 1 in the admixed population, is , where the expectation is over samples drawn from the source populations. Similarly, the covariance between loci *l*, *m* at generation one in the admixed population is(5)where *D _{ll}*(

*k*) is the locus-specific variance in the

*k*th source population.

The expected covariance in generation *t* + 1, over realizations of the Wright–Fisher process in the admixed population, is(6)where *c _{lm}* =

*c*∈ (0, ] is the recombination fraction between loci

_{ml}*l*and

*m*(see,

*e.g.*, Karlin and McGregor 1968, Equation 7). Weir and Cockerham (1974) provide related expressions for “genotypic” disequilibria (see Weir 1996, Chap. 3), which completely specify two-locus dynamics in diploid organisms. Whereas

*D*

_{1,lm}of Equation 5 is a population parameter, 𝒟

_{t+1,lm}of Equation 6 is a random variable. In Equation 6, the value

*c*= 0, corresponding to complete linkage of

_{lm}*l*and

*m*, is reserved for the variance 𝒟

_{ll}, where it gives the formula for the decay of heterozygosity in a finite population (see

*e.g.*, Crow and Kimura 1970, Section 3.11).

Examining Equation 6 with statistical modeling in mind, it is attractive to write the *realized* covariance in generation *t* + 1 as(7)where ρ_{t,lm} = ρ_{t,ml} is an unknown parameter giving the proportion of initial covariance (or variance) remaining after *t* transitions of the Wright–Fisher process. A sample from the admixed population in generation *t* + 1 produces an estimate of *D _{t}*

_{+1,lm}, via the

*l*,

*m*element of the sample covariance matrix. Although ρ

_{t,lm}is an important structural component of the model, it will not be an object of estimation. To establish later mathematical results, I assume that there exist ϵ and θ such that(8)This assumption places uniform upper and lower bounds on the extent to which covariances (or variances) have decayed since the admixture “pulse,” for the loci included in the study. Covariances between

*unlinked*loci (with

*c*= ) are expected to decay much more quickly than covariances between linked loci: for unlinked loci, ρ

_{lm}_{t,lm}is on the order of 2

^{−t}. In light of this observation, (8) significantly limits the number of generations that could have occurred since the admixture pulse. Assumption (8) also rules out changes in the signs of covariances, up to and including generation

*t*+ 1. The pulse-decay model is thus revealed to be primarily suited for recent admixture. Accordingly, any additional changes in allele frequencies and covariances due to mutation are assumed to be small enough to ignore, within the time-scale of the model.

Under the pulse-decay model, the realized covariance matrix in the admixed population at generation *t* + 1 can be written as(9)where **R** is the symmetric matrix having *l*, *m* element ρ_{t,lm}, ⊙ is the Hadamard (elementwise) product,, and is the symmetric matrix having *l*, *m* element . The columns of **Γ** are weighted vectors of source-population allele frequency deviations. The number of generations elapsed since the admixture pulse is assumed to be unknown. I do not propose to explicitly model time-dependent properties, but rather to model the effect of recent admixture on observed covariances. Hence, the dependence on *t* is dropped in the final expression of (9).

#### The properties of and Ψ:

The theory of population genetics predicts that covariances between unlinked loci will be negligible in the source populations, if the source populations are not very small. Here, I address the consequences of this prediction for **Ψ**, along with other properties of and **Ψ**.

In terms of understanding and specifying the model, much is gained by making clear how model parameters are derived from stochastically varying quantities. To describe the association between alleles at loci *l* and *m* in the *k ^{th}* source population, I imagine a 2 × 2 table, formed from the set of gamete counts(10)The counts are arranged in the table in some natural way,

*e.g.*, by assigning rows to the alleles at locus

*l*and columns to the alleles at locus

*m*. The following discussion parallels Weir (1996, pp. 112–114), with the important exception that here the entire

*k*th source population at generation zero plays the role of the “sample” in Weir's development. Under the Wright–Fisher model, the gamete counts in (10) are random variables, having a multinomial distribution with parameters 2

*N*(

*k*) and ; I specify the probabilities in more detail below for the case of unlinked loci. I view the

*l*,

*m*-locus gamete counts in the

*k*th source population at generation zero as a particular realization of (10), and the allele frequencies

*p*(

_{l}*k*),

*p*(

_{m}*k*) appearing in Figure 1 and Equation 5 as marginal frequencies of the realized 2 × 2 table. The covariance between

*x*and

_{l}*x*in the

_{m}*k*th source population, written as a function of the random gamete counts, is(11)I view the parameter

*D*(

_{lm}*k*), appearing in Figure 1 and on the right-hand side of Equation 5, as a realization of 𝒟

_{lm}(

*k*).

For *unlinked* loci *l*, *m* and 2*N*(*k*) sufficiently large, gamete frequencies follow the “independent loci” model (Ethier 1979); under this model, the probability distribution on 2 × 2 tables has the property of row-by-column independence. Let π_{l}(*k*) (respectively, π_{m}(*k*)) be the probability that a randomly chosen gamete of the *k*th source population carries the “1” allele at locus *l* (*m*). Under the independent loci model, the multinomial probabilities are π_{11}(*k*) = π_{l}(*k*)π_{m}(*k*), , , and . It is worth noting that the Wright–Fisher process in the *k*th source population need not be stationary in order for unlinked *l*, *m* to follow the independent loci model (Ethier 1979). Further, for large source populations, even weakly linked loci (those with *c _{lm}* < , but

*c*not very small) can be treated as independent (Ethier 1979).

_{lm}By straightforward calculation, under the independent loci model, *E*(𝒟_{lm}(*k*)) = 0, where the expectation is over multinomial samples forming the *k*th source population at generation zero. For 2*N*(*k*) sufficiently large, a calculation using the “delta method” gives(12)(see Weir 1996, Equations 2.17, 3.10), again assuming the independent loci model. As a consequence of the above, when *l*, *m* are unlinked and 2*N*(*k*) is sufficiently large, the parameter *D _{lm}*(

*k*) in Equation 5 can be approximated by a normal deviate having mean zero and variance

*Q*(

*k*)/2

*N*(

*k*), where 0 <

*Q*(

*k*) < 1. Provided that all of the source populations are sufficiently large, the parameters

*D*(1), …

_{lm}*D*(

_{lm}*K*) for unlinked

*l*,

*m*may then be viewed as a set of small, independent fluctuations about zero. I assume that the weighted average of these fluctuations, , can be equated to zero for unlinked

*l*,

*m*. The fact that unlinked loci

*l*,

*m*are assigned to different blocks leads to the observation thatConsequently,

**Ψ**is of block-diagonal form:(13)where

**Ψ**

_{b}is of dimensions

*L*×

_{b}*L*;

_{b}*b*= 1,…,

*B*.

The sub-matrices **Ψ**_{b}, formed by locus pairs *l*, *m* in the *same* block, contain covariances attributable to the shared ancestry of linked loci. Calculations of Ohta and Kimura (1969) suggest the approximation , for linked loci with recombination fraction *c _{lm}*, where . This variance characterizes the order of magnitude of

*D*(1),…

_{lm}*D*(

_{lm}*K*) when

*l*,

*m*are linked. For , Var(𝒟

_{lm}(

*k*)) is nonnegligible, compared to the scale of a sample covariance

*s*. In the pulse-decay model, the

_{lm}*D*(

_{lm}*k*) for linked

*l*,

*m*are averaged over source populations in order to form the elements of

**Ψ**

_{1},…,

**Ψ**

_{B}. In contrast to the elements of the off-diagonal-blocks, I do not equate the elements of

**Ψ**

_{1},…,

**Ψ**

_{B}, to zero; yet, some elements of

**Ψ**

_{1},…,

**Ψ**

_{B}may indeed be close to zero, if the covariances

*D*(

_{lm}*k*) for linked loci tend to be of opposite signs in different source populations, and if the weights

*q*(1),…,

*q*(

*K*) are nearly equal. A scenario in which one source population has a relatively large mixing proportion

*q*leads to a straightforward interpretation of the elements of

**Ψ**

_{1},…,

**Ψ**

_{B}: here, the element ψ

_{lm}mainly reflects the covariance between

*l*,

*m*in the dominant source population.

The matrix contains components of covariance resulting only from differences in allele frequencies between source populations. These components are independent of any notion of physical linkage between loci, and could be considered entirely artifacts of admixture.

and **Ψ** are symmetric by definition, and the following proposition establishes that both have the other essential property of covariance matrices. A proof is given in the appendix.

Proposition 1. *and* Ψ *are nonnegative definite.*

In the remaining sections, I write **M** ≥ 0 to indicate that a symmetric matrix **M** is nonnegative definite, and **M** > 0 to indicate that **M** is positive definite.

#### The multiple battery factor analysis model:

The matrix containing admixture covariances has a factorization given by the following proposition; the covariance structure model is then an immediate consequence. A proof is given in the appendix.

Proposition 2. *There exists a matrix* **Λ**^{L}^{×(K–1)} *of full column rank such that* .

Proposition 2 establishes that the rank of **Λ** is one less than the number of linearly independent source population vectors **p**(1),…, **p**(*K*). Linear dependencies among **p**(1),…, **p**(*K*), which could arise if the source populations are related by recent descent from an ancestral population, or are themselves connected by migration, will reduce the rank of **Λ** accordingly.

The matrix equation(14)with **Ψ** block-diagonal, is a covariance structure model associated with the “inter-battery” (for *B* = 2 blocks, Tucker 1958) and “multiple battery” (for *B* ≥ 3, Browne 1980, Browne and Tateneni 1997) factor analysis models of psychometrics. The preceding development has shown that the covariance structure Equation 14 is implied by the pulse-decay model. In psychometrics, (14) is the covariance structure implied by the linear latent variable model(15)where **μ**^{L}^{×1} and **Λ**^{L}^{×(K–1)} are unknown parameter matrices, **f** ∈ ^{K}^{−1} is a vector of unobserved “inter-battery” factors with *E*(**f**) = **0** and Var(**f**) = **I**, and **z** ∈ * ^{L}* is a vector of unobserved “battery-specific” factors, independent of

**f**, with

*E*(

**z**) =

**0**and Var(

**z**) =

**Ψ**(see Browne 1980). In ordinary factor analysis,

**Ψ**is a diagonal matrix containing the “unique variances” of the elements of

**x**(Lawley and Maxwell 1971). The present statistical problem is to estimate

**Λ**and

**Ψ**given an observation

**S**of

**Σ**.

Psychometric studies of “difficulty factors” (see, *e.g.*, McDonald and Ahlawat 1974) caution against the naïve application of models (14) and (15) to binary data. However, alternative models for binary observations, including latent class and response function models (see *e.g.*, Lazarsfeld and Henry 1968; Bartholomew and Knott 1999; Satten *et al.* 2001), assume that all pairs of observed variables are conditionally independent given the latent variables, and thus do not allow for the within-block covariances encoded in **Ψ**. The block-diagonal structure of **Ψ** is an essential aspect of both the genetic model and the statistical methods described here.

## ESTIMATION OF MODEL PARAMETERS

I first describe the basic model-fitting problem for an arbitrary number of blocks *B*; then in subsequent sections I discuss the cases *B* = 2 and *B* ≥ 3, and give two data examples.

For **Λ** with a given column rank *K* – 1, the parameters of the model (14) are said to be *identified* if the map from (**Λ**, **Ψ**) to **Σ** is one-to-one, *i.e.*,, if(Jöreskog 1981). In practical terms, the identifiability of a model (or lack thereof) determines whether or not unique estimates of model parameters can be found, given the sample. When parameters of (14) are not identified, the strategy I adopt below is to obtain estimates which satisfy a generalized least-squares criterion. In particular situations I will describe, the parametrization **Σ** = is partly or wholly identified.

We know from population genetics that sample allele frequencies impose constraints on observed LD values, so it is natural to ensure that the same constraints apply to model estimates. As the diagonal element *s* is the sample variance of a binary variable, clearly(16)

For distinct loci *l*, *m*, let and , where , are the sample frequencies of the “1” allele at loci *l* and *m*. The inequality constraints(17)(Lewontin 1964) derive from the fact that two-locus gamete frequencies are nonnegative. Thomson and Baur (1984) described an additional set of inequality constraints on the elements of **S**, deriving from the fact that three-locus gamete frequencies are nonnegative. For the triplet of distinct loci *l*, *m*, *j*, the constraints can be written as(18)(19)(20)(21)Robinson *et al.* (1991) determined that no additional constraints on the elements of **S** are imposed by considering loci in groups of four or more; thus, a complete set of inequality constraints can be obtained by applying (16) to the diagonal elements of **S**, (17) to the elements involving distinct locus pairs *l*, *m*, and (18)–(21) to the elements involving distinct locus triplets *l*, *m*, *j*. I will say that model estimates and satisfy the constraints if (16)–(21) hold, upon replacing sample quantities *s _{lm}* by their estimates , for

*l*,

*m*∈ {1, …

*L*}.

I use generalized least-squares (GLS) methods for fitting the model (14) to data for two main reasons: these methods do not require explicit distributional assumptions, and the statistical properties of GLS estimates are known under fairly general conditions (see, *e.g.*, Browne 1974, 1984; Jöreskog 1981). For a “weight” matrix **V**^{L}^{×L} > **0** yet to be chosen, GLS estimates of **Λ** and **Ψ** are obtained by minimizing the discrepancy function(22)on the setwhere ‖ · ‖ is the Euclidean norm. I also examine the unconstrained problem, in which (22) is minimized on {(**Λ**, **Ψ**): rank(**Λ**) = *K* – 1, **Ψ** ≥ **0**}; for, a minimizer of the unconstrained problem may, for a given data set, satisfy the constraints . In such a situation, is also the minimizer of the constrained problem. When **V** is not a function of **Λ** or **Ψ**, straightforward matrix differentiation (see, *e.g.*, Magnus and Neudecker 1988, section 9.9) gives the estimating equations(23)and(24)where for **M**^{L}^{×L}, Bdiag[**M**] gives **M** the same block-diagonal structure as **Ψ**. Equations 23 and 24 give necessary conditions for a local (or global) minimum of *f*** _{V}**, at a differentiable point of the parameter space (Magnus and Neudecker 1988, Sect. 7.4).

In the following sections, I use weight matrices **V** > 0 having the same block-diagonal structure as **Ψ**. There are two main reasons for this choice: first, block-diagonal **V** can be chosen to reflect unique features (levels of polymorphism, *etc.*) of particular blocks; second, block-diagonal **V** leads to a simple estimating equation for **Ψ**. For block-diagonal **V**, pre- and post-multiplication by **V**^{−1}, along with straightforward matrix algebra, establishes that the necessary condition (24) is equivalent to(25)Consequently, model estimates *within* diagonal blocks equal their observed counterparts:(26)

## THE MODEL WITH *B* = 2 BLOCKS

For *B* = 2, it is convenient to write the two blocks of a gamete vector as and , with *L _{x}* +

*L*=

_{y}*L*. The covariance structure model (14) is then(27)where

**Λ**

_{x},

**Λ**

_{y}are respectively of dimensions

*L*× (

_{x}*K*– 1),

*L*× (

_{y}*K*– 1). I assume without loss of generality that

*L*≤

_{x}*L*.

_{y}Unfortunately, **Λ** and **Ψ** are essentially unidentified for the model with *B* = 2 blocks. Given , , , and , which solve (27), many new solutions can be generated by the family of transformationswhere **M** is any conformable, invertible matrix. All of these solutions will satisfy (27) for a given **Σ**; thus we need to impose additional criteria to choose a “good” solution from the many. Although **Λ** and **Ψ** are not identified, the off-diagonal blocks , under the parametrization , are identified asThis parametrization is less informative mathematically, but shows that matrix elements describing admixture LD between loci of different blocks are unique.

I advocate a solution of model (27) first described by Rao (1973, 1979), and here give a rationale for its use in the admixture decomposition. I assume in the following that **S**_{xx} and **S**_{yy} are nonsingular and thatThe latter is a technical assumption which ensures that each latent dimension of the admixture model is represented in the sample. The singular value decomposition (SVD) of is accordingly(28)where and are orthogonal, ,and *r*_{1} ≥ ··· ≥ *r _{d}* are the

*d*≤

*L*singular values.

_{x}The Rao (1979) solution for the model with *B* = 2 blocks is(29)where **Δ**_{(K–1)} = diag(*r*_{1}, … *r _{K}*

_{–1}), and

**U**

_{(K–1)},

**W**

_{(K–1)}respectively contain the columns of

**U**,

**W**corresponding to the ordered singular values in

**Δ**

_{(K–1)}. The singular values

*r*

_{1}≥ ··· ≥

*r*of

_{d}**S**S

_{xy}

**S**, all of which lie in [0, 1], are the sample canonical correlations between the variables

**x**and

**y**.

The Rao solution can be connected to the unconstrained GLS problem by a particular choice of **V** in (22), and by use of the corresponding necessary condition ∂*f*** _{V}**/∂

**Ψ**=

**0**. A proof of the proposition is given in the appendix.

Proposition 3. *For* , *the Rao solution* (29) *gives a global minimum of f*** _{V}**(

**Λ**,

**Ψ**;

**S**)

*on the set*{(

**Λ**,

**Ψ**):

*rank*(

**Λ**) = K – 1,

**Ψ ≥**0}.

Thus the Rao solution is the GLS solution for the unconstrained problem, given the choice of **V** immediately above. Browne (1979) shows that the Rao solution (29) gives maximum-likelihood estimates of (**Λ**, **Ψ**) in the multivariate normal setting, under additional identifiability conditions. The connection given by Proposition 3 between the Rao solution and the minimization of the GLS discrepancy function is apparently new.

There is no guarantee that the Rao solution (29) will satisfy the constraints imposed by a given data set. Numerical methods for minimization of the discrepancy function (22), which explicitly incorporate inequality constraints, have been described by Lee (1980); these remain an option if the fit of the Rao solution is unsatisfactory.

#### Data from The SNP Consortium:

This is the first of two data examples using SNPs organized into two blocks on the human X chromosome. My aim is more to show a proof of principle rather than to demonstrate any elaborations of the analysis. Although the examples may seem rather narrow, I do not intend that the method be limited only to very similar situations. In both examples I use only X chromosomes typed in males, in order to avoid any complications introduced by either inferring phase or estimating pairwise LD by indirect methods. Both data sets are, in a broad sense, deliberately admixed, as they combine population samples meant to characterize human ethnic or geographic diversity. However, I will not assume that ethnic or geographic labels necessarily correspond to source populations, even if evidence for admixture is found.

The first data set is a sample of *n* = 110 independent X chromosomes, typed by Celera Genomics as part of The SNP Consortium (TSC) Linkage Map Project (Clark *et al.* 2003; Matise *et al.* 2003). The *n* = 110 males are comprised of the following (presumably self-identified) groups: 13 Asian males from the Celera diversity panel of 30 unrelated Asians; 15 African-American males from the Celera diversity panel of 30 unrelated African Americans; and 82 Caucasian males from the Utah CEPH pedigrees typed in “phase 2” of the Linkage Map Project. I selected CEPH males who could provide independent X chromosomes: typically for a three-generation pedigree, I included the father's and paternal grandfather's X chromosomes, along with a single X chromosome chosen at random from the maternal grandfather and sons.

The 11 SNPs used in this example are a subset of 39 X-linked SNPs typed by Celera Genomics in the diversity panels and CEPH Utah pedigrees. Two broad requirements guided the selection of these SNPs. First, the SNPs can be assigned to two blocks located on different arms of the X chromosome (see Figure 2 for the physical map). Second, the number of parameters estimated in fitting the model with 11 SNPs is moderate, compared to the sample size *n* = 110. With 11 SNPs there are 11 × 10/2 = 55 population covariances estimated by sample statistics, along with 11 allele frequencies. There are at most 5 canonical correlations to include in the model, determined by the size of the smaller block. Once the number of canonical correlations to retain has been decided (by statistical inference), the other components of the Rao solution are uniquely determined. Thus in the present case there are at most 55 + 11 + 5 = 71 quantities to be estimated in fitting the model.

#### Inference of the number of source populations:

In the case of *B* = 2 blocks, fitting the Rao solution (29) turns the problem of inferring the number of independent sources into that of inferring the number of nonzero canonical correlations of the population LD matrix. Although test-statistics for making this inference exist, their null distributions are known only when the data are multivariate normal, or more generally are from an elliptically-contoured family (Muirhead and Waternaux 1980). I use a bootstrap approach in order to avoid making assumptions about the data distribution. It will be convenient to define some basic sample quantities: for *i* = 1,…*n*, let , be the *x* and *y* blocks of the *i*th gamete, and let , be the centered data matrices. The five sample canonical correlations for the TSC data are 0.46, 0.35, 0.25, 0.16, and 0.08.

The initial inference to be made is whether or not there is any support for admixture. For the pulse-decay model, this is equivalent to testing the null hypothesis that there is only one source population, hence no contribution from the term **ΛΛ**′ in (14). Under this hypothesis **Σ** itself is block-diagonal. A likelihood-ratio test for the null hypothesis of blockwise independence (block-diagonal covariance structure), assuming multivariate normal data, is well known (see, *e.g.*, Anderson 1984, Chap. 9; Seber 1984, Sect. 3.5). For *B* = 2 blocks, the test-statistic is(30)where *r*_{1} ≥ ··· ≥ *r _{d}* are the sample canonical correlations (Anderson 1984, Sect. 12.4.1). For multivariate normal data, the sample value

*t*

_{0}is compared to the χ

^{2}distribution on

*L*degrees of freedom.

_{x}L_{y}The null hypothesis of blockwise independence suggests a natural permutation strategy that I use to obtain a null distribution for *T*_{0}: if the sample consists of pairs (**x**_{1}, **y**_{1}) , … , (**x**_{n}, **y**_{n}), then a permuted sample under blockwise independence is (**x**_{1}, **y**_{π(1)}), … , (**x**_{n}, **y**_{π(n)}), where π is a uniform random permutation of {1, … , *n*}. I computed the test-statistic *T*_{0} for each of *R* = 9, 999 permuted samples, and obtained an approximate *p*-value offor the test of blockwise independence, where *t*, … , *t* are the values of *T*_{0} for the permuted samples (see Davison and Hinkley 1997, Sect. 4.3). There is then good evidence against simple block-diagonal structure in the TSC data.

The permutation test does not extend readily to sequential tests which add nonzero canonical correlations to the model one at a time. So I proceed by computing sequential confidence regions, for the first canonical correlation, then for the first and second canonical correlations jointly, etc*.*, by nonparametric bootstrapping. I take inclusion of the value zero in a confidence region as evidence that the corresponding correlation is nonsignificant.

Construction of a confidence interval for ρ_{1} is the first step. I use a bootstrap approach described by Davison and Hinkley (1997, section 2.7.2) which employs smooth transformations of sample statistics and corresponding variance estimates. An approximate pivotal quantity can be obtained from the first sample canonical correlation *r*_{1} using Fisher's (1921) transformation(31)along with a variance approximation calculated from empirical influence values. The variance approximation iswhere *l _{j}* is the empirical influence value for the

*j*th observation on

*r*

_{1}(see Davison and Hinkley 1997, Equation 2.38). Gu and Fung (1998) show that

*l*can be written as(32)where

_{j}*s*

_{j}_{1}and

*t*

_{j}_{1}are the

*j*th elements of and , respectively, and , are the sample canonical vectors associated with

*r*

_{1}. The basic quantity to be bootstrapped is then . The function

*boot.ci*, available in the

*boot*library of the statistical software R (R Development Core Team 2005, version 2.2.1), handles the confidence interval calculations readily. A 95% studentized confidence interval for ρ

_{1}for the TSC data, obtained from

*R*= 9, 999 ordinary nonparametric bootstraps, is (0.18, 0.47), confirming the existence of a first canonical correlation clearly distinct from zero. As the sample canonical correlation is

*r*

_{1}= 0.46,

*r*

_{1}is apparently highly biased as a naïve estimator of ρ

_{1}. This can be confirmed by performing basic bootstraps of

*r*

_{1}.

Bootstrapping the first two or more canonical correlations together requires additional care: the nonindependence of the canonical correlations (imposed by their rank ordering) calls for joint bootstrapping and the use of a covariance matrix reflecting the joint structure. For a joint bootstrap of the first two canonical correlations, each transformed according to Fisher's function, the covariance matrix estimate based on empirical influence values is ** d′d,** where ,(33), , and , are the sample canonical vectors associated with

*r*

_{2}(see Hampel

*et al.*1986, Equation 4.2.2; Davison and Hinkley 1997, Sect. 5.8). The basic quantity to be bootstrapped is then the quadratic form(34)whereThe quantile is the key quantity recovered from the bootstrap samples; from this, a 100(1 – α)% joint confidence region for (ρ

_{1}, ρ

_{2}) is obtained as(35)(see Davison and Hinkley 1997, Sect. 5.8). I checked the stability of the required matrix inversion by a preliminary bootstrap of the singular values of

**. In the present case,**

*d*′𝒱*d***was invertible in all**

*d*′𝒱*d**R*= 9, 999 bootstrap samples.

Figure 3 shows joint confidence regions for (ρ_{1}, ρ_{2}) based on *R* = 9, 999 bootstrap samples. The joint confidence regions clearly support the inference ρ_{1} > 0, and give weaker, though measurable, support for ρ_{2} > 0. Before settling on a model with two canonical correlations, support for ρ_{3} > 0 needs to be investigated. I used the natural extensions of (33)–(35) to three variables, to compute joint confidence regions for (ρ_{1}, ρ_{2}, ρ_{3}) based on *R* = 9, 999 bootstrap samples. With three parameters the confidence regions are nested three-dimensional surfaces, similar to the Russian dolls known as “matreshki.” The nested surfaces are not easy to display graphically, so in Figure 4 I show the *intersections* of the surfaces with the two-dimensional (ρ_{1}, ρ_{2}) plane, in which ρ_{3} = 0. Each of the three confidence regions intersects the plane, so I conclude that there is little support for a third canonical correlation. Thus the inference is that there are *two* significant correlations in the TSC data, indicating the presence of *three* independent sources of variation.

#### Other methods for the TSC data:

The computer program *Structure*, available at http://pritch.bsd.uchicago.edu/structure.html, implements Bayesian methods for fitting admixture models described in Pritchard *et al.* (2000) and Falush *et al.* (2003). Pritchard *et al.* (2000) describe a method for inferring the number of source populations *K*, based on estimates of the probability of the sample given different values of *K*. I used *Structure*, along with rules of thumb described in the accompanying user's manual, to infer the number of source populations for the TSC data, for comparison with the inference made above. For each of three models of ancestry implemented in *Structure* (the “no admixture”, “admixture” and “linkage” models), I carried out five independent runs of the *Structure* program at each value of *K* = 1, … , 5 (for a total of 75 runs). For each run, a “burn-in” of 100,000 iterations was followed by 1,000,000 Markov-chain Monte Carlo updates. Arguably, none of the ancestry models implemented in *Structure* are appropriate for the TSC SNPs, some of which are very tightly linked. Each ancestry model gave strongest support to the value *K* = 1 for the TSC data. For models with *K* > 1, in all runs, the overall sample proportion assigned to a “cluster” was 1/*K* for each cluster. These results are understood to indicate an absence of population structure.

The programs in *Eigensoft* (http://genepath.med.harvard.edu/∼reich/EIGENSTRAT.htm) perform a principal components analysis described by Patterson *et al.* (2006) for inference of admixture. Under this method, *K* – 1 is inferred as the number of statistically significant eigenvalues of an interindividual covariance matrix [having a form similar to expression (4)]. As for *Structure*, I have reservations about using *Eigensoft* on the TSC sample, because the data architecture is of a form not intended by Patterson *et al.* (2006). In particular, Patterson *et al.* intend (but do not require) that the number of markers *L* be larger than the number of individuals *n*, and acknowledge that their method is not designed specifically to accomodate markers in linkage disequilibrium. I applied the *Eigensoft* genotype coding scheme to the TSC X chromosomes as follows: a male with the “reference” allele (chosen arbitrarily) at an X chromosome locus has the value 0 at the locus, whereas a male with the “variant” allele has the value 2. I used the regression method described in Patterson *et al.* (2006) to correct for LD, and varied the number of adjacent markers used for adjustment between 1, 2 and 3. In each case *Eigensoft* called all eigenvalues nonsignificant, thus the inference would be that only one population contributed to the TSC sample.

#### The admixture decomposition:

The Rao solution can be computed by straightforward matrix operations commonly implemented in statistical or numerical packages. I do the calculations in R (R Development Core Team 2005, version 2.2.1), and describe a numerically stable method for computing the decomposition in the appendix.

The Rao decomposition for the TSC databased on two canonical correlations is shown graphically in Figure 5. The estimates **Λ*****Λ***′ and **Ψ*** satisfy the constraints imposed by the sample allele frequencies. Covariances due to admixture, involving rs1541354 and rs1859004 (respectively, the third and fourth SNPs from the left in Figure 5) with several of the other SNPs, can be seen as vertical bands in the third and fourth columns of **Λ*****Λ***′. A comparison of the off-diagonal block of the residual matrix with the corresponding block of **S** reveals a notable tendency for residual covariances to be corrected toward zero.

#### Data from the International HapMap Consortium:

In broad terms, the second example repeats the analysis of the TSC data, using a different sample of X chromosomes typed by the International HapMap Consortium (2005), HapMap Public Release #21, http://www.hapmap.org. Here I want to demonstrate that a different sample having similar structure yields similar results. The HapMap sample consists of 53 males of the Yoruban parent-offspring trios from Ibadan, Nigeria; 23 Japanese males from Tokyo; 22 Han Chinese males from Beijing; and 44 males of the Utah CEPH parent-offspring trios. Some of the Utah CEPH pedigrees typed by Celera Genomics for TSC were also typed by HapMap; thus 20 of the Utah males I include in the HapMap sample were also in the TSC sample.

Although HapMap typed many more SNPs overall than TSC, the particular SNPs I used for the TSC sample were incompletely typed by HapMap. I elected to choose a new set of SNPs for the HapMap sample, located in the same X-chromosome regions represented by blocks 1 and 2 of the TSC sample. The HapMap browser locates the TSC SNPs of block 1 in the X-chromosome region 9.2 - 13.7Mb, and the SNPs of block 2 in 139.3 - 147.3Mb. I judged that a model with 15 SNPs would preserve a reasonable balance between sample size (*n* = 142) and number of parameters. I chose 15 SNPs at random from those typed by HapMap, allocated to blocks 1 and 2 in rough proportion to the sizes of the two regions. Prior to choosing the 15 SNPs at random, I excluded SNPs from the two regions which had obvious genotyping errors or which failed to type in one or more individuals. I discarded approximately three random SNP sets because the sample covariance matrices they produced were singular, or because the Rao decomposition based on 1, 2 or 3 canonical correlations would have failed to satisfy the constraints imposed by the sample allele frequencies. The latter biased the selection toward a SNP set which yields a good fit of the model, but did not privilege any particular value of *K*. Figure 6 shows the physical map of the 15 chosen SNPs.

I used the same data analysis strategy for the HapMap sample as for the TSC sample, so I will summarize results briefly. The six sample canonical correlations for the HapMap data are 0.50, 0.41, 0.33, 0.23, 0.20 and 0.09. The permutation test of the statistic *T*_{0} produced an approximate *p*-value of 0.0007, indicating the presence of at least one significant canonical correlation. A 95% confidence interval for the first correlation is (0.25, 0.49), and joint confidence regions for the first two correlations are shown in Figure 7. The HapMap plot (not shown) analogous to Figure 4 has no contours in the (ρ_{1}, ρ_{2}) plane: thus ρ_{3} = 0 is excluded from each of the 95, 90, and 85% confidence regions. I take this to imply support for a third correlation. Graphical display of 4-dimensional regions is highly problematic, so I checked for the presence of a fourth significant correlation numerically. I bootstrapped the four-dimensional analog of (34) in order to estimate quantiles *q**, then evaluated (34) at candidate points (ρ_{1}, ρ_{2}, ρ_{3}, ρ_{4} = 0), with ρ_{1}, ρ_{2} and ρ_{3} varying in a region of high support. I found many such parameter combinations inside the joint confidence regions, and concluded that there is little evidence in the HapMap sample for a fourth correlation. Thus the inference is that there are *three* significant correlations in the HapMap data, indicating the presence of *four* independent sources of variation. The Rao decomposition for the HapMap data, based on three canonical correlations, is shown graphically in Figure 8. The decomposition satisfies the constraints .

#### Other methods for the HapMap data:

I followed the same strategy using *Structure* to infer *K* for the HapMap data as for the TSC data. In contrast to the TSC data, *Structure* finds evidence for multiple clusters in the HapMap data; in particular, the “linkage” model produces evidence for four clusters. In five runs of the linkage model with *K* = 4, *Structure* consistently assigned overall sample proportions near 0.27 to three clusters, and 0.2 to a fourth cluster. Posterior probabilities of cluster membership suggest that most individuals are of mixed ancestry, having roughly equal proportions of their alleles assigned to two or more clusters.

I followed the same strategy using *Eigensoft* for the HapMap data as for the TSC data. *Eigensoft* calls one eigenvalue significant when either 1 or 2 adjacent markers are used for LD adjustment, and calls no eigenvalues significant when 3 adjacent markers are used. Thus *Eigensoft* suggests that at most two sub-populations have contributed to the HapMap X chromosome data.

## THE MODEL WITH *B* ≥ 3 BLOCKS

Under mild conditions, the model with *B* ≥ 3 is identified up to post-multiplication of **Λ** by an orthogonal matrix (*i.e.*, up to rotations of **Λ**). However, explicit formulae for parameter estimates are not readily available.

To fix notation, the block-partitioned parameter matrices are **Ψ**, of the form (13), andwhere **Λ**_{b} is *L _{b}* × (

*K*– 1);

*b*= 1 , … ,

*B*. For

*B*≥ 3, Browne (1980, Proposition 2) shows that a sufficient condition for identification of

**Ψ**, and of

**Λ**up to rotations, is that at least three of the sub-matrices

**Λ**

_{b}of

**Λ**be of full column rank.

An assumption of the pulse-decay model, built on the notion that , is that the source-population vectors **p**(1), … **p**(*K*) are linearly independent. If *L _{b}*, the number of markers in the

*b*

^{th}block, is sufficiently large, it is reasonable to assume that linear independence also holds for the block-specific sub-vectors

**p**

_{b}(1) , … ,

**p**

_{b}(

*K*). In fact, provided

*L*≥

_{b}*K*– 1, a block-specific version of Proposition 2 (of this paper) is readily established, using suitably defined matrices

**Γ**

_{b},

_{bb}, and a proof of the same form. Browne's Proposition consequently implies that

**Ψ**is identified, and

**Λ**is identified up to rotations, if

*L*≥

_{b}*K*– 1 for at least three blocks

*b*∈ {1, …

*B*}. Furthermore, when at least three blocks meet this minimum size criterion, the matrix =

**ΛΛ**′ is unique: for =

**ΛΛ**′ =

**ΛQ**(

**ΛQ**)′ when

**Q**is orthogonal.

Obtaining an analog of Proposition 3 (of this article) for *B* ≥ 3 appears to be an open problem. Using**Ψ** = B diag[**S** – **ΛΛ**′] and the reasoning of Proposition 3, GLS estimates **Λ** , … , **Λ** for the unconstrained problem would be obtained by minimizing(36)It is here that the strategy used for the model with *B* = 2 blocks breaks down, for although the Eckart–Young theorem (Gower and Hand 1996, Sect. A.4.4) can be used to minimize the individual summands of (36), it is not clear how to reconcile the different expressions for **Λ** obtained as *i* varies.

At present, numerical estimation of (**Λ**, **Ψ**) appears to be the only feasible option for the model with *B* ≥ 3. Lee (1980) gives algorithms for GLS estimation under parameter constraints, which are appropriate for the data under consideration. For inference of *K*, an *ad hoc* strategy involving sequential goodness-of-fit tests, similar to a method used in ordinary factor analysis (see, *e.g.*, Bartholomew and Knott 1999, Sect. 3.8–3.9), may be pursued. For a candidate value *K* and a GLS estimate **Λ***, where **Λ*** has *K* – 1 columns, the fit of the model may be assessed by testing the residual matrix **S** – **Λ*****Λ***′ for block-diagonal structure (see *e.g.*, Anderson 1984, Chap. 9; Seber 1984, Sect. 3.5); this assumes that such tests could be suitably adapted to multivariate binary observations. The smallest candidate value providing the required block-diagonal residual matrix may be taken as an estimate of *K*.

## DISCUSSION

Here I have shown that a multilocus model of pairwise LD in an admixed population implies a particular covariance structure model (Equation 14), and that the model can be fitted feasibly to data. The simple pulse-decay framework will be implausible for many samples; nonetheless it may be useful to fit the statistical model, even when the form of admixture is clearly more complex, in order to discover or describe covariance structure. In this case, the “existence” result which connects the admixture model to (14) cannot be invoked: instead, it should be acknowledged that a convenient model of quasi-independent sources of covariance is being imposed on the data. The model (14), along with the statistical methods for fitting it, are of intrinsic interest and will likely have further applications in genomics. Regrettably, this description is too long to include work on a more general admixture model, in which source populations contribute to an admixed population at each generation (*e.g.*, the “gradual” admixture model of Ewens and Spielman 1995). Preliminary calculations (not shown here) suggest strongly that a decomposition resembling Equation 9, involving a block-diagonal matrix and several non-block-diagonal matrices, will hold for the more general model. Questions about the ranks of these matrices and about statistical inference in the general model remain for future work.

I can suggest two immediate, practical applications of this work. First, the model provides a well-supported statistical method for inferring the number of source populations contributing to an admixed population, for samples in which SNP markers can be organized into two blocks. In multiblock samples, the same method may still be applied by focusing attention on two well-chosen blocks. A complete extension of the method to multiblock samples is a desirable objective for future work. For the TSC and HapMap samples, I used canonical correlations to infer the presence of *K* = 3 and *K* = 4 independent sources of covariance, respectively. However, the inferred sources need not correspond to the three (respectively four) groups defined in these samples by ethnic labels. Under the model, SNPs, not labeled individuals, come from independent sources.

A second application is the detection, through examination of *** = Λ*Λ*′**, of SNPs worthy of scrutiny in an association mapping study, biogeography survey, or other context in which genetic mixtures are of interest. In both data examples considered here, examination of **Λ*****Λ***′ revealed SNPs that had relatively large admixture LD in pairwise combinations with other markers. Such SNPs would be more likely to show “spurious” associations with phenotypic traits, in association-mapping studies not designed to detect admixture effects. The elements of **Ψ*** are components of LD remaining after admixture artifacts have been removed; thus **Ψ*** contains observed LD values adjusted for admixture effects. These quantities may provide suitable surrogates for physical linkage in admixed populations, similar to conventional LD measures in unstructured populations. We can imagine an association-mapping scenario in which a binary trait has been roughly mapped to a linkage group (say, represented by block 1), and it is desired to fine-map the trait in a potentially admixed population. We include the trait as a variable in block 1, and fit the covariance structure model to block 1 along with a second unlinked block, to detect associations between the trait and markers due to admixture alone. The elements of **Ψ*** then give corrected components of LD between the trait and markers of block 1 and suggest which markers lay in close proximity to the trait locus.

I do not use the model to make inferences at the level of the individual, concerning, *e.g.*, the proportion of alleles on a gamete inherited from a putative source population. The fine-grained information provided by the model consists of the matrix elements of the decomposition. Progress toward individual inference may be possible by noting that the canonical variable pairs (**s**_{k}, **t**_{k}); *k* = 1, … *d*, used here to calculate empirical influence values, are also used in classification and assignment (see Krzanowski and Marriott 1995, Sect. 4.11). Formal inference at the individual level apparently requires estimation of individual factor scores **f** in Equation 15. These factor scores are indeterminate from the view of multivariate analysis (see, *e.g.*, Lawley and Maxwell 1971, Chap. 8; Mardia *et al.* 1979, Sect. 9.7): a single sample cannot be used to simultaneously estimate **Λ**, **Ψ**, and the individual factor scores **f**_{1} , … , **f**_{n}.

A reviewer asked if the model and statistical methods still apply when phase-unknown genotypes, rather than gametes, are observed. Only the method for inference of the number of significant canonical correlations needs to be modified for genotype data. Pairwise composite LD estimates, which require only unphased genotypes (Cockerham and Weir 1977; Weir 1979), can be organized in matrix form to estimate the population covariance matrix. The covariance structure model (14) and the generalized least-squares method depend only on the covariance matrix and sample allele frequencies. Fitting the Rao solution (29) for *B* = 2 blocks requires a decision about the number of canonical correlations to include; although here I bootstrapped gametes to make the decision, other avenues are available. There is no impediment to bootstrapping unphased genotypes, although to perform calculations analogous to these, empirical influence values for genotypes will need to be derived, perhaps using techniques of Romanazzi (1992) or Gu and Fung (1998). Genotype-based influence values could then be used similarly to (32) and (33) in variance approximations. Dümbgen (1998) gives an elegant method for constructing simultaneous confidence intervals for canonical correlations. The numerical implementation of Dümbgen's method requires only bootstrap realizations of the sample covariance matrix, which could be formed by composite LD calculations on bootstrapped genotypes. A key quantity, β, used in Dümbgen's method is of order (*L*/*n*)^{1/2}, and a small value of β is desired to produce reasonably narrow intervals. After experimenting with Dümbgen's method, I concluded that samples larger than the TSC or HapMap samples would be needed for successful implementation.

In fitting the model to data, I have stayed within relatively strict limits on the number of parameters estimated, *vis à vis* the sample size, to avoid straining the statistical theory unduly. I view each sample covariance as an estimate of a population covariance parameter: thus the number of parameters of the model is a quadratic function of *L*, the number of SNPs. Admittedly, using the sample size as an upper bound on the number of parameters of the model seems severe, at a time when thousands of SNPs can be typed simultaneously in an individual. However, allowing the sample size and number of SNPs to be in the reverse relationship, *n* < *L*, calls upon multivariate theory still in early development.

Recently, Patterson *et al.* (2006) have also taken a multivariate approach to the analysis of population structure. A thorough comparison of the method I have described with that of Patterson *et al.* will require significant work, although a few points of comparison can be discerned. Patterson *et al.* note that blocks showing significant “standing” LD pose problems for their method, although they give an *ad hoc* strategy for handling this situation. The method I have described makes explicit use of such blocks, via the block-diagonal structure of **Ψ**. A key difference between the two approaches is that Patterson *et al.* decompose an *n* × *n* matrix of inter-individual covariances, whereas I decompose an *L* × *L* matrix of covariances between SNPs. Under circumstances described by Gower (1966) and Jolliffe (2002, Sect. 5.2), there is a *duality* relationship between subject-space and variable-space decompositions. If a model in variable space dual to the model of Patterson *et al.* can be described, it will be of interest to study the structure of the matrices analogous to **Λ** and **Ψ**. Notably, Patterson *et al.* have developed their approach with the relationship *n* < *L* specifically in mind. Although they use multivariate theory for *n* < *L* to the extent it is available, they necessarily rely on (well-argued) conjecture and numerical demonstration where theoretical results are lacking.

In my view, the relationship between *n* and *L* is connected to basic notions of degrees of freedom. The need for theoretical and analytic tools suitable for genomic data may prompt us to rethink how much can be inferred from a sample of fixed size. However, it seems prudent to admit from the start that the amount of information provided by a sample is limited and that the scope of inference should be limited accordingly.

I obtained the flat files TSCmap.p1.Celera.txt.gz and TSCmap.p2.Celera.txt.gz, from which the TSC data were extracted, at http://snp.cshl.org/. I obtained the flat files genotypes_chrX_CEU_r21_nr_fwd.txt.gz, genotypes_chrX_JPT+CHB_r21_nr_fwd.txt.gz, and genotypes_chrX_YRI_r21_nr_fwd.txt.gz, from which the HapMap build 35 data were extracted, at http://www.hapmap.org/. Python scripts used to perform data extraction and R scripts used to perform data analysis are available from the author.

## APPENDIX

#### Proof of Proposition 1:

Note that the Schur product theorem (see, *e.g.*, Horn and Johnson 1985, Sect. 7.5) cannot be used in the present case, as **R**_{t} ≥ **0** is not assumed.

Because **ΓΓ**′ ≥ **0** by definition, for any **y** ∈ * ^{L}*,(37)Hence ≥

**0**. Let

**D**

_{b}(

*k*) be the covariance matrix for the loci of block

*b*in the

*k*th source population, and let

**R**

_{t,b}be the

*b*th diagonal block of

**R**

_{t}, for

*b*= 1 , … ,

*B*. As

**D**

_{b}(

*k*) ≥

**0**and

*q*(

*k*) > 0, as well. The nonnegative definiteness of is then established by considering

**y**′

**Ψ**

_{b}

**y**in a manner analogous to (37), for . Finally, because each of the diagonal blocks

**Ψ**

_{1}, … ,

**Ψ**

_{B}of

**Ψ**is nonnegative definite,

**Ψ**≥

**0**as well (see,

*e.g.*, Harville 1997, Lemma 14.8.3).

#### Proof of Proposition 2:

is symmetric and nonnegative definite, so only its rank is in question. The rank of is obtained by showing that and **ΓΓ**′, the latter of which has an easily determined rank, have the same null-space .

For **y** ∈ * ^{L}*, the

*l*th element of the column vector satisfies the inequalities(38)If

**y**∈ (

**ΓΓ**′), thenand (38) implies ()

_{l}= 0;

*l*= 1, …

*L*. Hence

**y**∈ . By similar reasoning, if

**y**∈ , then

**y**∈ (

**ΓΓ**′). (

**Γ**) consists of scalar multiples of the vectorhence rank = rank(

**ΓΓ**′) = rank(

**Γ**) =

*K*– dim[(

**Γ**)] =

*K*– 1.

Having established that is symmetric, nonnegative definite with rank *K* – 1, the existence of the decomposition = **ΛΛ**′ is a standard result (see, *e.g.*, Harville 1997, Theorem 14.3.7).

#### Proof of Proposition 3:

The necessity of (25) for block-diagonal **V** implies that minimization of *f*** _{V}**, for the unconstrained problem, need only be carried out onDirect substitution of

**Ψ**= B diag[

**S**–

**ΛΛ**′] in

*f*

**, along with the particular choice of**

_{V}**V**, yields a “concentrated” discrepancy function

*f** depending only on

**Λ**:(39)The GLS problem is now reduced to one of finding an

*L*×

_{x}*L*matrix of rank

_{y}*K*– 1, which minimizes (39). The required matrix is obtained by rescaling the SVD of and retaining the components corresponding to the

*K*– 1 largest singular values. Rao (1979) giveswhich attains the global minimumIn current practice, the result is understood as an application of the Eckart–Young theorem in the metrics

**S**

_{xx}and

**S**

_{yy}(see,

*e.g.*, Gower and Hand 1996, Sect. A.4.4). The factorization , with

**Λ**and

**Λ**as in (29), is somewhat arbitrary, but produces

**Λ*** of the appropriate rank.

A modification of Rao's result (1973, Sect. 8f.3) for the statistical setting establishes that **Ψ*** ≥ **0**, as follows. Because **U** is orthogonal, **S**_{xx} = **S****UU**′**S**, and hencewhere are the columns of **U** and 1 ≥ *r*_{1} ≥ ··· ≥ *r _{K}*

_{–1}≥ 0. Because

**Z**≥ 0, the product ≥ 0 as well (see,

*e.g.*, Harville 1997, Theorem 14.2.9). Nonnegative definiteness of

**Ψ**is established using a similar argument, and together with

**Ψ**≥ 0 this implies

**Ψ*** ≥ 0 (see,

*e.g.*, Harville 1997, Lemma 14.8.3).

#### Numerical calculation of the Rao decomposition:

The following calculations are adapted from Seber (1984, Sect. 5.7, 10.1.5). The QR decompositions of the centered data matrices are , , where , have orthonormal columns, and , are upper triangular with positive diagonal elements. The sample canonical correlations *r*_{1} ≥ ··· ≥ *r*_{d} occur as the diagonal elements of **Δ** in the singular-value decomposition(40)where , are orthogonal and is as in Equation 28 (Seber 1984, Sect. 10.1.5). Equation 40, along with some matrix algebra, yields the decompositionwhich is equivalent to the decomposition of **S**_{xy} implied by (28). Parameter estimates equivalent to those of the Rao solution (29) are then given by(41)

## Acknowledgments

Many thanks go to Chuck Langley; many of these ideas originated in discussions with him. Terry Speed and Matthew Stephens read early drafts of the manuscript, giving valuable comments and suggestions. Michael Browne, Lutz Dümbgen, Michael Perlman, and Yutaka Tanaka answered key questions about multivariate analysis. The comments of three anonymous reviewers on previous versions led to significant revisions. This work was supported by National Human Genome Research Institute (National Institutes of Health) grants 5R01HG002107-03 and 7R01MH60007.

## Footnotes

Communicating editor: K. W. Broman

- Received February 5, 2007.
- Accepted May 25, 2007.

- Copyright © 2007 by the Genetics Society of America