- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.107.071779v1
176/4/2405 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Grote, M. N.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Grote, M. N.
Originally published as Genetics Published Articles Ahead of Print on June 11, 2007.
Genetics, Vol. 176, 2405-2420, August 2007, Copyright © 2007
doi:10.1534/genetics.107.071779
A Covariance Structure Model for the Admixture of Binary Genetic Variation
Mark N. Grote1
Department of Anthropology, University of California, Davis California 95616
1 Address for correspondence: Department of Anthropology, University of California, 1 Shields Ave., Davis, CA 95616.
E-mail: mngrote{at}ucdavis.edu
>ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.
ADMIXTURE, the mixing of genetically differentiated populations via migration and subsequent intermating, can create linkage disequilibrium (LD) between genes, even when the genes are not physically linked (see, e.g., CAVALLI-SFORZA and BODMER 1971, p. 69; PROUT 1973). In this work, I show that admixture contributions to LD can be statistically quantified and distinguished from LD attributable to the shared ancestry of linked alleles. OHTA (1982) used Wright's island model to decompose a squared coefficient of LD into within- and between-population terms, in analogy with WRIGHT's (1940) decomposition of the inbreeding coefficient. These decompositions assume that populations connected by migration can be identified and sampled for genetic variation. The method I propose uses a pairwise LD matrix sampled from an admixed population of unknown composition; the number of source populations and the components of LD are inferred by use of a multivariate statistical model.
The data: blocks of binary markers:
It is convenient to develop the model using gametes as the basic units of observation. The data are then n random binary vectors of the form x = (x1,...xL)', with xl
{0, 1}; l = 1,..., L. Each vector represents the single nucleotide polymorphism (SNP) variation on one sampled gamete, under an arbitrary binary coding scheme, with xl indicating the allele on gamete x at the lth marker locus. The markers are assumed to be selectively neutral and variable in the sample.
Before statistical analysis begins, the markers in x are to be grouped into blocks by the investigator, based on physical criteria (e.g., the markers in a block share a localized region on a physical map), along with empirical evidence (e.g., the markers in a block are known by a previous linkage-mapping study to form a linkage group) independent of the sample under consideration. Each marker belongs to exactly one block; however, a particular marker may be the only member of a block. Any two markers l, m within the same block are assumed to be linked, with recombination fraction
. In contrast, markers l, j from different blocks are assumed to be unlinked, with recombination fraction clj
.
In the development below, block structure derives from physical and linkage relationships between markers, with blocks analogous to linkage groups; this is to be distinguished from empirical descriptions of "haplotype block structure" (see, e.g., GABRIEL et al. 2002; PHILLIPS et al. 2003). Recent work of the INTERNATIONAL HAPMAP CONSORTIUM (2005) and MYERS et al. (2005) suggests that haplotype block structure results from variation in recombination rates over small physical distances. The fine-scale rate estimates obtained by MYERS et al. (2005) demonstrate new tools for constructing blocks of tightly linked markers. In the data examples to follow, I form two blocks of markers on the arms of the human X chromosome, using simple physical criteria.
To fix notation, I assume that x can be partitioned into B blocks, labeled 1,..., B in any convenient order (though the same partitioning and labeling scheme is used for all gametes). For b = 1,...B, the bth block contains Lb
1 marker loci, with
. I order the vector elements x1,..., xL so that the indices
are assigned to the loci of the bth block. However, markers within a block need not be ordered in any particular way.
For binary loci, linkage disequilibrium is covariance:
For binary loci, a familiar pairwise LD parameter (LEWONTIN and KOJIMA 1960) is a covariance between locus-specific indicator variables. Let pl be the population frequency of the "1" allele at locus l, let plm be the population frequency of gametes carrying the "1" allele at both loci l and m, and let E be the expectation operator for the distribution on (xl, xm)
{0, 1} x {0, 1} parametrized by pl, pm, and plm. The LD parameter for loci l and m is then
![]() | (1) |
![]() | (2) |
, I treat the l, m entry of the sample covariance matrix
![]() | (3) |
In a recent article, PATTERSON et al. (2006) also use a sample covariance matrix to make inferences about admixture. Although the locus-specific variables defined by Patterson et al. refer to diploid genotypes and take values in {0, 1, 2}, it is useful to establish a relationship between the matrix they analyze and S. In the notation above (up to differences in variable definitions, and omitting a normalization step), the n x n matrix of inter-individual covariances analyzed by PATTERSON et al. (2006) could be written as
![]() | (4) |
ABSTRACT
>THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
|
Under the pulse-decay model, generation 1 of the admixed population consists of N diploid genotypes sampled at random with replacement from K
2 source populations. The probability that a genotype is chosen from the kth source population is q(k) > 0; k = 1,..., K, with
. The mixing proportions q(1),... q(K) are unknown. In practice, K may be an unknown parameter requiring estimation. A genotype contributed by the kth source population consists of two gametes chosen at random with replacement from the 2N(k) gametes present at generation zero in the kth source population. At generation zero, the frequency of the "1" allele at locus l in source population k is pl(k), k = 1,..., K, and the covariance between loci l and m in source population k is Dlm(k). I call the vector of allele frequencies
' from source population k p(k), and I assume that the source population allele frequency vectors p(1),..., p(K) are linearly independent. The latter assumption implies that the number of markers L is at least as large as the number of source populations K. Linear independence of p(1),..., p(K) also implies that no source population is an admixture of the other source populations. I describe the effect of weakening the assumption of linear independence briefly in a later section. Under the pulse-decay model, the source populations make no further contributions to the sampled population after generation 1. Mating is at random in the admixed population from generation 1 onward.
By straightforward calculation, the expected frequency of the "1" allele at locus l, at generation 1 in the admixed population, is
, where the expectation is over samples drawn from the source populations. Similarly, the covariance between loci l, m at generation one in the admixed population is
![]() | (5) |
The expected covariance in generation t + 1, over realizations of the Wright–Fisher process in the admixed population, is
![]() | (6) |
(0,
] is the recombination fraction between loci l and m (see, e.g., KARLIN and MCGREGOR 1968, Equation 7). WEIR and COCKERHAM (1974) provide related expressions for "genotypic" disequilibria (see WEIR 1996, Chap. 3), which completely specify two-locus dynamics in diploid organisms. Whereas D1,lm of Equation 5 is a population parameter,
t+1,lm of Equation 6 is a random variable. In Equation 6, the value clm = 0, corresponding to complete linkage of l and m, is reserved for the variance
ll, where it gives the formula for the decay of heterozygosity in a finite population (see e.g., CROW and KIMURA 1970, Section 3.11).
Examining Equation 6 with statistical modeling in mind, it is attractive to write the realized covariance in generation t + 1 as
![]() | (7) |
t,lm =
t,ml is an unknown parameter giving the proportion of initial covariance (or variance) remaining after t transitions of the Wright–Fisher process. A sample from the admixed population in generation t + 1 produces an estimate of Dt+1,lm, via the l, m element of the sample covariance matrix. Although
t,lm is an important structural component of the model, it will not be an object of estimation. To establish later mathematical results, I assume that there exist
and
such that
![]() | (8) |
) are expected to decay much more quickly than covariances between linked loci: for unlinked loci,
t,lm is on the order of 2–t. In light of this observation, (8) significantly limits the number of generations that could have occurred since the admixture pulse. Assumption (8) also rules out changes in the signs of covariances, up to and including generation t + 1. The pulse-decay model is thus revealed to be primarily suited for recent admixture. Accordingly, any additional changes in allele frequencies and covariances due to mutation are assumed to be small enough to ignore, within the time-scale of the model.
Under the pulse-decay model, the realized covariance matrix in the admixed population at generation t + 1 can be written as
![]() | (9) |
is the symmetric matrix having l, m element
t,lm,
is the Hadamard (elementwise) product,
![]() |
![]() |
, and
is the symmetric matrix having l, m element
. The columns of
are weighted vectors of source-population allele frequency deviations. The number of generations elapsed since the admixture pulse is assumed to be unknown. I do not propose to explicitly model time-dependent properties, but rather to model the effect of recent admixture on observed covariances. Hence, the dependence on t is dropped in the final expression of (9).
The properties of
and
:
The theory of population genetics predicts that covariances between unlinked loci will be negligible in the source populations, if the source populations are not very small. Here, I address the consequences of this prediction for
, along with other properties of
and
.
In terms of understanding and specifying the model, much is gained by making clear how model parameters are derived from stochastically varying quantities. To describe the association between alleles at loci l and m in the kth source population, I imagine a 2 x 2 table, formed from the set of gamete counts
![]() | (10) |
; I specify the probabilities
in more detail below for the case of unlinked loci. I view the l, m-locus gamete counts in the kth source population at generation zero as a particular realization of (10), and the allele frequencies pl(k), pm(k) appearing in Figure 1 and Equation 5 as marginal frequencies of the realized 2 x 2 table. The covariance between xl and xm in the kth source population, written as a function of the random gamete counts, is
![]() | (11) |
lm(k).
For unlinked loci l, m and 2N(k) sufficiently large, gamete frequencies follow the "independent loci" model (ETHIER 1979); under this model, the probability distribution on 2 x 2 tables has the property of row-by-column independence. Let
l(k) (respectively,
m(k)) be the probability that a randomly chosen gamete of the kth source population carries the "1" allele at locus l (m). Under the independent loci model, the multinomial probabilities
are
11(k) =
l(k)
m(k),
,
, and
. It is worth noting that the Wright–Fisher process in the kth source population need not be stationary in order for unlinked l, m to follow the independent loci model (ETHIER 1979). Further, for large source populations, even weakly linked loci (those with clm <
, but clm not very small) can be treated as independent (ETHIER 1979).
By straightforward calculation, under the independent loci model, E(
lm(k)) = 0, where the expectation is over multinomial samples forming the kth source population at generation zero. For 2N(k) sufficiently large, a calculation using the "delta method" gives
![]() | (12) |
, can be equated to zero for unlinked l, m. The fact that unlinked loci l, m are assigned to different blocks leads to the observation that
![]() |
is of block-diagonal form:
![]() | (13) |
b is of dimensions Lb x Lb; b = 1,..., B.
The sub-matrices
b, formed by locus pairs l, m in the same block, contain covariances attributable to the shared ancestry of linked loci. Calculations of OHTA and KIMURA (1969) suggest the approximation
, for linked loci with recombination fraction clm, where
. This variance characterizes the order of magnitude of Dlm(1),...Dlm(K) when l, m are linked. For
, Var(
lm(k)) is nonnegligible, compared to the scale of a sample covariance slm. In the pulse-decay model, the Dlm(k) for linked l, m are averaged over source populations in order to form the elements of
1,...,
B. In contrast to the elements of the off-diagonal-blocks, I do not equate the elements of
1,...,
B, to zero; yet, some elements of
1,...,
B may indeed be close to zero, if the covariances Dlm(k) for linked loci tend to be of opposite signs in different source populations, and if the weights q(1),..., q(K) are nearly equal. A scenario in which one source population has a relatively large mixing proportion q leads to a straightforward interpretation of the elements of
1,...,
B: here, the element
lm mainly reflects the covariance between l, m in the dominant source population.
The matrix
contains components of covariance resulting only from differences in allele frequencies between source populations. These components are independent of any notion of physical linkage between loci, and could be considered entirely artifacts of admixture.
and
are symmetric by definition, and the following proposition establishes that both have the other essential property of covariance matrices. A proof is given in the APPENDIX.
PROPOSITION 1.
and
are nonnegative definite.
In the remaining sections, I write M
0 to indicate that a symmetric matrix M is nonnegative definite, and M > 0 to indicate that M is positive definite.
The multiple battery factor analysis model:
The matrix
containing admixture covariances has a factorization given by the following proposition; the covariance structure model is then an immediate consequence. A proof is given in the APPENDIX.
PROPOSITION 2. There exists a matrix
Lx(K–1) of full column rank such that
.
Proposition 2 establishes that the rank of
is one less than the number of linearly independent source population vectors p(1),..., p(K). Linear dependencies among p(1),..., p(K), which could arise if the source populations are related by recent descent from an ancestral population, or are themselves connected by migration, will reduce the rank of
accordingly.
![]() | (14) |
block-diagonal, is a covariance structure model associated with the "inter-battery" (for B = 2 blocks, TUCKER 1958) and "multiple battery" (for B
3, BROWNE 1980, BROWNE and Tateneni 1997) factor analysis models of psychometrics. The preceding development has shown that the covariance structure Equation 14 is implied by the pulse-decay model. In psychometrics, (14) is the covariance structure implied by the linear latent variable model
![]() | (15) |
Lx(K–1) are unknown parameter matrices, f
K–1 is a vector of unobserved "inter-battery" factors with E(f) = 0 and Var(f) = I, and z
L is a vector of unobserved "battery-specific" factors, independent of f, with E(z) = 0 and Var(z) =
(see BROWNE 1980). In ordinary factor analysis,
is a diagonal matrix containing the "unique variances" of the elements of x (LAWLEY and MAXWELL 1971). The present statistical problem is to estimate
and
given an observation S of
.
Psychometric studies of "difficulty factors" (see, e.g., MCDONALD and AHLAWAT 1974) caution against the naïve application of models (14) and (15) to binary data. However, alternative models for binary observations, including latent class and response function models (see e.g., LAZARSFELD and HENRY 1968; BARTHOLOMEW and KNOTT 1999; SATTEN et al. 2001), assume that all pairs of observed variables are conditionally independent given the latent variables, and thus do not allow for the within-block covariances encoded in
. The block-diagonal structure of
is an essential aspect of both the genetic model and the statistical methods described here.
ABSTRACT
THE PULSE-DECAY MODEL OF...
>ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
3, and give two data examples.
For
with a given column rank K – 1, the parameters of the model (14) are said to be identified if the map from (
,
) to
is one-to-one, i.e.,, if
![]() |
=
is partly or wholly identified.
We know from population genetics that sample allele frequencies impose constraints on observed LD values, so it is natural to ensure that the same constraints apply to model estimates. As the diagonal element s
is the sample variance of a binary variable, clearly
![]() | (16) |
For distinct loci l, m, let
and
, where
,
are the sample frequencies of the "1" allele at loci l and m. The inequality constraints
![]() | (17) |
are nonnegative. THOMSON and BAUR (1984) described an additional set of inequality constraints on the elements of S, deriving from the fact that three-locus gamete frequencies
are nonnegative. For the triplet of distinct loci l, m, j, the constraints can be written as
![]() | (18) |
![]() | (19) |
![]() | (20) |
![]() | (21) |
and
satisfy the constraints
if (16)–(21) hold, upon replacing sample quantities slm by their estimates
, for l, m
{1, ... L}.
I use generalized least-squares (GLS) methods for fitting the model (14) to data for two main reasons: these methods do not require explicit distributional assumptions, and the statistical properties of GLS estimates are known under fairly general conditions (see, e.g., BROWNE 1974, 1984; JÖRESKOG 1981). For a "weight" matrix VLxL > 0 yet to be chosen, GLS estimates of
and
are obtained by minimizing the discrepancy function
![]() | (22) |
![]() |
,
): rank(
) = K – 1,
0}; for, a minimizer
of the unconstrained problem may, for a given data set, satisfy the constraints
. In such a situation,
is also the minimizer of the constrained problem. When V is not a function of
or
, straightforward matrix differentiation (see, e.g., MAGNUS and NEUDECKER 1988, section 9.9) gives the estimating equations
![]() | (23) |
![]() | (24) |
. Equations 23 and 24 give necessary conditions for a local (or global) minimum of fV, at a differentiable point of the parameter space (MAGNUS and NEUDECKER 1988, Sect. 7.4).
In the following sections, I use weight matrices V > 0 having the same block-diagonal structure as
. There are two main reasons for this choice: first, block-diagonal V can be chosen to reflect unique features (levels of polymorphism, etc.) of particular blocks; second, block-diagonal V leads to a simple estimating equation for
. For block-diagonal V, pre- and post-multiplication by V–1, along with straightforward matrix algebra, establishes that the necessary condition (24) is equivalent to
![]() | (25) |
![]() | (26) |
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
>THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
and
, with Lx + Ly = L. The covariance structure model (14) is then
![]() | (27) |
x,
y are respectively of dimensions Lx x (K – 1), Ly x (K – 1). I assume without loss of generality that Lx
Ly.
Unfortunately,
and
are essentially unidentified for the model with B = 2 blocks. Given
,
,
, and
, which solve (27), many new solutions can be generated by the family of transformations
![]() |
; thus we need to impose additional criteria to choose a "good" solution from the many. Although
and
are not identified, the off-diagonal blocks
, under the parametrization
, are identified as
![]() |
I advocate a solution of model (27) first described by RAO (1973, 1979), and here give a rationale for its use in the admixture decomposition. I assume in the following that Sxx and Syy are nonsingular and that
![]() |
is accordingly
![]() | (28) |
and
are orthogonal,
,
![]() |
···
rd are the d
Lx singular values.
The RAO (1979) solution for the model with B = 2 blocks is
![]() | (29) |
(K–1) = diag(r1, ... rK–1), and U(K–1), W(K–1) respectively contain the columns of U, W corresponding to the ordered singular values in
(K–1). The singular values r1
···
rd of S
SxyS
, all of which lie in [0, 1], are the sample canonical correlations between the variables x and y.
The Rao solution can be connected to the unconstrained GLS problem by a particular choice of V in (22), and by use of the corresponding necessary condition
fV/
= 0. A proof of the proposition is given in the APPENDIX.
PROPOSITION 3. For
, the Rao solution (29) gives a global minimum of fV(
,
; S) on the set {(
,
): rank (
) = K – 1,
0}.
Thus the Rao solution is the GLS solution for the unconstrained problem, given the choice of V immediately above. BROWNE (1979) shows that the Rao solution (29) gives maximum-likelihood estimates of (
,
) in the multivariate normal setting, under additional identifiability conditions. The connection given by Proposition 3 between the Rao solution and the minimization of the GLS discrepancy function is apparently new.
There is no guarantee that the Rao solution (29) will satisfy the constraints
imposed by a given data set. Numerical methods for minimization of the discrepancy function (22), which explicitly incorporate inequality constraints, have been described by LEE (1980); these remain an option if the fit of the Rao solution is unsatisfactory.
Data from The SNP Consortium:
This is the first of two data examples using SNPs organized into two blocks on the human X chromosome. My aim is more to show a proof of principle rather than to demonstrate any elaborations of the analysis. Although the examples may seem rather narrow, I do not intend that the method be limited only to very similar situations. In both examples I use only X chromosomes typed in males, in order to avoid any complications introduced by either inferring phase or estimating pairwise LD by indirect methods. Both data sets are, in a broad sense, deliberately admixed, as they combine population samples meant to characterize human ethnic or geographic diversity. However, I will not assume that ethnic or geographic labels necessarily correspond to source populations, even if evidence for admixture is found.The first data set is a sample of n = 110 independent X chromosomes, typed by Celera Genomics as part of The SNP Consortium (TSC) Linkage Map Project (CLARK et al. 2003; MATISE et al. 2003). The n = 110 males are comprised of the following (presumably self-identified) groups: 13 Asian males from the Celera diversity panel of 30 unrelated Asians; 15 African-American males from the Celera diversity panel of 30 unrelated African Americans; and 82 Caucasian males from the Utah CEPH pedigrees typed in "phase 2" of the Linkage Map Project. I selected CEPH males who could provide independent X chromosomes: typically for a three-generation pedigree, I included the father's and paternal grandfather's X chromosomes, along with a single X chromosome chosen at random from the maternal grandfather and sons.
The 11 SNPs used in this example are a subset of 39 X-linked SNPs typed by Celera Genomics in the diversity panels and CEPH Utah pedigrees. Two broad requirements guided the selection of these SNPs. First, the SNPs can be assigned to two blocks located on different arms of the X chromosome (see Figure 2 for the physical map). Second, the number of parameters estimated in fitting the model with 11 SNPs is moderate, compared to the sample size n = 110. With 11 SNPs there are 11 x 10/2 = 55 population covariances estimated by sample statistics, along with 11 allele frequencies. There are at most 5 canonical correlations to include in the model, determined by the size of the smaller block. Once the number of canonical correlations to retain has been decided (by statistical inference), the other components of the Rao solution are uniquely determined. Thus in the present case there are at most 55 + 11 + 5 = 71 quantities to be estimated in fitting the model.
|
Inference of the number of source populations:
In the case of B = 2 blocks, fitting the Rao solution (29) turns the problem of inferring the number of independent sources into that of inferring the number of nonzero canonical correlations of the population LD matrix. Although test-statistics for making this inference exist, their null distributions are known only when the data are multivariate normal, or more generally are from an elliptically-contoured family (MUIRHEAD and WATERNAUX 1980). I use a bootstrap approach in order to avoid making assumptions about the data distribution. It will be convenient to define some basic sample quantities: for i = 1,...n, let
,
be the x and y blocks of the ith gamete, and let
,
be the centered data matrices. The five sample canonical correlations for the TSC data are 0.46, 0.35, 0.25, 0.16, and 0.08.
The initial inference to be made is whether or not there is any support for admixture. For the pulse-decay model, this is equivalent to testing the null hypothesis that there is only one source population, hence no contribution from the term 
' in (14). Under this hypothesis
itself is block-diagonal. A likelihood-ratio test for the null hypothesis of blockwise independence (block-diagonal covariance structure), assuming multivariate normal data, is well known (see, e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5). For B = 2 blocks, the test-statistic is
![]() | (30) |
···
rd are the sample canonical correlations (ANDERSON 1984, Sect. 12.4.1). For multivariate normal data, the sample value t0 is compared to the
2 distribution on LxLy degrees of freedom.
The null hypothesis of blockwise independence suggests a natural permutation strategy that I use to obtain a null distribution for T0: if the sample consists of pairs (x1, y1) , ... , (xn, yn), then a permuted sample under blockwise independence is (x1, y
(1)), ... , (xn, y
(n)), where
is a uniform random permutation of {1, ... , n}. I computed the test-statistic T0 for each of R = 9, 999 permuted samples, and obtained an approximate p-value of
![]() |
, ... , t
are the values of T0 for the permuted samples (see DAVISON and HINKLEY 1997, Sect. 4.3). There is then good evidence against simple block-diagonal structure in the TSC data. The permutation test does not extend readily to sequential tests which add nonzero canonical correlations to the model one at a time. So I proceed by computing sequential confidence regions, for the first canonical correlation, then for the first and second canonical correlations jointly, etc., by nonparametric bootstrapping. I take inclusion of the value zero in a confidence region as evidence that the corresponding correlation is nonsignificant.
Construction of a confidence interval for
1 is the first step. I use a bootstrap approach described by DAVISON and HINKLEY (1997, section 2.7.2) which employs smooth transformations of sample statistics and corresponding variance estimates. An approximate pivotal quantity can be obtained from the first sample canonical correlation r1 using FISHER's (1921) transformation
![]() | (31) |
![]() |
![]() | (32) |
and
, respectively, and
,
are the sample canonical vectors associated with r1. The basic quantity to be bootstrapped is then
. The function boot.ci, available in the boot library of the statistical software R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), handles the confidence interval calculations readily. A 95% studentized confidence interval for
1 for the TSC data, obtained from R = 9, 999 ordinary nonparametric bootstraps, is (0.18, 0.47), confirming the existence of a first canonical correlation clearly distinct from zero. As the sample canonical correlation is r1 = 0.46, r1 is apparently highly biased as a naïve estimator of
1. This can be confirmed by performing basic bootstraps of r1.
Bootstrapping the first two or more canonical correlations together requires additional care: the nonindependence of the canonical correlations (imposed by their rank ordering) calls for joint bootstrapping and the use of a covariance matrix reflecting the joint structure. For a joint bootstrap of the first two canonical correlations, each transformed according to Fisher's function, the covariance matrix estimate based on empirical influence values is d'
d, where
,
![]() | (33) |
![]() |
,
, and
,
are the sample canonical vectors associated with r2 (see HAMPEL et al. 1986, Equation 4.2.2; DAVISON and HINKLEY 1997, Sect. 5.8). The basic quantity to be bootstrapped is then the quadratic form
![]() | (34) |
![]() |
is the key quantity recovered from the bootstrap samples; from this, a 100(1 –
)% joint confidence region for (
1,
2) is obtained as
![]() | (35) |
d. In the present case, d'
d was invertible in all R = 9, 999 bootstrap samples.
Figure 3 shows joint confidence regions for (
1,
2) based on R = 9, 999 bootstrap samples. The joint confidence regions clearly support the inference
1 > 0, and give weaker, though measurable, support for
2 > 0. Before settling on a model with two canonical correlations, support for
3 > 0 needs to be investigated. I used the natural extensions of (33)–(35) to three variables, to compute joint confidence regions for (
1,
2,
3) based on R = 9, 999 bootstrap samples. With three parameters the confidence regions are nested three-dimensional surfaces, similar to the Russian dolls known as "matreshki." The nested surfaces are not easy to display graphically, so in Figure 4 I show the intersections of the surfaces with the two-dimensional (
1,
2) plane, in which
3 = 0. Each of the three confidence regions intersects the plane, so I conclude that there is little support for a third canonical correlation. Thus the inference is that there are two significant correlations in the TSC data, indicating the presence of three independent sources of variation.
|
|
Other methods for the TSC data:
The computer program Structure, available at http://pritch.bsd.uchicago.edu/structure.html, implements Bayesian methods for fitting admixture models described in PRITCHARD et al. (2000) and FALUSH et al. (2003). PRITCHARD et al. (2000) describe a method for inferring the number of source populations K, based on estimates of the probability of the sample given different values of K. I used Structure, along with rules of thumb described in the accompanying user's manual, to infer the number of source populations for the TSC data, for comparison with the inference made above. For each of three models of ancestry implemented in Structure (the "no admixture", "admixture" and "linkage" models), I carried out five independent runs of the Structure program at each value of K = 1, ... , 5 (for a total of 75 runs). For each run, a "burn-in" of 100,000 iterations was followed by 1,000,000 Markov-chain Monte Carlo updates. Arguably, none of the ancestry models implemented in Structure are appropriate for the TSC SNPs, some of which are very tightly linked. Each ancestry model gave strongest support to the value K = 1 for the TSC data. For models with K > 1, in all runs, the overall sample proportion assigned to a "cluster" was 1/K for each cluster. These results are understood to indicate an absence of population structure.
The programs in Eigensoft (http://genepath.med.harvard.edu/
reich/EIGENSTRAT.htm) perform a principal components analysis described by PATTERSON et al. (2006) for inference of admixture. Under this method, K – 1 is inferred as the number of statistically significant eigenvalues of an interindividual covariance matrix [having a form similar to expression (4)]. As for Structure, I have reservations about using Eigensoft on the TSC sample, because the data architecture is of a form not intended by PATTERSON et al. (2006). In particular, Patterson et al. intend (but do not require) that the number of markers L be larger than the number of individuals n, and acknowledge that their method is not designed specifically to accomodate markers in linkage disequilibrium. I applied the Eigensoft genotype coding scheme to the TSC X chromosomes as follows: a male with the "reference" allele (chosen arbitrarily) at an X chromosome locus has the value 0 at the locus, whereas a male with the "variant" allele has the value 2. I used the regression method described in PATTERSON et al. (2006) to correct for LD, and varied the number of adjacent markers used for adjustment between 1, 2 and 3. In each case Eigensoft called all eigenvalues nonsignificant, thus the inference would be that only one population contributed to the TSC sample.
The admixture decomposition:
The Rao solution can be computed by straightforward matrix operations commonly implemented in statistical or numerical packages. I do the calculations in R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), and describe a numerically stable method for computing the decomposition in the APPENDIX.
The Rao decomposition for the TSC databased on two canonical correlations is shown graphically in Figure 5. The estimates
*
*' and
* satisfy the constraints
imposed by the sample allele frequencies. Covariances due to admixture, involving rs1541354 and rs1859004 (respectively, the third and fourth SNPs from the left in Figure 5) with several of the other SNPs, can be seen as vertical bands in the third and fourth columns of
*
*'. A comparison of the off-diagonal block of the residual matrix with the corresponding block of S reveals a notable tendency for residual covariances to be corrected toward zero.
|
Data from the International HapMap Consortium:
In broad terms, the second example repeats the analysis of the TSC data, using a different sample of X chromosomes typed by the INTERNATIONAL HAPMAP CONSORTIUM (2005), HapMap Public Release #21, http://www.hapmap.org. Here I want to demonstrate that a different sample having similar structure yields similar results. The HapMap sample consists of 53 males of the Yoruban parent-offspring trios from Ibadan, Nigeria; 23 Japanese males from Tokyo; 22 Han Chinese males from Beijing; and 44 males of the Utah CEPH parent-offspring trios. Some of the Utah CEPH pedigrees typed by Celera Genomics for TSC were also typed by HapMap; thus 20 of the Utah males I include in the HapMap sample were also in the TSC sample.Although HapMap typed many more SNPs overall than TSC, the particular SNPs I used for the TSC sample were incompletely typed by HapMap. I elected to choose a new set of SNPs for the HapMap sample, located in the same X-chromosome regions represented by blocks 1 and 2 of the TSC sample. The HapMap browser locates the TSC SNPs of block 1 in the X-chromosome region 9.2 - 13.7Mb, and the SNPs of block 2 in 139.3 - 147.3Mb. I judged that a model with 15 SNPs would preserve a reasonable balance between sample size (n = 142) and number of parameters. I chose 15 SNPs at random from those typed by HapMap, allocated to blocks 1 and 2 in rough proportion to the sizes of the two regions. Prior to choosing the 15 SNPs at random, I excluded SNPs from the two regions which had obvious genotyping errors or which failed to type in one or more individuals. I discarded approximately three random SNP sets because the sample covariance matrices they produced were singular, or because the Rao decomposition based on 1, 2 or 3 canonical correlations would have failed to satisfy the constraints imposed by the sample allele frequencies. The latter biased the selection toward a SNP set which yields a good fit of the model, but did not privilege any particular value of K. Figure 6 shows the physical map of the 15 chosen SNPs.
|
I used the same data analysis strategy for the HapMap sample as for the TSC sample, so I will summarize results briefly. The six sample canonical correlations for the HapMap data are 0.50, 0.41, 0.33, 0.23, 0.20 and 0.09. The permutation test of the statistic T0 produced an approximate p-value of 0.0007, indicating the presence of at least one significant canonical correlation. A 95% confidence interval for the first correlation is (0.25, 0.49), and joint confidence regions for the first two correlations are shown in Figure 7. The HapMap plot (not shown) analogous to Figure 4 has no contours in the (
1,
2) plane: thus
3 = 0 is excluded from each of the 95, 90, and 85% confidence regions. I take this to imply support for a third correlation. Graphical display of 4-dimensional regions is highly problematic, so I checked for the presence of a fourth significant correlation numerically. I bootstrapped the four-dimensional analog of (34) in order to estimate quantiles q*, then evaluated (34) at candidate points (
1,
2,
3,
4 = 0), with
1,
2 and
3 varying in a region of high support. I found many such parameter combinations inside the joint confidence regions, and concluded that there is little evidence in the HapMap sample for a fourth correlation. Thus the inference is that there are three significant correlations in the HapMap data, indicating the presence of four independent sources of variation. The Rao decomposition for the HapMap data, based on three canonical correlations, is shown graphically in Figure 8. The decomposition satisfies the constraints
.
|
|
Other methods for the HapMap data:
I followed the same strategy using Structure to infer K for the HapMap data as for the TSC data. In contrast to the TSC data, Structure finds evidence for multiple clusters in the HapMap data; in particular, the "linkage" model produces evidence for four clusters. In five runs of the linkage model with K = 4, Structure consistently assigned overall sample proportions near 0.27 to three clusters, and 0.2 to a fourth cluster. Posterior probabilities of cluster membership suggest that most individuals are of mixed ancestry, having roughly equal proportions of their alleles assigned to two or more clusters.I followed the same strategy using Eigensoft for the HapMap data as for the TSC data. Eigensoft calls one eigenvalue significant when either 1 or 2 adjacent markers are used for LD adjustment, and calls no eigenvalues significant when 3 adjacent markers are used. Thus Eigensoft suggests that at most two sub-populations have contributed to the HapMap X chromosome data.
3 BLOCKS
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
>THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
3 is identified up to post-multiplication of
by an orthogonal matrix (i.e., up to rotations of
). However, explicit formulae for parameter estimates are not readily available.
To fix notation, the block-partitioned parameter matrices are
, of the form (13), and
![]() |
b is Lb x (K – 1); b = 1 , ... , B. For B
3, BROWNE (1980, Proposition 2) shows that a sufficient condition for identification of
, and of
up to rotations, is that at least three of the sub-matrices
b of
be of full column rank.
An assumption of the pulse-decay model, built on the notion that
, is that the source-population vectors p(1), ... p(K) are linearly independent. If Lb, the number of markers in the bth block, is sufficiently large, it is reasonable to assume that linear independence also holds for the block-specific sub-vectors pb(1) , ... , pb(K). In fact, provided Lb
K – 1, a block-specific version of Proposition 2 (of this paper) is readily established, using suitably defined matrices
b,
bb, and a proof of the same form. Browne's Proposition consequently implies that
is identified, and
is identified up to rotations, if Lb
K – 1 for at least three blocks b
{1, ... B}. Furthermore, when at least three blocks meet this minimum size criterion, the matrix
= 
' is unique: for
= 
' =
Q (
Q)' when Q is orthogonal.
Obtaining an analog of Proposition 3 (of this article) for B
3 appears to be an open problem. Using
![]() |
= B diag[S – 
'] and the reasoning of Proposition 3, GLS estimates 
, ... , 
for the unconstrained problem would be obtained by minimizing
![]() | (36) |

obtained as i varies.
At present, numerical estimation of (
,
) appears to be the only feasible option for the model with B
3. LEE (1980) gives algorithms for GLS estimation under parameter constraints, which are appropriate for the data under consideration. For inference of K, an ad hoc strategy involving sequential goodness-of-fit tests, similar to a method used in ordinary factor analysis (see, e.g., BARTHOLOMEW and KNOTT 1999, Sect. 3.8–3.9), may be pursued. For a candidate value K and a GLS estimate
*, where
* has K – 1 columns, the fit of the model may be assessed by testing the residual matrix S –
*
*' for block-diagonal structure (see e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5); this assumes that such tests could be suitably adapted to multivariate binary observations. The smallest candidate value
providing the required block-diagonal residual matrix may be taken as an estimate of K.
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
>DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
I can suggest two immediate, practical applications of this work. First, the model provides a well-supported statistical method for inferring the number of source populations contributing to an admixed population, for samples in which SNP markers can be organized into two blocks. In multiblock samples, the same method may still be applied by focusing attention on two well-chosen blocks. A complete extension of the method to multiblock samples is a desirable objective for future work. For the TSC and HapMap samples, I used canonical correlations to infer the presence of K = 3 and K = 4 independent sources of covariance, respectively. However, the inferred sources need not correspond to the three (respectively four) groups defined in these samples by ethnic labels. Under the model, SNPs, not labeled individuals, come from independent sources.
A second application is the detection, through examination of
* =
*
*', of SNPs worthy of scrutiny in an association mapping study, biogeography survey, or other context in which genetic mixtures are of interest. In both data examples considered here, examination of
*
*' revealed SNPs that had relatively large admixture LD in pairwise combinations with other markers. Such SNPs would be more likely to show "spurious" associations with phenotypic traits, in association-mapping studies not designed to detect admixture effects. The elements of
* are components of LD remaining after admixture artifacts have been removed; thus
* contains observed LD values adjusted for admixture effects. These quantities may provide suitable surrogates for physical linkage in admixed populations, similar to conventional LD measures in unstructured populations. We can imagine an association-mapping scenario in which a binary trait has been roughly mapped to a linkage group (say, represented by block 1), and it is desired to fine-map the trait in a potentially admixed population. We include the trait as a variable in block 1, and fit the covariance structure model to block 1 along with a second unlinked block, to detect associations between the trait and markers due to admixture alone. The elements of
* then give corrected components of LD between the trait and markers of block 1 and suggest which markers lay in close proximity to the trait locus.
I do not use the model to make inferences at the level of the individual, concerning, e.g., the proportion of alleles on a gamete inherited from a putative source population. The fine-grained information provided by the model consists of the matrix elements of the decomposition. Progress toward individual inference may be possible by noting that the canonical variable pairs (sk, tk); k = 1, ... d, used here to calculate empirical influence values, are also used in classification and assignment (see KRZANOWSKI and MARRIOTT 1995, Sect. 4.11). Formal inference at the individual level apparently requires estimation of individual factor scores f in Equation 15. These factor scores are indeterminate from the view of multivariate analysis (see, e.g., LAWLEY and MAXWELL 1971, Chap. 8; MARDIA et al. 1979, Sect. 9.7): a single sample cannot be used to simultaneously estimate
,
, and the individual factor scores f1 , ... , fn.
A reviewer asked if the model and statistical methods still apply when phase-unknown genotypes, rather than gametes, are observed. Only the method for inference of the number of significant canonical correlations needs to be modified for genotype data. Pairwise composite LD estimates, which require only unphased genotypes (COCKERHAM and WEIR 1977; WEIR 1979), can be organized in matrix form to estimate the population covariance matrix. The covariance structure model (14) and the generalized least-squares method depend only on the covariance matrix and sample allele frequencies. Fitting the Rao solution (29) for B = 2 blocks requires a decision about the number of canonical correlations to include; although here I bootstrapped gametes to make the decision, other avenues are available. There is no impediment to bootstrapping unphased genotypes, although to perform calculations analogous to these, empirical influence values for genotypes will need to be derived, perhaps using techniques of ROMANAZZI (1992) or GU and FUNG (1998). Genotype-based influence values could then be used similarly to (32) and (33) in variance approximations. DÜMBGEN (1998) gives an elegant method for constructing simultaneous confidence intervals for canonical correlations. The numerical implementation of Dümbgen's method requires only bootstrap realizations of the sample covariance matrix, which could be formed by composite LD calculations on bootstrapped genotypes. A key quantity, ß, used in Dümbgen's method is of order (L/n)1/2, and a small value of ß is desired to produce reasonably narrow intervals. After experimenting with Dümbgen's method, I concluded that samples larger than the TSC or HapMap samples would be needed for successful implementation.
In fitting the model to data, I have stayed within relatively strict limits on the number of parameters estimated, vis à vis the sample size, to avoid straining the statistical theory unduly. I view each sample covariance as an estimate of a population covariance parameter: thus the number of parameters of the model is a quadratic function of L, the number of SNPs. Admittedly, using the sample size as an upper bound on the number of parameters of the model seems severe, at a time when thousands of SNPs can be typed simultaneously in an individual. However, allowing the sample size and number of SNPs to be in the reverse relationship, n < L, calls upon multivariate theory still in early development.
Recently, PATTERSON et al. (2006) have also taken a multivariate approach to the analysis of population structure. A thorough comparison of the method I have described with that of Patterson et al. will require significant work, although a few points of comparison can be discerned. Patterson et al. note that blocks showing significant "standing" LD pose problems for their method, although they give an ad hoc strategy for handling this situation. The method I have described makes explicit use of such blocks, via the block-diagonal structure of
. A key difference between the two approaches is that Patterson et al. decompose an n x n matrix of inter-individual covariances, whereas I decompose an L x L matrix of covariances between SNPs. Under circumstances described by GOWER (1966) and JOLLIFFE (2002, Sect. 5.2), there is a duality relationship between subject-space and variable-space decompositions. If a model in variable space dual to the model of Patterson et al. can be described, it will be of interest to study the structure of the matrices analogous to
and
. Notably, Patterson et al. have developed their approach with the relationship n < L specifically in mind. Although they use multivariate theory for n < L to the extent it is available, they necessarily rely on (well-argued) conjecture and numerical demonstration where theoretical results are lacking.
In my view, the relationship between n and L is connected to basic notions of degrees of freedom. The need for theoretical and analytic tools suitable for genomic data may prompt us to rethink how much can be inferred from a sample of fixed size. However, it seems prudent to admit from the start that the amount of information provided by a sample is limited and that the scope of inference should be limited accordingly.
I obtained the flat files TSCmap.p1.Celera.txt.gz and TSCmap.p2.Celera.txt.gz, from which the TSC data were extracted, at http://snp.cshl.org/. I obtained the flat files genotypes_chrX_CEU_r21_nr_fwd.txt.gz, genotypes_chrX_JPT+CHB_r21_nr_fwd.txt.gz, and genotypes_chrX_YRI_r21_nr_fwd.txt.gz, from which the HapMap build 35 data were extracted, at http://www.hapmap.org/. Python scripts used to perform data extraction and R scripts used to perform data analysis are available from the author.
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
>APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
Proof of Proposition 1:
Note that the Schur product theorem (see, e.g., HORN and JOHNSON 1985, Sect. 7.5) cannot be used in the present case, as Rt
0 is not assumed.
Because 
'
0 by definition, for any y
L,
![]() | (37) |
0. Let Db(k) be the covariance matrix for the loci of block b in the kth source population, and let Rt,b be the bth diagonal block of Rt, for b = 1 , ... , B. As Db(k)
0 and q(k) > 0,
as well. The nonnegative definiteness of
is then established by considering y'
by in a manner analogous to (37), for
. Finally, because each of the diagonal blocks
1 , ... ,
B of
is nonnegative definite,
0 as well (see, e.g., HARVILLE 1997, Lemma 14.8.3).
Proof of Proposition 2:
is symmetric and nonnegative definite, so only its rank is in question. The rank of
is obtained by showing that
and 
', the latter of which has an easily determined rank, have the same null-space
.
For y
L, the lth element of the column vector
satisfies the inequalities
![]() | (38) |
(
'), then
![]() |
)l = 0; l = 1, ... L. Hence y

. By similar reasoning, if y

, then y
(
').
(
) consists of scalar multiples of the vector
![]() |
= rank(
') = rank(
) = K – dim[
(
)] = K – 1.
Having established that
is symmetric, nonnegative definite with rank K – 1, the existence of the decomposition
= 
' is a standard result (see, e.g., HARVILLE 1997, Theorem 14.3.7).
Proof of Proposition 3:
The necessity of (25) for block-diagonal V implies that minimization of fV, for the unconstrained problem, need only be carried out on
![]() |
= B diag[S – 
'] in fV, along with the particular choice of V, yields a "concentrated" discrepancy function f* depending only on
:
![]() | (39) |
of rank K – 1, which minimizes (39). The required matrix is obtained by rescaling the SVD of
and retaining the components corresponding to the K – 1 largest singular values. RAO (1979) gives
![]() |
![]() |
, with 
and 
as in (29), is somewhat arbitrary, but produces
* of the appropriate rank.
A modification of Rao's result (1973, Sect. 8f.3) for the statistical setting establishes that
*
0, as follows. Because U is orthogonal, Sxx = S
UU'S
, and hence
![]() |
are the columns of U and 1
r1
···
rK–1
0. Because Z
0, the product
0 as well (see, e.g., HARVILLE 1997, Theorem 14.2.9). Nonnegative definiteness of 
is established using a similar argument, and together with 
0 this implies
*
0 (see, e.g., HARVILLE 1997, Lemma 14.8.3).
Numerical calculation of the Rao decomposition:
The following calculations are adapted from SEBER (1984, Sect. 5.7, 10.1.5). The QR decompositions of the centered data matrices are
,
, where
,
have orthonormal columns, and
,
are upper triangular with positive diagonal elements. The sample canonical correlations r1
···
rd occur as the diagonal elements of
in the singular-value decomposition
![]() | (40) |
,
are orthogonal and
is as in Equation 28 (SEBER 1984, Sect. 10.1.5). Equation 40, along with some matrix algebra, yields the decomposition
![]() |
![]() | (41) |
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
>LITERATURE CITED
ANDERSON, T. W., 1984 An Introduction to Multivariate Statistical Analysis, Ed. 2. Wiley & Sons, New York.
BARTHOLOMEW, D. J., and M. KNOTT, 1999 Latent Variable Models and Factor Analysis. Arnold, London.
BROWNE, M. W., 1974 Generalized least-squares estimators in the analysis of covariance structures. South African Stat. J. 8: 1–24.
BROWNE, M. W., 1979 The maximum-likelihood solution in inter-battery factor analysis. Br. J. Math. Statist. Psychol. 32: 75–86.
BROWNE, M. W., 1980 Factor analysis of multiple batteries by maximum likelihood. Br. J. Math. Statist. Psychol. 33: 184–199.
BROWNE, M. W., 1984 Asymptotically distribution-free methods for the analysis of covariance structures. Br. J. Math. Statist. Psychol. 37: 62–83.[Medline]
BROWNE, M. W., and K. TATENENI, 1997 Noniterative estimation for the multiple battery factor analysis model. Behaviormetrika 24: 3–18.
CAVALLI-SFORZA, L. L., and W. F. BODMER, 1971 The Genetics of Human Populations. W. H. Freeman, San Francisco.
CLARK, A. G., R. NIELSEN, J. SIGNOROVITCH, T. C. MATISE, S. GLANOWSKI et al., 2003 Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet. 73: 285–300.[CrossRef][Medline]
COCKERHAM, C. C., and B. S. WEIR, 1977 Digenic descent measures for finite populations. Genet. Res. 30: 121–147.
CROW, J. F., and M. KIMURA, 1970 An Introduction to Population Genetics Theory. Burgess Publishing, Minneapolis.
DAVISON, A. C., and D. V. HINKLEY, 1997 Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, UK.
DÜMBGEN, L., 1998 Perturbation inequalities and confidence sets for functions of a scatter matrix. J. Multivariate Anal. 65: 19–35.[CrossRef]
EKHOLM, A., P. W. F. SMITH and J. W. MCDONALD, 1995 Marginal regression analysis of a multivariate binary response. Biometrika 82: 847–854.
ETHIER, S. N., 1979 A limit theorem for two-locus diffusion models in population genetics. J. Appl. Prob. 16: 402–408.[CrossRef]
EWENS, W. J., and R. S. SPIELMAN, 1995 The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet. 57: 455–464.[Medline]
FALUSH, D., M. STEPHENS and J. K. PRITCHARD, 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
FISHER, R. A., 1921 On the probable error of a coefficient of correlation deduced from a small sample. Metron 1: 3–32.
GABRIEL, S. B., S. F. SCHAFFNER, H. NGUYEN, J. M. MOORE, J. ROY et al., 2002 The structure of haplotype blocks in the human genome. Science 296: 2225–2229.
GOWER, J. C., 1966 Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4): 325–338.
GOWER, J. C., and D. J. HAND, 1996 Biplots. Chapman & Hall, London.
GU, H., and W. K. FUNG, 1998 Assessing local influence in canonical correlation analysis. Ann. Inst. Statist. Math. 50(4): 755–772.[CrossRef]
HAMPEL, F. R., E. M. RONCHETTI, P. J. ROUSSEEUW and W. A. STAHEL, 1986 Robust Statistics: The Approach Based on Influence Functions. Wiley & Sons, New York.
HARVILLE, D. A., 1997 Matrix Algebra From a Statistician's Perspective. Springer-Verlag, New York.
HORN, R. A., and C. R. JOHNSON, 1985 Matrix Analysis. Cambridge University Press, Cambridge, UK.
INTERNATIONAL HAPMAP CONSORTIUM, 2005 A haplotype map of the human genome. Nature 437(27): 1299–1320.[CrossRef][Medline]
JOLLIFFE, I. T., 2002 Principal Component Analysis. Springer-Verlag, New York.
JÖRESKOG, K. G., 1981 Analysis of covariance structures. Scand. J. Statist. 8: 65–92.
KARLIN, S., and J. MCGREGOR, 1968 Rates and probabilities of fixation for two locus random mating finite populations without selection. Genetics 58: 141–159.
KRZANOWSKI, W. J., and F. H. C. MARRIOTT, 1995 Multivariate Analysis Part 2: Classification, Covariance Structures and Repeated Measurements. Arnold, London.
LAWLEY, D. N., and A. E. MAXWELL, 1971 Factor Analysis as a Statistical Method, Ed. 2. Butterworth, London.
LAZARSFELD, P. F., and N. W. HENRY, 1968 Latent Structure Analysis. Houghton Mifflin, Boston.
LEE, S. Y., 1980 Estimation of covariance structure models with parameters subject to functional restraints. Psychometrika 45: 309–324.[CrossRef]
LEWONTIN, R. C., 1964 The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67.
LEWONTIN, R. C., and K. KOJIMA, 1960 The evolutionary dynamics of complex polymorphisms. Evolution 14: 458–472.[Medline]
MAGNUS, J. R., and H. NEUDECKER, 1988 Matrix Differential Calculus With Applications in Statistics and Econometrics. Wiley & Sons, New York.
MARDIA, K. V., J. T. KENT and J. M. BIBBY, 1979 Multivariate Analysis. Academic Press, London.
MATISE, T. C., R. SACHIDANANDAM, A. G. CLARK, L. KRUGLYAK, E. WIJSMAN et al., 2003 A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set. Am. J. Hum. Genet. 72: 271–284.
MCDONALD, R. P., and K. S. AHLAWAT, 1974 Difficulty factors in binary data. Br. J. Math. Statist. Psychol. 27: 82–99.
MUIRHEAD, R. J., and C. M. WATERNAUX, 1980 Asymptotic distributions in canonical correlation analysis and other multivariate procedures for nonnormal populations. Biometrika 67: 31–43.
MYERS, S., L. BOTTOLO, C. FREEMAN, G. MCVEAN and P. DONNELLY, 2005 A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321–324.
OHTA, T., 1982 Linkage disequilibrium with the island model. Genetics 101: 139–155.
OHTA, T., and M. KIMURA, 1969 Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutation. Genetics 63: 229–238.
PATTERSON, N., A. L. PRICE and D. REICH, 2006 Population structure and eigenanalysis. PLoS Genetics 2: 2074–2093.
PHILLIPS, M. S., R. LAWRENCE, R. SACHIDANANDAM, A. P. MORRIS, D. J. BALDING et al., 2003 Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33: 382–387.[CrossRef][Medline]
PRITCHARD, J. K., M. STEPHENS and P. DONNELLY, 2000 Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
PROUT, T., 1973 Appendix, pp. 494–496 in Mitton, J.B., and R.K.Koehn, 1973, Population genetics of marine pelecypods. III. Epistasis between functionally related isoenzymes of Mytilus edulis. Genetics 73: 487–496.
R DEVELOPMENT CORE TEAM, 2005 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.r-project.org.
RAO, C. R., 1973 Linear Statistical Inference and Its Applications, Ed. 2. Wiley & Sons, New York.
RAO, C. R., 1979 Separation theorems for singular values of matrices and their applications in multivariate analysis. J. Multivar. Anal. 9: 362–377.[CrossRef]
ROBINSON, W. P., M. A. ASMUSSEN and G. THOMSON, 1991 Three-locus systems impose additional constraints on pairwise disequilibria. Genetics 129: 925–930.[Abstract]
ROMANAZZI, M., 1992 Influence in canonical correlation analysis. Psychometrika 57: 237–259.[CrossRef]
SATTEN, G. A., W. D. FLANDERS and Q. YANG, 2001 Accounting for unmeasured population structure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68: 466–477.[CrossRef][Medline]
SEBER, G. A. F., 1984 Multivariate Observations. Wiley & Sons, New York.
THOMSON, G., and M. BAUR, 1984 Third order linkage disequilibrium. Tissue Antigens 24: 250–255.[Medline]
TUCKER, L. R., 1958 An inter-battery method of factor analysis. Psychometrika 23: 111–136.[CrossRef]
WEIR, B. S., 1979 Inferences about linkage disequilibrium. Biometrics 35: 235–254.[CrossRef][Medline]
WEIR, B. S., 1996 Genetic Data Analysis. Sinauer Associates, Sunderland, MA.
WEIR, B. S., and C. C. COCKERHAM, 1974 Behavior of pairs of loci in finite monoecious populations. Theor. Pop. Biol. 6: 323–354.[CrossRef][Medline]
WICKENS, T. D., 1995 The Geometry of Multivariate Statistics. Erlbaum Associates, Hillsdale, NJ.
WRIGHT, S., 1940 Breeding structure of populations in relation to speciation. Am. Nat. 74: 232–248.[CrossRef]
Communicating editor: K. W. BROMAN
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.107.071779v1
176/4/2405 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Grote, M. N.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Grote, M. N.






































































