- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.107.071779v1
176/4/2405 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Grote, M. N.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Grote, M. N.
Originally published as Genetics Published Articles Ahead of Print on June 11, 2007.
Genetics, Vol. 176, 2405-2420, August 2007, Copyright © 2007
doi:10.1534/genetics.107.071779
A Covariance Structure Model for the Admixture of Binary Genetic Variation
Mark N. Grote1
Department of Anthropology, University of California, Davis California 95616
1 Address for correspondence: Department of Anthropology, University of California, 1 Shields Ave., Davis, CA 95616.
E-mail: mngrote{at}ucdavis.edu
>ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.
ADMIXTURE, the mixing of genetically differentiated populations via migration and subsequent intermating, can create linkage disequilibrium (LD) between genes, even when the genes are not physically linked (see, e.g., CAVALLI-SFORZA and BODMER 1971, p. 69; PROUT 1973). In this work, I show that admixture contributions to LD can be statistically quantified and distinguished from LD attributable to the shared ancestry of linked alleles. OHTA (1982) used Wright's island model to decompose a squared coefficient of LD into within- and between-population terms, in analogy with WRIGHT's (1940) decomposition of the inbreeding coefficient. These decompositions assume that populations connected by migration can be identified and sampled for genetic variation. The method I propose uses a pairwise LD matrix sampled from an admixed population of unknown composition; the number of source populations and the components of LD are inferred by use of a multivariate statistical model.
The data: blocks of binary markers:
It is convenient to develop the model using gametes as the basic units of observation. The data are then n random binary vectors of the form x = (x1,...xL)', with xl
{0, 1}; l = 1,..., L. Each vector represents the single nucleotide polymorphism (SNP) variation on one sampled gamete, under an arbitrary binary coding scheme, with xl indicating the allele on gamete x at the lth marker locus. The markers are assumed to be selectively neutral and variable in the sample.
Before statistical analysis begins, the markers in x are to be grouped into blocks by the investigator, based on physical criteria (e.g., the markers in a block share a localized region on a physical map), along with empirical evidence (e.g., the markers in a block are known by a previous linkage-mapping study to form a linkage group) independent of the sample under consideration. Each marker belongs to exactly one block; however, a particular marker may be the only member of a block. Any two markers l, m within the same block are assumed to be linked, with recombination fraction
. In contrast, markers l, j from different blocks are assumed to be unlinked, with recombination fraction clj
.
In the development below, block structure derives from physical and linkage relationships between markers, with blocks analogous to linkage groups; this is to be distinguished from empirical descriptions of "haplotype block structure" (see, e.g., GABRIEL et al. 2002; PHILLIPS et al. 2003). Recent work of the INTERNATIONAL HAPMAP CONSORTIUM (2005) and MYERS et al. (2005) suggests that haplotype block structure results from variation in recombination rates over small physical distances. The fine-scale rate estimates obtained by MYERS et al. (2005) demonstrate new tools for constructing blocks of tightly linked markers. In the data examples to follow, I form two blocks of markers on the arms of the human X chromosome, using simple physical criteria.
To fix notation, I assume that x can be partitioned into B blocks, labeled 1,..., B in any convenient order (though the same partitioning and labeling scheme is used for all gametes). For b = 1,...B, the bth block contains Lb
1 marker loci, with
. I order the vector elements x1,..., xL so that the indices
are assigned to the loci of the bth block. However, markers within a block need not be ordered in any particular way.
For binary loci, linkage disequilibrium is covariance:
For binary loci, a familiar pairwise LD parameter (LEWONTIN and KOJIMA 1960) is a covariance between locus-specific indicator variables. Let pl be the population frequency of the "1" allele at locus l, let plm be the population frequency of gametes carrying the "1" allele at both loci l and m, and let E be the expectation operator for the distribution on (xl, xm)
{0, 1} x {0, 1} parametrized by pl, pm, and plm. The LD parameter for loci l and m is then
![]() | (1) |
![]() | (2) |
, I treat the l, m entry of the sample covariance matrix
![]() | (3) |
In a recent article, PATTERSON et al. (2006) also use a sample covariance matrix to make inferences about admixture. Although the locus-specific variables defined by Patterson et al. refer to diploid genotypes and take values in {0, 1, 2}, it is useful to establish a relationship between the matrix they analyze and S. In the notation above (up to differences in variable definitions, and omitting a normalization step), the n x n matrix of inter-individual covariances analyzed by PATTERSON et al. (2006) could be written as
![]() | (4) |
ABSTRACT
>THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
|
Under the pulse-decay model, generation 1 of the admixed population consists of N diploid genotypes sampled at random with replacement from K
2 source populations. The probability that a genotype is chosen from the kth source population is q(k) > 0; k = 1,..., K, with
. The mixing proportions q(1),... q(K) are unknown. In practice, K may be an unknown parameter requiring estimation. A genotype contributed by the kth source population consists of two gametes chosen at random with replacement from the 2N(k) gametes present at generation zero in the kth source population. At generation zero, the frequency of the "1" allele at locus l in source population k is pl(k), k = 1,..., K, and the covariance between loci l and m in source population k is Dlm(k). I call the vector of allele frequencies
' from source population k p(k), and I assume that the source population allele frequency vectors p(1),..., p(K) are linearly independent. The latter assumption implies that the number of markers L is at least as large as the number of source populations K. Linear independence of p(1),..., p(K) also implies that no source population is an admixture of the other source populations. I describe the effect of weakening the assumption of linear independence briefly in a later section. Under the pulse-decay model, the source populations make no further contributions to the sampled population after generation 1. Mating is at random in the admixed population from generation 1 onward.
By straightforward calculation, the expected frequency of the "1" allele at locus l, at generation 1 in the admixed population, is
, where the expectation is over samples drawn from the source populations. Similarly, the covariance between loci l, m at generation one in the admixed population is
![]() | (5) |
The expected covariance in generation t + 1, over realizations of the Wright–Fisher process in the admixed population, is
![]() | (6) |
(0,
] is the recombination fraction between loci l and m (see, e.g., KARLIN and MCGREGOR 1968, Equation 7). WEIR and COCKERHAM (1974) provide related expressions for "genotypic" disequilibria (see WEIR 1996, Chap. 3), which completely specify two-locus dynamics in diploid organisms. Whereas D1,lm of Equation 5 is a population parameter,
t+1,lm of Equation 6 is a random variable. In Equation 6, the value clm = 0, corresponding to complete linkage of l and m, is reserved for the variance
ll, where it gives the formula for the decay of heterozygosity in a finite population (see e.g., CROW and KIMURA 1970, Section 3.11).
Examining Equation 6 with statistical modeling in mind, it is attractive to write the realized covariance in generation t + 1 as
![]() | (7) |
t,lm =
t,ml is an unknown parameter giving the proportion of initial covariance (or variance) remaining after t transitions of the Wright–Fisher process. A sample from the admixed population in generation t + 1 produces an estimate of Dt+1,lm, via the l, m element of the sample covariance matrix. Although
t,lm is an important structural component of the model, it will not be an object of estimation. To establish later mathematical results, I assume that there exist
and
such that
![]() | (8) |
) are expected to decay much more quickly than covariances between linked loci: for unlinked loci,
t,lm is on the order of 2–t. In light of this observation, (8) significantly limits the number of generations that could have occurred since the admixture pulse. Assumption (8) also rules out changes in the signs of covariances, up to and including generation t + 1. The pulse-decay model is thus revealed to be primarily suited for recent admixture. Accordingly, any additional changes in allele frequencies and covariances due to mutation are assumed to be small enough to ignore, within the time-scale of the model.
Under the pulse-decay model, the realized covariance matrix in the admixed population at generation t + 1 can be written as
![]() | (9) |
is the symmetric matrix having l, m element
t,lm,
is the Hadamard (elementwise) product,
![]() |
![]() |
, and
is the symmetric matrix having l, m element
. The columns of
are weighted vectors of source-population allele frequency deviations. The number of generations elapsed since the admixture pulse is assumed to be unknown. I do not propose to explicitly model time-dependent properties, but rather to model the effect of recent admixture on observed covariances. Hence, the dependence on t is dropped in the final expression of (9).
The properties of
and
:
The theory of population genetics predicts that covariances between unlinked loci will be negligible in the source populations, if the source populations are not very small. Here, I address the consequences of this prediction for
, along with other properties of
and
.
In terms of understanding and specifying the model, much is gained by making clear how model parameters are derived from stochastically varying quantities. To describe the association between alleles at loci l and m in the kth source population, I imagine a 2 x 2 table, formed from the set of gamete counts
![]() | (10) |
; I specify the probabilities
in more detail below for the case of unlinked loci. I view the l, m-locus gamete counts in the kth source population at generation zero as a particular realization of (10), and the allele frequencies pl(k), pm(k) appearing in Figure 1 and Equation 5 as marginal frequencies of the realized 2 x 2 table. The covariance between xl and xm in the kth source population, written as a function of the random gamete counts, is
![]() | (11) |
lm(k).
For unlinked loci l, m and 2N(k) sufficiently large, gamete frequencies follow the "independent loci" model (ETHIER 1979); under this model, the probability distribution on 2 x 2 tables has the property of row-by-column independence. Let
l(k) (respectively,
m(k)) be the probability that a randomly chosen gamete of the kth source population carries the "1" allele at locus l (m). Under the independent loci model, the multinomial probabilities
are
11(k) =
l(k)
m(k),
,
, and
. It is worth noting that the Wright–Fisher process in the kth source population need not be stationary in order for unlinked l, m to follow the independent loci model (ETHIER 1979). Further, for large source populations, even weakly linked loci (those with clm <
, but clm not very small) can be treated as independent (ETHIER 1979).
By straightforward calculation, under the independent loci model, E(
lm(k)) = 0, where the expectation is over multinomial samples forming the kth source population at generation zero. For 2N(k) sufficiently large, a calculation using the "delta method" gives
![]() | (12) |
, can be equated to zero for unlinked l, m. The fact that unlinked loci l, m are assigned to different blocks leads to the observation that
![]() |
is of block-diagonal form:
![]() | (13) |
b is of dimensions Lb x Lb; b = 1,..., B.
The sub-matrices
b, formed by locus pairs l, m in the same block, contain covariances attributable to the shared ancestry of linked loci. Calculations of OHTA and KIMURA (1969) suggest the approximation
, for linked loci with recombination fraction clm, where
. This variance characterizes the order of magnitude of Dlm(1),...Dlm(K) when l, m are linked. For
, Var(
lm(k)) is nonnegligible, compared to the scale of a sample covariance slm. In the pulse-decay model, the Dlm(k) for linked l, m are averaged over source populations in order to form the elements of
1,...,
B. In contrast to the elements of the off-diagonal-blocks, I do not equate the elements of
1,...,
B, to zero; yet, some elements of
1,...,
B may indeed be close to zero, if the covariances Dlm(k) for linked loci tend to be of opposite signs in different source populations, and if the weights q(1),..., q(K) are nearly equal. A scenario in which one source population has a relatively large mixing proportion q leads to a straightforward interpretation of the elements of
1,...,
B: here, the element
lm mainly reflects the covariance between l, m in the dominant source population.
The matrix
contains components of covariance resulting only from differences in allele frequencies between source populations. These components are independent of any notion of physical linkage between loci, and could be considered entirely artifacts of admixture.
and
are symmetric by definition, and the following proposition establishes that both have the other essential property of covariance matrices. A proof is given in the APPENDIX.
PROPOSITION 1.
and
are nonnegative definite.
In the remaining sections, I write M
0 to indicate that a symmetric matrix M is nonnegative definite, and M > 0 to indicate that M is positive definite.
The multiple battery factor analysis model:
The matrix
containing admixture covariances has a factorization given by the following proposition; the covariance structure model is then an immediate consequence. A proof is given in the APPENDIX.
PROPOSITION 2. There exists a matrix
Lx(K–1) of full column rank such that
.
Proposition 2 establishes that the rank of
is one less than the number of linearly independent source population vectors p(1),..., p(K). Linear dependencies among p(1),..., p(K), which could arise if the source populations are related by recent descent from an ancestral population, or are themselves connected by migration, will reduce the rank of
accordingly.
![]() | (14) |
block-diagonal, is a covariance structure model associated with the "inter-battery" (for B = 2 blocks, TUCKER 1958) and "multiple battery" (for B
3, BROWNE 1980, BROWNE and Tateneni 1997) factor analysis models of psychometrics. The preceding development has shown that the covariance structure Equation 14 is implied by the pulse-decay model. In psychometrics, (14) is the covariance structure implied by the linear latent variable model
![]() | (15) |
Lx(K–1) are unknown parameter matrices, f
K–1 is a vector of unobserved "inter-battery" factors with E(f) = 0 and Var(f) = I, and z
L is a vector of unobserved "battery-specific" factors, independent of f, with E(z) = 0 and Var(z) =
(see BROWNE 1980). In ordinary factor analysis,
is a diagonal matrix containing the "unique variances" of the elements of x (LAWLEY and MAXWELL 1971). The present statistical problem is to estimate
and
given an observation S of
.
Psychometric studies of "difficulty factors" (see, e.g., MCDONALD and AHLAWAT 1974) caution against the naïve application of models (14) and (15) to binary data. However, alternative models for binary observations, including latent class and response function models (see e.g., LAZARSFELD and HENRY 1968; BARTHOLOMEW and KNOTT 1999; SATTEN et al. 2001), assume that all pairs of observed variables are conditionally independent given the latent variables, and thus do not allow for the within-block covariances encoded in
. The block-diagonal structure of
is an essential aspect of both the genetic model and the statistical methods described here.
ABSTRACT
THE PULSE-DECAY MODEL OF...
>ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
3, and give two data examples.
For
with a given column rank K – 1, the parameters of the model (14) are said to be identified if the map from (
,
) to
is one-to-one, i.e.,, if
![]() |
=
is partly or wholly identified.
We know from population genetics that sample allele frequencies impose constraints on observed LD values, so it is natural to ensure that the same constraints apply to model estimates. As the diagonal element s
is the sample variance of a binary variable, clearly
![]() | (16) |
For distinct loci l, m, let
and
, where
,
are the sample frequencies of the "1" allele at loci l and m. The inequality constraints
![]() | (17) |
are nonnegative. THOMSON and BAUR (1984) described an additional set of inequality constraints on the elements of S, deriving from the fact that three-locus gamete frequencies
are nonnegative. For the triplet of distinct loci l, m, j, the constraints can be written as
![]() | (18) |
![]() | (19) |
![]() | (20) |
![]() | (21) |
and
satisfy the constraints
if (16)–(21) hold, upon replacing sample quantities slm by their estimates
, for l, m
{1, ... L}.
I use generalized least-squares (GLS) methods for fitting the model (14) to data for two main reasons: these methods do not require explicit distributional assumptions, and the statistical properties of GLS estimates are known under fairly general conditions (see, e.g., BROWNE 1974, 1984; JÖRESKOG 1981). For a "weight" matrix VLxL > 0 yet to be chosen, GLS estimates of
and
are obtained by minimizing the discrepancy function
![]() | (22) |
![]() |
,
): rank(
) = K – 1,
0}; for, a minimizer
of the unconstrained problem may, for a given data set, satisfy the constraints
. In such a situation,
is also the minimizer of the constrained problem. When V is not a function of
or
, straightforward matrix differentiation (see, e.g., MAGNUS and NEUDECKER 1988, section 9.9) gives the estimating equations
![]() | (23) |
![]() | (24) |
. Equations 23 and 24 give necessary conditions for a local (or global) minimum of fV, at a differentiable point of the parameter space (MAGNUS and NEUDECKER 1988, Sect. 7.4).
In the following sections, I use weight matrices V > 0 having the same block-diagonal structure as
. There are two main reasons for this choice: first, block-diagonal V can be chosen to reflect unique features (levels of polymorphism, etc.) of particular blocks; second, block-diagonal V leads to a simple estimating equation for
. For block-diagonal V, pre- and post-multiplication by V–1, along with straightforward matrix algebra, establishes that the necessary condition (24) is equivalent to
![]() | (25) |
![]() | (26) |
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
>THE MODEL WITH B...
THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
and
, with Lx + Ly = L. The covariance structure model (14) is then
![]() | (27) |
x,
y are respectively of dimensions Lx x (K – 1), Ly x (K – 1). I assume without loss of generality that Lx
Ly.
Unfortunately,
and
are essentially unidentified for the model with B = 2 blocks. Given
,
,
, and
, which solve (27), many new solutions can be generated by the family of transformations
![]() |
; thus we need to impose additional criteria to choose a "good" solution from the many. Although
and
are not identified, the off-diagonal blocks
, under the parametrization
, are identified as
![]() |
I advocate a solution of model (27) first described by RAO (1973, 1979), and here give a rationale for its use in the admixture decomposition. I assume in the following that Sxx and Syy are nonsingular and that
![]() |
is accordingly
![]() | (28) |
and
are orthogonal,
,
![]() |
···
rd are the d
Lx singular values.
The RAO (1979) solution for the model with B = 2 blocks is
![]() | (29) |
(K–1) = diag(r1, ... rK–1), and U(K–1), W(K–1) respectively contain the columns of U, W corresponding to the ordered singular values in
(K–1). The singular values r1
···
rd of S
SxyS
, all of which lie in [0, 1], are the sample canonical correlations between the variables x and y.
The Rao solution can be connected to the unconstrained GLS problem by a particular choice of V in (22), and by use of the corresponding necessary condition
fV/
= 0. A proof of the proposition is given in the APPENDIX.
PROPOSITION 3. For
, the Rao solution (29) gives a global minimum of fV(
,
; S) on the set {(
,
): rank (
) = K – 1,
0}.
Thus the Rao solution is the GLS solution for the unconstrained problem, given the choice of V immediately above. BROWNE (1979) shows that the Rao solution (29) gives maximum-likelihood estimates of (
,
) in the multivariate normal setting, under additional identifiability conditions. The connection given by Proposition 3 between the Rao solution and the minimization of the GLS discrepancy function is apparently new.
There is no guarantee that the Rao solution (29) will satisfy the constraints
imposed by a given data set. Numerical methods for minimization of the discrepancy function (22), which explicitly incorporate inequality constraints, have been described by LEE (1980); these remain an option if the fit of the Rao solution is unsatisfactory.
Data from The SNP Consortium:
This is the first of two data examples using SNPs organized into two blocks on the human X chromosome. My aim is more to show a proof of principle rather than to demonstrate any elaborations of the analysis. Although the examples may seem rather narrow, I do not intend that the method be limited only to very similar situations. In both examples I use only X chromosomes typed in males, in order to avoid any complications introduced by either inferring phase or estimating pairwise LD by indirect methods. Both data sets are, in a broad sense, deliberately admixed, as they combine population samples meant to characterize human ethnic or geographic diversity. However, I will not assume that ethnic or geographic labels necessarily correspond to source populations, even if evidence for admixture is found.The first data set is a sample of n = 110 independent X chromosomes, typed by Celera Genomics as part of The SNP Consortium (TSC) Linkage Map Project (CLARK et al. 2003; MATISE et al. 2003). The n = 110 males are comprised of the following (presumably self-identified) groups: 13 Asian males from the Celera diversity panel of 30 unrelated Asians; 15 African-American males from the Celera diversity panel of 30 unrelated African Americans; and 82 Caucasian males from the Utah CEPH pedigrees typed in "phase 2" of the Linkage Map Project. I selected CEPH males who could provide independent X chromosomes: typically for a three-generation pedigree, I included the father's and paternal grandfather's X chromosomes, along with a single X chromosome chosen at random from the maternal grandfather and sons.
The 11 SNPs used in this example are a subset of 39 X-linked SNPs typed by Celera Genomics in the diversity panels and CEPH Utah pedigrees. Two broad requirements guided the selection of these SNPs. First, the SNPs can be assigned to two blocks located on different arms of the X chromosome (see Figure 2 for the physical map). Second, the number of parameters estimated in fitting the model with 11 SNPs is moderate, compared to the sample size n = 110. With 11 SNPs there are 11 x 10/2 = 55 population covariances estimated by sample statistics, along with 11 allele frequencies. There are at most 5 canonical correlations to include in the model, determined by the size of the smaller block. Once the number of canonical correlations to retain has been decided (by statistical inference), the other components of the Rao solution are uniquely determined. Thus in the present case there are at most 55 + 11 + 5 = 71 quantities to be estimated in fitting the model.
|
Inference of the number of source populations:
In the case of B = 2 blocks, fitting the Rao solution (29) turns the problem of inferring the number of independent sources into that of inferring the number of nonzero canonical correlations of the population LD matrix. Although test-statistics for making this inference exist, their null distributions are known only when the data are multivariate normal, or more generally are from an elliptically-contoured family (MUIRHEAD and WATERNAUX 1980). I use a bootstrap approach in order to avoid making assumptions about the data distribution. It will be convenient to define some basic sample quantities: for i = 1,...n, let
,
be the x and y blocks of the ith gamete, and let
,
be the centered data matrices. The five sample canonical correlations for the TSC data are 0.46, 0.35, 0.25, 0.16, and 0.08.
The initial inference to be made is whether or not there is any support for admixture. For the pulse-decay model, this is equivalent to testing the null hypothesis that there is only one source population, hence no contribution from the term 
' in (14). Under this hypothesis
itself is block-diagonal. A likelihood-ratio test for the null hypothesis of blockwise independence (block-diagonal covariance structure), assuming multivariate normal data, is well known (see, e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5). For B = 2 blocks, the test-statistic is
![]() | (30) |
···
rd are the sample canonical correlations (ANDERSON 1984, Sect. 12.4.1). For multivariate normal data, the sample value t0 is compared to the
2 distribution on LxLy degrees of freedom.
The null hypothesis of blockwise independence suggests a natural permutation strategy that I use to obtain a null distribution for T0: if the sample consists of pairs (x1, y1) , ... , (xn, yn), then a permuted sample under blockwise independence is (x1, y
(1)), ... , (xn, y
(n)), where
is a uniform random permutation of {1, ... , n}. I computed the test-statistic T0 for each of R = 9, 999 permuted samples, and obtained an approximate p-value of
![]() |
, ... , t
are the values of T0 for the permuted samples (see DAVISON and HINKLEY 1997, Sect. 4.3). There is then good evidence against simple block-diagonal structure in the TSC data. The permutation test does not extend readily to sequential tests which add nonzero canonical correlations to the model one at a time. So I proceed by computing sequential confidence regions, for the first canonical correlation, then for the first and second canonical correlations jointly, etc., by nonparametric bootstrapping. I take inclusion of the value zero in a confidence region as evidence that the corresponding correlation is nonsignificant.
Construction of a confidence interval for
1 is the first step. I use a bootstrap approach described by DAVISON and HINKLEY (1997, section 2.7.2) which employs smooth transformations of sample statistics and corresponding variance estimates. An approximate pivotal quantity can be obtained from the first sample canonical correlation r1 using FISHER's (1921) transformation
![]() | (31) |
![]() |
![]() | (32) |
and
, respectively, and
,
are the sample canonical vectors associated with r1. The basic quantity to be bootstrapped is then
. The function boot.ci, available in the boot library of the statistical software R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), handles the confidence interval calculations readily. A 95% studentized confidence interval for
1 for the TSC data, obtained from R = 9, 999 ordinary nonparametric bootstraps, is (0.18, 0.47), confirming the existence of a first canonical correlation clearly distinct from zero. As the sample canonical correlation is r1 = 0.46, r1 is apparently highly biased as a naïve estimator of
1. This can be confirmed by performing basic bootstraps of r1.
Bootstrapping the first two or more canonical correlations together requires additional care: the nonindependence of the canonical correlations (imposed by their rank ordering) calls for joint bootstrapping and the use of a covariance matrix reflecting the joint structure. For a joint bootstrap of the first two canonical correlations, each transformed according to Fisher's function, the covariance matrix estimate based on empirical influence values is d'
d, where
,
![]() | (33) |
![]() |
,
, and
,
are the sample canonical vectors associated with r2 (see HAMPEL et al. 1986, Equation 4.2.2; DAVISON and HINKLEY 1997, Sect. 5.8). The basic quantity to be bootstrapped is then the quadratic form
![]() | (34) |
![]() |
is the key quantity recovered from the bootstrap samples; from this, a 100(1 –
)% joint confidence region for (
1,
2) is obtained as
![]() | (35) |
d. In the present case, d'
d was invertible in all R = 9, 999 bootstrap samples.
Figure 3 shows joint confidence regions for (
1,
2) based on R = 9, 999 bootstrap samples. The joint confidence regions clearly support the inference
1 > 0, and give weaker, though measurable, support for
2 > 0. Before settling on a model with two canonical correlations, support for
3 > 0 needs to be investigated. I used the natural extensions of (33)–(35) to three variables, to compute joint confidence regions for (
1,
2,
3) based on R = 9, 999 bootstrap samples. With three parameters the confidence regions are nested three-dimensional surfaces, similar to the Russian dolls known as "matreshki." The nested surfaces are not easy to display graphically, so in Figure 4 I show the intersections of the surfaces with the two-dimensional (
1,
2) plane, in which
3 = 0. Each of the three confidence regions intersects the plane, so I conclude that there is little support for a third canonical correlation. Thus the inference is that there are two significant correlations in the TSC data, indicating the presence of three independent sources of variation.
|
|
Other methods for the TSC data:
The computer program Structure, available at http://pritch.bsd.uchicago.edu/structure.html, implements Bayesian methods for fitting admixture models described in PRITCHARD et al. (2000) and FALUSH et al. (2003). PRITCHARD et al. (2000) describe a method for inferring the number of source populations K, based on estimates of the probability of the sample given different values of K. I used Structure, along with rules of thumb described in the accompanying user's manual, to infer the number of source populations for the TSC data, for comparison with the inference made above. For each of three models of ancestry implemented in Structure (the "no admixture", "admixture" and "linkage" models), I carried out five independent runs of the Structure program at each value of K = 1, ... , 5 (for a total of 75 runs). For each run, a "burn-in" of 100,000 iterations was followed by 1,000,000 Markov-chain Monte Carlo updates. Arguably, none of the ancestry models implemented in Structure are appropriate for the TSC SNPs, some of which are very tightly linked. Each ancestry model gave strongest support to the value K = 1 for the TSC data. For models with K > 1, in all runs, the overall sample proportion assigned to a "cluster" was 1/K for each cluster. These results are understood to indicate an absence of population structure.
The programs in Eigensoft (http://genepath.med.harvard.edu/
reich/EIGENSTRAT.htm) perform a principal components analysis described by PATTERSON et al. (2006) for inference of admixture. Under this method, K – 1 is inferred as the number of statistically significant eigenvalues of an interindividual covariance matrix [having a form similar to expression (4)]. As for Structure, I have reservations about using Eigensoft on the TSC sample, because the data architecture is of a form not intended by PATTERSON et al. (2006). In particular, Patterson et al. intend (but do not require) that the number of markers L be larger than the number of individuals n, and acknowledge that their method is not designed specifically to accomodate markers in linkage disequilibrium. I applied the Eigensoft genotype coding scheme to the TSC X chromosomes as follows: a male with the "reference" allele (chosen arbitrarily) at an X chromosome locus has the value 0 at the locus, whereas a male with the "variant" allele has the value 2. I used the regression method described in PATTERSON et al. (2006) to correct for LD, and varied the number of adjacent markers used for adjustment between 1, 2 and 3. In each case Eigensoft called all eigenvalues nonsignificant, thus the inference would be that only one population contributed to the TSC sample.
The admixture decomposition:
The Rao solution can be computed by straightforward matrix operations commonly implemented in statistical or numerical packages. I do the calculations in R (R DEVELOPMENT CORE TEAM 2005, version 2.2.1), and describe a numerically stable method for computing the decomposition in the APPENDIX.
The Rao decomposition for the TSC databased on two canonical correlations is shown graphically in Figure 5. The estimates
*
*' and
* satisfy the constraints
imposed by the sample allele frequencies. Covariances due to admixture, involving rs1541354 and rs1859004 (respectively, the third and fourth SNPs from the left in Figure 5) with several of the other SNPs, can be seen as vertical bands in the third and fourth columns of
*
*'. A comparison of the off-diagonal block of the residual matrix with the corresponding block of S reveals a notable tendency for residual covariances to be corrected toward zero.
|
Data from the International HapMap Consortium:
In broad terms, the second example repeats the analysis of the TSC data, using a different sample of X chromosomes typed by the INTERNATIONAL HAPMAP CONSORTIUM (2005), HapMap Public Release #21, http://www.hapmap.org. Here I want to demonstrate that a different sample having similar structure yields similar results. The HapMap sample consists of 53 males of the Yoruban parent-offspring trios from Ibadan, Nigeria; 23 Japanese males from Tokyo; 22 Han Chinese males from Beijing; and 44 males of the Utah CEPH parent-offspring trios. Some of the Utah CEPH pedigrees typed by Celera Genomics for TSC were also typed by HapMap; thus 20 of the Utah males I include in the HapMap sample were also in the TSC sample.Although HapMap typed many more SNPs overall than TSC, the particular SNPs I used for the TSC sample were incompletely typed by HapMap. I elected to choose a new set of SNPs for the HapMap sample, located in the same X-chromosome regions represented by blocks 1 and 2 of the TSC sample. The HapMap browser locates the TSC SNPs of block 1 in the X-chromosome region 9.2 - 13.7Mb, and the SNPs of block 2 in 139.3 - 147.3Mb. I judged that a model with 15 SNPs would preserve a reasonable balance between sample size (n = 142) and number of parameters. I chose 15 SNPs at random from those typed by HapMap, allocated to blocks 1 and 2 in rough proportion to the sizes of the two regions. Prior to choosing the 15 SNPs at random, I excluded SNPs from the two regions which had obvious genotyping errors or which failed to type in one or more individuals. I discarded approximately three random SNP sets because the sample covariance matrices they produced were singular, or because the Rao decomposition based on 1, 2 or 3 canonical correlations would have failed to satisfy the constraints imposed by the sample allele frequencies. The latter biased the selection toward a SNP set which yields a good fit of the model, but did not privilege any particular value of K. Figure 6 shows the physical map of the 15 chosen SNPs.
|
I used the same data analysis strategy for the HapMap sample as for the TSC sample, so I will summarize results briefly. The six sample canonical correlations for the HapMap data are 0.50, 0.41, 0.33, 0.23, 0.20 and 0.09. The permutation test of the statistic T0 produced an approximate p-value of 0.0007, indicating the presence of at least one significant canonical correlation. A 95% confidence interval for the first correlation is (0.25, 0.49), and joint confidence regions for the first two correlations are shown in Figure 7. The HapMap plot (not shown) analogous to Figure 4 has no contours in the (
1,
2) plane: thus
3 = 0 is excluded from each of the 95, 90, and 85% confidence regions. I take this to imply support for a third correlation. Graphical display of 4-dimensional regions is highly problematic, so I checked for the presence of a fourth significant correlation numerically. I bootstrapped the four-dimensional analog of (34) in order to estimate quantiles q*, then evaluated (34) at candidate points (
1,
2,
3,
4 = 0), with
1,
2 and
3 varying in a region of high support. I found many such parameter combinations inside the joint confidence regions, and concluded that there is little evidence in the HapMap sample for a fourth correlation. Thus the inference is that there are three significant correlations in the HapMap data, indicating the presence of four independent sources of variation. The Rao decomposition for the HapMap data, based on three canonical correlations, is shown graphically in Figure 8. The decomposition satisfies the constraints
.
|
|
Other methods for the HapMap data:
I followed the same strategy using Structure to infer K for the HapMap data as for the TSC data. In contrast to the TSC data, Structure finds evidence for multiple clusters in the HapMap data; in particular, the "linkage" model produces evidence for four clusters. In five runs of the linkage model with K = 4, Structure consistently assigned overall sample proportions near 0.27 to three clusters, and 0.2 to a fourth cluster. Posterior probabilities of cluster membership suggest that most individuals are of mixed ancestry, having roughly equal proportions of their alleles assigned to two or more clusters.I followed the same strategy using Eigensoft for the HapMap data as for the TSC data. Eigensoft calls one eigenvalue significant when either 1 or 2 adjacent markers are used for LD adjustment, and calls no eigenvalues significant when 3 adjacent markers are used. Thus Eigensoft suggests that at most two sub-populations have contributed to the HapMap X chromosome data.
3 BLOCKS
ABSTRACT
THE PULSE-DECAY MODEL OF...
ESTIMATION OF MODEL PARAMETERS
THE MODEL WITH B...
>THE MODEL WITH B...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
3 is identified up to post-multiplication of
by an orthogonal matrix (i.e., up to rotations of
). However, explicit formulae for parameter estimates are not readily available.
To fix notation, the block-partitioned parameter matrices are
, of the form (13), and
![]() |
b is Lb x (K – 1); b = 1 , ... , B. For B
3, BROWNE (1980, Proposition 2) shows that a sufficient condition for identification of
, and of
up to rotations, is that at least three of the sub-matrices
b of
be of full column rank.
An assumption of the pulse-decay model, built on the notion that
, is that the source-population vectors p(1), ... p(K) are linearly independent. If Lb, the number of markers in the bth block, is sufficiently large, it is reasonable to assume that linear independence also holds for the block-specific sub-vectors pb(1) , ... , pb(K). In fact, provided Lb
K – 1, a block-specific version of Proposition 2 (of this paper) is readily established, using suitably defined matrices
b,
bb, and a proof of the same form. Browne's Proposition consequently implies that
is identified, and
is identified up to rotations, if Lb
K – 1 for at least three blocks b
{1, ... B}. Furthermore, when at least three blocks meet this minimum size criterion, the matrix
= 
' is unique: for
= 
' =
Q (
Q)' when Q is orthogonal.
Obtaining an analog of Proposition 3 (of this article) for B
3 appears to be an open problem. Using
![]() |
= B diag[S – 
'] and the reasoning of Proposition 3, GLS estimates 
, ... , 
for the unconstrained problem would be obtained by minimizing
![]() | (36) |

obtained as i varies.
At present, numerical estimation of (
,
) appears to be the only feasible option for the model with B
3. LEE (1980) gives algorithms for GLS estimation under parameter constraints, which are appropriate for the data under consideration. For inference of K, an ad hoc strategy involving sequential goodness-of-fit tests, similar to a method used in ordinary factor analysis (see, e.g., BARTHOLOMEW and KNOTT 1999, Sect. 3.8–3.9), may be pursued. For a candidate value K and a GLS estimate
*, where
* has K – 1 columns, the fit of the model may be assessed by testing the residual matrix S –
*
*' for block-diagonal structure (see e.g., ANDERSON 1984, Chap. 9; SEBER 1984, Sect. 3.5); this assumes that such tests could be suitably adapted to multivariate binary observations. The smallest candidate value


























































