Genetics, Vol. 160, 1707-1719, April 2002, Copyright © 2002

Homozygosity and Linkage Disequilibrium

Chiara Sabattia and Neil Rischb
a Department of Human Genetics and Statistics, University of California, Los Angeles, California 90095-7088
b Department of Genetics, Stanford University, Stanford, California 94305-5120

Corresponding author: Chiara Sabatti, UCLA School of Medicine, 695 Charles E. Young Dr. S., Los Angeles, CA 90095-7088., csabatti{at}mednet.ucla.edu (E-mail)

Communicating editor: G. A. CHURCHILL


*  ABSTRACT
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

We illustrate how homozygosity of haplotypes can be used to measure the level of disequilibrium between two or more markers. An excess of either homozygosity or heterozygosity signals a departure from the gametic phase equilibrium: We describe the specific form of dependence that is associated with high (low) homozygosity and derive various linkage disequilibrium measures. They feature a clear biological interpretation, can be used to construct tests, and are standardized to allow comparison across loci and populations. They are particularly advantageous to measure linkage disequilibrium between highly polymorphic markers.


TESTING for the presence of linkage disequilibrium (LD) and measuring its value are two important instruments of statistical genetics that have recently received a great deal of attention. The first studies of LD were mainly in the context of population genetics; for example, disequilibrium between markers was used to assess the age of various populations. In the last decade, instead, measures of LD have also been rediscovered as a tool for disease mapping, so that investigation has focused on measuring the disequilibrium between an unknown disease gene and a known set of markers. Indeed, the presence of linkage disequilibrium between a disease gene and a given set of markers identifies the chromosomal region spanned by the markers as a candidate location of the disease gene. Moreover, the pattern of variation of LD values along a stretch of DNA also carries information: It can be used to pinpoint the most likely location of a disease gene within a region or to reconstruct the modality of recombination. The hope of exploiting the relation between the amount of linkage disequilibrium and the recombination fraction between two loci has motivated, on the one hand, the development of a series of statistical methodologies (HASTBACKA et al. 1992 Down; KAPLAN et al. 1995 Down; TERWILLIGER 1995 Down; DEVLIN et al. 1996 Down; XIONG and GUO 1997 Down; GRAHAM and THOMPSON 1998 Down; LAZZERONI 1998 Down; MCPEEK and STRAHS 1999 Down; SERVICE et al. 1999 Down; LAM et al. 2000 Down; MORRIS et al. 2000 Down; LIU et al. 2001 Down) and, on the other hand, the design of genome screens where a high number of densely spaced markers are typed in a population-like sample, to be analyzed with linkage disequilibrium techniques (COLLINS et al. 1997 Down; KRUGLYAK 1998 Down; LONJOU et al. 1999 Down; WRIGHT et al. 1999 Down). As more extensive and systematic data are collected (KIDD et al. 1998 Down; HUTTLEY et al. 1999 Down; REICH et al. 2001 Down; STEPHENS et al. 2001 Down), it has become apparent that levels of disequilibrium vary greatly between genomic regions and across populations. To design and to interpret LD genome screens one needs to refer to a "map" of the background levels of disequilibrium that can be expected in a given region of the genome and in a given population. To construct such a map, the researchers' attention has been directed, once again, to measure in the most effective manner the levels of disequilibrium between close-by markers. The literature on these measures is quite rich (see DEVLIN and RISCH 1995 Down and WIER 1996 for reviews), but there are still open problems. In particular, there is no generally satisfactory measure of disequilibrium between two markers that have more than two alleles or between more than two markers. And yet, most of the data being collected in LD genome screens are of this form. In this work we analyze how it is possible to address this specific question using haplotype homozygosity (the probability of selecting two identical haplotypes at random from the population).

Among the numerous suggestions that are documented in the literature for testing and measuring LD, various references to homozygosity can be found. SVED 1968 Down, AVERY and HILL 1979 Down, and BROWN et al. 1980 Down proposed to use the variance of homozygosity; OHTA 1980 Down suggested a measure of disequilibrium that is based on the homozygosity of two loci and is analyzed by HEDRICK 1987 Down in his review article. MORTON and SIMPSON 1983 Down define kinship between loci as a homozygosity index and use it to reconstruct distances. Even though the cited literature illustrates the presence of a connection between variation in homozygosity and linkage disequilibrium, the nature of this connection has never been precisely analyzed and hence the reliability of homozygosity to test and measure disequilibrium remains unclear. It is our goal to show what property of the population frequencies of the haplotypes defined by two markers is captured by homozygosity. While the focus of this article is on the definition of measures of disequilibrium calculated from the true population distribution, we also briefly consider the associated inferential problems. In particular, we show how Markov chain Monte Carlo algorithms can be used to conduct permutation tests and to measure disequilibrium on the basis of sample haplotypes.


*  RELATIONS BETWEEN LINKAGE DISEQUILIBRIUM AND HOMOZYGOSITY
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

Notation and definition of homozygosity:
In the following we consider two markers A and B, respectively, with r and c possible alleles, A1, A2, ... , Ar and B1, B2, ... , Bc. The population frequencies of the above alleles and of the haplotypes defined by these two markers are described in

(1)

where {pi}ij is the population frequency of the haplotype (Ai, Bj); {pi}i· is the population frequency of allele Ai; and {pi}·j is the population frequency of allele Bj. We indicate with and the homozygosities of the two markers and with the haplotype homozygosity (the probability of selecting two identical haplotypes at random from the population). When there is no room for confusion, we omit the argument ({{pi}ij}) in the formula above. Typically, the population frequencies above are unknown and one estimates them from a random sample of haplotypes. However, for the time being we assume {{pi}ij} to be known and investigate the relation between {sum}i,j {pi}2ij and linkage disequilibrium.

We note that homozygosity has been previously used to identify the location of disease genes with the strategy that goes under the name of homozygosity mapping (SMITH 1953 Down; LANDER and BOTSTEIN 1987 Down). In such cases, however, the data came from inbred families, while the measures we consider here are appropriate for a random or case-control sample from the entire population of haplotypes—indeed, related haplotypes should be excluded from the analysis.

Linkage disequilibrium:
Loci A and B are said to be in gametic phase equilibrium (GPE) if for all i, j (if the qualitative random variables A and B are independent). Linkage disequilibrium is defined as a departure from GPE. This broad definition of disequilibrium as association between A and B poses some problems. A deviation from GPE could be due to a number of population genetic phenomena such as stratification, admixture, or genetic drift. It is often impossible, on the basis of tables such as (1) alone, to determine the origin of the disequilibrium. Moreover, there is not a precise notion of distance from independence that allows one to order a set of tables. We show how homozygosity measures a specific direction of the departure from independence. The utility of this particular measure, then, depends on its genetic interpretability and its connection with the specific problem at hand.

The value of haplotype homozygosity under maximal dependency and equilibrium:
The existence of a connection between haplotype homozygosity and linkage disequilibrium is easily established. The homozygosity of any given marker is higher when fewer alleles are present with a significant frequency. Indeed, in statistics, heterozygosity is known as the Gini index of diversity (see, for example, BHARGAVA and UPPULURI 1977A Down, BHARGAVA and UPPULURI 1977B Down). Similarly, when the contingency Table 1 has few cells different from zero, the value of the haplotype homozygosity is high. If we do not fix the values of the marginal distributions, this happens in the case of maximum disequilibrium: Each allele at one marker is found in combination with one and only one allele at the other marker; that is, only one cell both per row and per column is different from zero. High homozygosity is, thus, associated with high disequilibrium.


 
View this table:
In this window
In a new window

 
Table 1. Linkage disequilibrium values across populations at DRD2 (data from KIDD et al. 1998 Down)

On the other hand, under linkage equilibrium, the multiplicative property of translates into and the haplotype homozygosity is equal to the product of the marker homozygosities. However, this equation does not hold only in the case of linkage equilibrium. A brief consideration of a 2 x 2 table clarifies the issue. In a 2 x 2 contingency table, let , and . Then, we can reexpress {{pi}ij} in the following form that emphasizes the existing linear constraints and the departure from independence,

(2)

with max(-pq,-(1 - p)(1 - q)) <= D <= min(p(1 - q),q(1 - p)). The homozygosity associated with this table is

(3)

From (3), it is clear that when , but also when . Indeed, the haplotype homozygosity can be smaller than the product of the marker homozygosity. An expression similar to the above can be obtained for tables of any dimension. Indeed, letting , one obtains

(see OHTA 1980 Down). By extending the results in (3), one notes that for multiallelic markers the haplotype homozygosity HAB can be equal to HAHB for an unlimited number of tables. Fig 1 illustrates the relation between haplotype homozygosity and linkage disequilibrium described up to this point. We consider one biallelic marker (with allele frequencies 0.4 and 0.6) and a marker with three alleles (frequencies 0.2, 0.3, and 0.5). The space of all possible tables with these marginals can be parametrized as a function of two parameters and :

(4)



View larger version (54K):
In this window
In a new window
Download PPT slide
 
Figure 1. Homozygosity values for the class of haplotype distributions described in (4). The shaded area represents the admissible tables, in the space of and . The solid circle identifies the table corresponding to gametic phase equilibrium. The open circle signals the table with highest haplotype homozygosity. The ellipses are level sets of homozygosity. It is apparent that there is a set of tables that share the same homozygosity value as the independence one and that there are tables with higher heterozygosity than the independence one.

Requiring that 0 <= {pi}ij <= 1 for all i and j is equivalent to requiring that 0 <= x <= 0.2, 0 <= y <= 0.3, and x + y <= 0.4. The space of all the possible tables that satisfy these constraints is represented in Fig 1 by the shaded area. The table of linkage equilibrium, corresponding to the values , is represented with a solid circle. For each table in the space, we can calculate the homozygosity value. Contour levels of homozygosity as a function of x and y are depicted in Fig 1; it can be seen that there is a set of tables that have the same homozygosity level as the {{pi}ij} corresponding to linkage equilibrium. It is also evident that there exist tables with homozygosity levels lower than HAHB. Another table emphasized in Fig 1 is the one that leads to the highest chi-square statistic, which in this case is also the one with highest homozygosity (corresponding to the values ). The fact that not only in the case of independence clearly signals that H is not, strictly speaking, a measure of dependence, but rather of one particular form of association that can be zero even if the table {{pi}ij} shows dependence. This is a common characteristic of measures of association. For example, the correlation coefficient between two random variables is zero not only in the case of independence, but whenever there is no linear association. Yet, it is often used as a measure of dependence, but with due caution. We clarify below the form of association measured by homozygosity.

Connection between homozygosity and recombination fraction:
Much of the current interest in linkage disequilibrium between markers is due to the fact that its evolution over time can be related to the recombination fraction between the loci. Consider a simplified model where each individual has one chromosome and chromosomes of the next generation (t + 1) are obtained by either sampling one from the present generation (t) and not recombining it or sampling two and recombining them. Then, for each i, j it is easily seen that

(5)

where {theta} is the recombination fraction between the two loci. This dynamic assures that {pi}ij -> {pi}i·{pi}·j as t -> {infty}. An immediate consequence is that Ht -> 0 as t -> {infty}. It is of interest to monitor the behavior of this convergence. By the same reasoning used above,

From the last expression it is evident that is not a sufficient condition for stability: Unless , equilibrium is not reached. Then, even if there are numerous tables such that , only the table corresponding to independence represents an equilibrium for the system. The differential equation describing the behavior of Ht can be further simplified recalling (5) and defining :

(6)

Then, by recursion we get

(7)

The evolution of {{pi}ij} and Ht for a given value of {theta} is illustrated in Fig 2 for two different starting disequilibrium situations: (open circle) and (open square). On the left, the evolution in the space of all possible tables is emphasized: Arrows indicate the convergence path from the two initial points to the linkage equilibrium situation. It can be seen that one of the paths crosses the locus of tables with once before reaching equilibrium. On the right, the values of Ht for the two systems are plotted as a function of the number of generations t: In one case the homozygosity is monotonically decreasing toward the equilibrium value, while in the other it assumes values slightly smaller than HAHB before converging to equilibrium. Equation 7 specifies the relation between evolution of haplotype homozygosity over time and recombination fraction {theta} between the considered markers. Fig 3 illustrates the values of the excess of homozygosity over the equilibrium one as a function of recombination fraction for a population that is 100 generations old and two distinct initial haplotype frequencies, corresponding to the two table values (x0, y0) and (x'0, y'0) defined above. It is clear that for some values of H0 and 0 the relation between homozygosity and recombination fraction is not monotonic. An obvious implication is that H should be used with caution for mapping purposes. However, Fig 2 and Fig 3 do demonstrate monotonic behavior of Ht with both t and {theta} when Ht is restricted to positive values.



View larger version (18K):
In this window
In a new window
Download PPT slide
 
Figure 2. Convergence over time to linkage equilibrium. On the left, the space of tables is as described in (4). Two disequilibrium situations are considered and identified by an open circle and an open square. The solid circle indicates the table corresponding to linkage equilibrium. The lines with arrows indicate the path to equilibrium in successive generations for the considered tables. The ellipse identifies the set of tables that have the same homozygosity value as the equilibrium one. On the right, the values of homozygosity for the two populations are depicted as a function of generations; the solid line corresponds to the evolution of the table identified by an open circle on the left and the dashed line to the evolution of the table identified with an open square. The boldface solid line identifies the equilibrium homozygosity value.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 3. Relation between homozygosity and recombination fraction. Letting and considering the two disequilibrium situations depicted in Fig 2, the Equation 7 leads to this graph where the excess of homozygosity over the equilibrium situation (on the y-axis) is depicted as a function of recombination fraction (on the x-axis).


*  MEASURING DISEQUILIBRIUM WITH HOMOZYGOSITY
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

The preceding section illustrated how haplotype homozygosity captures a particular form of departure from equilibrium. In this section we make precise the nature of this dependence and give operative definitions of measures of disequilibrium on the basis of homozygosity. The key idea is that haplotype homozygosity measures agreement between markers; it indicates how likely it is that, sampling two haplotypes at random from a population, if they are identical at one marker they are also identical at the other one, or, vice versa, if they are different at one marker, they are also different at the other one. To make this more precise, it is useful to introduce the notion of agreement between partitions.

Agreement between the partition of haplotypes by two markers:
Let S be the set of all the existing population haplotypes defined by markers A and B. Any subdivision of S into subsets Si such that each haplotype in S belongs to exactly one of the subsets Si is called a partition of S. Each of the two markers A and B identifies a partition of S by putting in the same subset haplotypes with the same allele. For example, for a population with eight haplotypes, suppose that the set of the haplotypes S is

The partition of the haplotypes according to the first marker is {h1, h2, h6}, {h3, h4, h5}, {h7}, {h8}, while the partition of the haplotypes according to the second marker is {h3, h4}, {h1, h5}, {h2, h6, h7}, {h8}. Every possible partition can be represented by a matrix with as many rows and columns as the number of haplotypes in S. For example, for a population with eight haplotypes, we can represent the partitions according to the loci A and B given above by two matrices and . The element {alpha}lm of is going to be equal to 1 if haplotypes l and m are in the same group in the partition defined by A and zero otherwise. The definition of is similar. Again, in our example, the matrices and would be


Let us now consider the agreement between these partitions. The agreement would be perfect if each allele at marker A corresponded to one, and only one, allele at marker B: Two haplotypes are in the same group according to B if and only if they are in the same group according to A. On the contrary, the agreement is lowest if whenever A puts two haplotypes in the same group, B separates them. Between these two extremes there is the agreement that one gets just by chance. A simple way to measure the agreement is to consider and as vectors (for example, reading the numbers left to right and top to bottom) and calculate the covariance between them (see HUBERT and BAKER 1978 Down). Let N be the number of haplotypes in S; then

To see how this is related to H, note that we can describe the set S of haplotypes with a contingency table {{pi}ij}. In our example,

Now, it can be verified that

Hence H = Agr; that is, the difference between the haplotype homozygosity and the product of the marginal homozygosities measures the agreement between the partitions defined by the markers: A positive value of H (excess homozygosity) indicates more agreement than that expected by chance; a negative value of H (excess heterozygosity) indicates less agreement. Either of these excesses signifies a departure from gametic phase equilibrium (independence). Indeed, a founder effect can generate both a positive and negative value of H. Suppose, for example, that you have a population of 100 chromosomes and a disease-causing mutation appears on one of them, close to a biallelic locus. If the chromosome that experienced the mutation had at the nearby locus an allele with population frequency <=0.5, there will be excess homozygosity for the disease locus-marker locus haplotype [according to Equation 3 and D2 being small, HAB - HAHB {cong} 2D(2p - 1)(2q - 1) > 0, provided q < 1/2 when p = 0.01]. For a marker allele frequency >0.5, there will be excess heterozygosity.

References to the literature on agreement between partitions can be obtained from HUBERT and BAKER 1978 Down and FOWLKES and MALLOWS 1983 Down. Incidentally we note that BLOCH and KRAEMER 1989 Down proposed a translation of measures of agreement into measures of dependence, which is entirely different from the present one.

Measures of disequilibrium based on H:
Having clarified the nature of dependence captured by H, we now set to define a measure based on H that allows comparisons across tables. Generally speaking, it is useful to standardize H to obtain an index that has absolute value <1 and is equal to ±1 in case of maximal dependence and 0 in correspondence of independence. In defining maximal dependence, recall that the degree of linkage disequilibrium between markers should be independent from the allele frequencies of each of the markers considered separately. This implies that H should be standardized using the extreme values it can take on for the given marginal distributions {pi}i·, {pi}.j· That is, for a table {{pi}ij}, we are interested in the index of dependence,

(8)

Unfortunately, the maximization involved in the definition of H' does not have a closed form solution for all c and r. In the simple case of a 2 x 2 table, this constrained quadratic maximization is, however, easy to solve. Recalling the parameterization of a 2 x 2 table given in (2), one quickly realizes that the problem is quadratic in D and that the solution is on the boundaries. The table corresponding to the maximal homozygosities will have the following value of {pi}11:

The minimal homozygosity is achieved for

This allows us to define an index H' that takes on value 1 in correspondence of maximal homozygosity and -1 in correspondence of maximal heterozygosity. To illustrate briefly the meaning of H' and its difference with a traditional measure of disequilibrium, let us consider the following tables with identical margins:

For 2 x 2 tables, LEWONTIN 1964 Down has popularized a measure, D', that is a standardization of the value , so that D' is always <1 in absolute value, is equal to 0 in case of independence, and has positive sign when the association is along the main diagonal of the contingency table (A1 with B1, A2 with B2). Recalling the parametrization of 2 x 2 tables given in (2), the definition of the measure D' is as follows

For the tables above, and , while and . The sign of D' depends on the order of the rows and columns of the tables; when there is not a natural order for the outcomes of variables A and B, this seems a rather arbitrary decision. In contrast to this, the sign of H' indicates excessive homozygosity or heterozygosity and is independent from row or column order.

For generic c > 2 and r > 2, in the absence of an exact solution of the maximization in (8), one can bound the denominator in the definition of H', obtaining an index that will always have absolute value <1 and may attain value 1 only for some particular marginal distributions. There are multiple ways of obtaining such bounds, by considering the following table:

(9)

Using the same reasoning that is behind the construction of the common measure D', we can define

If one wants to use the same standardization for positive and negative values of H, one can use . Furthermore, by equating, at each marker, homozygosity with the numeric value 1 and heterozygosity with the numeric 0, one can get an index that is the analog of the correlation coefficient:

which will attain the maximal values 1 and -1 for an even more restricted set of marginals.

Note that once we decide to restrict our attention to table (9), any measure of dependence defined on it will give an indication of how much HAB differs from HAHB. In particular, one may choose to look at the odds ratio

and at its standardized version ({Omega} - 1)/({Omega} + 1). We decided to focus on the rescaling of H for ease of interpretation.

The notion of homozygosity can be applied to haplotypes that contain more than one marker. Consider, for example, the case of three loci. Then, H naturally generalizes to {sum}ijk {pi}2ijk - {sum}i {pi}2i·· {sum}j {pi}2·j· {sum}k {pi}2··k. The maximization (minimization) of H given the marginal distributions, however, is computationally even more demanding. Nonetheless, it is possible to define an index that is appropriate when the haplotype homozygosity is greater than the product of the individual marker homozygosities (H > 0):

This index is based on the observation that haplotype homozygosity necessarily has to be smaller than each marker homozygosity. We illustrate its application with one example.


*  SAMPLE-BASED MEASURE OF DEPENDENCE
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

Estimating H from sample frequencies:
In contrast to what has been assumed thus far, the matrix {{pi}ij} of the true haplotype frequencies for loci A and B is unknown. Linkage disequilibrium between the markers must then be estimated from a sample of haplotypes of size n, leading to the counts represented in the following matrix:

The measures described in the previous section are applicable to analysis of sample data using the "plug-in" principle, that is, substituting for the theoretical quantities their sample analogs. Hence, instead of {pi}i·, one uses ni·/n, etc. It is worth noting that homozygosity can be estimated from sample haplotypes in two ways, to which we refer as direct count and maximum likelihood. For marker A, homozygosity is estimated by direct count as

or, assuming Hardy-Weinberg (HW) equilibrium, by maximum likelihood as

the latter method being more efficient when HW holds. Similarly, the joint homozygosity can be estimated in two ways:

(10)


(11)

To evaluate the first of these estimators, one has to estimate {pi}ij by its sample counterpart nij/n; this is immediate whenever the phase of haplotypes is known. In such cases, HmleAB or its unbiased version is preferable to HcountAB, as it will have a smaller variance; It is effectively the expected value of HcountAB given the sufficient statistics for this model (see LEHMAN 1983 Down). The expressions for the variances follow (see BHARGAVA and UPPULURI 1977B Down):

However, when the phase of the genotypes is not available, the count estimator (11) becomes a handy alternative. Note that to ensure that the estimates of the indices take on values between -1 and 1, one should use the same estimation procedure for the haplotype and marker homozygosities.

Testing for linkage disequilibrium and sample size effects:
We have outlined how the plug-in principle can be used to obtain measures of disequilibrium on the basis of H from sample data. However, analyzing a random sample, one has to evaluate the possibility that the observed counts—with their associated disequilibrium— are generated by a table {{pi}ij} characterized by independence. In other words, prior to measuring disequilibrium, one should conduct a test to assess whether the hypothesis of GPE can or cannot be rejected. It is possible to use homozygosity to test for GPE; we do not intend to propose the following procedures as an alternative to the numerous tests already studied in the literature, but rather consider them for completeness. It is easy to construct asymptotic tests:

  1. The statistic

    has, under independence, an approximate N(0, 1) distribution for n -> {infty} and leads to a Gaussian test.

  2. From the 2 x 2 table of observed haplotype homozygosity


  1. one can obtain a {chi}2 test—again assuming n -> {infty} and a sizeable number of observations per cell.

We do not present details of the power of these two tests, but do note that that their power can be zero for those alternatives to linkage equilibrium that give the same joint homozygosity as independence (see, for example, Fig 2). Hence, technically, the tests above are useful only if they result in a rejection of the null hypothesis. We also note that the second test is particularly practical in the case of unphased data: It can be evaluated directly on the sample data without requiring phasing.

The tests outlined above are based on asymptotic approximations; however, the assumption of n -> {infty} sometimes represents a serious limitation. This can be overcome with exact permutation tests that are based on the statistic H({nij/n}). In this context, one is interested in considering all the possible tables mij with the same marginal counts as the observed n, n·j and evaluating the probability of the set of these tables that leads to an excess of homozygosity greater than or equal to the observed H({nij/n}). Fig 4 illustrates the space of all tables {mij} with and marginal relative frequencies as in (4). The table corresponding to independence and the one with highest homozygosity excess are identified. With regard to the probability with which each table is observed under independence, it is well known that {mij} has a Fisher-Yates (FY) distribution. The probability

(12)

represents the achieved significance level (P value) of an exact permutation test. It is possible to evaluate (12) either by direct computation (as in the algorithm described in MEHTA and PATEL 1983 Down) or with a Markov chain Monte Carlo (MCMC) procedure as described in LAZZERONI and LANGE 1997 Down. We draw attention in particular to the use of MCMC samples, as they represent the only method effectively applicable for multidimensional contingency tables with highly polymorphic markers. A MCMC is used to obtain a sample of contingency tables with distribution FY(n, n·j). The percentage of tables {msij} in the sample such that |H({msij/m})| >= |H({nij/n})| is taken as an estimate of the exact P value (12). LAZZERONI and LANGE 1997 Down describe how to obtain a sample {msij} with the appropriate Fisher-Yates distribution. DIACONIS and STURMFELS 1998 Down give another MCMC algorithm that can be used for this purpose. The chain that these authors propose is, however, more directly applicable to the evaluation of another quantity that provides significant information on the amount of disequilibrium in the observed table. Recall that the maximization problem required in the definition of H' (8) does not have a closed-form solution. When dealing with haplotype counts, one can consider the following corresponding discrete problem:



View larger version (46K):
In this window
In a new window
Download PPT slide
 
Figure 4. Space of all possible {mij} with marginal relative frequencies m/m and m·j/m as in (4) and total number of haplotypes . Tables are identified by bullets. The shaded area represents the space of all probability distributions {{pi}ij} with the same marginals. The solid circle indicates the table corresponding to independence and the bullet with darker perimeter identifies the table with highest homozygosity. The ellipses are level sets of absolute deviation of the haplotype's homozygosity from its value under independence.

As n, c, r increase, this problem also becomes computationally difficult, but its solution can be approximated with a MCMC algorithm. In particular, the chain described by DIACONIS and STURMFELS 1998 Down leads to a sample of tables {msij} with uniform distribution among the tables with fixed marginal counts ni· and n·j (that is, uniform on the space of tables described in Fig 4). A sample-dependent version of H' can then be evaluated as


*  EXAMPLES
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

We now consider two datasets previously published in the literature for which measuring disequilibrium is particularly interesting; one because of implications regarding the presence of recombination in mitochondria and the second regarding the history of populations. The first dataset consists of biallelic markers: We evaluate the sample analog of H', substituting nij/n for {pi}ij. In the second dataset, four different markers are considered at the same time to obtain a "global" measure of disequilibrium. We evaluate an empirical version of Hm, where {sum}ijk {pi}2ijk is substituted by the direct count of haplotype homozygosity.

Example 1. Recombination in mitochondria:
We consider here a dataset that has recently been used to provide evidence for the presence of recombination in mitochondria (AWADALLA et al. 1999 Down). It is particularly interesting since the conclusion of the analysis depends critically on which measure of disequilibrium is used: It represents, then, a clear example of the need for reliable measures of disequilibrium. The data come from the analysis of (I) six sites (7025, 10,394, 12,308, 13,366, 15,606, 15,925) in 86 Swedish and Finnish individuals; (II) seven sites (1715, 5176, 7933, 8391, 10,394, 10,397, 13,262) in 167 Siberians; and (III) five sites (663, 5176, 10,394, 10,397, 13,262) in 153 Native Americans. Detailed description of these sites and samples can be found in the original articles cited by AWADALLA et al. 1999 Down. Every possible pairing of the sites has been considered and the amount of disequilibrium measured between them has been plotted against their distance apart.

The measure of disequilibrium used in AWADALLA et al. 1999 Down is R2: ({pi}11{pi}22 - {pi}12{pi}21)2/{pi}{pi}{pi}·1{pi}·2. Fig 5A reproduces the article's findings: The level of disequilibrium decreases as the distance between the markers increases, as to be expected in a system with recombination (we plotted |R| rather than R2 to ease the comparison with D'). Fig 5B illustrates the effect of using |D'| rather then |R| as a measure of disequilibrium: The mentioned effect completely disappears. The difference between R and D' relies substantially in the standardization of the measures: While in D' the measure is standardized so that the values -1 and 1 are achievable for any set of marginals, in R the extremes 1 and -1 are attainable in theory only. The graph in Fig 5B would seem to suggest that the effect noted by Awadalla et al. is due exclusively to the variation in marginal frequencies rather than to disequilibrium. However, there is a sample-size effect associated with D' that has to be considered in interpreting Fig 5B. As soon as one of the cells of a 2 x 2 contingency table is empty, the absolute value of D' is equal to one. When the marginal allele frequencies are such that the probability associated with that cell is very small under independence, and the sample size is small, there is a risk of evaluating as complete disequilibrium what is really quite close to independence. It is of interest, then, to analyze the datasets with other measures of disequilibrium that, differently from R2, take into account marginal distributions but also, differently from D', do not inflate disequilibrium for small sample sizes. H' is the ideal candidate based on homozygosity; Fig 6 shows the results of H' to the datasets in question. It is clear that the effect observed by AWADALLA et al. 1999 Down disappears with an appropriate consideration of the marginal allele frequencies.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 5. The pattern of linkage disequilibrium values in the datasets considered by AWADALLA et al. 1999 Down using (a) R2 and (b) |D'|. On the x-axis, distances between markers in number of basepairs are shown. On the y-axis, the measured disequilibrium values are shown.



View larger version (9K):
In this window
In a new window
Download PPT slide
 
Figure 6. The pattern of linkage disequilibrium values in the datasets considered by AWADALLA et al. 1999 Down using H'. On the x-axis, distances between markers in number of basepairs are shown. On the y-axis, the measured disequilibrium values are shown.

Example 2. Variation of disequilibrium across populations:
According to the "out of Africa" hypothesis, there was a single migration of modern Homo sapiens out of Africa and an additional loss of variation as that initial non-African founder population grew and expanded to the East and later into the Americas. Estimating the values of linkage disequilibrium in various populations can help corroborate this hypothesis. To this purpose, four sites [three single nucleotide polymorphism (SNP) and one short tandem repeat polymorphism (STRP)] have been studied at the DRD2 locus on chromosome 11q by KIDD et al. 1998 Down. The physical map for this region is SNP1–4.7 kb–SNP2–1.4 kb–STRP–19.3 kb–SNP4; thus a total of 25.4 kb is spanned by the four sites. Data from 28 populations covering five continents and 1324 subjects have been generated and analyzed to determine the overall pattern of disequilibrium in this chromosomal segment and how it varies across populations. We have reanalyzed the data using a global measure of disequilibrium defined above (Hm) on the basis of haplotype homozygosity for the four sites and obtained the results presented in Table 1. The table shows a clear pattern of increasing LD moving from African to European/Western Asian to Eastern Asian and Amerinds, which is consistent with the out-of-Africa hypothesis. One can note an aberrantly high value of Hm for Ethiopians. Examination of the haplotype frequencies for this population reveals a pattern of nearly complete LD. Although we have included this as an African population, it is actually intermediate between Africans and Europeans/Western Asians, and could reasonably also be included in the latter group. Also, this population has the smallest sample size (n = 32), possibly leading to extreme variability. To address the significance of geographic origins, we have calculated the average variance within continent vs. variance between continent means. The within-continent variance is 0.0186 (or 0.0149 leaving out the Ethiopians) vs. 0.0308 between continents. The ratio (between vs. within) is 1.66, or 2.07 omitting the Ethiopians.


*  DISCUSSION
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

We have discussed the use of haplotype homozygosity to measure linkage disequilibrium or, equivalently, of {sum}ij {pi}2ij to measure the amount of dependency in a contingency table {{pi}}. The statistical literature contains references to this index from two different perspectives: as an index of agreement between partitions (see HUBERT and BAKER 1978 Down) and as an index of diversity of the distribution {{pi}} (see BHARGAVA and UPPULURI 1977A Down). As we illustrated, both points of view provide a statistical interpretation of the relation between homozygosity and linkage disequilibrium. What remains to be discussed is the relevance for genetic purposes of the direction of disequilibrium measured by homozygosity; this will require further examination. We limit ourselves to consider the four properties that a measure of disequilibrium should have according to HEDRICK 1987 Down: (1) a simple biological interpretation (obviously satisfied for homozygosity), (2) an available statistical test (we showed how to construct it), (3) a direct relationship to evolutionary factors as recombination, and (4) a standardization that allows comparison across loci and populations (we illustrated the available standardizations and their limits). Point (3) is particularly relevant when one wants to use disequilibrium measures to identify the location of a disease gene (see the review by DEVLIN and RISCH 1995 Down). We have seen that, unfortunately, the relation between homozygosity and recombination fraction is not always direct, although it is so for excess homozygosity. The fact that homozygosity is defined independently of the number of alleles per locus makes H particularly suitable to measure LD between highly polymorphic markers. As the most recent LD-based genome screens have brought to general attention, it is important to collect information on the expected pattern of disequilibrium in different regions of the genome and in different populations. KIDD et al. 1998 Down, HUTTLEY et al. 1999 Down, REICH et al. 2001 Down, and STEPHENS et al. 2001 Down represent a step in this direction. The LD measures used in these works are either the P value of a test of hypothesis (a solution acceptable in their case, but not robust to sample size fluctuations) or D', which applies only to biallelic markers. The definition of robust measures of disequilibrium that can be used for successful comparison is crucial to this goal and we believe that measures based on homozygosity can play a significant role.

The measures we have defined on the basis of haplotype homozygosity are particularly suited to assessing linkage disequilibrium of multiple sites (such as SNPs) and multiallelic systems (such as STRPs), on the basis of randomly ascertained samples. The problem of localizing a disease gene among a group of closely linked markers usually entails nonrandom sampling, where disease allele-bearing chromosomes are oversampled (DEVLIN and RISCH 1995 Down). The measures we have described are not robust to such nonrandom sampling. For this particular application of linkage disequilibrium analysis, many different approaches, either analyzing one marker locus at a time (HASTBACKA et al. 1992 Down; KAPLAN et al. 1995 Down; TERWILLIGER 1995 Down; DEVLIN et al. 1996 Down; XIONG and GUO 1997 Down; GRAHAM and THOMPSON 1998 Down; LAZZERONI 1998 Down) or analyzing full multilocus haplotypes (MCPEEK and STRAHS 1999 Down; SERVICE et al. 1999 Down; LAM et al. 2000 Down; MORRIS et al. 2000 Down; LIU et al. 2001 Down), have been described.


*  ACKNOWLEDGMENTS

Chiara Sabatti was supported by the Nancy Pritzker Foundation. Neil Risch was supported in part by National Institutes of Health grant GM057672.

Manuscript received October 9, 2001; Accepted for publication January 14, 2002.


*  LITERATURE CITED
*TOP
*ABSTRACT
*RELATIONS BETWEEN LINKAGE...
*MEASURING DISEQUILIBRIUM WITH...
*SAMPLE-BASED MEASURE OF...
*EXAMPLES
*DISCUSSION
*LITERATURE CITED

AVERY, P. and W. HILL, 1979  Variance in quantitative traits due to linked dominant genes and variance in heterozygosity in small populations. Genetics 91:817-844[Abstract/Free Full Text].

AWADALLA, P., A. EYRE-WALKER, and J. MAYNARD-SMITH, 1999  Linkage disequilibrium and recombination in hominid mitochondrial DNA. Science 286:2524-2525[Abstract/Free Full Text].

BHARGAVA, T. and V. UPPULURI, 1977a  An axiomatic derivation of the Gini's index of diversity with applications. Metron 33:41-53.

BHARGAVA, T. and V. UPPULURI, 1977b  Sampling distribution of Gini's index of diversity. Appl. Math. Comput. 3:1-24.

BLOCH, D. and H. KRAEMER, 1989  2 x 2 kappa coefficients: measures of agreement or association. Biometrics 45:269-287[Medline].

BROWN, A., M. FELDMAN, and E. NEVO, 1980  Multilocus structure of natural populations of Hordeum spontaneum.. Genetics 96:523-536[Abstract/Free Full Text].

COLLINS, F., M. GUYER, and A. CHAKRAVARTI, 1997  Variations on a theme: cataloging human DNA sequence variation. Science 278:1580-1581[Free Full Text].

DEVLIN, B. and N. RISCH, 1995  A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29:311-322[Medline].

DEVLIN, B., N. RISCH, and K. ROEDER, 1996  Disequilibrium mapping: composite likelihood for pairwise disequilibrium. Genomics 29:311-316.

DIACONIS, P. and B. STURMFELS, 1998  Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26:363-397.

FOWLKES, E. and C. MALLOWS, 1983  A method for comparing two hierarchical clusterings. J. Am. Statist. Assoc. 78:553-569.

GRAHAM, J. and E. THOMPSON, 1998  Disequilibrium likelihoods for fine-scale mapping of a rare allele. Am. J. Hum. Genet. 63:1517-1530[Medline].

HASTBACKA, J., A. DE LA CHAPELLE, I. KAITILA, P. SISTONEN, and A. WEAVER et al., 1992  Linkage disequilibrium mapping in isolated founder populations: diastrophic dysplasia in Finland. Nat. Genet. 2:204-211[Medline].

HEDRICK, P., 1987  Gametic disequilibrium measures: proceed with caution. Genetics 117:331-341[Abstract/Free Full Text].

HUBERT, L. and F. BAKER, 1978  Evaluating the conformity of sociometric measurements. Psychometrika 43:31-41.

HUTTLEY, G., M. SMITH, M. CARRINGTON, and S. O'BRIEN, 1999  A scan for linkage disequilibrium across the human genome. Genetics 152:1711-1722[Abstract/Free Full Text].

KAPLAN, N., W. HILL, and B. WEIR, 1995  Likelihood methods for locating disease genes in nonequilibrium populations. Am. J. Hum. Genet. 56:18-32[Medline].

KIDD, K., B. MORAR, C. M. CASTIGLIONE, H. ZHAO, and A. J. PAKSTIS, 1998  A global survey of haplotype frequencies and linkage disequilibrium at the DRD2 locus. Hum. Genet. 103:211-227[Medline].

KRUGLYAK, L., 1998  Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22:139-144.

LAM, J., K. ROEDER, and B. DEVLIN, 2000  Haplotype fine mapping by evolutionary trees. Am. J. Hum. Genet. 66:659-673[Medline].

LANDER, E. and D. BOTSTEIN, 1987  Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 236:1567-1570[Abstract/Free Full Text].

LAZZERONI, L., 1998  Linkage disequilibrium and gene mapping: an empirical least-squares approach. Am. J. Hum. Genet. 62:159-170[Medline].

LAZZERONI, L. and K. LANGE, 1997  Markov chains for Monte Carlo tests of genetic equilibrium in multidimensional contingency tables. Ann. Stat. 25:138-168.

LEHMAN, E., 1983 Theory of Point Estimation. John Wiley & Sons, New York.

LEWONTIN, R., 1964  The interaction of selection and linkage. I. General considerations: heterotic models. Genetics 49:49-67[Free Full Text].

LIU, J., C. SABATTI, J. TENG, B. KEATS, and N. RISCH, 2001  Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11:1716-1724[Abstract/Free Full Text].

LONJOU, C., A. COLLINS, and N. MORTON, 1999  Allelic association between marker loci. Proc. Natl. Acad. Sci. USA 96:1621-1626[Abstract/Free Full Text].

MCPEEK, M. and A. STRAHS, 1999  Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 65:858-875[Medline].

MEHTA, C. and N. PATEL, 1983  A network algorithm for performing Fisher's exact test in r x c contingency tables. J. Am. Stat. Assoc. 78:427-434.

MORRIS, A., J. WHITTAKER, and D. BALDING, 2000  Bayesian fine-scale mapping of disease loci by hidden Markov models. Am. J. Hum. Genet. 67:155-169[Medline].

MORTON, N. and S. SIMPSON, 1983  Kinship mapping of multilocus systems. Hum. Genet. 64:103-104[Medline].

OHTA, T., 1980  Linkage disequilibrium between amino acid sites in immunoglobulin genes and other multigene families. Genet. Res. 36:181-197[Medline].

REICH, D., M. CARGILL, S. BOLK, J. IRELAND, and P. SABETI et al., 2001  Linkage disequilibrium in the human genome. Nature 411:199-204[Medline].

SERVICE, S., D. TEMPLE-LANG, N. FREIMER, and L. SANDKUIJL, 1999  Linkage disequilibrium mapping of disease genes by reconstruction of ancestral haplotypes in founder populations. Am. J. Hum. Genet. 64:1728-1738[Medline].

SMITH, C., 1953  The detection of linkage in human genetics. J. R. Stat. Soc. B 15:153-184.

STEPHENS, J., J. A. SCHNEIDER, D. A. TANGUAY, J. CHOI, and T. ACHARYA et al., 2001  Haplotype variation and linkage disequilibrium in 313 human genes. Science 293:489-493[Abstract/Free Full Text].

SVED, J., 1968  The stability of linked systems of loci with a small population size. Genetics 59:543-563[Free Full Text].

TERWILLIGER, J., 1995  A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci. Am. J. Hum. Genet. 56:777-787[Medline].

XIONG, M. and S. GUO, 1997  Fine-scale genetic mapping based on linkage disequilibrium: theory and application. Am. J. Hum. Genet. 60:1513-1531[Medline].

WEIR, B., 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.

WRIGHT, A., A. CAROTHERS, and M. PIRASTU, 1999  Population choice in mapping genes for complex diseases. Nat. Genet. 23:397-404[Medline].




This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Laakso, S. Tuupanen, A. Karhu, R. Lehtonen, L. A. Aaltonen, and S. Hautaniemi
Computational identification of candidate loci for recessively inherited mutation using high-throughput SNP arrays
Bioinformatics, August 1, 2007; 23(15): 1952 - 1961.
[Abstract] [Full Text] [PDF]