- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.108.089912v1
179/3/1577 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Related articles in Genetics
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Abney, M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Abney, M.
Originally published as Genetics Published Articles Ahead of Print on July 13, 2008.
Genetics, Vol. 179, 1577-1590, July 2008, Copyright © 2008
doi:10.1534/genetics.108.089912
Identity-by-Descent Estimation and Mapping of Qualitative Traits in Large, Complex Pedigrees
Mark Abney1
Department of Human Genetics, University of Chicago, Chicago, Illinois 60637
1 Address for correspondence: Department of Human Genetics, University of Chicago, 920 E. 58th St., Chicago, IL 60637.
E-mail: abney{at}genetics.uchicago.edu
>ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
LITERATURE CITED
Computing identity-by-descent sharing between individuals connected through a large, complex pedigree is a computationally demanding task that often cannot be done using exact methods. What I present here is a rapid computational method for estimating, in large complex pedigrees, the probability that pairs of alleles are IBD given the single-point genotype data at that marker for all individuals. The method can be used on pedigrees of essentially arbitrary size and complexity without the need to divide the individuals into separate subpedigrees. I apply the method to do qualitative trait linkage mapping using the nonparametric sharing statistic Spairs. The validity of the method is demonstrated via simulation studies on a 13-generation 3028-person pedigree with 700 genotyped individuals. An analysis of an asthma data set of individuals in this pedigree finds four loci with P-values <10–3 that were not detected in prior analyses. The mapping method is fast and can complete analyses of
150 affected individuals within this pedigree for thousands of markers in a matter of hours.
COMPUTATION of identical-by-descent (IBD) allele sharing between related individuals is a necessary ingredient in many methods for linkage mapping of complex traits. Typically, IBD allele sharing is used either directly to assess whether affected individuals are sharing more at a locus than expected under the null hypothesis or as a component in the covariance matrix in a variance component model. A number of algorithms for computing IBD exactly exist (e.g., ELSTON and STEWART 1971; LANDER and GREEN 1987; KRUGLYAK et al. 1996; FISHELSON and GEIGER 2002); however, these methods become computationally infeasible when pedigrees are very large and complex. Under such circumstances approximate methods become necessary, whether Markov chain Monte Carlo (THOMPSON et al. 1993; SOBEL and LANGE 1996; HEATH 1997) or regression based (FULKER et al. 1995; ALMASY and BLANGERO 1998). Even these methods, however, have difficulty when the pedigree is very deep with many generations of individuals with no data.
In humans, very deep, and possibly complex, pedigrees often arise in conjunction with genetic studies of isolated populations. Isolated populations are commonly thought to have characteristics that may prove advantageous for mapping (WRIGHT et al. 1999; PELTONEN et al. 2000; ESCAMILLA 2001; SHIFMAN and DARVASI 2001; SERVICE et al. 2006), yet may require specialized statistical methods to both properly leverage these advantages and provide a valid test for the presence of a trait-influencing gene (BOURGAIN and GENIN 2005). Large pedigrees also arise in other animal systems where breeding is carefully controlled. For example, there is interest in methods that are applicable to complex pedigrees for both livestock (THALLMAN et al. 2001) and dogs (SUTTER and OSTRANDER 2004).
What I present here is a rapid computational method for estimating, in large complex pedigrees, the probability that pairs of alleles are IBD given the single-point genotype data at that marker for all individuals. Because the method is very fast, it can easily be used on genomewide data with many thousands of markers on hundreds of related individuals. It can be used directly to do linkage mapping with affected individuals using the Spairs statistic or to compute approximate multipoint probabilities both for alleles being IBD, using regression-based approaches (e.g., ALMASY and BLANGERO 1998), and for alleles being homozygous by descent (HBD) using a hidden Markov model (HMM) (ABNEY et al. 2002). Here, I describe this computational method and its application to qualitative trait linkage analysis. Although computing Spairs is straightforward, in principle, a number of challenges must be overcome in creating a practical and valid mapping method for very large, and possibly complex, pedigrees. In particular, it is common in studies involving large pedigrees to have one, or a few, pedigrees, making the asymptotic distribution of the test statistic, which is appropriate when there are many independent pedigrees, not necessarily applicable. Also, the allele-frequency distribution may have a major influence on the test statistic when inheritance information is obscured by missing data. Unfortunately, the relevant allele-frequency distribution is that in the founders of the pedigree, which in large pedigrees may be many generations earlier than the sampled individuals. As a result, estimation of the founder allele-frequency distribution from the sampled data can result in a large bias of the conditional expected sharing statistic. The difficulties posed by not knowing the true allele-frequency distribution can be largely overcome through the use of simulations, but the capacity to do many simulations requires a computationally efficient method, particularly when a large number of markers are involved. Below, I describe the theoretical basis of the IBD estimation method, the approximations used, and how it differs from earlier methods that take a similar approach (WANG et al. 1995; DAVIS et al. 1996). I then show its application to single-point linkage mapping using Spairs and how the difficulties mentioned above are solved. The APPENDIX describes how to use the IBD estimation method to obtain multipoint estimates of HBD by modifying the HMM of ABNEY et al. (2002).
ABSTRACT
>METHODS
RESULTS
DISCUSSION
APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
LITERATURE CITED
IBD estimation
The objective is to compute the probability of two alleles being IBD given all available genotype data at that locus and the entire, unbroken pedigree. The method is based on the recursive strategy suggested by WANG et al. (1995) and DAVIS et al. (1996). In both of these studies the probability is computed in a manner analogous to the recurrence relation for kinship coefficients,
, where
and
are the mother and the father of individual
, and individual
is not a descendant of
. The equivalent recurrence equation when there are genotype data at the locus, as given by WANG et al. (1995) and DAVIS et al. (1996), is valid only when there are no missing genotypes and is
![]() | (1) |
is the probability that the ith allele from
was inherited from the jth allele of
's mother, given the observed genotype data; and Ai
Bj means allele Ai is IBD with allele Bj. This equation is applied repeatedly until the founders of the pedigree are reached and boundary conditions are used to obtain the probability. When there are no individuals with missing genotypes the method is both fast and returns exact probabilities.
Unfortunately, as recognized by both WANG et al. (1995) and DAVIS et al. (1996), Equation 1 is not valid when there are missing genotypes in the data. Although neither study formulated a version of Equation 1 that holds under missing data conditions, they each suggested approaches for this case. The most recent version of SimIBD (DAVIS et al. 1996) uses a Monte Carlo procedure where, for each realization, a random genotype is assigned to each missing genotype, and the recursive algorithm is applied. The final probability is the average of the probabilities computed at each Monte Carlo realization. In contrast, WANG et al. (1995) suggest two different possibilities. In the first, when the recursive algorithm encounters an individual who has a missing genotype, the relevant inheritance probability [e.g.,
] is computed by summing over all possible genotypes for the missing data weighted by the probability of the genotype given the observed genotypes. To simplify the computation, one can use the probability of the genotype given only the genotypes of close relatives rather than all observed genotypes. The second possibility is to find the genotype configuration for all individuals with missing genotypes that has the highest probability and apply the recursive algorithm to that configuration. Finding the highest-probability genotype configuration, however, can be computationally demanding if there are many missing genotypes.
A common situation when analyzing large pedigrees is to have several generations of the pedigree completely untyped. None of the above strategies are entirely sufficient in such a situation. The problem is that there is little information in the untyped portion of the genealogy from which to infer the genotype probability distribution in those individuals. Simulating over valid genotype configurations can then be time consuming and, possibly, inaccurate. Summing over all possible genotypes, on the other hand, may be computationally impracticable.
The approach I propose relies on classifying individuals into two groups, A and S. An S individual is someone who either is genotyped or has at least one ancestor who is genotyped, while an A individual is someone who is not among the S group. Note that by this definition the S group may contain individuals for whom no data were actually collected. I also define a set of individuals called "quasi-founders," where each quasi-founder is either an S individual with both parents in the A group or an A individual with a spouse in S. A version of recurrence Equation 1, reformulated to hold true even under missing data, is applied to the S individuals until the quasi-founders are reached, at which point boundary conditions are employed to determine the final probability. This allows one to avoid using the recurrence equation over those generations with no genotype data, thereby speeding up the computation significantly. Furthermore, additional computational efficiency is gained by applying approximations designed specifically to work well when the rate of missing genotype data in S is reasonably low (e.g., <20%). Note that this constraint on the rate of missing genotype data in S still allows for potentially many generations of untyped individuals in A.
The algorithm is described in four parts. First, I describe the general form of Equation 1 and how to use this as a recurrence relation by updating the genotype information the probabilities are conditional on. I then show how the conditional probability of two alleles being IBD given some genotype information should be expressed when the allelic type of either of those alleles is unknown. This provides a general expression that can be applied recursively to compute the IBD probability. Applying this expression requires computing transmission probabilities in the presence of missing data. I derive an equation for calculating this probability and describe the approximations made to assure computational efficiency. Finally, the recursive algorithm is completed by specifying the boundary conditions to the recurrence equations.
Recurrence rules:
The following notation is used throughout the remainder of this article. Individuals are indicated with uppercase script characters (e.g.,
,
,
,
), while the true genotype of, for instance, individual
comprises two random variables (A1, A2), where the alleles A1 and A2 may each take on one of the L possible allelic types at the locus,
. Note that the ordering of A1 and A2 is arbitrary. Throughout, I assume that the pattern of missing genotype data is noninformative and that there is no genotyping error. Hence, observed allelic types indicate the true underlying genotype whereas the event that a genotype is missing provides no information, by itself, on the true genotype. Furthermore, the entire analysis is done conditional on the pattern of missing data. I let G represent the genotype information, which, as I show below, will grow during the course of the algorithm. Then, G = gr represents the information at the rth stage, where gr is a vector with two elements for each quasi-founder and for each person in S, where the element
=
i if the allelic type of the first allele of individual
at the rth stage is known to be
i (i.e., A1 =
i at stage r) or is equal to zero if unknown. The vector g0, then, has elements representing all directly observed genotype data or data that can be inferred without ambiguity. The vector
has the same length as gr with entries equal to zero at all locations except for the two elements representing the alleles of
. Then, for instance,
, where · is the inner product, is a vector with entries for
equal to the corresponding entries of gr and all other entries equal to zero.
To extend Equation 1 to the case when some genotypes are missing, first note that it includes conditional probabilities for descent events involving only one allele at a time from
(e.g., {
}, {
}, etc.). In fact, if A1 came from the mother, for instance, then A2 must have come from the father. A version of Equation 1 that includes the descent events for the other allele and is true even with missing data is
![]() | (2) |
and
are not the same individual and
is not a descendant of
. If
and
are the same individual the equation becomes
![]() | (3) |
Unlike Equation 1, Equations 2 and 3 are not strictly recurrence equations because the IBD probabilities on the right-hand side have additional descent conditions not present in the left-hand side probability. In the case of no missing genotype data, the equations may be applied recursively by noting that terms such as
on the right-hand side of Equations 2 and "expanding" these terms using the appropriate recurrence relation. Also, the descent probabilities
, etc., are easily tabulated on the basis of the possible genotype configurations of
,
, and
. When there are missing genotypes, it is still possible to employ a recursive method based on Equations 2 and 3 by updating G with the genotype information provided by the descent events (e.g.,
). To show this I describe the application of the updating scheme to the first term on the right-hand side of Equation 2, but the arguments apply equally well to all terms on the right-hand side. First, focus on the conditional IBD probability
, keeping in mind that Equation 2 holds only when
is neither
nor a descendant of
. If
has a known genotype but either
or
does not, then the additional information from the conditions
and
must be included in the probability calculation. If, for example,
and
have known genotypes,
does not, and A2 =
l, then 
, where gr+1 is identical to gr, but with component
. The subsequent applications of the recurrence Equation 2 from this term must be done conditional on G = gr+1 rather than G = gr. Note that this implies that the computations must allow for the case of a partially known genotype, as F2 is known but F1 may not be.
In general, then, the IBD probabilities Pr(A1
B1 | G = gr) and Pr(A1
A2 | G = gr) are conditional on both the observed genotype data and the additional genotype information that results from the previous application of recurrence Equations 2 and 3. Even with the additional information, however, it is possible for A1 or B1 to be unknown. In this case, the probabilities must be written as a sum over the allelic types for the unknown alleles before the recurrence equations are applied. So, if A1 is unknown and
,
![]() | (4) |
B1 | G = gr, A1 =
k) = Pr(A1
B1 | G = gr+1). Then, to compute the probability Pr(A1
B1 | G = gr) one applies Equation 2 to the right-hand side of Equation 4, obtaining
![]() | (5) |
and
(and the other alleles of
and
if known) and, in practice, generally serves as a very good approximation.
Equation 5 provides a general recurrence equation that may be applied recursively to determine the probability that A1 and B1 are IBD, as long as
and
are not the same individual and
is not a descendant of
. If
and
are the same individual, Equation 3 may be generalized similarly. To compute the IBD probability, it is necessary to determine the transmission probabilities
in the presence of missing data. The derivation of these probabilities and the approximations made are described in Equations 6–10. Although only the probability
is computed here, extending this derivation to the other transmission probabilities is straightforward.
To compute this probability we must consider the case where A1 or A2 may be unknown. If
, one must sum over possible values of A2,
![]() | (6a) |
![]() | (6b) |
![]() | (6c) |
and
are known, and where Equation 6c results from approximating
. Approximation (6c) allows one to compute the probability without performing a sum over all alleles by assuming that the sum over all allelic types is approximated by the conditional probability with A2 unknown. Note that when A2 is known and equal to, for instance,
j, Equation 6c becomes
+ 1
and is exact if
and
are known. Equation 6c says that we may approximate the conditional probability of transmission events
,
given all the genotype data with the probability given just the genotype data of
, and
. In fact, as described below, these probabilities will be computed using the genotype data from first-degree relatives. In computing the conditional probability of these transmission events I assume
, where
and
are allowed to equal either the unknown state 0 or one of the known allelic types
. This allows us to replace
with
in Equation 6c.
The probability in Equation 6c may be computed using Bayes' rule,
![]() | (7) |
, the observed genotypes, at the r + 1 step, of
and
only. Consider the numerator of this equation,
![]() | (8) |
![]() | (9) |
in Equation 9 is conditional only on the known values of
and
, I improve the missing data approximations in Equations 6b and 6c by instead computing this probability conditional on the observed genotypes of all first-degree relatives of
.
To complete the computation of Equation 8 one needs to determine the probability
. This probability is D(A1, M1), but in the case where
is known and
is unknown, the two conditions
and
result in the conditional probability on the right-hand side of Equation 9 becoming
, when
and
. Hence,
, and combining this with Equations 9, 8, and 7, Equation 6c becomes
![]() | (10) |
Computing Equation 9 may require knowing the conditional genotype distribution of
if
is unknown. Although this distribution is computed conditional on the genotypes of first-degree relatives, it also depends on the relatedness of
and
(which is zero in outbred individuals). The current implementation of the algorithm to compute this probability ignores the relatedness of
and
because, when there are a fair amount of genotype data among the first-degree relatives, this has a relatively small effect, unless
and
are very closely related. The algorithm for obtaining the conditional genotype distribution for an individual is to descend through the pedigree, one generation at a time, beginning with the quasi-founders, and for each person encountered who has missing genotype data, compute the genotype-frequency distribution given the genotypes of the parents, the spouse, and the offspring. If the person is a quasi-founder, the computation is done conditional only on the offspring and population allele frequencies. If the person is not a quasi-founder, but one or both of the parents are untyped, the computation is done conditional on the genotype-frequency distribution already computed for that parent. If an offspring is untyped, that offspring is ignored in the computation. To propagate information from more distant relatives when close relatives are untyped, additional iterations could be done, as needed. Experience suggests, however, that when the rate of missing genotypes is fairly low (<
20% in the set S), a single pass through the pedigree is sufficient to obtain reasonably accurate estimates of the conditional genotype distributions in individuals with missing genotype data.
The general form of Equation 10 is particularly useful in that it naturally provides a framework to include effects such as mutation and genotyping error. In particular, while normally D(A1, M1) is the zero or one indicator function when
and
are known, this may be generalized to a nontrivial function of the observed genotypes. Selecting a particular model for mutation or genotyping error where the observed genotypes depend on the true underlying genotypes would allow one to devise a more general form for D(A1, M1) than what is given in Equation 9.
Boundary conditions:
The recursive algorithm is completed by specifying boundary conditions that are applied to Equations 2 and 3 when
and
are quasi-founders. I assume below that the boundary conditions are applied at step t of the recurrence. The first rule is a general condition that applies to all sample individuals.
Boundary condition 1:
For any allele Ai of individual
, Pr(Ai
Ai | G = gt) = 1.
For the next two boundary conditions pi is the frequency of allele i in the founding population. To properly account for missing allele information I encode a missing allele as 0 and use the convention p0 = 1. I also define the following two functions,
![]() |
![]() |
's genotype
and the alleles of
's genotype
, where any of i, j, k, l may be unknown (i.e., there may be partial or no genotype information on one or both individuals). The conditional probabilities given the genotypes of
and
in boundary conditions 2 and 3 are exact, even in the presence of missing data, but approximate the conditional probability given all the genotype data.
Boundary condition 2:
Let
be the inbreeding coefficient of individual
. When
is a quasi-founder,
![]() |
and
, one needs the condensed identity coefficients (JACQUARD 1974) for the pair. Without loss of generality, I consider only the probability that A1 and B1 are IBD.
Boundary condition 3:
When
and
are quasi-founders,
![]() |
where Sr is condensed identity state r and
r is its unconditional probability.
Furthermore, the probabilities in the sum are
|
Frequency estimation:
When
and
are not just quasi-founders but actual founders, the boundary conditions reduce to those given in WANG et al. (1995) and DAVIS et al. (1996). Both boundary conditions 2 and 3 depend on the allele frequencies in the founding population, unless all the founders are genotyped. These frequencies may be estimated either from the founding population itself, if available, or through the genotypes of those individuals in the pedigree. With deep pedigrees, where the founders lived many generations ago, it may be impossible to obtain accurate estimates of the allele frequencies in the founding population. Unless this population is large enough that genetic drift will have had a negligible effect, sampling from the present-day population from which the pedigree originated might not provide accurate estimates of the founder-allele frequencies. Instead, allele frequencies are often estimated from the genotyped study sample.
The current method uses the best linear unbiased estimator of allele frequencies of MCPEEK et al. (2004) to obtain estimates from the genotyped study sample. The estimator of the frequency p
for allele
is given by
![]() |
ij with
ij being the kinship coefficient between individuals i and j, and y
is the vector whose ith element equals one-half the number of alleles of type
in individual i.
The danger of using the same population both for estimating frequencies and for computing sharing is that the sharing estimate will now be negatively biased. This is a general phenomenon when estimating IBD sharing among individuals and can be understood as follows. When related individuals share an allele identical by state there is some probability that this allele is also shared IBD. The rarer this allele is in the founding population, the higher the probability is that these two copies were inherited from a common ancestral allele and are IBD. Conversely, an allele shared between individuals that was fairly common in the founders is relatively unlikely to be IBD. Because of genetic drift, the frequency of an allele in the typed individuals will vary from its frequency in the founding population, and, when IBD is estimated for individuals sharing this allele, the estimate will be either too high or too low depending on whether the allele frequency has drifted down or up, respectively. Underestimating the probability will, on average, happen more often because common alleles, where underestimation most often takes place, are necessarily more frequent than rare alleles. Without some adjustment for this effect, Pr(A1
B1 | G) will tend to be below its expected value. The approach taken here is to use simulations to estimate the amount of bias when the allele frequencies are estimated. This is discussed in more detail below.
Linkage mapping
I implement a nonparametric, affecteds-only mapping method by using the sharing statistic Spairs. For affected pair k (composed of individuals
and
),
![]() |
![]() |
is the kinship coefficient between
and
. Here, the summation for Spairs is done over all possible pairs, including when
and
represent the same individual, in which case
and
, where
is the inbreeding coefficient of
. Under the null hypothesis of no linkage, we have
![]() |
![]() |
The definition of Spairs,k relies on knowing the true, unobserved IBD state of the alleles. Instead, we use its expectation given the available genotype data at that locus G,
![]() |
![]() |
for every possible pair of individuals at a locus. Under the null hypothesis of no linkage
will form a distribution with a mean of zero, whereas under the alternative of linkage there should be excess sharing and
will generally be positive. Hence, to test for linkage one would do a one-sided test of
and reject the null hypothesis if
for some value of t.
Significance:
A major challenge of the method is determining the proper statistical significance of the estimated sharing
at a marker. Three problems, in particular, must be overcome: (1) Computation of an exact likelihood-ratio statistic, as done in KONG and COX (1997), is computationally prohibitive; (2) with only a single large pedigree, asymptotic theory for the distribution of
might not hold and, hence, using a normal distribution approximation, as in KRUGLYAK et al. (1996), might not be valid; and (3) obtaining an empirical distribution for
for each marker, conditional on that marker's allele frequency, could entail analysis of 102–105 simulated data sets per locus, depending on the degree of evidence for linkage at the locus. (The varying number of simulations per locus results from a time-saving strategy of using a limited number of simulations to screen out potentially interesting markers for which more accurate empirical distributions are then found.) This may be computationally impractical for data sets with many markers.
A potential solution is to develop a more computationally efficient variation on approach (3), in which the screening step is based on an appropriate transformation of the empirical distribution from a single completely informative marker. Because a completely informative marker allows immediate identification of which alleles are IBD, it is possible to rapidly compute the distribution of Spairs. A first-pass P-value could then be estimated at each locus by comparing
at the marker against this distribution. For markers showing strong evidence of linkage, the more accurate and time-consuming approach (3) could be used. The difficulty with this approach is twofold. First, the variance of Spairs is much larger than the variance of
. The less informative the marker, the greater the difference in the variances. For essentially any marker the P-value obtained by comparing
to the distribution of Spairs is far too conservative to be useful. Second, when allele frequencies are estimated from the sample
is negatively biased. Without some form of correction the P-value becomes even more conservative.
With a fast enough method, simulations can be used to estimate the bias and variance at a marker and correct the estimate of
. Note that the number of simulations needed to correct the bias and variance at a marker would typically be much less than the number needed to assess the empirical P-value at the marker by method (3). To do this, for each simulation the founders are given marker alleles according to the allele-frequency distribution estimated in the real data set. The allele frequencies for each simulated data set are computed from the same individuals who are typed in the real data set, and these frequencies are used to compute
for that simulation. The mean value over the simulated
is used as an estimator for the bias,
, where Nsim is the number of simulations. The statistic for that marker is adjusted by this bias. Although this procedure can dramatically reduce the bias, it cannot completely eliminate it, even as Nsim becomes very large, because estimates of the bias are themselves biased. This comes about because the estimated frequencies are, in general, larger than the true frequencies (e.g., rare alleles are often lost) and simulating founder genotypes using these estimated frequencies results in the simulated estimates of
being slightly more biased than the actual data. The final result is a slight overcorrection. The ideal solution would be to use allele frequencies determined not from the study sample itself but from the population from which the founders originated.
The distribution problem is solved by assuming that the distributions of Spairs and
have the same shape, but different variance. Normalizing both Spairs and
to have a variance of one would allow one to compare
to the distribution of Spairs. I estimate the variance of
from the values of
obtained from the bias estimation procedure. A new statistic is defined as
. The statistic
can now be compared directly against the empirical distribution obtained for
to obtain an approximate P-value. The amount of computation per marker now includes a single analysis of the real data and analysis of Nsim simulated data sets with allele frequencies and marker informativeness based on the real marker data. Note that in the case where the founder-allele frequencies are known, adjusting for the bias is unnecessary, but simulations are still needed to estimate
. The procedure of adjusting the bias and normalizing the variance has the benefit that now
(and, hence, the approximate P-value) is directly comparable across all markers, regardless of markers' allele-frequency distributions. This is unlike
, where the distribution depends significantly on the informativeness and allele frequencies of the marker.
In practice, the assumption that the shapes of the distribution for Spairs and
are identical does not always hold for very large values of the statistics, where P-values are very small. In this case, the approximate P-value obtained using the above method tends to be highly conservative. Estimates for very small P-values can be obtained empirically by doing many simulations under the null hypothesis, using either the given or the estimated allele frequencies. Although many simulations are needed, the method is fast enough to be able to determine the empirical distribution of a marker under the null hypothesis for a small number of markers.
ABSTRACT
METHODS
>RESULTS
DISCUSSION
APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
LITERATURE CITED
Simulations:
Simulations were done to test the accuracy and validity of the proposed methods. Genotype data for a designated set of individuals were obtained by assigning the founders of a pedigree a genotype at a single locus on the basis of given allele frequencies and allowing these alleles to "drop" through the pedigree according to Mendelian inheritance. Only a single locus (i.e., not multiple linked markers) was simulated. Three different types of markers were used in the simulations. The first was a single-nucleotide polymorphism (SNP) having two alleles of equal frequency. The second marker represented a microsatellite and had five equifrequent alleles. The third marker was meant to be representative of a "super marker," where each allele is actually a haplotype of tightly linked SNPs. The strategy of using haplotypes as alleles of a marker provides a straightforward means of using multipoint data to increase information about the inheritance process. Hence, the marker has 21 alleles with allele frequencies given by 2 x 0.2, 0.12, 0.07, 0.06, 0.05, 15 x 0.02. This allele-frequency distribution was chosen as being approximately representative of haplotype distributions for 90-kb segments on the basis of HAPMAP data for Caucasians.Using both the method proposed here and an exact computation as done by MERLIN (ABECASIS et al. 2002), I checked the accuracy of the proposed method by estimating the number of alleles shared IBD for the two individuals in the pedigree shown by solid symbols in Figure 1. For each of the three types of markers, 1000 simulations were done with the genotypes from the top two generations removed and 10% of the remaining genotypes randomly set to be missing. For the purposes of the proposed method, then, all individuals in the third generation from the top were considered quasi-founders. Both exact and approximate estimates of the number of alleles shared IBD were computed for each simulation with the results shown in Figure 2. The plots show good agreement between the approximate and the exact computations with the square root of the mean squared error being 0.024, 0.049, and 0.044 for the SNP, microsatellite, and haplotype markers, respectively.
|
|
The above methods were applied to simulated data sets in a large, complex pedigree taken from the Hutterite population. The Hutterites are an isolated religious sect that originated in the Tyrolean Alps during the 1500s and now largely reside in the northern United States and western Canada (MANGE 1964; HOSTETLER 1974). The pedigree used here is an extension of the pedigree described in ABNEY et al. (2000), comprising 3028 individuals in 13 generations. To assess the efficacy of the bias-correcting and scaling procedures, simulations were done with 700 individuals' assigned genotypes. These 700 are the same sample that is currently being genotyped using the Affymetrix 500k SNP array. From this group of 700, 148 have been diagnosed with bronchial hyperresponsiveness and were labeled as "affected." Figures 3 and 4 show the histograms of Spairs and
from 10,000 simulations with the haplotype marker before and after the bias-correction and scaling procedure was applied. The distributions in Figure 4 are fairly well matched over much of the range of statistic values, indicating that the scaled distribution of Spairs can be used as a reference distribution from which approximate P-values can be obtained. It is also evident that Spairs has a heavier tail at very large values, resulting in generally conservative estimates for very small approximate P-values.
|
|
The main utility of the bias-correction and scaling procedure is to put all markers in the study, regardless of informativeness, on a common scale over which they may be compared. This allows one to select the best markers to follow up with an empirical analysis by choosing those with the largest
(or, equivalently, those with the smallest approximate P-value). Figure 5 compares the distributions of
for the microsatellite and SNP markers over 10,000 simulations. The two distributions show good agreement, suggesting
is a reliable measure for selecting the markers that have the largest amount of IBD sharing. In Figure 6 the distribution of
for the haplotype marker is compared against the distributions of the microsatellite and SNP markers. In this case, where the allele-frequency distributions of the markers are radically different, the distributions do not compare as well, although the microsatellite marker distribution is notably closer to the haplotype marker distribution than is the SNP marker distribution. Most genetic mapping studies, however, tend to have markers with allele-frequency distributions that are relatively similar (e.g., all SNPs, all microsatellites, or possibly a mixture of the two), suggesting that
will generally be an effective measure of sharing across markers. It is worth noting that the haplotype marker allele-frequency distribution, with many alleles with small frequencies, presents the most challenging scenario when trying to estimate the bias and scale of the distributions. Inevitably, genetic drift within the sampled population results in a number of lost alleles and frequencies within the current population that may not be very representative of the frequencies in the founding population. This results in inefficiency in the bootstrapping method used and a less accurate estimate of the bias of
.
|
|
To check the type-I error rate of the empirical P-value estimates, 5000 replicates were generated for each marker type in the Hutterite pedigree. In each replicate 910 individuals in the bottom three to four generations of the pedigree were assigned genotypes and analyses were done with 71 individuals assigned as affected, the same individuals as were previously diagnosed with asthma in this population (OBER et al. 2000). In each replicate the empirical P-value of the marker was estimated on the basis of 5000 simulations. Two analyses were done within each replicate, one where 10% of the genotype data was removed at random to emulate missing data and another with no missing genotypes in the sample individuals. Results are in Table 1 and, when there are no missing data, show good agreement with the nominal type I error. With 10% of data missing, the empirical P-values are slightly conservative. This results from the empirical distribution being computed under simulations with no missing data. Although this results in slightly conservative P-values, the computation time can be significantly reduced when doing 10,000–100,000 simulations. Optionally, with data sets that can be analyzed quickly, one can condition on a particular set of individuals having missing data to obtain more accurate empirical P-values.
|
Asthma data set:
A genome screen for asthma using 563 markers, including both microsatellites and SNPs, that was previously published (OBER et al. 2000) was reanalyzed using the method described here. In the previous analysis, the pedigree was divided into 20 subpedigrees with all inbreeding loops trimmed. A single-point, semiparametric method was used where a likelihood was maximized over parameters representing the penetrances, disease-susceptibility allele frequency, and recombination frequency. A likelihood-ratio
2 was obtained by comparing to a maximized likelihood where the recombination frequency was set to 0.5. Because of the size of the subpedigrees, each marker was analyzed separately. This analysis resulted in five markers with P-values <0.001 with the smallest P-value being 0.0002. The genome screen data were analyzed using the methods described here and approximate P-values were obtained for each marker. Four markers had approximate P-values <0.01 and these were selected for follow-up with empirical P-value estimates. Empirical P-values were estimated on the basis of 100,000 simulations for each marker. Results for these four markers are shown in Table 2. Three of the four markers had smaller P-values than the smallest P-value of the previous analysis with two being much more significant. There was no overlap between these markers and the five markers reported previously. Approximate P-values for the five markers identified earlier ranged from 0.13 to 0.65. Of the four markers identified with the new method all had P-values >0.05 in the old analysis except for D5S1505, which had a P-value of 0.0441. That the markers found to have significant linkage in the two studies are different is not surprising, considering the very different methods used. The previous analysis finds markers that show some evidence for linkage under some genetic model in at least some of the subpedigrees. Although this may, it does not necessarily indicate higher than expected levels of IBD sharing among affecteds when looking at the pedigree as a whole, which is what the current method detects.
|
I also assessed the computation time required to do the analyses. These were accomplished on an AMD Opteron 252 at 2.6 GHz, running Linux 2.6. The exact computation of the number of alleles shared IBD, as done by MERLIN, for 1000 simulations, took 9.4 x 104, 3930, and 2880 sec for the SNP, microsatellite, and haplotype markers, respectively. In contrast, the approximate computations took 0.4, 0.2, and 1.5 sec for the same data sets. The empirical P-value estimates, using the asthma sample, took 41, 9.5, and 7 hr to estimate P-values for 1000 markers, for the SNP, microsatellite, and haplotype markers, respectively, where the estimates were based on 5000 replicates for each marker. That is, 5 x 106 estimates of IBD sharing were computed over all 2556 pairs, averaging 0.03, 0.0068, and 0.0052 sec per
estimate. When analyzing the real genotype data for the asthma sample, where markers consisted of both microsatellites and SNPs of varying informativeness, each estimate of
took
0.074 sec. This corresponds to obtaining approximate P-values in a genome screen of 1000 markers, with bias estimated using 50 simulations for each marker, in
1 hr. In contrast, obtaining empirical P-values based on 100,000 replicates took
2 hr per marker. ABSTRACT
METHODS
RESULTS
>DISCUSSION
APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
LITERATURE CITED
A direct application of the method is linkage mapping for qualitative traits. In this case, one computes the
statistic for affected individuals and determines significance from a simulated distribution. There are a number of challenges inherent in accomplishing this, using pedigrees such as the one described here. A major difficulty is determining the distribution of the statistic
. Because studies involving large pedigrees often are restricted to a single, or a few, pedigrees rather than many independent ones, the usual central limit theorem argument may not lead to an accurate approximation of the null distribution. Without a theoretical basis for the nature of the distribution, we must simulate to determine its characteristics under the null. Because an empirical distribution for a marker depends on the characteristics of that particular marker (e.g., founder allele frequencies, rates of missing data, etc.), one would, in principle, need to determine the null distribution for each marker in the study. This approach is problematic not only when there are many markers, but also for any single marker if the founder allele frequencies are not known. Here, I propose an alternative approach that may be used even when the number of markers is large. The distribution for a marker with perfect information is determined through simulation and scaled to have a mean of zero and a variance of one. The score
at a marker is then shifted—to remove bias—and scaled to also have a variance of one, on the basis of a limited number of simulations. If the shape of the perfect marker distribution matches that of the real marker, then this provides a computationally efficient means of obtaining a reference distribution for
and an approximate P-value. Although the approximate P-values obtained in this manner are fairly accurate over much of the 0–1 interval, the P-values tend to become increasingly conservative as they approach zero. Nevertheless, being able to obtain rapid estimates of the P-value allows one to select those "best" markers for which to obtain a more accurate, empirical P-value from a large number of simulations. Note that when allele frequencies are estimated from the data, additional error beyond the Monte Carlo uncertainty of simulations is introduced due to the bootstrap nature of simulating data conditional on the estimated allele frequencies. If at all possible, then, it is best to obtain population allele frequencies that are independent of the data. A further complication of the need to do simulations to obtain a P-value is the difficulty of including genotyping error. Although it is straightforward to include genotyping error in the computation for IBD, when estimating significance, genotyping error would also have to be included in the simulations. An accurate P-value, then, necessitates simulating from a genotyping error model that is representative of the actual errors in the data. Because determination of this model may be difficult, at present the best strategy is to take any necessary precautions to ensure that errors in the genotype data are minimized.
Although the current method obtains P-values that are smaller than those done in an earlier analysis with different methods, these results do not directly address the question of the relative power of the two approaches. The merits of the proposed method, however, can be evaluated in light of the findings of the Genetic Analysis Workshop 12, where the asthma data set described here was analyzed by a number of different research groups. As summarized by CHAPMAN and WIJSMAN (2001, pp. S222–S223), none of the groups were able to perform linkage analysis on the entire, intact pedigree, including those who used Markov chain Monte Carlo or regression-based approaches that are normally capable of handling large genealogies. They also suggest that two lessons could be learned from the results as a whole: "First, the results were sensitive to the method used to simplify the pedigree." They conclude that "minimal simplification of the pedigree is desirable. In general, stronger linkage signals came from data sets that used larger sub-pedigrees." Second, they conclude that "the method used to simplify the pedigree may be more important than the exact method of analysis used." The method described here is unique and advantageous in that it is capable of performing linkage mapping without any pedigree simplification. This should not only maximize the power but also avoid the difficulties of both creating appropriate subpedigrees and interpreting the results when multiple pedigree splittings have been analyzed. A power comparison with other methods involving possible pedigree simplification strategies (FALCHI et al. 2004; BROCKLEBANK et al. 2007; LIU et al. 2008) and genetic models would, nevertheless, be informative and will be pursued in future work.
In addition to linkage mapping for a qualitative trait, the proposed IBD estimation method is directly applicable to quantitative trait linkage mapping based on variance components (AMOS 1994; ALMASY and BLANGERO 1998). There, the covariance due to a quantitative trait locus (QTL) is modeled by the IBD sharing between all pairs of individuals in the pedigree. The method described here would allow rapid determination of the QTL covariance matrix for pedigrees that have been too large to consider as a whole. The variance component mapping strategy is, in many ways, considerably more straightforward than the Spairs linkage mapping approach. Because the hypothesis test is not based explicitly on the degree of sharing estimated, the problems associated with determining the proper distribution of the statistic (e.g., bias) are avoided. Although bias will still exist in the IBD estimates, if allele frequencies are estimated, the amount of bias for any particular pair will generally be small. In contrast, the bias when computing
can be large because it is the sum over very many pairs, each of which has a small amount of bias. Furthermore, an empirical distribution for the sharing statistic need not be constructed to determine statistical significance. The consequence is that it is not necessary to do the simulations at each marker to determine the bias and variance of the statistic.
The utility of a method that addresses a computationally difficult problem depends critically on the time it takes to solve the problem, particularly when many analyses and simulations are necessary. The above methods have been implemented in freely available software coded in C (available at http://www.genes.uchicago.edu/abney.html) that can analyze the data presented here with, as yet, unprecedented speed. The analyses accomplished here, using both simulated and real data, show that it is possible to analyze large data sets (
700 genotyped individuals,
150 affecteds) in a large pedigree (
3000 individuals) on a timescale of hours for thousands of markers. As far as I am aware, there are no other available methods that can perform linkage mapping with this quantity of data on such a large, unbroken pedigree in any reasonable length of time. With the methods and software tools introduced here researchers with large, complex pedigrees will be able to leverage their genetic data to a degree that was not possible before.
ABSTRACT
METHODS
RESULTS
DISCUSSION
>APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
LITERATURE CITED
An HMM is normally defined in the following way (BAUM 1972; RABINER 1989). Let
![]() |
![]() |
![]() |
![]() |
![]() | (A1) |
![]() |
ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX: MULTIPOINT HBD...
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX: MULTIPOINT HBD...
ACKNOWLEDGEMENTS
>LITERATURE CITED
ABECASIS, G. R., S. S. CHERNY, W. O. COOKSON and L. R. CARDON, 2002 Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30: 97–101.[CrossRef][Medline]
ABNEY, M., M. S. MCPEEK and C. OBER, 2000 Estimation of variance components of quantitative traits in inbred populations. Am. J. Hum. Genet. 66: 629–650.[CrossRef][Medline]
ABNEY, M., C. OBER and M. S. MCPEEK, 2002 Quantitative-trait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting serum-insulin level in the Hutterites. Am. J. Hum. Genet. 70: 920–934.[CrossRef][Medline]
ALBERS, C. A., T. HESKES and H. J. KAPPEN, 2007 Haplotype inference in general pedigrees using the cluster variation method. Genetics 177: 1101–1116.
ALMASY, L., and J. BLANGERO, 1998 Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62: 1198–1211.[CrossRef][Medline]
AMOS, C. I., 1994 Robust variance-components approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet. 54: 535–543.[Medline]
BARUCH, E., J. I. WELLER, M. COHEN-ZINDER, M. RON and E. SEROUSSI, 2006 Efficient inference of haplotypes from genotypes on a large animal pedigree. Genetics 172: 1757–1765.
BAUM, L. E., 1972 An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3: 1–8.
BOURGAIN, C., and E. GENIN, 2005 Complex trait mapping in isolated populations: Are specific statistical methods required? Eur. J. Hum. Genet. 13: 698–706.[CrossRef][Medline]
BROCKLEBANK, D., J. GAYAN and L. R. CARDON, 2007 Novel combinatorial optimisation methods to partition large pedigrees for genetic analysis. The American Society of Human Genetics Annual Meeting, October 25, 2007, San Diego.
CHAPMAN, N. H., and E. M. WIJSMAN, 2001 Introduction: linkage analyses in the Hutterites. Genet. Epidemiol. 21(Suppl. 1): S222–S223.[Medline]
DAVIS, S., M. SCHROEDER, L. R. GOLDIN and D. E. WEEKS, 1996 Nonparametric simulation-based statistics for detecting linkage in general pedigrees. Am. J. Hum. Genet. 58: 867–880.[Medline]
ELSTON, R. C., and J. STEWART, 1971 A general model for the genetic analysis of pedigree data. Hum. Hered. 21: 523–542.[CrossRef][Medline]
ESCAMILLA, M. A., 2001 Population isolates: their special value for locating genes for bipolar disorder. Bipolar Disord. 3: 299–317.[CrossRef][Medline]
FALCHI, M., P. FORABOSCO, E. MOCCI, C. C. BORLINO, A. PICCIAU et al., 2004 A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of sardinia. Am. J. Hum. Genet. 75: 1015–1031.[CrossRef][Medline]
FISHELSON, M., and D. GEIGER, 2002 Exact genetic linkage computations for general pedigrees. Bioinformatics 18(Suppl. 1): S189–S198.[Abstract]
FULKER, D. W., S. S. CHERNY and L. R. CARDON, 1995 Multipoint interval mapping of quantitative trait loci, using sib pairs. Am. J. Hum. Genet. 56: 1224–1233.[Medline]
HEATH, S. C., 1997 Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am. J. Hum. Genet. 61: 748–760.[Medline]
HOSTETLER, J. A., 1974 Hutterite Society. Johns Hopkins University Press, Baltimore.
JACQUARD, A., 1974 The Genetic Structure of Populations. Springer-Verlag, New York.
KONG, A., and N. J. COX, 1997 Allele-sharing models: Lod scores and accurate linkage tests. Am. J. Hum. Genet. 61: 1179–1188.[CrossRef][Medline]
KRUGLYAK, L., M. J. DALY, M. P. REEVE-DALY and E. S. LANDER, 1996 Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58: 1347–1363.[Medline]
LANDER, E. S., and P. GREEN, 1987 Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84: 2363–2367.
LIU, F., A. KIRICHENKO, T. AXENOVICH, C. VAN DUIJN and Y. AULCHENKO, 2008 Anapproach for cutting large and complex pedigrees for linkage analysis. Eur. J. Hum. Genet. 16: 854–860.[CrossRef][Medline]
MANGE, A. P., 1964 Growth and inbreeding of a human isolate. Hum. Biol. 36: 104–133.[Medline]
MCPEEK, M. S., X. WU and C. OBER, 2004 Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics 60: 359–367.[CrossRef][Medline]
OBER, C., A. TSALENKO, R. PARRY and N. J. COX, 2000 A second-generation genomewide screen for asthma-susceptibility alleles in a founder population. Am. J. Hum. Genet. 67: 1154–1162.[Medline]
O'CONNELL, J. R., 2000 Zero-recombinant haplotyping: applications to fine mapping using snps. Genet. Epidemiol. 19(Suppl. 1): S64–S70.
PELTONEN, L., A. PALOTIE and K. LANGE, 2000 Use of population isolates for mapping complex traits. Nat. Rev. Genet. 1: 182–190.[CrossRef][Medline]
RABINER, L. R., 1989 A tutorial on hidden Markov-models and selected applications in speech recognition. Proc. IEEE 77: 257–286.[CrossRef]
SERVICE, S., J. DEYOUNG, M. KARAYIORGOU, J. L. ROOS, H. PRETORIOUS et al., 2006 Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat. Genet. 38: 556–560.[CrossRef][Medline]
SHIFMAN, S., and A. DARVASI, 2001 The value of isolated populations. Nat. Genet. 28: 309–310.[CrossRef][Medline]
SOBEL, E., and K. LANGE, 1996 Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am. J. Hum. Genet. 58: 1323–1337.[Medline]
SUTTER, N. B., and E. A. OSTRANDER, 2004 Dog star rising: the canine genetic system. Nat. Rev. Genet. 5: 900–910.[CrossRef][Medline]
THALLMAN, R. M., G. L. BENNETT, J. W. KEELE and S. M. KAPPES, 2001 Efficient computation of genotype probabilities for loci with many alleles: Ii. Iterative method for large, complex pedigrees. J. Anim. Sci. 79: 34–44.
THOMPSON, E. A., S. LIN, A. B. OLSHEN and E. M. WIJSMAN, 1993 Monte Carlo analysis on a large pedigree. Genet. Epidemiol. 10: 677–682.[CrossRef][Medline]
WANG, T., R. FERNANDO, S. VANDERBEEK, M. GROSSMAN and J. VANARENDONK, 1995 Covariance between relatives for a marked quantitative trait locus. Genet. Sel. Evol. 27: 251–274.[CrossRef]
WINDIG, J., and T. MEUWISSEN, 2004 Rapid haplotype reconstruction in pedigrees with dense marker maps. J. Anim. Breed. Genet. 121: 26–39.[CrossRef]
WRIGHT, A. F., A. D. CAROTHERS and M. PIRASTU, 1999 Population choice in mapping genes for complex diseases. Nat. Genet. 23: 397–404.[CrossRef][Medline]
ZHANG, K., F. SUN and H. ZHAO, 2005 Haplore: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 21: 90–103.
Communicating editor: R. W. DOERGE
Related articles in Genetics:
ISSUE HIGHLIGHTS
Genetics 2008 179: NP.
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.108.089912v1
179/3/1577 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Related articles in Genetics
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Abney, M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Abney, M.


































