- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Real Haplotype Data
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Orzack, S. H.
- Articles by Stanton, V. P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Orzack, S. H.
- Articles by Stanton, V. P., Jr.
Analysis and Exploration of the Use of Rule-Based Algorithms and Consensus Methods for the Inferral of Haplotypes
Steven Hecht Orzacka, Daniel Gusfieldb, Jeffrey Olsonc, Steven Nesbittc, Lakshman Subrahmanyanc, and Vincent P. Stanton, Jr.ca Fresh Pond Research Institute, Cambridge, Massachusetts 02140,
b Department of Computer Science, University of California, Davis, California 95616
c Variagenics, Cambridge, Massachusetts 02139
Corresponding author: Steven Hecht Orzack, 173 Harvey St., Cambridge, MA 02140., orzack{at}freshpond.org (E-mail)
Communicating editor: M. UYENOYAMA
| ABSTRACT |
|---|
The difficulty of experimental determination of haplotypes from phase-unknown genotypes has stimulated the development of nonexperimental inferral methods. One well-known approach for a group of unrelated individuals involves using the trivially deducible haplotypes (those found in individuals with zero or one heterozygous sites) and a set of rules to infer the haplotypes underlying ambiguous genotypes (those with two or more heterozygous sites). Neither the manner in which this "rule-based" approach should be implemented nor the accuracy of this approach has been adequately assessed. We implemented eight variations of this approach that differed in how a reference list of haplotypes was derived and in the rules for the analysis of ambiguous genotypes. We assessed the accuracy of these variations by comparing predicted and experimentally determined haplotypes involving nine polymorphic sites in the human apolipoprotein E (APOE) locus. The eight variations resulted in substantial differences in the average number of correctly inferred haplotype pairs. More than one set of inferred haplotype pairs was found for each of the variations we analyzed, implying that the rule-based approach is not sufficient by itself for haplotype inferral, despite its appealing simplicity. Accordingly, we explored consensus methods in which multiple inferrals for a given ambiguous genotype are combined to generate a single inferral; we show that the set of these "consensus" inferrals for all ambiguous genotypes is more accurate than the typical single set of inferrals chosen at random. We also use a consensus prediction to divide ambiguous genotypes into those whose algorithmic inferral is certain or almost certain and those whose less certain inferral makes molecular inferral preferable.
THEORETICAL and practical considerations suggest that a more complete causal understanding of many complex traits, such as human diseases, will often be gained by haplotype analysis, since such traits may often be the partial result of many genetic determinants (![]()
![]()
![]()
![]()
Since molecular inferral of haplotypes is difficult (e.g., see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Here, we analyze the accuracy of the rule-based approach when applied to a sample of genotypes with known haplotypes at the human apolipoprotein E (APOE) locus (see ![]()
![]()
![]()
| MATERIALS AND METHODS |
|---|
Subjects studied:
Genomic DNA from each of 31 cell lines from the Coriell Cell Respository was sequenced to identify polymorphisms in the APOE locus. Using the data on race and ethnicity at the Coriell web site, we classified 8 cell lines as derived from Asian individuals, 8 from Blacks, and 15 from Caucasians. Some of the cell lines were from individuals with rare inborn errors of metabolism, although none suffered from a condition known to be associated with their APOE genotype. Of the 80 individuals whose haplotypes were phased, we classified 18 as Asians, 19 as Blacks, and 43 as Caucasians. All individuals were unrelated.
DNA sequencing and genotyping:
Polymorphisms were discovered in the 31 cell lines by DNA sequencing using an ABI 3700 DNA sequencer and Polyphred software for identification of candidate polymorphic sites, substantially as described by ![]()
![]()
![]()
Experimental haplotype determination:
Haplotypes were determined first by using the genotype data to identify the 5' flanking heterozygous site for each phase-unknown individual. (Sequence orientation is determined relative to the APOE locus.) Second, each of the two alleles of such individuals was amplified in a PCR reaction primed with an allele-specific primer at the 5' flanking heterozygous site and a constant (non-allele-specific) primer located just 3' of exon 4. Third, the haplotypes of the two allele-specific amplicons from each individual were determined by genotyping all internal heterozygous sites. A pair of allele-specific primers was designed and tested for all sites except the most 3' site, 21388, which is next to the constant primer and therefore cannot circumtend two heterozygous sites. Primer design and PCR conditions were optimized for maximal allele selectivity at each site using known pairs of haplotypes. The sequence of both alleles and the genotyping results provided independent sources of information for all haplotype inferrals. Primer sequences are available upon request.
The rule-based approach ![]()
We first define some terms. Any inferral method "resolves" an ambiguous genotype when it chooses a "resolution," that is, a pair of underlying haplotypes for an ambiguous genotype. A genotype left unresolved is an "orphan."
![]()
![]()
For each known haplotype, we then look at all the remaining unresolved [genotypes] and ask whether the known haplotype can be made from some combination of the ambiguous sites. Each time such a haplotype is found, we immediately recover the complement of the haplotype as another potential haplotype. This chain of inference continues until all haplotypes have been recovered, or until no more haplotypes can be found.
Clark also proposed that this procedure be run more than once with the data being randomly reordered at the start each time so as to determine whether distinct resolutions would be generated for any given ambiguous genotype. Following Clark, we refer to the ensemble of pairs of inferred haplotypes associated with a given reordering as a "solution" for the genotypic data. On the basis of such iterated calculations, Clark (pp. 117118) went on to describe an important claim: "... the solution with the fewest orphans is the valid solution and suggests that when a solution resolves all haplotypes it is likely to be unique." We interpret this claim to be the prediction that the set of inferred haplotypes that resolves the most individuals is the most accurate solution; that is, a higher percentage of its inferrals is correct than for any other solution. This is an important claim because it implies that the single, most accurate solution can be identified using this approach. However, despite this suggestion by Clark to generate and compare multiple solutions, it appears that no subsequent analysis is so structured.
The rule-based approacha general overview:
We define the list of "real" haplotypes as those obtained by serially inspecting the ordered set of input genotypes and adding the haplotypes of unambiguous genotypes to the end of a list. How duplicates are handled is discussed below.
During haplotype inferral, haplotypes derived from ambiguous genotypes might be added to the list of real haplotypes; this larger list is the "reference list" of haplotypes. Each iteration of a given variation began with a random reordering of the list of genotypes and the creation of the list of real haplotypes. Haplotype inferral proceeded in either of two ways. In the first, proceeding downward from the genotype at the top of the list, we scanned down the reference list of haplotypes to find one that could resolve the genotype in question. In the second, proceeding downward from the haplotype at the top of the reference list, we scanned down the list of ambiguous genotypes to find one that could be resolved by the haplotype in question. In either case, if a resolution was found, we then searched for additional resolutions. We either chose the first haplotype found as the basis for the resolution of the genotype in question or chose among all of the haplotypes that could resolve it (see below). Finally, we decided whether the chosen haplotype and/or its complement haplotype was added to the reference list.
Thus, each iteration of our implementation of the rule-based approach consisted of either a haplotype selection loop, which in turn contained a genotype selection loop, or a genotype selection loop, which in turn contained a haplotype selection loop. In a given iteration, one of these pairs of loops was executed until no further genotypes could be resolved. We found no systematic differences between the results stemming from the two methods; we present results based upon the second.
We developed eight variations of the rule-based approach that differ in the way that the initial list of real haplotypes and the reference haplotype list are formulated and in how the list of genotypes is analyzed. The development of these variations was motivated by questions that a user of the rule-based approach must answer before using it:
- Does the reference list include just real haplotypes or inferred ones as well?
- Are duplicate haplotypes removed or retained in the reference list at the beginning of each iteration? If they are removed, the initial list is like a phone book, with a unique "name" for each haplotype. If they are retained, they are represented as distinct copies.
- Are duplicate ambiguous genotypes "consolidated" at the beginning of each iteration or left separate? If they are consolidated, identical genotypes will be resolved identically. If they are left separate, identical genotypes may be resolved differently, except when the reference list is not randomized (see below).
- Before trying to resolve an ambiguous genotype, it is important to ask is the current reference list of haplotypes randomized or not? Since the list is scanned from the top, reordering or not may determine whether a particular haplotype and its complement are used in the resolution.
- If multiple haplotypes can resolve a given ambiguous genotype, is the haplotype chosen as the "resolving" haplotype the first one found, the one with the currently highest frequency, or one chosen randomly?
- After a genotype is resolved by a given haplotype, how is the reference list of haplotypes updated? We used two alternatives. In the first, the resolving haplotype's complement haplotype is added to the list only if it is new, thereby maintaining the list as a phone book of names. In the second, a resolving haplotype and its complement are added to the reference list, thereby allowing any subsequent choice among competing resolving haplotypes for an ambiguous genotype to be influenced by their population frequencies.
In all of our variations, we assumed that inferred haplotypes are included on the reference list. We focused on how the answers to questions 2, 3, 4, 5, and 6 affect inferral accuracy. None of these questions has one right answer; for example, the decision as to whether identical genotypes should be resolved with the same haplotypes comes down to a judgment as to the prevalence of homoplasy.
The choices made for each variation in regard to the treatment of duplicate haplotypes, haplotype list randomization, and consolidation of ambiguous genotypes are shown in Table 1. These choices have consequences in regard to the choice of real vs. inferred haplotypes, the choice of more frequent haplotypes, and the resolution of identical ambiguous genotypes. For example, variation 1 is perhaps the simplest inferral algorithm: duplicate haplotypes were retained on the reference list, which therefore included an entry for each haplotype derived from unambiguous genotypes and resolved ambiguous genotypes. The order of the haplotypes on the reference list was randomized before each iteration. Duplicate ambiguous genotypes were not consolidated. The resolving haplotype was selected randomly from among all possible resolving haplotypes. These choices generated a "population" frequency preference: a more common resolving haplotype was likely chosen more often than a less common resolving haplotype, although identical genotypes might not be resolved identically.
|
Comments on the other variations are:
- Variation 2: the same as variation 1 except that duplicate ambiguous genotypes were consolidated, which caused them to be resolved identically by the first resolving haplotype found.
- Variation 2a: removal of duplicate real haplotypes from the reference list at the beginning of each iteration generated a "weak" frequency preference. Its magnitude was typically less than that of the population frequency preference because the reference list had no initial frequency differences (any duplicate inferred haplotypes were retained).
- Variation 2b: lack of randomization of the haplotype reference list caused real haplotypes to have a preference as resolving haplotypes since they are nearer the top of the list than are inferred ones. Identical ambiguous genotypes were resolved identically by the first haplotype that could do so because haplotype order was fixed.
- Variation 2c: identical ambiguous genotypes were resolved identically by the first haplotype that could do so because haplotype order was never changed. Consequently, there was a "no" frequency preference.
- Variation 3: there was a "strong" frequency preference in that any given ambiguous genotype was solved by the currently most frequent haplotype that could do so. This is the only variation in which frequency preference was an explicit choice.
- Variation 4a: the complement of a solution haplotype was added to the reference list only if it was new. Identical genotypes were solved identically because of consolidation.
- Variation 4b: the same as variation 4a except that the order of the haplotypes in the reference list was not randomized. There was a preference for real over inferred haplotypes (since they were nearer the top of the reference list). These choices generated a "no" frequency preference. This variant appears to be the approach described by
CLARK 1990 although he did not provide an explicit algorithm.
One can devise additional rule-based algorithms. For example, one might assess the real or inferred status of a complement haplotype when deciding among resolving haplotypes. However, our present choices allowed us to explore the substantial biological heterogeneity among the eight variations of the rule-based approach to haplotype inferral. So, for example, the variations differed in having no frequency preference (variations 2c, 4a, and 4b), a weak frequency preference (2a), a population frequency preference (variations 1, 2, and 2b), or a strong frequency preference (variation 3). In addition, within either the first or the third group, the variations differed substantially in how a given frequency preference was manifested. For example, variation 1 had a population preference with no priority for real-over-inferred haplotypes. In contrast, variation 2b had a population preference that favored real haplotypes. We ran at least two independent sample paths of 10,000 iterations for each variation.
| RESULTS |
|---|
Identification of polymorphisms in the apolipoprotein E locus:
A 415-bp segment of the APOE locus extending from 570 bp upstream of the transcription start site to the end of exon 4 (nucleotides 1780021958 of GenBank accession no.
AF050154) was sequenced and polymorphisms were identified at nucleotides 17874, 17937, 18145, 18476, 19311, 19753, 20334, 21250, 21349, and 21388. These polymorphisms have been described previously (![]()
![]()
![]()
|
Experimental identification of haplotypes:
We were unable to develop reliable allele-specific PCR reactions for one of the 10 polymorphic sites; this site (19753) was omitted from subsequent analyses. As shown in Table 2, a total of 17 haplotypes involving 9 sites were experimentally identified; these range in frequency from 0.62% (1 of 160) to 33.1% (53 of 160). Eleven of the haplotypes can be inferred from unambiguous genotypes.
Of the 80 individuals, 33 have unambiguous pairs of haplotypes and 47 have ambiguous pairs. There are 17 genotypes with two polymorphic sites, 20 genotypes with three such sites, 6 genotypes with four such sites, and 4 genotypes with five such sites. Haplotype pairs for all genotypes are available at http://www.genetics.org.
Genotypic proportions at all nine sites do not deviate significantly (
= 0.05) from Hardy-Weinberg proportions: 17874 (
2 = 1.769, P = 0.183), 17937 (3.775, 0.052), 18145 (0.250, 0.617), 18476 (0.290, 0.590), 19311 (0.356, 0.551), 20334 (0.003, 0.955), 21250 (0.868, 0.352), 21349 (0.029, 0.864), and 21388 (0.786, 0.375).
Computational results:
Our first goal was to compare the performances of the variations. The second goal was to determine whether the important inferential principle enunciated by Clark was true, that is, whether the solution with the fewest orphans is the most accurate solution. Finally, we wished to determine whether multiple complete solutions exist and, if so, to understand how they could be of use.
The frequency distribution of accuracy for a 10,000-iteration sample path of variation 1 is shown in Fig 1. The first important point is that there are "different" solutions, that is, those that differ in the resolution of at least one ambiguous genotype. In fact, each of the 10,000 iterations resulted in a different complete solution of the 47 ambiguous genotypes. None of these solutions had an orphan. It is clear for this locus, then, that the principle enunciated by Clark concerning the accuracy of the solution with the fewest orphans cannot apply since there is no one solution with the fewest orphans. Analyses of two other sets of phase-known genotypes and of simulated genotypic data also resulted in multiple solutions with no orphans (S. ORZACK and D. GUSFIELD, unpublished results). This underscores the potential for mistaken inferrals particularly when an investigator runs a method only once on a given data set. We expect the occurrence of multiple complete solutions to be common when using the rule-based approach.
|
Multiple distinct solutions were manifested in different ways, as shown in Table 3. In one variation there are only two solutions, in one there are <500, in some there are
2000, and several had 10,000 or close to it. Thus, while the algorithmic differences among the variations at first may seem minor, they led to substantially different outcomes.
|
The normal-like distribution of accuracy shown in Fig 1 is not typical, as can be seen in Fig 2, which contains the distribution of accuracy for variation 2, and in Fig 3, which contains the distribution of accuracy for variation 3; here, there are only two solutions. This last result reflects the never-decreasing influence of frequency on the process of inferral; this restricts resolutions to a small part of the solution space.
|
|
The average accuracy score and its 95% confidence interval for each of the variations are shown in Table 4. It is clear that the variations differed substantially in their predictive accuracy, with variations 2a, 2c, 4a, and 4b performing poorly (3640% accuracy) and variations 1, 2, 2b, and 3 performing better (6268% accuracy).
|
Variations 2c, 4a, and 4b, with no frequency preference, performed poorly. Average accuracy tends to increase with increasing reliance on frequency, although most of this increase is achieved by a population frequency preference as a strong frequency preference added little to average accuracy.
Dealing with multiple solutions:
For this data set, all of the rule-based variations generate multiple solutions of quite variable quality. Without an understanding of what sense, if any, can be made of multiple solutions, it is difficult to understand what insights any rule-based algorithm could provide an investigator. However, almost all of the variations generate some solutions in which >80% of the ambiguous genotypes are correctly inferred. If one could identify such solutions, it would be extremely advantageous.
Consensus methods:
These considerations led us to develop consensus methods that are based on the multiple solutions that are present in any given simulation. We evaluated several ways to generate a consensus prediction.
The first way was to tally the different inferrals for any given genotype across all 10,000 iterations of the genotypic data. The inferral appearing in the highest number of iterations was taken to be the "full" consensus prediction for the genotype.
Our knowledge of the real haplotype pairs for each genotype allowed us to determine whether particular types of solutions are more likely to have a higher number of correct inferrals. To generate a second kind of consensus prediction we used information on the number of haplotypes needed to resolve all of the ambiguous genotypes or to resolve all of the genotypes.
A plot of the relationship between the number of different haplotypes used in a solution and the average number of correct inferrals is shown in Fig 4. There is a strong negative relationship between the two, implying that smaller lists perform better than larger lists (Spearman rank correlation corrected for ties, ambiguous genotypes, -0.996, 13 d.f., P < 0.001; all genotypes, -1.0, 13 d.f., P < 0.001). This result occurred for all of our variations for this data set and in the analysis of two other phase-known data sets (S. ORZACK and D. GUSFIELD, unpublished results). We surmise that the underlying reason for the negative relationship between list size and accuracy is that typical mutation and recombination rates do not substantially increase the expected number of haplotypes for genotypes with k ambiguous sites above k + 1, the number possible with no recombination and no recurrent mutation. This implies that a population will tend to contain few haplotypes (relative to k) and therefore that solutions generated by fewer haplotypes will be more accurate (S. ORZACK, D. GUSFIELD and C. WIUF, unpublished results).
|
We believe haplotype number will prove generally useful as the basis for making more accurate consensus predictions. Accordingly, the second way we generated a consensus prediction for each genotype was to use either just the solutions that were based on the fewest number of haplotypes ("minimum") or those solutions plus those that were based on one more than the fewest number of haplotypes ("minimum + 1"). As for full consensus, the most common inferral was taken to be the consensus prediction for a genotype.
The results shown in Fig 4 are relevant to claims that the rule-based approach satisfies a parsimony criterion in regard to the number of haplotypes used. For example, ![]()
![]()
![]()
![]()
The results of consensus:
In Table 5, we show the results of the full consensus and the minimum + 1 consensus calculations for our data. The results of consensus calculations are very consistent across sample paths of a given variation and we present results from only a single sample path.
|
A consensus solution can be used in two ways. First, such a solution provides a single prediction for any genotype; without such a synthetic prediction the investigator is left to pick a solution at random from among multiple complete solutions that are often of highly variable quality (see Fig 1). The results in Table 6 indicate that the average accuracy of complete solutions was always less than the accuracy of the consensus solution (although their difference can be small or substantial). This difference underscores the advantage of using a consensus prediction, even apart from its necessity when dealing with multiple solutions.
|
The distinction between the average number of correct inferrals and the consensus number of correct inferrals is reflected in the fact that the correlation between the two was negative for all five consensus methods (Spearman rank correlation corrected for ties: full, -0.195; minimum, -0.708; minimum + 1, -0.878; minimum (all), -0.192; minimum + 1 (all), -0.169) and the average correlation is significantly negative (-0.509, 95% confidence interval: -0.747 to -0.158). These negative relationships indicate that the variations differed in the extent to which they explore the solution space. For example, variation 1 achieved a higher average percentage of correct inferrals (
62%) as compared to, say, variation 4b (
37%), but variation 1 had a poorer minimum consensus accuracy (36) as compared to variation 4b (40). This reveals that any given iteration of variation 4b tended to correctly predict only a subset of the ambiguous genotypes but not the same subset as another iteration; yet the ensemble prediction of such subsets had good accuracy.
The second way in which a consensus prediction can be used involves the assessment of the consensus values themselves so as to distinguish between more and less reliable inferrals. As shown in Table 5, as the consensus threshold or number of identical inferrals increases, there is a threshold value for which all inferrals are correct. One can use this approach to divide the data into inferrals that are certain or nearly certain as opposed to those that are less certain and for which experimental inferral is mandated. For example, a full consensus threshold of 8000 (80%) would reduce the number of required experimental inferrals by
50% (22/47). Similarly, a minimum + 1 consensus threshold of 80% would reduce the required experimental inferrals by
>50% (25/47). This is a very substantial saving of experimental labor. We note that such consensus calculations did not perform well for variation 3 in that even high consensus thresholds are associated with a substantial proportion of incorrect inferrals (not shown). The strong frequency preference appears to amplify the "signal" of both incorrect and correct inferrals.
| DISCUSSION |
|---|
We believe that consensus calculations hold great promise for allowing the investigator to ultimately achieve correct inferrals for all of the genotypes contained in a sample. We are guided here by a belief that some measure of the reliability for any given inferral is an essential element of any inferral method (see also ![]()
![]()
We are developing methods to refine and extend these consensus results. Consensus calculations for another more complex locus do not perform as well in that even some high consensus scores for variation 1 are associated with some incorrect inferrals, although partitioning the locus so that separate inferrals occur for regions with different recombination rates improves consensus performance (S. ORZACK and D. GUSFIELD, unpublished results). We hope to develop metrics so that an estimate of linkage disequilibrium can be used to select a consensus threshold that will be reliable at providing correct inferrals with certainty or near certainty. Our motivation is the belief that inferral methods should not be viewed as statistical machines competing against one another. Instead, they compete against experimental inferral, which can be viewed in the present context as being error free (or at least as having a much smaller error rate than that of algorithmic inferral). To this extent, the role of algorithmic inferral should be to guide the expenditure of experimental effort to achieve a correct solution.
The nature of mistaken inferrals and a guide to the investigator:
The variations of the rule-based approach described here differ substantially in their algorithmic structure and population-genetic assumptions. It is essential to understand the basis for their differences in performance so that the potential user can compare the inferral capabilities of particular variations.
Table 7 contains a breakdown of the inferrals for a single sample path for each of the eight variations. Consider the results concerning full consensus calculations. The distribution of correct and incorrect inferrals does not differ across the variations for either two-ambiguous-site genotypes (uncorrected
2 = 10.46, 7 d.f., P > 0.05) or three-ambiguous-site genotypes (uncorrected
2 = 4.12, 7 d.f., P > 0.05). These results suggest that consensus inferral of haplotypes with two and three ambiguous sites can be performed with any of the variations. However, the variations differ significantly (
= 0.05) in regard to their inferral success for more complex genotypes as indicated by the significant
2 values for four- and five-ambiguous-site genotypes (see Table 7). Inspection of the results indicates that variations 2a and 4a perform best in regard to the correct prediction of these more complex genotypes.
|
The breakdowns of correct inferrals for restricted consensus predictions are shown in Table 7. The
2 values for minimum consensus predictions indicate that the variations do not differ in their ability to correctly infer haplotypes involving up to five ambiguous sites. In contrast, the
2 values for minimum + 1 consensus predictions indicate that the variations do differ in their predictive ability for four- and five-site genotypes, with variations 2a and 4a again performing best.
The results shown in Table 6 and Table 7 also indicate that restricted consensus calculations based on the number of haplotypes used to solve only ambiguous genotypes result in higher average numbers of correct inferrals than do restricted consensus calculations based on the number needed to solve all genotypes.
The rule-based approach and the approach of ![]()
Both the consensus approach described here and the inferral method of ![]()
These two methods differ in important ways. The rule-based approach is simpler to understand, to program oneself, and to modify. In addition, it is quicker since multiple iterations and associated consensus predictions can be obtained in a few minutes using typical microprocessors currently available. The approach of ![]()
![]()
![]()
We stress that we regard Stephens et al.'s approach to be an important inferral tool. In the present analysis, the performance of their program matches the best consensus performance of the rule-based approach (42 correctly inferred genotypes out of 47 possible); these best solutions differ (only 2 of the 5 incorrectly inferred genotypes are the same).
In our opinion, the performance of these methods is surprisingly good. After all, the vagaries of natural selection, genetic drift, and sampling biases, and the absence of assumptions about the genetic details of the APOE locus all combine to make haplotype inferral a formidably complex problem in an a priori sense. However, this performance does not mean that these methods can now be applied to most loci with the confidence that they will work well. Such a conclusion must await additional studies like the present one in which real phase-known genotypic data are analyzed.
Tests of haplotype inferral accuracy:
Our results and those of ![]()
![]()
![]()
![]()
![]()
We have analyzed the ![]()
|
|
Finally, we note that ![]()
The general utility of consensus calculations:
Consider the application of the consensus approach to the inferrals generated by the program of ![]()
|
Consensus methods can also be applied to the results generated by the method of ![]()
Any algorithmic inferral method has the potential to generate different solutions. This is as it should be, given that many genotypic configurations can support different haplotype configurations. Indeed, such a general consideration suggests that no inferral program be viewed as allowing "one-time-use" only. To this extent, techniques for reconciling competing solutions are needed. One approach is consensus; another is the use of goodness-of-fit statistics, as described by ![]()
The future development of rule-based algorithms and consensus calculations:
If one has genotypes with known haplotype pairs, one advantage of the rule-based approach is that one can assess, using an integer programming analysis, whether a given variation could result in perfect prediction (see ![]()
Such an analysis shows that such a perfect solution exists for variation 1. However, the best result we obtained for a single iteration was 40 correct resolutions out of a possible 47. Hence, even with a large number of iterations, this variation failed to achieve optimal performance. This failure motivates ongoing research on how to best implement particular variations to achieve optimal performance and on techniques that force these algorithms to do so (S. ORZACK and D. GUSFIELD, unpublished results).
| ACKNOWLEDGMENTS |
|---|
We thank Anne Ferentz, Alan Templeton, David Posada, Carsten Wiuf, and three reviewers for critical comments. This work was partially supported by National Science Foundation (NSF) awards SES-9906997, DBI-9723346, and EIA-0220154 and by Variagenics.
Manuscript received January 13, 2003; Accepted for publication June 20, 2003.
| LITERATURE CITED |
|---|
ARTIGA, M. J., M. J. BULLIDO, I. SASTRE, M. RECUERO, and M. A. GARCIA et al., 1998 Allelic polymorphisms in the transcriptional regulatory region of apolipoprotein E gene. FEBS Lett. 421:105-108.[Medline]
BULLIDO, M. J., M. J. ARTIGA, M. RECUERO, I. SASTRE, and M. A. GARCIA et al., 1998 A polymorphism in the regulatory region of APOE associated with risk for Alzheimer's dementia. Nat. Genet. 18:69-71.[Medline]
CHAKRAVARTI, A., 1999 Population geneticsmaking sense out of sequence. Nat. Genet. 21(Suppl.):56-60.[Medline]
CLARK, A. G., 1990 Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7:111-122.[Abstract]
DRACOPOLI, N. C., J. L. HAINES, B. R. KORF, C. M. MORTON, C. E. SEIDMAN et al., 2001 Current Protocols in Human Genetics. John Wiley & Sons, New York.
DRYSDALE, C. M., D. W. MCGRAW, C. B. STACK, J. C. STEPHENS, and R. S. JUDSON et al., 2000 Complex promotor and coding region ß2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl. Acad. Sci. USA 97:10483-10488.
EXCOFFIER, L. and M. SLATKIN, 1995 Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12:921-927.[Abstract]
FALLIN, D. and N. J. SCHORK, 2000 Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet. 67:947-959.[Medline]
FULLERTON, S. M., A. G. CLARK, K. M. WEISS, D. A. NICKERSON, and S. L. TAYLOR et al., 2000 Apolipoprotein E variation at the sequence haplotype level: implications for the origin and maintenance of a major human polymorphism. Am. J. Hum. Genet. 67:881-900.[Medline]
GUSFIELD, D., 2001 Inference of haplotypes from samples of diploid populations: complexity and algorithms. J. Comput. Biol. 8:305-323.[Medline]
HARTMAN, J. L., IV, B. GARVIK, and L. HARTWELL, 2001 Principles for the buffering of genetic variation. Science 291:1001-1004.
HAWLEY, M. E. and K. K. KIDD, 1995 HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Hered. 86:409-411.
JUDSON, R., J. C. STEPHENS, and A. WINDEMUTH, 2000 The predictive power of haplotypes in clinical response. Pharmacogenomics 1:15-26.[Medline]
LIN, S., D. J. CUTLER, M. E. ZWICK, and A. CHAKRAVARTI, 2002 Haplotype inference in random population samples. Am. J. Hum. Genet. 71:1129-1137.[Medline]
LONG, J. C., R. C. WILLIAMS, and M. URBANEK, 1995 An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56:799-810.[Medline]
MICHALATOS-BELOIN, S., S. A. TISHKOFF, K. L. BENTLEY, K. K. KIDD, and G. RUANO, 1996 Molecular haplotyping of genetic markers 10 kb apart by allele-specific long- range PCR. Nucleic Acids Res. 24:4841-4843.
NICKERSON, D. A., V. O. TOBE, and S. L. TAYLOR, 1997 Polyphred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25:2745-2751.
NICKERSON, D. A., S. L. TAYLOR, S. M. FULLERTON, K. M. WEISS, and A. G. CLARK et al., 2000 Sequence diversity and large-scale typing of SNPs in the human apolipoprotein E gene. Genome Res. 10:1532-1545.
NIU, T., Z. S. QIN, X. XU, and J. S. LIU, 2002 Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70:157-169.[Medline]
PRITCHARD, J. K., 2001 Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69:124-137.[Medline]
REICH, D. E., M. CARGILL, S. BOLK, J. IRELAND, and P. C. SABETI et al., 2001 Linkage disequilibrium in the human genome. Nature 411:199-204.[Medline]
SCIENCE CITATION INDEX, 2003 ISI Web of Knowledge, September 2003 (http://isiknowledge.com).
STEPHENS, M., N. J. SMITH, and P. DONNELLY, 2001a A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978-989.[Medline]
STEPHENS, M., N. J. SMITH, and P. DONNELLY, 2001b Reply to Zhang et al. Am. J. Hum. Genet. 69:912-914.
STEPHENS, M., N. J. SMITH and P. DONNELLY, 2002 Documentation for PHASE, version 1.0 (http://www.stat.washington.edu/stephens/phase.html).
TEMPLETON, A. R., C. F. SING, A. KESSLING, and S. HUMPHRIES, 1988 A cladistic analysis of phenotype associations with haplotypes inferred from restriction endonuclease mapping. II. The analysis of natural populations. Genetics 120:1145-1154.
XU, C., K. LEWIS, K. L. CANTONE, P. KHAN, and C. DONNELLY et al., 2002 Effectiveness of computational methods in haplotype prediction. Hum. Genet. 110:148-156.[Medline]
ZHU, X., C. A. MCKENZI, T. FORRESTER, D. A. NICKERSON, and U. BROECKEL et al., 2001 Localization of a small genomic region associated with elevated ACE. Am. J. Hum. Genet. 67:1144-1153.
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Real Haplotype Data
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Orzack, S. H.
- Articles by Stanton, V. P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Orzack, S. H.
- Articles by Stanton, V. P., Jr.





