Efficient Inference of Haplotypes From Genotypes on a Large Animal Pedigree

We present a simple algorithm for reconstruction of haplotypes from a sample of multilocus genotypes. The algorithm is aimed specifically for analysis of very large pedigrees for small chromosomal segments, where recombination frequency within the chromosomal segment can be assumed to be zero. The algorithm was tested both on simulated pedigrees of 155 individuals in a family structure of three generations and on real data of 1149 animals from the Israeli Holstein dairy cattle population, including 406 bulls with genotypes, but no females with genotypes. The rate of haplotype resolution for the simulated data was >91% with a standard deviation of 2%. With 20% missing data, the rate of haplotype resolution was 67.5% with a standard deviation of 1.3%. In both cases all recovered haplotypes were correct. In the real data, allele origin was resolved for 22% of the heterozygous genotypes, even though 70% of the genotypes were missing. Haplotypes were resolved for 36% of the males. Computing time was insignificant for both data sets. Despite the intricacy of large-scale real pedigree genotypes, the proposed algorithm provides a practical rule-based solution for resolving haplotypes for small chromosomal segments in commercial animal populations.

T HE genotyping data of diploid organisms obtained by current laboratory techniques provide unordered allele pairs for each marker. Reconstruction of haplotypes from this data is a crucial step in many applications. Haplotypes of tightly linked markers provide insight into old and rare recombination events, and thus are more informative than single markers. In particular, haplotypes are essential for linkage disequilibrium (LD)-based gene mapping. Numerous studies have shown that individual loci affecting economic traits (QTL) can be detected via linkage to genetic markers (reviewed by Weller 2001). However, using genetic linkage, the location of a QTL in animal populations generally cannot be resolved to ,10 cM, and an interval of this magnitude will still contain 80 genes. Using LD mapping, Meuwissen and Goddard (2000) proposed a method to narrow the confidence interval for a QTL to a few centimorgan units. Haplotype resolution is also important for commercial animal breeding. Mackinnon and Georges (1998) proposed that if a QTL of economic importance is localized to a relatively small chromosomal segment, frequency of the favorable allele can be increased by selection on haplotypes, even though the actual QTL has not been identified.
Approaches to reconstruction of haplotypes by genotype inference can be divided into statistical methods and nonstatistical methods. The first approach applies computational or statistical inference to find the most likely haplotype configuration consistent with the observed genotypic data. Several recent studies have proposed modifications for this method, e.g., Clark's (1990) parsimony method, the expectation maximization algorithm (Excoffier and Slatkin 1995), the partition ligation variant , Phase (Stephens et al. 2001), Haplotyper , and the phylogenetic approach (Gusfield 2001;Eskin et al. 2003). However, these algorithms generally require large data sets (on the order of tens of thousands of individuals) from populations where a biological ''bottleneck'' has previously occurred, as presumably happened in humans (Daly et al. 2001;Helmuth 2001;Gabriel et al. 2002).
In the nonstatistical approach, every possible trio of two parents and an offspring is examined. The haplotypes are resolved by forward inference of Mendelian rules from the parental genotypes to the offspring genotype and backward inference from offspring genotype to the parents. As the pedigree size increases, the number of computations for most rule-based algorithms increases exponentially. Thus, analysis of pedigrees consisting of hundreds of individuals would require several hours or even days of computations.
Many studies tried to deal with the time-consuming computations by applying specific biological assumptions, such as absence of recombination within the haplotype. O'Connell (2000) with ZAPLO described a genotype elimination algorithm, which was intended for single nucleotide polymorphism (SNP) markers under the assumption of zero recombination. Tapadar et al. (1999) proposed an evolution-based method with 1 an optimality criterion of a minimum number of recombinations over all the possible haplotype configurations of pedigree members. The difficulty arises when there is a need to handle missing genotypes, as is usually the case in actual data sets. Qian and Beckmann (2002) developed a six-rule-based algorithm that exhaustively searches all possible minimum recombinant haplotype configurations and completes missing genotypes. To overcome the poor performance of this algorithm in large pedigrees, Li and Jiang (2003) proposed a polynomial time-exact algorithm for haplotype reconstruction without recombinants. This algorithm first identifies all the necessary constraints on the basis of Mendelian laws and the zero recombination assumption, but can be applied only to data that have no missing genotypes. Another formulation of the same algorithm is the branch-and-bound strategy (Li and Jiang 2004) that utilizes a partial order relationship and some other special relationships among variables to decide the branching order. This algorithm can be applied to pedigree with missing data, but the running time will (linearly) increase with the rate of missing data. Moreover, when multiple solutions exist, which is the usual case, especially in real pedigree data, the best haplotype configuration is selected on the basis of a maximum-likelihood approach. Hence it may lead to incorrectly resolved haplotype configuration.
The aim of this study was to develop an efficient rulebased haplotype reconstruction method suitable for pedigrees of several hundred individuals, assuming no recombination and allowing for missing data. The method was applied to both simulated and actual data from the Israeli dairy cattle population and compared to alternative algorithms. : Following Eronen et al. (2004), we assumed a set (map for a chromosomal segment) M of l markers 1, . . . , l and denote the set of alleles of marker i by A i , where A i ¼ fa i1 , a i2 , a i3 , . . . , a ik g. The number of different alleles of marker i in the population is denoted by k. For SNP, we will assume k ¼ 2. The single-locus genotype G(i) over M is an unordered allele pair at locus i, which corresponds to the pair of homologous chromosomes, G(i) 2 ffa ip , a iq g j a ip , a iq 2 A i g. Let G(s, t) denote the allelic sequence from the sth to the tth marker on both chromosomes. G(s, t) ¼ fH 1 (s, t), H 2 (s, t)g, where H 1 (s, t) 2 P i¼s,. . .,t A i is a vector of (known) alleles on one of the homologous chromosomes and H 2 (s, t) is its counterpart on the other chromosome. We will denote H(i, i) by H(i).

Definitions
The ith genotype will be denoted ''resolved'' if two haplotypes, H 1 (i) and H 2 (i), are determined, such that G(i) ¼ fH 1 (i), H 2 (i)g, and the H 1 (i) and H 2 (i) haplotype sources (paternal or maternal) are also deter-mined. H 1 (i), H 2 (i), and G(i) are termed ''consistent'' when all markers in G(1, . . . , l) are resolved and fH 1 (i), H 2 (i)g for i ¼ 1, . . . , l is a possible haplotype configuration for genotypes G(1, l). Thus a haplotype configuration fully describes the haplotypes in a member of the pedigree and the origin of each allele included in the haplotypes.
The ith marker has a ''conflict'' if G(i) must contain fH 1 (i), H 2 (i)g, based on Mendelian rules, but does not. The conflicts were divided into three categories. Conflicts due to no common allele between the parent and the putative progeny that were both originally genotyped in the raw data are denoted type 1. Conflicts due to no common allele between the parent and putative progeny, but either the parent or the progeny had a missing genotype which was reconstructed by application of Mendelian rules, are denoted type 2. A pedigree of more than two generations is required to detect a type 2 conflict. Sequences of more than one marker in which there is a shared allele between parent and progeny for each marker, but neither of the progeny haplotypes correspond to either of the parental haplotypes, are denoted type 3. All three types of conflicts can be due to mutations, incorrect parentage recording, or genotyping errors. Type 3 conflicts can also be due to recombination.
When the program identifies a conflict, it marks the locus and individuals in conflict. The haplotypes of the two individuals in conflict are considered unresolved, and the user is prompted to resolve the conflict by changing or deleting a genotype. If the user does not resolve the conflict, the algorithm will proceed under the assumption that both genotypes in conflict are correct, and these genotypes will be used to resolve haplotypes of other relatives.
A haplotype H(s, t) and a genotype G(s, t) are denoted a ''match'' if there exists the stringĤ 2 P s#i#t A i , such that fH(s, t),Ĥ g is consistent with G(s, t). Using the declarations above, the ''haplotype reconstruction problem'' is defined as follows: given a setĜ of an individual's genotypes for several closely linked markers with all possible haplotype configurations, the objective is to derive the correct haplotype configurations, G, where G 2Ĝ.
Reconstruction of missing genotypes: Animals without parents recorded in the data file are denoted ''founders.'' The sire, dam, and a single direct offspring are denoted a ''nuclear family.'' In actual data, genotypes for some markers are missing, and some genotypes will be incorrect. Some missing genotypes can be inferred using Mendelian rules and the data of relatives. We propose that reconstruction of missing genotypes should precede haplotype resolution, since genotype (or allele) reconstruction is more fundamental phase in comparison to haplotype resolution, which requires determination not only of the alleles at a locus, but also of their parental associations. Nevertheless, during haplotype resolution a missing genotype might also be resolved due to its flanking alleles. In the proposed algorithm the entire pedigree is first scanned from bottom to top, considering each nuclear family. If either the parent or the offspring of the individual with unknown genotype is homozygous, then the algorithm determines that the individual with the unknown genotype must have at least one copy of this allele. If information from a homozygous progeny and its parents is insufficient to resolve both alleles of the individual with unknown genotype, the algorithm then checks if there is an allele in any of the offspring of this individual that does not appear in its mate. If so, the algorithm determines that this allele was derived from the parent with the missing genotype. Any contradiction to these two rules is evidence of marker conflict.
Resolving haplotypes: In every nuclear family there is an offspring-parent pair with a maximum number of parental unresolved markers that are resolved in its offspring. It can be formulated by maxfjfijG O (i) resolved and G P (i) not resolvedgjg, where G P (i) is the parental genotype of marker i, and G O (i) is the offspring genotype for this marker. The offspring with the maximum number of resolved markers that are not resolved in its parent can potentially contribute the most information to resolving its parents' markers. Haplotype resolution is divided into two sequential steps: solving for the parent by separation of its two haplotypes, G(1, l) ¼ fH 1 (1, l), H 2 (1, l)g, and assigning the progeny haplotype (H 1 or H 2 ) to the parent from the offspring whose haplotype was most complete. This haplotype is then denoted as the resolved haplotype (RHap). In the second step, the RHap is assigned to each of the parent's progeny, where possible. The criterion of which haplotype to assign when either solving for the parent by its progeny or solving for the progeny by its parent depends on the existence of a resolved marker that is heterozygous in the parental genotype and has been resolved in the progeny. This criterion is denoted the ''separation rule.'' An example is given in Figure 1.
Although it is not possible to determine haplotype source (paternal or maternal) for founders, it is possible to determine both haplotypes. Therefore, when assigning RHap for founders, the algorithm determines the founder's offspring with maximal contribution to the determination of the parental haplotypes, i.e., the offspring with the maximum number of resolved markers not resolved in its parent, and applies the separation rule. Haplotype determination can be improved in nonfounder families by using information from other offspring of the same parent. The more resolved loci found in offspring that are not resolved in the parent, the more the offspring contributes to the composition of RHap. The offspring can then be utilized as the source for RHap, as long as the separation rule can be applied, as in the example in Figure 1, thus gathering haplotype information not just from a single offspring, but also from its siblings.
In a parent with heterozygous loci, it is then possible to identify which haplotype was passed to a progeny, provided that the progeny genotypes differ from the parent in at least one locus heterozygous in the parent. In this case, it is possible to improve RHap iteratively each time an offspring is found to have a resolved marker that is not resolved in his parent, which fulfills the separation rule. This can be done only in nonfounder families, because in founders it is not possible to determine paternal or maternal haplotypes, except in the case where all the founder's genotypes are homozygous.
After maximum RHap resolution is achieved, RHap then can be used as a haplotype source. For example, in the case given in Figure 1, resolution of parental genotype can be used to better resolve the offspring's genotype, which then can be used to resolve its siblings and its own offspring, which in turn can be used to resolved The parent and progeny genotypes are shown, and the parental haplotypes are assumed known. Each locus is denoted by a square. The first row (M) lists the alleles of the parent's maternal haplotype, and the second row (P) lists the alleles of the parent's paternal haplotype. Missing alleles are denoted with zeros. Considering each marker separately, the offspring can be resolved only for the fifth and sixth loci, and these are indicated accordingly. For the remaining markers, the two alleles of the progeny's genotype are given in a single square. The fifth marker is circled, because allelic origin for the offspring can be determined only for this marker. For the sixth marker it is possible to determine that the progeny received allele 2 from its parent, but it is not possible to determine whether this allele derives from the parent's maternal or paternal haplotypes, because the parent was homozygous for this locus. Under the assumption that the offspring received the M maternal haplotype in its entirety, it is possible to resolve three of the offspring's missing alleles and also to determine their origins. The progeny's resolved haplotypes are given in the bottom two rows. the offspring haplotype and its own parent and vice versa. Offspring without resolved markers can be resolved only to the level of the best-resolved offspring in same nuclear family. Resolving other families is then processed recursively, with the offspring of the first family considered as the parent of a new family.
Generation and analysis of simulated data: Ten independent populations were simulated. The chromosomal segment analyzed consisted of five tightly linked biallelic markers. Genotypes were assumed known only for males. Population allelic frequencies were determined for each marker by simulation from a uniform distribution. Each population consisted of five founder males. Genotypes and haplotypes of founders were generated by random sampling of alleles on the basis of simulated population allelic frequencies. Each sire passed one of his two haplotypes intact to each progeny, each with a probability of 0.5. The dam haplotype of each progeny was determined by random sampling of alleles, on the basis of the population allelic frequencies. Five male progeny were generated for each male founder, and five male progeny were generated for each male of the second generation, for a total of 155 males with recorded genotypes. Genotypes and haplotypes of the grandsons of the founders were determined using the same procedure used to determine the genotypes of their sires. That is, for each grandson, one complete haplotype of his sire was selected as the paternal haplotype, and the maternal haplotype was determined by random selection of alleles on the basis of population frequencies. Thus zero recombination from sires to sons was assumed throughout. In addition, 10 more simulated populations were generated under the same conditions, but with 20% of genotypes randomly designated as missing.
Simulated data sets were also analyzed by the blockextension algorithm (Li and Jiang 2003) and integer linear programming (ILP) algorithm (Li and Jiang 2004) of the haplotype resolution package PedPhase2 and by SimWalk2 version 2.91 (Sobel and Lange 1996). Results of these programs and LSPH were compared. PedPhase2 and SimWalk2 required generating ''dummy'' records for individuals listed as parents. Therefore it was necessary to generate dummy dam records for all nonfounders, increasing the number of individuals in each data set to 305. Simwalk2 could not run on the simulated data set, because the number of founders exceeded the maximum allowed. Therefore an abbreviated data set consisting of 155 individuals was generated for this program by deleting 75 grandsons and their 75 dams.
Analysis of actual data: A QTL affecting milk production traits located in the central region of chromosome 6 was detected independently by different research groups in several cattle populations (summarized by Ron et al. 2001). Twelve diallelic markers were detected within a 2-cM chromosomal segment distal to BM143, a multiallelic microsatellite, as described by Cohen-Zinder et al. (2005). Details of the polymorphisms genotyped on BTA6 are given in Table 1. These markers are located in 10 genes and included SNP or variations in simple sequence repeats.
Semen was obtained for 418 artificial insemination sires of the Israeli Holstein population with genetic evaluations. These bulls were genotyped for the 12 diallelic markers and BM143. Of these bulls, 12 had three or more genotype conflicts between their genotype and the genotype of their putative sires. We therefore concluded that either paternity was incorrect or that our semen sample was mislabeled. These 12 bulls were discarded from further analysis, leaving 406 bulls. The numbers of bulls genotyped for each marker are also listed in Table 1. There were 4374 valid genotypes of 5278 possible genotypes (406 animals 3 13 markers). Thus genotyping efficiency was 83%. Of the valid genotypes, 2111 (48%) were heterozygous. Frequencies of heterozygotes by marker are also given in Table 1. The frequencies of the bulls' year of birth are given in Figure 2. More than 95% of the bulls were born between 1980 and 1996. All known parents and grandparents of bulls with genotypes were included in the pedigree file for a total of 1149 animals. Of these, 372 were ''founders,'' that is, cows or bulls with parents not included in the pedigree file, and 584 were females. The maximum number of generations from founders to the youngest genotyped animals was six. None of the females or founders was genotyped, because DNA was not available for these animals. Nearly all of the females had only a single male progeny. Thirteen alleles ranging in fragment length from 90 to 118 bp were observed for BM143 and their allelic frequencies are given in Table 2. Most allele frequencies were quite low. Most of the conflicts detected by application of the algorithm involved BM143. Therefore the data were also analyzed with this marker deleted. The number of individuals analyzed and the genotypes with and without BM143 included are given in Table 3. The actual data results were analyzed on the basis of the number and types of conflicts detected and the proportion of haplotype-resolved heterozygous marker loci.

RESULTS AND DISCUSSION
CPU time for the analysis of either simulated or real data sets by LSPH was never .2 sec on an Intel Pentium 1.4 GHz M processor computer with 1 GB RAM. The computing time of LSPH with relatively large data sets was much shorter than, or at least as short as, other tested rule-based methods. The results from the analysis of the simulated data are summarized in Table 4. The average rate of haplotype resolution with no missing data was 91.5% with a standard deviation of 2.0%. Haplotypes were resolved for 74% of the heterozygous markers with a standard deviation of 8.6%. With 20% missing data, the rate of haplotype resolution was 67.3% with a standard deviation of 1.3%, and resolution for the heterozygous markers was 54.2% with a standard deviation of 6.1%. None of the haplotype resolutions were incorrect. Unlike other algorithms (e.g., Li and Jiang 2004), the amount of missing data has no effect on CPU time.
LSPH results for the real data are presented in Table  5. Results are presented separately, including BM143, the only multiallelic marker, and with BM143 deleted. In both cases the program was able to resolve only 2% of the unrecorded genotypes. The fraction of resolved markers for real data was much smaller than that for the simulated data. This result was due chiefly to the greater   fraction of missing data (70%) in the real data set. In the simulated data, a 20% increase in the amount of missing data led on the average to a 24% reduction in the number of resolved genotypes and about a 20% reduction in the number of resolved heterozygous genotypes. In addition, real data may contain recombination events as well as incorrect genotypes. These were marked as ''type 3'' conflicts and could not be resolved. LSPH was able to resolve haplotypes for 60% of the known genotypes, but this includes all homozygous genotypes, which are by definition ''resolved.'' Allele origin was resolved for only 22% and 17% of the heterozygous genotypes with and without BM143 included, respectively. With BM143 included, the program was able to completely resolve at least one heterozygous marker for 203 (18%) animals of the total. Assuming no recombination within this chromosomal segment, this is also the number of animals with resolved haplotypes. As noted previously, the analysis included 584 female ancestors that were not genotyped. No female haplotypes were resolved. Thus haplotypes were resolved for 36% of the males. With BM143 deleted, the number of animals with resolved haplotypes was reduced to 150, or 13% of the total animals and 27% of the males. With BM143 included, 75 conflicts were detected. Of these, 22 were denoted type 1, 3 were denoted type 2, and 50 were denoted type 3. Only 36 conflicts were detected with BM143 deleted, and none were type 2. Thus inclusion of a microsatellite with multiple alleles increased the number of animals with resolved haplotypes by 50%, but doubled the number of conflicts.
As noted, type 1 and 2 conflicts are due to genotyping mistakes, incorrect parental assignments, sample switching, or mutations. Genotyping error rates for microsatellites are on the order of 1%, while mutation rates are much lower (Weller et al. 2004). Previous results indicate that the rate of incorrect paternity recording of cows during the 1990s was 11.7% (Weller et al. 2004), although it is expected that the incorrect paternity rate for AI bulls should be lower, and paternity was verified by genetic markers for at least 80% of the bulls (Y. Zaron, personal communication). In addition, incorrect paternity should have lead to multiple conflicts, as was the case for 12 bulls, which were discarded from the analysis. Although type 3 conflicts could also be due to recombination, it is very unlikely that this is the case for most of these conflicts. Haplotypes were resolved for only 203 and 150 animals with and without BM143 included. Assuming that the chromosomal segment spans 2 cM, only three to four recombinations per generation should have occurred among the animals with resolved haplotypes, and most would be undetectable. In conclusion, genotyping mistakes are still the most likely explanation for most of the conflicts.
SNP have the advantages that they are more frequent throughout the genotype and genotyping errors are less frequent than for microsatellites, but have the disadvantage that they are less polymorphic than microsatellites, which reduces the likelihood that allele origin can be unequivocally determined. Clearly genotyping females, especially dams of AI sires, could increase the fraction of haplotypes resolved; but in most cases this will not be a viable option, because unlike AI sires, DNA of females is not collected and stored.
Studies based on the minimum recombination principle have been previously published. Tapadar et al. (1999) implemented their model using likelihood function, but could not deal with missing data. O'Connell (2000) estimated the haplotype frequencies for biallelic markers using ZAPLO, but had difficulties in completing missing data and analysis of markers with more than two alleles. The methods proposed by Qian and Beckmann (2002) and Li and Jiang (2003) deal with missing and inaccurate genotype data, but these methods can be applied only to pedigrees of several tens of individuals. The implementation package PedPhase 2.0 for the minimum recombinant haplotype configuration Each population included 155 males of three generations. Ten simulations were analyzed for each set of conditions. problem (Li and Jiang 2004) contains five functions for inferring haplotypes from genotypes for members of a general pedigree. Two functions ½locus-based dynamic programming (DP) and member-based DP are exponential in the size of the input and are not recommended by the authors for a large number of markers or individuals in the pedigree. The ILP function computes all possible solutions whenever it cannot unequivocally determine the haplotype. Moreover, the number of all possible haplotype solutions with zero recombinants, even for a moderately sized data set, can be huge. The constrained-finding function produces similar results to LSPH on simulated data with no missing data, but cannot deal with missing data. The last function of the PedPhase package, the block-extension algorithm, is intended for analysis of large pedigrees. This is a heuristic algorithm Jiang 2003, 2004) and as such provides one possible solution out of many, which may or may not be correct. In real data sets, the founders' haplotypes are generally unknown, which increases the number of possible solutions. Not all of the functions in the PedPhase package perform complete validity checks. LSPH, on the contrary, performs several differ-ent types of validity checks from the basic level of pedigree structure (such as duplicate individuals or missing individual-family relationship) to the Mendelian consistency check and data completion (when possible) for each locus for each individual. In case of inconsistency, LSPH also determines the mismatch type, as described.
The results of LSPH, SimWalk2, and two functions of PedPhase-block-extension and ILP-on the simulated data sets are compared in Table 6. As noted previously, SimWalk2 could not run on the simulated data sets of 305 individuals, including dams. Running time for SimWalk2 for the truncated set of 155 individuals, including dams, was 50 min, as compared to ,1 sec for LSPH or either PedPhase function (although for some pedigree files, the ILP algorithm running time was .8 sec). Of the three programs, LSPH is the only one that does not provide a solution for haplotypes that cannot be unequivocally resolved. Although both Simwalk2 and PedPhase present a single-haplotype resolution of all possible combinations, only SimWalk2 differentiates between these cases and the cases in which haplotypes are unequivocally determined. For both PedPhase and SimWalk2, nonexistent recombination events were reported. That is, haplotype solutions requiring recombinations in previous generations were derived, even though the simulated population was generated without recombinations.
Of the heterozygous loci, 24.4 and 10.7% were incorrectly resolved by the block-extension and ILP algorithms of PedPhase with no missing data. It should be noted that since there are only two alternatives with equal probabilities, random haplotype determination would result in 50% correct decisions. Only the ILP algorithm was able to run on the data sets with 20% missing data, and in this case 8.5% of the genotype determinations were incorrect. Since only 20% of genotypes were missing, 42.5% of the missing genotypes were incorrectly determined. Of the correctly determined genotypes, including both known and reconstructed, 15.9% of the heterozygous loci were incorrectly resolved.
The percentage of correct allele determinations of heterozygotes for LSPH and the two PedPhase algorithms are also presented, with and without missing data. The results for LSPH are the same as those given in Table 4. With no missing data, the block-extension algorithm value is only marginally higher than LSPH, even though 24.4% of all the determinations are incorrect. The ILP algorithm was able to correctly resolve allelic origin for 15% more of the heterozygote genotypes, but this is still only marginally better than could be obtained by LSPH, if the 25.6% unresolved heterozygotes are randomly assigned. On the basis of this comparison, the ''resolution'' rate for LSPH can be assumed to be 87.2%, with the advantage that the program indicates which haplotypes are known with certainty. With missing data, the ILP algorithm is able to correctly resolve 84.1% of the heterozygotes, as compared to the 54.2% of heterozygotes that are resolved with certainty by LSPH. Again, another 22.9% of the unresolved heterozygotes would be ''correctly resolved'' by random assignment.
LSPH was more robust than the other programs with respect to the input requirements. It could handle pedigree data with different types of missing information and discrepancies, such as only a single parent is known, only one allele of a genotype is recorded, gender conflicts, or the same individual is recorded twice. PedPhase terminated with error status if there were missing parents, redundant individuals, or Mendelian conflicts in genotype. PedPhase terminated normally if there were incorrect gender determinations, but did not note the discrepancy. Although all the programs were designed to run on the Windows operating system, only LSPH has the capability to produce output in Microsoft Excel XML format (among others), which was discovered to be extremely helpful in manipulating large amounts of data.
As noted in the Introduction, information on haplotype frequencies can be used to determine the most probable haplotype in situations in which more than a single solution is possible. The aim of statistics-based programs, such as SimWalk2, is to find a haplotype configuration with the maximum likelihood under the assumed model. Exact algorithms for finding the most probable haplotype configuration can work only for small data sets, where the number of consistent haplotype configurations for a given pedigree is in a manageable range.
We developed and implemented an algorithm for inference of haplotypes suitable for analysis of a large population without a well-defined pedigree structure, with individuals genotyped for many closely linked genetic markers under the assumption of no recombination. This was useful for inferring missing genotype, as well as resolving haplotypes by using the information from closely linked markers. During the development of this algorithm, effort was directed toward handling data from a large pedigree with missing and erroneous genotypes. LSPH does not perform an exhaustive search, and does not make any assumptions on the underlying population, and therefore can also be applied to analysis of human populations.
The algorithm is rule based, and as such the reconstructed haplotypes are error free, provided that there are no genotyping mistakes or recombinations. Under these restrictions, the optimal solution can then be defined as resolution of the maximum number of haplotypes. We cannot prove that LSPH is optimal in this sense, although on simulated data 91% of the haplotypes were resolved with no missing data. Kerr and Kinghorn (1996) developed a segregation analysis method to determine genotype probabilities for individuals that were not genotyped on the basis of the known genotypes of their relatives. This method can be applied to a very large population, but considers only a single locus and does not resolve haplotypes. Windig and Meuwissen (2004) developed a rapid algorithm that determines the most probable haplotypes for short chromosomal segments with many markers and large families. Further study is suggested to combine these methods with our algorithm for further resolution of genotypes and haplotypes.