- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rosenberg, N. A.
- Articles by Weigend, S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rosenberg, N. A.
- Articles by Weigend, S.
Empirical Evaluation of Genetic Clustering Methods Using Multilocus Genotypes From 20 Chicken Breeds
Noah A. Rosenberga, Terry Burke1,b, Kari Eloe, Marcus W. Feldman1,a, Paul J. Freidlin1,c, Martien A. M. Groenen1,d, Jossi Hillel1,c, Asko Mäki-Tanila1,e, Michèle Tixier-Boichard1,f, Alain Vignal1,g, Klaus Wimmers1,h, and Steffen Weigendia Department of Biological Sciences, Stanford University, Stanford, California 94305,
b Department of Animal and Plant Sciences, Sheffield University, S10 2TN, United Kingdom,
c Department of Genetics, The Hebrew University of Jerusalem, Faculty of Agriculture, Rehovot 76100, Israel,
d Institute of Animal Sciences, Wageningen Agricultural University, 6700 AH Wageningen, The Netherlands,
e Agricultural Research Centre, Institute of Animal Production, FIN-31600 Jokioinen, Finland,
f Institut National de la Recherche Agronomique, Centre de Recherches de Jouy-en-Josas, 78 352 Jouy-en-Josas Cedex, France,
g Institut National de la Recherche Agronomique, Centre INRA de Toulouse, 31326 Castanet Tolosan, France,
h Institute of Animal Breeding Science, Rheinische Friedrich-Wilhelms-Universitat, D-53012 Bonn, Germany
i Institute for Animal Science and Animal Behaviour, Mariensee, 31535 Neustadt, Germany
Corresponding author: Noah A. Rosenberg, Program in Molecular and Computational Biology, University of Southern California, 1042 W. 36th Pl., DRB 155, Los Angeles, CA 90089-1113., noahr{at}usc.edu (E-mail)
Communicating editor: G. B. GOLDING
| ABSTRACT |
|---|
We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently
98%. When markers of highest expected heterozygosity were used, genotypes that included at least 810 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 1215 highly variable markers and only 1520 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units that are useful in problems of conservation.
CHARACTERIZATIONS of the population structure of species are useful in a variety of contexts. Genetic ascertainment of within-species population structure has been widely applied for classifying subspecies, for defining intraspecific conservation units, for understanding events in the history of a species, for identifying ongoing speciation events, and for testing hypotheses about evolutionary processes. In other situations, the presence of population structure poses a practical nuisance. For example, allele frequencies in reference groups are central to calculations in forensic studies, and it is difficult to identify appropriate reference groups in structured populations (NATIONAL RESEARCH COUNCIL 1996). In case-control studies that test for statistical associations between a genotype at a particular locus and a phenotype, not taking into account population structure can lead to the false detection of associations (e.g., ![]()
Population structure assessment has often relied upon a priori groupings of individuals on the basis of phenotypes or sampling locations. A classification chosen by an investigator, however, might not accurately describe the genetic structure of the populations. Genetically similar groups of individuals might be labeled differently due to distinct geography, different phenotypes, or, in the case of human groups, cultural differences; however, a high level of geographic, phenotypic, or cultural diversity among a collection of populations need not imply that the groups are genetically divergent. Conversely, geographic overlap or phenotypic similarity may mask underlying genetic variation. Thus, a purely genetic analysis using no external information provides the most direct method of determining population structure. Only if a correspondence between genetic and geographic or phenotypic classifications is established can these characteristics also serve as appropriate classification tools.
The structure algorithm (![]()
![]()
In this article, we consider the utility of genetic cluster analysis on a large data set for which population structure is known, with the aim of making recommendations about its future uses. We employ a collection of 27-locus genotypes from 600 individuals representing 20 chicken breeds. This data set is substantially larger than previous data sets on which structure has been applied (![]()
![]()
![]()
We first characterize the genetic differences among the populations. We then demonstrate that genetic cluster analysis has great ability to correctly ascertain the population structure for these data, and we compare the cluster analysis to a cladogram derived from the neighbor-joining algorithm. To assess the success of clustering as a function of the number of markers, we consider subsets of the loci chosen by different criteria of variability. We also consider the success of clustering as a function of the number of individuals used per population. Finally, we discuss recommendations on the use of genetic cluster analysis for ascertaining population structure, for applications in the assignment of individuals of unknown origin to populations, and for identifying genetically distinctive populations.
| MATERIALS AND METHODS |
|---|
Breeds:
We genotyped 30 individuals from each of 20 breeds. These breeds form a subset of the populations studied in a survey of European chicken genetic diversity (![]()
![]()
![]()
![]()
Markers:
Genotypes were used for 27 microsatellite markers spread across the chicken genome (listed in Table 1). Except for ADL278, LEI94, LEI166, LEI194, LEI228, and LEI234, it has previously been reported that these markers show high levels of polymorphism within and between breeds (![]()
![]()
|
Genotyping:
Genotyping was performed in the laboratories of T. Burke, M. A. M. Groenen, J. Hillel, and S. Weigend, with similar procedures used in all labs. The example procedure that follows is from the laboratory of S. Weigend. PCR products were obtained in a 25-µl volume using Ready-To-Go PCR Beads (no. 27-9555-01; Amersham Pharmacia Biotech Europe, Freiburg, Germany) and a thermal cycler (Mastercycler; Eppendorf, Hamburg, Germany). Two pairs of microsatellite primers were run in one tube. Each PCR tube contained 20 ng of genomic DNA, 10 pmol of each forward primer labeled with either IRD700 or IRD800 (MWG-Biotech, Ebersberg, Germany), 10 pmol of each unlabeled reverse primer, and 1 mM tetramethylammoniumchloride. The amplification involved initial denaturation at 95° (1 min), 35 cycles of denaturation at 95° (1 min), primer annealing at temperatures varying between 58° (1 min) and extension at 72° (1 min), followed by final extension at 72° (10 min). Specific DNA fragments produced by amplification were visualized as bands by 8% PAGE, which was performed with a LI-COR automated DNA analyzer (LI-COR Biotechnology Division, Lincoln, NE 68504). Electrophoregram processing and allele-size scoring were performed with the RFLPscan package (Scanalytics, Division of CSP, Billerica, MA).
Missing data:
The proportion of missing data was 0.8%, and 12 of 27 loci had missing genotypes. For no locus were >3.5% of the possible genotypes missing. Missing genotypes were distributed across 88 individuals from 18 breeds. For no breed were >4.1% of its genotypes missing. Out of 600 individuals, 13 individuals originating from 6 breeds did not have available genotypes at >1 locus. These 13 individuals included 1 individual that was lacking genotypes at 9 loci and 3 individuals that were missing genotypes at 10 loci.
Statistical analysis:
Genetic differentiation:
For each pair of breeds, allele frequencies were tabulated at each locus, sequentially pooling the rarest alleles into one allelic class, until the average frequency for the two breeds exceeded 0.1 for each class. A chi-square association test statistic was computed for each locus, with the number of degrees of freedom equaling one fewer than the number of allelic classes. We counted how many loci produced test statistics below the 0.001 level.
Genetic distance between breeds was calculated using the negative logarithm of the proportion of shared alleles (PSA) in the two breeds (![]()
![]()
Clustering of breeds:
Population structure was studied using two methods. First, we obtained an unrooted neighbor-joining cladogram (![]()
![]()
The second approach utilized the program structure, which identifies clusters of related individuals from multilocus genotypes (![]()
Evaluation of cluster analysis:
Each individual was assigned to a specific breed using structure (![]()
|
Once the individuals were clustered to the greatest extent possible at the conclusion of step 3, we followed step 4 to assign each individual to a single breed. The "clustering success rate" (step 5) was then defined as the proportion of individuals correctly assigned to their breeds of origin.
Note that we assumed that individuals were maximally clustered after step 3. This assumption avoided additional subclustering runs: In principle, a cluster C that was associated only with breed B in step 3 might have been decomposable into subclusters. However, each of the resulting subclusters would then be associated with either no breed or with the single breed B. Thus, this subclustering would not greatly affect the eventual assignment of individuals of cluster C to breeds. We also did not decompose any subclusters obtained in step 3 into "sub-subclusters." While it is conceivable that subclusters could be further divided, a single round of subclustering provided a convenient stopping point for the evaluation, allowing us to devise the precise procedure in Fig 1. Since only a small number of individuals would have been affected by sub-subclustering, the impact of this assumption on the clustering success rate was likely not very large. In the application of structure to data of unknown population structure, however, subclustering should be performed hierarchically, so that each cluster, subcluster, or lower-level grouping cannot be further decomposed.
Pairwise cluster analysis:
We assessed populations two at a time with neighbor-joining tree diagrams of the individuals in two populations (![]()
![]()
![]()
For each pair of breeds, we also ran the cluster analysis using 20,000 iterations and a burn-in period of 5000, with K = 2. The clustering success rate was measured using the algorithm in Fig 1, though the criterion for subclustering was not met for any pair of populations.
Clustering success as a function of the number of markers:
To determine properties of markers that make them effective in cluster analysis, we performed cluster analysis using subsets of the original 27 markers according to several variability criteria. For each criterion, and for each value of M (M = 1, 2, 3, ... 27), we selected the M markers that exhibited the highest values of that criterion, and we performed cluster analysis using that subset of loci. In cases where two or more criteria produced the same subset, we only performed one analysis for that subset. The criteria included the following: (1) Expected heterozygositytreating the whole sample as one group, for each locus we computed one minus the sum of the squares of the sample allele frequencies; (2) total number of alleles in the sampleif two or more loci had the same number of alleles, we broke ties by ranking markers in order of the mean number of alleles per breed; (3) Fstwe estimated Fst according to ![]()
We also considered marker subsets taken in reverse order by expected heterozygosity, and we used a random ordering of the loci: For each value of M, we selected the M markers that were associated with the M highest random numbers. Rankings of markers are shown in Table 1. With the exception of orderings induced by the number of alleles and expected heterozygosity (Kendall coefficient = 0.464, P = 0.0007), we did not detect evidence for rank correlation among pairs of orderings (of course, the Kendall coefficient was -1 for the orderings by highest and lowest expected heterozygosity, and it equaled -0.464 for the orderings by highest number of alleles and lowest expected heterozygosity).
Clustering success as a function of the number of individuals: To see how cluster analysis performed with fewer individuals, for each value of N (N = 5, 10, 15, 20, 25), we repeated the analysis (with all markers and with marker subsets) using N randomly chosen individuals from each breed.
| RESULTS |
|---|
Genetic differentiation:
For each pair of breeds, the null hypothesis that the two populations had equal allele frequencies was rejected at the 0.001 significance level for at least 6 loci (not shown). Even between the most closely related pairs of breeds, extremely significant differences were found. The only breed pairs for which 15 or fewer loci had significantly different allele frequencies at the 0.001 level were (44, 45), (5, 16), (16, 18), (18, 26), and (37, 3402). For several pairs, the null hypothesis of equal allele frequencies was rejected for at least 26 of 27 loci. These pairs included (4, 28), (4, 51), (4, 50), (26, 32), (28, 32), (32, 33), (32, 50), (32, 102), and (37, 102). Genetic distances were generally large as well (not shown), with only 10 pairwise comparisons <0.5 and with the average pairwise distance equaling 0.782. The lowest genetic distances were found for the following pairs: (44, 45), (5, 16), (16, 18), (18, 37), and (45, 51). The 25 largest genetic distances involved breeds 4, 19, 32, and 102.
Clustering of breeds:
Due to the complexity of the relationships among the individuals in the data and the existence of numerous likely clustering solutions, different runs of structure identified different potential clusterings of the individuals (Table 2). Some features of the clustering were consistent across runs. Most strikingly, breeds 4, 19, 27, 32, and 102 always fell into their own clusters, while breeds 44 and 45 always shared the same cluster. Breeds 5, 13, 21, 26, 28, 51, and 3402 usually occupied their own clusters, and breeds 18 and 37 were often found together in a single cluster.
|
In 9 of the 100 runs performed, 19 clusters were assigned nontrivial fractions of the data. The remaining runs included 43, 44, and 4 runs for which 18, 17, and 16 clusters were occupied, respectively. In the 8 solutions of highest likelihood, breeds 44 and 45 shared a single cluster and each of the other 18 breeds occupied an exclusive cluster. The most frequent groupings (Table 2), including (5, 16), (16, 18), (18, 37), (33, 44, 45), (37, 3402), (42, 50), (44, 45), and (44, 45, 51), appeared in high-likelihood solutions, while rare groupings were obtained in low-likelihood solutions. None of the rare groupings (13, 26), (16, 33), (21, 26), (26, 42), (28, 42), or (33, 51) occurred in any of the 40 solutions of highest likelihood. The 10 lowest-likelihood runs contained the single instances that produced (21, 26) and (28, 42), as well as three of four instances in which the grouping (13, 26) was obtained.
Breeds that grouped into clusters generally fell close to each other in their placement on the neighbor-joining cladogram (Fig 2), although frequently clustered groups did not always form clades. Of the eight most commonly clustered sets, three did not form clades, namely (33, 44, 45), (16, 18), and (18, 37). Bootstrap confidence values for groupings in the cladogram were generally low.
|
Pairwise clustering:
Although runs using all 20 breeds clustered pairs or triples of populations because 20 breeds were placed into 19 clusters, cluster analysis using only the individuals from 2 breeds separated them into 2 clusters. Of 190 pairs, 175 could be perfectly separated (Table 3). For the remaining 15 pairs, at most 5 individuals of 60 were placed incorrectly. However, for only 5 of these 15 pairs were individuals assigned to the wrong breed with high confidence (>75%). The clustering success rate was also high for the two triads of populations that grouped together: For both (33, 44, 45) and (44, 45, 51), only individual 45_1 was misplaced (with breed 44).
|
For 188 of 190 breed pairs, the neighbor-joining tree was weakly consistent with breed affiliation (Fig 3). Of these 188 trees, 170 were strongly consistent with breed affiliation. The 2 pairs for which trees were not consistent (Fig 3), (5, 16) and (44, 45), were among the pairs for which clustering was imperfect. The 18 pairs for which trees were weakly consistent but not strongly consistent with breed affiliation were (5, 18), (5, 33), (5, 50), (5, 102), (13, 26), (13, 102), (16, 18), (16, 50), (16, 102), (18, 37), (27, 102), (28, 102), (33, 45), (33, 102), (42, 50), (42, 102), (50, 102), and (51, 102).
|
Evaluation of clustering:
Using the complete set of 27 markers, cluster analysis obtained correct groupings of individuals with high accuracy (Fig 4). When only the most polymorphic markers were selected according to the greatest number of alleles or the highest expected heterozygosity, only
810 markers were needed to attain 95% accurate clusterings. Once 1112 markers were chosen, expected heterozygosity, number of alleles, and Fst performed similarly, achieving 9598% in almost every run. Although the random ordering achieved 90% clustering accuracy with 1012 markers, it required 1720 markers to achieve 95%. The reverse ordering by expected heterozygosity required 1415 loci to achieve 90% and 1720 loci to reach 95%. When only a few markers were used, the discrepancy between the two most effective criteria and the other criteria was extremely high. The marker sets chosen by reverse order of expected heterozygosity performed particularly poorly compared to the other methods. However, as the number of markers increased, all criteria produced nondisjoint sets of markers, and when nearly all markers were used, the accuracy of clustering was
98% for each criterion.
|
For further analysis and discussion, we used expected heterozygosity. This statistic and the number of alleles are highly correlated (in fact, the most variable marker and the set of seven most variable markers coincided according to these two criteria) and they produce similarly accurate clusterings. However, expected heterozygosity is more generally usefulfor example, if single nucleotide polymorphisms are used, it provides a natural method to rank loci that all have two alleles.
Most breeds were clustered perfectly (Table 4), and although clustering solutions differed across runs, the same individuals tended to be misclassified across runs. For some breeds, including 4, 19, 27, and 32, all individuals were perfectly clustered using a small number of highly heterozygous loci. Others, including 5, 13, 16, 18, 44, 45, and 102, required many loci to obtain correct clustering. For breeds 5, 16, 45, and 102, large numbers of markers did not improve classification of a few specific individuals.
|
When subsets of the individuals were chosen, clustering accuracy declined. When 5 individuals were chosen from each breed, 27 markers were insufficient to obtain 90% accuracy (Fig 5). When 10 individuals were chosen, 21 markers were sufficient to achieve 90%. When 15 or more individuals were selected from each breed, 90% accuracy was attained using the 12 most variable loci.
|
| DISCUSSION |
|---|
Correspondence of inferred and known population structure:
When the full data set was used, inferred genetic clusters of individuals corresponded extremely well to predefined breed categorizations. Since similar likelihoods of many proposed clusterings make it difficult to label a "best" clustering of the data, we suggest that, for large data sets, cluster analysis should be performed multiple times before inferences are drawn. All solutions had in common that each cluster contained all or nearly all individuals from one or a few breeds. Upon further analysis, all clusters that contained more than one breed could be subdivided into a collection of subclusters, each of which matched a single breed.
While structure easily separated individuals into clusters that corresponded almost exactly to phenotypic labels, the bootstrap neighbor-joining cladogram was less capable of grouping subsets of the data with great regularity. Several possibilities can explain this discrepancy. First, while structure constructs genetic clusters from individual genotypes without reference to breed affiliation, cladograms assume that genetic clusters correspond to breed designations. This correspondence essentially holds, although 1015 individuals frequently appeared more similar to breeds from which they did not originate. The inclusion of these individuals in breed groupings used for the neighbor-joining tree potentially decreases genetic distances of certain pairs and hence affects the cladogram. However, upon removal of all individuals that were sometimes placed incorrectly in pairwise clustering (Table 3), the cladogram was essentially unchanged (not shown); thus, these individuals cannot explain its poor reliability.
An alternate explanation for the performance of the neighbor-joining tree is the fact that in domesticated species such as chickens, population histories may not follow a bifurcating tree model, so that tree diagrams present a misleading or inaccurate representation of population relationships. The considerable frequency of gene exchange among historical chicken populations could potentially explain the low bootstrap values on internal edges of the tree and on edges that group feral and traditional breeds.
Finally, it is likely that structure simply uses individual genotypic data more efficiently than cladograms based on genetic distance matrices (![]()
![]()
It has been argued that 30 markers are insufficient for distinguishing related populations using phylogenetic analysis (![]()
1 year for these breeds, so that considerable genetic variation has built up within and among chicken breeds (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In pairwise analysis of populations, clustering and neighbor-joining trees performed similarly. We note that neighbor-joining trees of individuals from two breeds are more useful if the strong criterion of separation is used. If genetic origins of individuals in two populations are known, the weak criterion of consistency for separating populations is applicabletwo populations are separated if there exists a decomposition of the tree into two components, each corresponding to a population. However, if genetic origins are unknown beforehand, an objective method must be used to separate the tree into components; under these circumstances, the strong criterion of consistency must be applied. When this criterion is used, cluster analysis performs slightly better than neighbor-joining trees in separating populations. Clustering also offers the opportunity for significance testing using R x C tests of association (![]()
Strategies for successful clustering:
Highest expected heterozygosity and highest number of alleles provided the best ways to select loci for clustering and were better than highest Fst. This result was surprising: The most useful genetic marker for clustering populations and assigning individuals is one that varies greatly across populations but little within them. A perfect locus for these purposes would be monomorphic within any given breed but polymorphic across breeds (![]()
Several highly variable markers, the tetranucleotide loci LEI192 and LEI228 most dramatically, had many alleles that were specific to at most a few populations and frequent in those populations. These markers also had generally low Fst values, since more common alleles did not greatly differ in frequency across breeds. It seems likely that these "diagnostic" alleles were partly responsible for the extremely successful clustering with the number of alleles and expected heterozygosity statistics. For 27 markers, we observed 101 alleles private to a single breed out of 326 total alleles. For 62 of the private alleles, two or more copies were observed, and, thus, these alleles are unlikely to result from genotyping errors. Since so many alleles in this study were breed specific and since many more were found only in two or three breeds, it is possible that these alleles had a substantial effect on clustering success. However, the number of private alleles in a sample decreases as the sample size from a breed increases. A large number of private alleles is not a property of most data sets, and, thus, the highest number of private alleles cannot be recommended as a criterion method for choosing the best markers to use. In closely related populations, private alleles may be uncommon: for example, using data from a study of 11 human populations (![]()
The successful performances of all other criteria compared to the reverse ordering by expected heterozygosity demonstrate that a careful choice of markers increases the power to achieve accurate clustering. This idea that a careful choice of markers can improve statistical power has been employed in the estimation of population of origin for admixed individuals (![]()
![]()
![]()
Depending on the species under consideration, the relative cost of genotyping more individuals and genotyping more markers will vary. We did not achieve 90% success in clustering when only 5 individuals were used per breed, but with 10 or more individuals per breed, clustering was highly successful when enough markers were used. Similarly, we did not achieve 90% success when fewer than 67 markers were used, even when all individuals were included. Thus, as a minimum, for similarly diverged populations to those in our study, at least 1215 highly variable markers should be genotyped in at least 1520 individuals per hypothesized population to achieve accurate clustering. In species for which genetic research is still preliminary, genotyping can be done sequentially: A small number of individuals can be genotyped for many markers. The most variable markers can then be selected for future study and then genotyped in a large sample.
While using the most variable markers allows researchers to minimize genotyping effort, we caution against using this type of marker set with statistical methods that assume a random set of loci and that make inferences based on mean variabilities across loci. A marker set selected for maximal variability will inflate estimates of divergence times estimated using the genetic distance (
µ)2 (![]()
![]()
We note that the 20 breeds genotyped here were chosen from among many breeds used in an earlier study (![]()
Problematic individuals:
In most runs, the clustering success rate remained <
98%, though this level could be obtained when 1520 highly variable markers were used. Given this observation, it is surprising that the full set of 27 markers did not achieve 100% accuracy. Errors in the clustering algorithm seem to be an unlikely explanation, since roughly the same sets of individuals were placed in the wrong clusters in runs that used different sets of loci or that produced different clustering solutions. Since all breeds were sampled from populations maintained in different locations, it is unlikely that recent admixture or labeling errors explain the improper placements. We suspect that the inability to achieve perfect clustering results from the fact that some individuals were genetically atypical of their breeds, and the algorithm could not recognize breeds of origin for these individuals. The frequently misplaced individuals 102_19, 102_20, and 102_21 derived from a flock of zoo animals that may have undergone considerable genetic drift. Individuals 16_21, 16_22, and 16_23 came from a single flock, one of many that was incorporated into the breed 16 sample; this flock may have been managed differently from the others. Interestingly, only one individual was misplaced from the closely related breeds 44 and 45: This suggests that structure may be useful for distinguishing lines from different breeding companies, in spite of common origin and similar selection objectives.
Cluster analysis and population assignment:
Placement of individuals into clusters is related, but not identical, to assignment of unknown individuals to populations. Assignment tests assume the existence of distinct populations and use properties of those groups, such as allele frequencies, to infer the source populations of unknown individuals (![]()
![]()
![]()
![]()
![]()
![]()
![]()
Our results are best interpreted as verification that these individuals indeed form genetic clusters that correspond to their breed designations and that they can be used in a training sample for assignment of future unknowns. This training sample can be utilized differently by various assignment algorithms. For example, the method of ![]()
![]()
![]()
The importance of training samples for population assignment suggests a strategy by which future assignment studies can be optimized. For any species of interest, the most variable markers according to expected heterozygosity, number of alleles, or Fst should be genotyped on a large scale. New markers could be tested by the criterion, and highly variable new markers could potentially be included in the set of most variable markers, reducing the number of markers needed for clustering studies below the current recommendation of 1215. A database of individual genotypes at these most variable loci could then be made publicly available. New individuals could be genotyped for the most variable markers and could then be added to the database. Individuals in the database who are known to represent certain breeds could be used as a training sample for assignment tests. Individuals who were misassigned or who were difficult to assign correctly could be excluded from the database, so that only the individuals who can confidently be assigned to the correct breeds would be included.
Such a database might be extremely useful to researchers who may only have one or a few unknown individuals that they wish to identify (e.g., ![]()
Cluster analysis and genetic distinctiveness:
We observed that some breeds were easier to separate into clusters than others, in the sense that all individuals in some breeds were correctly placed with only a small number of markers. This likely derives from the presence of distinctive multilocus genetic combinations in the breeds that were easiest to separate. Thus, we suggest that the relative number of loci required for the correct clustering of several breeds can be used as a way of identifying populations that are genetically distinctive with respect to a collection.
In addition to resolving questions about population histories (![]()
![]()
![]()
![]()
![]()
Relationships of chicken breeds:
Considerable attention has been devoted to the study of genetic diversity and relationships of chickens. Some studies focused on commercial breeds (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Conclusions:
We have discussed the application of genetic cluster analysis to 600 individuals from 20 chicken breeds, demonstrating that the technique has great potential to correctly identify population structure. We have argued that individual clustering provides a more appropriate characterization of population structure in these groups than does a neighbor-joining tree. Last, we have proposed recommendations on future uses of genetic cluster analysis and individual assignment tests in similarly diverged collections of populations: (1) At least 1215 highly variable loci should be genotyped in at least 1520 individuals per hypothesized population; (2) markers with the highest expected heterozygosity, number of alleles, and Fst can be used in genetic cluster analysis to minimize genotyping costs; (3) databases of multilocus genotypes obtained at highly variable markers in individuals of known origins can be established to provide training samples for assignment algorithms; (4) genetically distinctive populations can be identified on the basis of how difficult it is to separate them from other breeds when cluster analysis is used; and (5) cluster analysis can provide an additional tool for identification of population relationships, history, and within-species genetic units for conservation.
| FOOTNOTES |
|---|
1 These authors are listed alphabetically. ![]()
| ACKNOWLEDGMENTS |
|---|
The authors thank Nina Dudnik and Jonathan Pritchard for helpful comments. This study arose during a visit by N.A.R. to the laboratory of J.H. N.A.R. is supported by a Program in Mathematics and Molecular Biology graduate fellowship. This research was supported by the European Community-funded project AVIANDIV (Development of Strategy and Application of Molecular Tools to Assess Biodiversity in Chicken Genetic Resources, BIO4CT980342) and by National Institutes of Health grant GM28428 to M.W.F.
Manuscript received March 26, 2001; Accepted for publication August 1, 2001.
| LITERATURE CITED |
|---|
BEAUMONT, M., E. M. BARRATT, D. GOTTELLI, A. C. KITCHENER, and M. J. DANIELS et al., 2001 Genetic diversity and introgression in the Scottish wildcat. Mol. Ecol. 10:319-336[Medline].
BOWCOCK, A. M., A. RUIZ LINARES, J. TOMFOHRDE, E. MINCH, and J. R. KIDD et al., 1994 High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368:455-457[Medline].
BUCHANAN, F. C., L. J. ADAMS, R. P. LITTLEJOHN, J. F. MADDOX, and A. M. CRAWFORD, 1994 Determination of evolutionary relationships among sheep breeds using microsatellites. Genomics 22:397-403[Medline].
CIAMPOLINI, R., H. LEVEZIEL, E. MAZZANI, C. GROHS, and D. CIANCI, 2000 Genomic identification of the breed of an individual or its tissue. Meat Sci. 54:35-40.
CORNUET, J.-M., S. PIRY, G. LUIKART, A. ESTOUP, and M. SOLIGNAC, 1999 New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153:1989-2000
CRANDALL, K. A., O. R. P. BININDA-EMONDS, G. M. MACE, and R. K. WAYNE, 2000 Considering evolutionary processes in conservation biology. Trends Ecol. Evol. 15:290-295[Medline].
CROOIJMANS, R. P. M. A., A. F. GROEN, A. J. A. VAN KAMPEN, S. VAN DER BEEK, and J. J. VAN DER POEL et al., 1996 Microsatellite polymorphism in commercial broiler and layer lines estimated using pooled blood samples. Poult. Sci. 75:904-909[Medline].
DAVIES, N., F. X. VILLABLANCA, and G. K. RODERICK, 1999 Determining the source of individuals: multilocus genotyping in nonequilibrium population genetics. Trends Ecol. Evol. 14:17-21[Medline].
DEVLIN, B. and K. ROEDER, 1999 Genomic control for association studies. Biometrics 55:997-1004[Medline].
DUNNINGTON, E. A., L. C. STALLARD, J. HILLEL, and P. B. SIEGEL, 1994 Genetic diversity among commercial chicken populations estimated from DNA fingerprints. Poult. Sci. 73:1218-1225[Medline].
FELSENSTEIN, J., 1993 PHYLIP (Phylogeny Inference Package). Department of Genetics, University of Washington, Seattle.
GOLDSTEIN, D. B., A. RUIZ LINARES, L. L. CAVALLI-SFORZA, and M. W. FELDMAN, 1995 Genetic absolute dating based on microsatellites and the origin of modern humans. Proc. Natl. Acad. Sci. USA 92:6723-6727
GROENEN, M. A. M., H. H. CHENG, N. BUMSTEAD, B. F. BENKEL, and W. E. BRILES et al., 2000 A consensus linkage map of the chicken genome. Genome Res. 10:137-147
HILLEL, J., A. KOROL, V. KIRZNER, P. FREIDLIN, S. WEIGEND et al., 1999 Biodiversity of chickens based on DNA pools: first results of the EC funded project AVIANDIV, pp. 2229 in Poultry Genetics Symposium, Proceedings, edited by R. PREISINGER. Lohmann Tierzucht, Cuxhaven, Germany.
JIN, L., M. L. BASKETT, L. L. CAVALLI-SFORZA, L. A. ZHIVOTOVSKY, and M. W. FELDMAN et al., 2000 Microsatellite evolution in modern humans: a comparison of two data sets from the same populations. Ann. Hum. Genet. 64:117-134[Medline].
KAISER, M. G., N. YONASH, A. CAHANER, and S. J. LAMONT, 2000 Microsatellite polymorphism between and within broiler populations. Poult. Sci. 79:626-628
MAFENI, M. J., K. WIMMERS, and P. HORST, 1997 Genetic diversity in indigenous Cameroon and German Dahlem Red fowl populations estimated from DNA fingerprints. Arch. Tierz. 40:581-589.
MINCH, E., A. RUIZ-LINARES, D. B. GOLDSTEIN, M. W. FELDMAN and L. L. CAVALLI-SFORZA, 1998 Microsat2: A Computer Program for Calculating Various Statistics on Microsatellite Allele Data. Department of Genetics, Stanford University, Stanford, CA.
MOAZAMI-GOUDARZI, K., D. LALOË, J. P. FURET, and F. GROSCLAUDE, 1997 Analysis of genetic relationships between 10 cattle breeds with 17 microsatellites. Anim. Genet. 28:338-345[Medline].
MORITZ, C., 1994 Defining evolutionarily significant units for conservation. Trends Ecol. Evol. 9:373-375.
MOUNTAIN, J. and L. L. CAVALLI-SFORZA, 1997 Multilocus genotypes, a tree of individuals, and human evolutionary history. Am. J. Hum. Genet. 61:705-718[Medline].
NATIONAL RESEARCH COUNCIL, 1996 The Evaluation of Forensic DNA Evidence. National Academy Press, Washington, DC.
NOTTER, D. R., 1999 The importance of genetic diversity in livestock populations of the future. J. Anim. Sci. 77:61-69
PAETKAU, D., 1999 Using genetics to identify intraspecific conservation units: a critique of current methods. Conserv. Biol. 13:1507-1509.
PAETKAU, D., W. CALVERT, I. STIRLING, and C. STROBECK, 1995 Microsatellite analysis of population structure in Canadian polar bears. Mol. Ecol. 4:347-354[Medline].
PONSUKSILI, S., K. WIMMERS, and P. HORST, 1996 Genetic variability in chickens using polymorphic microsatellite markers. Thai J. Agric. Sci. 29:571-580.
PONSUKSILI, S., K. WIMMERS, and P. HORST, 1998 Evaluation of genetic variation within and between different chicken lines by DNA fingerprinting. J. Hered. 89:17-23
PONSUKSILI, S., K. WIMMERS, F. SCHMOLL, P. HORST, and K. SCHELLANDER, 1999 Comparison of multilocus DNA fingerprints and microsatellites in an estimate of genetic distance in chicken. J. Hered. 90:656-659
PRIMMER, C. R., M. T. KOSKINEN, and J. PIIRONEN, 2000 The one that did not get away: individual assignment using microsatellite data detects a case of fishing competition fraud. Proc. R. Soc. Lond. Ser. B 267:1699-1704[Medline].
PRITCHARD, J. K., M. STEPHENS, and P. J. DONNELLY, 2000 Inference of population structure using multilocus genotype data. Genetics 155:945-959
RANNALA, B. and J. L. MOUNTAIN, 1997 Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94:9197-9201
REED, T. E., 1973 Number of gene loci required for accurate estimation of ancestral population proportions in individual human hybrids. Nature 244:575-576[Medline].
ROSENBERG, N. A., E. WOOLF, J. K. PRITCHARD, T. SCHAAP, and D. GEFEL et al., 2001 Distinctive genetic signatures in the Libyan Jews. Proc. Natl. Acad. Sci. USA 98:858-863
SAITOU, N. and M. NEI, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425[Abstract].
SHRIVER, M. D., M. W. SMITH, L. JIN, A. MARCINI, and J. M. AKEY et al., 1997 Ethnic-affiliation estimation by use of population-specific DNA markers. Am. J. Hum. Genet. 60:957-964[Medline].
TAKAHASHI, H., K. NIRASAWA, Y. NAGAMINE, M. TSUDZUKI, and Y. YAMAMOTO, 1998 Genetic relationships among Japanese native breeds of chicken based on microsatellite DNA polymorphisms. J. Hered. 89:543-546
TIXIER-BOICHARD, M., G. COQUERELLE and C. VILELA-LAMEGO, 1999 Contribution of data on history, management and phenotype to the description of the diversity between chicken populations sampled within the AVIANDIV project, pp. 1521 in Poultry Genetics Symposium, Proceedings, edited by R. PREISINGER. Lohmann Tierzucht, Cuxhaven, Germany.
VANHALA, T., M. TUISKULA-HAAVISTO, K. ELO, J. VILKKI, and A. MÄKI-TANILA, 1998 Evaluation of genetic variability and genetic distances between eight chicken lines using microsatellite markers. Poult. Sci. 77:783-790
WEIGEND, S., 1999 Assessment of biodiversity in poultry with DNA markers, pp. 714 in Poultry Genetics Symposium, Proceedings, edited by R. PREISINGER. Lohmann Tierzucht, Cuxhaven, Germany.
WEIR, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.
WIMMERS, K., S. PONSUKSILI, F. SCHMOLL, T. HARDGE, and E. B. SONAIYA et al., 1999 Application of microsatellite analysis to group chicken according to their genetic similarity. Arch. Tierz. 42:629-639.
WIMMERS, K., S. PONSUKSILI, T. HARDGE, A. VALLE-ZARATE, and P. K. MATHUR et al., 2000 Genetic distinctness of African, Asian and South American local chickens. Anim. Genet. 31:159-165[Medline].
ZHIVOTOVSKY, L. A., L. BENNETT, A. M. BOWCOCK, and M. W. FELDMAN, 2000 Human population expansion and microsatellite variation. Mol. Biol. Evol. 17:757-767
ZHOU, H. and S. J. LAMONT, 1999 Genetic characterization of biodiversity in highly inbred chicken lines by microsatellite markers. Anim. Genet. 30:256-264[Medline].
This article has been cited by other articles:
![]() |
D. E. Pearse, S. A. Hayes, M. H. Bond, C. V. Hanson, E. C. Anderson, R. B. Macfarlane, and J. C. Garza Over the Falls? Rapid Evolution of Ecotypic Differentiation in Steelhead/Rainbow Trout (Oncorhynchus mykiss) J. Hered., June 26, 2009; (2009) esp040v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kwak, J. A. Kami, and P. Gepts The Putative Mesoamerican Domestication Center of Phaseolus vulgaris Is Located in the Lerma-Santiago Basin of Mexico Crop Sci., March 17, 2009; 49(2): 554 - 563. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Gugerli, S. Brodbeck, and R. Holderegger Utility of Multilocus Genotypes for Taxon Assignment in Stands of Closely Related European White Oaks from Switzerland Ann. Bot., November 1, 2008; 102(5): 855 - 863. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Vicente, M. I. Carolino, M. C. O. Sousa, C. Ginja, F. S. Silva, A. M. Martinez, J. L. Vega-Pla, N. Carolino, and L. T. Gama Genetic diversity in native and commercial breeds of pigs in Portugal assessed by microsatellites J Anim Sci, October 1, 2008; 86(10): 2496 - 2507. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Simko and J. Hu Population Structure in Cultivated Lettuce and Its Impact on Association Mapping J. Amer. Soc. Hort. Sci., January 1, 2008; 133(1): 61 - 68. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. B. Shaffer and R. C. Thomson Delimiting Species in Recent Radiations Syst Biol, December 1, 2007; 56(6): 896 - 906. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Tadano, M. Nishibori, N. Nagasaka, and M. Tsudzuki Assessing Genetic Diversity and Population Structure for Commercial Chicken Lines Based on Forty Microsatellite Analyses Poult. Sci., November 1, 2007; 86(11): 2301 - 2308. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Jakobsson and N. A. Rosenberg CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure Bioinformatics, July 15, 2007; 23(14): 1801 - 1806. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gaudeul, H. K. Stenoien, and J. Agren Landscape structure, clonal propagation, and genetic diversity in Scandinavian populations of Arabidopsis lyrata (Brassicaceae) Am. J. Botany, July 1, 2007; 94(7): 1146 - 1155. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Camus-Kulandaivelu, J.-B. Veyrieras, B. Gouesnard, A. Charcosset, and D. Manicacci Evaluating the Reliability of Structure Outputs in Case of Relatedness between Individuals Crop Sci., March 1, 2007; 47(2): 887 - 890. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Soller, S. Weigend, M. N. Romanov, J. C. M. Dekkers, and S. J. Lamont Strategies to Assess Structural Variation in the Chicken Genome and its Associations with Biodiversity and Biological Performance Poult. Sci., December 1, 2006; 85(12): 2061 - 2078. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Musani, N. D. Halbert, D. T. Redden, D. B. Allison, and J. N. Derr Marker Genotypes and Population Admixture and Their Association With Body Weight, Height and Relative Body Mass in United States Federal Bison Herds Genetics, October 1, 2006; 174(2): 775 - 783. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. D. Lorenzen, P. Arctander, and H. R. Siegismund Regional Genetic Structuring and Evolutionary History of the Impala Aepyceros melampus J. Hered., March 1, 2006; 97(2): 119 - 132. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. FONSECA, J. L. SMITH, R. C. WILKERSON, and R. C. FLEISCHER PATHWAYS OF EXPANSION AND MULTIPLE INTRODUCTIONS ILLUSTRATED BY LARGE GENETIC DIFFERENTIATION AMONG WORLDWIDE POPULATIONS OF THE SOUTHERN HOUSE MOSQUITO Am J Trop Med Hyg, February 1, 2006; 74(2): 284 - 289. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.A Toro and A Caballero Characterization and conservation of genetic diversity in subdivided populations Phil Trans R Soc B, July 29, 2005; 360(1459): 1367 - 1378. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Berlin and H. Ellegren Chicken W: A genetically uniform chromosome in a highly variable genome PNAS, November 9, 2004; 101(45): 15967 - 15969. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. G. Parker, L. V. Kim, N. B. Sutter, S. Carlson, T. D. Lorentzen, T. B. Malek, G. S. Johnson, H. B. DeFrance, E. A. Ostrander, and L. Kruglyak Genetic Structure of the Purebred Domestic Dog Science, May 21, 2004; 304(5674): 1160 - 1164. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Sundstrom, M. T. Webster, and H. Ellegren Reduced Variation on the Chicken Z Chromosome Genetics, May 1, 2004; 167(1): 377 - 385. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Rosenberg, J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd, L. A. Zhivotovsky, and M. W. Feldman Genetic Structure of Human Populations Science, December 20, 2002; 298(5602): 2381 - 2385. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Guinand, A. Topchy, K. S. Page, M. K. Burnham-Curtis, W. F. Punch, and K. T. Scribner Comparisons of Likelihood and Machine Learning Methods of Individual Classification J. Hered., July 1, 2002; 93(4): 260 - 269. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rosenberg, N. A.
- Articles by Weigend, S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rosenberg, N. A.
- Articles by Weigend, S.


















