Abstract
To determine the relationships among closely related populations or species, two methods are commonly used in the literature: phylogenetic reconstruction or multivariate analysis. The aim of this article is to assess the reliability of multivariate analysis. We describe a method that is based on principal component analysis and Mantel correlations, using a twostep process: The first step consists of a singlemarker analysis and the second step tests if each marker reveals the same typology concerning population differentiation. We conclude that if single markers are not congruent, the compromise structure is not meaningful. Our model is not based on any particular mutation process and it can be applied to most of the commonly used genetic markers. This method is also useful to determine the contribution of each marker to the typology of populations. We test whether our method is efficient with two real data sets based on microsatellite markers. Our analysis suggests that for closely related populations, it is not always possible to accept the hypothesis that an increase in the number of markers will increase the reliability of the typology analysis.
ANALYSIS of genetic relationships is useful for phylogenetic or biodiversity studies. For this purpose, it is customary to use genetic markers such as protein and blood group polymorphisms or DNA markers. Generally, the approach for the genetic analysis of these data consists of calculating genetic distances and constructing trees. For example, Estoup et al. (1995) highlighted the use of microsatellite loci for a precise dissection of the genetic structure of honey bee colonies. Barker et al. (1997) analyzed the phylogenetic relationships of Asian water buffalos by comparing results from protein loci and microsatellite loci.
On the basis of theoretical studies, Takezaki and Nei (1996) have shown that one of the important factors for analyzing the correct position of populations in a genetic study is the number of loci used. Within this framework, the main task is to obtain a mean or consensus phylogeny. The reliability of the results is addressed through studies of confidence limits on phylogenies using bootstrap as in Felsenstein (1985) and Efron et al. (1996) or Markov chain Monte Carlo methods as in Li et al. (2000).
Alternatively, representations of the genetic relationships among populations may be obtained by using multivariate procedures. These techniques condense the information from several alleles and loci into a few synthetic variables. The connection between tree procedures and multivariate procedures is close (CavalliSforzaet al. 1994). The first splits in a tree generally correspond to the separation of populations generated by the first dimensions of the multivariate procedure. The use of these procedures in studies on genetic variation among populations has been pioneered by CavalliSforza and his colleagues over the last 25 years to reconstruct the history of human populations (Chenet al. 1985; CavalliSforza and Piazza 1993; Bowcocket al. 1994; CavalliSforzaet al. 1994; CavalliSforza 1997; Underhillet al. 2000). Blood group, protein, and DNA markers were typed for large population samples around the world. These studies provided a broad picture of human populations from a genetic perspective, with additional descriptions based on archeological data and linguistics.
Multivariate procedures are particularly attractive when admixtures are known to have occurred among the populations under study, because construction of trees using admixed populations contradicts the principles of phylogeny reconstruction (Felsenstein 1982). This situation is often found for genetic studies concerning breeds or populations from the same species, as in human population genetic studies (Saha and Tay 1992; Ayalaet al. 1994; Sahaet al. 1995; Boschet al. 1997; Jordeet al. 1997) or in studies concerning domestic breeds (Grosclaudeet al. 1990; MacHughet al. 1997; Cymbronet al. 1999; Yanget al. 1999; Cañonet al. 2000; Hansliket al. 2000; Wimmerset al. 2000). For instance, Blott et al. (1998) used blood group and protein polymorphisms to examine the genetic relationships among 37 European cattle breeds. They have shown that relationships among breeds reflect their geographical origin and common ancestry rather than the agricultural use for which the breeds have been selected. MacHugh et al. (1997) used different markers such as microsatellites, mtDNA, and a Y chromosomal probe to determine the evolutionary relationships among 20 different cattle breeds from Africa, Europe, and Asia. They confirmed the large divergence between Bos taurus and B. indicus lineages. In addition, they revealed a pattern of zebu genetic introgression in African taurine populations.
However, the reliability of multivariate analysis is never discussed. In this article, we investigate this problem by addressing the following specific questions: Is the consensus representation meaningful? Which markers influence the typology of populations? Finally, we study their interest and efficiency with bovine data.
MATERIALS AND METHODS
Fisher’s exact test: Equality of allelic frequencies per breed was tested for each marker by a Fisher’s exact test of independence. P values were estimated by a Monte Carlo procedure (Agresti 1992). Tests were performed with the procedure FREQ of SAS 8.1.2 (SAS Institute 2000).
Principal component analysis: Among the many multidimensional analysis methods, principal component analysis (PCA) offers a simple and powerful mode of analysis of a set of populationbygene frequency data. A detailed presentation of the general method can be found, for instance, in Mardia et al. (1979).
Let us consider g populations and an nallelic marker. p_{ik} denotes the allelic frequency of the kth allele in the ith population, and p_{.k} the mean allelic frequency of the kth allele in all the populations. As advocated by CavalliSforza et al. (1994) the use of an a priori standardization of the allelic frequencies p_{ik} by their estimated standard deviation σ_{k} = [p_{.k} (1  p_{.k})]^{1/2}, can be expected to improve the recovery of information. This standardization emphasizes contribution of rare alleles. The general term of the array X of standardized allelic frequencies is then equal to x_{ik} = (p_{ik}  p_{·k})/σ_{k}. This standardization is used throughout this article.
Each population is represented by a point in an mdimensional space, the coordinates of which are the m standardized allelic frequencies. The core of the PCA is to find the most scattered directions, or principal components, of this cloud of points. Principal components (PCs) are eigenvectors of the variancecovariance matrix of standardized allelic frequencies. The sum of the eigenvalues is the trace of this matrix and it is sometimes called “total variance.” PCs are ranked according to the fraction of total variance that each of them can independently explain: For instance, the first PC is by definition more informative than the second PC and all the PCs are independent from each other.
This method can be applied to the allelic frequencies of one marker (singlemarker analysis) or of several markers (overall analysis) to get the induced typology of populations.
Relationships among singlemarker analyses: Phylogenetic studies typically involve several markers and lead to several typologies, trees, or multidimensional plots. The first problem is to evaluate the relationship among these typologies, which is commonly called “interstructure.” Use of PCA leads to an implicit euclidean distance between populations i and j:
Relative positions of markers in this circle indicate the magnitude of association between these markers. For instance, if all markers are positively correlated, their correlations with the first axis are positive, and they will be positioned to the right half of the circle. If diagonalizing this matrix, PerronFrobenius theorem ensures that all the coefficients of the first principal component are positive (e.g., Mardiaet al. 1979). Conversely, dispersed positions of the markers inside the circle are a visual indicator of the incongruence of markers.
Test of structure congruence: If singlemarker structures are not congruent, no compromise structure will exist, Mantel correlations between distances will be indiscriminately positive or negative, and the sum S of all the m(m  1) Mantel correlations among the distances will not be significantly different from 0. To test the significance of S, a randomization test is conducted as follows: (1) calculation of the observed statistic S_{obs}, (2) random permutation of entire rows of each table X of standardized allelic frequencies, and (3) recalculation of S_{rnd} values. The probability that no compromise exists is calculated as (number of S_{rnd} ≥ S_{obs} + 1)/(number of randomizations + 1). The 1 in the numerator and the denominator represents the observed value for the statistic being evaluated, which is considered as a possible value of the randomization distribution.
Another method (MoazamiGoudarziet al. 1997), based on a nonparametric analysis of the variance, considers the standardized distance d_{ijk} between populations i and j computed from the allelic frequencies of marker k as a result of the following oneway linear model:
The null hypothesis H0 is then: All distances are equal. Conversely, if some typology exists, the alternative hypothesis H1 will be that at least one distance is different from the others. This test can be performed by numerous nonparametric methods, for instance, the KruskallWallis test, based on Wilcoxon rank scores (Conover 1980), in the procedure NPAR1WAY of SAS 8.1.2. (SAS Institute 2000).
If significant relationships are found for all the markers, a global analysis is meaningful and can be done by a PCA on the entire data. An interesting feature of PCA in this multimarker context is the possibility of evaluating the relative contribution of each marker in the structure of the principal components, to see if one of the principal components, and therefore part of the typology among populations, is due to the action of a few markers only or to the whole set of markers.
All computations relative to PCA were performed with the SAS 8 package (SAS Institute 2000).
Application to data: Now we test the efficiency of our method on real data. We present results obtained by analyzing two data sets. Data set 1 was obtained from eight French and two European cattle breeds and data set 2 was obtained from 20 distinct populations from Africa (10) and Europe (10). B. taurus, B. indicus, and one crossbred population are included in data set 2.
Data set 1, taken from MoazamiGoudarzi et al. (1997), contains data for 17 microsatellite loci (INRA K, INRA 005, INRA 011, INRA 013, INRA 016, INRA 023, INRA 025, INRA 032, INRA 035, INRA 037, INRA 040, INRA 063, INRA 064, INRA 072, ETH 131, ETH 152, and ETH 225) genotyped in 10 cattle breeds (Breton Black Pied, Charolais, Holstein, Jersey, Limousin, MaineAnjou, Montbeliard, Normand, Parthenais, and Vosgien). Samples were collected throughout France.
In data set 2, nine microsatellite loci (INRA K, INRA 005, INRA 016, INRA 032, INRA 035, INRA 063, INRA 072, ETH 152, and ETH 225) were studied in 20 different cattle populations including the 10 populations of data set 1 and 10 African populations [Somba, Lagunaire, N’dama, Baoulé, Kuri, Sudanese Fulani zebu, Red Bororo zebu, Shuwa zebu, Madagascar zebu, and Borgou (Shorthorn × zebu crossbreds)]. West African populations were sampled in three neighboring countries: Somba cattle samples were collected in the Atacora highlands (Northwestern Benin/Northeastern Togo), which is the birthplace of this breed. Lagunaire cattle samples were collected in Benin, N’dama, Baoulé, and Borgou; Sudanese Fulani zebu samples in BurkinaFaso and Kuri; Red Bororo zebu and Shuwa zebu samples in Chad; and Madagascar zebu samples in Madagascar.
For Somba, Lagunaire, Borgou (Shorthorn × zebu crossbreds), and Sudanese Fulani zebu samples, we used protocols described in MoazamiGoudarzi et al. (1997). Data concerning N’dama, Baoulé, Kuri, Red Bororo zebu, Shuwa zebu, and Madagascar zebu populations were taken from Souvenir Zafindrajaona et al. (1999). Data concerning the 10 European breeds were taken from MoazamiGoudarzi et al. (1997).
RESULTS AND DISCUSSION
Fisher’s exact test: The frequency distributions of each microsatellite breed combination were significantly different as demonstrated by Fisher’s exact tests (P < 10^{4}).
Mantel correlations: Mantel correlations have been computed for the two data sets (Tables 1 and 2). For data set 1, 73 out of the 136 Mantel correlations, i.e., more than onehalf of the coefficients, are negative. These results are confirmed by the correlation circle (Figure 1A). All the markers are scattered in the circle and only a few markers are in the same area (for example, INRA 011, INRA 013, INRA 035, and INRA 063). In addition, high P values equal to 0.48 for the permutation test on the sum of Mantel correlations and 0.60 for the KruskallWallis test were obtained. In this case, tests for the existence of compromise structure are not significant. This lack of structure can be explained by the small level of differentiation among populations or by different typologies exhibited by different markers. As demonstrated by Fisher’s exact tests, significant differences among breeds are exhibited by each microsatellite. For this data set, the second explanation is more likely. Further analyses are meaningless.
However, in data set 2, only 5 out of the 36 Mantel correlations, i.e., ∼1/7, are negative. This low proportion is a first indication of a congruent typology. These results are confirmed by the correlation circle (Figure 1B.). All the markers, except INRA 035, are clustered in the same area. Tests for the existence of a common typology are very significant. P values are equal to 0.0001 for both the Mantel correlation test and the KruskallWallis test. Thus, the search for a compromise typology is meaningful and therefore we performed an overall PCA.
Principal component analysis for data set 2: The first three PCs account for 59%, i.e., most of the variance (Figure 2A). The scatterplot is in Figure 3. The first PC accounts for 30% of the total variance and it clearly distinguishes three clusters: (1) the Kouri, Borgou, and zebu cluster; (2) the African taurine cluster; and (3) the French taurine cluster. The second PC summarizes 16% of the total variance and it clearly separates the African taurine breeds from the Madagascar zebu breed. The third PC describes 13% of the total variance and it strongly isolates the Madagascar zebu.
From a geographical point of view (Europa/Africa) and from a species point of view (B. taurus/B. indicus), the breeds included in this analysis are very well characterized. The first cluster is heterogeneous, since it includes the Borgou, a crossbred population between zebu and taurine breeds, and the Kuri, considered as a taurine population since it is humpless and has the small metacentric Y chromosome of B. taurus. This intermediate position of the Kuri breed has also been found by Mahé et al. (1999) and Souvenir Zafindrajaona et al. (1999). Divergent explanations have been proposed. The genetic differences between Kuri and other taurine populations of western Africa could have a relatively remote historical origin. For example, the ancestors of the N’dama probably spread through northern and northwestern Africa while those of Kuri spread through the Sahara during the “green period.” The gene frequencies of the original domesticated population of southwest Asia were probably closer to those of zebu than to those of modern taurines of western Africa. Thus, it is conceivable that Kuri has remained genetically closer to zebu. MacHugh et al. (1997) have concluded that an easttowest introgression gradient of microsatellite alleles exists in Indian B. indicus into African populations, including subpopulations of the taurine type N’dama. Another nonexclusive explanation could be that these results reflect recent crossbreeding with Shuwa and Red Bororo zebus located around Lake Chad.
Influence of individual markers: Contributions of markers to the first three PCs are in Figure 4. Five markers contribute roughly equally to the first PC: ETH 152 (18%), ETH 225 (15%), INRA 063 (15%), INRA k (15%), and INRA 032 (14%); i.e., a total of 77%. Single PCA performed with these markers shows a good separation, particularly between the zebu cluster and the other breeds.
Other markers, especially INRA 035, which contributes only for 3.6%, do not exhibit any clustering between taurine and zebu breeds. INRA 035 separates Charolais and Parthenais breeds from the others. No explanation was found for this specific structure. No null alleles were observed for any of the microsatellites used in this study. The allele of INRA 035 that we have sequenced was composed of a perfect, uninterrupted, and homogeneous TG. It is also known that gene selection processes may lead to the fixation of alleles and the polymorphism in the flanking region may be wiped out (Schlötterer and Wiehe 1999). Until now, no gene has been mapped in the region 16q11 where INRA 035 is localized (BovMap database http://locus.jouy.inra.fr). In addition, no homology between the whole sequence of this microsatellite and other mammalian sequences has been found (Blast search).
Contributions to other PCs are more disparate. In particular, the third PC, which strongly separates Madagascar zebu from other breeds, is supported mainly by two markers only: ETH 152 (27%) and INRA 032 (32%). Thus, the excentric position of the Madagascar zebu is not really reliable and should be confirmed by further analysis.
Robustness of the compromise structure: When a compromise typology exists, the majority of markers will contribute to the construction of the first PCs. These PCs will explain a great percentage of the variance, while the other PCs, which express specific actions of markers, are of less importance. In this case, the omission of noncongruent markers will not change the compromise typology since they do not participate in its construction. This analysis is robust against the presence/absence of an incongruent marker. For instance, in data set 2, ignoring INRA 035 in the analysis will not change the general compromise typology.
When each marker leads to an independent typology, each PC of the global PCA is associated with one marker and therefore each PC will explain roughly the same variance proportion. In this case, the omission of markers will still lead to a nonmeaningful joint analysis. This situation corresponds to our data set 1. The variance proportions explained by each PC (Figure 2B) show a slow decrease of values that contrasts with the corresponding number for data set 2. Results concerning contributions of markers to the construction of PCs (Figure 5) are also different from the analysis of data set 2. PCs, even the first one, are not built by a majority of markers, but by only two of them.
Comparison with the neighborjoining tree: It is interesting to compare our approach with the neighborjoining tree of data set 1 (MoazamiGoudarziet al. 1997). While all microsatellite loci showed significant differences among breeds, a lack of strong differentiation was observed by bootstrap resampling of loci. The most robust feature of the typology concerned the Holstein/MaineAnjou grouping with a bootstrap value of 74%.
Note that this problem is not specific to French cattle breeds but common for geographically close populations. Similar results have been obtained from many different sources of data relevant to genetic diversity and evolutionary history of humans (CavalliSforza and Piazza 1993; Bowcocket al. 1994; Underhillet al. 2000) or domestic breeds (cattle, MacHughet al. 1997; goats, Saitbekovaet al. 1999; chicken, Wimmerset al. 2000; dogs, Koskinen and Bredbacka 2000).
This lack of structure is commonly explained by the fact that too few markers were used for the analysis. However, this explanation implicitly assumes that each marker reveals the same typology among populations. In this context, differences that may be observed among typologies will be considered as typical residual errors or white noise, the influence of which will decrease with the number of markers involved. Consequently, when the number of loci increases, estimation error variance of the distances among populations will decrease (Foulley and Hill 1999), and bootstrap values of dendrograms will increase (Takezaki and Nei 1996).
Alternatively, this lack of structure could be explained by discrepancies among the typologies exhibited by each marker. This is highlighted by the fact that onehalf of the Mantel correlations among markers are negative in data set 1. The study of Laval et al. (2000) illustrated this phenomenon. They analyzed the genetic diversity of 11 European pig breeds on the basis of 18 microsatellites. They reported that these breeds exhibit a very strong differentiation and it was difficult to infer any reliable phylogeny among those populations. When the analysis was restricted to the 9 breeds for which genotypes were available at 25 loci, even less reliable results were obtained. Increasing the number of loci decreased the reliability of the phylogeny. Laval et al. (2000) suggested that populations had differentiated according to a radiative scheme of divergence. According to this scheme, expected genetic distances between populations would be equal, and any casual differences among distances might be due to random genetic drift. In this context, incongruent singlemarker structures could be obtained. This hypothesis has to be confirmed, for example, by simulation studies.
Assignment: Several studies have demonstrated the high potential of microsatellites for discrimination among individuals (Cornuetet al. 1999; Pritchardet al. 2000; Bjornstad and Roed 2001; Cañonet al. 2001). The lack of congruence among structures is not a disadvantage for assignment studies, where independence of discriminating factors is a great asset. For instance, in data set 1, INRA 035 discriminates the Parthenais breed from the others (allele 116 is present in this breed only, with a frequency of 21%; the frequency of allele 114 is 13% while it is <6% for other breeds), and INRA 013 isolates the Maine Anjou breed (the frequency of allele 188 is 37% in this breed while it is <10% in other breeds). These two markers are not congruent, but their combination makes it possible to discriminate both Parthenais and MaineAnjou breeds. For this data set, the proportion of animals correctly assigned to their population of origin was 90% (MoazamiGoudarziet al. 1998).
The contrasts between the good efficiency of assignment techniques and the weak structuralization among breeds are also found in other studies. Goldstein et al. (1999) analyzed microsatellite variations in populations of island foxes (Urocyon littoralis) on California’s Channel Islands with 19 microsatellites. Out of the five nodes of the UPGMA consensus tree, only one could be considered as significant, with a bootstrap value of 0.96, while the other bootstrap values were 0.5, 0.49, and 0.74. However, assignment of individuals to their geographic origin was already perfect with 181 individuals out of 183 correctly assigned. Cañon et al. (2000) studied the genetic structure of seven Spanish Celtic horse breeds by genotyping 13 microsatellites. Bootstrap values of the five nodes were 0.49, 0.51, 0.59, 0.64, and 0.79, while 7797% of correct assignments were obtained. Vilà et al. (2001) analyzed 15 microsatellites of 10 horse breeds and generally their bootstrap values were not significant while 95% of the individuals were assigned correctly.
CONCLUSION
In this article, we have described a twostep process: The first step consists of performing singlemarker analyses and studying relationships between them (interstructure) and the second step is building a compromise plot (intrastructure). We have focused on particular techniques, such as PCA, but other multivariate methods exist, such as correspondence analysis (Hill 1974), or multidimensional scaling (Carroll 1981). The study of intrastructure and testing of the existence of a compromise typology were based on distances and Mantel correlations and can be applied outside a strictly multidimensional context.
We have illustrated our method with real data provided by microsatellites. These markers are substantially more complicated than assumed. Knowledge of the mutation processes at microsatellite loci is currently insufficient. Several factors have been found to be relevant to the evolution of these repeated sequences, such as asymmetry in the distribution of mutations, dependence of the mutation rate on the number of repeats, and purity of alleles or constraints on allele size (Estoup and Cornuet 1999). The method described here is independent from the mutation model of the marker used. The advantage of this method is that it can be applied to various types of markers (e.g., phenotypical markers, proteins, blood groups, microsatellites, amplified fragment length polymorphisms, single nucleotide polymorphisms...). Because of the heterogeneity of the genome, it is more realistic to use different genetic systems. This will decrease the potential of misinterpretation when investigating phylogeographic processes. It could be particularly useful in the context of protection of the biodiversity of domestic species, where the usefulness of molecular data alone in the identification of priorities for conservation is under debate (Ruane 1999). As advocated by Crepaldi et al. (2001), the combined use of random neutral markers and expressed genes will allow a global assessment of the genome that might complement information of quantitative and unique traits and serve as a rational basis for an objective choice of the genetic resources to be conserved. In a context where genomic heterogeneity is no longer a negative but an interesting parameter, applying methods that permit different compromise typologies to be revealed would be very appealing.
Acknowledgments
We acknowledge the assistance of the respective breeders associations in the collection of French cattle samples. We acknowledge the following persons for their help in planning and conducting the sampling missions for African samples: V. Codja (Bénin), N. T. Kouagou (Togo), I. Sidibé (BurkinaFaso), and P. Souvenir Zafindrajaona (Chad and Madagascar).
Footnotes

Communicating editor: C. Haley
 Received October 1, 2001.
 Accepted June 12, 2002.
 Copyright © 2002 by the Genetics Society of America