Abstract
Computer simulations are used to evaluate maximum likelihood methods for inferring male fertility in plant populations. The maximum likelihood method can provide substantial power to characterize male fertilities at the population level. Results emphasize, however, the importance of adequate experimental design and evaluation of fertility estimates, as well as limitations to inference (e.g., about the variance in male fertility or the correlation between fertility and phenotypic trait value) that can be reasonably drawn.
ONE half of the nuclear genes in most plants pass through the male reproductive pathway, yet estimates of male fertility based on ecological observations such as dispersal distances of pollen analogues or observed pollinator movements can be “disappointingly crude” (Snow and Lewis 1993, p. 332): any one of a large number of individuals capable of producing male gametes may potentially sire a particular offspring. This situation is attributable to unique features of plant biology, particularly the difficulty of reliably circumscribing the pool of potential fathers.
Genetic markers can assist male fertility estimation. The most powerful markerbased methods (Devlinet al. 1988; Roederet al. 1989; Brown 1990; Adamset al. 1992) partition paternity among genetically possible fathers using a maximum likelihood argument (Roederet al. 1989; Smouse and Meagher 1994). Estimated fertilities may be used to evaluate specific hypotheses (e.g., that all males have equal fertility) and to describe patterns such as variation in male fertility (e.g., Devlin and Ellstrand 1990; Devlinet al. 1992; Smouse and Meagher 1994) or the relationship between male trait value and fertility as a measure of selection (e.g., Schoen and Stewart 1986; Broyles and Wyatt 1990; Conneret al. 1996).
Here I use computer simulation to document statistical power of maximum likelihood methods and to identify conditions when reasonable insight into male fertility variation can be obtained. The focus is on allozyme data, where factors contributing to manageable experimental designs are well understood; speculation on possible results from highly variable markers is presented in discussion. Results indicate the importance of genetic exclusion probability (ε, see Chakrabortyet al. 1988; Devlinet al. 1988), number and size of maternal progeny arrays, and estimation of a limited number of fertilities. Future paternity studies require further mathematical analysis of maximum likelihood methods, or extensive computer simulation, to adequately evaluate the accuracy of inferences made.
MATERIALS AND METHODS
Maximum likelihood estimation: Smouse and Meagher (1994; following Roederet al. 1989) develop a maximum likelihood estimator of male fertility for use in conjuction with electrophoretic or other genetic marker data. The problem is to estimate a vector λ of male fertilities, using a matrix X of genetic data. Each element of the fertility vector λ_{j} corresponds to the fertility of the jth unique male genotype, while the matrix entry X_{ij} is the probability of observing offspring genotype i given the genotypes of the maternal parent and the jth putative paternal parent (Devlinet al. 1988; Roederet al. 1989). The likelihood of a vector of male fertilities, given observed offspring genotypes, is
A maximum of the likelihood can be found using the expectation maximization algorithm (Roederet al. 1989, p. 373). One iteration of this algorithm transforms a value of male fertility λ_{j} to a value λ′_{j} using the formula
Simulation methodology: Simulation was used to evaluate the statistical power of the estimation procedure and to evaluate inference about male fertility. Simulations centered around a “standard” parameter set. The standard set assumed a dioecious population of 25 male and 25 female parents, with 20 progeny assayed per maternal family. Genetic data in the standard set consist of eight loci, each with two equally frequent alleles (expected exclusion probability ε = 0.81; observed exclusions in simulations, e.g., in Figure 1, are less than this because of the finite number of paternal parents). This parameter set involves assaying a reasonable number of progeny for a combination of loci with exclusion probabilities toward the high end of that attainable with allozyme markers. Natural populations are likely to have more than 25 potential males, but the analyses presented below suggest that this realistic situation results in poor statistical properties. Loci are in HardyWeinberg and linkage equilibrium and are inherited in a Mendelian fashion. Parental genotypes are known without error. Expected male fertilities were chosen from a Gaussian distribution with mean equal to the number of progeny simulated and coefficient of variation equal to CV_{g}; zero fertility was assigned when negative deviates were drawn. The actual fertility coefficient of variation CV_{m} (i.e., variation in male fertility realized in a simulation) includes this source of variation and an additional multinomial component associated with sampling. Numbers of male and female parents, progeny array size, and number of loci were varied one at a time, with CV_{g} ranging between zero and one (with CV_{g} < 0.7, virtually all males sire some offspring, whereas for CV_{g} = 1, the distribution of male fertilities is nearly Poisson and ~35% of males sire no offspring). Each parameter combination involved 500 replicates.
Statistical power was evaluated using the likelihood ratio statistic suggested by Roeder et al. (1989). The test asks whether estimated male fertilities significantly improve the likelihood of the data when compared with the initial equal fertility vector. The test subtracts the log of the likelihood in Equation 1 calculated with the estimated fertilities from the log of the likelihood with equal fertilities, and is symbolized as Δ log L. For each statistical test, 500 data sets were simulated assuming equal male fertility, CV_{g} = 0. The Δ log L values from these simulations represent the null distribution against which fertility distributions with CV_{g} > 0 are to be compared. Statistical power for each scenario with CV_{g} > 0 is determined as the proportion of Δ log L values more extreme (larger) than 95% of the values under the assumption of equal expected fertility.
Two measures were used to characterize estimated vs. actual fertilities. The first,
RESULTS
Simulation results in Figure 1 indicate that statistical power to reject the null hypothesis of equal male fertility can be high, provided that male fertility is not too uniformly distributed. Paternity analyses benefit from large progeny sizes, many maternal progeny arrays, many loci (high exclusion probabilities), and few paternal parents. The lower panels of Figure 1 suggest that the total number of progeny assayed is important because similar curves result when comparable total progeny are assayed (e.g., 10 progeny from 25 mothers = 250 total progeny vs. 20 progeny from 12 mothers = 240 total progeny).
Estimation of the male fertility variance may be biased, and there may not be a strong correlation between actual and estimated fertility (Table 1). These difficulties are particularly apparent when the actual variance is limited or when many male fertilities are estimated. Even in scenarios with 12 loci and, hence, extraordinary exclusion probability (expected ε = 0.92), the maximum likelihood method overestimates variance in male fertility by 1.5 to 2fold. With eight loci and moderate exclusion probability (expected ε = 0.81), the correlation between actual and estimated fertility ranges from 0.25, when there are many males with limited fertility variation, to 0.65, when substantial fertility variation among relatively few males is estimated using many or large maternal families. With the exclusion probability offered by 12 loci, the correlation between actual and estimated fertility can rise to 0.83. When males have equal expected fertility, replicates with 50 females or 40 progeny per female show a slight decrease in performance of the estimators compared with standard parameter values involving fewer females or progeny. A similar pattern is observed when male fertility variation is summarized as a ratio of expected values, rather than as the expected value of ratios, so that the difference is not likely to result from uncertainty in the denominator of
DISCUSSION
Maximum likelihood methods can detect significant male fertility variation when applied to appropriate data sets (Roederet al. 1989). However, low statistical power (Figure 1), biased estimates of fertility variation, and low correlation between actual and estimated fertility (Table 1) occur with few loci, few maternal progeny arrays, few progeny per maternal family, or many potential fathers. The fertility coefficient of variation, and hence opportunity for selection, can be substantially overestimated, even with 12 loci and exclusion probability ε = 0.92. The correlation between estimated and actual fertility can reduce the correlation between trait value and relative fertility in a selection analysis by 50% or more (Table 1). These results suggest how experimental design can enhance statistical power, and they indicate limits to inference drawn from such experiments.
Experimental populations are well suited to inference of male fertility (Devlin and Ellstrand 1990; Devlinet al. 1992; Kohn and Barrett 1992; Conneret al. 1996), although some care must be taken in evaluating male fertility in natural populations. In experimental populations, the number of male fertilities requiring estimation can be small, and genotypes represented in the population can be chosen to ensure high exclusion probability. The most ambitious experimental study to date (Conneret al. 1996) involves 60 hermaphroditic plants, ~35 progeny per maternal parent, and exclusion probability between 0.85 and 0.89. Analysis by Conner et al. shows that the coefficient of variation of estimated individual male fertilities in this study is small (<5%). The results in Table 1 suggest that even in this data set, male fertility variation will be moderately overestimated, and the ability to detect selection on reproductive traits will be diminished by the imperfect correlation between estimated and actual fertility. Nonetheless, there is reasonable promise for application of paternity estimation techniques in populations of 25 possible paternal parents with substantial fertility and allozyme variation present. Clearly excluded as candidates for fertility estimation in nature are populations with large numbers of males (including species with extensive gene flow), populations with limited or moderate allozyme variation, or species with small progeny array sizes.
Genetic information (exclusion probability ε) plays a prominent but not exclusive role in male fertility estimation. For instance, all parameter sets involving eight loci in Figure 1 have the same exclusion probability, yet statistical power varies from near zero to one, depending on other aspects of experimental design and the actual amount of fertility variation. The results of Table 1 similarly show the importance of factors other than exclusion probability in characterizing fertility variation. Even if exclusion were complete and fertility assigned without error, under the hypothesis of uniform expected male fertility, the error of individual fertility estimates follows a multinomial distribution with sampling variance inversely proportional to the total number of progeny surveyed (Roederet al. 1989). Thus, the best strategy for increasing accuracy of fertility estimates may not be maximizing genetic exclusion (e.g., through use of hypervariable markers). Perhaps the most encouraging result is the benefit of increasing the number of progeny sampled for statistical power (either sampling more progeny per mother or more maternal parents, see Figure 1) because assaying additional progeny is the factor most easily manipulated by the investigator interested in natural populations. Admittedly, Table 1 shows that increasing progeny sampled may only modestly increase the precision of estimated male fertility parameters.
Modern molecular markers may substantially expand the applicability of paternity analyses, although available data sets only hint at appropriate parameters for further investigation. Simple sequence repeats (SSRs) are one promising genetic marker with abundant polymorphism and codominant expression. Although many SSR loci are found in rice (Chenet al. 1997) or maize (Smithet al. 1997), published studies of natural plant populations document SSR variants at relatively few loci. For instance, four polymorphic loci with effective number of alleles (Hartl and Clark 1989, p. 126) between 1.9 and 5.24 were found in Pithecellobium elegans (Mimosoideae; Chaseet al. 1996), while a single locus with six alleles was identified in the tropical tree Gliricidia sepium (Dawsonet al. 1997). Table 2 shows simulation results when highly polymorphic loci are assayed in 250 progeny (10 offspring from 25 maternal parents) with between 25 and 200 potential male parents and male fertility differences resulting entirely from sampling (i.e., CV_{g} = 0). Variation similar to that reported from natural populations (e.g., four alleles at four loci) continues to provide biased estimates of male fertility variation and low correlation between actual and estimated fertility, even with only 25 potential male parents. A greater number of alleles per locus results in very favorable prospects for paternity analysis, but observation of many alleles per locus may be precluded by genetic drift in the small populations assumed here. Investing in development of additional loci offers very effective paternity analysis, even in moderatesized populations.
Computer simulation and resampling techniques may continue to play an important part in paternity studies. Preliminary analysis, using knowledge of marker variation, population structure, and proposed experimental design, might help to determine whether a fullscale study will be informative (Roederet al. 1989) and to identify an appropriate sampling strategy (e.g., polymorphism such as that in Table 2 suggests few progeny per maternal parent compared with that in Table 1). Interpretation of hypothesis tests and inferences from a paternity study also requires investigation of statistical properties of the inference to determine the expected bias in estimates of male fertility variation or the expected correlation between estimated and actual fertility. Computer simulation also offers the opportunity to incorporate idiosyncrasies of the data set under investigation. For instance, using many marker loci increases the likelihood of linkage, parental genotypes may not be in HardyWeinberg proportions, and markers may violate Mendelian patterns of segregation.
Finally, the method of estimating paternity used here represents only one form of analysis. Adams and coworkers (Adams and Birkes 1991; Adams 1992; Burczyket al. 1996) use electrophoretic data to estimate the fraction of selffertilizations, matings between neighboring individuals, and mating between individuals outside the local neighborhood. Matings between neighboring individuals are further estimated as a function of plant or population attributes (e.g., size of putative paternal parent, distance between maternal and putative paternal parent). This procedure has much to recommend it, because it restricts the pool of potential fathers (through estimation of neighborhood size) and directly estimates a small number of biologically interesting parameters (e.g., relationship between plant size and fertility) rather than relying on intermediary estimates of a large number of male fertilities. These methods were developed for seed orchards with relatively few maternal parents and welldefined populations, so their application to natural populations should be approached with caution.
Acknowledgments
This research was supported by a Natural Sciences and Engineering Research Council of Canada postdoctoral fellowship. Daniel Schoen, Peter Smouse, and anonymous reviewers provided many helpful comments on earlier versions.
Footnotes

Communicating editor: A. H. D. Brown
 Received November 18, 1997.
 Accepted February 27, 1998.
 Copyright © 1998 by the Genetics Society of America