Abstract
We introduce a Bayesian method for estimating parameters for a model of multiple mating and sperm displacement from genotype counts of broodstructured data. The model is initially targeted for Drosophila melanogaster, but is easily adapted to other organisms. The method is appropriate for use with field studies where the number of mates and the genotypes of the mates cannot be controlled, but where unlinked markers have been collected for a set of females and a sample of their offspring. Advantages over previous approaches include full use of multilocus information and the ability to cope appropriately with missing data and ambiguities about which alleles are maternally vs. paternally inherited. The advantages of including Xlinked markers are also demonstrated.
SPERM competition is an important factor in the evolution of reproduction of many organisms, particularly birds and insects. The phenomenon of sperm competition in Drosophila has been extensively studied by setting up matings in the laboratory with males that bear different visible genetic markers (Fowler 1973; Prout and Bundgaard 1977). Females mate with multiple males and store the collection of sperm in the seminal receptacle and the paired spermathecae for later use in fertilizing eggs. These laboratory experiments have shown that latermating males tend to father a greater proportion of the offspring and that there is great variability among genotypes of males and of females in the magnitude of this latermale advantage (Clarket al. 1995; Clark and Begun 1998). The precise mechanism that determines sperm success is not fully understood; however, by following fluorescently labeled sperm Civetta (1999) and Price et al. (1999) showed that the sperm that gets transported to the storage organs will be used in fertilization. This implies that the critical time is shortly after mating, when the decision is made as to which sperm will be stored. When the female remates, it appears that there is some loss of the first male’s sperm, through either physical removal or incapacitation (Priceet al. 1999), so the phenomenon is often called sperm displacement.
There has always been some concern that laboratory experiments in sperm competition are somewhat contrived in that the females must be exposed to males in a prescribed order and timing. The timing of opportunities for mating in natural populations no doubt differs dramatically from such laboratory experiments, so the whole phenomenon of sperm competition may be quite different in nature. It would be desirable to study sperm competition directly by sampling from natural populations. However, this presents more challenges than the laboratory setting. In natural populations, typically the mother and a sample of the offspring (the brood) are genotyped at one or more loci, but no information is available on the fathers, except perhaps population allele frequencies. However, it is still possible to use the available information to model the number of mates for each mother and to quantify attributes of sperm competition.
Harshman and Clark (1998) developed a model for multiple mating and sperm displacement, building on the approach of Cobbs (1977) and Griffiths et al. (1982). To fit the parameters of the model for data from a wild population, they reduced the singlelocus data for each brood to a summary statistic, the number of distinct paternal alleles. They then used simulation to estimate the distribution of this statistic for a grid of parameter values. The likelihood of a set of parameter values for a particular brood was then simply the height of the simulated distribution at the observed number of distinct paternal alleles at each locus. Information from multiple loci was combined by taking the product of likelihoods across loci.
Because this approach does not use the allele counts, differentiate between alleles of different frequencies, or use haplotype information in the multilocus case (i.e., is not based on sufficient statistics) some information is lost. Computational constraints also led Harshman and Clark to consider at most four possible mates per brood. We take the model from Harshman and Clark (1998), place it in a Bayesian framework, and use Markov chain Monte Carlo to examine the posterior distribution of the model parameters. This approach allows us to make full use of the data, remove limits on the number of possible mates per brood, and deal with missing data and ambiguities in which alleles are paternally inherited. We examine the performance of this method on both simulated data and the data considered in Harshman and Clark (1998).
METHODS
Model for mating and offspring production: The model of sperm competition is that each mate following the first mate displaces fraction β of the alreadypresent sperm. If there are K mates, the first mating male accounts for fraction (1 β)^{K}^{1} of the stored sperm; the ith mating male accounts for β(1 β)^{K}^{}^{i}. Laboratory experiments with just two mates have consistently shown that the latermating male fathers more offspring, so β> 0.5 is a biologically plausible constraint to place on the estimator. Thus we have put a prior on β that is uniform between 0.5 and 1.
The number of sampled offspring fathered by each mate is assumed to follow a multinomial distribution, where the multinomial probabilities correspond to the fraction of sperm from each father. The counts of offspring with different paternal haplotypes are thus from a different multinomial, where the probability of each genotype is a sum over fathers, weighted by the fraction of sperm from that father. For a haplotype h that segregates from the ith mate with probability x_{ih}, the probability is
The number of mates K is assumed to have a truncated Poisson distribution (since all the females produced offspring, the possibility of zero mates has been eliminated). This distribution is parameterized by α:
The prior on the mates’ genotypes is just the population frequencies (assuming HardyWeinberg and linkage equilibrium). Call this π_{G}(g). The population allele frequencies are assumed to be known, but in reality are estimated from the data. The estimates used are the observed allele frequencies among the mothers’ alleles and the offspring’s paternal alleles. For purposes of allele frequency estimation only, if there is ambiguity in which the allele is paternal, the paternal allele is arbitrarily designated. It would also be possible to sample adult males from the population and to compare the allele frequencies to those inferred from the progeny to infer differential mating success (Bundgaard and Christiansen 1972), but for now we assume that no adult male data are available.
In many cases, the probability that offspring i receives allele a_{il} at locus l through paternal inheritance, π_{A}(a_{il}), is simply 0 or 1: The paternal allele received by an offspring is ambiguous only if the offspring shares both of its alleles with its mother at that locus. If there are ambiguities, π_{A} places half its mass on each of the two possible alleles. Thus the probability of any feasible set of values for a is just (½)^{Number of ambiguities}.
The posterior probability of a particular set of (α, β, K, g, a) is then proportional to the prior times the likelihood:
Markov chain Monte Carlo parameter estimation: We construct an ergodic Markov chain whose stationary distribution is the posterior of α, β, the K’s, a’s, and g’s. A sample from this chain can thus be used to estimate the general shape of the posterior distribution, its mode, mean, and other quantities of interest.
The chain moves among the possible values for α, β, the K’s, a’s, and g’s. It is constructed using a reversible jump MetropolisHastings algorithm (Metropoliset al. 1953; Hastings 1970; Green 1995). Reversible jump simply refers to the fact that when the number of fathers changes, the dimension of g does as well. However, the Jacobian of the dimension “jumping” transformation is one, so the algorithm is identical to a standard MetropolisHastings. Moves are proposed, and then the associated Hastings ratio is computed:
The moves we use to sample the possible state for each brood were:
Change the ith father’s genotype at some locus l. Both i and l are chosen uniformly from the possible values.
Change the order of the fathers. Two fathers were selected at random for swapping.
Add a father. A new firstmating father is proposed; the alleles for this father are selected with equal probability for each paternal allele appearing in the brood and 1 allele representing all others in the population. [Each allele is proposed with probability 1/(number of observed alleles in this brood + 1).]
Subtract a father. This move proposes to delete the firstmating father.
Switch ambiguous alleles. We proposed to switch the allele designated as paternally inherited, a_{il}, from one of the offspring’s alleles to the other. Simultaneously, we propose to switch values of the fathers’ genotypes matching a_{il}’s original value. We propose to switch each instance with probability 0.5.
The broods are cycled through; for each brood one of the preceding moves is proposed, each with probability 0.2. After a cycle through the broods, α and β are updated. A new α is proposed by sampling uniformly from a window of width 1 centered on the current α; a similar mechanism is used to propose the new β, but the window has width 0.1.
The resulting chain is irreducible (each state is reachable from every other state, and the number of steps between two states need not be a multiple of any number greater than one) and thus samples from the desired posterior. This algorithm has been coded in C++ as SCARE (sperm competition and remating estimates), available from http://www.stat.psu.edu/~trix/software/scare.html.
Simulation study: The performance of the method was examined by simulating 100 data sets under each of nine different scenarios and by assessing the accuracy of the inferences made. The first three scenarios each involve typing a total of 400 offspring, the next three a total of 900, and the final three a total of 1600. The first three scenarios each consisted of 20 broods with 20 typed offspring. Scenario 1 used one autosomal locus, scenario 2 used two autosomal loci, and scenario 3 used a single Xlinked locus. (In this final case, it was presumed that only female offspring were sampled.) There were 15 equifrequent alleles per locus, the sort of polymorphism one might expect at microsatellite loci. (This information was not used in the inference procedure; the allele frequencies were estimated from each data set.) The remaining scenarios all used a single autosomal locus. A “square” design with 30 broods and 30 offspring per brood was compared with designs with 20 broods and 45 offspring per brood and 45 broods and 20 offspring per brood. Similarly, the scenarios with 1600 total offspring examined three designs with 40 broods of 40 offspring, 20 broods of 80 offspring, and 80 broods of 20 offspring.
For each scenario, broods were simulated by first sampling from the distribution of numbers of mates per female, with parameter α= 3. The relative contribution of the male genotypes to the sperm pool was as specified by our model with β= 0.7. Over the range of plausible values of α (16) and β (0.60.9), the simulations perform less well for a given sample size as the number of mates increases and as the magnitude of sperm competition increases, although a systematic quantification of these effects was not done. For each simulated data set a sample of 10,000 parameters from the posterior distribution of (α, β) was generated using the SCARE software. The samples are spaced by 100 cycles through the broods. This number of samples was determined to be adequate because repeated runs of this length, even with different starting points, gave very similar parameter estimates for the Ravenswood data, as discussed below.
Ravenswood data: We consider the same data studied in Harshman and Clark (1998). The data were collected from a natural population of Drosophila melanogaster living near the open fermenters of the Ravenswood winery in Sonoma County, California. Female flies were captured and stored in vials. The vials were maintained at room temperature until the females laid eggs and the eggs hatched. Each female and a sample of her offspring were then genotyped for two microsatellite loci, ula and nanos. A total of 19 broods were collected, with an average of 13 offspring typed per brood. The two loci are actually linked, with a recombination fraction of ∼20%, a deviation from the assumptions the SCARE software used in computing the haplotype probabilities (x_{ih}’s) in Equation 1. While it is conceptually straightforward to accommodate the linkage, we perform the estimation as if the loci are unlinked and discuss what effect this might have.
Again, inference based on a sample of 10,000 was taken from the joint posterior of α and β, spaced by 100 MetropolisHastings updates for each brood. The starting values for α and β were 2.0 and 0.6, respectively. To assess Monte Carlo error, this procedure was repeated 100 times with different random number seeds. The procedure was also tried with several different starting values for (α, β), spread over the region with prior support: (1.0, 0.5), (1.0, 0.9), (5.0, 0.5), and (5.0, 0.9).
RESULTS
Simulated data: Histograms of the posterior means for α inferred for the simulated data sets under each design scenario are in Figure 1. The posterior means for β are in Figure 2. The method shows a tendency to underestimate α when there are low numbers of offspring per brood and a single autosomal locus. Either adding additional offspring or using an Xlinked locus improves the situation.
β also tends to be slightly underestimated; although the range of the β estimates narrows as the total number of typed offspring increases, this bias seems to persist over the range of designs considered. Using an Xlinked locus or two loci also improves estimation for β.
Ravenswood data: A histogram of 10,000 samples from the joint posterior of α and β is shown in Figure 3. From this sample, posterior means and credible intervals were estimated. For α, the posterior mean was 2.44 with a 90% credible interval of (1.64, 3.32); for β the posterior mean was 0.61 with a 90% credible interval of (0.51, 0.69). The Monte Carlo standard deviation for the posterior mean estimate, estimated by 100 runs with different random number seeds, was 0.021 for α and 0.0045 for β. As desired, the uncertainty due to Monte Carlo error is dwarfed by the parameter uncertainty (as represented by the credible intervals).
The estimates of posterior means were also insensitive to the starting values for the parameters. Three of the four “extreme” starting values produced estimates falling within the interquartile span of the 100 estimates using starting value (2.0, 0.6). The exception was starting value (5.0, 0.9), which, despite starting with a value at the high end for both parameters, wound up with a (somewhat) unusually low estimate for both: (2.39, 0.59). This estimate still falls within the range of the samples produced with starting point (2.0, 0.6) and clearly shows that there is no problem with “stickiness” at the starting value.
DISCUSSION
Simulated data: The tendency to underestimate α with low numbers of offspring reflects the fact that for these designs it is more likely for a mate to have no offspring among those typed. The likelihood tends to favor parsimonious configurations of the fathers’ genotypes, so if a mate has no offspring among those sampled the estimate of K for that brood, and therefore the estimate of α, shifts downward. The problem of mates without offspring in the sample remains even when there is information that helps resolve paternity, such as using an Xlinked rather than an autosomal locus or multiple loci. In fact, additional simulations (not shown) demonstrate that the pressure for parsimonious paternal configurations increases with the polymorphism of the locus, or the number of loci, as the probability of chance matches between additional fathers and the observed offspring decreases. However, as the number of offspring increases the bias subsides; thus we see that as more offspring per brood are included, the histograms become more balanced around the true α value of 3.0.
While increasing the number of offspring per brood helps reduce bias in α, since the distribution parameterized by α describes variation in the number of mates between broods, increasing the number of broods is necessary to decrease the variance of the estimate of α. This is reflected in the fact that although the “rectangular” designs with more offspring per brood are more symmetric around the true α value, they have a larger spread than designs that have the same total number of offspring but more broods. Consequently, it seems that a “square” design provides the best solution, balancing the effects of variance and bias.
In contrast, if assignment to paternal groups could be done without error, increasing either the number of offspring or the number of broods (while holding the other steady) would lead to improved estimation of β. This seems to continue to hold in our situation where we have imperfect parentage information: There are not marked differences in the distributions of β estimates when the same total numbers of typed offspring are used.
Xlinked or multiple loci improve estimation of β by better resolving paternity. Our simulations show Xlinked loci are effective at resolving alternative fathers, and thus improving estimation of β, without additionally biasing the estimate of α. Thus it seems that when only a small number of offspring can be typed, switching to use of an Xlinked locus is a better way to improve the parentage information than using multiple autosomal loci. Once the number of offspring per brood is large enough to ensure a high probability of including offspring from all mates, we would ideally include enough genetic information (from a mix of Xlinked and autosomal markers) that each offspring has a high probability of uniquely specifying a paternal genotype.
Laboratory research suggests that β in fact varies among males (Clarket al. 1995). One goal might be to characterize this variation in a natural population. However, to do this the variation in the proportions of offspring with each genotype caused by differing β’s must be large compared to the variation due to multinomial sampling of the offspring from each paternal group and ambiguity in assigning the paternal groups. The histogram with 40 broods and 40 offspring (1600 offspring total) reflects these sources of variation and has a range of ∼0.07. Only if the variation of β between males exceeds this could we hope to detect it with such a design.
Ravenswood data: In this data set, we do not expect the fact that the loci are treated as independent to produce misleading results. The main effect of a deficiency of recombinant haplotypes is to reduce the information about which haplotypes are paired within a father: This information will be intermediate between that found in a data set with two independent loci and a data set with a single (more polymorphic) locus.
Allele frequencies were estimated for the data set as described in methods; it is worth noting that if one of the loci used was in linkage disequilibrium with a male locus affecting sperm competition, the allele frequencies for that locus could be substantially biased. However, we have no reason to believe that is the case for this data set. Although the frequencies are based on ∼271 individuals (the 19 mothers and their offspring), the paternal alleles of offspring in the same brood are correlated so the variance of the estimate is much larger than that for 271 independent individuals. Estimation of the allele frequencies could be incorporated into the Markov chain Monte Carlo and utilizes the inferred paternal genotypes (the g’s) directly; an advantage of this approach would be that posterior distribution would reflect the uncertainty in α and β due to imperfect knowlege of the allele frequencies.
Our analysis of the Ravenswood data shows a lower value for β (0.61 vs. 0.83) and a higher value for α (2.44 vs. 1.82) than that reported in Harshman and Clark (1998). The value of α reported by Harshman and Clark is within our 90% credible interval, so in this sense it is not dramatically different; however, their estimate of β is above the credible interval obtained here. We believe the main reason for this is that the method described in this article makes full use of multilocus information to resolve differing paternity, while the method described in Harshman and Clark does not. Their likelihoods are based on the number of distinct alleles observed at each locus. Many broods in this data set, looked at in that manner, are compatible with a smaller number of fathers than is possible if the multilocus paternal haplotypes are considered. That clearly explains the increase in α; the decrease in β is explained by the fact that with the small brood sizes (13 on average) we would not expect offspring from so many fathers to end up in our sample unless β is relatively low (resulting in a more even distribution of offspring among mates). This substantial change highlights the importance of our methodology’s ability to fully utilize multilocus information. Biologically, the finding that the degree of lastmale advantage gets him only 61% of the progeny may be quite important, because the assumption had been that the level of sperm competition in nature ought to be comparable to that in the laboratory, where more typically 90% or more of the progeny are sired by the last male. Of course, in the laboratory, typically only two successive males are tested, so the recurrent mating by females may serve to reduce the overall magnitude of sperm competition. This is contrary to the general belief that the higher the frequency of repeated matings, the stronger the opportunity for sperm competition should be. Clearly there is room to apply these methods to better understand the nature of sperm competition in field situations.
Acknowledgments
We thank Drs. Anthony Fiumera and Lawrence Harshman for helpful comments on a draft of this manuscript. This work is supported by National Science Foundation grant DEB 0108965 to A.G.C.
Footnotes

Communicating editor: G. Churchill
 Received July 24, 2002.
 Accepted December 12, 2002.
 Copyright © 2003 by the Genetics Society of America