A Bayesian Approach to Inferring Rates of Selfing and Locus-Specific Mutation
- Benjamin D. Redelings*,
- Seiji Kumagai*,
- Andrey Tatarenkov†,
- Liuyang Wang*,
- Ann K. Sakai†,
- Stephen G. Weller†,
- Theresa M. Culley‡,
- John C. Avise† and
- Marcy K. Uyenoyama*,1
- *Department of Biology, Duke University, Durham, North Carolina 27708-0338
- †Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697-2525
- ‡Department of Biological Sciences, University of Cincinnati, Cincinnati, Ohio 45220
- 1Corrresponding author: Department of Biology, Box 90338, Duke University, Durham, NC 27708-0338. E-mail: marcy{at}duke.edu
Abstract
We present a Bayesian method for characterizing the mating system of populations reproducing through a mixture of self-fertilization and random outcrossing. Our method uses patterns of genetic variation across the genome as a basis for inference about reproduction under pure hermaphroditism, gynodioecy, and a model developed to describe the self-fertilizing killifish Kryptolebias marmoratus. We extend the standard coalescence model to accommodate these mating systems, accounting explicitly for multilocus identity disequilibrium, inbreeding depression, and variation in fertility among mating types. We incorporate the Ewens sampling formula (ESF) under the infinite-alleles model of mutation to obtain a novel expression for the likelihood of mating system parameters. Our Markov chain Monte Carlo (MCMC) algorithm assigns locus-specific mutation rates, drawn from a common mutation rate distribution that is itself estimated from the data using a Dirichlet process prior model. Our sampler is designed to accommodate additional information, including observations pertaining to the sex ratio, the intensity of inbreeding depression, and other aspects of reproduction. It can provide joint posterior distributions for the population-wide proportion of uniparental individuals, locus-specific mutation rates, and the number of generations since the most recent outcrossing event for each sampled individual. Further, estimation of all basic parameters of a given model permits estimation of functions of those parameters, including the proportion of the gene pool contributed by each sex and relative effective numbers.
- selfing rate
- Ewens sampling formula
- Bayesian
- MCMC
- mating system
INBREEDING generates genome-wide, multilocus disequilibria of various orders, transforming the context in which evolution proceeds. Here, we address a simple form of inbreeding: a mixture of self-fertilization (selfing) and random outcrossing (Clegg 1980; Ritland 2002).
Various methods exist for the estimation of selfing rates from genetic data. Wright’s (1921) fundamental approach bases the estimation of selfing rates on the coefficient of inbreeding (), a summary of the departure from Hardy–Weinberg proportions of genotypes for a given set of allele frequencies. The maximum-likelihood method of Enjalbert and David (2000) detects inbreeding from departures of multiple unlinked loci from Hardy–Weinberg proportions, estimating allele frequencies for each locus and accounting for correlations in heterozygosity among loci [identity disequilibrium (Cockerham and Weir 1968)]. David et al. (2007) extend the approach of Enjalbert and David (2000) to accommodate errors in scoring heterozygotes as homozygotes. A primary objective of InStruct (Gao et al. 2007) is the estimation of admixture. It extends the widely used program structure (Pritchard et al. 2000), which bases the estimation of admixture on disequilibria of various forms, by accounting for disequilibria due to selfing. Progeny array methods (see Ritland 2002), which base the estimation of selfing rates on the genetic analysis of family data, are particularly well suited to plant populations. Wang et al. (2012) extend this approach to a random sample of individuals by reconstructing sibship relationships within the sample.
Methods that base the estimation of inbreeding rates on the observed departure from random union of gametes require information on expected Hardy–Weinberg proportions. Population-wide frequencies of alleles observed in a sample at locus l () can be estimated jointly in a maximum-likelihood framework (e.g., Hill et al. 1995) or integrated out as nuisance parameters in a Bayesian framework (e.g., Ayres and Balding 1998). Similarly, expected locus-specific heterozygosity,
(1)can be obtained from observed allele frequencies (Enjalbert and David 2000) or estimated jointly with the selfing rate (David et al. 2007).
Here, we introduce a Bayesian method for the analysis of mixed-mating systems that accounts for genetic variation through coalescence-based models and uses the Ewens sampling formula (ESF) (Ewens 1972) in determining likelihoods. Our approach replaces the estimation of allele frequencies or heterozygosity (Equation 1) with the estimation of a locus-specific mutation rate () under the infinite-alleles model of mutation. We use a Dirichlet process prior (DPP) to determine the number of classes of mutation rates, the mutation rate for each class, and the class membership of each locus. We assign the DPP parameters in a conservative manner so that a new mutational class is created only if sufficient evidence exists to justify doing so. Further, while other methods assume that the frequency in the population of an allelic class not observed in the sample is zero, the ESF provides the probability, under the infinite-alleles model of mutation, that the next-sampled gene represents a novel allele [see (21a)].
To estimate the probability that a random individual is uniparental (), we exploit identity disequilibrium (Cockerham and Weir 1968), the correlation in heterozygosity across loci. This association, even among unlinked loci, reflects that all loci within an individual share a history of inbreeding back to the most recent random outcrossing event. Conditional on the number of generations since this event, the genealogical histories of unlinked loci are independent. For each diploid individual in the sample, our method models coalescence events at each locus back to the most recent point at which all remaining lineages reside in distinct individuals. The ESF provides the exact likelihood of the ancestral allele frequency spectrum at that point, obviating the need for further genealogical reconstruction. This approach permits computationally efficient analysis of samples comprising large numbers of individuals and large numbers of loci observed across the genome.
We address the estimation of rates of inbreeding and other evolutionary processes in populations undergoing pure hermaphroditism, androdioecy (hermaphrodites and males), or gynodioecy (hermaphrodites and females). Application of the method to simulated data sets demonstrates its accuracy in parameter estimation and in assessing uncertainty. We apply the method to microsatellite data from the self-fertilizing killifish Kryptolebias marmoratus (Mackiewicz et al. 2006; Tatarenkov et al. 2012) and the gynodioecious Hawaiian endemic Schiedea salicaria (Wallace et al. 2011) to illustrate the simultaneous inference of various biologically significant aspects of mating systems in nature, including levels of inbreeding depression, population proportions of sexual forms, and effective numbers.
Evolutionary Model
We use the ESF (Ewens 1972) to determine likelihoods based on a sample of diploid multilocus genotypes. By subsampling a single gene from each locus from each diploid individual, we could apply the ESF to the reduced sample to determine a likelihood function with a single parameter: the mutation rate, appropriately scaled to account for the acceleration of the coalescence rate caused by inbreeding [ (Fu 1997; Nordborg and Donnelly 1997)]. Consideration of the full sample of diploid genotypes yields information about an additional parameter: the probability that a random individual is uniparental (uniparental proportion
).
We describe the dependence of composite parameters and
on the basic parameters of the iconic mating systems pure hermaphroditism and gynodioecy. In addition, we develop the Kryptolebias model, based on the mating system of the killifish K. marmoratus, in which only males fertilize eggs that are not self-fertilized by hermaphrodites (Furness et al. 2015). Although this mating system and that of the worm Caenorhabditis elegans have been described as androdioecious, we reserve this botanical term for plant systems comprising hermaphrodites and female steriles (males), with pollen from both sexes capable of fertilizing seeds that are not set by self-pollen.
Rates of coalescence and mutation
Here, we describe the structure of the coalescence process shared by our pure hermaphroditism, Kryptolebias, and gynodioecy models.
Relative rates of coalescence and mutation:
We use to denote the uniparental proportion (probability that a random individual is uniparental) and
to denote the rate of parent sharing (the probability that a pair of genes residing in distinct individuals descend from the same individual in the immediately preceding generation). These quantities determine the coalescence rate and the scaled mutation rate of the ESF.
A pair of lineages residing in distinct individuals derive from a single parent (P) in the preceding generation at rate They descend from the same gene (immediate coalescence) or from distinct genes in P with equal probability. In the latter case, P is itself either uniparental (probability
), implying descent once again of the lineages from a single individual in the preceding generation, or biparental, implying descent from distinct individuals. The ancestry of a pair of lineages residing in a single individual rapidly resolves either to coalescence, with probability
[the classical coefficient of identity (Wright 1921; Haldane 1924)], or to residence in distinct individuals, with the complement probability. The total rate of coalescence of lineages sampled from distinct individuals corresponds to
(2)Our model assumes that coalescence and mutation occur on comparable timescales,
(3)for u the rate of mutation under the infinite-alleles model and N an arbitrary quantity that goes to infinity at a rate comparable to
and
Here, E represents a measure of effective population size (the “inbreeding effective size” of Crow and Denniston 1988), scaled relative to a population comprising N reproductives.
In large populations, switching of lineages between uniparental and biparental carriers occurs on the order of generations, virtually instantaneously relative to the rate at which lineages residing in distinct individuals coalesce (Fu 1997; Nordborg and Donnelly 1997). Our model assumes independence between the processes of coalescence and mutation and that these processes occur on a much longer timescale than random outcrossing: (4)Using (2), we obtain the probability that the most recent event in the ancestry of m lineages, each residing in a distinct individual, corresponds to mutation,
in which
(5)for θ and E defined in (3). In inbred populations, the single parameter of the ESF for an allele frequency spectrum comprising genes sampled from separate individuals corresponds to
Uniparental proportion and the rate of parent sharing:
In a purely hermaphroditic population comprising reproductives, the rate of parent sharing (
) corresponds to
and the uniparental proportion (
) to
(6a)for
the fraction of uniparental offspring at conception and τ the rate of survival of uniparental relative to biparental offspring. For the pure-hermaphroditism model, we assign the arbitrary constant N in (3) as
implying
(6b)Under the Kryptolebias model, involving reproduction by
hermaphrodites and
males, the uniparental proportion (
) is identical to the case of pure hermaphroditism (Equation 6),
(7a)Because only males fertilize eggs that are not self-fertilized by hermaphrodites, a random gene derives from a male in the preceding generation with probability
The rate of parent sharing (
) corresponds to
(7b)which in the absence of inbreeding (
) agrees with the classical harmonic mean expression for effective population size (Wright 1969). For the Kryptolebias model, we assign the arbitrary constant N in (3) as the number of reproductives
implying a scaled rate of coalescence of
(7c)for
(8a)the proportion of males among reproductives. Relative effective number
takes its maximum under equality between the total number of reproductives (
) and effective number
determined by the rate of parent sharing. At
the probability that a random gene derives from a male parent corresponds to the proportion of males among reproductives:
(8b)In gynodioecious populations, in which
hermaphrodites and
females (male steriles) reproduce, the uniparental proportion (
) corresponds to
(9a)in which σ represents the seed fertility of females relative to hermaphrodites and
is the proportion of seeds of hermaphrodites set by self-pollen. A random gene derives from a female in the preceding generation with probability
for
(9b)the proportion of biparental offspring that have a female parent. The rate of parent sharing (
) corresponds to
(9c)We assign the arbitrary constant N in (3) as
implying a scaled rate of coalescence of
(9d)for
(10a)the proportion of females among reproductives. As for the Kryptolebias model,
achieves its maximum only if the proportion of females among reproductives equals the probability that a random gene derives from a female parent:

Likelihood
We here address the probability of a sample of diploid multilocus genotypes.
Genealogical histories:
For a sample comprising up to two alleles at each of L autosomal loci in n diploid individuals, we represent the observed genotypes by (11)in which
denotes the set of genotypes observed at locus l among the n individuals
(12)with
the genotype at locus l of individual k, which bears alleles
and
To facilitate accounting for the shared recent history of genes borne by an individual in sample, we introduce latent variables (13)for
denoting the number of consecutive generations of selfing in the immediate ancestry of the
individual, and
(14)for
indicating whether the lineages borne by the
individual at locus l coalesce within the most recent
generations. Independent of other individuals, the number of consecutive generations of inbreeding in the ancestry of the
individual is geometrically distributed,
(15)with
signifying that individual k is the product of random outcrossing. Irrespective of whether 0, 1, or 2 of the genes at locus l in individual k are observed,
indicates whether the two genes at that locus in individual k coalesce during the
consecutive generations of inbreeding in its immediate ancestry:
Because the pair of lineages at any locus coalesce with probability 1/2 in each generation of selfing,
(16)Figure 1 depicts the recent genealogical history at a locus l in five individuals. Individuals 2 and 5 are products of random outcrossing (
), while the others derive from some positive number of consecutive generations of selfing in their immediate ancestry (
). Both individuals 1 and 3 are homozygotes (
), with the lineages of individual 3 but not 1 coalescing more recently than the most recent outcrossing event (
). As individual 2 is heterozygous (
), its lineages necessarily remain distinct since the most recent outcrossing event (
). One gene in each of individuals 4 and 5 is unobserved (∗), with the unobserved lineage in individual 4 but not 5 coalescing more recently than the most recent outcrossing event (
).
Following the history of the sample () backward in time until all ancestors of sampled genes reside in different individuals (
). Ovals represent individuals and circles represent genes. Blue lines indicate the parents of individuals, while red lines represent the ancestry of genes. Black circles represent sampled genes for which the allelic class is observed (Greek letters) and their ancestral lineages. White circles represent genes in the sample with unobserved allelic class (∗). Gray circles represent other genes carried by ancestors of the sampled individuals. The relationship between the observed sample
and the ancestral sample
is determined by the intervening coalescence events
T indicates the number of consecutive generations of selfing for each sampled individual.
In addition to the observed sample of diploid individuals, we consider the state of the sampled lineages at the most recent generation in which an outcrossing event has occurred in the ancestry of all n individuals. This point in the history of the sample occurs generations into the past, for
In Figure 1, for example,
reflecting the most recent outcrossing event in the ancestry of individual 3. As all remaining lineages reside in distinct individuals at that point, the ESF provides the probability of the allele frequency spectrum at this point.
We represent the ordered list of allelic states of the lineages at generations into the past by
(17)for
a list of ancestral genes in the same order as their descendants in
Each gene in
is the ancestor of either 1 or 2 genes at locus l from a particular individual in
(Equation 12), depending on whether the lineages held by that individual coalesce during the consecutive generations of inbreeding in its immediate ancestry. We represent the number of genes in
by
(
). In Figure 1, for example,
contains 10 genes in five individuals, but
contains only 8 genes, with
the ancestor of only the first allele of
and
the ancestor of both alleles of
We assume (Equation 4) that the initial phase of consecutive generations of selfing is sufficiently short to ensure a negligible probability of mutation in any lineage at any locus and a negligible probability of coalescence between lineages held by distinct individuals more recently than In addition to constraints on relative rates within loci (Equation 4), this assumption may entail small numbers of observed loci relative to the population size (
). Under these assumptions, the coalescence history I (Equation 14) completely determines the correspondence between genetic lineages in X (Equation 11) and Y (Equation 17).
Computing the likelihood:
In principle, the likelihood of the observed data can be computed from the augmented likelihood by summation, (18)for
(19)the list of scaled, locus-specific mutation rates,
the population-wide uniparental proportion for the reproductive system under consideration (e.g., Equation 6 for the pure hermaphroditism model), and T (Equation 13) and I (Equation 14) the lists of latent variables representing the time since the most recent outcrossing event and whether the two lineages borne by a sampled individual coalesce during this period. Here we follow a common abuse of notation in using
to denote
for random variable X and realized value x. Summation (18) is computationally expensive: the number of consecutive generations of inbreeding in the immediate ancestry of an individual (
) has no upper limit (compare David et al. 2007) and the number of combinations of coalescence states (
) across the L loci and n individuals increases exponentially (
) with the total number of assignments. We perform Markov chain Monte Carlo (MCMC) to avoid both these sums.
To calculate the augmented likelihood, we begin by applying Bayes’ rule:Because the times since the most recent outcrossing event T depend only on the uniparental proportion
through (15), and not on the rates of mutation
Even though our model assumes the absence of physical linkage among any of the loci, the genetic data X and coalescence events I are not independent across loci because they depend on the times since the most recent outcrossing event T. Given T, however, the genetic data and coalescence events are independent across loci:
Further,
This expression reflects that the times to the most recent outcrossing event T affect the observed genotypes
only through the coalescence states
and that the coalescence states
depend only on the times to the most recent outcrossing event T, through (16).
To compute we incorporate latent variable
(Equation 17), describing the states of lineages at the most recent point at which all occur in distinct individuals (Figure 1),
(20a)reflecting that the coalescence states
establish the correspondence between the spectrum of genotypes in
and the spectrum of alleles in
and that the distribution of
given by the ESF, depends on the uniparental proportion
only through the scaled mutation rate
(Equation 5).
Given the sampled genotypes and coalescence states
at most one ordered list of alleles
produces positive
in (20a). Coalescence of the lineages at locus l in any heterozygous individual [e.g.,
with
in Figure 1] implies
for all
Any nonzero
precludes coalescence in any heterozygous individual and
must specify the observed alleles of
in the order of observation, with either 1 (
) or 2 (
) instances of the allele for any homozygous individual [e.g.,
]. For all cases with nonzero
Accordingly, expression (20a) reduces to
(20b)a sum with either 0 or 1 terms. Because all genes in
reside in distinct individuals, we obtain
from the Ewens sampling formula for a sample, of size
ordered in the sequence in which the genes are observed.
To determine in (20b), we use a fundamental property of the ESF (Ewens 1972; Karlin and McGregor 1972): the probability that the next-sampled (ith) gene represents a novel allele corresponds to
(21a)for
defined in (5), and the probability that it represents an additional copy of already-observed allele j is
(21b)for
the number of replicates of allele j in the sample at size
(
). Appendix A presents a first-principles derivation of (21a). Expressions (21) imply that for
the list of alleles at locus l in order of observance,
(22)in which
denotes the total number of distinct allelic classes,
the number of replicates of the jth allele in the sample, and
the number of lineages remaining at time
(Figure 1).
Missing data:
Our method allows the allelic class of one or both genes at each locus to be missing. In Figure 1, for example, the genotype of individual 4 is indicating that the allelic class of the first gene is observed to be β, but that of the second gene is unknown.
A missing allelic specification in the sample of genotypes leads to a missing specification for the corresponding gene in
unless the genetic lineage coalesces, in the interval between
and
with a lineage ancestral to a gene for which the allelic type was observed. Figure 1 illustrates such a coalescence event in the case of individual 4. In contrast, the lineages ancestral to the genes carried by individual 5 fail to coalesce more recently than their separation into distinct individuals, giving rise to a missing specification in
The probability of can be computed by simply summing over all possible values for each missing specification. Equivalently, those elements may simply be dropped from
before computing the probability via the ESF, the procedure implemented in our method.
Bayesian Inference Framework
Prior on mutation rates
Ewens (1972) showed for the panmictic case that the number of distinct allelic classes observed at a locus [e.g., in (22)] provides a sufficient statistic for the estimation of the scaled mutation rate. As each locus l provides relatively little information about the scaled mutation rate
(Equation 5), we make the assumption that mutation rates across loci cluster in a finite number of groups. Because we do not know a priori the group assignment of loci or even the number of distinct rate classes among the observed loci, we use the DPP to estimate simultaneously the number of groups, the value of
for each group, and the assignment of loci to groups.
The Dirichlet process comprises a base distribution, which here represents the distribution of the scaled mutation rate across groups, and a concentration parameter α, which controls the probability that each successive locus belongs to a new group. In assigning 0.1 to α, which implies a low expected number of rate classes, we adopt a conservative approach under which a new rate class is created only if the data provide sufficient support for doing so. Further, we place a gamma distribution [
] on the mean scaled mutation rate for each group. As this prior has a high variance relative to the mean (0.5), it is relatively uninformative about
Model-specific parameters
Derivations presented in the preceding section indicate that the probability of a sample of diploid genotypes under the infinite-alleles model depends on only the uniparental proportion and the scaled mutation rates
(Equation 19) across loci. These composite parameters are determined by the set of basic demographic parameters Ψ associated with each model of reproduction under consideration. As the genotypic data provide equal support to any combination of basic parameters that implies the same values of
and
the full set of basic parameters for any model is in general nonidentifiable, using the observed genotype frequency spectrum alone.
Even so, our MCMC implementation updates the basic parameters directly, with likelihoods determined from the implied values of and
This feature facilitates the incorporation of information in addition to the genotypic data that can contribute to the estimation of the basic parameters under a particular model or assessment of alternative models. We have
(23)for X the genotypic data and
the uniparental proportion determined by Ψ for the model under consideration. To determine the marginal distribution of
(Equation 3) for each locus l, we use (5), incorporating the distributions of
and
the scaling factor defined in (3):
For the pure hermaphroditism model (Equation 6),
for
the proportion of conceptions through selfing and τ the viability of uniparental individuals relative to biparental individuals. The default priors for
and τ are uniform:
(24)For the Kryptolebias model (Equation 7),
with uniform priors as the default:
(25)For the gynodioecy model (Equation 9),
including
the proportion of egg cells produced by hermaphrodites fertilized by selfing,
(Equation 10a) the proportion of females (male steriles) among reproductives, and σ the fertility of females relative to hermaphrodites. The default priors correspond to

Assessment of Accuracy and Coverage Using Simulated Data
We developed a forward-in-time simulator (https://github.com/skumagai/selfingsim) that tracks multiple neutral loci with locus-specific scaled mutation rates (Θ) in a population comprising reproducing diploid hermaphrodites of which a proportion
are of uniparental origin. We used this simulator to generate data under two sampling regimes: large (
loci in each of
diploid individuals) and small (
loci in each of
diploid individuals). We applied our Bayesian method and RMES (David et al. 2007) to simulated data sets. A description of the procedures used to assess the accuracy and coverage properties of the three methods is included in Supporting Information, File S1.
In addition, we determine the uniparental proportion () inferred from the departure from Hardy–Weinberg expectation (
) (Wright 1969) alone. Our
-based estimate entails setting the observed value of
equal to its classical expectation
(Wright 1921; Haldane 1924) and solving for
(27)In accommodating multiple loci, this estimate incorporates a multilocus estimate for
(Appendix B) but, unlike those generated by our Bayesian method and RMES, does not use identity disequilibrium across loci within individuals to infer the number of generations since the most recent outcross event in their ancestry. As our primary purpose in examining the
-based estimate (Equation 27) is to provide a baseline for the results of those likelihood-based methods, we have not attempted to develop an index of error or uncertainty for it.
Accuracy
To assess relative accuracy of estimates of the uniparental proportion we determine the bias and root-mean-square error of the three methods by averaging over
data sets (
independent samples from each of
independent simulations for each assigned
). In contrast with the point estimates of
produced by RMES, our Bayesian method generates a posterior distribution. To facilitate comparison, we reduce our estimate to a single value, the mode of the posterior distribution of
with the caveat that the median and mean may show different qualitative behavior (see File S1).
Figure 2 indicates that our method, RMES, and even the -based estimate (Equation 27) provide estimates of the uniparental proportion
that show little bias over most of its range. RMES differs from the other two methods in showing a steep rise in both bias and root-mean-square (RMS) error for high values of
with the change point occurring at lower values of the uniparental proportion
for the small-sample regime (
,
). A likely contributing factor to the increased error shown by RMES under high values of
is its default assumption that the number of generations in the ancestry of any individual does not exceed 20. Violations of this assumption arise more often under high values of
possibly promoting underestimation of the uniparental proportion. Further, RMES discards data at loci at which no heterozygotes are observed and terminates analysis altogether if the number of loci drops below 2. RMES treats all loci with zero heterozygosity (Equation 1) as uninformative, even if multiple alleles are observed. In contrast, our full-likelihood method uses data from all loci, with polymorphic loci in the absence of heterozygotes providing strong evidence of high rates of selfing (rather than low rates of mutation). Under the large sampling regime (
), RMES discards on average 50% of the loci for true
values exceeding 0.94, with
of data sets unanalyzable (<2 informative loci) even at
(Figure 3). Under the
regime, RMES discards on average 50% of loci for true
values exceeding 0.85, with ∼50% of data sets unanalyzable under
Errors for the full likelihood (posterior mode), RMES, and -based (Equation 27) methods for a large simulated sample (
individuals,
loci). In the key, rms indicates the root-mean-square error and bias the average deviation. Averages are taken across simulated data sets at each true value of
Fraction of loci and data sets that are ignored by RMES.
Coverage
We determine the fraction of data sets for which the confidence interval (C.I.) generated by RMES and the Bayesian credible interval (BCI) generated by our method contain the true value of the uniparental proportion This measure of coverage is a frequentist notion, as it treats each true value of
separately. A 95% C.I. should contain the truth 95% of the time for each specific value of
However, a 95% BCI is not expected to have 95% coverage at each value of
but rather 95% coverage averaged over values of
sampled from the prior. Of the various ways to determine a BCI for a given posterior distribution, we choose to report the highest posterior density BCI (rather than the central BCI, for example).
Figure 4 indicates that coverage of the 95% C.I.’s produced by RMES is consistently <95% across all true values under the large sampling regime (
). Coverage appears to decline as
increases, dropping from 86% for
to 64% for
In contrast, the 95% BCIs have slightly >95% frequentist coverage for each value of
except for
values very close to the extremes (0 and 1). Under very high rates of inbreeding (
), an assumption (Equation 4) of our underlying model (random outcrossing occurs on a timescale much shorter than the timescales of mutation and coalescence) is likely violated. We observed similar behavior under nominal coverage levels ranging from 0.5 to 0.99 (File S1).
Frequentist coverage at each level of for 95% intervals from RMES and the method based on the full likelihood under the large sampling regime (
). RMES intervals are 95% confidence intervals computed via profile likelihood. Full-likelihood intervals are 95% highest posterior density Bayesian credible intervals.
Number of consecutive generations of selfing
To check the accuracy of our reconstructed generations of selfing, we examine the posterior distributions of selfing times for
under the large sampling regime (
). We average posterior distributions for selfing times across 100 simulated data sets and across individuals
within each simulated data set. We then compare these averages based on the simulated data with the exact distribution of selfing times across individuals (Figure 5). The pooled posterior distribution closely matches the exact distribution. This simple check suggests that our method correctly infers the true posterior distribution of selfing times for each sampled individual.
Exact distribution of selfing times under compared to the posterior distribution averaged across individuals and across data sets.
Analysis of Microsatellite Data from Natural Populations
To illustrate the features of our method, we apply it to existing microsatellite data sets from natural populations of a self-fertilizing vertebrate and a plant. We note that the infinite-alleles model of mutation may fail to capture features of mutation processes of microsatellites.
Self-fertilizing vertebrate
Our analysis of data from the killifish K. marmoratus (Mackiewicz et al. 2006; Tatarenkov et al. 2012) incorporates genotypes from 32 microsatellite loci as well as information on the observed fraction of males. Our method jointly estimates the proportion of males in the population () together with rates of locus-specific mutation (
) and the uniparental proportion (
). We apply the method to two populations, which show highly divergent rates of inbreeding.
Parameter estimation:
Our analysis uses an expanded-likelihood expression, which directly incorporates the observation of males among
zygotes,
in which
(28)for
(Equation 8a) the fraction of males among reproductives, under the assumption that the sex ratio among observed individuals corresponds to the sex ratio among reproductives. The likelihood expression reflects that
and
are sufficient to account for
and
which are independent of
and
In the absence of direct information regarding the existence or intensity of inbreeding depression, we impose the constraint which permits estimation of the uniparental proportion
under a uniform prior:

Low outcrossing rate:
We applied our method to the BP data set described by Tatarenkov et al. (2012). This data set comprises a total of 70 individuals, collected in 2007, 2010, and 2011 from the Big Pine location in the Florida Keys.
Tatarenkov et al. (2012) report 2 males among the 201 individuals collected from various locations in the Florida Keys during this period, consistent with other estimates of ∼1% (e.g., Turner et al. 1992). Drawing on the long-term experience of the Tatarenkov–Avise laboratory, we assume observation of males of
individuals in (28). Our purpose here is to illustrate the application of the method, with researchers using the software for primary research encouraged to substitute actual numbers. Our analysis for the BP population generates a posterior distribution for the fraction of males in the population (
) with a posterior median of 0.01 and a 95% BCI of
Our estimates of mutation rates () indicate substantial variation among loci, with the median ranging over an order of magnitude (∼0.5–5.0) (Figure S8). The distribution of mutation rates across loci appears to be multimodal, with many loci having a relatively low rate and some having larger rates.
Figure 6 shows the posterior distribution of uniparental proportion with a median of 0.95 and a 95% BCI of
This estimate appears to be somewhat lower than the
-based estimate (Equation 27) of 0.97 and slightly higher than the RMES estimate of 0.94, which has a 95% C.I. of
We note that RMES discarded from the analysis data from the 9 loci (of 32) that showed no heterozygosity, even though 7 of the 9 were polymorphic in the sample.
Posterior distribution of the uniparental proportion for the BP population. The median is indicated by a black circle, with a maroon bar for the 95% BCI and a slate-colored bar for the 50% BCI.
Our method estimates the latent variables (Equation 13), representing the number of generations since the most recent outcross event in the ancestry of each individual (Figure S6). Figure 7 shows the empirical distribution of the time since outcrossing across individuals, averaged over posterior uncertainty, indicating a complete absence of biparental individuals (0 generations of selfing). Because we expect that a sample of size 70 would include at least some biparental individuals under the inferred uniparental proportion (
), this finding suggests that any biparental individuals that may exist in the sample show lower heterozygosity than expected from the observed level of genetic variation. This deficiency suggests that an extended model that accommodates biparental inbreeding or population subdivision may account for the data better than the present model, which allows only selfing and random outcrossing.
Empirical distribution of number of generations since the most recent outcross event (T) across individuals for the BP population of K. marmoratus, averaged across posterior samples. The right panel is constructed by zooming in on the panel on the left. “Expected” probabilities represent the proportion of individuals with the indicated number of selfing generations expected under the median uniparental proportion “Inferred” probabilities represent proportions inferred across individuals in the sample. The first inferred bar with positive probability corresponds to
Higher outcrossing rate:
We apply the three methods to the sample collected in 2005 from Twin Cays, Belize (TC05) (Mackiewicz et al. 2006). Compared to the BP data set, this TC data set shows considerably higher incidence of males and levels of polymorphism and heterozygosity.
We incorporate the observation of 19 males among the 112 individuals collected from Belize in 2005 (Mackiewicz et al. 2006) into the likelihood (see Equation 28). Our estimate of the population fraction of males among reproductives () has a posterior median of 0.17 with a 95% BCI of
Figure S9 indicates that the posterior medians of the locus-specific mutation rates span a wide range (∼0.5–23). Two loci appear to exhibit mutation rates substantially higher than those of other loci, both of which appear to have high rates in the BP population as well (Figure S8). The rank orders of median mutation rates estimated across loci from the two data sets show only diffuse correspondence (Figure S10).
All three methods confirm the inference of Mackiewicz et al. (2006) of much lower inbreeding in the TC population relative to the BP population. Our posterior distribution of uniparental proportion has a median and 95% BCI of
(Figure 8). This median again lies between the
-based estimate (Equation 27) of 0.39 and the RMES estimate of 0.33, which has a 95% C.I. of
In this case, RMES excluded from the analysis only a single locus, which was monomorphic in the sample.
Posterior distribution of the uniparental proportion for the TC population. Also shown are the 95% BCI (maroon), 50% BCI (slate color), and median (black circle).
Figure 9 shows the inferred distribution of the number of generations since the most recent outcross event (T) across individuals, averaged over posterior uncertainty. In contrast to the BP population, the distribution of time since the most recent outcross event in the TC population appears to conform to the distribution expected under the inferred uniparental proportion (), including a high fraction of biparental individuals (
). Figure S7 presents the posterior distribution of the number of consecutive generations of selfing in the immediate ancestry of each individual.
Empirical distribution of selfing times T across individuals, for K. marmoratus (population TC). The histogram is averaged across posterior samples.
Gynodioecious plant
We next examine data from Schiedea salicaria, a gynodioecious member of the carnation family endemic to the Hawaiian islands. We analyzed genotypes at nine microsatellite loci from 25 S. salicaria individuals collected from west Maui and identified by Wallace et al. (2011) as nonhybrids. We use this analysis to illustrate the incorporation of data in addition to the genotypic scores.
Parameter estimation:
Campbell et al. (2010) reported a 12% proportion of females ( females among
individuals). As for Kryptolebias (Equation 28), we model this information by
(29)obtaining estimates from an extended-likelihood function corresponding to the product of
and the likelihood of the genetic data.
Our analysis assumes equal seed set by females and hermaphrodites (), consistent with empirical results (Weller and Sakai 2005). In addition, we use results of experimental studies of inbreeding depression to develop an informative prior distribution for τ,
(30)the mean of which (0.2) is consistent with the results of greenhouse experiments reported by Sakai et al. (1989). We retain a uniform prior for the proportion of seeds of hermaphrodites set by self-pollen (
).
Results:
Table 1 presents posterior medians and 95% BCIs for the proportion of uniparentals among reproductives (), the proportion of seeds of hermaphrodites set by self-pollen (
), the viability of uniparental relative to biparental offspring (τ), the proportion of females among reproductives (
), and the probability that a random gene derives from a female parent [
]. Our full analysis incorporates genetic data (G), observations on the sex ratio (F), and an informative prior (I) for the relative viability of uniparentals (τ) based on results of manipulative experiments (Sakai et al. 1989). Each row represents an analysis that includes (Y) or excludes (N) information of type G, F, or I. Comparison of the YYY and NYY rows indicates that inclusion of the genetic data more than doubles the posterior median of
(from 0.112 to 0.247) and shrinks the credible interval. Comparison of the YYY and YYN rows indicates that information about the level of inbreeding depression increases the posterior median of the collective contribution of females to the gene pool [
], bringing it closer to the proportion of females (
), with equality (10b) implying maximization of relative effective number
(Equation 9d).
Analysis in the absence of data (NNN, bottom row of Table 1) provides a prior estimate for the proportion of uniparentals () of
While the proportion of seeds set by self-pollen (
) and the relative viability of uniparental offspring (τ) have uniform priors, the induced prior on composite parameter
departs from uniform on
In the absence of information about and τ, we recommend that researchers use the pure hermaphrodite model (Equation 6) with τ assigned as unity so that
will be estimated under a uniform prior. We adopt this approach to compare our method to RMES, which uses only the genotype counts. Our estimate of the uniparental proportion
[median 0.287, 95% BCI
] is similar to the estimate using all information (YYY in Table 1) and in line with the
-based estimate (Equation 27) of
In contrast, RMES gave an estimate of
[95% CI
], even though it excluded none of the loci. Application of our gynodioecy model to the genotypic counts with or without additional information (YYY, YYN, YNY, or YNN in Table 1) produces estimates of the selfing rate for which the 95% BCIs exclude zero. This unexpected estimate of RMES stands in opposition to previous work supporting the presence of selfing in this population of S. salicaria (Wallace et al. 2011).
Figure 10 presents the inferred distribution across individuals of the number of generations since the most recent outcross event T (Equation 15), averaged over posterior uncertainty, using all data (YYY). In contrast with the analysis of the K. marmoratus BP population (Figure 7), the distribution appears to be consistent with the inferred uniparental proportion
Empirical distribution of selfing times T across individuals, for S. salicaria. The histogram is averaged across posterior samples.
We include additional results obtained using all data (YYY) in File S1. Figure S11 presents posterior distributions of all basic parameters of the gynodioecy model (Equation 9). Unlike the K. marmoratus data sets, the S. salicaria data set does not appear to provide substantial evidence for large differences in locus-specific mutation rates across loci (Figure S13). Figure S12 presents the posterior distribution of the number of consecutive generations of selfing in the immediate ancestry of each individual.
Discussion
We introduce a model-based Bayesian method for the inference of the rate of self-fertilization and other aspects of mating systems. Designed to accommodate arbitrary numbers of loci, it uses the ESF to determine likelihoods in a computationally efficient manner from frequency spectra of genotypes observed at multiple unlinked sites throughout the genome. Our MCMC sampler explicitly incorporates the full set of parameters for each mating system considered (pure hermaphroditism, Kryptolebias, and gynodioecy). This construction permits incorporation of information in addition to genetic data, affording insight into components of the evolutionary process beyond the estimation of selfing rates alone.
Components of inference
Locus-specific mutation rates:
Our method permits variation among loci in the rate of mutation (Equation 3) by using the DPP to determine the number of rate classes, the mutation rate of each class, and the class to which each locus belongs. Our DPP adopts a conservative approach, creating a new rate class only if the data demand it. Under the DPP, loci belonging to the same group have identical mutation rates. This approach might be generalized, for example, by using a Dirichlet process mixture to allow variation in mutation rate among loci within a rate class.
Joint inference of mutation and inbreeding rates:
For the infinite-alleles model of mutation, the ESF (Ewens 1972) provides the probability of any allele frequency spectrum (AFS) observed at a locus in a sample derived from a panmictic population. Under partial self-fertilization, the ESF provides the probability of an AFS observed among genes, each sampled from a distinct individual. For such genic (as opposed to genotypic) samples, the coalescence process under inbreeding is identical to the standard coalescence process, but with a rescaling of time (Fu 1997; Nordborg and Donnelly 1997). Accordingly, genic samples may serve as the basis for the estimation of the single parameter of the ESF, the scaled mutation rate (Equation 5), but not the rate of inbreeding apart from the scaled mutation rate.
Our method uses the information in a genotypic sample, the genotype frequency spectrum, to infer both the uniparental proportion and the scaled mutation rate
Our sampler reconstructs the genealogical history of a sample of diploid genotypes only to the point of the most recent random-outcross event of each individual, with the number of consecutive generations of inbreeding in the immediate ancestry of a given individual (
for individual k) corresponding to a latent variable in our Bayesian inference framework. Invocation of the ESF beyond the point at which all lineages reside in separate individuals obviates the necessity of further genealogical reconstruction. As a consequence, our method may be better able to accommodate genome-scale magnitudes of observed loci (L).
Identity disequilibrium (Cockerham and Weir 1968), the correlation in heterozygosity across loci within individuals, reflects that all loci within an individual experience the most recent random-outcross event at the same time, irrespective of physical linkage. The heterozygosity profile of individual k provides information about (Equation 15), which in turn reflects the uniparental proportion
Observation of multiple individuals provides a basis for inference of both the uniparental proportion
and the scaled mutation rate
Estimation of the selfing rate
Accuracy and uncertainty:
Enjalbert and David (2000) and David et al. (2007) base estimates of selfing rate on the distribution of numbers of heterozygous loci. Both methods strip genotype information from the data, distinguishing between only homozygotes and heterozygotes, irrespective of the alleles involved. Loci lacking heterozygotes altogether (even if polymorphic) are removed from the analysis as uninformative about the magnitude of departure from Hardy–Weinberg proportions (Figure 3). As the observation of polymorphic loci with low heterozygosity provides strong evidence of inbreeding, exclusion of such loci by RMES (David et al. 2007) may contribute to its loss of accuracy for high rates of selfing (Figure 2).
Our method derives information from all loci. Like most coalescence-based models, it accounts for the level of variation as well as the way in which variation is partitioned within the sample. Even a locus monomorphic within a sample provides information about the age of the most recent common ancestor of the observed sequences, a property that was not widely appreciated prior to analyses of the absence of variation in a sample of human Y chromosomes (Dorit et al. 1995; Fu and Li 1996).
Both RMES and our method invoke independence of genealogical histories of unlinked loci, conditional on the time since the most recent outcrossing event. RMES seeks to approximate the likelihood by summing over the distribution of time since the most recent outcross event, but truncates the infinite sum at 20 generations. The increased error exhibited by RMES under high rates of inbreeding may reflect that the likelihood has a substantial mass beyond the truncation point in such cases. Our method explicitly estimates the latent variable of time since the most recent outcross for each individual (Equation 13). This quantity ranges over the nonnegative integers, but values assigned to individuals are explored by the MCMC according to their effects on the likelihood.
Estimates of the proportion of uniparental individuals (Equation 4) produced by our method appear to show greater accuracy than RMES over much of the parameter range (Figure 2). Even so, we note that all methods considered here provide fair estimates of the selfing rate, including the
-based method (Equation 27) that uses only the single-locus departures from Hardy–Weinberg proportions and not identity disequilibrium. However, our Bayesian method appears to provide a more accurate assessment of uncertainty than does the maximum-likelihood method RMES: our BCIs have good frequentist coverage properties (Figure S5), while the C.I.’s reported by RMES appear to perform less well (Figure 4).
Identifiability:
In an analysis based solely on the genotype frequency spectrum observed in a sample, the likelihood depends on just two composite parameters: the probability that a random individual is uniparental () and the scaled rates of mutation
(Equation 19) across loci. Even so, our MCMC implementation updates the full set of basic parameters, with likelihoods determined from the implied values of
and
Any model for which the parameter set Ψ (Equation 23) comprises more than one parameter is not fully identifiable from the genetic data alone. In the pure hermaphroditism model (Equation 6), for example, basic parameters (fraction of fertilizations by selfing) and τ (relative viability of uniparental offspring) are nonidentifiable: any assignments that determine the same values of composite parameters
and
have the same likelihood.
For each basic parameter in Ψ beyond one, identifiability requires incorporation of additional information beyond the genetic data. A full treatment of such information requires expansion of the likelihood function to encompass an explicit model of the new information. For example, the Kryptolebias model (Equation 7) comprises three basic parameters, including (Equation 8a), the frequency of males among reproductives. In our analysis of microsatellite data from the killifish K. marmoratus (Mackiewicz et al. 2006; Tatarenkov et al. 2012), the expanded-likelihood function corresponds to the product of the probability of the genetic data and the probability of the number of males observed among a total number of individuals (Equation 28). In a similar manner, our analysis of the data set from S. salicaria (Wallace et al. 2011) uses an extended-likelihood function that models the observed number of females as a binomial random variable (Equation 29), permitting estimation of the frequency of females among reproductives (
).
Nonidentifiable parameters can also be estimated through the incorporation of informative priors. Because identifiability is defined in terms of the likelihood, which is unaffected by priors, such parameters remain nonidentifiable. Even so, informative priors assist in their estimation through Bayesian approaches, which do not require parameters to be identifiable. Our analysis of the Schiedea data draws on experimental evidence in addition to the genotype counts to justify the assumption of equal seed set by females and hermaphrodites () (Weller and Sakai 2005) and to develop an informative prior for τ (Equation 30) (Sakai et al. 1989).
Guidance for applying the method:
Our present implementation of the method introduced here includes default priors for the basic parameters, with users encouraged to specify priors appropriate for their systems. For example, a biologically motivated prior for the relative viability of uniparentals (τ) might favor weak selection () or inbreeding depression of an intensity sufficient to maintain selfing (
).
In the Kryptolebias model (Equation 7), comprising basic parameters (proportion of eggs self-fertilized by hermaphrodites), τ (relative viability of uniparentals), and
(proportion of males among reproductives),
together with
determines the scaling of time (Equation 7c), which depends on
and τ only through
In the absence of information regarding
and τ, we recommend assigning
which permits estimation of
under the default uniform prior or a user-specified prior. This assignment produces estimates that are simply agnostic concerning the relative influence of
and τ in determining
In the four-parameter gynodioecy model (Equation 26), however, the scaling of time (Equation 9d) depends not only on (the proportion of uniparentals) and
(the proportion of females among reproductives), but also on F (the proportion of biparental offspring that have a female parent). Because F (Equation 9b) depends on
(the proportion of seeds set by self-pollen), information about τ affects inference of all basic parameters. In the absence of information about the intensity of inbreeding depression, we recommend using the pure hermaphroditism model (Equation 24) under the assignment
which permits estimation of the uniparental proportion
under a uniform prior.
Beyond estimation of the selfing rate
Unlike the other methods considered here, our method provides a framework for the incorporation of information in addition to counts of diploid genotypes and the inference of a number of aspects of the mating system beyond the selfing rate.
Time since the most recent outcross:
Our method incorporates as a latent variable (Equation 13), the number of generations since the most recent outcross event in the immediate ancestry of individual k, and provides posterior distributions for this quantity.
This aspect of the mating system is of biological interest in itself and also affords insight into the suitability of the underlying model. Pooling such estimates of times since the most recent outcross over individuals produces an empirical distribution of the number of consecutive generations of selfing. Under the assumption of a single population-wide rate of self-fertilization, we expect selfing time to have a geometric distribution with parameters corresponding to the estimated selfing rate. Empirical distributions of the estimated number of generations since the last outcross appear consistent with this expectation for the data sets derived from the TC population of K. marmoratus (Figure 9) and from Schiedea (Figure 10). In contrast, the empirical distribution for the highly inbred BP population of K. marmoratus (Figure 7) shows an absence of individuals formed by random outcrossing ().
That our method accurately estimates T from simulated data (Figure 5) argues against attributing the inferred deficiency of biparental individuals in the BP data set to an artifact of the method. Rather, the deficiency may indicate a departure from the underlying model, which assumes reproduction only through self-fertilization or random outcrossing. In particular, biparental inbreeding as well as selfing may reduce the fraction of individuals formed by random outcrossing. Misscoring of heterozygotes as homozygotes due to null alleles or other factors, a possibility directly addressed by RMES (David et al. 2007) but not by our method, may also in principle contribute to the apparent deficiency of individuals formed by random outcrossing.
Relative effective number:
Incorporation of additional information, either through extension of the likelihood or through informative priors, permits inference not only of the basic parameters but also of functions of the basic parameters. For example, Table 1 includes estimates of the proportion of seeds of hermaphrodites set by self-pollen () and the probability that a random gene derives from a female parent [
] in gynodioecious S. salicaria. We are not aware of other studies in which these quantities have been inferred from the pattern of neutral genetic variation observed in a random sample.
Among the most biologically significant functions to which this approach affords access is relative effective number E (Equation 3), a fundamental component of the reproductive value of the sexes (Fisher 1958). We denote the probability that a pair of genes, randomly drawn from distinct individuals, derive from the same parent in the preceding generation as the rate of parent sharing (). Its inverse (
) corresponds to the inbreeding effective size of Crow and Denniston (1988). Relative effective number E is the ratio of
to the total number of reproductive individuals. For example, in the absence of inbreeding (
),
in our gynodioecy model (Equation 9) corresponds to Wright’s (1969) harmonic mean expression for effective population size and E to the ratio of
and
the total number of reproductive females and hermaphrodites. In general (
), relative effective size E reflects reductions in effective size due to inbreeding in addition to differences in numbers of the sexual forms.
Figure 11 presents posterior distributions of E for the three data sets explored here. These results suggest that relative effective number E in each of the natural populations surveyed lies close to its maximum of unity, with the effective number defined through the rate of parent sharing approaching the total number of reproductives. Our estimates suggest that maximization of relative effective number would occur under a slight increase in the frequency of males (Equation 8b) in both K. marmoratus populations and a very slight decrease in the frequency of females
(Equation 10b) in the S. salicaria population.
Posterior distributions of relative effective number E (Equation 3) for data sets derived from K. marmoratus (BP and TC populations) and S. salicaria.
Acknowledgments
We thank the Reviewers, Associate Editor Rasmus Nielsen, and Senior Editor John Wakeley for valuable comments and suggestions. This project was initiated during a sabbatical visit of M.K.U. to the Department of Ecology and Evolutionary Biology at the University of California at Irvine. We thank Francisco J. Ayala for exceedingly gracious hospitality and Diane R. Campbell and all members of the Department for stimulating interactions. We thank Lisa E. Wallace for making available to us microsatellite data from S. salicaria. Public Health Service grant GM 37841 (to M.K.U.) provided partial funding for this research.
Appendix A
The Last-Sampled Gene
We present a first-principles derivation (not requiring knowledge of the ESF) of the probability that the last-sampled gene of i genes randomly sampled from distinct individuals represents a novel allele (Equation 21a).
Under the infinite-alleles model of mutation, a single mutation in a lineage suffices to distinguish a new allele. We designate as the focal gene the last-sampled gene and consider the level of the genealogical tree in which its ancestral lineage either receives a mutation or joins the gene tree of the sample at size Level ℓ of the entire (i-gene) gene tree corresponds to the segment in which ℓ lineages persist.
The probability that the line of descent of the focal gene terminates in a mutation immediately, in level i of the genealogy, isIn general, the probability that the lineage of the focal gene terminates on level
is
This expression illustrates the invariance over termination orders noted by Griffiths and Lessard (2005). Summing over all levels, including level 2, for which a mutation in either remaining lineage ensures that the focal gene represents a novel allele, we obtain the overall probability that the last-sampled gene represents a novel allele:

Appendix B
Estimators of 
We follow Weir (1996) in developing an estimate of the uniparental proportion from
alone (Equation 27).
For a single locus, a simple estimator of corresponds to
for O the observed fraction of heterozygotes in the sample and E the expected fraction based on Hardy–Weinberg proportions given the observed allele frequencies. Explicitly, we have
for
the frequency of allele u in the sample and
the frequency of homozygous genotype
in the sample. However, this estimator can be substantially biased for small samples, leading to underestimation of
(Weir 1996).
To address this bias and accommodate multiple loci, we instead adopt (B1)for n the number of diploid genotypes observed, L the number of loci, and
the number of alleles at locus l. While this estimator is also biased in general, it corresponds to the ratio of unbiased estimators of
and
in which
is the frequency of allele u at locus l in the entire population (Weir 1996). Our analysis of simulated data (Appendix D) indicates that this estimator is more accurate than an estimator that simply averages single-locus estimates:
(B2)Our
-based estimates (Equation 27) incorporate (B1) and not (B2).
Appendix C
Implementation of the MCMC
State space
The state space for the Markov chain of our MCMC sampler includes times across sampled individuals since the last outcross event T (Equation 13), coalescence events across individuals and loci since that event I (Equation 14), and model-specific parameters Ψ (Equation 23). The state space also comprises the scaled mutation rates (Equation 19), which are determined by C, a list specifying the mutation rate category
for locus
and Z, a list specifying the scaled mutation rate
for category
For example, the scaled mutation rate at locus l corresponds to
(C1)While the actual number of observed rate categories does not exceed the number of loci (L), expanding the size of the lists to
improves mixing of the MCMC by ensuring that multiple categories are available for the placement of a new, previously unobserved category (see section 6 of Neal 2000). At any given point in the MCMC, the state of the Markov chain corresponds to
Iterations
Each iteration of our MCMC sampler performs multiple updates, with each variable updated at least once per iteration. We recorded the state sampled by the MCMC at each iteration. We assessed convergence by computing for each parameter an effective number of independent samples [effective sample size (Liu 2001)]. Effective numbers for the large sample regime exceeded 500 samples for each parameter on average, while effective number for the small sample regime exceeded 250 samples for each parameter on average. We found that 2000 iterations were more than sufficient to achieve these effective numbers. About 14.5 min were required to complete 2000 iterations for the large sample regime on a Core i3–4030 processor; ∼10 sec were required for the small sample regime.
For analyses of simulated data sets, we ran Markov chains for 2000 iterations, discarding the first 200 iterations as burn-in. For analyses of the actual data sets, we ran Markov chains for 100,000 iterations, discarding the first 10,000 iterations as burn-in. Convergence appeared to occur as rapidly for actual data as for simulated data, but we found empirically that the larger number of samples was needed to achieve smooth density plots for the actual data sets.
Transition kernels
Updating of the continuous variables of mutation rates (Equation C1) and model-specific parameters Ψ (Equation 23) uses both Metropolis–Hastings (MH) transition kernels and autotuned slice-sampling transition kernels. Updating of the discrete variables
uses a Gibbs transition kernel.
Efficient inference on selfing times through collapsed Metropolis–Hastings
Simple MH proposals that separately update the time since the most recent outcross event () and coalescence history since that event (
) lead to extremely poor mixing efficiency. Strong correlations between
and
cause changes to
to be rejected with high probability unless
is updated as well. For example, consider proposing a change of
from 1 to 0. When
on average
will be 1 at half of the loci and 0 at the remaining loci. If any of the
a move to
will always be rejected because the probability of a coalescence event more recently than the most recent outcross event is 0 if the sampled individual is itself a product of outcrossing. To permit acceptance of changes to
we introduce a proposal for
that also changes
The scheme starts from the value and proposes a new value
In standard MH within Gibbs, we would compute the probability of
and of
given that all other parameters are unchanged. We modify this MH scheme to compute probabilities without conditioning on the coalescence indicators for individual k. However, the coalescence indicators for other individuals are still held constant. To compute this probability, let J indicate all the coalescence indicators
where
Then
We introduce
by summing over all possible values
Since the
for different loci are independent given
we have
Therefore, for specific values of T and
we can compute the sum over all possible values of
for
in computation time proportional to L instead of
This is possible because the L coalescence indicators for individual k each affect different loci and are conditionally independent given
and J.
After accepting or rejecting the new value of with
integrated out, we must choose new values for
given the chosen value of
Because of their conditional independence, we may separately sample each coalescence indicator
for
from its full conditional given the chosen value of
This completes the collapsed MH proposal.
Appendix D
Analysis of Simulated Data
Simulations
Our simulator (https://github.com/skumagai/selfingsim) was developed using simuPOP, publicly available at http://simupop.sourceforge.net/. It explicitly represents individuals, each bearing two genes at each of L unlinked loci. Mutations arise at locus l at scaled rate
(Equation 3), in accordance with the infinite-alleles model.
We assigned to uniparental proportion values ranging from 0.01 to 0.99, with half of the
loci assigned scaled mutation rate
and the remaining loci assigned
We conducted independent simulations for each assignment of
Each simulation was initialized with each of the
genes representing a unique allele. Most of this maximal heterozygosity was lost very rapidly, with allele number and allele frequency spectrum typically stabilizing well within
generations. After
generations, we recorded the realized population, from which 100 independent samples of
loci of size
were extracted. From this collection, we randomly chose
loci and subsampled 100 independent samples of size
Analysis
We applied our Bayesian method, the method, and RMES to
independent samples from each of
independent simulations for each assignment of the uniparental proportion
Our Bayesian method is open source and can be obtained at https://github.com/bredelings/BayesianEstimatorSelfing/. We used the implementation of RMES (David et al. 2007) provided at http://www.cefe.cnrs.fr/images/stories/DPTEEvolution/Genetique/fichiers%20Equipe/RMES%202009%282%29.zip.
Footnotes
Communicating editor: R. Nielsen
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179093/-/DC1.
- Received June 5, 2015.
- Accepted September 4, 2015.
- Copyright © 2015 by the Genetics Society of America