## Abstract

Features common to many mating-type regions include recombination suppression over large genomic tracts and cosegregation of genes of various functions, not necessarily related to reproduction. Model systems for homomorphic self-incompatibility (SI) in flowering plants share these characteristics. We introduce a method for the exact computation of the joint probability of numbers of neutral mutations segregating at the determinant of mating type and at a linked marker locus. The underlying Markov model incorporates strong balancing selection into a two-locus coalescent. We apply the method to obtain a maximum-likelihood estimate of the rate of recombination between a marker locus, *48A*, and *S-RNase*, the determinant of SI specificity in pistils of *Nicotiana alata*. Even though the sampled haplotypes show complete allelic linkage disequilibrium and recombinants have never been detected, a highly significant deficiency of synonymous substitutions at *48A* compared to *S-RNase* suggests a history of recombination. Our maximum-likelihood estimate indicates a rate of recombination of perhaps 3 orders of magnitude greater than the rate of synonymous mutation. This approach may facilitate the construction of genetic maps of regions tightly linked to targets of strong balancing selection.

IN many organisms, reproduction occurs only between individuals or gametes that express different sexes or mating types. Many such systems show high structural and sequence diversity among distinct mating types and greatly reduced recombination across extensive genomic tracts surrounding the determinants of mating type (Ferris *et al.* 1997; Hiscock and Kües 1999). Suppression of crossing-over between the human X and Y chromosomes appears to have occurred in a stepwise fashion, through a series of chromosomal rearrangements, with the nonrecombining euchromatic region of the Y now comprising some 23 Mb (Lahn and Page 1999; Iwase *et al.* 2003; Skaletsky *et al.* 2003). The majority of functional protein-coding genes in the male-specific (non-pseudoautosomal) region show testis-specific expression, occur in palindromes, and have their most closely related homologs dispersed throughout the genome (Rozen *et al.* 2003; Skaletsky *et al.* 2003). Most of the remaining protein-coding genes show ubiquitous expression, perform various metabolic functions, and have X-linked homologs (Skaletsky *et al.* 2003).

Two mating types exist in the unicellular green alga *Chlamydomonas reinhardtii* and in some fungi, both ascomycete (Neurospora species) and basiodiomycete (*Cryptococcus neoformans* and *Ustilago hordei*). In *C. reinhardtii*, recombination suppression over a 1-Mb tract surrounding the mating-type locus appears to derive from multiple inversions and translocations within the central 200-kb domain (R) in which mating-type-specific genes reside (Ferris *et al.* 2002). Not all genes embedded in this region contribute directly to reproduction: many perform general (“housekeeping”) functions or show maximal expression in vegetative stages of the life cycle (Ferris *et al.* 2002). In the smut fungus *U. hordei*, extensive rearrangements across a 500-kb region containing the determinants of mating type (transcription factors, pheromones, and pheromone receptors) appear to inhibit recombination between mating types (Lee *et al.* 1999). In the self-fertile *Neurospora tetrasperma*, the chromosomal segment in which the highly rearranged alternative mating-type alleles lie fails to pair during meiosis (Gallegos *et al.* 2000). In the human pathogen *C. neoformans*, alternative mating-type alleles also show extensive rearrangements, with the region of reduced recombination spanning >100 kb (Lengeler *et al.* 2002). As in Chlamydomonas, this region contains many genes that perform a variety of primary functions, not restricted to reproduction (Lengeler *et al.* 2002).

Homomorphic self-incompatibility (SI) in flowering plants prevents fertilization by pollen that express mating specificities in common with the pistil. In the best-known model systems, a single, large nonrecombining region (*S*-locus) encodes both pistil and pollen mating specificity. Genetic analysis of *S*-locus regions may afford insight into many issues of general relevance to sexual reproduction, including the genetic and evolutionary mechanisms that permit genes to maintain function in the face of recombination suppression and enforced heterozygosity.

*Brassica oleracea* and its relatives represent the model system for sporophytic self-incompatibility (SSI), in which the specificity expressed by an individual pollen grain is determined not by its own *S*-locus genotype but by the genotype of the plant that produced it. The *S*-locus region comprises several open reading frames bearing sequence similarity to genes of known function, some with no apparent relationship to SI or reproduction (Suzuki *et al.* 1999; Casselman *et al.* 2000). *S*-alleles segregating within Brassica populations differ greatly with respect to the physical distance between the determinants of pollen and pistil specificity and to the orientation of those genes relative to one another and to other genes embedded in the region (Nasrallah 2000).

In gametophytic SI (GSI) systems, the *S*-locus genotype of each pollen grain or tube determines its own specificity. In many species of the Solanaceae, a functional ribonuclease (*S*-RNase), expressed in pistil, exhibits highly specific inhibition of pollen tubes that express incompatible specificities (Lee *et al.* 1994; Murfett *et al.* 1994). Homologs of the *S-RNase* gene appear to mediate GSI in the distantly related plant families Rosaceae and Scrophulariaceae as well (Igic and Kohn 2001; Steinbachs and Holsinger 2002). While the Brassica *S*-locus spans some hundreds of kilobases (Boyes *et al.* 1997), the domain of reduced recombination surrounding *S-RNase* comprises at least 1 Mb (McCubbin and Kao 1999) and likely >4 Mb (Wang *et al.* 2003). To date, dozens of open reading frames have been identified for which linkage to *S-RNase* appears to be virtually complete (Li *et al.* 2000; McCubbin *et al.* 2000; Lai *et al.* 2002; Entani *et al.* 2003; Ushijima *et al.* 2003). If the *S*-locus region of *S*-RNase-based GSI shares with Brassica the feature of high gene density (one per 5.4 kb; Suzuki *et al.* 1999), a considerable number of expressed genes may cosegregate with *S-RNase*.

The advent of genomic sequence information has raised the prospect of using the pattern of variability across sites of known location to identify candidate targets of selection (*e.g.*, Nordborg 1997; Kim and Stephan 2002). While neutral substitution at a given site proceeds at the rate of neutral mutation, independent of linkage to targets of selection (Birky and Walsh 1988), selection can truncate or expand the time since the most recent common ancestor (MRCA) at the target and regions tightly linked to it (Hudson and Kaplan 1988; Kaplan *et al.* 1989; Takahata 1990; Charlesworth *et al.* 1993). Implementation of this method of recognizing candidate targets requires an understanding of the joint effects of multiple kinds of selection operating at nearby sites (Kim and Stephan 2000; Kelly and Wade 2002) and demographic history, including population-wide bottlenecks (Barton 1998; Galtier *et al.* 2000).

Our objective is the obverse: we seek to use the relative magnitudes of neutral diversity at markers tightly linked to the mating-type locus to infer their relative order. Other approaches for the detection of recombination recognize incongruence among gene genealogies as *prima facie* evidence of recombination (see Crandall and Templeton 1999; Wiuf *et al.* 2001).

Takebayashi *et al.* (2003) analyzed the pattern of variation at *48A*, a pollen-expressed, protein-encoding gene tightly linked to *S-RNase* in *Nicotiana alata*. Screens of hundreds of plants have yielded no recombinants (Li *et al.* 2000) and *48A* exhibits complete allelic linkage disequilibrium with *S-RNase* (Brewer 1998). However, the great deficiency in the number of segregating neutral mutations at *48A* relative to *S-RNase* departs highly significantly from expectation under the hypothesis of complete linkage (Takebayashi *et al.* 2003).

Here, we present a method for the maximum-likelihood (ML) estimation of the rate of recombination between a marker locus and a target of strong balancing selection. It explicitly incorporates symmetric, multiallelic balancing selection into a two-locus coalescent. We base our ML estimate of the recombination rate on computed (rather than simulated) probabilities of arbitrary numbers of neutral mutations observed segregating in a sample of haplotypes comprising two loci: one showing absolute linkage and the other partial linkage to the target of selection. Our method assumes the absence of intragenic recombination and that neutral mutations arise at each locus at identical per-site rates or at known relative per-locus rates. Application to the *S*-RNase-based system of GSI indicates a rate of recombination between *S-RNase* and *48A* of perhaps 3 orders of magnitude greater than the rate of synonymous mutation. Estimation of very low recombination rates, practicably undetectable by direct observation, may facilitate construction of genetic maps within mating-type regions. It also applies to other systems of symmetric balancing selection, perhaps including the vertebrate major histocompatibility complex region.

## BALANCING SELECTION

Ancestral selection graphs provide a means of incorporating selection into genealogical reconstruction (Krone and Neuhauser 1997; Neuhauser and Krone 1997). Neuhauser (1999) represented GSI as the strong-selection (lethal homozygosis) limit of symmetric overdominance, but concluded that the ancestral selection graph approach cannot be applied to this system because the diffusive limit on which it relies does not exist under selection pressures of this intensity. To accommodate GSI, we adopt a model that includes both slow diffusion and instantaneous jumps.

### Overdominance in viability:

Under symmetric overdominance (SOD), individuals that bear any specificity in homozygous form survive to reproduction at rate (1 − *s*) relative to heterozygotes. Coefficients describing the drift (mean change) in frequency of a given specificity (*x*) and the diffusion (variation around the mean change) generated by stochastic forces correspond to 1a1b

(see, for example, Takahata 1990), for *u*, the rate of mutation to alleles not presently segregating in common frequencies, and *F*, the homozygosity (inverse of the effective number of alleles; Kimura and Crow 1964). Under strong SOD, common alleles segregate in frequencies near the deterministic expectation of 1/*n*, for *n* the number of common alleles maintained at approximate steady state; in this case, *F* corresponds closely to 1/*n*. For *s* and *u* of order 1/2*N*, the coefficients in (1) describe a valid diffusion process. Under lethal homozygosis (*s* → 1), the change due to selection (1a) becomes infinite in the diffusion limit (*N* → ∞).

### Gametophytic self-incompatibility:

GSI differs from the strong-selection limit (*s* → 1) of SOD. Wright's (1939) diffusion approximation for GSI has endured various criticisms (*e.g.*, Fisher 1930; Moran 1962; Ewens 1964; Mayo 1966), but nevertheless appears to provide a remarkably accurate description of GSI evolution (Yokoyama and Nei 1979; Vekemans and Slatkin 1994). After a succession of modifications, the canonical drift and diffusion coefficients for GSI have become 2a2bfor *u* (∼*O*(1/2*N*)), the rate of mutation to new *S*-alleles (Wright 1969, pp. 405–406). The difference between the diffusion coefficients for SOD (1b) and GSI (2b) reflects that GSI interactions between diploid zygotes (pistils) and haploid gametes (pollen) cannot be reduced to interactions among gametes alone, as is the case for SOD. Even so, GSI shares with strong SOD (*s* > *O*(1/2*N*)) the property that in the diffusion limit (*N* → ∞), the mean change of the focal allele (2a) remains finite in only two frequency domains: very close to zero (*x* ∼ *O*(1/2*N*)) and very close to the deterministic equilibrium ((*F* − *x*) ∼ *O*(1/2*N*)).

Our model of GSI evolution adopts (2) within these two domains, together with jumps of the process between them (see Sasaki 1989, 1992; Gillespie 1983, 1994). A jump corresponds to the rise of an allele to common frequencies that is virtually instantaneous relative to the diffusion process. We restrict consideration to haplotypes under linkage sufficiently tight to justify ignoring recombination during jumps (contrast Kaplan *et al.* 1989 with Wiehe and Stephan 1993). Under these assumptions, jumps can induce multifurcating nodes in the gene genealogy, reflecting the simultaneous coalescence of multiple haplotypes expressing the same specificity into the single ancestral haplotype in which the specificity-determining mutation first arose.

### Invasion of new specificities:

Upon its introduction into the population in very low frequency, a new specificity wanders in a “boundary layer” (Gillespie 1983), primarily under the influence of drift, until it either declines to extinction or rises to sufficiently high frequency to permit selection to dominate the evolutionary process. In the latter case, strong selection then mediates a jump to common frequency (1/*n*). Let λ represent the rate at which new specificities rise to common frequencies. Under SOD (1), λ corresponds to 3

(Sasaki 1989, 1992), and under GSI (2), 4

Approximate stationarity of the number of common alleles (*n*) entails a balance between invasion of new alleles and extinction of common alleles. This invasion/extinction process sets the timescale of coalescence among allelic classes (Takahata 1990). Because balancing selection promotes the increase of any rare allele, allelic lineages persist in the population for longer intervals than expected under neutrality. Takahata (1990) compared the process of coalescence of symmetrically overdominant alleles in a population of size 2*N* to that among neutral alleles in a population of size 2*Nf*, for *f*, a scaling factor that represents the expansion of the timescale of the allelic genealogy caused by balancing selection. Coalescence among *j* neutral lineages occurs at rate 5and among *j* overdominant lineages at rate 6

The latter expression reflects the invasion of a new specificity (λ), without the extinction of its immediate parent specificity (1 − 1/*n*), and also the inclusion in the sample of descendants of both the offspring and parent lineages (see Hey 1992; Takahata *et al.* 1992). Equating (5) and (6) determines the scaling factor 7for λ given by (3) for SOD and by (4) for GSI.

## TWO-LOCUS LIKELIHOODS

We describe a method for determining the probability of any pattern of neutral segregating mutations observed in a sample of haplotypes. Each haplotype comprises two loci: *A* shows absolute linkage and *B* partial linkage to the specificity-determining region. We assume no intragenic recombination and an *r*-cM separation between *A* and *B*. Neutral mutations arise within *A* and *B* according to the infinite-sites model at the per-locus, per-generation rates *v _{A}* and

*v*, respectively. The population comprises

_{B}*N*diploid individuals, among which

*n*distinct common mating specificities segregate in equal frequencies.

We first summarize the derivation of probability-generating functions of numbers of segregating sites and the recursive computation of likelihoods. We then describe the Markov transition matrices specific to the evolutionary process under consideration.

### Probability-generating functions:

#### Transitions in state:

As an alternative to the analysis of full ancestral recombination graphs (Griffiths 1991; Griffiths and Marjoram 1996), we study correlations between the marginal genealogies using joint probability distributions. We represent the ancestry of the sample as two distinct but concurrent genealogies: level (*l _{A}*,

*l*) corresponds to the segment of the concurrent genealogies in which

_{B}*l*locus

_{A}*A*lineages and

*l*locus

_{B}*B*lineages remain (Figure 1).

A type 0 haplotype bears an ancestral lineage at neither locus, type 1 only at *A*, type 2 only at *B*, and type 3 at both *A* and *B* (compare Kaplan and Hudson 1985; Griffiths 1991). At any point in the genealogy, *n _{i}*

_{,}

_{j}_{,}

*of the*

_{k}*n*common specificity classes in the population contain

*i*type 1,

*j*type 2, and

*k*type 3 haplotypes (0 ≤

*i*,

*j*,

*k*≤

*L*), with 8

The state of the process corresponds to the configuration of lineages within alleles. We represent the state within level (*l _{A}*,

*l*) by

_{B}

**S**_{lA,lB}= (

*n*

_{0,0,0},

*n*

_{0,0,1}, … ,

*n*

_{i}_{,}

*, … ), for all*

_{j,k}*i*,

*j*, and

*k*. Transition matrix

**P**_{lA,lB}describes rates of transition of the process between states within level (

*l*,

_{A}*l*) and

_{B}

**Q**_{lA,lB}describes rates of transition from level (

*l*,

_{A}*l*) to lower levels (fewer lineages);

_{B}

**U**_{lA,lB}and

**V**_{lA,lB}give conditional transition rates, given that a transition has occurred, 9a9bfor

**C**_{lA,lB}a diagonal matrix, with the

*m*th diagonal element (

*C*

_{lA,lB}(

*m*)) corresponding to the total rate of transition from state

*m*to any other state.

#### Recursive derivation:

For state *m* on level (*l _{A}*,

*l*), the probability that a transition in state has occurred more recently than a mutational event in any of the lineages is

_{B}The joint probability-generating function (pgf) of the number of new mutations accumulated in any of the lineages since the most recent transition in state corresponds to 10

This expression reflects that the distribution of the total number of mutational events since the most recent transition is geometric with parameter *p*_{lA,lB}, and the number of mutations in *A* (or *B*) lineages given the total number is binomial.

Within a sample of any given configuration **S**_{lA,lB}, the segregating mutations arose either before or after the most recent transition. Because the numbers of mutations in these disjunct periods are independent, the pgf for the total number of segregating sites corresponds to the product of the pgfs for the two periods: 11a11bin which **g**_{lA,lB} represents the vector of pgfs for sample configurations on level (*l _{A}*,

*l*),

_{B}

**h**_{lA,lB}is the vector of pgfs on lower levels, and

**F**_{lA,lB}is a diagonal matrix with the

*m*th diagonal element given by

*f*

_{lA,lB}(10). Equivalently, 12a12bfor

**D**_{lA,lB}, a diagonal matrix of the denominators of

**F**_{lA,lB}. For states (1, 2) or (2, 1) (two lineages remaining in one marginal genealogy), the only lower level (1, 1) corresponds to the joint MRCA of the sample. Because any mutations in this state do not contribute to the number of segregating sites, 13

Beginning at this lowest level, the pgfs for samples of arbitrary sizes and configurations may be obtained recursively from (11) or (12).

Determination of the joint probability distribution of segregating sites requires only the derivatives of the pgfs rather than the pgfs themselves. We obtain the joint probability of observing *k _{A}* segregating sites at locus

*A*and

*k*at

_{B}*B*by taking derivatives of the vector of pgfs, 14in which the dummy variables of the pgf and the subscripts indicating level are suppressed for simplicity, and the

*c*

_{kA,jA,kB,jB}are determined from either of 15a15bwith initial conditions

*c*

_{1,0,0,0}=

*c*

_{1,1,0,0}=

*c*

_{0,0,1,0}=

*c*

_{0,0,1,1}= 1 and for any

*j*>

_{A}*k*or

_{A}*j*>

_{B}*k*. Beginning with initial levels (2, 1) and (1, 2) (13), we obtain derivatives of arbitrary orders for all sample sizes and configurations by successive application of (14), a double recursion in sample size and mutation numbers.

_{B}### Mating-type regions:

Application of this method to the estimation of the rate of recombination between the determinant of mating type and a neutral marker locus entails only specification of one-step Markov matrices (**P**_{lA,lB} and **Q**_{lA,lB}). Figure 2 illustrates the three processes that mediate changes in state: coalescence of haplotypes within specificity class through genetic drift, transfer of marker lineages among specificity classes by recombination, and turnover (invasion and extinction) of specificity classes. Drift and turnover contribute to between-level transitions (**Q**_{lA,lB}) and recombination contributes to within-level transitions (**P**_{lA,lB}).

#### Genetic drift:

We assume that at approximate steady state, each of the *n* common classes comprises 2*N*/*n* haplotypes, any pair of which may coalesce through genetic drift. Within any of the *n _{i}*

_{,}

_{j}_{,}

*specificity classes of type (*

_{k}*i*,

*j*,

*k*) (containing

*i*type 1,

*j*type 2, and

*k*type 3 haplotypes), coalescence between two type 1 haplotypes occurs at rate

*i*(

*i*− 1)

*n*/4

*N*and moves the process from level (

*l*,

_{A}*l*) to (

_{B}*l*− 1,

_{A}*l*). All other state transitions due to drift and their rates are similarly determined.

_{B}#### Recombination:

Because locus *A* shows absolute linkage to the specificity-determining region, only *B* locus regions can transfer between specificity classes by recombination. In any generation, a fraction *r*/2 of haplotypes are newly formed recombinants. We treat as negligible the probability of transmission of more than one haplotype derived from the single meiosis that generated a particular recombinant; consequently, its sister recombinant could not have borne an ancestral lineage (type 0). If the focal recombinant bears no ancestral *B* lineage (type 0 or 1), recombination would have induced no transition in state. At rate *rjn _{i}*

_{,}

_{j}_{,}

*/2, one of the type 2 haplotypes in a specificity class presently of type (*

_{k}*i*,

*j*,

*k*) represents a newly formed recombinant;

*rkn*

_{i}_{,}

_{j}_{,}

*/2 is the rate for a type 3 recombinant. If the new recombinant is type 2, the state of the recipient class immediately before the recombination event was (*

_{k}*i*,

*j*− 1,

*k*); if type 3, it was (

*i*+ 1,

*j*,

*k*− 1). The

*B*lineage of the recombinant was derived from a specificity class presently of type (

*x*,

*y*,

*z*) with probability

*n*

_{x}_{,}

_{y}_{,}

*/(*

_{z}*n*− 1) for (

*x*,

*y*,

*z*) ≠ (

*i*,

*j*,

*k*), and probability (

*n*

_{i}_{,}

_{j}_{,}

*− 1)/(*

_{k}*n*− 1) for (

*x*,

*y*,

*z*) = (

*i*,

*j*,

*k*). Our assumption that the sister to the focal recombinant was not transmitted entails that the former state of the donor class was (

*x*,

*y*+ 1,

*z*), irrespective of whether the newly formed recombinant is type 2 or 3.

#### Turnover:

At rate λ, one of the *n* common specificities segregating in the population represents a class that expanded in the immediately preceding generation. We assume that once initiated, the rise to common frequencies of an invading class occurs virtually instantaneously relative to mutation, recombination, and drift. Consequently, all lineages within the new class descend immediately and without mutation from the single ancestor in which the new specificity arose. Only the *n*_{s} specificities that contain a single type of lineage can have undergone expansion in the immediately preceding generation: 16

At approximate steady state for the number of common alleles (*n*), the extinction of an existing common allele accompanies the invasion of a new allele. At rate λ/*n*, a new specificity class (*O*) invades and replaces its immediate parent (*P*). For example, *O* presently has type (*i*, 0, 0) with probability *n _{i}*

_{,0,0}/

*n*

_{s}, having expanded from a type (1, 0, 0) ancestor. This event implies a transition in state from level (

*l*,

_{A}*l*) to (

_{B}*l*−

_{A}*i*+ 1,

*l*).

_{B}At rate λ(1 − 1/*n*), *O* invades and coexists with *P*. The probability that *O* currently has type (*i*, *j*, *k*), in which only one of *i*, *j*, and *k* is positive, and *P* currently has type (*x*, *y*, *z*) corresponds to *n _{i}*

_{,}

_{j}_{,}

_{k}n_{x}_{,}

_{y}_{,}

*/*

_{z}*n*

_{s}(

*n*− 1) for (

*x*,

*y*,

*z*) ≠ (

*i*,

*j*,

*k*), and to

*n*

_{i}_{,}

_{j}_{,}

*(*

_{k}*n*

_{i}_{,}

_{j}_{,}

*− 1)/*

_{k}*n*

_{s}(

*n*− 1) for (

*x*,

*y*,

*z*) = (

*i*,

*j*,

*k*). For example, if

*O*presently has type (0, 0,

*k*), then immediately before the turnover event,

*P*had type (

*x*,

*y*,

*z*+ 1) and the process was at level (

*l*−

_{A}*k*+ 1,

*l*−

_{B}*k*+ 1). Similar considerations provide the rates of all possible state transitions due to turnover.

#### Relative rates of transition:

For state *m* on level (*l _{A}*,

*l*), the joint pgf of the numbers of mutations accumulated at locus

_{B}*A*and locus

*B*before the first transition in state corresponds to 17for

*C*

_{D},

*C*

_{R}, and

*C*

_{T}rates of change due to genetic drift within specificity class, recombination, and turnover of specificity classes: 18a18b, 18c

We consider a rapid-drift approximation (compare Takahata 1991 with Notohara 2000), which entails that coalescence within specificity class occurs virtually instantaneously relative to recombination or coalescence between classes (*n*/2*N* ≫ *r*/2, λ). Lineages ancestral to a present-day sample in any configuration quickly come to reside in separate classes. Upon the formation, by recombination or turnover, of a class that contains more than one lineage-bearing haplotype, genetic drift dominates the evolutionary process (*C*_{D} ≫ *C*_{R}, *C*_{T}), ensuring immediate transition to a lower level. Between such formations, *C*_{D} = 0 and the rates of transition through turnover or recombination correspond to 19a19b

[from (18)].

## APPLICATION

We apply our method to obtain a maximum-likelihood estimate (MLE) of the rate of recombination between marker locus *48A* and *S-RNase*, the determinant of pistil specificity in *S*-RNase-based systems of GSI. We assume that substitution of neutral mutations occurs at the per-site rate *v* in both loci, with the per-locus rates proportional to sequence length (*v _{A}* =

*vX*,

_{A}*v*=

_{b}*vX*, for

_{B}*X*and

_{A}*X*numbers of sites). The parameters of our model include the number of common

_{B}*S*-alleles segregating in the population (

*n*), scaled per-site rate of neutral substitution (θ = 4

*Nv*), rate of invasion of new common

*S*-alleles [λ (4), or its inverse function

*f*(7)], and scaled rate of recombination (

*R*= 4

*Nr*).

Hudson and Kaplan's (1988) analysis of symmetric biallelic overdominance without allelic turnover suggested that the rate of recombination between the target of balancing selection and a linked neutral site might be inferred from the level of variation at the neutral site alone. In the presence of allelic turnover, however, this information is insufficient because balancing selection can decrease as well as increase the age of the MRCA at linked sites (Uyenoyama and Takebayashi 2004). Simultaneous consideration of variation at sites absolutely linked to the target of selection improves inference about the rate of recombination between the target and partially linked sites.

In this section, we summarize our previous estimates of the number of mutation events at *S-RNase*, assumed absolutely linked to the *S*-locus, and at the partially linked marker *48A*, and our conclusion that the pattern of variation at the marker locus alone provides an insufficient basis for the estimation of the rate of recombination between the loci. We then describe analyses of variation at each locus separately, comparing estimates obtained from the method-of-moments and likelihood approaches. Finally, we present results of our ML analysis of the joint pattern of variation at *S-RNase* and *48A*.

### Numbers of segregating sites:

Takebayashi *et al.* (2003) used the approach of Yang *et al.* (2000) to assign each codon within *S-RNase* and *48A* to either a positive selection partition or a conservative evolution partition. Because our primary goal was to infer phylogenetic history from neutral variation at the subset of sites from which all targets of positive selection had been excluded, we adopted a very liberal criterion for positive selection. Table 1 presents our estimates, corresponding to the total lengths of ML trees, of the numbers of neutral mutations segregating in the conservative evolution partition in a sample of five two-locus haplotypes, each representing a distinct *S*-allele specificity. The estimated number of synonymous mutations at *S-RNase* exceeds the number of synonymous sites, indicating that many sites have accumulated multiple hits within the span of the very deep *S*-allele genealogy. Even under this underestimation of the number of mutations in *S-RNase*, the >15-fold deficiency in *48A* departs highly significantly from expectation under the hypothesis of absolute linkage (Takebayashi *et al.* 2003).

Underestimation of the total number of mutations accumulated at *S-RNase* over the span of the *S*-allele genealogy may have caused underestimation of both genealogical depth (*f*θ) and recombination rate (*fR*). The net effect of these biases on our estimate of the rate of recombination relative to the rate of synonymous substitution (*R*/θ) remains unexplored.

### Insufficiency of information at marker locus alone:

A key determinant of the rate of coalescence among marker lineages is the rate of formation of *S*-allele classes that contain multiple lineages. Because allelic turnover as well as recombination can generate such classes, the number of segregating sites at the marker locus alone provides an insufficient basis for the estimation of the recombination rate.

Figure 3, from Uyenoyama and Takebayashi (2004), shows the likelihood surface generated by our method from the estimated number of segregating sites at *48A* under the assignment *n* = 20 and *f* = 10. While information on the number (*n*) of *S*-alleles segregating in ancestral natural populations of *N. alata* is unavailable, the extensive body of empirical and theoretical work on gametophytic SI systems suggests the segregation of dozens of *S*-alleles (Lawrence 2000). To an order of magnitude, the assigned value of the scaling factor (*f* = 10) is comparable to rough estimates for major histocompatibility loci (Takahata *et al.* 1992) and fungal mating-type genes (May *et al.* 1999).

The existence of a ridge of nearly equal likelihoods (Figure 3) indicates that the low level of neutral variation at *48A* relative to *S-RNase* is consistent with a continuum of values of *R*/θ and *f*θ. For example, low variation might reflect tight linkage to the *S*-locus (small *R*/θ) and frequent rapid expansions of newly arisen alleles (high turnover rate λ, or small *f*θ). Alternatively, loose linkage (large *R*/θ) would moderate the expected increase in age of the MRCA at sites linked to a target of balancing selection that is subject to infrequent turnover (large *f*θ). Consideration of additional marker loci would introduce a new recombination rate for each locus; even so, analysis of the joint numbers of segregating sites would improve the estimation if linkage relationships among the markers were known.

### Analysis of variation at each locus separately:

Under the rapid-drift approximation (19), the pgf of the number of segregating sites in samples of size *L* reduces, under a rescaling of time, to Watterson's (1975) classic result for populations without specificity classes, 20in which 21

(Uyenoyama and Takebayashi 2004). The rescaling corresponds to the expected waiting time for formation of a specificity class that contains more than one lineage (1/[*r*/2(*n* − 1) + λ/*n*^{2}]). That this pgf has only one parameter suggests that the number of segregating sites at the marker locus provides information about only one composite parameter, precluding separation of *R*/θ and *f*θ.

For comparison to our ML analysis, we outline a method-of-moments approach using the one-locus, rapid-drift pgf (20). This pgf provides familiar expressions for the first two moments of the number of segregating sites (*S*), 22a22bin which and , which suggest an unbiased moment estimator of θ* _{R}* of

*Ŝ*/

*a*, with variance

_{L}*Ŝ*/

*a*

_{L}+

*Ŝ*

^{2}

*b*

_{L}/

*a*

^{4}

_{L}(Watterson 1975).

Table 2 presents estimates of θ* _{R}* (21) based on the number of segregating sites at each locus separately. Shown are both method-of-moments estimates, with approximate standard deviations, and MLEs, with 2-unit support ranges (two log-likelihood units; Edwards 1972). In the absence of recombination between

*48A*and

*S-RNase*(

*R*= 0), composite parameters

*f*θ and θ

*(21) would be identical. In accord with the significant rejection of absolute linkage by an exact test (Takebayashi*

_{R}*et al.*2003), the 2-unit support ranges for the MLEs of these parameters do not overlap. In contrast, the moments approach fails to distinguish between the estimates of

*f*θ and θ

*, or even between either estimate and zero. Both the asymmetry of the 2-unit support ranges of the MLEs and the magnitude of the confidence intervals obtained from the moments approach suggest that the variance estimates provide little indication of dispersion about the expectation.*

_{R}### Two-locus likelihoods:

Considering variation at both loci, we obtained maximum-likelihood estimates of *R*/θ and *f*θ under the rapid-drift approximation (19). The joint likelihood represents the probability of the observed pair of numbers of segregating sites at the two loci, and the composite likelihood the product of the probabilities determined separately for each locus. Maximization of the composite likelihood would treat the numbers of segregating sites at *S-RNase* and *48A* as independent. Correlations between loci reflect periods of shared history between lineages while residing in the same haplotype (Kaplan and Hudson 1985; Wiuf and Hein 1999). Recombination tends to break up any type 3 haplotypes in the initial sample into types 1 and 2.

To assess the effect of shared history, we compared the MLEs determined by maximizing the joint likelihood to those determined by maximizing the composite likelihood obtained from the two one-locus models (20). Table 3 shows very close correspondence between the composite and joint MLEs across a range spanning realistic values of the number of *S*-alleles in the population (*n*). This agreement suggests that recombination has occurred at a sufficiently high rate to justify ignoring the correlation due to the initial sampling of type 3 haplotypes.

Assuming negligible correlation between the loci, we studied the shape of the composite-likelihood surface. Figure 4 shows composite likelihoods of *R*/θ and *f*θ under the assignment *n* = 20. The steep decline in the likelihood of small values of *R*/θ supports our earlier rejection of absolute linkage (Takebayashi *et al.* 2003). This analysis provides an MLE of *R*/θ of 841.56 (Figure 4, X).

Rearrangement of (21) suggests another means of estimating *R*/θ, 23for determined from the number of segregating sites at *48A* and from *S-RNase*. Assignment of *n* = 20 and and to the moments estimates given in Table 2 gives = 776.52 (Figure 4, horizontal line), somewhat less than the MLE.

## DISCUSSION

### Maximum-likelihood estimation of very low recombination rates:

We have described a method for the recursive computation of the exact sampling distribution of the numbers of neutral mutations segregating at a marker locus and a target of balancing selection in a sample of arbitrary size and configuration. A >15-fold deficiency in variation at marker locus *48A* relative to *S-RNase*, the determinant of SI specificity in pistil (Table 1; Takebayashi *et al.* 2003), corresponds to an MLE of the interlocus recombination rate of perhaps 3 orders of magnitude greater than the per-site rate of synonymous substitution (Figure 4).

To an order of magnitude, rates of synonymous substitution in plants (*Arabidopsis thaliana*; Zhang *et al.* 2002) as well as mammals (Li *et al.* 1985; Kumar and Subramanian 2002) fall in the range 10^{−8}–10^{−9}/site/year. Under the assumption of an annual life cycle, our estimates suggest a rate of recombination between *48A* and *S-RNase* on the order of 10^{−3} or 10^{−4} cM. Li *et al.* (2000) detected no recombination between the *S*-locus and three marker loci, including *48A*, in screens of hundreds of plants (Table 4). The minimum detectable rate of recombination (∼1 cM) exceeds by several orders of magnitude our MLE, which reflects recombination events that occurred within the span of the *S*-allele genealogy. Segregating *S*-allele classes appear to have diverged >30 million years (MY) ago in both the solanaceous *S*-RNase-based GSI system (Ioerger *et al.* 1990) and the Brassica SSI system (Uyenoyama 1995). Adopting this figure for the average pairwise divergence time suggests an expected total length of a five-allele genealogy of 120 MY. Over intervals of this magnitude, recombination at even the very low rate suggested by our analysis appears to have been sufficient to cause a significant reduction in the number of synonymous mutations segregating at *48A* relative to *S-RNase*.

### Countervailing effects of strong balancing selection:

Most studies of the effects of balancing selection on neutral variability in linked regions restrict consideration to the biallelic case (*e.g.*, Strobeck 1980; Kaplan *et al.* 1988; Barton and Navarro 2002). In contrast, homomorphic SI systems maintain dozens or even hundreds of *S*-alleles (Lawrence 2000). A phenomenon that arises in multiallelic but not biallelic systems held in polymorphism by strong balancing selection is the rapid increase in frequency of novel or rare alleles. Such abrupt expansions tend to decrease neutral variation within specificity class at hitchhiking sites. However, strong balancing selection can also increase divergence times and neutral diversity at linked sites by expanding the timescale of coalescence among allelic classes (Takahata 1990; Sasaki 1992; Vekemans and Slatkin 1994; Uyenoyama 2003). Recombination and allelic turnover jointly influence the overall pattern of nucleotide variation. In contrast, under the assumptions of constant allele frequencies and no allelic turnover, the observed decline with looser linkage in interallelic pairwise differences in a sliding-window analysis of the *Adh* locus of *Drosophila melanogaster* was deemed sufficient to identify the nucleotide site subject to balancing selection (Hudson and Kaplan 1988).

Distinguishing between an ancient episode of purifying selection at a tightly linked site and a recent episode at a loosely linked site presents a general challenge to methods that seek to infer the location of a target of selection from the pattern of neutral variation (Kim and Stephan 2002; Navarro and Barton 2002). The likelihood surface obtained from consideration of variation at *48A* alone (Figure 3) reflects these alternative explanations for low variation. In our study, the problem was exacerbated by the nature of the sample, which comprised only one representative of each *S*-allele. The magnitude of variation within mating-type alleles can provide information about the age of a given specificity and consequently the rate of specificity turnover (Slatkin and Rannala 1997; May *et al.* 1999).

### Evolution under tight linkage and enforced heterozygosity:

In addition to the intense balancing selection engendered by GSI, “Muller's ratchet” (Felsenstein 1974) and related degenerative processes (Charlesworth and Charlesworth 2000) likely operate within mating-type regions. Rozen *et al.* (2003) have proposed that gene conversion within the extensive palindromes of the human Y chromosome has permitted preservation of many essential male-specific genes. Degeneration appears to be the fate of Y-linked genes that lack recourse to such special mechanisms (Marshall Graves 2002).

In regions under complete linkage to the *S*-locus, enforcement of *S*-allele heterozygosity postpones formation of zygotes bearing a mutation in homozygous form until after the diversification in mating type of its carriers. Ancient (>30 MY) divergence among *S-*alleles and tight linkage to the *S*-locus of genomic tracts of considerable size (Boyes *et al.* 1997; McCubbin and Kao 1999; Wang *et al.* 2003) together imply the accumulation of mutations over very long periods and at a great many sites prior to their selective screening in homozygous form.

Small population size promotes the substitution of deleterious mutations (Nei 1970). Subdivision of the gamete pool into many mating types exacerbates degeneration at sites tightly linked to mating-type regions. In populations with equal frequencies of the sexes, Y-linked genes experience an effective population size fourfold lower than that of autosomal genes. Linkage to the *S*-locus reduces effective population size by *n*-fold, for *n* the typically large number of common *S*-alleles segregating in natural populations (Lawrence 2000). As a consequence, linkage disequilibrium between deleterious mutations and the *S*-locus (*S*-allele-specific load) may accumulate.

*S*-allele-specific load:

An analysis of the shape of *S-RNase* genealogies indicated significantly long terminal branches (Uyenoyama 1997). This method has been used to detect similar distortions in a larger sample of solanaceous *S*-alleles (Richman and Kohn 1999), among Brassica SSI *S*-alleles (Schierup *et al.* 1998), among fungal mating specificities (May *et al.* 1999), and at a major histocompatibility complex class II locus (Richman and Kohn 1999). Even the *Adh* polymorphism in *D. melanogaster*, an exemplar of balancing selection, shows this pattern (Hudson and Kaplan 1986).

Uyenoyama (1997)(2000) proposed that progressive intensification of allele-specific load may contribute to this kind of genealogical distortion. Analysis of an extension of Wright's (1939) classical diffusion model of GSI indicated that allele-specific load reduces the expected number of common alleles (*n*) maintained in populations and the rate of *S*-allele diversification (λ ∝ 1/*f*) for a given effective population size and rate of origin of new *S*-alleles (Uyenoyama 2003). Although *S*-allele-specific load is expected to change the nature of the dependence of *f* and *n* on population size and rate of origin of new alleles, our analysis treats *f* and *n* as phenomenological parameters. Consequently, we conjecture that the presence of *S*-allele-specific load may not severely compromise our estimate of the relative rate of recombination (*R*/θ).

### Method of moments:

Many workers have used backward recursions to determine the moments of divergence times or numbers of pairwise differences (*e.g*., Kaplan *et al.* 1988; Notohara 1990; Takahata 1991; Wakeley 1998). In particular, Kaplan and Hudson (1985) derived the mean, variance, and covariance of numbers of segregating sites at multiple loci in samples of arbitrary size. While the expectation corresponds well to the high-likelihood region (Figure 4), the confidence intervals appear uninformative (Table 2; Uyenoyama and Takebayashi 2004). For the system at hand, likelihood methods may provide a more reliable basis for statistical inference.

### Maximum-likelihood methods:

Felsenstein *et al.* (1999) and Stephens and Donnelly (2000) have provided lucid accounts of two powerful ML approaches using Monte Carlo computational methods. Felsenstein and colleagues have addressed the estimation of population parameters, using Metropolis-Hastings (MH) sampling of genealogies to average over the missing datum of the genealogy of the sampled genes. In the Felsenstein approach, a genealogy comprises topology and branch lengths in units of mutations. Alternatively, Griffiths and Tavaré (1994) characterize the probability distribution of entire histories as solutions of systems of linear equations. In the Griffiths/Tavaré approach, a history comprises topology together with assignment of mutations to branches. Because the central linear recursions are defined over the effectively infinite-dimensioned state space of history, importance sampling (IS) is used to approximate the underlying Markov chain.

The joint distribution of numbers of segregating sites at two neutral loci (Griffiths 1981) or of the ages of the most recent common ancestors of sampled genes at multiple neutral loci (Griffiths and Marjoram 1996; Griffiths 1999; Slade 2002) provides information about rates of recombination among the loci. Implementations of MH (Kuhner *et al.* 2000) and IS (Fearnhead and Donnelly 2001, 2002) methods for the estimation of recombination rates have been developed. Incorporation of recombination entails analysis of the even larger state space of ancestral recombination graphs (Griffiths 1991; Griffiths and Marjoram 1996).

Our primary objective is the estimation of population parameters rather than genealogy reconstruction. Our method is much simpler than other ML approaches because we base our estimate of the rate of recombination on the summary statistic of number of segregating sites. Computation of the likelihoods, which incorporate averaging over all possible histories of the sampled haplotypes, entails recursive solution of linear systems of equations, not simulation of a posterior distribution of genealogies. The variables of our recursion system correspond to probability-generating functions of numbers of segregating sites, and the state space of our Markov chain comprises all possible configurations of the sample (assignment of lineages to *S*-allele classes). Mutations are not part of the definition of state, as in the Griffiths/Tavaré approach, but are tracked separately by the pgfs. The enormous reduction in the dimension of the state space permits direct solution of the systems of linear equations without resort to IS computational methods. Further, the simplicity of the analytical description [*e.g.*, (11) and (14)] serves to clarify the evolutionary process.

### Applications:

Our approach may facilitate the determination of the relative order of markers in regions containing targets of various forms of strong balancing selection (see Introduction). Sampling strategies most useful for inferring gene order from the relative numbers of segregating sites include obtaining haplotypes that comprise the target of balancing selection as well as the marker loci, multiple haplotypes corresponding to the same specificity class, and multiple distinct specificity classes.

In principle, our method can accommodate additional summary statistics (*e.g.*, number of distinct haplotypes, frequency spectrum of mutations) and general biological processes or population structures. Customization to the ecology and genetics of the particular system under study entails redefinition of the state space and derivation of one-step Markov transition matrices (9). In our present analysis, these tasks were straightforward, even while accommodating a two-locus coalescent, intense balancing selection and gene genealogies with multifurcating nodes. In practice, the range of applications would be limited primarily by the size of the central system of linear recursions (14). Efficient IS methods have been described for some arbitrarily large systems (Griffiths and Tavaré 1994; Stephens and Donnelly 2000).

## Acknowledgments

We thank two anonymous reviewers for constructive comments. E.N. receives funding from the Australian Research Council. U.S. Public Health Service grant GM 37841 to M.K.U. provided support for this study.

## Footnotes

↵

^{1}*Present address:*Department of Biology and Wildlife, University of Alaska, Fairbanks, AK 99775.Communicating editor: Y.-X. Fu

- Received August 21, 2003.
- Accepted April 30, 2004.

- Genetics Society of America