## Abstract

In stock enhancement programs, it is important to assess mixing rates of released individuals in stocks. For this purpose, genetic stock identification has been applied. The allele frequencies in a composite population are expressed as a mixture of the allele frequencies in the natural and released populations. The estimation of mixing rates is possible, under successive sampling from the composite population, on the basis of temporal changes in allele frequencies. The allele frequencies in the natural population may be estimated from those of the composite population in the preceding year. However, it should be noted that these frequencies can vary between generations due to genetic drift. In this article, we develop a new method for simultaneous estimation of mixing rates and genetic drift in a stock enhancement program. Numerical simulation shows that our procedure estimates the mixing rate with little bias. Although the genetic drift is underestimated when the amount of information is small, reduction of the bias is possible by analyzing multiple unlinked loci. The method was applied to real data on mud crab stocking, and the result showed a yearly variation in the mixing rate.

IT is well known that many fisheries stocks have been reduced because of overfishing, habitat loss, and degradation. (Pauly *et al.* 2002). Some management procedures, such as setting a total allowable catch (TAC), are the usual practices for sustainable use of such fisheries stocks (*e.g.*, Hilborn and Walters 1992). In addition to such management procedures, stock enhancement programs through release of hatchery-reared juveniles are used to supplement reduced natural stocks (Blaxter 2000; Svåsand *et al.* 2000; Leber *et al.* 2004; Kitada and Kishino 2006).

In stock enhancement programs, the assessment of the contribution of the released population to the commercial catch is one of the important tasks (Kitada and Kishino 2006). Also, it is necessary to monitor the genetic impact of release on a natural population. Evaluation of the mixing conditions of a released population into a natural population must then be carried out. However, if juveniles are too small to tag and/or the tag-shedding rate is significant, it is difficult to determine their mixing compositions through experiments with artificial tags. For such cases, there is the possibility of stock identification using genetic tags such as mtDNA haplotypes and microsatellite alleles (*e.g.*, Leber *et al.* 2004).

We now consider a fisheries stock of a random-mating population in a closed fishing ground. Here, we call this population a natural population. Suppose that in each year hatchery-reared juveniles are released before the fishing season. Then, a population after the release of the juveniles composes a composite population, which consists of two source populations, that is, the natural and supplemented populations. The composite population could in turn be naturally reproducing in the next year. We call it a natural population again.

The haplotype or allele frequencies in a population in the fishing ground change temporally for two reasons: the release of juveniles and the genetic drift. The amount of genetic drift depends on the effective population size (*N*_{e}) (Nei and Tajima 1981; Waples 1989). The methods for estimating *N*_{e} on the basis of temporal changes in haplotype or allele frequencies have been developed. Waples (1989) introduced a moment-based estimation method using the standardized variance in allele frequency change. More efficient estimation by the maximum-likelihood method was introduced by Williamson and Slatkin (1999) and Anderson *et al.* (2000). Furthermore, Anderson (2005) presented a coalescent-based likelihood method for temporally sampled individuals.

Such temporal change in sampled haplotype or allele frequencies can also be used for the estimation of mixing rates. Typically, the estimation of mixing rates in a mixed population needs samples from different baseline populations as well as the composite population. For such situations, the maximum-likelihood (ML) and Bayesian methods were developed (*e.g.*, Millar 1987; Pella and Masuda 2001). Another situation occurs in stock enhancement programs. In this case, one baseline population is a natural population before release, and the other is a released population. A mixed population is a composite population after release. Usually, sampling from the composite population is conducted in the fishing season, and the composite population in the previous year acts as a baseline. In this sense, a composite population in one year plays two roles: as a mixed population for the sampling year and as a baseline natural population for the following year. This kind of method was applied to data for the mud crab, *Scylla paramamosain*, in Japan. Imai *et al.* (2002) used the mitochondrial DNA (mtDNA) D-loop region as a genetic tag and gave a point estimate for the mixing rate of juveniles released for a year. Obata *et al.* (2006) obtained point estimates of mixing rates and their confidence intervals using the maximum-likelihood method.

As mentioned above, haplotype or allele frequencies can vary due to genetic drift. Hatchery releases could be effective for depleted populations with limited recruitment. Supportive breeding (Ryman and Laikre 1991) programs also target small populations of endangered species. In such cases, the effective population size could be small and the amount of genetic drift cannot be ignored. If the effect of genetic drift is large, actual haplotype or allele frequencies in the natural population, regarded as a baseline population, are different from frequencies in the natural population at release. The natural population at release should actually act as a baseline. Obata *et al.* (2006) conducted a simulation for sensitivity tests to quantify the magnitude of changes in estimates and standard errors of mixing rates for various values of genetic drift assumed to be known. However, the effect of the genetic drift on the estimate of the mixing rate has not been fully evaluated and the reliability of the classical method for such situations remains unknown.

In this article, we develop a new method to simultaneously estimate mixing rates and the magnitude of the genetic drift, on the basis of successive allelic frequencies. Our statistical model includes further parameters such as frequencies in the initial population and latent vectors that represent random haplotype or allele frequencies due to genetic drift. While the complete likelihood function for such latent variables and observed data can be explicitly expressed, the likelihood function for observed data does not have a closed form. To obtain the maximum-likelihood estimate of the genetic drift, we use a Monte Carlo approximation of likelihood with a Markov chain Monte Carlo (MCMC) method (*e.g.*, Gelman *et al.* 2004; Robert and Casella 2004). Performance of the proposed method was evaluated via numerical simulations and real data of mud crab stocking in Japan were analyzed.

## MODELS AND METHODS

We mostly illustrate our statistical model and estimation method for the case where alleles at a single locus (or haplotypes in mtDNA) are observed.

### Statistical model:

Let be a vector of the true allele frequencies for a population in a fishing ground in year , where *J* is the number of alleles and . Let denote the observed allele frequencies of samples randomly collected from this natural population in year *y*. Then, given *p*_{1y}, the observed frequencies *n*_{1y} are assumed to have a multinomial distribution,where is twice the sample size for the diploid (the sample size itself for the haploid marker).

Suppose that juveniles are reared each year using some sampled individuals with minor alleles and then released into the natural population in the following year. In this case, the alleles used in the released population act as genetic tags in the natural population. Let be a vector of allele frequencies for released juveniles in year (). Let denote sampled counts of allele frequencies from the released population in year *y* with a multinomial distribution,where .

The allele frequencies in the population in the fishing ground change due to two factors: genetic drift and the release of juveniles into the population. To represent the variation due to the genetic drift, we assume a Dirichlet distribution with a probability density function(1)where is a vector of allele frequencies before release of juveniles in year *y*, and θ is a parameter that controls the amount of genetic drift; the smaller θ is, the larger the amount of genetic drift, and vice versa. In fact, under the assumption of nonoverlapping generations, the drift variance of allele frequencies between generations is given by *p*(1 − *p*)/2*N*_{e}, where *p* is an allele frequency in the previous generation and *N*_{e} is the effective population size (*e.g.*, Nei and Tajima 1981; Waples 1989). Note that the variance for mtDNA haplotype frequencies is *p*(1 − *p*)/*N*_{f}, where *N*_{f} is the effective female population size. In this model, the variance of based on the Dirichlet distribution above is given byHence, the parameter θ is closely related to the effective population size as 1 + θ = 2*N*_{e} or 1 + θ = *N*_{f}. In this article, we assume a constant θ over all years.

Now, the vector of frequencies cannot be observed because the sample is taken after the release of juveniles, and therefore it is regarded as a latent random vector. After the genetic drift among years, the vector of allele frequencies further changes because of mixing with the released juveniles, aswhere ω_{y} is the mixing rate of the released population into the natural population in year *y*. As a summary of our statistical model, a schematic diagram is shown in Figure 1.

### Likelihood:

In this model, unknown parameters are mixing rates, , the degree of the genetic drift, θ, the vector of the true allele frequencies in a natural population in the initial year, *p*_{11}, and those of the released population, . For notational convenience, we summarize random variables as , , , and . If the latent vectors can be observed, the likelihood function for the unknown parameters, ω, θ, *p*_{11}, *p*_{2}, based on the observations *n*_{1}, *n*_{2} and the latent variables , is explicitly expressed as follows:

Now, are unobservable latent variables, and hence we must consider a likelihood function based on the marginal distribution of *n*_{1} and *n*_{2} as follows:Unfortunately, an explicit formula for the marginal likelihood function is not available. To find the maximum-likelihood estimates, we use a Monte Carlo approximation of likelihood as explained below.

### Estimation algorithm:

For simplicity of illustration, we assume that the allele frequencies *p*_{2} are known. In this case, the observation *n*_{2} is not used in the estimation. The method with unknown *p*_{2} is straightforward.

Let ϕ denote the set of parameters (ω, θ, *p*_{11}). To approximate *L*_{M}(ϕ | *n*_{1}) via a Monte Carlo simulation, we use the formulawhere ϕ* is a reference vector, and is the conditional distribution of the latent variables at ϕ* given the observed data *n*_{1}. This integration can be approximated by Monte Carlo aswhere are the realization of drawn from the conditional distribution .

Unfortunately, a closed form of the conditional distribution does not exist because it includes the marginal distribution *L*_{M}(ϕ* | *n*_{1}). To generate the random numbers from , we use a Metropolis-within-Gibbs sampling method (*e.g.*, Robert and Casella 2004). Details of this algorithm are described in appendix a.

In general, the precision of the approximation depends on the choice of ϕ* as well as the size of *m*. One option for the determination of ω* and *p**_{11} in ϕ* is to use the maximum-likelihood estimates by ignoring the genetic drift. In this case, no latent variables are included in the likelihood. Another option is to use a Monte Carlo version of the EM algorithm (*e.g.*, McLachlan and Krishnan 1997). A detailed description of this algorithm is given in appendix b. It is known that convergence of the EM algorithm tends to be slow, but values of parameters that are innovated up to several steps can be used for ϕ*, including θ*.

### The case of multiple unlinked loci:

Our model can be extended for the case of multiple unlinked loci by considering the composite genotype frequencies. Different loci receive independent genetic drifts, and therefore the distribution of the latent allele frequencies, which is expressed in Equation 1, can be defined locus-by-locus. On the other hand, although the composite genotype frequencies are expressed as products of the genotype frequencies at the loci, the probability distribution of observed composite genotype is not given by the product of multinomial distributions for the loci. Here, we denote by *x*_{1yil} a composite genotype at the *l*th locus for the *i*th individual sampled from the composite population in year *y* (2 ≤ *y* ≤ *Y*) and by and *p*_{2yl} vectors of allele frequencies at the *l*th locus in the natural population before release and in the released population in year *y*, respectively. Then the distribution of *x*_{1yi} given and *p*_{2yl} is expressed aswhere *L* is the number of loci, and the *f*'s are probability distributions for the two source populations. The estimation algorithm with a MCMC becomes complicated for this case because the acceptance ratios in MCMC samplings depend on information from all the loci. In this article, we use the product of multinomial distributions from each locus as an approximated model.

## SIMULATION STUDIES

### Simulation scenarios:

To evaluate the estimation performance in our statistical model and the difference of estimates between considering and ignoring the genetic drift, we used simulation studies. As scenario I, the length of sampling years was fixed at 7, and the number of loci was assumed to be 1 or 5. Furthermore, as scenario II, we considered an extreme case where samples were taken for only 2 years. In this scenario, 5 or 10 loci were analyzed to cover the limited information from only 2 years. The allele frequencies in the natural population in the initial year and those in released populations were assumed to be common over loci, and the frequencies in the released populations were known. Note that, for the case of multiple loci, composite genotypes were generated from the true distribution, and then parameters were estimated using the approximated likelihood. To make the computational time shorter, we assumed a case of only three alleles. The mixing rates, of 0.1, were assumed to be the same over the years. Three levels of genetic drift, θ = 10, 100, and 1000, were used. The same sample size, *N*_{y} = 500, was assumed each year. Scenarios we used in this simulation are summarized in Table 1
. The frequencies in the released populations are assumed to be known. Throughout this simulation study, we set the true values of parameters for ϕ*. Considering the computational burden, the sample size of the MCMC was fixed at 5000. Among 5000 samples, the first 1000 samples were discarded as constituting the burn-in period, and every fifth simulation draw was kept. Therefore, the effective sample size was 800. The number of simulation replicas was 100 for each scenario. All the computation was carried out using our own program written by statistical software R (http://www.r-project.org/).

### Simulation results:

The results are shown in Table 2
. In scenario I (*Y* = 7), although the estimate of θ tended to be overestimated, especially for the case of *L* = 1 and larger values of θ, estimates of the mixing rates, ω's, by the model that considered the genetic drift were almost unbiased throughout the simulation. The bias of was reduced when multiple loci (*L* = 5) were analyzed. On the contrary, those where the genetic drift was not considered were subject to large bias when θ was small; that is, the degree of genetic drift was large. Note that the estimation performance for the mixing rates tended to be similar in both the models for θ = 1000, the case of small genetic drift.

In scenario II (*Y* = 2), because of the small size of replication with respect to years, the estimation was more difficult to make than in scenario I. In fact, especially for the case of θ = 1000, the estimates of θ tend to be infinity. Hence, we set an upper bound, 5000, for the estimate of θ. However, the simulation results suggest that the estimation of θ with many loci performs well when genetic drift is large. The estimation on mixing rates while considering the genetic drift was also stable in this scenario. On the other hand, biases were observed in the estimates without considering the drift. A noteworthy fact is that the biases are smaller than those in scenario I, where samples were taken for ≥7 years than the 2 in scenario II. This phenomenon indicates that the large bias in scenario I was due to the impact of the genetic drift through the year. Therefore, the longer the sampling period is the more effective it is to take the genetic drift into account.

To reduce the computation time for the simulation, we assumed only three alleles in our simulation studies. The allele or haplotype frequencies in the released population have a large impact on the estimation of mixing rates rather than the number of alleles. The results for our model with considering genetic drift showed that the estimation performance of mixing rates in years 4 and 7 for scenario I tended to be worse than that in other years. This is because the most common allele in the released population is also most common in the natural population before release. On the other hand, smaller standard deviations were observed in years 2 and 5, when the common allele in the released population is a rare allele in the natural population. In this way, use of rare alleles or haplotypes for the released population improves the estimation performance.

## CASE STUDY: ANALYSIS FOR MUD CRABS

### Data:

In Japan, three mud crab species, *S. paramamosain, S. serrata*, and *S. olivacea*, are commercially important resources in local scale fisheries. Urado Bay in Kochi Prefecture is the main area for *S. paramamosain* fisheries. The life span of this crab is from 1.5 to 2.5 years (Obata *et al.* 2006). May and June are the spawning season, and in September hatched crabs grow to commercial size and recruit into the fishing ground. The fishery is operated mainly from August to December with the highest catch in September. The commercial catch of this species has fallen since the mid-1970s from ∼12 tons and reached a record low of 1.6 tons in 1989. In the early 1980s, a stock enhancement program for this species started in Urado Bay, and the number of released juveniles has increased.

To examine the efficacy of the release program for this species, an experiment using genetic tagging was conducted (Obata *et al.* 2006). Samples were collected in Urado Bay around December in 1996–2002. In each year from 1996 to 2000, individuals with different haplotypes were chosen from samples to produce supplemental juveniles, and these minor haplotypes would then be used as a genetic tag. During mid-May/early June in each of the next years, juveniles were released into Urado Bay. Supplemented individuals, if not removed by fishing from the population in the year they are introduced, can contribute to the reproducing population in the second year.

We used observed haplotype frequencies in the mtDNA D-loop region of the mud crabs (Table 3
). Here, we grouped very minor haplotypes into one category called “others.” The haplotype frequencies for the released population are also shown in Table 3. Only one haplotype was released every year, except 1998. Using these data, we estimated mixing rates and the genetic drift simultaneously. Because the number of released juveniles was known, we can regard the parameters *p*_{2y} as fixed. We set the burn-in period at 5000 and kept every fifth simulation draw of 250,000 samples as an effective output for Monte Carlo sampling.

### Results:

Table 4 shows the summary of parameter estimations with and without taking the genetic drift into account. The estimates of mixing rates, , were small except for 1997. Little differences between the estimates of the two methods were observed. This may be due to the small genetic drift. In fact, the estimated θ was ∼341.1. Although θ may be overestimated considering the simulation results, severe genetic drift was not observed.

Although not so much difference in the number of released juveniles was recorded except for in 2001, a large yearly variation in the estimates of mixing rates was observed. In the next section, we discuss the effects of adaptation and selection on the estimation of mixing rates.

## DISCUSSION

In this article, we proposed a statistical model for the simultaneous estimation of mixing rates and genetic drift for stock enhancement programs. In our model, temporal changes in haplotype or allele frequencies are modeled, taking into account mixing and genetic drift. This model can be regarded as an extension of work by Anderson *et al.* (2000), in which temporal changes due to genetic drift only were assumed and the effective population size (*N*_{e}) was estimated using the relationship between the genetic drift and *N*_{e}. Our simulation studies demonstrated that the performance for the estimation of the mixing rates, ω, based on the proposed model, was much better than that based on the model ignoring the genetic drift if the degree of genetic drift was large. Also, even if the degree of genetic drift was small, the mixing rates were well estimated.

Meanwhile, the simulation results also showed that the estimation of genetic drift is difficult compared to that of mixing rates. The genetic drift θ was overestimated especially when the genetic drift was small. In fact, the standard errors for θ in the case studies for mud crabs were large. This is partly due to the small number of sampling years and partly due to the use of only the mtDNA haplotype frequencies. Longer-term research would improve the estimation performance for the genetic drift and the mixing rates. In such studies, however, we must also consider the change in θ or *N*_{e} through time, because the parameter θ is closely related to the effective population size. Such variation may be likely in situations where juveniles are artificially released. Note that our model does not take into account effects of removal of individuals. We supposed that individuals are randomly caught by fisheries from the composite populations, and therefore change in haplotype frequencies due to fisheries catch occurs in the same manner as the genetic drift. In this respect, we think that any changes in statistical model are not essentially necessary even if considering the effect. However, we regarded such random changes as only a result of genetic drift and related it to the effective population size. If the amount of sample is large, effective population size after the fishing season may be underestimated.

The estimates of θ for mud crabs in Urado Bay were 341.1. Because mtDNA haplotypes were used in the analysis here, the degree of genetic drift θ is equal to the effective female population size minus one, *N*_{f} − 1. If the sex ratio is 1:1, then the estimate of the effective population size is *N*_{e} = 2*N*_{f} = 684.2. Although the accuracy of this value is not high because of the large standard error for θ, it seems to be small compared to the actual population size. The catch, in weight, in the late 1990s was ∼5 tons, and the average weight was estimated as 411 g through a market survey (Obata *et al*. 2006). Hence, the catch in number can be calculated at 12,165. The mortality rate is unknown for this stock, but if the exploitation rate can be assumed to be 33% and the natural mortality is ignored, the actual population size after exploitation, *N*, is (1 − 0.33) · 5000/(0.411 · 0.33) ≈ 24,700. This value is considerably larger than the estimated *N*_{e}. However, as described in Turner *et al.* (2002), the estimate of the ratio *N*_{e}/*N* for the red drum, *Sciaenops ocellatus*, was <0.001. The ratio *N*_{e}/*N* can take small values in nature, for example, if heterogeneity in fecundity or reproductivity among individuals or habitat is large. In this case, genetic impacts of hatchery releases on wild populations cannot be ignored. In this regard, further studies are necessary for the investigation of *N*_{e} and mixing for mud crabs in Urado Bay.

Our statistical model does not take into account selection on the juvenile mortality and fertility. For our situation, there are two possible selection forces: domestication and natural selection. Domestication is defined as genetic changes that result from different selective pressures in the hatchery and the wild environment (Waples and Drake 2004). If a hatchery population was established from the natural population and then kept in the hatchery and multiplied, the hatchery-reared stock generally becomes adapted to hatchery conditions and maladapted to natural conditions. Therefore, the mixing rate in the fishery season can be different from the one at the time of release. Our model estimates the former. Domestication may also cause a difference in fertility between the released individuals and the others in the composite population. In addition, natural selection is another factor that has an impact on changes in haplotype frequencies. In this study, we consider only the genetic drift and ignore the difference in the fertility and fitness.

If such selection in the reproduction process is mild and the effective population size is small, the effect on the estimation of mixing rates may be small. Meanwhile, if the effect of selection is relatively large, the directional changes in the frequencies may cause a bias in our estimation of mixing rate. For example, assume that minor haplotype I is used for release in one year and another haplotype II is used in the next year. Suppose that fitness of haplotype I is low. In this case, the frequency of haplotype I before release in the next year decreases due to selection. Therefore, the mixing rate of the released population with haplotype II is overestimated. If the same haplotype I is used in both years, and fitness of haplotype I stays low, then mixing rate is in turn underestimated. In this way, our estimates of mixing rates might be biased if the effect of selection is large. On the other hand, selection increases the level of genetic drift because it increases the rate of inbreeding. The result in the mud crab analysis showed a small level of genetic drift. We used wild female crabs caught in the fishing ground for juvenile production every year and duration for juveniles kept in the hatchery environment was short, ∼1 month or 1.5 months. Therefore, the domestication selection might be neglected for this case. Minor haplotypes are considered the consequence of long-term natural selection and evolution. Although further investigation is necessary, it may have suggested a small impact of selection on the estimation.

In this article, we described mostly a statistical model only for cases where haplotype or allele counts at a single locus are observed. Now, many microsatellites are often used in the study of population genetics, including genetic stock identification (*e.g.*, Wirgin and Waldman 2005). The extension for multiple unlinked markers is direct. Even with loosely linked markers, our statistical models may perform well. However, the effect of linkage disequilibrium can be stronger than that in usual population genetics studies.

Linkage disequilibrium exponentially decays with successive generations. Hence, in a population reaching equilibrium, the assumption of independence among loci will be appropriate. Our article, however, considers admixture of yearly released populations, and it is impossible to expect such equilibrium. Therefore, the assumption of independent loci tends to no longer hold. Statistical models under linkage disequilibrium have been developed (*e.g.*, Falush *et al*. 2003; Kitada and Kishino 2004). However, we think that such modeling and developing of estimation methods themselves are pretty complicated and beyond the scope of our research. Therefore, statistical models considering linkage disequilibrium as well as the mixing rates and genetic drift will be developed in future research.

Our statistical model included latent vectors representing haplotype frequencies in mtDNA due to genetic drift. The likelihood function is not expressed in a closed form, and hence direct maximization is not possible. In this article, we consider the maximum-likelihood estimation with a Monte Carlo approximation of likelihood (and Monte Carlo EM, MCEM, algorithm). Meanwhile, the maximum-likelihood estimate is not always the best because the profile likelihood has a drawback when the sample size is small compared to the number of parameters (*e.g.*, Neyman and Scott 1948). In fact, for more precise estimation of θ, eliminating the other parameters including ω through integration may be efficient (Kitakado *et al*. 2006, accompanying article in this issue). In this case, some noninformative prior distributions are to be assumed, and the parameters, except for θ, are integrated out from the likelihood. The parameter θ is estimated by maximizing the integrated likelihood, and the parameters, except for θ, are inferred using their posterior distribution given data and the estimate of θ. This kind of hierarchical structure is also effective in a more practical situation where there is a random or systematic change in θ over years. Although in this case further hierarchical modeling is necessary, extension of our model and algorithm is not difficult for such purposes. Furthermore, if we consider the presence of locus-specific θ, the assumption of random variation of θ over loci could contribute to the reduction in the number of parameters as well as an improvement of estimation performance. In this case, further complicated algorithms would be necessary. These are matters to be investigated in future research.

In this article, we have developed a statistical method for estimating mixing rates of hatchery-released individuals in enhanced populations. Such mixing could also occur when the outside species or population comes from a natural population. In this sense, we note that our method under successive sampling can also be applied in conservation genetics to monitor the effect of invasion by outside populations.

## APPENDIX A: MCMC SAMPLING

We illustrate a method for generating random numbers from using a Metropolis-within-Gibbs sampling method. Let be the initial values for this algorithm. Then a sequence of random vectors is generated in turn,(A1)The method for actual generation is as follows:

Sample a candidate vector from a proposal distribution

Compute acceptance ratios defined by

Set as

For example, the form of the acceptance ratio is justified by the following formula:(A2)As a proposal distribution for , we use a Dirichlet distribution aswhere the parameter *c* controls a deviation between a candidate vector *p*_{1y}* and the previous vector . In fact, the variance of is given by , and therefore the mixing condition over parameter space becomes slow if *c* increases, while the acceptance ratio decreases if *c* decreases. We chose a suitable value by trial and error. The samples in the initial stage were discarded as a burn-in period, which is usually used in MCMC to eliminate the effect of the initial values (Gelman *et al.* 2004). In addition, the MCMC sequence was thinned to reduce autocorrelation (Gelman *et al.* 2004).

## APPENDIX B: MCEM ALGORITHM

Let be the logarithm of , and let ϕ^{[t]} denote a value of ϕ at the *t*th EM iteration. Define *Q*(ϕ | ϕ^{[t]}) by the conditional expectation of the *l*_{c}(ϕ) with respect to given *n*_{1},Then the original EM algorithm alternates between the two steps as follows:

E step: Compute

*Q*(ϕ | ϕ^{[t]}).M step: Define ϕ

^{[t+1]}by a value of ϕ maximizing*Q*(ϕ | ϕ^{[t]}).

In our model, the complete log-likelihood, , can be obtained aswith . The original EM algorithm expects an explicit formula for *Q*(ϕ | ϕ^{[t]}). This is the case at least for statistical models with simple exponential families. Unfortunately, another step is necessary for computation of *Q*(ϕ | ϕ^{[t]}) in our model. In fact, the function *Q*(ϕ | ϕ^{[t]}) is expressed asThe two parts of the conditional expectation with respect to given *n*_{1} have no explicit formulas. To compute the conditional expectation of the complete log-likelihood, *Q*(ϕ | ϕ^{[t]}), we use a Monte Carlo integration by generating simulation outcomes for given the observable data *n*_{1}. The random variables are generated from the conditional distribution, with an MCMC as described in appendix a. The size *m _{t}* at the

*t*th step should be increased with

*t*. Then the function

*Q*(ϕ | ϕ

^{[t]}) is approximately calculated aswhere

*m*is the length of the effective MCMC sequence considering the burn-in period and thinning.

_{t}## Acknowledgments

The authors are grateful to two anonymous reviewers for their valuable comments on the original version of this manuscript. We also thank Katsuyuki Hamasaki, Naohisa Kanda, and Hiroshi Okamura for their helpful comments.

## Footnotes

Communicating editor: R. W. Doerge

- Received February 1, 2006.
- Accepted June 8, 2006.

- Copyright © 2006 by the Genetics Society of America