## Abstract

The use of linkage disequilibrium to localize the genes underlying quantitative traits has received considerable attention in the livestock genetics community over the past few years. This has resulted in the investigation of linkage disequilibrium structures of several domestic livestock populations to assess their potential use in fine-mapping efforts. However, the linkage disequilibrium structure of free-living populations has been less well investigated. As the direct evaluation of linkage disequilibrium can be both time consuming and expensive the use of simulations that include as many aspects of population history as possible is advocated as an alternative. A simulation of the linkage disequilibrium structure of the Soay sheep population of St. Kilda, Scotland, is provided as an example. The simulated population showed significant decline of linkage disequilibrium with genetic distance and low levels of background linkage disequilibrium, indicating that the Soay sheep population is a viable resource for linkage disequilibrium fine mapping of quantitative trait loci.

RECENTLY, investigations into the linkage disequilibrium (LD) structures of domestic livestock have shown the potential for the use of LD to localize the genes underlying quantitative traits (Farnir *et al*. 2000; McRae *et al*. 2002; Tenesa *et al*. 2003; Nsengimana *et al*. 2004). This potential has been realized in at least two cases, with the fine mapping of a twinning locus to a region of <1 cM (Meuwissen *et al*. 2002) and the identification and subsequent characterization of a gene for milk yield and composition (Grisart *et al*. 2002).

Wild populations provide a valuable resource for the understanding of the genetics of quantitative traits. Not only can the genetics of traits such as birth weight be compared to domestic populations, providing an insight into the effects of population management on the expression of quantitative traits, but also other traits that cannot be studied in managed populations, such as traits associated with fitness (survival, lifetime breeding success), can be considered. However, before attempting to use LD for fine mapping genes in any population, it is important to have an understanding of the underlying LD structure to enable an appropriate choice of study design.

Direct evaluation of the LD structure of a population is both time consuming and expensive. However, the LD structure of a population should not be evaluated only as a secondary objective of a LD mapping experiment. LD could extend over prohibitively short distances given the density of markers available in the studied species. Alternatively, large amounts of LD between markers that are unlinked or separated by large distances will result in an increased number of false positive results and a lack of power to localize the regions of interest. In these scenarios, this may result in any generated marker data being unusable for its intended purpose. Theoretical estimation of the levels of LD for realistic populations is complex due to the large number of interrelated factors involved in the formation of LD, including genetic drift, population admixture, and natural selection. Any LD formed is broken down in a subsequent generation by recombination, which generally results in a pattern of LD decreasing with distance. With respect to LD mapping, the important factors in the LD structure are a significant decay with genetic distance and the amount of variation about the average decay.

Simulation studies provide a useful alternative to the direct evaluation of LD structure. This approach has been used in humans (Kruglyak 1999) and these predictions for the level of LD are reasonably close to those observed from experimental data, despite the very simple model applied. However, significant deviations are observed in populations where the model's assumptions about population structure are not applicable (reviewed in Pritchard and Przeworski 2001; Wall and Pritchard 2003). In this article, a simulation of a well-studied wild sheep population is performed. The large amount of data available on the population history allows this simulation to include the major aspects of the population dynamics, providing a model that should predict the overall structure of LD in the population with a reasonable accuracy.

## MATERIALS AND METHODS

A three-step strategy is used to model the LD in the Soay sheep population. First, census data are used to model various aspects of population dynamics, including population size, adult sex ratio, variation in reproductive success, and survival. These models are then used to simulate the population given the current understanding of its history. Finally, LD is measured from the loci simulated in the population.

#### Population data:

The Soay sheep is a primitive domestic sheep that has lived on the islands of St. Kilda, Scotland (54°49′N, 08°34′W), for many centuries (Figure 1). The exact date of their introduction is unknown but is likely to have been between 1000 and 2000 years ago (Boyd and Boyd 1990). The Soay sheep were eventually restricted to the island of Soay after which they were named. In 1932 a flock of 107 Soay sheep was moved from Soay to Hirta. The flock consisted of 20 rams, 44 ewes, and 43 lambs, of which 21 were female and the remaining were castrated males (Boyd 1953).

A census of Hirta is performed each year by three teams using designated census routes that have been followed since 1959 (Boyd *et al*. 1964). An intensive study of the sheep in the Village Bay area began in 1985, this being the largest concentration of sheep on the island, comprising ∼30% of the total number. In the spring, lambs are caught when a few days old, weighed, tagged, and blood sampled. Each year during the summer, an extensive effort is made to catch all the sheep in the Village Bay area, with all animals caught being recorded and any new animals tagged and blood sampled. Mortality is recorded by searches during late winter and early spring, with additional searches being performed in years of high mortality.

#### Statistical modeling:

##### Population size:

The population size appears to fluctuate randomly (see Figure 2), with local maxima occurring at irregular intervals. Boyd (1974) showed a weak upward trend in population size in the years until 1973 and this increase appears to have continued until today. This increase may be an artifact of improved census counts. This explanation suggests modeling population size using the log of the data, as the size of fluctuations will be confounded with the proportion of sheep being counted. Using the log of the data also allows for the indication of density-dependent mortality in the populations (Grenfell *et al*. 1992). The standardized residuals of the linear regression of the log of the population size on the year of the count are used in all following models as these allow for the interpretation of population dynamics in terms of deviations from the average population trend, thus removing any potential effects of improving census quality. For convenience, these standardized residuals are referred to as residual population size.

The changing residual population size was modeled using a self-exciting threshold autoregressive (SETAR) model (Tong 1990), with model selection based on the improved cross-validation criterion, *C _{u}*, proposed by De Gooijer (2001). SETAR models are useful in modeling data in which future values depend on previous values but this dependence varies with the current value of the system. For example, in years where the population size is small a model showing a rapid increase in population size may be appropriate. Once the population size becomes larger, this rate of increase may be smaller than that for small population sizes and then become negative for large population sizes. Each region where a different model is applied in predicting future values is called a regime, with its boundaries known as thresholds. For the above example three regimes are given by thresholds that define what constitutes small and large population sizes. In each regime an autoregressive (AR) model is used in predicting future values. This type of model has previously been applied to the Soay sheep population (Grenfell

*et al*. 1998) but without removing the trend and using a different model selection criterion.

##### Model selection:

A modified model selection procedure was used for modeling the proportion of lambs, adult sex ratio, and survival. All models are based on linear regressions on residual population sizes of the current and previous years. Initial models used residual population sizes from the previous 5 years, with this being successively reduced when the earliest year in the model was not significant at the 5% level. This modified selection procedure allows effects for years after the earliest significant year to be nonsignificant. As a delayed response to population size is unlikely, this type of model selection is useful, especially when the power to detect small effects is low. One disadvantage of this selection regime is the potential of false positive results that lead to a highly overparameterized model, but choosing reasonable starting models reduces the likelihood of this occurring.

##### Lamb proportion and adult sex ratio:

The proportion of lambs in the population in year *i*, denoted *p _{i}*, was modeled aswhere

*r*is the residual population size in year

_{j}*j*and is the normally distributed error term. The delay parameter

*d*was successively reduced from an initial value of 5 in the model selection procedure. The modeling of neonatal mortality is prevented as the model is based on census data that are collected several months after lambing but before the mating season. The ratio of number of male to female adult sheep in year

*i*, denoted

*s*, was modeled aswith model selection again reducing

_{i}*d*from an initial value of 5.

##### Survival:

For the modeling of survival, the data set was restricted to all sheep with known birth date and known fate (either a known death date or known to be still alive). Separate models for the effect of residual population size on survival were made for each age and sex. For each year, the hazard function was evaluated using the Nelson-Aalen estimatorwhere *d _{i}*

_{,j,s}is the number of observed deaths in the

*n*

_{i}_{,j,s}sheep of age

*i*and sex

*s*alive in year

*j*and followed to year

*j +*1. The effect of residual population size on the hazard function was modeled using a weighted regression aswhere

*d*was selected using the model selection procedure described above starting from an initial value of 4 and weights were given by

*n*

_{i,j,s}. In older age groups, where data were available from only a few years and there was no observable effect of residual population size on the hazard function, the hazard was modeled as a beta distribution with parameters and , where

*d*

_{i,s}and

*n*

_{i,s}are obtained from the summation over

*j*of

*d*

_{i}_{,j,s}and

*n*

_{i}

_{,j,s}, respectively. As no observations of males older than 12 years or females older than 16 years were made, these were considered the upper limit for the lifespan of the Soay sheep.

#### Simulation model:

The Soay sheep population was modeled to reflect the population's history since the introduction to Hirta in 1932. For each of the 1000 simulation replicates, the initial haplotypes, which contain a pair of loci separated by a random distance between 0 and 40 cM, were simulated using a coalescent process (see below). The ages of the adults in the original sheep were sampled from the equilibrium age structure given by the pooled survival estimates for each sex. The castrated ram lambs that were initially transferred to Hirta were not included in the simulation, as these would not leave any descendants. The initial population growth on Hirta was modeled by the doubling of population size each year until this exceeded the population size trend. From this point, until island census data were available, the population size was given by a realization from the model for population size. For each year that the simulation iterated, realizations from models for sex ratio, proportion of lambs, and survival were taken, conditioned on the realized residual population size.

Given the yearly pattern of mortality of the Soay sheep after the rut but before the birth of lambs that is observed in the Soay sheep and the fact that rams can have offspring born after their death but ewes cannot, the following steps were cycled through in each year of the simulation: (1) death of ewes, (2) creation of lambs from remaining sheep, (3) death of rams, and (4) aging of sheep that remain from the previous year. The number of sheep that were to die at each step of the simulation model was determined by the required adult sex ratio and proportion of lambs given the residual population size. The survival model was used only to give an appropriate age structure to the population. This was achieved by choosing a random sheep and comparing a random number to the realized hazard for sheep of that age, *h _{i}*. If the random number was less than

*h*, the sheep was removed from the population, otherwise it was returned to the sampling pool. This was repeated until the required population size was achieved. The efficiency of the steps involved was greatly increased by replacing

_{i}*h*with . The sampling process used in choosing parents for the generated lambs is described below. Haplotypes passed on to lambs were generated assuming recombination rates given by Haldane's map function.

_{i}##### Generation of initial haplotypes:

Although the population fluctuations on Hirta and its surrounding islands have been shown to be in synchrony (Grenfell *et al*. 1998) and estimates about the period of time that the Soay sheep have been present on Soay are available, it is difficult to extend these concepts to generating appropriate haplotypes for the initial sheep moved to Hirta. However, some of the properties of the current Soay sheep population can be used to provide an appropriate benchmark for the simulation of these haplotypes.

An estimate for the genome-wide heterozygosity of the Soay sheep population of 0.385 (SE = 0.009) was obtained by considering a panel of 144 microsatellite markers with a high polymorphic information content covering the majority of the genome of five rams from the current population chosen because of their high prolificacy (our unpublished observations). Simulating the population starting with completely heterozygous founders indicates that heterozygosity in the Soay sheep has decreased by 7.1% since their introduction to Hirta (10,000 replicates, data not shown). Thus the average heterozygosity of the microsatellite markers examined in the founding sheep is inferred to be 0.414. Under a neutral model, the heterozygosity at a locus equals , , where is the effective population size and μ is the mutation rate at the locus. A reasonable estimate for μ will be in the high end of the observed mutation rate in mammals given the highly polymorphic loci chosen to measure the average heterozygosity. The microsatellite markers used in paternity analysis are similarly chosen and have mutation rates ranging from 0.02 to 0.0001 (Ellegren 2004). Using gives , which is ∼15% of the average population size on Soay. This estimate is consistent with the observed decrease in heterozygosity given above and with estimates of in other wild populations (Frankham 1995).

For each replicate, 40 Mb of sequence was simulated with a neutral coalescent model using the program “ms” (Hudson 2002). A mutation rate of 10^{−8} was used in the simulation as ms uses an infinite-site model of mutation and thus generates loci with histories similar to SNP markers. Recombination was assumed to occur at a rate of 10^{−8} between base pairs, giving a 1 cM/Mb ratio as approximately observed on average in the sheep genome. This generated a large number of polymorphic sites (mean 320) of which the leftmost and another random site were chosen for use in the remaining simulation, giving an approximately uniform distribution of allele separation across the simulation replicates.

##### Modeling of reproduction:

The reproductive success of female Soay sheep is relatively uniform, with adult females only rarely not giving birth to lambs. However, the proportion of births producing twins decreases with population size. This association does not need to be modeled due to being intrinsically modeled through changes in population size. When the number of lambs to be generated was less than the number of ewes, the lambs were assigned randomly to individual ewes. If the number of lambs was greater than the number of ewes, all ewes were assigned one lamb with the remaining lambs being distributed to individual ewes.

Modeling of male reproductive success is more difficult as the promiscuous nature of the Soay sheep requires paternities to be inferred by molecular methods. From the inferred paternities (at 80% confidence) of 44% of the Village Bay lambs born between 1986 and 1996, the distribution of male lifetime breeding success was highly skewed with mean of 1.05 and variance 4.41 (Pemberton *et al*. 2004). Although these data are not complete, they provide information on the coefficient of variation of male breeding success. Male breeding success was modeled by assigning each male individual a random “success” variable from a two-parameter gamma distribution with mean and variance as given above. As with survival, the selection of male parents was achieved using rejection sampling methodology. For ewes having a single lamb, a male parent was chosen by randomly selecting a male and comparing its success variable (scaled to have maximum value one) to a uniform random number. The male was selected when the random number was less than the scaled success variable; otherwise the sampling process was repeated. When a ewe was assigned as having twins, first the number of male parents was randomly determined. The proportion of twins that are fathered by the same sire has been estimated as 26% (Pemberton *et al*. 1999) and any effect of population size on this value is likely to be small. The sampling of male parents for twins with different fathers was achieved using the same methodology as for single lambs. For sampling a single father for twin lambs, the square of the scaled success variable was compared to the random uniform number to correct for the fact that the selected individual was being assigned two paternities.

#### Measuring linkage disequilibrium:

LD was measured using Lewontin's (1964) standardized measure of LD, . This measure was chosen to enable comparisons with LD observed in previous studies of domestic livestock that have used Hedrick's (1987) multiallelic extension of this measure. is calculated aswhereandwhere *x _{ij}* is the frequency of haplotypes with allele

*i*at the first marker and allele

*j*at the second marker. The absolute value of is reported as the sign of the -statistic depends on the arbitrary choice of allele at each locus. As can be skewed when one or both markers contain rare alleles (Eyre-Walker 2000), replicates giving final allele frequencies <0.1 were repeated.

#### Stability of estimates:

The effect of model parameters on LD estimates was examined by replicating the simulation with varied parameters. Three areas need to be examined for their effect on the final LD estimates: the value of used in the construction of the original haplotypes, the parameter estimates in the models for population dynamics, and the model of male breeding success. The effect of the initial value of is examined by both reducing and increasing the assumed value of by a factor of two. For the examination of the effects of population model parameter estimates, an ∼95% confidence interval was constructed for each coefficient of residual population size. These were then categorized into groups of high and low effect of residual population size. The simulation was run using these groupings representing extremes of possible variation caused through modeling population dynamics. The effect of the model of male breeding structure was examined by simulating with the coefficient of variation of the success variable reduced and increased by a factor of two.

## RESULTS

Figure 3 compares aspects of the Village Bay population with that of the entire island. The census counts in the Village Bay area show a strong correlation with the island census counts (see Figure 3, a and b, *r* = 0.962, *P* < 0.001), highlighting that a fluctuation in population size is not area specific but affects the whole island. The proportions of lambs in the two populations are similar (see Figure 3c), although a slightly higher proportion of lambs is given by the island census. The adult sex ratio is higher in the Village Bay population (Figure 3d), but this is likely to be due to the increased accuracy of the Village Bay count with the island census, as rams are undercounted compared to ewes in the island census (Grubb 1974). This indicates that the use of the Village Bay data is preferential in the modeling of aspects of population structure, despite having fewer data points, as this is likely to give less biased results. However, the strong correlation between island census counts and the Village Bay population counts suggests that the use of the larger data set would be beneficial in modeling the population fluctuations as it is unlikely to bias the results.

The trend for increase in population size was significant (*P* = 0.018) and gave residual population sizes, *r*, aswhere *n _{i}* is the population size of year

*y*. The SETAR model for changing residual population size found no evidence for the existence of more than one regime. Furthermore, no autocorrelation or partial autocorrelation is observed between successive residual population sizes and these show no deviation from normality. This indicates that a simpler model of normally distributed noise is appropriate for residual population size.

_{i}The model selection procedure found no evidence for a relationship between residual population size and the proportion of lambs in the population. As the untransformed proportions showed no significant deviation from normality, a simpler model was used with the proportion of lambs being normally distributed with mean and standard deviation estimated from the data as μ = 0.3305 and σ = 0.0442. The adult sex ratio was significantly related to the residual population size, with residual population sizes from 3 years previous being included in the model. The model selected for sex ratio, *s*, waswith the error term being normally distributed with standard deviation σ = 0.2385. The coefficients for the linear models for survival are given in Table 1. Males aged 7 years and older and females aged 11 years and above showed no significant effect of residual population size on survival and were modeled using a beta distribution with parameters given in Table 2.

Figure 4 plots the simulated LD against distance for 1000 locus pairs. As expected from population genetic theory, LD was significantly negatively correlated with marker distance (*r* = −0.4188, *P* < 0.0001). There is a large amount of variation in the amount of linkage disequilibrium between loci separated by a small distance. As with the average LD value, this decreases as the distance marker pairs are separated by increases, with no absolute -value >0.2 being observed for marker pairs separated by >15 cM. As markers separated by >20 cM showed no significant relationship with distance (*P* = 0.444), these can be used to estimate nonsyntenic LD. These 490 marker pairs have a mean -statistic of 0.026 and standard deviation 0.026. This indicates low levels of LD between nonsyntenic marker pairs. Among marker pairs separated by <10 cM, 79% (219/276) have and 22% (62/276) have .

The stability of the LD estimates with respect to variation in the parameters used to simulate the population is examined in Figure 5. The main effect of variation of the parameters used in the simulation is the changing of the average amount of LD at closely linked loci. The distance over which the linkage disequilibrium decays and the amount of LD at more distantly linked loci remain similar with all parameter values. The variation of the estimated population dynamics parameters has the least effect on the LD structure (Figure 5, c and d). Altering parameter values of the effective population size used for generating the initial haplotypes (Figure 5, a and b) and variation in male breeding success (Figure 5, e and f) result in similar changes in the overall LD structure.

## DISCUSSION

Wild populations provide novel opportunities for the understanding of the genetics of quantitative traits. However, the approach used to locate the genes underlying quantitative traits will depend on the amount of information available from the population. If a detailed pedigree is available, linkage analyses can be performed using the two-step approach proposed by George *et al*. (2000). This approach has been used to map QTL for birth weight in a wild population of red deer (Slate *et al*. 2002). When pedigree information is unavailable LD mapping methods that rely upon associations among marker haplotypes to infer historical relationships in the population can be used. However, before such an experiment is started, it is important to know the structure of LD in the population to determine the necessary marker distances for such a scan. In this article, the use of simulations is advocated as an inexpensive alternative to the direct evaluation of LD and this is applied to a wild sheep population, the Soay sheep of St. Kilda, Scotland.

The simulated Soay sheep population showed considerable amounts of LD between tightly linked loci. As expected there is a significant decline in the amount of LD with distance. Perhaps the most striking feature about this decline is the low level of LD between effectively unlinked marker pairs. Previous studies of LD in domestic livestock have observed a mean nonsyntenic LD of 0.211 in Coopworth sheep (McRae *et al*. 2002), 0.12–0.20 in Dutch black and white dairy cattle (Farnir *et al*. 2000), 0.39 in the United Kingdom dairy cattle population (Tenesa *et al*. 2003), and between 0.11 and 0.22 in domestic pig lines (Nsengimana *et al*. 2004). The simulated Soay sheep population gave a mean *D*′-value of 0.026 for markers separated by >60 cM. McRae *et al*. (2003) showed that using a small number of haplotypes inflated the value of . Here a total of 3778 haplotypes were used in the estimation of , resulting in a very low inflation of compared to Farnir *et al*. (2000), who used 581–1254 haplotypes, McRae *et al*. (2003), 276 haplotypes, Tenesa *et al*. (2003), <100 haplotypes, and Nsengimana *et al*. (2004), 184–302 haplotypes. Correcting the mean *D*′-values based on the model given by McRae *et al*. (2002) accounts for most of the variation in the mean values. However, the range of LD for marker pairs separated by >60 cM is lower in the simulated Soay sheep population, giving a maximum of 0.25, which is under half of the value given by nonsyntenic markers in domestic livestock studies (Farnir *et al*. 2000; McRae *et al*. 2002; Tenesa *et al*. 2003). In this study, marker genotypes were simulated to represent SNP loci and restricted to have minor allele frequency of >0.1. This is likely to account for some of the differences observed in LD structure as the previous studies have used microsatellite loci and Hedrick's (1987) multiallele extension, which is susceptible to bias by small allele frequencies. A further explanation for the differences in the patterns of background linkage disequilibrium observed between domestic livestock populations and the simulated Soay sheep population is concurrent selection at unlinked loci. This will increase the range of linkage disequilibrium at unlinked loci without dramatically increasing mean levels and will not be observed in the simulated population as selection was not included in the population models. However, Farnir *et al*. (2000) found no significant evidence for selection causing an increase in LD between unlinked loci and showed that drift accounted for most of the variation in linkage disequilibrium observed in the Dutch black and white dairy cattle population. Thus, the cause of the difference between background LD observed in domestic livestock and the simulated Soay sheep population requires further investigation, possibly with direct measurement of LD in the Soay sheep population.

From the structure of LD given by the simulated Soay sheep an estimate of the required marker density for LD mapping can be obtained. Considering regions with consecutive LD statistics that are overall significantly greater than the mean LD statistic for nonlinked markers and single LD statistics outside the range of LD for nonlinked markers as indicators of linkage, a minimum marker density for LD mapping in the Soay sheep of ∼2 cM can be obtained. The latest linkage map of the sheep genome has a marker density on average of 3.4 cM (Maddox *et al*. 2001). Slate *et al*. (1998) found that only 42.5% of the bovine markers that amplified in the Soay sheep were polymorphic, indicating a higher-density linkage map will be required. However, this density is likely to be achievable with the use of SNP markers. This indicates that the Soay sheep population is a viable resource for linkage disequilibrium fine mapping of quantitative trait loci.

One potential pitfall of using a simulation study to assess the LD structure of a population is an incorrect specification of population dynamics and history. As the demography of the historical Soay sheep population is not recorded, a coalescent model was used to simulate the initial chromosomes used. This requires an estimate of effective population size that was derived from the heterozygosity observed in the current population. From Figure 5, a and b, it can be seen that changing this value has little effect on the overall conclusions of this study; LD still decays rapidly from its initial value and the amount of background LD is small. The main difference is the average LD between markers separated by small differences, which increases when decreases and decreases otherwise. A decrease in the average LD value at small differences causes the density of loci needed to fine map QTL using LD to increase due to a smaller overall decay of LD with distance. However, given the population dynamics of the Soay sheep population, a value of double the estimate (as used in Figure 5b) is unlikely. Similarly, the parameter values used to test the effect of the estimation of population dynamics models and variation of male breeding success are extreme values. Even using these, the overall LD structure does not deviate extremely from that observed using the average parameter values, indicating that the estimated LD structure provides an accurate representation of the actual LD structure.

In conclusion, the use of simulation studies provides a useful alternative to the direct evaluation of LD in wild populations. This can provide an indication of the population's potential for LD mapping and a first estimate of the required marker density. Although the accuracy of the simulated LD structure will depend on the inclusion of all major aspects of population dynamics, the simulated LD structure can be readily confirmed in the early stages of genotyping.

## Acknowledgments

We are grateful to the National Trust for Scotland and Scottish Natural Heritage for permission to work on St. Kilda and their assistance in the field. We also thank the Ministry of Defence, Defence Evaluation and Research Agency, Eurest, and QinetiQ staff stationed on St. Kilda and Benbecula for their assistance and support and all the field assistants and volunteers helping with the field work, especially J. Pilkington. The collection of the data used in this study is funded by the United Kingdom Natural Environmental Research Council and the Wellcome Trust. A.M. is funded by a Commonwealth Scholarship and P.V. is funded by a Biotechnology and Biological Sciences Research Council research development fellowship.

## Footnotes

Communicating editor: J. B. Walsh

- Received January 16, 2005.
- Accepted May 19, 2005.

- Copyright © 2005 by the Genetics Society of America