## Abstract

Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced recent genetic drift. Detection of these IBD segments has recently become feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright–Fisher model. Specifically, using coalescent theory, we calculate the variance of the total sharing between random pairs of individuals. We then investigate the cohort-averaged sharing: the average total sharing between one individual and the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally. Surprisingly, the variance of this distribution does not vanish even for large cohorts, implying the existence of “hypersharing” individuals. The presence of such individuals has consequences for the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD and subsequently in power to detect an association, when individuals are either randomly selected or specifically chosen to be the hypersharing individuals. Using our framework, we also compute the variance of an estimator of the population size that is based on the mean IBD sharing and the variance in the sharing between inbred siblings. Finally, we study IBD sharing in an admixture pulse model and show that in the Ashkenazi Jewish population the admixture fraction is correlated with the cohort-averaged sharing.

IN isolated populations, even purported unrelated individuals often share genetic material that is *identical-by-descent* (IBD). Traditionally, the term IBD sharing referred to coancestry at a single site (or autozygosity, in the case of a diploid individual) and was widely investigated as a measure of the degree of inbreeding in a population (Hartl and Clark 2006). Recent years have brought dramatic increases in the quantity and density of available genetic data and, together with new computational tools, these data have enabled the detection of IBD sharing of entire genomic segments (see, *e.g.*, Purcell *et al.* 2007; Kong *et al.* 2008; Albrechtsen *et al.* 2009; Gusev *et al.* 2009; Browning and Browning 2011; Carr *et al.* 2011; Brown *et al.* 2012). The availability of IBD detection tools that are efficient enough to detect shared segments in large cohorts has resulted in numerous applications, from demographic inference (Davison *et al.* 2009; Palamara *et al.* 2012) and characterization of populations (Gusev *et al.* 2012a) to selection detection (Albrechtsen *et al.* 2010), relatedness detection and pedigree reconstruction (Huff *et al.* 2011; Kirkpatrick *et al.* 2011; Stevens *et al.* 2011; Henn *et al.* 2012), prioritization of individuals for sequencing (Gusev *et al.* 2012b), inference of HLA type (Setty *et al.* 2011), detection of haplotypes associated with a disease or a trait (Akula *et al.* 2011; Gusev *et al.* 2011; Browning and Thompson 2012), imputation (Uricchio *et al.* 2012), and phasing (Palin *et al.* 2011).

Recently, some of us used coalescent theory to calculate several theoretical quantities of IBD sharing under a number of demographic histories. Then, shared segments were detected in real populations, and their demographic histories were inferred (Palamara *et al.* 2012). Here, we expand upon Palamara *et al.* (2012) to investigate additional aspects of the stochastic variation in IBD sharing. Specifically, we provide a precise calculation for the variance of the total sharing in the Wright–Fisher model, either between a random pair of individuals or between one individual and all others in the cohort.

Understanding the variation in IBD sharing is an important theoretical characterization of the Wright–Fisher model, and additionally, it has several practical applications. For example, it can be used to calculate the variance of an estimator of the population size that is based on the sharing between random pairs. In a different domain, the variance in IBD sharing is needed to accurately assess strategies for sequencing study design, specifically, in prioritization of individuals to be sequenced. This is because imputation strategies use IBD sharing between sequenced individuals and genotyped, not-sequenced individuals to increase the number of effective sequences analyzed in the association study (Palin *et al.* 2011; Gusev *et al.* 2012b; Uricchio *et al.* 2012).

In the remainder of this article, we first review the derivation of the mean fraction of the genome shared between two individuals (Palamara *et al.* 2012). We then calculate the variance of this quantity, using coalescent theory with recombination. We provide a number of approximations, one of which results in a surprisingly simple expression, which is then generalized to a variable population size and to the sharing of segments in a length range. We also numerically investigate the pairwise sharing distribution and provide an approximate fit. We then turn to the average total sharing between each individual and the entire cohort. We show that this quantity, which we term the cohort-averaged sharing, is approximately normally distributed, but is much wider than naively expected, implying the existence of hypersharing individuals. We consider several applications: the number of individuals needed to be sequenced to achieve a certain imputation power and the implications to disease mapping, inference of the population size based on the total sharing, and the variance of the sharing between siblings. We finally calculate the mean and the variance of the sharing in an admixture pulse model and show numerically that admixture results in a broader than expected cohort-averaged sharing. Therefore, large variance of the cohort-averaged sharing can indicate admixture. In the Ashkenazi Jewish population, we show that the cohort-averaged sharing is strongly anticorrelated with the fraction of European ancestry.

## Materials and Methods

### Coalescent simulations

To simulate IBD sharing in the Wright–Fisher model, we used the Genome haploid coalescent simulator (Liang *et al.* 2007). Recombination in Genome is discretized to short blocks and mutations (which we ignore in this study) are placed on the simulated branches. In all simulations, we generated one chromosome with recombination rate of 10^{−8} per generation per base pair and block lengths of 10^{4} bp (corresponding to resolution of 0.01 cM in the lengths of the shared segments).

### IBD sharing in simulations

We used an add-on to Genome that returns, for each pair of chromosomes, the locations of all shared segments (Palamara *et al.* 2012). In that add-on, a segment is shared as long as the two chromosomes share the same ancestor, even if there was a recombination event within the segment. We calculated, for each pair, the total length of shared segments longer than *m* and divided by the chromosome size. For Figures 2–6, we simulated *N*_{pop} ≥ 100 populations and *n* = 100 haploid sequences in each population and calculated all properties of the total sharing among all *n* chromosomes, their sharing to each of the other *n* − 1 chromosomes in the cohort and then used the *N*_{pop}*n* calculated values to obtain the variance and the distribution. Details on the simulation of an admixture pulse can be found in Supporting Information, File S1, section S4.

### The Ashkenazi Jewish cohort

The cohort we analyzed was previously described in Guha *et al.* (2012) and Palamara *et al.* (2012). Briefly, DNA samples from ≈ 2600 Ashkenazi Jews (AJ) were genotyped on the Illumina-1M SNP array. Genotypes (autosomal only) were subjected to quality control, including removal of close relatives, and phasing [Beagle (Browning and Browning 2009)], leaving finally ≈741,000 SNPs for downstream analysis. IBD sharing was calculated using Germline (Gusev *et al.* 2009) with the following parameters: bits, 25; err_hom, 0; err_het, 2; min_m, 1; h_extend, 1. The results presented in *IBD sharing after an admixture pulse* section remained qualitatively the same even when we used a longer length cutoff of *m* = 5 cM.

### Admixture analysis

For the admixture analysis, we merged the HapMap3 CEU population (Utah residents with ancestry from Northern and Western Europe; International HapMap Consortium 2007; release 2) with the AJ data, removed all SNPs with potential strand inconsistency, and pruned SNPs that were in linkage disequilibrium (Purcell *et al.* 2007). We then ran Admixture (Alexander *et al.* 2009) with default parameters and *K* = 2. Admixture consistently classified all individuals according to their population (CEU/AJ). Genome-wide, the AJ ancestry fraction was ≈85%, compared to ≈3% for the CEU population. Principal components analysis [SmartPCA (Patterson *et al.* 2006)] gave qualitatively similar results.

### Simulations of AJ demography

Demographic reconstruction of the AJ population was performed in Palamara *et al.* (2012), using chromosome 1 of 500 randomly selected individuals and using a novel IBD-based method described therein. Simulations presented here were performed using the final set of inferred demographic parameters: ancestral (diploid) population effective size of ≈2300 individuals, expansion starting 200 generations ago reaching ≈45,000 individuals 33 generations ago, a severe bottleneck of ≈270 individuals, and an expansion to the current size of ≈4.3 million individuals. Simulation of 100 populations was carried out using Genome (Liang *et al.* 2007).

## Results

### Variation in IBD sharing in the Wright–Fisher model

#### Definitions:

##### The Wright–Fisher model:

We consider the standard Wright–Fisher model for a finite, isolated population, described by 2*N* haploid chromosomes, where each pair of chromosomes corresponds to one diploid individual. Each chromosome in the current generation descends, with equal probability, from one of the chromosomes in the previous generation, and recombination occurs at rate 0.01/cM per generation. The Wright–Fisher model has been widely investigated both in forward dynamics and under the coalescent (Wakeley 2009). For simplicity of notation, we denote the number of individuals, or the population size, as *N*, even though we really refer to the number of haploids and not the number of individuals. Throughout most of the analysis, we assume that each individual carries a single chromosome of length *L* cM.

##### IBD sharing:

We say that a genomic segment is shared, or is IBD, between two individuals if it is longer than *m*(cM) and it has been inherited without recombination from a single common ancestor. We do not require the shared segments to be completely identical. That is, if any mutation has occurred since the time of the most recent common ancestor (MRCA), that would not disqualify the segments from being shared IBD according to our definition. The reason is that even in the presence of mutations, an order of magnitude calculation shows that regardless of the segment length, two individuals sharing a segment are expected to differ in just ≈1 site along the segment (see File S1, section S1.1). Therefore, in a long IBD segment, the number of differences should be very small compared to the number of matches. In practice, there are also other sources of error in IBD detection, most notably phase switch errors. We assume, however, that there always exists a large enough length threshold above which segments are detectable without errors (Browning and Browning 2011; Brown *et al.* 2012), which corresponds to the parameter *m* introduced above; the precise value of the threshold will depend on the genotyping/sequencing technology. We assume that information is available for *M* markers, uniformly distributed (in genetic distance) along the chromosome and densely enough that any effect caused by the discreteness of the markers is negligible (say, if *m* ⋅ (*M*/*L*) ≫ 1). We define the *total sharing* between two individuals as the fraction of their markers that are found in shared segments.

#### Mean total sharing:

In this subsection, we review the derivation of the mean fraction of the genome found in segments shared between two individuals (Palamara *et al.* 2012). We assume that the coalescent process along the chromosome can be approximated by the sequentially Markov coalescent (McVean and Cardin 2005) and ignore the different behavior of sites at the ends of the chromosome. Consider first a single site *s* and assume that its MRCA dates *g* generations ago. The total length ℓ of the segment in which the site is found is the sum of ℓ_{R} and ℓ_{L}, where ℓ_{R} and ℓ_{L} are the segment lengths to the right and left of *s*, respectively (all lengths are in centimorgans). The distributions of ℓ_{R} and ℓ_{L} are exponential with rate *g*/50, since the two individuals were separated by 2*g* meioses, each of which introduces a recombination event with rate 0.01/cM, and the nearest recombination would terminate the shared segment. The probability *π* of the total segment length, ℓ, to exceed *m* is, given *g*,*g* → *Nt* the times to the MRCA are exponentially distributed with rate 1: Φ(*t*) = *e*^{−}* ^{t}*. Therefore,

*I*(

*s*) is the indicator that site

*s*is in a shared segment, and the sum is over all sites. The mean fraction of the genome shared is

*mN*→ ∞, 〈

*f*

_{T}〉 → 0 and for

*mN*→ 0, 〈

*f*

_{T}〉 → 1. For large

*N*, we have 〈

*f*

_{T}〉 ≈ 100/(

*mN*).

#### The variance of the total sharing:

We now turn to calculating the variance of the total sharing. Using Equation 3,*π*_{2}(*s*_{1}, *s*_{2}) is the probability that both markers *s*_{1} and *s*_{2} are on shared segments and *π* is given by Equation 2. In the rest of the section, we assume that each individual carries one chromosome only, for if we have *c* chromosomes, each of length *L _{i}*, then (if the two individuals are not close relatives)

*f*

_{T,(}

_{i}_{)}is the total sharing in chromosome

*i*, assumed independent of the other chromosomes. Rewriting

*π*

_{2}(

*s*

_{1},

*s*

_{2}) as

*π*

_{2}(

*k*), where

*k*is the number of markers separating

*s*

_{1}and

*s*

_{2}, we have

*π*

_{2}(

*k*), for which we provide three approximations. The first is presented below and the second (which is a variation of the first) is presented in File S1, section S1.3. The third approximation, which is the most crude but yields an explicit dependence on the population parameters, is presented in

*An approximate explicit expression.*

In the first approach, we assume that once the times *t*_{1}, *t*_{2} to the MRCA at the two sites are known, the sites are (or are not) in shared segments independently of each other and with probabilities given by Equation 1. Clearly, this assumption is violated when both sites belong to the same shared segment, and in File S1, section S1.3, we show how this assumption can be avoided (but at the cost of significantly complicating the analysis). Nevertheless, it gives a good approximation, as we later see (Figure 2). We can therefore use Equation 1 to write*t*_{1}, *t*_{2}) is the joint probability density function (PDF) of *t*_{1} and *t*_{2} and*t*_{1}, *t*_{2}). We therefore reduced the problem of finding *π*_{2}(*k*) into that of finding

To find Φ(*t*_{1}, *t*_{2}) (or rather, its Laplace transform), we use the continuous-time Markov chain representation of the coalescent with recombination (Hudson 1983; Simonsen and Churchill 1997; Wakeley 2009). The chain is illustrated in Figure 1. Initially (present time), the chain is in state 1, corresponding to two chromosomes carrying two sites each. The chain terminates at state 8, when both sites have reached their MRCA. To construct the chain, coalescence events were assumed to occur at rate 1 and recombination events at rate *ρ*/2, where

Denote by *P _{i}*(

*t*) the probability that the chain is at state

*i*at time

*t*, given that it started at state 1. The probability that the two sites have reached their MRCA simultaneously in the time range [

*t*,

*t*+

*dt*] is

*P*

_{1}(

*t*)

*dt*, since this is the product of the probability that the chain is at state 1 at time

*t*(

*P*

_{1}(

*t*)) and the probability of the transition 1 → 8 in the given time interval (

*dt*). The probability that only the left site has reached its MRCA (and the right site has not) in [

*t*,

*t*+

*dt*] is

*P*

_{2}(

*t*)

*dt*+

*P*

_{3}(

*t*)

*dt*: this corresponds to the transitions 2 → 5 and 3 → 7. This is also the probability that only the right site has reached its MRCA in [

*t*,

*t*+

*dt*] (transitions 2 → 4, 3 → 6). Finally, the probability that the left site has reached its MRCA in [

*t*

_{1},

*t*

_{1}, +

*dt*

_{1}] and that the right site has reached its MRCA in [

*t*

_{2},

*t*

_{2}+

*dt*

_{2}] (

*t*

_{2}>

*t*

_{1}) is

*t*

_{2}−

*t*

_{1}) and then leave to the terminal state is

*t*

_{1}>

*t*

_{2}(with the transitions 2 → 4 and 3 → 6). In sum, for

*t*

_{1},

*t*

_{2}> 0,

*δ*(

*t*) is the Dirac delta function and Θ(

*t*) = 1 for

*t*> 0 and is otherwise zero. Laplace transforming the last equation,

*i*= 1, 2, 3) are the Laplace transforms of

*P*(

_{i}*t*). The Laplace transforms can be calculated using the general relation

*Q*is the transition rate matrix:

*Q*is the transition rate from

_{ij}*i*to

*j*≠

*i*and

*A*= (1 +

*q*

_{1})(1 +

*q*

_{2}),

*B*= (3 +

*q*

_{1}+

*q*

_{2})(6 +

*q*

_{1}+

*q*

_{2}),

*C*=

*ρ*(2 +

*q*

_{1}+

*q*

_{2}),

*D*= 13 + 3(

*q*

_{1}+

*q*

_{2}), and

*E*=

*ρ*

^{2}(2 +

*q*

_{1}+

*q*

_{2}). Equation 11 was also derived in Griffiths (1991), using the birth-and-death process equivalent of the coalescent with recombination, and can also be derived using the Feynman–Kac formula (see File S1, section S1.4). Substituting, using Mathematica, Equation 11 in Equation 7, and then using Equation 6 gives an expression for the variance,

*standard deviation*(SD) of the total sharing is defined as usual as

To evaluate the accuracy of our expressions for the mean and SD of the total sharing, we used the Genome coalescent simulator (Liang *et al.* 2007), along with an add-on that returns, for each generated genealogy, the locations of the segments that are IBD between each pair of individuals (Palamara *et al.* 2012). The simulation results (see also *Methods*) are presented and compared to the theory in Figure 2. In each panel, we varied one of *N*, *m*, and *L*, keeping the two others fixed (as long as the marker density is large enough, the number of markers *M* has no effect on the variance). Across most of the parameter space, our expressions agree well with simulations. Notable deviations, however, arise for the SD in particularly short or long chromosomes. For these cases, the second, more complicated approximation, which we mentioned above and appears in File S1, section S1.3, is more accurate (Figure 2).

#### An approximate explicit expression:

In this subsection, we derive another, simpler approximation of the variance, one that is less accurate but that has an explicit dependence on the population and genetic parameters. The gist of this approximation is that the main contribution to the variance comes from the long-distance probability of pairs of sites to reside on the same segment. Denote the distance between two given sites by *d*, and assume that *d* > *m*. For a given pair of individuals, if there was no recombination event between the two sites in the history of the two lineages, then both sites lie on a shared segment of length ≥*d* > *m*. Of course, even if there was a recombination event, the two sites could still be each on a different shared segment. However, this occurs with probability very close to *π*^{2}, the probability that the two sites are on shared segments given that they are independent.

In terms of Equation 6, the above approximation translates to, for *d* > *m*,*p*_{nr} is the probability of no recombination,*Nd* ≫ 50. This is true because in the ancestral process (Figure 1), no recombination corresponds to a coalescence event taking place before any recombination event. Since the coalescence rate is 1 and the recombination rate is *ρ*, Equation 14 follows. We can then further simplify and neglect the contribution to the variance from sites separated by short distance *d* < *m*. Finally, we can also neglect the single-site term of the variance, since it scales as 1/*M* and therefore vanishes when the markers are dense. Overall, the simplified Equation 6 gives*L* ≫ *m*. Nicely, Equation 15 provides an explicit (and rather simple) dependence on *N*, *L*, and *m*, and as expected, the expression does not depend on the marker density. Equation 15 is also plotted in Figure 2, showing that it fits quite well to the simulation results, although it is usually less accurate than Equation 12.

For the entire (autosomal) human genome, we use Equation 5,*m* ≈ 1(cM), the last equation gives

##### A variable population size:

The framework presented above can be extended to calculate the variance for a generalization of the Wright–Fisher model in which the population size is allowed to change in time. Denote the population size as *N*(*t*) = *N*_{0}*λ*(*t*), where *t* is the time (scaled by *N*_{0}) before present. The PDF of the (scaled) coalescence time for two lineages is (see, *e.g.*, Li and Durbin 2011)*et al.* (2012), the mean of the total sharing is obtained by substituting the above Φ(*t*) in Equation 2, giving*p*_{nr}. For sites distance *d* apart,*t* the sites are separated by 2*N*_{0}*t* meioses, in each of which the probability of no recombination is *e*^{−}^{d}^{/100}. Equation 18 reduces to Equation 14 for *λ*(*t*) = 1 [where Φ(*t*) = *e*^{−t}]. Equation 18 can then be substituted into Equation 15, giving

##### The total sharing in a length range:

Consider the quantity _{1}, ℓ_{2}]. Clearly, *m* = ℓ_{1} and when *m* = ℓ_{2}. The average is simply*et al.* (2012) and then used for demographic inference. Here, we calculate the variance, *I*(*s*; *m* = ℓ) is the indicator that site *s* is in a shared segment of length at least ℓ, *π _{m}*

_{=ℓ}is the probability associated with the indicator, and

*π*

_{2}(

*s*

_{1}, ℓ

_{1};

*s*

_{2}, ℓ

_{2}) is the probability that site

*s*

_{1}is in a shared segment of length at least ℓ

_{1}and site

*s*

_{2}is in a shared segment of length at least ℓ

_{2}. The key approximation, similar to the one made in

*An approximate explicit expression*section (Equation 15), is that

*longer than*ℓ

_{2}. Defining

*p*

_{nr}, as before, as the probability of no recombination between

*s*

_{1}and

*s*

_{2}in the history of the two individuals, we have

_{1}, ℓ

_{2}≪

*L*, the variance depends only on the ratio ℓ

_{2}/ℓ

_{1}.

#### The total sharing distribution and an error model:

Having the first two moments of the total sharing, we sought to find its distribution, *P*(*f*_{T}). While we could not find an exact expression, we could find, inspired by the numerical results of Huff *et al.* (2011), a reasonable fit. Huff *et al.* (2011) showed empirically that for HapMap’s Europeans (International HapMap Consortium 2007), the number of segments shared between random individuals was distributed as a Poisson and that the length of each segment was distributed exponentially with a lower cutoff at *m*, independently of the number of segments. If this is true also for the Wright–Fisher model, then the total length of the shared segments, defined as *L*_{T} = *Lf*_{T}, is distributed as a sum of a Poisson-distributed number of these exponentials. In equations,*n*_{0} is the mean number of segments, the density of the ℓ* _{i}*’s is exp[−(ℓ

*−*

_{i}*m*)/ℓ

_{0}]/ℓ

_{0}(ℓ

_{0}+

*m*is the mean segment length), and ℓ

*≥*

_{i}*m*. Such an expression is easier to handle in Laplace space, where the Laplace transform of

*P*(

*L*

_{T}),

*n*

_{0}and ℓ

_{0},

*P*(

*L*

_{T}) [and then

*P*(

*f*

_{T})] was uniquely determined from

*et al.*1982; Hollenbeck 1998). For specific values of (

*N*,

*L*,

*m*), we fitted the parameters

*n*

_{0}and ℓ

_{0}by minimizing the squared error between the simulated distribution and

*P*(

*f*

_{T}) (from Equation 24) in a grid search. The results are plotted in Figure 4, with the fitted

*n*

_{0}and ℓ

_{0}plotted in Figure S2. It can be seen that Equation 24 captures quite well the unique features of

*P*(

*f*

_{T}) (except in the tail; see Figure S2).

Inspection of the distributions (Figure 4) for several values of *N* leads to some interesting observations. For small *N* (*e.g.*, *N* ≈ 1000 and for *m* = 1 cM and *L* = 278 cM), where the typical amount of sharing is large (〈*f*_{T}〉 ≈ 5−10%, *n*_{0} ≈ 10, ℓ_{0} ≈ 1 cM), the distribution is unimodal (but not normal), centered around 〈*f*_{T}〉. As *N* increases (*e.g.*, *N* ≈ 3000), a discontinuous peak appears at *f*_{T} = 0, with *P*(*f*_{T}) = 0 for 0 < *f*_{T} < *m*/*L* (≈0.4%). This is of course due to the restriction on the minimal segment length: a pair of individuals can share either nothing or at least one segment of length *m*. For *f*_{T} > *m*/*L* the distribution is continuous, still centered around 〈*f*_{T}〉, but with small, yet notable peaks at *f*_{T} = *m*/*L*, 2*m*/*L*, 3*m*/*L*, … corresponding to pairs of individuals sharing a small number of minimal length segments. For even larger *N* (*e.g.*, *N* ≈ 10,000 and beyond), 〈*f*_{T}〉 drops below 1%, *n*_{0} ≈ 1 (ℓ_{0} still ≈1 cM), and the peaks at *f*_{T} = 0 and *f*_{T} = *m*/*L* increase such that the distribution decreases almost monotonically beyond *m*/*L*. An analytical bound on the fraction of pairs not sharing any segment is given in File S1, section S2.1 (Figure S3).

##### An error model:

To model errors during IBD detection, suppose that we set *m* large enough to avoid any false positives (*i.e.*, detected segments that are not truly IBD). We model false negatives as true IBD segments being missed with probability *ε* (independent of the segment length). It is possible to extend the above formulation (Equation 23) to the case with errors, as follows. Summing over the true number of segments, *n*′, the distribution of the number of detected segments, *n*, is*n*_{0}(1 − *ε*). Then, as a sum of a random number of independent variables, the mean and variance of *L*_{T} are 〈*L*_{T}〉 = 〈*n*〉〈ℓ〉 and Var[*L*_{T}] = 〈*n*〉 Var[ℓ] + 〈ℓ〉^{2} Var[*n*], where *n* is the number of segments and ℓ is the segment length. In our case,*ε*) times their noise-free values. This is confirmed by simulations in Figure 5.

##### Other approaches:

We note that a similar approach dates back to R. A. Fisher (Fisher 1954) and others (Bennet 1954; Stam 1980; Chapman and Thompson 2003) in their work on IBD sharing in a model where the population has been recently founded by a number of unrelated individuals. Briefly, those authors also assumed a Poisson number of IBD segments, each of which is exponentially distributed. They then matched the Poisson and exponential parameters to the average IBD sharing and the average number of segments, which they calculated using their population model. Here, we used a different population model (the coalescent; see also File S1, section S2.2) and assumed the exponentials have a cutoff at *m*. In principle, the parameters *n*_{0} and ℓ_{0} can also be directly calculated, by matching the mean and variance of the total sharing; see File S1, section S2.3. In practice, however, this does not give a good fit. In Palamara *et al.* (2012), a similar compound Poisson approach was developed but with a different, coalescent theory-based approximation of the segment length PDF, leading to an improved fit of the remaining parameter *n*_{0}.

### The cohort-averaged sharing

We have so far considered the total sharing between any two random individuals in a population. In practice, we usually collect genetic information on a cohort of *n* individuals. In this context, we can attribute each individual with the amount of genetic material it shares with the rest of the cohort. Define, for each individual, the *cohort-averaged* sharing *n* − 1 individuals in the cohort. Naively, one may anticipate that the width of the distribution of *n*, because the averaging will tend to eliminate any randomly arising differences between the individuals. We show that in fact, the width of the distribution approaches a nonzero limit. The individuals at the right tail of the cohort-averaged sharing distribution can be seen as “hypersharing”, meaning they are, on average, more genetically similar to members of the cohort than are others. Similarly, individuals at the left tail are “hyposharing”. The existence of hypersharing individuals is important for prioritizing individuals for sequencing, as we show in *Implications to sequencing study design* section.

Define the fraction of the genome shared by individuals *i* and *j* as *i*, *n* ≫ 1 and used the fact that the covariance term is identical for all (*i*, *j*_{1}, *j*_{2}) combinations and therefore, for simplicity of notation, we set *i* = 1, *j*_{1} = 2, and *j*_{2} = 3. Recall that *I*(*s*) is the indicator that site *s* is on a shared segment. Thus, the covariance can be written as*I*^{(}^{i}^{,}^{j}^{)}(*s*) is the indicator that site *s* is on a segment shared between individuals *i* and *j*, and *k* markers away from the first, is on a segment shared between 1 and 3. As in *An approximate explicit expression* section (*e.g.*, Equation 15), we will keep only the most dominant term in the sum. Consider the coalescent tree relating the three individuals 1, 2, and 3 and assume that the distance between the sites is *d* > *m*. If there was no recombination event in the entire tree between the two sites, then immediately *p*_{nr}, depends on *T*_{3}, the total size of the tree of three lineages. Since the PDF of *T*_{3} is *dN* ≫ 100,*N*, as the cohort size cannot exceed the population size), *n*. This implies that even for very large cohorts, the distribution of the cohort-averaged sharing retains a minimal width. Equation 28 is in good agreement with simulations, as shown in Figure 6A (although some deviations are seen for larger *n*). We note that the variance was computed for a given individual over all ancestral processes of a cohort of size *n*. Therefore, the variance within the cohort, for a given ancestral process might actually be smaller. Simulations results (Figure S4), however, show that unless *n* is very small, Equation (28) is a good approximation also for the variance within the cohort.

For a genome with *c* chromosomes,*i*. For the human genome and for small *n* and *m* ≈ 1 cM, Equation 16 gives*n*, Equation 27 gives*n*.

The fact that the width of the cohort-averaged sharing distribution does not approach zero for large *n* results from the “long-range” correlations between the averaged (*n* − 1) variables or, in other words, the fact that *i*, *j*_{1}, *j*_{2}. In Hilhorst and Schehr (2007), it was found that when all pairs of random variables are weakly correlated, the PDF of their average converges to a normal distribution. In our case, the covariance is *N*, *L*, and *m*, than *N*) of random pairs of individuals to be close relatives.

### Implications to sequencing study design

Suppose we have sparse genotype information for a large cohort, as well as whole-genome sequences for a subset of it. If the genotype data allow detection of IBD shared segments, then alleles not typed can be directly imputed if they lie on haplotypes shared with sequenced individuals (see, *e.g.*, Uricchio *et al.* 2012). In fact, such a strategy is expected to be quite successful; as we mentioned in the *Definitions* section, only about one recent mutation is expected on each shared segment. Since some individuals share more than others, their sequencing should be prioritized if imputation power is to be maximized. Recently, Gusev *et al.* (2012b) developed an algorithm (Infostip) for sample selection based on the observed IBD sharing. Here, using our results in *The cohort-averaged sharing* section, we calculate the theoretical maximal imputation power.

Assume first that individuals are haploids; the case of diploids is treated later. Assume a cohort of size *n*, a budget that enables the sequencing of *n _{s}* individuals, and two selection strategies: either of random

*n*individuals or of the

_{s}*n*individuals with the largest cohort-averaged sharing. Define an imputation metric,

_{s}*i*, an individual not sequenced, that is shared IBD with at least one sequenced individual. Let the selected individuals be

*i*shares with

*m*as

_{j}*j*

_{1},

*j*

_{2}, = 1, … ,

*n*(

_{s}*The cohort-averaged sharing*section). We also assume that the locations of the shared segments are independent and uniformly distributed along the genome. Under these assumptions, the probability of a locus to be covered by at least one sequenced individual is

*f*

_{T}≪ 1,

*f*

_{T}〉 is given by Equation 4. When selecting the highest-sharing individuals, values of

*The cohort-averaged sharing*section that

*f*

_{T}〉 and

*exactly*those individuals with the largest cohort-averaged sharing will not achieve the imputation power of Equation 34. This is because the top sharing individuals usually share many segments with each other and thus sequencing of all of them will be redundant (

*e.g.*, in the extreme case of siblings, both will appear as top sharing, but sequencing of both will add little power beyond sequencing just one). To avoid such redundancies, we selected the high-sharing (simulated) individuals using Infostip (Gusev

*et al.*2012b), which proceeds in a greedy manner, each time selecting the individual who shares the most with the rest of the cohort in regions that are not yet covered by the already selected individuals. We then compared the imputation power when individuals were selected either randomly or using Infostip. The results, shown in Figure 7, suggest good agreement between the theoretical Equations 32 and 34 and the simulations, at least as long as

*n*is not large (relative to

_{s}*n*). For large

*n*, the coverage is lower than predicted, likely due to the nonuniform concentration of the shared segments.

_{s}For a cohort of *n* diploid individuals (assuming phase can be resolved) we redefine the cohort-averaged sharing as*e.g.*, *i*) and assume that the individuals selected for sequencing have the largest diploid cohort-averaged sharing. Since the two terms in *p*_{c} is interpreted, as before, as the probability of a locus on a given chromosome to be in a segment shared with at least one sequenced chromosome. The theory developed above is still valid, provided that in Equations 32 and 34 *n _{s}* is replaced by 2

*n*and that in Equation 33

_{s}##### Increase in association power:

Using our results for the power of imputation by IBD, we calculate below the expected subsequent increase in power to detect rare variant association. We use the simple model of Shen *et al.* (2011), in which we consider rare variants that appear in cases but not in any control, and assume that the causal variant is dominant.

Assume that we have genotyped and detected IBD segments in a cohort of *n _{c}* (diploid) cases and

*n*controls and that we sequenced a subset of

_{t}*n*individuals, of which

_{s}*n*

_{c}_{,}

*are cases and*

_{s}*n*

_{t}_{,}

*are controls (*

_{s}*n*=

_{s}*n*

_{c}_{,}

*+*

_{s}*n*

_{t}_{,}

*). After imputation by IBD, a locus in a (diploid) individual not sequenced has probability*

_{s}*p*

_{c}is given by Equation 32 or Equation 34. For a given locus, we define the

*effective*number of cases (controls), as the number of cases (controls) for which genotypes are known either directly from sequencing or from imputation. Since there are

*n*−

_{c}*n*

_{c}_{,}

*cases not sequenced and*

_{s}*n*−

_{t}*n*

_{t}_{,}

*controls not sequenced,*

_{s}*b*cases but in no controls, under the null hypothesis that the variant assorts independently of the disease, is given by Fisher’s exact test,

*Q*as the threshold

*P*-value and denote by

*b** the smallest integer above which

*P*(cases only) <

*Q*. When the causal variant carrier frequency in cases is

*β*, the probability of the variant to appear in

*b*cases is binomial, and the power is, for a given

*Q*,

*vs. n*

_{c}_{,}

*, when the sequencing budget*

_{s}*n*=

_{s}*n*

_{c}_{,}

*+*

_{s}*n*

_{t}_{,}

*is fixed and for representative parameter values. In Figure 8, we plot the power*

_{s}*vs.*the carrier frequency for the optimal value of

*n*

_{c}_{,}

*. Figure 8 demonstrates that the power increases by severalfold when imputation by IBD is used. This is, however, an expected consequence of increasing the effective sample size and would likely be achieved with any imputation algorithm (*

_{s}*e.g.*, Howie

*et al.*2012). Figure 8 also shows an additional, slight increase when the highest-sharing individuals are selected for sequencing. Thus, while it should be easy to identify the highest-sharing individuals given a genotyped cohort [

*e.g.*, using Infostip (Gusev

*et al.*2012b)], and doing so will increase the association power, our results suggest that the gain in power over a random selection will be minor.

### Other applications of the variance of IBD sharing

#### An estimator of the population size:

Assume that we have genotyped or sequenced a diploid chromosome of one individual and calculated *f*_{T}, the fraction of the chromosome shared between the individual’s paternal and maternal chromosomes. Can we estimate the effective population size?

According to Equation 4, *N* gives (see also Palamara *et al.* 2012)*f*_{T}〉 ≪ 1. This suggests the following estimator,*f*_{T}], which we could not calculate, but could approximate as follows. Let us write *f*_{T} − 〈*f*_{T}〉| ≪ 〈*f*_{T}〉 (in which regime clearly *f*_{T}〉 ≈ 100/(*mN*) (Equation 4) and Equation 15 for *S*_{2} is the number of heterozygous sites and *μ* is the mutation rate (per chromosome per generation). Watterson’s estimator is unbiased,

Note that in practice, the proposed estimator is not very useful, as it diverges whenever *f*_{T} = 0 (which is common for large *N*). Suppose, however, that we have sequences for *n* (haploid) chromosomes and that we have computed the total sharing between all pairs. Define *n* (haploid) chromosomes, *N* and *n*), which decays to zero with increasing *n* slower than the IBD estimator. Simulation results, shown in Figure S8 and Figure S9, confirm the accuracy of Equation 39 and show that the bias is limited to very small values of *n*.

In the context of the error model in *The total sharing distribution and an error model* section, introducing a probability *ε* to miss a true IBD segment will decrease the average total sharing by (1 − *ε*) (Equation 25). Consequently, Equation 38 will estimate a population size ∼1/(1 − *ε*) [≈ (1 + *ε*) for small *ε*] larger than the true one.

#### IBD sharing between siblings:

The total IBD sharing between relatives can usually be decomposed into sharing due to the recent coancestry and “background” sharing due to population inbreeding (Huff *et al.* 2011; Henn *et al.* 2012). While much is known about the distribution of sharing in pedigrees (*e.g.*, Hill and Weir 2011), less is known about the population-level sharing, and relatedness detection algorithms (*e.g.*, Huff *et al.* 2011; Henn *et al.* 2012) estimate it empirically. In a different domain, the variance in sharing between relatives appears in theoretical calculations of the variance of heritability estimators (Visscher *et al.* 2006). Our results for the variance of the total sharing in the Wright–Fisher model (*Variation in IBD sharing in the Wright–Fisher model* section) can thus have practical applications if modified to account for recent coancestry.

Here, we calculate the variance of the sharing between siblings by combining the approach of Visscher *et al.* (2006) with that of our *An approximate explicit expression* section. Assume that two individuals are siblings, either half or full: we calculate, without loss of generality, only the sharing between the two chromosomes that descended from the same parent and denote the fraction of sharing as *f*_{S}. Assume as before a population of size *N* and one chromosome of length *L*. For a given marker to be on a shared segment, it can either be on a segment directly coinherited from the same grandparent (probability 1/2) or otherwise be on a segment shared between the grandparents (probability *π*/2, Equation 2). We ignore boundary effects near the sites of recombination at the parent. The mean fraction of the genome shared is therefore just 〈*f*_{S}〉 = (1 + *π*)/2. The variance can be written as in Equation 6,*π*_{2,S}(*k*) is the probability of two sites separated by *k* markers [or genetic distance *r* is the recombination fraction and we used Haldane’s map function (Visscher *et al.* 2006). Also with probability *p*_{same}, the sites are both inherited from different grandparents, and we use the expressions developed in *An approximate explicit expression* section for the probability of the sites to be in shared segments: *π*_{2,S}(*k*) = *π*^{2} + Θ(*d* − *m*)*p*_{nr} [where *p*_{nr} ≈ 50/(*Nd*) and Θ(*x*) = 1 for *x* > 0 and is zero otherwise]. With probability (1 − 2*p*_{same}), one site is coinherited and the other is not; in that case *π*_{2,S}(*k*) = *π*. Approximating the sum as an integral and simplifying, we finally have*N* as small as 500.

#### IBD sharing after an admixture pulse:

In this final subsection, we study the IBD sharing in a simple admixture model. In our model, a single population *A* of constant size *N* has received gene flow from population *B*, *G*_{a} generations ago. We assume that gene flow took place for one generation only (hence, an admixture pulse) and, further, that population *B* is sufficiently large that the chromosomes it donated to *A* share no detectable IBD segments. Denote the fraction of the lineages coming from population *A* at the admixture event as *α* (fraction 1 − *α* coming from *B*), and let *T*_{a} = *G*_{a}/*N* be the scaled admixture time. We are interested in IBD sharing between extant chromosomes in population *A*.

To approximate the mean IBD sharing in the sample, note that if admixture was very recent, then two chromosomes will be potentially shared only if both descend from population *A*, which occurs with probability *α*^{2}. Therefore, the mean sharing is *α*^{2} times its value without admixture. While this is a good approximation (Figure S10), it does not account for two chromosomes, one or two of which are from the external population *B*, having their common ancestor more recently than the admixture event. We therefore calculate the mean IBD sharing using Equation 17, using the following (nonnormalized) PDF for the coalescence times,*f*_{T}〉_{admix} ≈ *α*^{2}〈*f*_{T}〉_{no admix} + (1 − *α*^{2})*T*_{a}. The first term corresponds to lineages descending from population *A*; the second term corresponds to at least one of the lineages descending from population *B* but where the lineages have coalesced already in the hybrid population. The variance can be similarly calculated, by substituting Equation 41 into Equation 19,*γ* is the Euler–Mascheroni constant, and we solved the integrals in Mathematica and later simplified under specific assumptions (see File S1, section S4). Equation 43 usually predicts a variance slightly smaller than the case of no admixture. Simulation results are shown in Figure S10 for the mean and variance. While agreement is not perfect (as Equation 19 is itself approximate), Equations 42 and 43 capture the main effects of changing *α* and *T*_{a}. Note that the result of Equation 42 implies that, for small *T*_{a} and large *N*, the observed mean IBD sharing is as if the population is of size ≈*N*/*α*^{2}.

##### A test for admixture:

For recent admixture (small *T*_{a}), the fractions of ancestry vary among individuals (Verdu and Rosenberg 2011; Gravel 2012). In our model, since a pair of segments is shared mostly when both descend from population *A*, some individuals will share more than others merely due to having a larger fraction of *A* ancestry. In turn, this will increase the variance of the cohort-averaged sharing. This observation suggests the following test for a recent gene flow into a population: (i) extract IBD segments and calculate the mean fraction of total sharing over all pairs, *N*_{pop} populations of size *P*-value for rejecting the null hypothesis of no admixture is the fraction of the *N*_{pop} populations where the SD of the cohort-averaged sharing was larger than the observed one. Note that the identity of the external population need not be known, nor are the admixture fraction and time; the test relies on admixture creating a gradient of ancestry fractions and hence an increased variability in the similarity between individuals. Simulation results are plotted in Figure S10, showing that for a *P*-value of 0.05 and *G*_{a} = 5, gene flow with *α* ≈ 0.9 or lower can be detected (*α* ≈ 0.8 or lower for *G*_{a} = 10). We stress that a broader than expected distribution of cohort-averaged sharing does not *necessarily* indicate admixture, and there might be other factors responsible for the effect (see also the *Discussion*). We validated, however, that IBD detection errors alone (as in the model in *The total sharing distribution and an error model* section) as well as variable population size (in a simple two-size model) do not lead to significant *P*-values in the admixture test (Figure S11).

##### IBD sharing and admixture in the Ashkenazi Jewish population:

As our final result, we apply the admixture test to the real population of Ashkenazi Jews (AJ). Historical records, and recently also genetic studies, suggest that AJ form a genetically distinct group of likely Middle-Eastern origin. However, the AJ population was also shown to receive a significant amount of gene flow from neighboring European populations (Ostrer 2001; Atzmon *et al.* 2010; Behar *et al.* 2010; Bray *et al.* 2010; Guha *et al.* 2012). We analyzed a data set of ≈ 2600 AJ, details of which have been published elsewhere (Guha *et al.* 2012; Palamara *et al.* 2012) and are summarized in the *Methods* section. To detect IBD shared segments in the AJ population, we used Germline (Gusev *et al.* 2009). For 500 individuals on chromosome 1, and with *m* = 1 cM, the average fraction of sharing over all pairs is ≈4.4%, leading to an estimated population size of *et al.* (2012), a population model was inferred based on the fraction of the genome shared at different segment lengths. The model’s best estimate of AJ history is a slow expansion until ∼35 generations ago and then a severe bottleneck (effective population size of just 270) followed a by rapid expansion to a current size of a few millions. As can be seen in Figure 10, A and B, the model agrees well with the distribution of the fraction of total sharing over all pairs, but predicts a much narrower distribution of cohort-averaged sharing than the true one. Here too, in none of 100 simulated populations with the inferred demography was the SD of the cohort-averaged sharing as large as in the real data. These results, therefore, suggest (based on the AJ population alone) that the AJ population was the target of a recent gene flow. To confirm that the increase in the variance of the cohort-averaged sharing is due (at least partly) to admixture, we ran an admixture analysis [Admixture (Alexander *et al.* 2009)] comparing AJ to HapMap’s CEU (International HapMap Consortium 2007). As can be seen in Figure 10C, the fraction of “AJ ancestry” is indeed highly correlated with the cohort-averaged sharing (Pearson’s *r* = 0.59).

## Discussion

The recent availability of dense genotypes, together with sophisticated detection tools, has transformed IBD sharing into an increasingly important tool in population genetics. Here, we used coalescent theory to compute the variance and other properties of the total sharing in the Wright–Fisher model. For the variance, we suggested three derivations, one of which was more coarse but had a simple closed form that was later extended to populations of variable size. Investigating the cohort-averaged sharing, we discovered the curious phenomenon of hypersharing. We showed how this can be exploited to improve power in imputation and association studies. We also calculated the variance of the total sharing between siblings and briefly considered some implications to the accuracy of demographic inference. We finally investigated IBD sharing in a hybrid population and suggested a test for admixture based on the cohort-averaged sharing, which we then applied to the Ashkenazi Jewish population. We provide Matlab routines for the main results (File S2).

Most of our analytical results depend on certain assumptions and simplifications, as specified in the individual sections and in File S1, section S1.2. Additionally, in reality, the Wright–Fisher model and the coalescent are only approximations of the true ancestral process, and procedures such as phasing, IBD inference, and imputation are also prone to error. IBD detection errors will particularly affect our results for imputation and association studies (*Implications to sequencing study design* section), and these results should therefore be considered as idealized upper bounds. The error model we introduced, where each IBD segment is missed with a certain probability, gives a sense of the effect of errors. Investigation of more detailed models, *e.g.*, length-dependent error rate for segment misdetection or more realistic models for imputation and association studies, is challenging and left for future work.

Prospects of our work are in a few fields. First, as shown in Palamara *et al.* (2012), theoretical characterization of IBD sharing can lead to new methods for demographic inference, which are expected to perform particularly well when investigating the recent history of genetic isolates. Here, we expanded the theory of IBD sharing to compute the variance of the total sharing and the cohort-average sharing. This turned useful, for example, when we provided in *An estimator of the population size* section expressions for the variance of an estimator of the population size based on the average sharing over all pairs of chromosomes and in *IBD sharing after an admixture pulse* section a test for recent admixture. In another domain, understanding the distribution of sharing between relatives can improve the accuracy of relatedness detection (*IBD sharing between siblings* section). Other potential applications are in the detection of regions either positively selected or associated with a disease based on excess sharing, although more work is needed for these. Finally, our results provide the first estimate for the potential success of imputation by IBD strategies (*Implications to sequencing study design* section). We note that of course, once a given cohort has been genotyped, IBD can be calculated directly to estimate the expected success of imputation. However, in many cases, study design takes place before the actual recruiting and genotyping, and then, if a rough estimate of the population size is available, our results can be invoked to estimate the amount of resources needed.

One of our interesting findings was the presence of hypersharing individuals. While we did not define the term precisely, we referred to the fact that even for large cohorts, the variance of the cohort-averaged sharing does not decrease below a certain value. This result, while somewhat counterintuitive, follows naturally from the population model. In the real population of AJ, we showed that the distribution of the cohort-averaged sharing is even broader, indicating possible admixture, and indeed, we found that the cohort-averaged sharing is highly correlated with the Ashkenazi ancestry fraction. This is not to say that admixture was the only factor shaping the distribution of IBD sharing; other factors such as selection or population substructure could have been playing a role as well. Our results, however, emphasize the importance of reconstructing the AJ demography simultaneously with that of their neighboring populations.

## Acknowledgments

We thank the reviewers for insightful comments and Omer Bobrowski for discussions. S.C. thanks the Human Frontier Science Program for financial support. I.P. acknowledges support from National Science Foundation grant CCF 0845677 and National Institutes of Health grant U54 CA121852.

## Footnotes

*Communicating editor: Y. S. Song*

- Received October 26, 2012.
- Accepted December 14, 2012.

- Copyright © 2013 by the Genetics Society of America