## Abstract

The population-genetic statistic is used widely to describe allele frequency distributions in subdivided populations. The increasing availability of DNA sequence data has recently enabled computations of from sequence-based “haplotype loci.” At the same time, theoretical work has revealed that has a strong dependence on the underlying genetic diversity of a locus from which it is computed, with high diversity constraining values of to be low. In the case of haplotype loci, for which two haplotypes that are distinct over a specified length along a chromosome are treated as distinct alleles, genetic diversity is influenced by haplotype length: longer haplotype loci have the potential for greater genetic diversity. Here, we study the dependence of on haplotype length. Using a model in which a haplotype locus is sequentially incremented by one biallelic locus at a time, we show that increasing the length of the haplotype locus can either increase or decrease the value of , and usually decreases it. We compute on haplotype loci in human populations, finding a close correspondence between the observed values and our theoretical predictions. We conclude that effects of haplotype length are valuable to consider when interpreting calculated on haplotypic data.

THE quantity has seen broad usage in studies of population structure and divergence (Holsinger and Weir 2009). Wright (1951) originally formulated for a biallelic locus; subsequent perspectives that accommodate more than two alleles (Nei 1973) have enabled its computation on multiallelic loci such as microsatellites and haplotype loci.

Calculations of from haplotypic data have provided insight into a variety of questions, especially following the development of a widely used haplotype-based statistical test for population subdivision (Hudson *et al.* 1992). Haplotypic computations of have been useful for studying patterns of population structure, species divergence, and gene flow in numerous organisms (Hanson *et al.* 1996; Clark *et al.* 1998; Rocha *et al.* 2005; Jakobsson *et al.* 2008).

can be computed from haplotypic data in multiple ways. One method computes sequence differences for pairs of sequences from the same population and from different populations, and relies on a connection between , pairwise sequence differences, and coalescence times (Slatkin 1991; Hudson *et al.* 1992). Both this approach and the related analysis of molecular variance framework of Excoffier *et al.* (1992) rely on comparisons of sequences. A fundamentally different method employs a clustering technique to place distinct haplotypes into a set of haplotype clusters, regards the clusters of a sequence at a specified location as alleles, and computes from cluster membership frequencies (Jakobsson *et al.* 2008; San Lucas *et al.* 2012). A third method treats a specific segment of the genome as a “haplotype locus,” so that distinct haplotypes over that genomic segment represent distinct “haplotype alleles,” and computes from the haplotype alleles (Clark *et al.* 1998; Oleksyk *et al.* 2010).

This last approach, treating each distinct haplotype as its own distinct allele, provides a theoretical framework for understanding an observed dependence of on haplotype length. Studies that have computed using both individual single-nucleotide polymorphisms (SNPs) and haplotypes in the same data set have consistently observed that haplotype tends to be smaller than SNP [Clark *et al.* 1998; Jakobsson *et al.* 2008 (Figure S29); Oleksyk *et al.* 2010; Sjöstrand *et al.* 2014 (Figure 2)]. An explanation for this basic pattern is suggested by the dependence of on the frequency of the most frequent allelic type (Jakobsson *et al.* 2013; Edge and Rosenberg 2014; Alcala and Rosenberg 2017). A lower frequency for the most frequent type at a locus generally results in lower values of , and the most frequent haplotype at a particular haplotype locus is necessarily no more frequent than the most frequent SNP allele that it contains. We would then expect that because longer haplotype loci are likely to have a lower frequency for the most frequent haplotype, such loci would generate lower values.

Here, we examine the effect of haplotype length on . We derive the value of upon the addition of a biallelic SNP locus to an existing haplotype locus. Using this result, we predict the effect of haplotype length on values of , assuming for mathematical convenience that added SNPs are in linkage equilibrium with existing haplotype loci. Comparing values of for haplotype loci in human genomic data to those obtained by our theoretical predictions, we find that our predictions largely match the observed values, despite the presence of linkage disequilibrium (LD) between the added SNPs and the existing haplotype loci in the data but not in the theory. In addition, we find that haplotype-based computations are likely to reduce compared to single-SNP computations. We propose that a variety of haplotype lengths be used when computing from haplotype loci and that the length of the haplotype locus be considered when interpreting the resulting values.

## Model

### Definitions

We compute on a multiallelic locus in a pair of populations, 1 and 2, of equal size. Denote by the frequency of allele *i* in population *k*, with for all . For each *k*, , where *I* is the total number of distinct alleles at the locus. We use Nei’s (1973) formulation of ,(1)where(2)is the mean of the two population homozygosities, and(3)is the homozygosity of the population obtained by pooling populations 1 and 2 together.

For and , we define the population homozygosities by(4)We define the dot product between the two population allele frequency vectors by(5)Using Equations 4 and 5, we rewrite (Equation 1) in the form that we use for our analysis:

(6)Note that a constraint exists on given and :(7)with equality in the upper bound if, and only if, each allele has the same frequency in both populations. Note that the upper bound is only achievable if (see further discussion in Appendix A). The lower bound can be obtained by making each distinct allele unique to one of the two populations.

### Adding a SNP to a haplotype locus

We are concerned with the scenario in which the multiallelic locus is a “haplotype locus,” a genomic region of specified length for which each distinct haplotype is regarded as a distinct “allele.” We add a biallelic locus to our multiallelic locus, corresponding to a scenario in which the “haplotype locus” is augmented by one SNP. We refer to the multiallelic locus as a “haplotype locus,” to each of its alleles as a “haplotype,” and to the biallelic locus as a SNP. However, our results can apply to any kind of multiallelic locus augmented by a biallelic locus. We refer to the haplotype locus augmented by a SNP as an “extended haplotype locus.”

Our goal is to compute over the extended haplotype locus defined by adding the SNP to the haplotype locus, given the population frequencies of the alleles of the haplotype locus and the SNP. The SNP has two alleles, a major allele—with frequency greater than or equal to —and a minor allele. We identify these alleles by examining the mean allele frequency between the two populations, so that the minor allele has mean frequency or less, even if it is the more common allele in one but not the other of the two populations.

The alleles of the extended haplotype locus are cooccurrences of the alleles of the SNP with the haplotypes of the haplotype locus. Each of the *I* distinct haplotypes can cooccur with either the major or the minor allele of the SNP. Therefore, alleles are possible for the extended haplotype locus as a result of combining the haplotype locus with the SNP. For each *i* from 1 to *I*, we index the allele formed by cooccurrence of the *i*th haplotype with the SNP minor allele by , and the allele formed by cooccurrence of the *i*th haplotype with the SNP major allele by . Denote by the frequency of the minor allele of the SNP on the *i*th haplotype in population *k*. In other words, is the probability that haplotype *i* contains the minor allele of the SNP when augmented by the SNP. By a slight abuse of notation, using for the frequency of allele *i* of the haplotype locus in population *k*, for each *i* from 1 to *I*, the allele frequencies of the extended haplotype locus in population *k* are(8)(9)For convenience, we drop the comma in subscripts when possible.

Written with conditional probability, if *A* is the event that the SNP minor allele is observed and *B* is the event that haplotype *i* is observed, then cooccurrence of *A* and *B* has probability . Equation 8 merely encodes this result, with , and . If is the event that the major allele of the SNP is observed, then Equation 9 can be obtained by noting that and , so that .

Note that is not necessarily equal to the overall frequency of the SNP minor allele in population *k*, or . The notation in Equations 8 and 9 allows us to write as(10)and the minor allele frequency of the SNP across all populations, *q*, as(11)Table 1 summarizes our allele frequency notation and Figure 1 provides a schematic of the process of adding a SNP to a set of haplotypes.

## Results

### General formula: arbitrary LD between haplotype locus and SNP

We seek to evaluate on the set of alleles of the extended haplotype locus. We call this quantity . To compute using Equation 6, we use Equations 8 and 9 to obtain the values of the component quantities , , and (Equations 4 and 5) for the extended haplotype locus:

(12)(13)Addition of the SNP splits each haplotype into two new alleles, so homozygosity (Equation 12) cannot increase: . For a fixed set of for the haplotype locus in population *k*, equality can occur if and only if for all *i*, is either 0 or 1. This condition is obtained if and only if each haplotype is associated with only a single SNP allele. Otherwise, adding a SNP always decreases homozygosity at the extended haplotype locus compared to the haplotype locus itself. Figure 2, A and B, provides geometric intuition for this result.

The dot product (Equation 13) also cannot increase, as . Equality occurs if and only if: (1) for all *i*, for some *k*, or (2) for each *i*, and are both 0 or both 1. In the former case, the alleles of the haplotype locus are each private to a single population. In the latter case, the SNP is partitioned so that each haplotype is associated with a single SNP allele, the same one in both populations. Otherwise, adding the SNP decreases the dot product at the extended haplotype locus. Figure 2, C and D, provides geometric intuition for this result.

Note that if , so that , then for all *i*. We then have , and . In this case, is equal to the for the initial haplotype locus (Equation 6). Thus, addition of a monomorphic locus does not change .

Because (Equation 6) monotonically increases with , decreasing homozygosity decreases . In contrast, monotonically decreases with , so decreasing increases . Therefore, it is not immediately evident if modifying , , and in the manner of Equations 12 and 13 increases or decreases . Whether increases or decreases with the addition of a SNP to a haplotype locus depends on whether the decrease in homozygosity (Equation 12) or the decrease in dot product (Equation 13) has a larger effect on Equation 6.

We can investigate the relative impact of the decreases in , , and on the value of by using Equations 12 and 13 in Equation 6 to compute(14)where(15)We now proceed to examine Equation 14 in the simplest case, in which the SNP and the haplotype locus are in linkage equilibrium separately in the two populations.

### Special case: linkage equilibrium between haplotype locus and SNP

We focus the remainder of our analysis on the situation in which the SNP is in linkage equilibrium with the haplotype locus. Under this condition of independence, the frequency of the minor allele of the SNP on a particular haplotype *i* in population *k*, , is just the population frequency of the minor allele of the SNP in population *k*, (Equation 10).

Plugging into Equations 12 and 13 yields(16)(17)If we denote the homozygosity of the SNP in population *k*, , and the dot product of the SNP allele frequency vectors in the two populations, , then we can write the quantities in Equations 16 and 17 by

Using and from Equations 18 and 19 in Equation 6 yields the special case of Equation 14 in which the SNP is in linkage equilibrium with the haplotype locus:(20)Thus, adding an independent SNP to a set of existing haplotypes amounts to multiplying the haplotype homozygosities and dot product by the SNP homozygosities and dot product, respectively, and recomputing (Equation 6) using the resulting products. This result also holds if the appended locus has more than two alleles. The general case appears in Appendix B.

Figure 3 provides a schematic of the special case of adding a SNP to a set of haplotypes where the SNP and the haplotypes are in linkage equilibrium.

#### Subcase: the SNP has the same minor allele frequency in the two populations:

We now consider a series of further constraints on the alleles. First, we consider an independent SNP that is not differentiated between the two populations. This procedure is equivalent to taking all haplotypes and labeling them with two different labels in the same proportions in both populations. It might be expected to decrease , because within-population diversity increases but haplotypes are not split differently between the two populations.

If the SNP has identical minor allele frequency in the two populations, then , with . Inserting into Equations 16 and 17 and applying Equation 6 yields(21)Equation 21 also follows from Equation 20, noting that for this case, .

The constant 4 in the denominator of Equation 21 is divided by a quantity that is at most 1, with equality only in the monomorphic case of . Hence, the denominator of Equation 21 is always greater than or equal to that of Equation 6. Thus, the addition of a polymorphic SNP with the same minor allele frequency in the two populations always decreases .

The function in Equation 21 decreases monotonically with increasing minor allele frequency *q* (Figure 4). Considering all *q*, the maximal occurs at and the minimum occurs at .

#### Subcase: the SNP minor allele occurs only in one population:

We now consider the subcase in which the SNP minor allele is private to one population, assuming without loss of generality. The SNP splits some haplotypes into distinct new haplotypes in population 2 only, reducing allele sharing between populations. Therefore, unlike in the previous case in which adding a SNP always decreases , this case might be expected to increase .

Inserting and into Equations 16 and 17, and applying Equation 6, yields(22)Equation 22 can also be derived from Equation 20, inserting , and .

The influence on (Equation 22) of the SNP minor allele frequency *q* depends on the value of . If , then the two populations share no haplotypes; they are maximally diverged at the haplotype locus. In this case, becomes:(23)The function in Equation 23 is symmetric in *q* across , as for each *a*, , . It is minimized at , and maximized at and (Figure 5A). The maximum value is the value of haplotype prior to the addition of a SNP and the minimum is . Thus, if the populations are maximally diverged at the haplotype locus in the sense that they share no haplotypes, then adding a SNP whose minor allele appears in only one population always decreases , with two exceptions. If the SNP is monomorphic in each population, with either or , then the value remains the same.

If and we disregard the case of a monomorphic haplotype locus with , then the two populations share at least one haplotype and therefore admit the possibility of increased divergence through decreased allele sharing. To understand the effect of the minor allele frequency (*q*) on whether increases or decreases, we examine the derivative of Equation 22 and assess the monotonicity of with increasing *q*.

From Appendix C, for fixed , , and , has a critical point in the permissible region for *q* if and only if the root of the derivative satisfies , where(24)We find that if(25)and that if(26)Equation 26 always holds, as its left-hand side is positive and its right-hand side is negative.

If Equation 25 holds, then we can see that the critical point is a local minimum: owing to Equation 25, at , the numerator of (Equation 39), and hence the derivative itself, is less than or equal to 0. Hence, if Equation 25 holds, then decreases as *q* increases from 0 to and increases as *q* increases from to . If Equation 25 fails, then the derivative has positive numerator at , and no critical points occur in . then increases with *q* on .

The behavior of Equation 22 as a function of *q* appears in Figure 5. In Figure 5A, , and ranges over its permissible space from 0 to 0.5 (Equation 7). Equation 25 is always satisfied. As increases, allele sharing between populations increases, and the range of *q* at which the population-specific SNP increases by decreasing allele sharing expands in turn.

In Figure 5B, , and ranges from 0.2 to 1. Equation 7 is always satisfied for these values of . Equation 25 is satisfied for all values considered, except 0.2. For the , and shown, except at (Equation 22) has a local minimum at (Equation 24). For , Equation 25 is not satisfied, and increases monotonically with increasing *q*. As increases from 0.2 to 1 for fixed and , the range of minor allele frequencies *q* for which an added population-specific allele increases gets smaller.

In summary, the effect of adding a private SNP depends on *q*. For large *q*, increases. For small *q*, only increases if the haplotype locus has large (Figure 5A) or if the population with the minor allele has low homozygosity at the haplotype locus (Figure 5B).

#### Subcase: multiple SNPs with the same allele frequencies:

The third subcase we consider is the construction of haplotypes from independent SNPs, with equivalent frequencies for all SNPs. Therefore, each SNP has the same values for , , and . For one of these SNPs, the “haplotype” is (Equation 6). If we now add another independent SNP with the same properties, then using Equation 20, we obtain

(27)Figure 3 provides a schematic of this case for one of the populations *k*, considering a SNP with minor allele frequency . By induction, for the extended haplotype locus constructed by concatenation of *n* independent SNPs with the same allele frequencies is

We plot Equation 28 as a function of *n* with , and fixed. In Figure 6A, appears as a function of *n* for fixed and at each of several values of . For each , a decline occurs in with increasing *n*. Figure 6B plots as a function of *n* for fixed and at each of several values. As in Figure 6A, for each , decreases with increasing *n*.

One special case has and , so that population 1 is monomorphic for all SNPs. The SNPs are polymorphic in population 2, with . Then , and(29)with the limit taken as . The same limit occurs for and (Figure 6B, ). Otherwise, if both and , then every term raised to the *n*th power in Equation 28 is less than 1, and as (Figure 6).

We can conclude that if haplotypes are constructed by concatenating SNPs that all have the same allele frequencies, then generally decreases with haplotype length. It has limit 0 in most cases and limit if one population is monomorphic for all SNPs.

## Application to data

To evaluate the empirical applicability of our theoretical results, we examined calculated on human SNP haplotypes. We used phased SNP data from Pemberton *et al.* (2012); the data contain 938 individuals from 53 populations from the Human Genome Diversity Panel (HGDP), with a total of 640,034 genome-wide autosomal SNPs.

Our theoretical results are applicable to calculated in pairs of populations. For this empirical application, we treated the seven geographical regions in the HGDP data set—Africa, Europe, Middle East, Central and South Asia, East Asia, Oceania, and America—as “populations.” To obtain a set of haplotypes for a region, we pooled all sampled haplotypes from every individual in every population in that region.

### Haplotype construction

We constructed haplotypes from collections of *n* SNPs obtained in two different ways, choosing windows of size SNPs. First, we drew 10,000 sets of random SNPs without replacement from the entire set of SNPs, requiring all pairs of SNPs in a set to be separated by at least 5 Mb or to be located on different chromosomes. Each “haplotype” started with the first SNP in the set, and subsequent “haplotypes” were constructed by sequentially appending the remaining SNPs in the set.

The purpose of this first “random SNPs” procedure was to create “haplotypes” from SNPs that were not likely to be physically linked, a situation that accords with the assumptions of our theoretical computations. The value of SNPs was chosen to be large enough that most haplotypes in a data set were likely to be distinct: for instance, at , the first random SNP set for the Europe/East Asia pair had 607 unique haplotypes in a sample of size 774 (387 individuals). In this circumstance, is effectively zero (Figure 7A). The distance threshold of 5 Mb was chosen to exceed the scale of tens to hundreds of kilobases for LD decay in humans (Patil *et al.* 2001; Gabriel *et al.* 2002; Wall and Pritchard 2003).

In our second “SNP window” approach for constructing haplotypes, we randomly chose 10,000 starting SNPs without replacement, each with at least SNPs between it and the chromosome end, as measured in order of increasing SNP position. Each haplotype started with the first SNP in the set, and subsequent haplotypes were constructed by sequentially appending remaining SNPs in the set. The purpose of this procedure was to test the theory on a situation in which the assumption of SNP independence is violated due to likely LD of neighboring SNPs.

### General observations

Figure 7A plots the observed between Europe and East Asia, regions with relatively large samples in the data set—157 and 230 individuals, respectively—as a function of haplotype length. The decay with haplotype length is faster for sets of random SNPs than for neighboring windows of SNPs. This result accords with the fact that LD in SNP windows maintains haplotype homozygosity over larger numbers of SNPs than in the case of the largely independent random SNP sets. We observe that the mean across SNP windows is greatest for , after which it decays. This pattern accords with the claim that as haplotypes increase in length, haplotype homozygosity decreases and the maximal in terms of homozygosity decreases, so that empirical values decrease.

To evaluate the agreement of our theoretical results with observed values, for each haplotype of length SNPs, we used Equation 20 to compute a predicted from the haplotype frequencies of the nested set of SNPs and the allele frequencies of the *n*th SNP. The theoretical produces the same qualitative decay with haplotype length and the same peak at a small number of SNPs as was seen for the empirical values (Figure 7B).

For each SNP set and haplotype length, we computed the ratio of the difference between observed and theoretically predicted values of and the theoretical value, a quantity we term “rescaled error.” For a particular SNP set and haplotype length, rescaled error is:

(30)Values of rescaled error (Equation 30) as a function of haplotype length for the SNP sets in Figure 7, A and B, appear in Figure 7C. The rescaled error is small for small *n*, increasing with *n*. Our theoretical predictions are therefore more accurate for short haplotypes. Owing to the generally low values recorded for longer haplotypes (Figure 7A), the absolute magnitude of the poorer predictions for longer haplotypes is relatively small. For , the prediction is more accurate for random SNP sets than for SNP windows. Interestingly, for , the prediction is instead more accurate for the neighboring SNP windows, despite the fact that the prediction is designed for SNP sets with no LD. This change in accuracy might be explained by the fact that SNP windows of a particular length produce values similar to those of random SNP sets of smaller length (Figure 7A), so that our predictions remain reasonably accurate for longer SNP windows than in the case of random SNP sets.

### Correlation between observations and theory

To study the change in as SNPs are added to a haplotype locus, we considered the value of with increasing haplotype length for each collection of SNPs. For each collection of SNPs, random SNPs or SNP windows, we obtained a “trajectory” of : the values of as a function of the number of SNPs used to construct haplotypes for each *n* from 1 to . We then compared the observed for haplotypes of length *n* to the theoretical obtained by using Equation 20 on the set of haplotypes with length together with the *n*th SNP.

In each trajectory, we also compared the observed for haplotypes of length *n* to a value of drawn with replacement from the set of all observed values of for haplotypes of length *n*. These random draws were designed to serve as a null model of as a function of haplotype length, where the value of depends only on haplotype length without regard to values of for previous entrants in the trajectory from to .

Table 2 displays correlation coefficients between observed values, and both theoretical values obtained from Equation 20 and null model values drawn from the empirical distribution of . The correlations are computed between sets of 290,000 sets of paired values, 10,000 SNP sets and 29 values per SNP set . The value of was not used because in Equation 20 only applies for . The correlations between observed and theoretical values range from 0.96 to 1.00 for random SNP sets, and from 0.94 to 0.98 for SNP windows, compared to 0.24–0.47 and 0.07–0.23 for the correlation between observed and null values for random SNP sets and SNP windows, respectively.

Supplemental Material, Figure S1 plots representative results from Table 2 for the Europe/East Asia pair of regions. As expected, theoretical values of match observed values more closely for random SNP sets than for SNP windows. However, the SNP windows produce results that are comparable to the random SNP results, indicating that our theoretical results are reasonable in situations in which the assumption of linkage equilibrium does not hold. For both methods of haplotype construction, the theoretical results dramatically outperform the null model results, indicating that the theory predicts substantial additional information about haplotype-based compared with null predictions.

### Trajectories as observations

For each collection of SNPs, considering the 29 values from to 30, we fit a linear regression of observed on the theoretical prediction from Equation 20 and computed the corresponding statistic for goodness-of-fit. The purpose of this analysis was to treat each trajectory as a separate observation with its own , in contrast to grouping them as in Table 2 and Figure S1.

For the Europe/East Asia pair, Figure S2 plots distributions across 10,000 trajectories for theoretical and null models, for both random SNPs and SNP windows. The fit of the theoretical values is substantially closer compared to that of the null values. The fit is also closer for random SNP trajectories compared to window trajectories (Figure S2).

Figure 8 displays the median trajectories for each category of result in Figure S2 for the Europe/East Asia pair. Figure 8 reveals a distinction between the null and theoretical results; the theoretical model (Figure 8, A and C) closely matches observations for shorter haplotypes but consistently underestimates the value of for longer haplotypes. In contrast, the null model (Figure 8, B and D) produces a poor fit for shorter haplotypes but is less consistently biased for longer haplotypes. This observation provides more detail about the observation in Figure 7 that rescaled error (Equation 30) is higher for longer haplotypes than for shorter haplotypes; in particular, the longer-haplotype is underestimated.

Figure 9 plots example trajectories as a function of the frequency *M* of the most frequent haplotype instead of haplotype length, together with the upper bound on given *M* (Jakobsson *et al.* 2013). The haplotype locus starts with one SNP, with major allele frequency at least . As more SNPs are added, *M* either stays the same (if one SNP allele does not cooccur with the previous most frequent haplotype) or decreases (if both SNP alleles cooccur with the previous most frequent haplotype). Increasing haplotype length first increases the upper bound on , increasing the potential for an increase in to occur upon addition of a SNP. Once *M* decreases below , increasing the haplotype length decreases the upper bound, generally forcing to decrease. In aggregate, these properties of the upper bound of as a function of *M* can explain the tendency of to increase upon addition of the first few SNPs before decreasing with more SNPs, as seen in Figure 7A.

### Error and LD

We expected that the primary cause of deviation of observed values from theoretical values was greater LD in SNP windows than in random SNP sets. LD has been detected in these SNP data for nearby SNPs, decaying quickly so that it is unexpected for random SNP pairs [see Jakobsson *et al.* (2008), Figure 2 and Li *et al.* (2008), Figure 3].

To assess the effect of LD on rescaled error, Figure 10 plots rescaled error (Equation 30) against a multiallelic measure of LD (Hedrick 1987) for European SNP–haplotype pairs. This quantity, which we term , measures the deviation of extended haplotype allele frequencies from linkage equilibrium, and is plotted for each SNP–haplotype pair. For each SNP set, for each *n* from 2 to , we computed between the haplotype locus of length and the SNP. For East Asia, we denote the quantity analogous to in Europeans by .

Figure 10, A and B, which consider random SNP sets and SNP windows, respectively, are split by quartile of values of . Increasing LD in one or both populations increases the rescaled error. This pattern is clear for SNP windows (Figure 10B), for which increasing (within a plot) and (moving left to right across plots) produce greater rescaled error. As LD increases, the model becomes less accurate, so that rescaled error increases.

The magnitude of the influence of LD on rescaled error is relatively small. When we separate SNP windows into quartiles by the physical distance between SNPs and *n*, representing four quartiles expected to have different LD levels, we see little difference among quartiles in the rescaled error (Figure S3).

### Data availability

See Pemberton *et al.* (2012) for the data used in this study. Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8792594.

## Discussion

We have derived the value of that is obtained when a haplotype locus is augmented by a SNP (Figure 1B), focusing on the situation in which the SNP is in linkage equilibrium with the haplotype locus. Three special cases we studied theoretically—a SNP with the same allele frequencies in both populations (Figure 4), a SNP whose minor allele appears only in one of the populations (Figure 5), and haplotype loci that are constructed from SNPs that all have the same allele frequencies (Figure 6)—suggest a general pattern: is likely to decrease when a SNP is added to a haplotype locus, even if the SNP itself has a high value of . Our empirical results using human SNP data corroborate this conclusion (Figure 7A).

The relationship between and the within-population homozygosities and dot product of allele frequencies between populations assists in understanding the effect on of adding a SNP to a haplotype locus. decreases both by a reduction in the within-population homozygosities and by an increase in the between-population allele sharing. Adding a SNP to a haplotype locus necessarily decreases homozygosities within populations by subdividing each allele of the haplotype locus. The addition of the SNP might or might not increase between-population allele sharing; if it does decrease allele sharing, then it might not do so sufficiently to overcome decreases in homozygosity, and might still decrease. We have found that a decrease in allele sharing through differing SNP allele frequencies in the two populations only increases compared to the haplotype locus if the SNP allele frequencies differ greatly between the two populations, the two populations are very similar in their frequencies at the haplotype locus, or they have high diversity at the haplotype locus.

In our trajectories, as more SNPs are added to SNP windows, approaches 0. Typically, the first few SNPs enable an increase in as the frequency of the most frequent haplotype across the population pair decreases toward , the value that permits the greatest (Figure 9). With enough SNPs, the extended haplotype locus becomes too heterozygous within populations for any population divergence information to be gleaned from .

Because has a systematic length dependence, a useful data analysis strategy is to not restrict attention to a single length and to report entire “profiles” of in terms of haplotype length. For example, Figure S4 examines the dependence of on haplotype length for various population pairs. Some of the lines representing different comparisons cross, indicating that the length affects which of a pair of comparisons has a larger value. In other cases, lines have the same relative position irrespective of the length considered. If profiles are computed for multiple population pairs, and the same pairs have larger values across multiple lengths, then relative values can potentially be regarded as robust.

This study augments recent attempts to analyze how population-genetic statistics change as the unit of analysis extends from a single SNP to a haplotype locus (*e.g.*, Morin *et al.* 2009; Gattepaille and Jakobsson 2012; Duforet-Frebourg *et al.* 2015; García-Fernández *et al.* 2018). In particular, our approach follows Gattepaille and Jakobsson (2012), who compared a statistic for ancestry information for two loci combined and treated as a single “haplotype locus” to the information content of the loci individually. We show how a two-locus framework can be used iteratively to examine haplotype loci on larger numbers of SNPs.

We have considered a particular form of , following recent work on the dependence of on allele frequencies (Jakobsson *et al.* 2013; Edge and Rosenberg 2014; Alcala and Rosenberg 2017), by treating as a function computed from allele frequencies rather than as a parameter of an evolutionary model. In our perspective, values at different haplotype lengths are not expected to be equal, either numerically or conceptually. In an alternative and widely used perspective in which is treated as an evolutionary parameter (*e.g.*, Holsinger and Weir 2009), haplotype loci of different lengths represent different scales for investigating the same underlying parameter. Thus, haplotype-based methods that consider each locus in the haplotype as part of a sum or average (Excoffier *et al.* 1992; Hudson *et al.* 1992) are expected to be less sensitive to haplotype length than in our case, in which haplotype loci of increasing lengths can be viewed as loci with an increasing mutation rate due to the larger number of SNP sites at which mutations can occur.

We note that although the scenario of interest assumes that the appended locus is biallelic, much of our theoretical analysis applies if the locus is multiallelic (Appendix B). Our main theoretical analysis focuses on the situation in which an added SNP is in linkage equilibrium with the haplotype locus (Equation 20). Indeed, we have found that the theory is least accurate when substantial LD is present (Figure 10). However, our more general theoretical result (Equation 14) does not assume linkage equilibrium and could be used for explicit linkage models that permit LD. Theoretical predictions of the values of the SNP allele frequencies for specific haplotypes under these alternative models could be used in the same way that we used the assumption of in the case of linkage equilibrium.

The assumption of linkage equilibrium between the SNP and haplotype locus nevertheless produces reasonably accurate predictions about even under circumstances in which linkage equilibrium is not expected (Figure 7, Figure 8, Figure 10, Table 2, and Figures S1–S3). Although the LD level might be smaller in the data we examined than in dense DNA sequence data, the general robustness to the presence of some LD suggests that our results can apply in approximate form to the general situations we have studied in data from human populations.

## Acknowledgments

Support was provided by National Institutes of Health grant R01 HG005855, National Science Foundation grant DBI-1458059, and a Graduate Fellowship from the Stanford Center for Computational, Evolutionary, and Human Genomics.

## Appendix

### Appendix A: Bounds on

Here we derive the upper bound on for a locus with frequencies and in populations 1 and 2 (Equation 5), when and (Equation 4) are treated as fixed quantities in , permitting the number of distinct alleles at the locus to be arbitrarily large. Because we are concerned with nonnegative allele frequencies, .

By the Cauchy–Schwarz inequality, , with equality if and only if one allele frequency distribution is a scalar multiple of the other. Because allele frequency distributions must sum to 1, the equality occurs if and only if the two allele frequency distributions are identical, with for all *i*. This condition implies .

If , then no pair of allele frequency distributions satisfies . However, we can construct a pair of allele frequency distributions, each with a finite number of alleles, such that is arbitrarily close to .

Choose , and . Suppose and . Let *K* be an integer with(31)Then , and .

Consider the allele frequency distributions defined by(32)where *i* ranges from 2 to , and(33)Note that , so that when we add to the inequality , rearrange terms, and take the square root, we obtain that . Because , we have for all . Analogously, for all . Thus, alleles are placed in descending order of frequency in both populations.

It is straightforward to calculate , and . The dot product between the two allele frequency distributions exceeds the product , so that:

(34)Choose *K* large enough that(35)From Equation 33, solving for *K*, we find that for *K* exceeding the root , . Similarly, , so that . Thus, given in , allele frequency distributions exist for which is equal to or arbitrarily close to , with equality possible if and only if .

The case in which one but not the other homozygosity equals 1 remains. For and , we set . We set and as in Equation 32 for , with as in Equation 33, and with . Then . A similar argument holds for and .

### Appendix B: Multiallelic Loci with Linkage Equilibrium

Here, we relax the requirement that the appended “SNP” locus must be biallelic. We show that under linkage equilibrium between the appended locus and the haplotype locus, Equations 18–20 continue to hold for multiallelic loci. Suppose, as before, that there are *I* distinct haplotype alleles, and distinct alleles of the additional multiallelic locus. In population *k*, we can write the frequency of the extended haplotype allele that contains haplotype *i* and additional multiallelic locus allele *m* analogously to Equations 8 and 9 as(36)where is the frequency of haplotype allele *i* in population *k* and is the frequency of multiallelic locus allele *m* on haplotype allele *i* in population *k*.

Under linkage equilibrium, . We can then proceed, as with Equations 12 and 13, to obtain and , as in Equations 18 and 19:(37)(38)where and are the homozygosity in population *k* and the allele frequency dot product, respectively, of the additional multiallelic locus.

Using and from Equations 37 and 38 in Equation 6 produces Equation 20.

### Appendix C: Roots of the Derivative in the Case that the Minor Allele of the SNP Occurs Only in One Population and

We use the derivative to determine conditions under which has a critical point in the permissible region for *q*, . Using Equation 22,(39)To find the roots of Equation 39, we first show that there are no discontinuities over the range of *q* with which we are concerned. The quantity in the denominator is negative for : at , its value is , which is negative for a polymorphic locus because , , and cannot simultaneously equal one; at , its value is . As a quadratic with positive leading term, it then has no roots in . The denominator is therefore never zero and Equation 39 has no discontinuities.

Consequently, the roots of Equation 39 are roots of the numerator. As a quadratic in *q*, the numerator of Equation 39 has two roots. One root, termed , appears in Equation 24; the other root subtracts rather than adds the term with the square root, and because it cannot be positive. Hence, if and only if , for fixed , , and , has a critical point in the permissible region for *q*.

## Footnotes

Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8792594.

*Communicating editor: G. Coop*

- Received March 6, 2019.
- Accepted June 29, 2019.

- Copyright © 2019 by the Genetics Society of America