## Abstract

We examine the distribution of heterozygous sites in nine European and nine Yoruban individuals whose genomic sequences were made publicly available by Complete Genomics. We show that it is possible to obtain detailed information about inbreeding when a relatively small set of whole-genome sequences is available. Rather than focus on testing for deviations from Hardy–Weinberg genotype frequencies at each site, we analyze the entire distribution of heterozygotes conditioned on the number of copies of the derived (non-chimpanzee) allele. Using Levene’s exact test, we reject Hardy–Weinberg in both populations. We generalized Levene’s distribution to obtain the exact distribution of the number of heterozygous individuals given that every individual has the same inbreeding coefficient, *F*. We estimated *F* to be 0.0026 in Europeans and 0.0005 in Yorubans, but we could also reject the hypothesis that *F* was the same in each individual. We used a composite-likelihood method to estimate *F* in each individual and within each chromosome. Variation in *F* across chromosomes within individuals was too large to be consistent with sampling effects alone. Furthermore, estimates of *F* for each chromosome in different populations were not correlated. Our results show how detailed comparisons of population genomic data can be made to theoretical predictions. The application of methods to the Complete Genomics data set shows that the extent of apparent inbreeding varies across chromosomes and across individuals, and estimates of inbreeding coefficients are subject to unexpected levels of variation, which might be partly accounted for by selection.

DEVIATIONS from Hardy–Weinberg genotypes frequencies (HWF) can have many causes, including population subdivision, natural selection, assortative mating by phenotype, consanguineous mating, or the avoidance of consanguineous mating. In general, population subdivision and either inbreeding or inbreeding avoidance will affect the whole genome while selection and assortative mating will affect only those loci associated with particular phenotypes. Methods for testing for genome-wide and local deviations from HWF are useful for detecting these processes and for detecting evidence of inbreeding depression resulting from excess homozygosity (Charlesworth and Willis 2009). In this article, we develop and apply methods for using whole-genome sequences to detect small deviations from HWF, both locally and globally, and compare detailed theoretical predictions with publically available sequence data for two human populations. Even small deviations from HWF are detectable at individual loci if sample sizes are large enough. Several methods for testing for deviations from HWF have been developed and tested extensively (reviewed by Weir 1996, Chap. 3). The availability of whole-genome sequences from several individuals creates the opportunity to detect systematic deviations from HWF even if sample sizes are small.

In this article, we show that the distribution of the number of heterozygous individuals, stratified by the number of copies of the derived (*i.e.*, non-chimpanzee) allele, indicates significant deviations from HWF. To illustrate the utility of our approach, we examine two sets of nine genomic sequences made available by Complete Genomics (CG) (http://www.completegenomics.com/public-data/). These are high-coverage sequences (51×–89×). The error rate for the methodology used is reported to be low (1 miscalled variant per 100 kb) (Drmanac *et al.* 2010). We compare the distribution of the number of heterozygous sites with the expectations based on the exact distribution derived by Levene (1949) and show that we can reject HWF in both data sets. We then test whether we can fit a model in which each individual is inbred to the same extent and reject that model also. Next, we show that there are small but significant differences in the estimated inbreeding coefficient across individuals. Even with that assumption, however, there is more variation in the estimated inbreeding coefficients for different chromosomes than is consistent with sampling effects. We conclude that simple theoretical models do not fit the data closely and that there are additional sources of variation in estimated inbreeding coefficients across chromosomes that are unaccounted for by simple models. We find that in one of the two populations (CEU) there is a slight effect of distance from coding genes, suggesting a subtle role for selection. High-quality genomic sequences provide a detailed picture of deviations from HWF that warrant further exploration.

## Data

We downloaded the VCF file containing 54 unrelated individuals from Complete Genomics. We did not use any cutoff for minor allele frequencies but we did filter out all triallelic sites. Although we restricted our later analyses to the 18 individuals in the CEU and YRI populations (9 in each), we used the larger data set to filter out sites whose genotypes appeared to be the result of alignment errors of duplicated regions. Such regions would be expected to create sites that are apparently heterozygous in every or nearly every individual, thus creating artifactual deviations from HWF. Supporting information, Figure S1 shows the distribution of the number of heterozygous individuals in the whole data set and Figure S1 focuses on sites with large numbers of heterozygous sites. There is an excess of sites where all 54 individuals are heterozygous. We chose to restrict to sites at which <49 of 54 individuals were heterozygous. Using other cutoff values did not significantly change our results.

## Expectation with No Inbreeding

We assume that a sample of *n* diploid individuals from a population has been genotyped. Each polymorphic site is characterized by *i*, the number of copies of the derived (*i.e.*, non-chimpanzee) allele, *i* = 1, …, 2*n*–1. Levene (1949) showed that in a randomly mating population, conditional on *i* and *n*, the probability distribution of *h*, the number of heterozygous individuals, is*P*(*h*; *i*, *n*) is unchanged if *i* is replaced by 2*n*–*i*. The value of *i* constrains the possible values of *h*: *h* can take only even values if *i* is even and only odd values if *i* is odd, and *h* ≤ min(*i*, 2*n*–*i*).

Emigh (1980) compares an exact test of random mating (Haldane 1954) based on Equation 1 with other tests. Smith (1970) and Mantel and Li (1974) have also explored the utility of the exact test.

Equation 1 can be derived from a balls-and-boxes argument (Haldane 1954). There are *n* boxes (individuals) and 2*n* balls (alleles) to be placed randomly in them, *i* white balls (derived alleles), and 2*n*–*i* red balls (ancestral alleles). There are (2*n*)!/[*i*!(2*n*–*i*)!] distinguishable arrangements of the balls. Let *n*_{2}, *n*_{1} and *n*_{0} be the numbers of boxes with 2, 1, and 0 white balls: (*n*_{2} + *n*_{1} + *n*_{0} = *n*). Note that *n*_{1} = *h*, *n*_{2} = (*j* – *h*)/2 and *n*_{0} = (2*n* – *j* – *h*)/2 in the notation in Equation 1. Given *n*_{2}, *n*_{1,} and *n*_{0}, there are in *n* boxes. A factor of 2 is needed for each box containing 1 red and 1 white ball because they can be arranged in two ways. Equation 1 is the second number divided by the first.

## Single Inbreeding Coefficient

We next compare the empirical distribution of *h* with the distribution expected if every individual has the same positive inbreeding coefficient, *F*. We derive that distribution from Equation 1 by assuming that, at each polymorphic site, each individual has a probability *F* of being identical by descent (i.b.d.) and a probability 1 – *F* of not being i.b.d. If an individual is i.b.d., we assume it is homozygous at that site. That is, we assume that identity by descent is equivalent to identity in state. The number of individuals that are potentially heterozygous at that site is the difference between *n* and the number that are i.b.d. This derivation does not allow for negative values of *F*, which would indicate a significant excess of heterozygotes. Assuming a single positive *F* for each individual is appropriate when an excess of homozygotes is the result of the mixing of two or more subpopulations (the Wahlund effect).

We assume the distribution of the number of individuals *j* who are i.b.d. at a site is binomially distributed. If *i* is even, then that distribution has sample size *n* and probability *F*. If *i* is odd, then at least one individual is not i.b.d. because there has to be at least one heterozygous site. In that case, the distribution of *j* is binomial with sample size *n* – 1 and probability *F*. These two cases can be combined by defining *i* is even and *i* is odd:*j*, the number *k* of individuals who are homozygous for the derived allele because they are i.b.d. has a hypergeometric distribution*i* is even and *i* is odd. The logic is that there are *i*′/2 pairs of derived alleles available to be drawn with replacement from *n*′ pairs.

Given *j* and *k*, the probability that there are *h* individuals heterozygous at the site has the distribution given by Equation 1 with *n* = *n* – *j* and *j* = *j* – 2*k*. Therefore, summing over *j* and *k* yields*F* = 0, Equation 4 reduces to Equation 1 because the last two probabilities are 0 except for *j* = *k* = 0.

To use Equation 4 as a likelihood for each *i*, we replace *n* by *n _{i}*, the observed number of sites at which there are

*i*derived alleles, and

*h*by

*h*, the number of those sites that are heterozygous. It is then straightforward to obtain the maximum-likelihood estimate (MLE) of

_{i}*F*for that

*i*,

*F*for all

*i*. In that case, the likelihoods from Equation 4 as a function of

*F*for each

*i*are multiplied to obtain an overall likelihood of

*F*, from which the MLE is computed. To test whether a single

*F*is appropriate for all

*i*, we used a likelihood ratio test of the hypothesis that all

*F*to depend on

*i*lets us test whether lower frequency-derived alleles, which will be younger on average, indicate different levels of deviation from HWF, something that would possibly indicate the effects of selection.

We computed the empirical distribution of *h* for the CG data for both populations, as well as the expectations under the model with no inbreeding, the model with a single *F* for all individuals, and the model with a different *F* for each *i* (2 ≤ *i* ≤ 16). Table S1a and Table S1b show the results summarized for the nine European (CEU) and nine Yoruban (YRI) individuals. When fitting a single *F* to all allele frequencies, we find

We performed a likelihood ratio test to determine whether assuming a different *F _{i}* for each

*i*fit the data better than a single

*F*for all

*i*. In both the CEU and YRI populations the tests were highly significant: the likelihood ratios were 566 (

*P*= 0.0) and 188 (

*P*= 0.0), respectively with 14 d.f.

Once we computed the best fit of *π** _{i}* = (

*π*

_{i}_{0},

*π*

_{i}_{1}, …,

*π*) be the vector of the expected fraction of sites with

_{in}*i*copies of the derived allele that have

*h*= 0, 1, …,

*n*heterozygotes under a given model,

*p**= (*

_{i}*p*,

_{i0}*p*,

_{i1}*…*,

*p*) be the observed fraction of sites with

_{in}*i*copies of the derived allele that have 0, 1, …,

*n*heterozygotes, and

*h**= (*

_{i}*h*,

_{i0}*h*

_{i}_{1},

*…*,

*h*) be the observed count of sites with

_{in}*i*copies of the derived allele that have 0, 1, …,

*n*heterozygotes. We computed the test statistic

*P*-values, we used a parametric bootstrap (Kunsch 1989). In the CEU population, this results in Λ of 2165.685, 1854.912, and 1288.163 for the models that assume

*F*= 0,

*F*=

_{i}*F*for all

*i*, and

*F*not constrained to be equal respectively. These all correspond to

_{i}*P*≈ 0. In the YRI, the values of Λ are 773.375, 757.5121, and 568.4466, respectively, each corresponding to

*P*≈ 0. Thus, even allowing for a different

*F*for each

*i*, the data are not consistent with the assumption that the inbreeding coefficient is the same in each individual.

## Individual-Specific Inbreeding Coefficients

We next allow each individual to have a different inbreeding coefficient. Let *F _{j}* be the inbreeding coefficient for individual

*j*(

*j*= 1,…,

*n*). It is possible in principle to find the MLE of the set (

*F*

_{1}, …,

*F*), given

_{n}*i*(the number of copies of the derived allele) and (

*h*

_{1},…,

*h*), where

_{n}*h*=(

_{j}*h*,

_{j1}*h*,

_{j2}*…*,

*h*) in which

_{jL}*h*= 1 if individual

_{jl}*j*is heterozygous at site

*l*and 0 otherwise. It is computationally difficult, however, to compute the exact likelihood of the

*F*’s given the data. Instead we compute the composite likelihood obtained by assuming that each individual’s heterozygosity is determined independently:

*L*is the number of polymorphic sites and

*F*may be negative. Its lower limit is constrained by

_{j}*i*and

*n*, however, because the right-hand side has to be <1.

Assuming independence of individuals, we can obtain the MLE *F _{j}* for each from Equations 6 and 7. We used a block jackknife (Kunsch 1989) with 5-Mb block sizes to compute standard errors on the estimates of

*F*. Briefly, we broke the genome up into 5-Mb blocks and computed jackknife replicates by removing each block one at a time and finding the MLE of

_{j}*F*for the rest of the genome. The results are summarized in Table 1. By the theory of the jackknife, the estimate of

_{j}*F*divided by its standard error is normally distributed with 1 SD (Kunsch 1989). We first performed a two-sided Z-test of the hypothesis that

_{j}*F*= 0 for each

_{j}*j*. All individuals except for one YRI sample strongly rejected

*F*= 0. Next, we tested for heterogeneity between individuals in their estimated

_{j}*F*. If two individuals have equal

_{j}*F*, the difference in their estimated

_{j}*F*should be normally distributed with variance equal to the sum of their squared standard errors. We therefore performed a two-sided test of the null hypothesis that the difference was equal to 0 for every pair of individuals. In the CEU, the

_{j}*P*-values ranged from 0 to 0.38, while in YRI the

*P*-values ranged from 0 to 0.84.

Our likelihood method for estimating *F* for each individual differs from the method-of-moments estimator implemented in the widely used program package PLINK (Purcell *et al.* 2007). PLINK estimates *F* by equating the number of homozygous sites in an individual to *F*, where *L* is the number of polymorphic sites in each individual and *p _{i}* is the allele frequency at the ith site. We implemented this method and found a high correlation between estimated

*F*’s for each individual (Spearman’s

*ρ*= 0.96). The MLE estimates did not tend to be larger than the method-of-moments estimates (Wilcoxon test:

*P*= 0.86 for YRI,

*P*= 1 for CEU). Figure S2 compares the estimates of F obtained using the two methods. Keller

*et al.*(2011) found that estimates of individual

*F*based on runs of homozygosity (ROH) were somewhat better than estimates obtained from PLINK for the purpose of detecting inbreeding depression in apparently outbred populations. We did not attempt to implement the Keller

*et al.*ROH method because it was developed for SNP data, not sequence data of the type we analyzed.

## Variation Among Chromosomes

For each individual, we estimated *F _{j}* separately for each chromosome. Figure 1 shows the results summarized for both populations. We tested whether there is more variation between chromosomes within each individual than can be accounted for by random sampling from that individual’s genomic background. For a chromosome with

*L*polymorphic loci, we sampled each site by first sampling its allele frequency from the frequency spectrum and then sampling the status of that site (heterozygous or homozygous) from the individual’s background frequency of heterozygous sites for sites with the given allele frequency. We then inferred the

*F*that fit the bootstrapped data best. We performed this test 10000 times for each chromosome in each individual;

*P*-values are shown in Table S2a and Table S2b. Each population had 198 hypothesis tests, and after a Bonferroni correction we found that 141 and 132 tests were significant at the 0.05 level in the CEU and YRI populations, respectively. As shown, there is substantial variation among the median estimates for each chromosome. Furthermore, median estimates of

*F*per chromosome were not correlated between CEU and YRI (Figure 2), thus ruling out the possibility that there are chromosome-specific factors contributing to the variation among chromosomes we detected.

## Distance from Genes

We tested whether there is an effect of the distance from coding sequences on the extent of deviation from HWF. We stratified sites depending on whether they are more or less than 5 kb from the beginning or end of an annotated coding region and computed the MLE of *F* for each individual and each *i*. We did this by creating an artificial chromosome made by concatenating all regions that are within 5 kb from genes (or respectively 5 kb away from genes) and computing the likelihood using Equation 4 assuming the same *F* for each *i*. The values shown in Table 2 are the maximum-likelihood estimates of *F*. We performed a likelihood ratio test of whether there is a significant effect of stratifying by distance from genes. There is for CEU (*P* = 3 × 10^{−6}) but not for YRI (*P* = 1). Thus, there is some evidence that selection might be contributing to variation in the apparent inbreeding coefficients.

## Sequencing Error

Although the error rate in the CG data are low, it is not zero. As long as the probability of sequencing error is independent of whether the error is at a site that is actually heterozygous or homozygous, sequencing error will cause the estimated *F* to be closer to zero than it actually is. Thus, if the true *F* is slightly positive, the apparent *F* will be somewhat smaller. The reason is that the Levene distribution (Equation 1) is derived under the assumption of equally probable configurations of alleles in *n* pairs. Randomly relabeling alleles because of sequencing error will not disturb the underlying randomness of the configurations but the apparent number of sites with each *i* will differ from the actual number. Therefore, if the actual distributions of *h* for each *i* fit the Levene distribution, the apparent distribution for each *i* will also fit. What will change is the frequency spectrum, *i.e.*, the numbers of sites with different *i*.

If *F* is not zero and the probability of sequencing error does not depend on whether a site is homozygous or heterozygous, then the additional randomness introduced by sequencing error will tend to reduce *F* slightly because it will tend to restore randomness. Therefore sequencing error of this type will cause estimates of *F* to be slightly too small, with the difference between the apparent and true values being proportional to the error rate.

If the probability of sequencing error does depend on whether a site is heterozygous or homozygous, then the effect of sequencing error on estimates of *F* depends on whether there is a greater tendency to create homozygous or heterozygous sites. We have no information about the error structure of the CG data so we did not attempt to compensate for errors of different kinds.

## Discussion and Conclusion

We have shown that it is possible to test for deviations from HWF and to quantify those deviations when high-quality genome sequences are available for a relatively small number of individuals. We show that a detailed examination of the numbers of heterozygous sites allows us to test hypotheses that are not testable when only small numbers of polymorphic loci are examined. In particular, we can test whether the apparent inbreeding depends on the frequency of the derived allele or on the distance to coding sequence, and we can test for chromosome-specific or individual-specific differences. Finding dependence on allele frequency or distance to coding sequence indicates that selection might interact with deviations from HWF. Finding differences among individuals provides a way to test for subtle inbreeding depression (Keller *et al.* 2011).

We used the exact distribution of heterozygous sites conditioned on the number of copies of the derived allele at each polymorphic site, to test for a fit to HWF and to estimate inbreeding coefficients, both for the set of individuals considered together and for each individual separately. We applied our method to sequences of nine individuals from each of two populations, Europeans and Yorbans. We found that neither population fits HWF but that the deviation was slightly greater in Europeans than in Yorubans. We also found that inbreeding coefficients estimated for each individual differ significantly. In Europeans, but not in Yorubans, there was a significant effect of distance from coding sequence, suggesting a role for selection.

Although the overall fit of the theory to the data is quite good, there are still unexplained sources of variation. Although variation in the inbreeding coefficient among individuals is to be expected because mating is not perfectly random even in a well-mixed population, some of the differences in inbreeding coefficient may reflect past population substructure. Variation in the inbreeding coefficients among chromosomes is more difficult to account for. In humans, linkage disequilibrium decays on a scale of a few hundred kilobases (Reich *et al.* 2001), making it unlikely that differences among chromosomes reflect accidental correlations in the ancestries of different chromosomes.

It seems unlikely that sequencing error would be an important cause of variation among chromosomes. The sequence data we used was based on 51×–89× coverage, which implies a low error rate. Our analysis was of polymorphic sites that had between 2 and 16 copies of the derived nucleotide in 9 individuals. Sequencing errors that created an apparently single polymorphic nucleotide would have no effect. Sequencing errors at sites that were already polymorphic would have to create an additional copy of a nucleotide that was already present, because we ignored sites at which three or four nucleotides were present. As discussed above, sequencing error that is independent of whether a site is heterozygous or homozygous will tend to reduce *F* slightly, making our estimates somewhat conservative. Only if sequencing error had a strong tendency to create an excess of homozygous sites would our estimates of *F* be too large.

We conclude that high-quality genome sequences permit detailed examination of the fit to HWF. Although the fit to expectations is good when inbreeding coefficients are different for different individuals, there are still deviations from expectations that are not accounted for.

## Software Resources

We have posted the script used to carry out the analysis described in this article at http://cteg.berkeley.edu/software.html.

## Acknowledgments

This research was supported in part by Training Grant T32-HG00047 and National Institutes of Health Grant R01-GM40282 to M.S.

## Footnotes

*Communicating editor: N. A. Rosenberg*

- Received August 27, 2012.
- Accepted October 4, 2012.

- Copyright © 2012 by the Genetics Society of America