Skip to main content
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org
  • Log in
Genetics

Main menu

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

User menu

Search

  • Advanced search
Genetics

Advanced Search

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
Previous ArticleNext Article

Inferring Heterozygosity from Ancient and Low Coverage Genomes

Athanasios Kousathanas, Christoph Leuenberger, Vivian Link, Christian Sell, Joachim Burger and Daniel Wegmann
Genetics January 1, 2017 vol. 205 no. 1 317-332; https://doi.org/10.1534/genetics.116.189985
Athanasios Kousathanas
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christoph Leuenberger
Department of Mathematics, University of Fribourg, 1200, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vivian Link
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christian Sell
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joachim Burger
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel Wegmann
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: daniel.wegmann@unifr.ch
  • Article
  • Figures & Data
  • Supplemental
  • Info & Metrics
Loading

Abstract

While genetic diversity can be quantified accurately from high coverage sequencing data, it is often desirable to obtain such estimates from data with low coverage, either to save costs or because of low DNA quality, as is observed for ancient samples. Here, we introduce a method to accurately infer heterozygosity probabilistically from sequences with average coverage Embedded Image of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence, except for the initial alignment of the sequencing data, and takes into account both variable sequencing errors and potential postmortem damage. It is thus also applicable to nonmodel organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to accurately infer recalibration parameters in the presence of postmortem damage. This method does not require knowledge about the underlying genome sequence, but instead works with haploid data (e.g., from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few megabasepairs of haploid data are sufficient for accurate recalibration, even at average coverages as low as Embedded Image At similar coverages, our method also produces very accurate estimates of heterozygosity down to Embedded Image within windows of about 1 Mbp. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples, and we found that 3000–5000-year-old samples showed diversity patterns comparable to those of modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very different between the two samples, supporting earlier conclusions of a highly diverse and structured population in Europe prior to the arrival of farming.

  • heterozygosity
  • low coverage
  • ancient DNA
  • postmortem damage
  • base recalibration

THE genetic diversity at a particular location in the genome is the result of its evolutionary past. Comparing the genetic diversity between individuals or regions of the genome thus gives insight into differences in their respective evolutionary histories. For a diploid individual, the heterozygosity of a genomic region (the fraction of sites in a region at which the individual carries two different alleles) is the result of mutations that occurred since the two alleles shared a common ancestor. It is thus a function of the local mutation rate and the time to the most recent common ancestor, which, in turn, is reflective of the demographic and selective past at that locus. Variation in local mutation rates and, due to recombination, also in the strength of selection and genetic drift leads to variable diversity across the genome. Comparing heterozygosity between regions can thus identify locations that were affected differently by selection, or those with an increased mutation rate. Comparing heterozygosity between individuals may further highlight differences in the demographic histories of populations or different levels of inbreeding, which may lead to long runs of homozygosity.

While heterozygosity is readily obtained from high quality genotype calls by counting, it is much harder to infer accurately from low coverage genomes (i.e., genomes sequenced at low depth). This is primarily due to a substantial probability of observing only one of the two alleles and to sequencing errors, which occur at rates orders of magnitude higher than the expected heterozygosity in many species, including humans. Additional biases may be introduced by relying on a reference genome or by postmortem DNA damage (PMD) when working with ancient DNA.

A natural way of circumventing these issues is to infer genetic diversity probabilistically by taking many of the mentioned issues into account, and several such methods have been developed over the past decade. Johnson and Slatkin (2006), for instance, developed a method for estimating the population scaled mutation rate Embedded Image where N is the population size, and μ the mutation rate, from large metagenomic data sets in the presence of sequencing errors. Shortly after, multiple moment based estimators were introduced to infer heterozygosity from a single individual (Hellmann et al. 2008; Jiang et al. 2008). Lynch (2008) then introduced a likelihood-based estimator that relaxed the assumption of a known error rate by estimating it jointly with heterozygosity from the data itself. Despite the additional parameter, this likelihood-based estimator is generally more accurate, even if this implementation is ill-behaved at very low coverages (Lynch 2008).

Several methods to infer genetic diversity that relax the assumption of a constant error rate have since been proposed. These methods commonly make use of error rate estimates provided by the sequencing machines (the quality scores) to calculate the genotype likelihoods then used in the inference. A particularly frequently used approach infers the site frequency spectrum (SFS) while integrating over genotype uncertainty at individual sites (Nielsen et al. 2012). If applied to data from a single individual, the SFS is a direct estimation of heterozygosity. However, the current implementation of this approach in the software package ANGSD (Korneliussen 2014) requires a priori knowledge of the two potentially observable alleles at each site. While these may be inferred accurately in the case of multiple samples with decent coverage, such an inference from low coverage data are likely biased.

Here, we present a direct extension of these previous approaches to infer heterozygosity within a region that relaxes the assumption of infinitely many sites, does not require any a priori knowledge of the underlying alleles, and takes additional biases introduced by PMD fully into account. We achieve this by explicitly modeling PMD using an extension of the likelihood framework proposed by Maruki and Lynch (2015) to infer site allele frequencies by integrating over all possible genotypes, and by modeling genotype frequencies using the classic substitution model of Felsenstein (1981), which allows for back mutations.

A major hurdle for all approaches relying on quality scores provided by sequencing machines is that these scores are not reliable and must be recalibrated, particularly when coverage is low. This is commonly achieved by learning error rates from sites assumed a priori to be invariant, for instance, by masking polymorphic sites, repetitive elements, and large structural variants (DePristo et al. 2011). While we have extended this approach to tolerate PMD (Hofmanová et al. 2016), it requires detailed knowledge of the study species, which is often lacking for nonmodel organisms. We propose here to circumvent this problem by using a reference-free recalibration approach that relies on haploid sequences such as those from the X-chromosome in male mammals. Our approach does not require any a priori information on the underlying sequence, as it integrates over all possible but hidden genotypes while taking PMD and covariates such as position in read or read context into account. This renders our approach essentially free of reference biases, since the reference is only required for aligning raw reads by mapping and current mapping techniques tolerate a sequence divergence of up to 10% (e.g., Lunter and Goodson 2011).

Using computer simulations, we show that our method reliably estimates local genetic diversity in single, diploid individuals even with average coverage below Embedded Image for windows of ∼1 Mbp. We further show that a few megabasepairs of data at equally low coverage are sufficient to properly recalibrate distorted quality scores. Finally, we use the methods here developed to infer the genome-wide pattern of diversity for several ancient and modern human samples. We found that these patterns differ between European and African samples, but that samples from a few thousand years ago cluster well with modern samples. In contrast, European hunter-gatherer individuals differ strongly from modern Europeans, but also from each other, illustrating the high diversity that existed in Europe before the Neolithic transition.

Theory

Inferring heterozygosity

Here we develop a method to estimate local heterozygosity in a genomic window from a collection of aligned reads by integrating out the uncertainty of the local genotype, and by taking the potential effects of PMD into account. Specifically, we are interested in inferring the stationary base frequencies Embedded Image, together with the rate of substitutions Embedded Image, along the genealogy connecting the two alleles of an individual within a genomic region. Here, T corresponds to the time to the most recent common ancestor of the two lineages, and μ to the mutation rate per base pair per generation. Notably, it is not possible to infer T and μ independently, and we therefore only attempt to estimate the compound substitution rate θ from the data.

We extend Felsenstein’s model of substitutions (Felsenstein 1981) to infer θ while accounting for the uncertainty in the genotypes in the region of interest. Let us denote the hidden genotype at site i by Embedded Image, where Embedded Image consists of a pair of nucleotides Embedded Image with Embedded Image Under the substitution model, the probability of observing a specific genotype Embedded Image given the base frequencies Embedded Image, and the substitution rate θ, is given byEmbedded Image(1)To integrate out the uncertainty in observed genotypes, we adopt a model similar to Lynch (2008), and those commonly used for genotype calling (e.g., Li 2011). The subsequent notation will closely follow the notation we recently introduced (Hofmanová et al. 2016).

The observed data Embedded Image at site i shall correspond to what is typically obtained when individual reads of next generation sequencing approaches are mapped to a reference genome. Here, we will assume that all sequencing reads were accurately mapped and that reads with low mapping qualities have been filtered out. The data Embedded Image obtained at site i thus consists of a list of Embedded Image observed bases Embedded Image Embedded Image

We chose to model the observed data, Embedded Image, at site i as a function of the underlying genotype, Embedded Image, as well as the base-specific rates of sequencing errors Embedded Image for Embedded Image We assume here that these rates are known (e.g., through quality scores provided by the sequencing machine). Assuming further that sequencing errors occur independently, the likelihood of the full data at site, i, is given byEmbedded Imagewhere Embedded Image

Following Maruki and Lynch (2015), and commonly used approaches (e.g., Li 2011), we will assume that a sequencing read is equally likely to cover any of the two alleles of an individual, and that sequencing errors may result in any of the alternative bases with equal probability, Embedded Image The probability of observing a base, Embedded Image, given the underlying genotype, Embedded Image, is then given byEmbedded ImageAssuming sites to be independent, the full likelihood of our model is given byEmbedded Imagewhere the sum runs over all combinations Embedded Image

PMD damage:

We will next extend this model with the possibility of PMD. The most common form of PMD is C deamination, which leads to a Embedded Image transition on the affected strand and a Embedded Image transition on the complimentary strand (e.g., Briggs and Stenzel 2007). These deaminations do not occur randomly along the whole read, but are observed much more frequently at the beginning of a read. This is due to fragment ends being more often single-stranded, and thus subject to a much higher rate of deamination. Consequently, the rates of PMD, while specific to the sample and the sequencing protocol used, generally decay roughly exponentially with distance from the ends of the read Skoglund et al. (2014). Since ancient DNA is highly fragmented, one read can often cover an entire DNA molecule, and hence Embedded Image and Embedded Image transitions may be seen in a single read, but they are accumulated at opposite ends.

Here, we will develop our model for this form of PMD following the formulation of Skoglund et al. (2014) and Hofmanová et al. (2016), but we note that it is readily extendable to incorporate other forms of PMD as well. We feel that the rationale of the approach taken is best explained with a specific example. Consider Embedded Image, given the underlying genotype Embedded Image There are three possible ways to obtain a T: (i) by sequencing an allele T without error, (ii) by sequencing an allele C affected by PMD without error, (iii) or by sequencing an allele C not affected by PMD with error. We thus haveEmbedded Image(2)where Embedded Image denotes the probability that a Embedded Image PMD occurred at the base of read j covering site i.

The emission probabilities for all combinations of Embedded Image and Embedded Image derived following the same logic are found in the Appendix.

Inference using expectation-maximization:

In this section, we will detail how to find the maximum likelihood estimate (MLE) of the model parameters θ and Embedded Image in a genomic window of I sites using an expectation-maximization (EM) algorithm (Dempster et al. 1977). For this, we will make the assumption that base-specific sequencing error rates, Embedded Image, and rates of PMD, Embedded Image, are given constants. For the cases in which they are not known a priori, we show below how they can be learned accurately from genome-wide data prior to inferring θ and Embedded Image In an effort to unburden the notation, we will thus refer to the emission probabilities simply as Embedded Image in the following.

The relevant property to develop an EM algorithm is the complete data likelihood, which, in the case of our model, is given byEmbedded Imageand thus the complete data log-likelihood by

Embedded Image
E-step:

The expected complete data log-likelihood is calculated asEmbedded Imagewhere the sum runs over all combinations Embedded Image Only the second part Embedded Image of this sum depends on the parameters Embedded Image We haveEmbedded Imagewhere we use the shorthand notation Embedded Image We have by Bayes’ TheoremEmbedded Image(3)Let us write out Embedded Image explicitly:

Embedded Image
M-step:

We have to maximize Embedded Image subject to the constraintEmbedded ImageFor this reason, we form the LagrangianEmbedded Imagewhere μ is the Lagrange multiplier. We get the following partial derivatives of the Lagrangian:Embedded ImageWe have to set these equations to zero and solve for Embedded Image and μ. Since this is not possible analytically, we will revert to the Newton-Raphson algorithm.

To streamline the notation, we will rename our variables Embedded Image Embedded Image, and Embedded Image With these variable the above system can be simplified to a systemEmbedded Image(4)whereEmbedded Imagefor Embedded Image andEmbedded ImageEmbedded ImageTo apply the Newton-Raphson algorithm, we determine the Embedded Image Jacobian matrix Embedded Image The nonzeros entries of the Jacobian with Embedded Image are:Embedded Image(5)We can now approximate the zero of Equation 4 with the iterationEmbedded Image(6)After a few iterations, we get the new estimate for the original parameters by setting Embedded Image for Embedded Image and Embedded Image

Confidence intervals:

We calculate an approximate confidence interval for θ using the Fisher information. To simplify the calculations, we consider the Embedded Image as constant. The observed Fisher information at the ML value Embedded Image isEmbedded Imageand the corresponding derivatives areEmbedded Image(7)Observe that Embedded Image From this we easily get thatEmbedded Image(8)where we have setEmbedded Image(9)An approximate Embedded Image confidence interval is now given by

Embedded Image

Estimating rates of PMD

As mentioned earlier, the method above assumes that rates of PMD are known a priori. In cases in which they are not known, they can be readily inferred from genome-wide data as we outline in this section.

Following Jónsson et al. (2013), we present an approach to estimate PMD rates directly from genome-wide counts of Embedded Image and Embedded Image transitions as a function of distance within the read. For this, we first build the three-dimensional table Embedded Image, where each entry Embedded Image corresponds to the number of observed bases, s, read at a site with reference base, r, at position p within a read. While these counts depend on the divergence between the sequenced individual and the reference genome used for mapping, we develop here an approach that takes this divergence into account.

Let us denote by Embedded Image the probability of a true difference between the sequenced individual and the reference, such that the reference has base r and the sequenced individual base s. Since the reference and a sequenced chromosome form a genealogy on which these mutations occurred, it is safe to assume that Embedded Image We will further assume that the observed counts in a cell Embedded Image not affected by PMD are a direct function of Embedded Image

Since the rate of PMD is generally low far away from the read ends, position-specific estimates may become noisy for these positions, particularly if data are limited. We thus introduce a method to estimate parameters of a model of exponential decay with the position in the read. The use of such a model was first introduced by Skoglund et al. (2014), and we implement here a slightly more general version of their function. Specifically, we assume that the probability of observing base Embedded Image when the reference sequence is a C at position p is given byEmbedded Imagewhere Embedded Image again denotes true differences between the individual and the reference.

Note that some of the parameters of this model are nonidentifiable. In the Appendix, we show how to obtain ML estimates of the parameters of the probability functionEmbedded Image(10)using the standard Newton-Raphson algorithm.

To then obtain estimates for the original parameters Embedded Image, and c, we assume that Embedded Image and obtain a ML estimateEmbedded ImageThen, Embedded Image Embedded Image, and Embedded Image We use the analogous logic to infer PMD patterns for Embedded Image damages, but measuring positions from the opposite end of the read.

Estimating base-specific error rates (recalibration)

The challenge of inferring genetic diversity from next-generation sequencing data lies in the fact that the per base error rates are orders of magnitude higher than the expected heterozygosity of many species (Lynch 2008). While this issue can easily be overcome with high coverages, accurate inference from low-coverage data relies on an exact knowledge of base-specific error rates. Crude estimates of these rates are usually directly provided by the sequencing machines themselves. However, these estimates are often inaccurate, and are recommend to be recalibrated for genotype calling (DePristo et al. 2011).

The most commonly used approach for recalibration is BQSR (Base Quality Score Recalibration) implemented in GATK (McKenna et al. 2010; De-Pristo et al. 2011). This approach infers new quality scores by binning the data into groups based on covariates such as the raw quality score, the position in the read, or the sequence context. All bases within such a bin are assumed to share the same error rate, which can be readily inferred if the true underlying sequence is known. As an alternative, Cabanski et al. (2012) proposed to fit a logistic regression to the full data where the response variable is the probability of a sequencing error and the explanatory variables are the raw quality scores, and covariates such as position in the read, or base context.

For our purpose, these methods suffer from two shortcomings: first, they cannot be applied to ancient DNA since they do not take PMD into account. Second, both require a reference sequence as well as knowledge of polymorphic positions, such that they can be excluded from the analysis. While we have shown how to extend the BQSR method to ancient DNA (Hofmanová et al. 2016), we develop here an approach that also integrates over the unknown reference sequence.

To do so, we will assume the existence a genomic region for which the individual does not show any polymorphism. A good example of such a genomic region are nonhomologous sequences from sex chromosomes in heterogametic individuals, and we will describe our approach with this type of data in mind. However, we note that our approach can also be readily applied to diploid regions that are known to be monomorphic, such as positions that are highly conserved among species or positions retained after filtering out those with high minor allele counts (Cabanski et al. 2012).

Model

As above, let us denote the hidden (haploid) genotype at site i by Embedded Image where Embedded Image is one of the nucleotides Embedded Image At each site i, there are Embedded Image reads, and we denote by Embedded Image Embedded Image, the base of read j covering site i. A sequencing error occurs with probability Embedded Image These probabilities shall now be given by a modelEmbedded Image(11)where Embedded Image is a given external vector of information, and Embedded Image are the parameters of the model that have to be estimated. While our approach is flexible regarding the choice of included covariates, we will here consider the raw quality score, the position within the read, the squares of these to account for a nonlinear relationships, and all two-base contexts consisting of the bases of the read at positions Embedded Image and i.

Following Cabanski et al. (2012), we impose the logit modelEmbedded Image(12)withEmbedded ImageIn the case of monomorphic or haploid sites only, the probability of the read vector Embedded Image given the hidden state Embedded Image can be written more generally as

Embedded Image(13)

Here, the dependence on the parameters Embedded Image is given by (12),Embedded Imageand Embedded Image and Embedded Image refer to the known probability that a Embedded Image or Embedded Image PMD occurred at the position covering site i in read j.

Again, this model can be estimated using the EM algorithm with the Newton-Raphson algorithm in the M-step. Details are given in the Appendix.

Implementation

All approaches mentioned were implemented in a custom C++ program available at our laboratory website (http://www.unifr.ch/biology/research/wegmann/wegmannsoft). We used functions included in the library BamTools for manipulating bam files (Barnett et al. 2011).

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Simulations

Generating simulations

In this section, we illustrate the power and accuracy of our inference approaches with simulations. These were generated using a custom-made R script that implements the following steps:

  1. The first chromosome of length L was simulated using random bases with frequencies Embedded Image

  2. The second, homologous chromosome was simulated according to the Felsenstein (1981) substitution model (Equation 1) with Embedded Image and a chosen θ value.

  3. Sequencing reads of 100 bases were then generated by copying from one of the two chromosomes with equal probability, and by choosing a starting position uniformly between positions 1 and Embedded Image until the desired average coverage was reached. All reads copied from the second chromosome were considered to map to the reverse strand.

  4. PMD was simulated on all reads with probabilities following an exponential decay with increasing position in the read as proposed by Skoglund et al. (2014) to match realistic patterns. Specifically, we simulate PMD at position p within the read with probabilityEmbedded Image

    • where Embedded Image and Embedded Image for both Embedded Image and Embedded Image but with p counted from the 3′ and 5′ ends, respectively.

  5. For each simulated base, a phred-scaled quality score was simulated and sequencing errors were then added with probabilities given by these scores. If not stated otherwise, quality scores were simulated from a normal distribution with mean Embedded Image and SD Embedded Image truncated at zero. When testing our recalibration approach, however, the quality scores were simulated from a uniform distribution Embedded Image, and then transformed according to Equation 12 with coefficients Embedded Image to obtain the true error rate, with which sequencing errors were simulated.

  6. The simulated data were finally used to generate a reference FASTA file containing the first chromosome and a SAM file containing the reads. The latter was then transformed into a BAM file using SAMtools (Li et al. 2009).

Power to infer θ

To check the power of our approach to infer θ from low coverage data, we first simulated data within a 1 Mbp window with a true Embedded Image for various coverages. The specific value of Embedded Image was chosen to reflect the median heterozygosity in a modern, non-African human individual.

We found the median of our θ estimates across replicates to be very close to the true value, but the variance to be a function of coverage. At low coverage (Embedded Image), θ was often inferred to be zero. This is not surprising, as the information about genetic diversity can come only from sites covered at least twice, which is rare at average coverages Embedded Image The simulations that did result in an estimate above zero were thus enriched for cases with slightly above average number of polymorphic sites among those covered twice. As soon as average coverage exceeded Embedded Image however, our approach estimated θ at Embedded Image very accurately (Figure 1A).

Figure 1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1

Power to infer θ from low coverage data. Results from sets of 100 simulations with PMD for different average coverage, window size, and true θ values. (A) Estimated Embedded Image in windows of 1 Mbp as a function of average coverage. (B) Estimated Embedded Image as a function of window size and fixed average coverage of Embedded Image (C) Accuracy of estimating Embedded Image quantified as the median relative error (Embedded Image) over replicates indicated by contour lines as a function of both coverage and window size. (D) True vs. estimated θ for different average coverages (see color legend). Polygons indicate the Embedded Image quantile of estimated Embedded Image values among all replicates. The diagonal black line indicates the expectation for perfect estimation. In (A, B, and D), replicates resulting in a Embedded Image are not shown, but their fraction across replicates are printed below the horizontal black line.

We next performed simulations with a fixed coverage of Embedded Image but varying the window size (Figure 1). Interestingly, we found that an increase in window size has a positive effect on the estimate accuracy, similarly to an increase in coverage, suggesting that larger windows help to increase accuracy if coverage is very low. To illustrate this effect, we performed simulations at various window sizes and coverages, and recorded the relative estimation error for a series of replicates. As expected, we found the median relative estimation error to be a direct function of the product of window size and coverage (Figure 1C), thus suggesting our method will perform well also at average coverages below Embedded Image if the window size is large enough.

Using the same setting, we also checked the accuracy of the approximate confidence intervals obtained using the Fisher information. For this we inferred θ from 1000 windows of 1 Mbp simulated with θ at Embedded Image for a coverage of 1.0 and 0.2. We found the true value to be included in the 95% confidence interval in 93.6 and 98.6%, respectively, suggesting these confidence intervals to be a very accurate reflection of estimation uncertainty.

Using a third set of simulations, we found that, at equal coverage and window size, higher θ values are estimated more accurately than lower values (Figure 1D). This is expected since, in the case of low θ, only few heterozygous sites are present in a given window, rendering the estimate more dependent on the detection of individual sites. Nonetheless, we found our approach to infer Embedded Image very accurately in a window of 1 Mbp if the average coverage exceeds Embedded Image.

All results above were generated assuming base-specific quality scores to be normally distributed with Embedded Image and SD Embedded Image which is lower than the quality expected with current sequencing approaches (e.g., Utturkar et al. 2015). Sequences generated with higher quality will positively affect estimation accuracy. Indeed we found that simulating data with Embedded Image or Embedded Image resulted in much lower estimation error, effectively rendering the estimation of θ feasible even at very low average coverage (Figure 2). For instance, we found that, at an average coverage of Embedded Image more than 90% of windows with Embedded Image and Embedded Image were estimated within less than half an order of magnitude from the true value. At Embedded Image this accuracy was only reached with an average coverage of Embedded Image

Figure 2
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2

Effect of sequencing quality on power to estimate θ. Results from sets of 100 simulations to assess the power to estimate θ of Embedded Image and Embedded Image for (A) and (B), respectively, for different average base qualities distributed normally with mean 20, 40, or 60, and a SD of 4.5, but truncated at 0. Polygon shapes indicate the Embedded Image confidence interval for estimated Embedded Image over all replicates, excluding those resulting in Embedded Image (the fraction excluded are printed below the horizontal black line). All simulations were conducted with PMD, and the true PMD probability functions were used during the estimation.

Comparison to an existing method

While there is currently no implemented method available to infer heterozygosity in a window or region from a single individual, our method is very similar to those inferring the SFS from multiple individuals (e.g., Nielsen et al. 2012; Korneliussen 2014). Indeed, inferring the SFS from a single individual gives a direct estimate of heterozygosity, H, which is related to θ byEmbedded Image(14)The most commonly used approach to infer the SFS while integrating over genotype uncertainty at individual sites (Nielsen et al. 2012) is implemented in the software package ANGSD (Korneliussen 2014). This implementation assumes that the two potential alleles are known a priori for each site, and hence have to be inferred first using the major-minor option implemented in ANGSD. While the method implemented in ANGSD can be extended to address this limitation, we used our simulation framework to assess its effect on the inference of heterozygosity from low coverage data. In order to make estimates comparable, we report those of ANGSD in terms of θ calculated according to Equation 14 from the fraction of sites reported to be heterozygous in the SFS.

As shown in Figure 3A, ANGSD inferred θ very accurately if the reference allele was provided for each site, and only the second allele needed to be inferred. Making explicit use of this additional information, the estimates were more accurate than those obtained with our approach, which does not make any assumption about the alleles present. However, when no information was given about the observable alleles, and both had to be inferred using the major-minor option of ANGSD prior to the inference of heterozygosity, θ was grossly underestimated whenever coverage was below 10×. While we note that ANGSD is designed for applications involving multiple samples, in which case inferring the two observable alleles accurately is much easier; this result illustrates the importance of taking the full genotype uncertainty into account when inferring diversity from low coverage data.

Figure 3
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3

Performance comparison with ANGSD. Results from sets of 100 simulations with (A) or without (B) PMD, comparing the performance of the method presented in this study and ANGSD in inferring θ. ANGSD was run either with the reference sequence provided (ANGSD_wREF) or without (ANGSD_noREF), in which case the major and minor alleles were first inferred in an additional step. For all simulations, we further assumed that base qualities were distributed normally, with Embedded Image and Embedded Image but truncated at 0. Polygon shapes indicate the Embedded Image confidence interval for estimated Embedded Image but excluding replicates resulting in Embedded Image The fraction of replicates excluded are printed below the horizontal black line. When applying our method in simulations conducted with PMD, we provided the true PMD probability patterns during the estimation.

Again using simulations, we next studied the impact of PMD on the inference of heterozygosity (Figure 3B). Unsurprisingly, the diversity estimated using the method implemented in ANGSD that does not take PMD into account was more than one order of magnitude too large. In contrast, our method properly accounts for PMD if its pattern is well characterized. While ANGSD can readily be extended to account for PMD, these results highlight the importance of accounting for PMD for any population genetic inference or comparison involving ancient DNA.

Accuracy of recalibration

The results discussed so far were all obtained under the assumption that quality scores provided by the sequencing machine are accurate. Unfortunately, this is rarely the case, making recalibration of the quality scores necessary for most applications, and, in particular, when trying to infer genetic diversity from low coverage data. Here, we developed an approach to recalibrate quality scores without prior knowledge of the underlying sequencing information. Instead, we simply assume that a part of the sequence is known to be monomorphic, such as the haploid X-chromosome in mammalian males.

To investigate the power of our approach to infer recalibration parameters, we simulated sequencing reads from a haploid region where the quality scores provided in the SAM files were distorted. We did this by first simulating fake quality scores from a uniform distribution Embedded Image, and then transforming them into true quality scores according to Equation 12. We used the following coefficients: all context coefficients = 1.0, the coefficients for the raw quality score, Embedded Image the square of the raw quality score, Embedded Image the position within the read, Embedded Image, and the square of the position within the read, Embedded Image These values were chosen to reflect a distortion observed in real data from the ancient human samples analyzed in this study (see below). They also result in both a relatively strong distortion, as well as meaningful error rates for the evaluation of our approach.

We found all coefficients to be inferred with high accuracy from a 1 Mbp window with an average coverage above Embedded Image (Figure 4). If the amount of data were much lower than that, estimates were generally less accurate. In particular, we found the coefficients for the quality (Embedded Image and Embedded Image) to be often slightly overestimated at low coverages, likely because many sequencing errors go undetected since they can only be inferred at sites covered at least twice. However, this bias can be alleviated with larger window sizes if coverage is very low (see below).

Figure 4
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4

Accuracy in inferring recalibration parameters. Results from sets of 100 simulations are shown where sequence data from a haploid 1 Mbp region was simulated assuming a uniform distribution of observed quality scores (Embedded Image) that were then transformed to true qualities according to Equation 12, with Embedded ImageEmbedded ImageEmbedded Image Embedded Image, and all context coefficients at 1.0. All simulations were conducted with PMD, and the true PMD probability patterns were used during the estimation. The coverage values refer to the diploid regions of the simulated genome.

Accuracy of full pipeline

We finally used simulations to assess the accuracy of the full pipeline, that is, when inferring first the pattern of PMD, then the recalibration coefficients given the inferred PMD pattern, and lastly using the recalibrated quality scores along with the inferred PMD pattern to estimate θ. In these simulations, the distortion of quality scores was, in addition to the four effects included above (Embedded Image Embedded Image Embedded Image and Embedded Image), also affected by sequence context, in that simulated sequencing errors were 1.5 times more likely to result in a C or G than in an A or T.

Regardless of the true θ value we used, we detected a strong bias toward high values in our estimates whenever very little data were used (Figure 5, 0.1 Mbp). This is a direct result of the overestimation of the quality scores during the recalibration step as reported above, which leads to an overestimation of diversity. Encouragingly, however, this bias is overcome with only slightly more data. Indeed, we found 1 Mbp of data with an average coverage of well below Embedded Image to be sufficient to accurately infer Embedded Image, and of Embedded Image for Embedded Image Notably, even lower average coverages were sufficient when data were available for 10 Mb. Finally, we found an average coverage of Embedded Image to be sufficient when conducting recalibration and inference in windows as small as 0.1 Mb. These results thus suggest that our approach may be useful not only for hemizygous individuals with large chunks of haploid DNA (the sex chromosomes), but may also work well in other individuals when using mtDNA, for which high coverage can be obtained, or ultraconserved elements for recalibration.

Figure 5
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5

Accuracy in estimating θ using the full pipeline. Results from sets of 50 simulations, each consisting of data from a haploid as well as a diploid region used to conduct recalibration and inference of θ, respectively. The data sets in (A, B, and C) were simulated with different true values of θ, which are indicated with the dashed lines, and were Embedded Image Embedded Image and Embedded Image respectively. Each data set was simulated with PMD as well as distorted base quality scores according to Equation 12, with Embedded Image Embedded Image Embedded Image and Embedded Image In addition, these simulations also included context effects in that sequencing errors were simulated to result 1.5 times more often in a C or G than in an A or T. The average coverage indicated is for the diploid data, while the haploid data were simulated with half the coverage. Line segments and polygons correspond to the median and the Embedded Image quantile of all estimated Embedded Image within the set of simulations, respectively.

Application

We illustrate the benefit of our approach by inferring θ for several ancient human male samples, and comparing these estimates to those obtained for several male individuals from the 1000 Genomes Project. For the ancient genomes, we first inferred PMD patterns from chromosome 1 using the exponential model introduced here (Supplemental Material, Figure S1), then used the first 20 Mbp of the X chromosome to perform recalibration individually for each read group, taking the inferred PMD pattern into account (Figure S2). Finally, we used both the inferred PMD patterns, as well as the recalibrated quality scores, to infer θ in windows of 1 Mbp in the whole genome, excluding windows closer than 5 Mbp to telomeres or centromeres as defined by the track Gap in group Mapping and Sequencing in the UCSC Table Browser (Karolchik et al. 2008).

The samples that we analyzed this way were (1) two European hunter-gatherer individuals (Jones et al. 2015), namely the Mesolithic genome “Kotias” from Kotias Klde cave from Western Georgia (KK1), and the western European Late Upper Paleolithic genome, “Bichon” from Grotte du Bichon, Switzerland (Bich), ∼17,700 years old (2) an individual from the Bronze age burial site at Ludas-Varjú-dülö, Hungary (BR2 Gamba et al. 2014), and (3) a 4500-year-old male from Mota Cave in the southern Ethiopian highlands (Gallego Llorente and Jones 2015). All these samples had relatively high coverage (Embedded Image), and thus allowed us to infer fine scaled patterns of heterozygosity along the genomes, even for regions with low diversity (Embedded Image).

For comparison, we also inferred diversity patterns for nine modern males from three populations that were analyzed as part of the 1000 Genomes Project, phase 3 (alignment files downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/). These were the British males HG00115, HG00116, and HG00117, Tuscan males NA20509, NA20511, and NA20762, and Yoruban males NA18486, NA18519, and NA18522. As shown in Figure 6B, these nine individuals portrayed the expected pattern of higher diversity in African than European individuals, but they also revealed significant variation among individuals of the same population (t-test, Embedded Image in at least two out of three possible comparisons in each population). Larger differences in overall diversity were observed among the ancient samples analyzed. Unsurprisingly, the African sample Mota exhibited the highest diversity of all ancient samples, which was also higher than the diversity observed in modern day Europeans, yet lower than modern day Yorubans. The ancient sample with the second highest diversity was the Bronze Age sample BR2, whose diversity falls well within the range of estimates obtained from modern day Europeans. In contrast, the two European hunter-gatherer samples KK1 and Bichon showed much lower diversity than modern day Europeans with their median estimates being 15–25% lower than the median estimates of modern Europeans. These results suggest that while hunter-gatherer populations had much lower diversity, the diversity found in Europeans about 3000 years ago was very comparable to the diversity observed today. This conclusion is in perfect agreement with a temporal trend in the total length of runs of homozygosity (ROH) inferred from imputed genotypes among ancient samples from Hungary that spanned a period from 5700 to 100 bc and also included the sample BR2 (Gamba et al. 2014).

Figure 6
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6

Local diversity in ancient and modern humans. (A) Heterozygosity (θ) inferred in 1 Mb windows along the first 75 Mbp of chromosome 1 (excluding windows closer than 5 Mbp of the telomere) for two modern Europeans (TSI2 and GBR2), and two ancient European hunter-gatherers (KK1 and Bich). Solid lines indicate the MLE estimate, shading indicates the 95% confidence intervals and dashed lines the genome-wide median for each sample. (B) Distribution of estimates Embedded Image in 1 Mbp windows across the first 22 chromosomes of each sample. (C) Similarity in the pattern of θ along the genome visualized by hierarchical clustering using 1 − Spearman correlation as distance.

The inference of local diversity patterns also allows us to compare the distribution of diversity in the genome between individuals, regardless of the overall level of diversity. This analysis revealed a substantial phylogenetic signal in the distribution of diversity as quantified by Spearman correlations. For instance, the diversity pattern is more strongly correlated among modern Yorubans (pairwise Spearman correlations between 0.55 and 0.60) than between Yorubans and Europeans (0.40–0.50). Similarly, the diversity pattern of the ancient African samples Mota is most strongly correlated with that of modern day Yorubans (0.491–0.537), and much less so with modern day Europeans (0.362–0.433). Interestingly, European samples are more diverse in their patterns than Africans (pairwise Spearman correlations between 0.42 and 0.48), and their correlations do not exceed those obtained when comparing African and European individuals. Nonetheless, hierarchical clustering groups all modern day Europeans together, and also puts the Bronze age sample BR2 at the basis of that clade (Figure 6C).

The lowest pairwise correlations (0.28–0.39) were found for comparisons involving the two European hunter-gatherer samples KK1 and Bich, with the overall lowest being the correlation between these samples (0.28). This is also illustrated visually when plotting our estimates of the first 75 Mbp of chromosome 1, where we found relatively high concordance in local diversity among the two European samples, but vastly different patterns among the hunter-gatherer samples (Figure 6C). These results suggest that, despite very comparable overall levels of diversity, the distribution of diversity along the genome was very diverse among European hunter-gatherer populations, and very different from the one observed among modern day individuals. Multiple observations support such a conclusion: first, the two samples analyzed here represent two vastly different geographic regions, with one being a sample from Switzerland, and the other from Georgia, and were previously reported to belong to two different clades that split 45,000 years ago, as inferred from genotyping data (Jones et al. 2015). Second, the ancestry of modern Europeans traces only partly back to European hunter-gatherers, with early Neolithic people from the Aegean (Hofmanová et al. 2016) and Yamnaya steppe herders (Haak et al. 2015) contributing the majority of the modern-day genetic makeup. Finally, the two European hunter-gatherer samples both exhibit many, but unique, regions of very low diversity (Embedded Image in 4% of all windows, cf. 0.00–0.03% in all modern Europeans), which do not have particularly low coverage. They are likely the result of small population sizes with some level of consanguinity in the population (Pemberton et al. 2012).

Discussion

Quantifying genetic diversity and comparing it between different individuals and populations is fundamental to understanding the evolutionary processes shaping genetic variation. Unfortunately, the inference of heterozygosity is confounded by both sequencing errors, resulting in false diversity as well as the statistical power to identify heterozygous sites, particularly when coverage is low. Several methods have been developed to learn about heterozygosity probabilistically, that is, without the need to first call genotypes. A rather recent such approach (Bryc et al. 2013) proposes to leverage data from external reference individuals to obtain an unbiased estimate of the probability that a specific sites is heterozygous. The expected heterozygosity is then estimated from these site-specific estimates. Since this approach requires per site estimates to be accurate, only sites with a coverage of Embedded Image or higher can be included in the analysis, which severely limits the scope of the application.

An alternative is to infer heterozygosity probabilistically from a collection of sites. Among the earliest such methods was a likelihood-based estimator (Lynch 2008), which infers heterozygosity of an individual jointly with the rate sequencing errors. We presented a natural extension of this approach that relaxes the infinite sites assumption and integrates PMD, a particular feature of ancient DNA that is not captured by base quality scores provided by sequencing machines. We achieved this by extending Felsenstein’s model of substitutions (Felsenstein 1981) with an explicit model of next-generation sequencing data that incorporates both sequencing errors as well as errors arising from PMD. We have previously used the same PMD error model for variant calling from ancient samples (Hofmanová et al. 2016), and we note that it may be used to extend any other inference method based on genotype likelihood to ancient DNA.

Following other recently developed approaches to infer genetic diversity from next-generation sequencing data (e.g., Nielsen et al. 2012; Korneliussen 2014; Maruki and Lynch 2015), our method also relaxes the assumption of constant error across all reads by benefiting from the base-specific quality information provided by current sequencing technologies. Yet since these provided quality scores are often distorted, we also introduced here a method to recalibrate the quality scores for low coverage genomes. In contrast to commonly used methods for recalibration (e.g., McKenna et al. 2010; DePristo et al. 2011; Cabanski et al. 2012), our approach does not require information about the underlying sequence context. It only assumes sites to be monomorphic while integrating over the uncertainty of the sequence itself. Examples of regions known to be monomorphic are the sex chromosomes in hemizygous individuals. But, since we found that our method recalibrates quality scores with high accuracy and reliably, even based on DNA stretches as short as 1 Mbp, we are confident that it will work even on ultraconserved elements or plasmid DNA. Finally, we note that if multiple individuals are sequenced together, they are likely affected by the same distortion of quality scores, and can hence be recalibrated with parameters inferred from a subset of them (e.g., the male samples).

As an illustration, we applied the methods developed here to modern and ancient human samples of various coverage. While our approach to infer heterozygosity incorporates the possibility of PMD, it assumes that the probability of a PMD event occurring is known. We thus also introduce a method to infer these probability functions from raw data, which is robust to divergence between the sample and the reference genome. By inferring PMD patterns for each sample, then the recalibration parameters, and, finally, local diversity in 1 Mbp windows, we found that both ancient and modern African samples exhibited much larger diversity than European individuals. In addition, the diversity inferred from two ancient European hunter-gatherer samples was much lower than that of modern samples, which is likely explained by smaller population sizes. Besides overall differences in diversity, also the pattern of diversity along the genome revealed a strong geographic clustering among modern and ancient samples. The exceptions were the two European hunter-gatherers that showed patterns very different from both modern samples, as well as from one another, further corroborating the view (Jones et al. 2015) that these samples represent different and ancient clades that contributed only marginally to the genetic make-up of modern day Europeans.

Acknowledgments

We are grateful to two anonymous reviewers for their very constructive comments on an earlier version of this work. This study was supported by Swiss National Foundation grant number 31003A_149920 to D.W.

Appendix

Emission Probabilities in the Presence of PMD

Following Lynch (2008) and commonly used approaches (e.g., Li 2011), we assume here that a sequencing read is equally likely to cover any of the two alleles of an individual, and that sequencing errors may result in any of the alternative bases with equal probability, Embedded Image In the absence of PMD, the probability of observing a base, Embedded Image, given the underlying genotype Embedded Image is then given byEmbedded ImageIn ancient DNA, differences between the base observed within a read and the underlying alleles may also be the result of PMD. Following Hofmanová et al. (2016), let us denote by Embedded Image and Embedded Image, the known probability that a Embedded Image or Embedded Image PMD occurred at the position covering site i in read j, respectively. In the presence of PMD, the probability of observing a base, Embedded Image, given the underlying genotype Embedded Image is given byEmbedded Image

Newton-Raphson Algorithm to Infer PMD Patterns

Under the model proposed in Equation 10, the log likelihood of the data are given byEmbedded Imagethe gradient vector Embedded Image byEmbedded Imageand the Jacobian matrix Embedded Image byEmbedded ImagewhereEmbedded ImageandEmbedded ImageThe Newton-Raphson iteration for Embedded Image is given by

Embedded Image(15)

EM Algorithm to Infer Base-Specific Error Rates

We propose an EM algorithm for this estimation that is similar to the one above, but assume here that the base frequencies Embedded Image Embedded Image are known, i.e., can be derived accurately from counting in the region. The complete data log-likelihood of our model is given byEmbedded ImageFrom this, we get the expected complete data log-likelihoodEmbedded ImageFor the M-step, we need only to consider the first part of Embedded ImageEmbedded ImagewhereEmbedded Imageby Bayes’ formula. From (13), we get more explicitlyEmbedded Imagewhere we used the abbreviations Embedded Image andEmbedded ImageIn order to maximize Embedded Image for Embedded Image we calculate the gradient vector Embedded Image with componentsEmbedded Image(16)for Embedded Image From (12), we obtainEmbedded ImageObserve that Embedded Image and Embedded Image for Embedded Image

We solve Embedded Image with the Newton-Ralphson method with the Jacobian matrix Embedded Image From (16), we getEmbedded Imagewhere

Embedded Image

Putting everything together we obtain

Embedded Image

The Newton-Ralphson iteration is

Embedded Image

Footnotes

  • Communicating editor: J. Novembre

  • Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.189985/-/DC1.

  • Received April 1, 2016.
  • Accepted October 19, 2016.
  • Copyright © 2017 by the Genetics Society of America

Available freely online through the author-supported open access option.

Literature Cited

  1. ↵
    1. Barnett D. W.,
    2. Garrison E. K.,
    3. Quinlan A. R.,
    4. Strömberg M. P.,
    5. Marth G. T.
    , 2011 Bamtools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27: 1691–1692.
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. Briggs A.,
    2. Stenzel U.
    , 2007 Patterns of damage in genomic DNA sequences from a Neandertal. Proc. Natl. Acad. Sci. USA 104: 14616–14621.
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Bryc K.,
    2. Patterson N.,
    3. Reich D.
    , 2013 A novel approach to estimating heterozygosity from low-coverage genome sequence. Genetics 195: 553–561.
    OpenUrlAbstract/FREE Full Text
  4. ↵
    1. Cabanski C. R.,
    2. Cavin K.,
    3. Bizon C.,
    4. Wilkerson M. D.,
    5. Parker J. S.,
    6. et al.
    , 2012 ReQON: a bioconductor package for recalibrating quality scores from next-generation sequencing data. BMC Bioinformatics 13: 221.
    OpenUrlPubMed
  5. ↵
    1. Dempster A. A.,
    2. Laird N. N.,
    3. Rubin D. D. B.
    , 1977 Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39: 1–38.
    OpenUrl
  6. ↵
    1. DePristo M. a.,
    2. Banks E.,
    3. Poplin R.,
    4. Garimella K. V.,
    5. Maguire J. R.,
    6. et al.
    , 2011 A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43: 491–498.
    OpenUrlCrossRefPubMedWeb of Science
  7. ↵
    1. Felsenstein J.
    , 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17: 368–376.
    OpenUrlCrossRefPubMedWeb of Science
  8. ↵
    1. Gallego Llorente M.,
    2. Jones E. R.,
    3. Eriksson A.,
    4. Siska V.,
    5. Arthur K. W.,
    6. et al.
    , 2015 Ancient Ethiopian genome reveals extensive Eurasian admixture throughout the African continent. Science 350: 820–822.
    OpenUrlAbstract/FREE Full Text
  9. ↵
    1. Gamba C.,
    2. Jones E. R.,
    3. Teasdale M. D.,
    4. McLaughlin R. L.,
    5. Gonzalez-Fortes G.,
    6. et al.
    , 2014 Genome flux and stasis in a five millennium transect of European prehistory. Nat. Commun. 5: 5257.
    OpenUrlCrossRefPubMed
  10. ↵
    1. Haak W.,
    2. Lazaridis I.,
    3. Patterson N.,
    4. Rohland N.,
    5. Mallick S.,
    6. et al.
    , 2015 Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522: 207–211.
    OpenUrlCrossRefPubMed
  11. ↵
    1. Hellmann I.,
    2. Mang Y.,
    3. Gu Z.,
    4. Li P.,
    5. de la Vega F. M.,
    6. et al.
    , 2008 Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18: 1020–1029.
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Hofmanová Z.,
    2. Kreutzer S.,
    3. Hellenthal G.,
    4. Sell C.,
    5. Diekmann Y.,
    6. et al.
    , 2016 Early farmers from across Europe directly descended from Neolithic Aegeans. Proc. Natl. Acad. Sci. USA 113: 6886–6891.
    OpenUrlAbstract/FREE Full Text
  13. ↵
    1. Jiang R.,
    2. Tavare S.,
    3. Marjoram P.
    , 2008 Population genetic inference from resequencing data. Genetics 181: 187–197.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Johnson P. L. F.,
    2. Slatkin M.
    , 2006 Inference of population genetic parameters in metagenomics: a clean look at messy data Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res. 16: 1320–1327.
    OpenUrlAbstract/FREE Full Text
  15. ↵
    1. Jones E. R.,
    2. Gonzalez-Fortes G.,
    3. Connell S.,
    4. Siska V.,
    5. Eriksson A.,
    6. et al.
    , 2015 Upper palaeolithic genomes reveal deep roots of modern Eurasians. Nat. Comm. 6: 1–8
    OpenUrl
  16. ↵
    1. Jónsson H.,
    2. Ginolhac A.,
    3. Schubert M.,
    4. Johnson P. L. F.,
    5. Orlando L.
    , 2013 mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29: 1682–1684.
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Karolchik D.,
    2. Kuhn R. M.,
    3. Baertsch R.,
    4. Barber G. P.,
    5. Clawson H.,
    6. et al.
    , 2008 The UCSC genome browser database: 2008 update. Nucleic Acids Res. 36: D773–D779.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Korneliussen T.
    , 2014 ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15: 1–13.
    OpenUrlCrossRefPubMed
  19. ↵
    1. Li H.
    , 2011 A statistical framework for {SNP} calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Li H.,
    2. Handsaker B.,
    3. Wysoker A.,
    4. Fennell T.,
    5. Ruan J.,
    6. et al.
    , 2009 The sequence alignment/map format and SAMtools. Bioinformatics 25: 2078–2079.
    OpenUrlAbstract/FREE Full Text
  21. ↵
    1. Lunter G.,
    2. Goodson M.
    , 2011 Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21: 936–939.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Lynch M.
    , 2008 Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. Mol. Biol. Evol. 25: 2409–2419.
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Maruki T.,
    2. Lynch M.
    , 2015 Genotype-frequency estimation from high-throughput sequencing data. Genetics 201: 473–486.
    OpenUrlAbstract/FREE Full Text
  24. ↵
    1. McKenna A.,
    2. Hanna M.,
    3. Banks E.,
    4. Sivachenko A.,
    5. Cibulskis K.,
    6. et al.
    , 2010 The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297–1303.
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. Nielsen R.,
    2. Korneliussen T.,
    3. Albrechtsen A.,
    4. Li Y.,
    5. Wang J.
    , 2012 SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS One 7: e37558.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Pemberton T.,
    2. Absher D.,
    3. Feldman M.,
    4. Myers R.,
    5. Rosenberg N.,
    6. et al.
    , 2012 Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91: 275–292.
    OpenUrlCrossRefPubMed
  27. ↵
    1. Skoglund P.,
    2. Northoff B. H.,
    3. Shunkov M. V.,
    4. Derevianko A. P.,
    5. Pääbo S.,
    6. et al.
    , 2014 Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proc. Natl. Acad. Sci. USA 111: 2229–2234.
    OpenUrlAbstract/FREE Full Text
  28. ↵
    1. Utturkar S. M.,
    2. Klingeman D. M.,
    3. Bruno-Barcena J. M.,
    4. Chinn M. S.,
    5. et al.
    , 2015 Sequence data for Clostridium autoethanogenum using three generations of sequencing technologies. Sci. Data 2: 150014.
    OpenUrl
View Abstract
Previous ArticleNext Article
Back to top

PUBLICATION INFORMATION

Volume 205 Issue 1, January 2017

Genetics: 205 (1)

ARTICLE CLASSIFICATION

INVESTIGATIONS
Population and evolutionary genetics
View this article with LENS
Email

Thank you for sharing this Genetics article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Inferring Heterozygosity from Ancient and Low Coverage Genomes
(Your Name) has forwarded a page to you from Genetics
(Your Name) thought you would be interested in this article in Genetics.
Print
Alerts
Enter your email below to set up alert notifications for new article, or to manage your existing alerts.
SIGN UP OR SIGN IN WITH YOUR EMAIL
View PDF
Share

Inferring Heterozygosity from Ancient and Low Coverage Genomes

Athanasios Kousathanas, Christoph Leuenberger, Vivian Link, Christian Sell, Joachim Burger and Daniel Wegmann
Genetics January 1, 2017 vol. 205 no. 1 317-332; https://doi.org/10.1534/genetics.116.189985
Athanasios Kousathanas
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christoph Leuenberger
Department of Mathematics, University of Fribourg, 1200, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vivian Link
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christian Sell
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joachim Burger
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel Wegmann
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: daniel.wegmann@unifr.ch
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation

Inferring Heterozygosity from Ancient and Low Coverage Genomes

Athanasios Kousathanas, Christoph Leuenberger, Vivian Link, Christian Sell, Joachim Burger and Daniel Wegmann
Genetics January 1, 2017 vol. 205 no. 1 317-332; https://doi.org/10.1534/genetics.116.189985
Athanasios Kousathanas
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christoph Leuenberger
Department of Mathematics, University of Fribourg, 1200, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vivian Link
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christian Sell
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joachim Burger
Paleogenetics Group, University of Mainz, 55122, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel Wegmann
Department of Biology and Biochemistry, University of Fribourg, 1700, SwitzerlandStatistical and Computational Evolutionary Biology Group, Swiss Institute of Bioinformatics, Fribourg, 1700, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: daniel.wegmann@unifr.ch

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero

Related Articles

Cited By

More in this TOC Section

Investigations

  • The Fate of Deleterious Variants in a Barley Genomic Prediction Population
  • Comparative Genomics and Transcriptomics To Analyze Fruiting Body Development in Filamentous Ascomycetes
  • Retrospective Association Analysis of Longitudinal Binary Traits Identifies Important Loci and Pathways in Cocaine Use
Show more Investigations

Population and Evolutionary Genetics

  • Fine-Mapping Complex Inversion Breakpoints and Investigating Somatic Pairing in the Anopheles gambiae Species Complex Using Proximity-Ligation Sequencing
  • The Genomic Basis for Short-Term Evolution of Environmental Adaptation in Maize
  • Polygenic Adaptation to an Environmental Shift: Temporal Dynamics of Variation Under Gaussian Stabilizing Selection and Additive Effects on a Single Trait
Show more Population and Evolutionary Genetics
  • Top
  • Article
    • Abstract
    • Theory
    • Simulations
    • Application
    • Discussion
    • Acknowledgments
    • Appendix
    • Footnotes
    • Literature Cited
  • Figures & Data
  • Supplemental
  • Info & Metrics

GSA

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.

Online ISSN: 1943-2631

  • For Authors
  • For Reviewers
  • For Subscribers
  • Submit a Manuscript
  • Editorial Board
  • Press Releases

SPPA Logo

GET CONNECTED

RSS  Subscribe with RSS.

email  Subscribe via email. Sign up to receive alert notifications of new articles.

  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus

Copyright © 2019 by the Genetics Society of America

  • About GENETICS
  • Terms of use
  • Advertising
  • Permissions
  • Contact us
  • International access