- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Wang, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Wang, J.
Sibship Reconstruction From Genetic Data With Typing Errors
Jinliang Wangaa Institute of Zoology, Zoological Society of London, London NW1 4RY, United Kingdom
Corresponding author: Jinliang Wang, Regent's Park, London NW1 4RY, United Kingdom., jinliang.wang{at}ioz.ac.uk (E-mail)
Communicating editor: J. B. WALSH
| ABSTRACT |
|---|
Likelihood methods have been developed to partition individuals in a sample into full-sib and half-sib families using genetic marker data without parental information. They invariably make the critical assumption that marker data are free of genotyping errors and mutations and are thus completely reliable in inferring sibships. Unfortunately, however, this assumption is rarely tenable for virtually all kinds of genetic markers in practical use and, if violated, can severely bias sibship estimates as shown by simulations in this article. I propose a new likelihood method with simple and robust models of typing error incorporated into it. Simulations show that the new method can be used to infer full- and half-sibships accurately from marker data with a high error rate and to identify typing errors at each locus in each reconstructed sib family. The new method also improves previous ones by adopting a fresh iterative procedure for updating allele frequencies with reconstructed sibships taken into account, by allowing for the use of parental information, and by using efficient algorithms for calculating the likelihood function and searching for the maximum-likelihood configuration. It is tested extensively on simulated data with a varying number of marker loci, different rates of typing errors, and various sample sizes and family structures and applied to two empirical data sets to demonstrate its usefulness.
KNOWLEDGE of the genealogical relationships among individuals in a population (sample) is important in many research areas in behavioral, ecological, and evolutionary genetics and in conservation biology. It is crucial in studying the social behavior, mating system, and sex and reproductive allocations in social insect and other species (![]()
![]()
![]()
Numerous methods have been advanced for inferring relationships among individuals solely from marker information (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Group-likelihood approaches consider all individuals in the entire sample and partition them simultaneously into distinctive genetic groups of variable sizes (![]()
![]()
![]()
![]()
![]()
![]()
![]()
The accuracy of both pairwise and group approaches relies heavily on the reliability of marker information used in relationship inference. The exclusion of a given relationship because of its incompatibility with the observed genotypes is legitimate only when the genetic data are perfect. Unfortunately, however, genotype errors can be quite common in practice and are difficult to avoid. Even in the most favorable situation where a large amount of high-quality DNA is available for repeated genotyping under optimized PCR conditions, relationship inference can still suffer from mutations that may occur at a rate as high as 1.4 x 102 for microsatellites (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In contrast, all previous group-likelihood approaches ignored typing errors and mutations completely. This is unfortunate because, on one hand, typing errors are expected to have a much more devastating consequence on relationship inference for group approaches than for pairwise approaches. For the former, a typing error may not only cause the individual carrying it to be incorrectly assigned into a genetic group, but also affect the assignment of the sibs of this individual. When the individual with a typing error is assigned incorrectly to a sib family, it may drag along with it some of its sibs with similar genotypes into the same false family. On the other hand, typing errors can be potentially identified and accounted for more effectively by group approaches than by pairwise approaches. This is because the multilocus genotypes of a group of individuals serve as mutual references and collectively they provide information about both a given relationship and possible typing errors. The larger the group, the more effectively group-likelihood approaches can detect and account for typing errors in relationship determination.
It is possible to accommodate typing errors in marker data in group-likelihood approaches to sibship reconstruction. If a typing error occurs at a locus of an individual and leads to a genotype incompatible with those of its full-sibs, then the likelihood of the full-sib family is zero no matter how many other loci are correctly typed and thus support the full-sib relationship. When typing errors are allowed for in an appropriate model, however, the likelihood of this full-sib family is always greater than zero and the family can be correctly recovered if genotypes at most loci of most individuals support the full-sib relationship even though one or more individuals are incorrectly typed at one or more loci. In this article, I propose two simple models of typing errors and incorporate them into a group-likelihood approach to sibship reconstruction. I use simulations to show that typing errors can cause severe biases in sibship inference if they are ignored. Yet, sibships can be inferred accurately from data in which typing errors occur at high rates, if typing errors are taken properly into account in estimation. I also propose a novel method based on Bayes' theorem to estimate allele frequencies from samples using the inferred relationships, a method to identify typing errors at each locus of each reconstructed family, and a method to infer parental genotypes. The performance of these methods and their robustness to the violation of some assumptions are checked by extensive simulations. Finally, I apply the proposed methods to two empirical data sets to infer the sibship structures and mating systems.
| METHODS |
|---|
First, I briefly review the factors affecting sibship inference from markers. In particular, the effects of genotype errors and how to correct for them in sibship reconstruction are described. Second, the likelihood functions of a nested half-sib family given population allele frequencies are derived for both diploid and haplo-diploid species. Third, an algorithm to search for the maximum-likelihood configuration of sibship structures for an entire sample is presented. Fourth, a method to estimate population allele frequencies from a sample with the reconstructed sibships accounted for is described. Fifth, I propose a method to detect typing errors at each locus within each reconstructed family in data and a method to infer parental genotypes. Last, I describe the simulation procedures employed to generate simulated data sets and the statistics used to measure the accuracy and precision of the proposed methods in inferring relationships, typing errors, and allele frequencies.
Assumptions and models of typing errors:
To infer sibships from genetic markers without parental information, several assumptions are necessary in either pairwise- (![]()
![]()
![]()
![]()
![]()
![]()
A sample of individuals is assumed to be taken from a single cohort in a large random-mating population. This assumption implies that the genotype frequencies of the parents of sampled individuals can be calculated from population allele frequencies under Hardy-Weinberg equilibrium and that the probability of a given mating type is just the product of the frequencies of the two parental genotypes. It is possible to relax this assumption and incorporate nonrandom mating into the framework, if information about mating system (e.g., selfing rate) is available.
All genetic markers used in sibship inference are assumed to be neutral, unlinked between loci, and in linkage equilibrium. Each marker locus is assumed to have two or more codominant alleles and to follow Mendelian segregation. All of the observed genotypes in a sample are free of errors and mutations, so that they can be trusted completely in inferring sibships.
Here, I follow previous studies in adopting the above assumptions, except for typing errors. In this study, typing errors are broadly defined as any changes in a genotype that could potentially cause incorrect relationship inference. They can come from the inheritance process (e.g., mutations), the genotyping procedure (e.g., miscalling or allelic dropout), and the course of data analyses (e.g., data entry error). The detailed patterns of changes in genotype are different among various kinds of typing errors and may vary among marker types [e.g., microsatellites vs. restriction fragment length polymorphisms (RFLPs)] and samples or studies. Obviously it is difficult or impossible to provide a universal model that reflects the detailed patterns of all kinds of typing errors in all marker data sets. Here I focus on microsatellite markers, which are used widely in ecological, behavioral, and evolutionary studies, and categorize their common typing errors into two classes that are modeled separately.
Class I includes allelic dropouts only. An allelic dropout occurs when PCR fails to amplify one of an individual's two homologous genes (one from each parent) at a locus. If the individual is a heterozygote, then a dropout yields a false homozygote. When dropouts are the sole source of typing errors, an observed heterozygote is always correct but an observed homozygote can be either correct or incorrect. If incorrect, the actual (true) genotype can be any heterozygotes containing the observed allele. For microsatellites, allelic dropouts seem to be the most serious problem (![]()
![]()
1. Ignoring double dropouts at the same locus and individual (which rarely occurs and, if it does, can be easily detected and thus rectified by regenotyping in practice), we obtain the probabilities of 1 2e1, e1, and e1 [where e1 =
1/(1 +
1)] for an actual heterozygote, say A1A2, being observed as A1A2, A1A1, and A2A2, respectively. Error rate
1 is allowed to vary across loci.
Class II includes all kinds of stochastic typing errors other than allelic dropouts. These errors can come from mutations, false alleles (polymerase errors rendering an allele other than the true one), miscalling (allele identification error), contaminant DNA, and data entry. Systematic typing errors, such as misplacing or admixing samples during DNA extraction, which may cause the entire multilocus genotype of an individual to be erroneous, are excluded. Compared with class I, class II errors are usually less frequent and can affect any homozygous or heterozygous genotype. To account for class II errors in data, I assume that the two homologous genes in any individual genotype at a locus are independently and equally likely to be incorrectly observed, with rate
2. I also assume that, for a locus with k codominant alleles, any allele is observed to be any one of the other alleles with an equal probability, e2 =
2/(k 1). This error model is similar to that of ![]()
1,
2 can be variable among loci. Both
1 and
2 are assumed known for each locus in data.
Hereafter, observed and actual genotypes are distinguished and called phenotypes and genotypes, respectively, for simplicity. Genotypes and phenotypes are denoted by G and R, respectively, when subscripts indexing alleles are used. Phenotypes are also denoted by r when subscripts indexing individuals and families are used.
For a locus with k codominant alleles (denoted by Aw with index w = 1, ... , k), there are k(k + 1)/2 possible (ordered) genotypes and phenotypes, Gw,x
AwAx, and Rw,x
AwAx for w
x = 1, ... , k. Taking typing errors of both classes into account, I obtain the transitional probability from a genotype Gw,x to a phenotype Ru,v,
![]() |
(1) |
if Gw,x is a heterozygote (w
x), and
![]() |
(2) |
if Gw,x is a homozygote (w = x). In (1) and (2),
u,v is Kronecker delta variable with values 1 and 0 when u = v and u
v, respectively. In deriving the first two equations in (1), I assumed that class II errors occur after class I errors, because the latter are generally more frequent than the former. A reversed sequence of error events leads to a different formulation of but little numerical difference in Pr[Ru,v|Gw,x] when
1 and
2 are not high, as is expected because the probability of both error events occurring to a single-locus genotype is minute.
The likelihood of a putative half-sib family:
I assume a population of a dioecious species with one sex monogamous and the other sex polygamous. The polygamous sex can be either males or females. A sample of individuals taken from a single cohort in the population may thus contain full-sib families nested within half-sib families. I derive the likelihood of such a half-sib family for a single locus. If markers are statistically independent, the multilocus likelihood of a putative sib family is simply the product of the single-locus likelihoods.
For a locus with k codominant alleles, denote the population frequency of allele Aw by pw (w = 1, ... , k) and the population frequency of the parental diploid genotype Gw,x
AwAx by Qw,x. Under Hardy-Weinberg equilibrium, Qw,x = (2
w,x)pwpx, where
w,x is the Kronecker delta variable with values 1 and 0 when w = x and w
x, respectively.
Consider a putative half-sib family consisting of a number of f putative full-sib families. Suppose, in full-sib family j (j = 1, ... , f), we observe dj distinctive phenotypes {r1,j, r2,j, ... , rdj,j} with corresponding counts {n1,j, n2,j, ... , ndj,j}. Under random mating, the probability of the phenotype data of the putative half-sib family (i.e., likelihood) is
![]() |
(3) |
In (3), the probability of observing an offspring phenotype ri,j (i = 1, ... , dj) given parental genotypes Gw,x and Gy,z is derived from Mendelian segregation:
![]() |
(4) |
For a given offspring phenotype ri,j = AuAv, each term on the right side of (4) is calculated by (1) or (2).
Although (14) are complete for calculating the likelihood of a half-sib family, they require substantial computation. This problem becomes especially important when Monte Carlo techniques are used in searching for the sibship configuration with the maximum likelihood (below) and the markers are highly polymorphic (k large). Computation can be reduced dramatically by considering just the observed alleles and an "allele" pooled over all the unobserved alleles for a putative family.
Suppose a number of mj distinctive alleles are observed in the jth full-sib family in a given putative half-sib family. We pool the k mj unobserved alleles as allele Ak+j+1, whose population frequency is the sum of those of the unobserved alleles. Denote the set of indexes of the mj observed alleles and the pooled unobserved allele by
j. Similarly, for the entire half-sib family, the set of indexes of the m0 observed alleles and the single allele (Ak+1) pooled over all unobserved ones is denoted by
0. By these arrangements, the likelihood function (3) reduces to
![]() |
(5) |
The computational cost of (5) can be a tiny fraction of that of (3) because generally only a small subset of alleles is observed in a sib family.
Note that (3) and (5) are derived for a family with f (f
1) full-sibships nested within a half-sibship. Obviously, they apply to pure half-sib and pure full-sib families, which are just two special cases when f > 1 and each full-sibship has just one offspring and when f
1, respectively. For a pure full-sib family (f
1 and
0
1), the likelihood computational load can be further reduced by using
![]() |
(5) |
instead of (5). The computational burden of (5') is approximately half of that of (5).
Group-likelihood methods can use the phenotype data of individuals in a sample to partition them into sib families without the need for any parental information. However, if parental phenotypes are available, they should be used in sibship inference because they could improve the inference dramatically. Previous group-likelihood methods invariably ignored the use of parental genotypes in sibship inference (e.g., ![]()
![]()
![]()
![]()
![]()
Suppose the phenotype of the parent in the monogamous sex of the jth full-sib family is observed as Ru,v
AuAv. Given Ru,v, the posterior probability of genotype Gs,t (s
t
j) of the parent is calculated as Q*s,t = Qs,tPr[Ru,v|Gs,t]/(
y
j
z
jz
yQy,zPr[Ru,v|Gy,z]) from Bayes' theorem. The sum over all possible genotypes,
y
j
z
jz
yQy,z, of that parent in calculating (5) should then be replaced by
y
j
z
jz
yQ*y,z. A known phenotype of a parent of the polygamous sex can be treated similarly.
Likelihood of a sib family in haplo-diploid species:
For haplo-diploid species, there are two possible scenarios for the hierarchical sibship structure of full-sib families nested within a half-sib family: the polygamous and monogamous sexes are diploid and haploid, respectively, or are haploid and diploid, respectively. Here I consider the first scenario only since the second one can be treated similarly. Assuming sampled offspring are all diploids, the likelihood of a nested half-sib family is
![]() |
(6) |
In (6), Qw,x is the probability of the diploid parent's genotype being Gw,x
AwAx calculated as above, and pz is the probability of the haploid parent's genotype being Gz
Az, which is the population frequency of allele Az. The probability of observing the ith distinctive offspring phenotype in the jth full-sib family (ri,j) given parental genotypes Gw,x and Gz is derived from Mendelian segregation:
![]() |
(7) |
For a given offspring phenotype gi,j = AuAv, the right-side terms of (7) are calculated by (1) and (2).
A phenotype of a diploid parent, if available, can be used in sibship inference in the same way as the diploid case. Any phenotype of a haploid parent can also be incorporated in sibship inference after considering its possible typing errors. Suppose the haploid phenotype of the parent of the jth full-sib family is observed as Au; then the sum over all possible genotypes of this haploid parent in (6),
z
j pz, should be replaced by

derived from Bayes' theorem.
Algorithm for searching the maximum likelihood:
Suppose a number of N offspring are sampled and genotyped to infer their relationships. A particular partition of these N offspring into a number of sib families is called a sibship configuration. The total likelihood of a given configuration is the product of single-family likelihoods, each being calculated as shown above. Any prior information about the distribution of sib family sizes in the sample, if available, can be readily incorporated into the likelihood function.
There are many possible configurations even for a small sample size (N). With N = 10 and possible relationships constrained to either full-sibs or unrelated, for example, there are still 115,975 possible configurations (![]()
![]()
- Generate an initial configuration by allocating those offspring known to be full-sibs to a full-sib family, those known to be half-sibs to a half-sib family, and those with unknown relationships each to a single half-sib family (
THOMAS and HILL 2000 ;
SMITH et al. 2001 ). Calculate and store the likelihood of each sib family in the initial configuration.
- Generate a proposal configuration by changing part of the old one. Changes within and between half-sib families are allowed to occur with an equal probability. For a within-half-sib-family change, a full-sib family, F1, is drawn uniformly from the filled ones (containing at least one individual) in the current configuration. If F1 is known to be a genuine full-sib family (whether the parents' phenotypes are available or not) from another source of information, then it is replaced by another draw. Repeat this process until a full-sib family, F1, without prior information is obtained. Then, draw an integer number uniformly from [1, nF,1] (where nF,1 is the number of individuals in F1) and choose at random that number of individuals from F1. These chosen individuals are to be moved to another family, F2, selected at random from the full-sib families (including an empty one with no individual in it) within the half-sib family from which F1 comes. Like F1, F2 must not be a full-sib family known to be genuine from prior information. For a between-half-sib-family change, a half-sib family, H1, is chosen uniformly from the filled ones. Draw an integer number uniformly from [1, nH,1] (where nH,1 is the number of filled full-sib families in H1) and choose at random that number of full-sib families from H1 to be moved into another half-sib family, H2, chosen at random from the half-sib families (including an empty one with no individual in it) in the current configuration. Similar to within-half-sib-family changes, both H1 and H2 must not be half-sib families known to be authentic from prior information.
- Calculate the old likelihood (Lold) of the parts of the configuration proposed to be changed. For a within-half-sib-family change, Lold is the likelihood of the half-sib family from which F1 and F2 come. For a between-half-sib-family change, Lold is the product of the likelihoods of half-sib families H1 and H2.
- Calculate the new likelihood (Lnew) of the parts of the proposal configuration that have been changed. For a within-half-sib-family change, Lnew is the likelihood of the half-sib family that has been altered. For a between-half-sib-family change, Lnew is the product of the likelihoods of the two half-sib families that have been changed.
- Determine whether to accept or reject the new configuration. Calculate
= Min[(Lnew/Lold)1/T, 1], where T is the annealing temperature governing the rate at which a new configuration is accepted. Generate a random number uniformly distributed between 0 and 1, and compare it with
. If it is smaller than
, the new configuration is regarded as successful and is thus accepted; otherwise, the new configuration is rejected and the old one is recovered. - Repeat steps 25 a sufficiently large number of times. This iterative procedure ensures the likelihood to go uphill in general, but allows it to go downhill occasionally to avoid it being stuck on a local maximum. The probability of a downhill tour is controlled by T, which is decreased as the annealing process proceeds so that a new configuration with a smaller likelihood than the old one becomes less and less frequently accepted. T is set initially at a value of one and reduced in multiplicative steps, each amounting to a 10% decrease. Each new value of T is held constant for 5000N reconfigurations or for 100N successful reconfigurations, whichever comes first. When efforts to improve configurations (increase likelihood) become sufficiently discouraging, the iterative process is stopped and the best configuration with the maximum likelihood is reported.
Estimating population allele frequency:
Sibship reconstruction must use the allele frequencies in the parental population, which are assumed known above. In practice, however, population allele frequencies are generally unavailable and have to be estimated from the sample in which sibships are to be inferred. In other words, usually the only information available is the sample from which we have to deduce population allele frequencies necessary for sibship reconstruction. To better estimate population allele frequencies, sibships in a sample should be taken into account, especially when a sample is dominated by a few large families. Ignoring sibships in a sample leads to overestimates of population frequencies for the alleles present in large families, which results in the likelihoods of large families being too small and those of small families being too large. A possible consequence is that large families tend to split into smaller ones (![]()
![]()
![]()
Here, I propose a simple method to estimate allele frequencies of the parental population by using likelihood rather than family size as the weight. Consider, as an example, a half-sib family consisting of f full-sib families in a diploid species and the phenotypes of the f + 1 parents are unavailable. The count of an allele, Au, in parent s (indexed as s = 0 for the polygamous parent and s = 1, ... , f for the sth parent of the monogamous sex) can be estimated from Bayes' theorem as
|
(8) |
where L is calculated by (5). For an unobserved allele, Av, pooled into allele Ak+s+1, we first estimate the count
of the pooled allele by (8). The count of Av in the sth (s = 0, 1, ... , f) parent is then estimated by
|
(9) |
where
v and
k+s+1 are the estimated population frequencies of Av and Ak+s+1 before updating. Estimate allele counts for each parent in each putative half-sib family in the current configuration, and update population allele frequencies by the mean of the estimated allele counts across parents. The computational load of (8) and (9) is minimal, because all quantities in them are already known from the calculation of L.
For half-sib families with partially known parental genotypes, and for the case of haplo-diploid species, population allele frequencies are estimated similarly, using Bayes' theorem and the corresponding likelihood functions.
Population allele frequencies are estimated initially by the method from the initial configuration and updated periodically after a certain number of successful reconfigurations. Because of the minimal computational cost of the proposed method, it is possible to update allele frequencies after each reconfiguration. However, it is usually unnecessary to update so frequently because a few improvements on the configuration do not change allele frequencies much (![]()
Identifying possible typing errors:
The group-likelihood approach shown above also allows us to identify possible typing errors in data that occurred at each locus within each reconstructed sib family. From the best configuration with the maximum likelihood that is finally rendered by the method, we can calculate, for each family and each locus, the likelihoods considering both class I and II errors (L), class I errors only (L1, setting
2 = 0), class II errors only (L2, setting
1 = 0), and no typing errors (L3, setting
1 =
2 = 0). A likelihood-ratio test can then be carried out to screen the most likely hypothesis. Allelic dropouts are inferred to have occurred at a locus in a family when L1 calculated for the given locus and family is significantly larger than L3, for example.
Obviously, not all typing errors are identifiable. If a typing error causes little change in family likelihood, then it is unlikely to be detected. The power of error detection also depends critically on family size. In the extreme case of a family containing just a single individual, it is impossible to ascertain typing errors. Therefore, the typing errors identified should be treated as conservative.
Inferring parental genotypes:
From the offspring phenotypes and the reconstructed sibships, we can also infer the parental genotypes using Bayes' theorem. As an example, consider a half-sib family consisting of a number of f full-sib families in a diploid species. The posterior probability of a parental genotype, Gu,v, given data and the reconstructed sibships, is
![]() |
(10a) |
for the polygamous parent and is
![]() |
(10b) |
for the sth parent of the monogamous sex, where L is the family likelihood calculated by (3). The maximum-likelihood estimate of a parental genotype is the one with the maximal posterior probability. For a parent with observed phenotypes, its actual genotypes can be inferred similarly.
As is intuitively obvious, parental genotype inference relies heavily on the correctness of sibship reconstruction. It is likely to be inaccurate for an incorrectly reconstructed sib family. Even for a correctly reconstructed family, the inferred parental genotypes are not guaranteed to be correct, especially when family size is small. For a pure full-sib family (f = 1) in a diploid species, it is impossible to resolve the male and female parental genotypes no matter how much marker information is available. The parental genotype combination (
x
), A1A2 x A3A4, has exactly the same posterior probability as A3A4 x A1A2, for example. In such situations, one has to genotype one of the parents to infer accurately the genotypes of both. For a pure full-sib family in haplo-diploid species, the power of parental genotype inference is also reduced if the diploid parent is homozygous at a locus. In any case, the reliability of an inferred parental genotype is indicated by its posterior probability. The higher this value is in comparison with those of alternative genotypes, the higher the confidence we have in it.
Simulations:
To assess the precision and accuracy of the proposed methods and their robustness when some assumptions are violated, I generate simulated data with known parameters by Monte Carlo, reconstruct sibships from the simulated data by the proposed methods, and measure (below) the fit between the true and estimated sibships. Different combinations of parameter values are used in simulations to check the performance of the methods and to investigate the effects of different factors on the estimation. Full-sib family sizes in a sample are assumed to follow either a Poisson distribution with parameter
or a negative binomial distribution with parameters
(probability of success) and
(number of successes). For both distributions, families with no offspring are obviously not represented in a sample. The mean and variance of full-sib family sizes in a sample are therefore
/(1 e
) and e
(e
1
)
/(1 e
)2, respectively, for the Poisson distribution and (1
)
/(
(1 
)) and (1
)
(1 
(1 + (1
)
))/(
(1 
))2, respectively, for the negative binomial distribution. The number of full-sib families nested within a half-sib family is also assumed to follow a Poisson distribution. For a given family, parental genotypes are generated using population allele frequencies under random mating and Hardy-Weinberg equilibrium, and offspring genotypes are generated from parental genotypes following Mendelian segregation. These parents and offspring genotypes are then changed, at a given rate, following the models of class I and II typing errors to give their corresponding phenotypes. The phenotypes are then taken as observed data, which are used in sibship reconstruction. For a given parameter combination, 50 independent data sets are generated and analyzed.
Statistics measuring the performance of the estimation:
Several statistics can be employed to measure the fit of the reconstructed to the actual sibships in simulated data. A stringent measurement of the overall fit is the number of full-sib (
FS) and half-sib (
HS) families being completely recovered relative to the actual numbers in a sample. Obviously
FS = 1 and
HS = 1 for a sample containing full-sib families only and half-sib families, respectively, mean the reconstructed sibships are perfect, with no individual being incorrectly assigned a relationship with any other individual. To gain insight into the causes of an imperfect sibship reconstruction (
FS < 1 or
HS < 1), I examine the statistic P(a|b), the proportion of dyads assigned relationship a when their actual relationship is b (![]()
To assess the accuracy of allele frequency estimates, I use the square root of mean-squared deviation (RMSD) of estimates from true frequencies of all alleles within and between loci. The power of the method for detecting typing errors is measured by the proportion of typing errors being correctly identified (
1 = number of correctly detected errors/total number of detected errors) and the proportion of typing errors being detected (
2 = total number of detected errors/total actual number of errors) across loci and reconstructed families in a sample. For parental genotype inference, I use the average proportion of parental genotypes being correctly inferred (
) to measure the accuracy of the method. For a single diploid parent,
1, 1/2, and 0 if 2, 1, and 0 of its alleles at a locus are correctly inferred, respectively. For a haploid parent,
1 and 0 when its genotype is correctly and incorrectly inferred, respectively. When a parental genotype is unresolved for a pure full-sib family,
is calculated as the mean of the
values calculated for the two alternative genotypes inferred.
is then calculated as the average of
across the two parents of all the sampled individuals and across loci.
| RESULTS |
|---|
Simulation results:
Number of loci:
Assuming typing errors of both classes occur at rate 0.05 at each locus, I generated simulated data that were then analyzed comparatively with typing errors ignored and taken into account, respectively. The proportions of correctly assigned full-sib pairs (P[FS|FS]) are shown in Fig 1 as a function of the number of loci used in estimation. The proportions of correctly assigned unrelated pairs (P[NS|NS]) are not shown because they are always close to 1 regardless of the number of loci and whether typing errors are ignored or not. Typing errors, if ignored in sibship reconstruction, lead to true full-sibs showing sib-incompatible phenotypes and thus to a full-sib family being broken up into several smaller ones. With an increasing number of loci used in sibship inference, both information and noise due to typing errors increase. However, the impact of typing errors overwhelms that of information, and, as a result, sibship inference becomes increasingly inaccurate with an increasing number of marker loci used in estimation. This is understandable because no matter how many loci are correctly typed and thus support a true sib family, it still breaks up into two families in reconstruction if one typing error occurring at a single locus in a single individual leads to a phenotype incompatible with others as sibs. The total multilocus likelihood for a group of individuals as a sib family is the product of single-locus likelihoods and is zero if a single-locus likelihood is zero. When typing errors are accounted for in estimation, P[FS|FS] increases rapidly with an increasing number of loci and the full-sib relationships of a sample are completely reconstructed (
FS = 1) once the number of loci is approximately equal to eight. The large impact of typing errors shown in Fig 1 highlights the importance of accounting for typing errors of data in group-likelihood approaches to relationship inference.
|
Rate of typing errors:
The effect of the rate of typing errors in data on relationship inference is shown in Fig 2. When typing errors are ignored in estimation, P[FS|FS] declines rapidly with an increasing rate of typing errors in data. An error rate as low as 0.001, which is possible from mutations alone for microsatellites (e.g., ![]()
1 =
2 = 0.256. Even if in such situations where an observed phenotype is more likely to be erroneous than correct, 78% of the full-sib pairs are correctly identified and 40% of the full-sib families are fully reconstructed for a haplo-diploid species when typing errors are accounted for. In contrast, P[FS|FS] and
FS are only 2 and 4%, respectively, when typing errors are ignored in estimation.
|
The rate of typing errors that the method can tolerate to yield satisfactory estimation depends on the amount of information available from data (number of marker loci and alleles per locus) and actual family sizes. With an increasing amount of marker information and/or family size in data, the method can cope with an error rate >0.256 as shown in Fig 2. In practice, probably no data are so dirty.
From Fig 1 and Fig 2, we can see that typing errors have a greater impact on sibship inference for haplo-diploid species than for diploid species. This is because typing errors result in a larger probability of false exclusion of sibships in haplo-diploid than in diploid species.
Updating allele frequencies: Fig 3 depicts the impact of updating allele frequencies on estimating relationships and allele frequencies. As is expected, the benefit from updating allele frequencies increases with an increasing imbalance (variance) in family sizes (Fig 3A). When allele frequencies are not updated, large families tend to split into smaller ones, resulting in some full-sibs being incorrectly assigned as unrelated. With a variance of family size of 50 in Fig 3, for example, the proportions of full-sib pairs being inferred as unrelated are 2.0 and 6.4% when allele frequencies are and are not updated, respectively. The gain from the updating procedure in estimating allele frequencies also increases with an increasing variance in family size (Fig 3B).
|
Robustness of the models of typing errors:
In the above, the rates of typing errors (
1 and
2) actually employed in generating simulated data are used in sibship inference. In application, usually
1 and
2 are unknown but are guessed from prior information or estimated by repeated genotyping (![]()
1 and
2 is obviously of concern for practical applications. In Fig 4, the data are generated with
1 =
2 = 0.05 but are analyzed assuming various values of
1 and
2. As can be seen, the accuracy of sibship inference (indicated by P[FS|FS] and
FS) is quite high even though the assumed values of
1 and
2 deviate over several orders from their true value (0.05) used in simulation. Full-sibs tend to be assigned as unrelated and unrelated individuals tend to be assigned as full-sibs when the assumed error rate is much smaller and much larger than the actual value, respectively. All such incorrect assignments occur at a very low frequency, however, even if the assumed error rates are many times greater or smaller than the true values. It seems that accurate sibship inference can be obtained using a wildly guessed, rather inaccurate rate of typing errors provided sufficient information is contained in data. A similar conclusion was reached by ![]()
![]()
|
The error model assumed that typing errors occur independently across loci within an individual. This assumption can be violated if DNA quality, quantity, or both vary considerably among individuals. When DNA is extracted from noninvasive sources such as hair and feces or from ancient material such as bones and scales, for example, both its quantity and quality can be highly variable among individual samples, resulting in significantly different error rates between individuals (e.g., ![]()
and using it in generating typing errors of both classes across loci for a given individual. Therefore, the larger the value of
, the higher the intra-individual correlation of error occurrences among loci. The effect of
on sibship inference is shown in Fig 5, where P[FS|FS] and
FS are plotted against
. As can be seen, the model is robust to moderate levels of variation in typing error rate among individuals. With an increasing value of
, the proportion of individuals carrying erroneous genotypes at almost all loci increases. For these individuals, it is obviously impossible to correctly assign relationships between one of them and any others in the sample. When
1 =
2
0.25, an individual's phenotype is more likely to be erroneous than correct at each locus. The proportions of such individuals are
3.3 and 27% for
= 0.1 and
= 0.2, respectively, in the simulated data sets.
|
With miscalling or mutations in the single-stepwise model for microsatellites, a typing error usually involves a single tandem repeat change and an allele is more likely to be observed if its size is closer to that of the actual allele. In heterozygotes, larger alleles may be more likely than smaller alleles to drop out. Such size-dependent dropouts may bias allele frequency estimation, which may further affect sibship inference. Families with partially known parental genotypes may suffer a smaller rate of typing errors than families with no parental information available, because in the former case some typing errors may be identified and corrected using the known parent-offspring relationship before the genotype data are analyzed for sibship inference. Simulated data were generated following these error patterns but analyzed using the proposed simple error models. In all cases considered, accurate inference of sibships was obtained (results not shown) when typing error rate was not very high (say, <0.15), indicating that the proposed models of typing errors are quite robust. This is not surprising given the results in Fig 4 and Fig 5.
Hierarchical sibship structures and sample sizes: Fig 6 shows the proportions of the actual full-sib and half-sib pairs being assigned different relationships by the estimator applied to samples of various sizes and composed of full-sib families nested within half-sib families. The assignments of unrelated pairs are almost perfect [P(NS|NS) very close to 1] for various sample sizes and are omitted from the figure. Sibship inference becomes increasingly inaccurate with a decreasing sample size once it becomes very small (<50) and with an increasing sample size once it becomes very large (>800). The former is due to the fact that allele frequencies used in sibship inference are less accurately estimated with a smaller sample size. The latter is caused mainly by the increasing probability with sample size that individuals in a sibship could have, by chance, disparate albeit compatible genotypes and thus the sibship may be split in reconstruction. The magnitude of the effect of sample size on sibship inference depends on the amount of marker information and family size.
|
Overall, sibships are quite accurately inferred for haplo-diploid species using only five microsatellites, with at least 95% full- and half-sib pairs being correctly inferred when sample size is
50800. Even if the sample size is as large as 1600, P(FS|FS) and P(HS|HS) are still >92% in the examples shown in Fig 6. Sibship inference is much less accurate for diploid than for haplo-diploid species, as expected. The contrast is especially evident when the sample size is very large or small. Obviously, the amount of marker information (five microsatellites, each with 10 alleles at equal frequency) is insufficient for accurate sibship inference in diploid species. In Fig 6, P(FS|FS) and P(HS|HS) are 0.51 and 0.38, respectively, when N = 1600 for a diploid species. When the number of loci used in the estimation is increased to 10 loci, the corresponding values are 0.97 and 0.98, respectively.
Identifying typing errors and inferring parental genotypes:
The proportion of typing errors being correctly detected (
1) by the likelihood method is generally high. For example,
1 > 0.99 for various error rates assumed in Fig 2,
1
0.910.92 and
1 > 0.96 when the number of loci used in sibship inference is
2 and >2, respectively, in Fig 1.
1 > 0.99 is obtained even though the assumed error rate is several orders larger or smaller than the actual value (Fig 4) or the error model assumptions are violated (Fig 5). On the other hand, the proportion of overall typing errors detected (
2) by the likelihood method is generally low, being <80% in simulations shown in Fig 1 Fig 2 Fig 3 Fig 4 Fig 5. This is not surprising because some typing errors cause no or little change in likelihood and are thus not detectable. These results indicate that a typing error identified by the method is highly likely to be genuine, but not all typing errors are identifiable.
The inference of parental genotypes is generally less accurate than sibship inference and typing error detection (
1). This is because it relies on correct sibship reconstruction and sometimes male and female parental genotypes are unresolved for full-sib families. In Fig 1, for example, the proportion of parental genotypes being correctly inferred (
) is
80 and 50% for haplo-diploid and diploid species, respectively, when three to six loci are used in estimation. The accuracy of parental genotype inference improves with nested half-sib families, larger family sizes, and more marker information. When eight loci, each having 10 alleles of equal frequency and a typing error rate of 0.05, are used in estimating the relationships of 100 offspring coming from nested half-sib families with the numbers and sizes of full-sib families being in Poisson distributions with parameters of 3 and 5, respectively,
is 96 and 80% for haplo-diploid and diploid species, respectively.
Applications:
The method developed in this study has been applied to estimating the number of colonies of two bumble bee species (Bombus terrestris and B. pascuorum) whose workers visit and use a given foraging site (![]()
Analysis on an ant data set:
The data set is from a study on the mating frequency of an ant species, Leptothorax acervorum (![]()
The likelihood method completely reconstructed the sibships of the sampled 377 workers, using their phenotype information only, without a single worker being assigned an incorrect relationship with any other worker. The 100% successful assignments (
FS = 1) were obtained with a wide range of possible typing error rates (
0.0010.40) assumed in the analyses. However, if typing errors are ignored by setting the error rate as zero, only 6 colonies are fully recovered (
FS = 60%) and each of the remaining 4 colonies is split into 2 colonies, resulting in a total number of 14 reconstructed colonies and P[FS|FS] = 0.96. The split of the four colonies is due to typing errors. Indeed, a typing error at a single locus in each of the four colonies is identified and reported by the analysis when typing errors are accounted for. Among the four typing errors identified, three can be verified because the observed data show Mendelian inconsistency (e.g., four or more alleles at a locus are observed among workers from a single colony). The other typing error is highly supported by the original data, if Mendelian segregation applies to the locus and colony.
The analysis also inferred the parental genotypes at each locus for each reconstructed family. In total, 67 single-locus parental phenotypes are available from this data set. If these observed phenotypes are completely correct, the numbers of correctly, incorrectly, and partially (i.e., only one allele correctly inferred for a queen) recovered single-locus parental genotypes are 63, 2, and 2, respectively. The two incorrectly inferred genotypes are at the same locus of a queen and its mate, and the queen is a homozygote. The posterior probability of these two inferred genotypes is 0.54, and that of the alternatively inferred genotypes, which are in full agreement with observations, is 0.46. The two partially recovered parental genotypes occur in the smallest family containing seven offspring in the sample.
The changes in log-likelihood and the number of full-sib pairs as a function of the number of iterates (reconfigurations) during the annealing process are shown in Fig 7 for five independent analyses on the data set. The five replicate runs are carried out in the same conditions using an error rate of 0.05 for both classes of errors at each locus, except that different seeds for the random number generator and different initial configurations are adopted. When typing errors are allowed for at each locus, it is possible to start the simulated annealing algorithm in searching for the maximum-likelihood estimates from an arbitrary initial configuration. As can be seen, all runs converge to the same configuration with the same maximum likelihood after only
106 iterates, indicating the annealing procedure adopted is powerful and well converged. The same results are obtained assuming different values of error rate in the range of
0.0010.4. This differs from ![]()
|
Analysis on a turtle data set:
The data set comes from a study on detecting multiple paternity in the Kemp's ridley sea turtle (![]()
![]()


















