Abstract
Studies of genetics and ecology often require estimates of relatedness coefficients based on genetic marker data. However, with the presence of null alleles, an observed genotype can represent one of several possible true genotypes. This results in biased estimates of relatedness. As the numbers of marker loci are often limited, loci with null alleles cannot be abandoned without substantial loss of statistical power. Here, we show how loci with null alleles can be incorporated into six estimators of relatedness (two novel). We evaluate the performance of various estimators before and after correction for null alleles. If the frequency of a null allele is <0.1, some estimators can be used directly without adjustment; if it is >0.5, the potency of estimation is too low and such a locus should be excluded. We make available a software package entitled PolyRelatedness v1.6, which enables researchers to optimize these estimators to best fit a particular data set.
PAIRWISE relatedness is central to studies of population genetics, quantitative genetics, behavioral ecology, and sociobiology (e.g., Mattila et al. 2012). The relatedness coefficient between individuals can be calculated from a known pedigree (Karigl 1981). In the absence of pedigree information, this coefficient can be estimated by using genetic markers.
When codominant genetic markers are used to estimate relatedness, some genotyping errors may occur. For example, in polymerase chain reaction-based markers, allelic dropout, false allele, and null allele errors are common, especially with microsatellites. In allelic dropout, the low quality of the DNA template inhibits the amplification of one of the two alleles in a heterozygote, causing the heterozygote to be observed as a homozygote (Taberlet et al. 1996). False alleles are artificial alleles produced during amplification, resulting in homozygotes mistyped as heterozygotes (Taberlet et al. 1996). Null alleles are alleles that cannot be detected because of mutation, often within the primer site (Brookfield 1996). These incorrect genotypes can cause serious problems in the estimation of genetic relationships. For instance, in parentage analysis, genotyping error can mistakenly reject the true father due to an observed lack of shared alleles with the offspring (Blouin 2003). Similarly, when pairwise genetic relatedness is estimated, genotyping error can cause true relatives to be omitted. This is the subject of this article.
The first two error types can be eliminated by multiple-tube methods because the errors appear at random (Taberlet et al. 1996; He et al. 2011). Genotyping errors due to null alleles cannot be resolved in this manner. Null alleles are pervasive in microsatellite markers and many studies report null alleles (e.g., Kokita et al. 2013). Solutions for dealing with null alleles are limited, but when null alleles are detected, they can be eliminated by redesigning the primers or by developing new microsatellite loci. However, both solutions involve additional time and material costs and are not freely available (Wagner et al. 2006).
There are a number of estimators that have been developed using different assumptions, each with varying efficiencies under different conditions. A single estimator cannot fulfill the requirements of all studies. In general, estimators can be classified into two categories: (i) method of moment and (ii) maximum likelihood.
Method-of-moment estimators equate sample moments with unobservable population moments and are generally unbiased but they are not optimal in terms of statistical efficiency (Huang et al. 2015a). These estimators have been developed to estimate only the relatedness coefficient (e.g., Queller and Goodnight 1989; Li et al. 1993; Loiselle et al. 1995; Ritland 1996) or both relatedness and four-gene coefficients simultaneously (i.e., Lynch and Ritland 1999; Wang 2002; Thomas 2010).
The maximum-likelihood approach models the probability of observing a specific pairwise allele pattern, given two “higher-order” coefficients (φ and Δ) and allele frequencies (Milligan 2003; Anderson and Weir 2007). By searching the parameter space for values of φ and Δ that maximize the probability of the genotype pattern observed, maximum-likelihood values can be determined for these parameters.
Here, we modify four existing relatedness estimators (Lynch and Ritland 1999; Wang 2002; Anderson and Weir 2007; Thomas 2010). We also present two new estimators based on two of these previously published works (Lynch and Ritland 1999; Wang 2002) to estimate relatedness coefficients using codominant markers with null alleles. First, we compare the performances of all six estimators before and after correction, to evaluate the influence of null alleles on each estimator under different frequencies of null alleles. Second, we calculate the minimum number of loci needed to achieve a predefined criterion of statistical power. Finally, we simulate a finite population with inbreeding and genetic drift to imitate natural conditions and evaluate the efficiency of each estimator. We make available the software entitled polyrelatedness v1.6, to help other researchers calculate and compare these estimators and simulation functions.
Theory and Modeling
In this article, we adopt the identical-by-descent (IBD) definition of relatedness, which is the probability that an allele sampled from one individual at a locus is IBD to an allele from another individual. This definition is often asymmetric in inbred populations (the two reciprocal relatedness coefficients from different individuals may be unequal), with the geometric mean being equivalent to Wright’s (1921) definition in diploids (Huang et al. 2015b). The genetic correlation between a pair of individuals (x and y) sampled from an outbred population has three modes: (i) each allele in x is IBD to one allele in y, (ii) a single allele in x is IBD to one allele in y, and (iii) no alleles in x are IBD to any allele in y. If the probabilities of occurrences of the first and second events are denoted as Δ (four-gene coefficient) and φ (two-gene coefficient), respectively, the relatedness coefficient r between individuals x and y can be expressed as (1)For example, in an outbred population, parent–offspring pairs share an IBD allele with a probability of 1, so
and
full sibs share one or two IBD alleles with the probability
or
, so
and
. Most relatedness estimators are assumed to have the following conditions: (i) a population is large, outbred, and panmictic; (ii) the locus is autosomal and inheritance Mendelian; and (iii) the loci used are unlinked and are not in linkage disequilibrium (i.e., Lynch and Ritland 1999; Wang 2002; Milligan 2003). Under these assumptions, the presence of alleles and genotypes is random and concurs with the Hardy–Weinberg equilibrium. Thus, the probability of occurrence for a genotype (i.e., Wang 2002; Milligan 2003) or a genotype conditioned on another genotype (i.e., Lynch and Ritland 1999) can be expressed as a function of Δ and ϕ. Subsequently, linear algebra methods (method-of-moment estimators) or numerical algorithms (maximum-likelihood estimators) can be used to solve
and
.
The presence of null alleles can be realized in two ways: (i) the observed pairwise genotypes will become either more or less similar than their true values, and (ii) the sum of the frequencies of visible alleles becomes unified, but the ratio among frequencies of visible alleles is unchanged. These two effects can cause an over- or underestimation of the relatedness coefficient. Some methods can estimate the frequency of null alleles (e.g., Brookfield 1996; Vogl et al. 2002; van Oosterhout et al. 2004; Kalinowski et al. 2006; Hall et al. 2012; Dabrowski et al. 2013). We thus model the probability that a genotype pair (Anderson and Weir 2007) or a genotype pattern (Lynch and Ritland 1999) or a genotype pair pattern (Wang 2002; Thomas 2010) is observed under the assumption that the true frequencies of alleles are already known. Here, we describe briefly six estimators and give a correction of null alleles for each estimator.
Data availability
File S1 contains (i) derivations of Expressions (3a), (3b), (9a) (9b), (12), (14a) and (14b), (ii) derivations of . and
of corrected Lynch and Ritland's (1999) estimator, (iii) derivations of biases of method-of-moment estimators before and after correction, and (iv) a summary of expressions of method-of-moment estimators.
Lynch and Ritland’s (1999) Estimator
Lynch and Ritland’s (1999) estimator models the probability of an observed proband genotype conditioned on a reference genotype. To classify the degree of similarity for a pair of genotypes, this estimator employs a similarity index denoted by S and defined by (2)where
denotes the ith allele, and
,
, and
are different. By modeling the probability of each observed genotype (i.e.,
) when the reference genotype (i.e.,
) is homozygous or heterozygous, the following systems of linear equations with Δ and φ as the unknowns are established (for the derivation, see Supporting Information, File S1): for the case of homozygotes,
(3a)for the case of heterozygotes,
(3b)Here
and
denote the allele frequencies of
and
, respectively, and
is the sum of the frequencies of visible alleles that do not appear in the reference genotype. Expression (3a) or (3b) is simply written as the matrix equation
(4)which is the general form of all moment estimators that we subsequently describe, where the elements of E are the probabilities that certain genotype patterns are observed conditional on zero relatedness, the elements of the sum of E plus the first column (or second column) of M are the probabilities that there are two pairs of IBD alleles (or one pair of IBD alleles) between x and y, and Δ is the column vector consisting of the higher-order coefficients Δ and φ. The estimate
can be solved from the formula
(5)where
is the observed value of P, and V is the variance–covariance matrix of P, assuming x and y are nonrelatives. For the case of homozygotes, because M is a square matrix with order two, it can be seen from Equation 4 or Equation 5 that
. Moreover, under the assumption that x and y are nonrelatives, V can be used to find the weighted least-squares solution of Equation 4.
The elements of V are (6)where
is the ith element of E, and
for the case of homozygotes, as well as
for the case of heterozygotes. Then, using Equation 5, we can calculate the analytical solution
. On the other hand,
can be solved from Equation 1.
Lynch and Ritland (1999) introduced the Kronecker operator . This is defined as
if
and
are identical-by-state; otherwise
Using the Kronecker operator, the expressions of
and
in homozygous and heterozygous reference genotypes can be unified to write as
where
and
denote the frequencies of the two alleles in the reference genotype. The locus-specific weights
and
are defined by the inverses of the variances of
and
, respectively. However, the two variances (
and
) are functions with r and Δ as the independent variables, in which r and Δ are the quantities we are attempting to estimate. Due to a lack of a priori information, Lynch and Ritland (1999) assumed that the two individuals are nonrelatives and then used the variances of
and
to weight the locus. The weights
and
of each locus are given by
The weighted averages of
and
(denoted
and
) for all loci used in the estimation can be calculated by averaging the estimates of all loci by using the inverses of the two variances (
and
). Because there is no particular reason to choose the individual x or y as a reference,
and
can also be calculated by using another individual as a reference. The final estimates
and
are the arithmetic means of both parameters:
When a null allele (i.e.,
) is present, the heterozygote
will be mistakenly identified as
(where
represents the corresponding quantity of · with the presence of null alleles), whereas the genotype
will remain undetected and regarded as a negative result (no bands detected from electrophoresis). Negative results may also be obtained due to other reasons (e.g., operational errors), with null alleles being indistinguishable from homozygotes. Therefore the unobservable genotype should be discarded from the probability of the observed genotype pattern. These facts can be summarized by the potential genotype patterns of
where
and
are the ith and jth possible genotypes of x and y, respectively;
is the probability that
is the true reference genotype on condition that
is observed; and
is the probability that
is
given
and assuming that
is observable. For example, if
and
, letting ρ be
, then
Denote
for the corresponding column vector of P on the left side of expression (3a) or (3b), with the presence of null alleles and
being visible [e.g.,
for homozygous reference individual is
]. It is clear from the previous example that the ith element
of
can be written as
(7)where
and
(
) are polynomials with
and
as the variables [such as
,
, and
]. Unfortunately, although
can be solved from expression (7) and Equation 5, the estimator becomes more biased because the relationship between Δ and
is not linear. As an alternative, the following approximate form of
for the estimation in expressions (3a) and (3b) can be used:
(8)By extracting Δ and ϕ, expression (8) can be rewritten as
or equivalently,
It is now clear that
can be expressed approximatively as
, where the matrices
and
are as follows: for the case of homozygotes,
(9a)for the case of heterozygotes,
(9b)in which
is the partitioned matrix with two columns and every column consisting of
[the derivation of expressions (9a) and (9b) is shown in File S1]. Replacing E by
and M by
in Equation 5, the following formula follows,
where
is the observed value of
, and the elements in
[refer to system (6) of equalities] are
Applying the above formula, an estimate of
can be calculated. Moreover, using Equation 1 and the elements of
, a less biased
of a single locus can be found. The weights across loci are still the inverses of the variances of
and
. The symbolic expressions of the variances for
and
are complex and are presented in File S1.
Novel Estimator A
This estimator is a modification of Lynch and Ritland’s (1999) estimator, but it does not directly use a probability matrix to solve the corresponding system of linear equations. Instead, the method-of-moment approach is used. The definition of the similarity index S is stated in expression (2). The similarity index S as a random variable determines a probability mass function whose values are listed as follows: for the case of homozygotes,for the case of heterozygotes,
These two systems of functions can be unified as
. The symbol S is used to denote the moment vector consisting of the first and second moments of S. Then the relationship between S and Δ can be described as
(10)where
Expression (10) is actually a system of linear equations with two unknowns (i.e., Δ and φ) and two equations.
With the presence of null alleles, using the same approximation as before, the matrices E and M in expression (10) are modified and written as the following matrices and
for the case of homozygotes,
for the case of heterozygotes,
Substituting
for S,
for E, and
for M in expression (10), we now derive an approximate expression as
(11)where
is the moment vector consisting of the first and second moments of
. Then, the estimate
can be calculated from the following formula:
On the other hand,
can be calculated from Equation 1. The locus-specific weights are still the inverses of the variances of
and
, and the remaining procedure is the same as Lynch and Ritland’s (1999) estimator.
Wang’s (2002) Estimator
Wang’s (2002) estimator also uses the similarity index S as defined in expression (2), but differs from Lynch and Ritland’s (1999) estimator. Wang’s (2002) estimator does not use reference and proband individuals and is obtained by modeling the probability of each similarity index observed. For example, for the event , the probability
is defined by
where
,
(especially
because
). Generally, the following system of linear equations can be established,
(12)where
,
, and
[the derivation of expression (12) is shown in File S1]. Expression (12) can also be expressed as Equation 4, and
can be solved from Equation 5. However, the symbolic solutions of
and
are long (see Wang 2002, for details). The weighting of Wang’s (2002) estimator differs from that of most other estimators. The locus-specific weight in Wang’s (2002) estimator is defined by the inverse of the expected value of the similarity index and is used to calculate the weighted averages of
,
, and
and to simultaneously solve
. The expected value
is given by
(Li et al. 1993; Wang 2002).
If null alleles appear, expression (12) will be modified. For example, for the event , this is given as
where
is the probability that the genotypes of both individuals are visible, in which
,
, and
. From the previous example, the probability for
being equal to a specific value is the sum of the probabilities of a series of genotype pairs. To denote such a probability by symbols, we let
be the sum
then
is the counterpart of
when null alleles are present. In general,
differs slightly from
, and their relationship is described as
. For instance, let ρ be the sum
. Then
The result of this calculation can also be expressed as
It is now clear that the probabilities of
, and
can be unified as the formula
(13)where
(14a)
(14b)The derivation of expressions (14a) and (14b) is shown in File S1. Similar to expression (7), the estimation
solved by expression (13) is heavily biased. As an alternative, the approximate form of Equation 4 is used for estimation,
(15)where
(16)Replacing E by
and M by
in Equation 5, the following formula is obtained,
where
is the observed value of
and the elements in
are
Now, the estimate
can be calculated from the above formula, which has a smaller bias. There are more parameters that should be weighted, including
(
),
,
,
,
,
, and
. The weight of each locus is the inverse of the expected value of S for nonrelatives, which is given by

Thomas’s (2010) Estimator
Thomas’s (2010) estimator is a modification of Wang’s (2002) estimator. In Thomas’s (2010) estimator, the two events and
are combined into a single event (i.e., the event
or 1/2). Then, the second and third rows of
, E, and M in Wang’s (2002) estimator are combined as the second row of P, E, and M in Thomas’s (2010) estimator, respectively. This results in
Now, the estimate
can be solved from Equation 5. The weighting scheme differs from that of Wang’s (2002) estimator. In Thomas’s (2010) estimator, for each locus,
and
are obtained by Equations 1 and 5, respectively. The averages of
and
for all loci are obtained from the following two approaches: (i) the inverse of the expected value of the similarity index is identical to that of both Wang (2002) and Li et al. (1993) and (ii) the inverses of the sampling variances of
and
are equal to those of Lynch and Ritland’s (1999) estimator.
With the presence of null alleles, the probabilities of , and 1/2 can also be expressed as expression (13); namely
, where the matrices
and
are
This result is similar to Wang’s (2002) estimator, the solution being overly biased if
is solved directly. The approximation method for the correction of Wang’s (2002) estimator can also be applied to regulate this estimator [refer to expression (15)]. The weighting method is the same as that of Wang; the inverse of the expected value of the similarity index is calculated by Equation 17, while the inverse of the variance is obtained from the formula
.
Novel Estimator B
This estimator is also based on Wang’s (2002) method and uses the moment vector consisting of the first and second moments of the similarity index S to solve . The probability mass function of S is stated in expression (12), which can also be denoted as Equation 4. Hence, the relationship between the moment vector S and Δ can be established [which is identical to expression (10)], and
can be solved from Equation 5. The weight of each locus is defined by the inverse of the expected value of the similarity index, and the weighting scheme follows that of Wang (2002).
With the presence of null alleles, the probability mass function of is identical to expression (13), where
and
are listed in expressions (14a) and (14b). With the same approximation as expression (15), expression (11) is established, where
and
are given in expression (16). Therefore,
can be solved from Equation 5, and the weight across loci is defined by expression (2).
Maximum-Likelihood Estimators
Jacquard (1972) described nine IBD modes, denoted by (see Figure 1), that summarize the possible IBD relationships among the set of four alleles possessed by two diploids. The probability that a pair of individuals is in IBD mode
is denoted by
, so
, and
and
are equivalent to Δ and φ, respectively. Because IBD alleles appear in the same individual,
are IBD modes of inbreeding. Therefore, the relatedness coefficient is given by
(18)In an outbred population, Equation 18 is reduced to Equation 1.
Modes of identity-by-descent between two diploids. In each plot, the two top circles represent the two alleles in one individual, whereas the bottom circles represent the alleles in the second individual. The lines indicate alleles that are identical-by-descent.
There are also nine IBS modes, denoted by , which are interpreted by using different definitions of identical-by-state (identified by separated lines in Figure 1). Unlike method-of-moment estimators, the maximum-likelihood estimator models the probability of an observed IBS mode conditioned on each IBD mode (for details see table 1 in Milligan 2003).
The single-locus likelihood L is the probability that the genotype pair has been observed given a value of Δ, and L can be described by the formulaFor multilocus estimation, the global likelihood is the product of the whole single-locus likelihoods. Maximum-likelihood estimators assign implicitly the weights for each locus. The parameter space is Δ in which
,
, while in outbred populations, the first six kinds of IBD alleles cannot appear in an individual. Therefore
. A constraint,
, in diploids in an outbred population was proposed by Thompson (1976). This constraint is based on the assumption that two individuals have fathers that may be relatives and mothers that may also be relatives, but the parents of each individual are unrelated. This constraint can slightly reduce the bias and was applied by Anderson and Weir (2007), but not in Milligan’s (2003) estimator. In addition, Anderson and Weir (2007) model the probability of patterns of identity-in-state given modes of identity-by-descent in structured populations.
With the presence of null alleles, the data in Milligan’s (2003) table 1 should be modified as we present here in Table 1. This shows that ,
,
, and
, with
being the probability that a pair of genotypes is observed. Each coefficient in Table 1 thus needs to be multiplied by
, where
(19)Wagner et al. (2006) also applied similar corrections to the estimator of the maximum-likelihood method, but failed to remove the probabilities for unobservable genotype patterns and to apply Thompson’s (1976) constraint.
Simulations and Comparisons
We simulated four applications: (i) the effect of null alleles (how biases the result before adjustment), (ii) the reliability of the estimators after correction (bias of the adjusted
), (iii) the efficiencies of the corrected estimators, and (iv) the robustness of the corrections (populations under nonideal conditions were simulated, and the statistical behaviors of all estimators before and after correction were compared).
We used Monte Carlo simulations for the first three cases. For each dyad, the alleles of the first individual were randomly associated into genotypes according to the allele frequencies. Subsequently, the other genotype was then obtained by using the condition of the first genotype and their relationships (Δ and φ). For each locus, we generated a random number t that was uniformly distributed from 0 to 1: (i) if , the second genotype at this locus would be equal to the first; (ii) if
, one allele was randomly copied from the first genotype, and the other was randomly generated according to the allele frequency; and (iii) if
, both alleles of the second genotype were randomly generated according to the allele frequency.
Before Correction
We used both observed genotype and allele frequency for estimation and simulated four levels of null allele frequency: , 0.3, 0.5, and 0.6. For each level, the visible allele frequency was drawn from a triangular distribution, in which allele frequency followed
proportions.
Four relationships were simulated, including parent–offspring, full sibs, half sibs, and nonrelatives, and six estimators were compared, including that of Lynch and Ritland (1999) (L&R), the novel estimator A (NA), Wang (2002) (WA), Thomas (2010) (using the inverse of variance for weighting, TH), the novel estimator B (NB), and the estimator of Anderson and Weir (2007) (A&W). It is noteworthy that, because we were unable to apply both corrections (null alleles and structured populations) simultaneously, the population structure parameter θ for the A&W estimator was set to zero in the simulation (see Anderson and Weir 2007). For moment estimators, bias was obtained for a single locus by 6 million simulations. Because the final estimate is usually a weighted average of the estimates from all loci (except the WA estimator), increasing the number of loci does not reduce bias. Therefore, a single locus is sufficient to show an effect of null alleles on bias. In contrast, bias was obtained for 60,000 loci from 100 simulations for the A&W estimator. This estimator is asymptotically unbiased. Therefore, there are two sources of bias with the presence of null alleles: (i) the original bias, present even in normal loci, and (ii) the bias brought about by the presence of null alleles. We focused on the second type of bias and used several loci to eliminate the first type of bias. The results shown in Figure 2 are a function of the observed number of alleles, with
being simulated from 2 to 15.
Bias of before correction as a function of the number of visible alleles at loci with triangular distributed allele frequency. The data that have null alleles with a frequency of 0.1, 0.3, 0.5, or 0.6 are shown in the first to fourth columns. The six estimators compared are as follows: Lynch and Ritland (1999) (L&R), novel estimator A (NA), Wang (2002) (WA), Thomas (2010) (TH), novel estimator B (NB), and the Anderson and Weir (2007) estimator (A&W). Results were obtained from 6 million pairs for four relationships by Monte Carlo simulations except the A&W estimator, including parent–offspring (“—”), full sibs (“– –”), half sibs (“–⋅”), and nonrelatives (“⋯”).
The L&R and NA estimators cope relatively well with null alleles, because their biases are smaller at low frequencies of null alleles (Figure 2). The bias of the L&R estimator remains relatively unchanged as increases, as does that of the A&W estimator. Because the bias of each of these two estimators is small, each can be used directly without adjustment at low null allele frequencies. However, each begins to show a larger bias as
increases and reaches ∼0.15 at
The NA estimator performs worse than L&R at low levels of null alleles, but improves at high levels of null alleles when
The other three method-of-moment estimators (WA, TH, and NB) have similar bias curves. The bias of
varies as
increases, is negative at lower
, and begins to increase as
increases. For a highly polymorphic locus at low null allele frequencies, the biases of the WA and NB estimators are also small.
The bias curves in Figure 2 can be divided into three categories: (i) L&R and NA; (ii) WA, TH, and NB; and (iii) A&W. The biases of L&R and NA estimators are highly independent of the number of observed alleles because the bias of L&R is not a function of , while those of WA, TH, and NB estimators are dependent on
. Because the NA estimator is a modification of the L&R estimator, and because the TH and NB estimators are both modifications of the WA estimator, the resulting curves are similar within the same category. We present the bias curves of method-of-moment estimators in File S1. For the A&W estimator,
is calculated by a numerical algorithm. Therefore an analytical solution cannot be obtained.
After Correction
The configurations for this application are identical to those previously described. Although bias cannot be completely eliminated, it reduces to a negligible level after correction (Figure 3). The biases are <0.003, 0.015, 0.02, and 0.04 when the null allele frequency = 0.1, 0.3, 0.5, and 0.6, respectively. At , both the NA and WA estimators encounter a singular matrix problem (see Huang et al. 2014) and cannot yield a valid estimate, so the corresponding value is missing or highly negative. Full sibs yield the largest bias, while the biases of the other relationships are relatively small. Similarly, the biases of corrected method-of-moment estimators are also presented in File S1, where the biases of the WA, TH, and NB estimators are identical and do not depend on
.
Bias of after correction as a function of the number of alleles at loci with a triangular allele frequency distribution. Four levels of null allele frequency are shown in each column. Six estimators are compared, as shown in Figure 2. Results were obtained from 60 million pairs for four relationships by Monte Carlo simulations except the A&W estimator, including parent-offspring (“—”), full-sibs (“––”), half-sibs (“–⋅”) and nonrelatives (“⋯”).
Efficiency of Estimators
Multiple multiallelic loci were used in all estimations to increase reliability. Less variance suggests a more precise estimation, so variance can be a criterion of efficiency of estimators. However, bias cannot be excluded completely. We thus used the mean square error (MSE) to evaluate the efficiency of estimators, where . To demonstrate the efficiency of corrected estimators in multilocus estimations, we performed two simulations: (i) the minimum number of loci needed to reach a mean square error <0.01 (the inverse of the resulting value was defined as the potency of a locus and is shown to be a function of
) (Figure 4) and (ii) the bias and MSE of these estimators were compared when estimating relatedness with 20 microsatellites. In natural populations, nonrelative is the most likely relationship between two randomly selected individuals. Therefore, we simulated 10 million dyads, with nonrelatives, half sibs, full sibs, and parent–offspring contributing to 70%, 10%, 10%, and 10% of these dyads, respectively. The results as a function of
are shown in Figure 5.
The minimum number of loci by using multiple loci to reach for the four relationships shown in Figure 2. The six estimators are the same as in Figure 1. For each estimator, four frequencies of null alleles were simulated: 0.1 (—), 0.3 (– –), 0.5 (–⋅–), 0.6 (⋯), and 0.7 (
), and the observed allele frequency follows the triangular distribution. Results were obtained from 1 million Monte Carlo simulations.
The bias and MSE of estimated from 20 loci with a triangular allele frequency distribution. Ten million dyads were simulated, with nonrelatives, half sibs, full sibs, and parent–offspring contributing to 70%, 10%, 10%, and 10% of dyads, respectively. The frequency of null alleles was simulated at four levels (
, 0.3, 0.5, and 0.6), with each row showing a single level. The results of six estimators are compared: L&R (—), NA (– –), WA (∇), TH (Δ), NA (∘), and A&W (×).
The potency of a locus is an inverse function of (Figure 4) but increases with
, reaching an asymptote around
Locus power is low for high frequencies of null alleles. For example, at least 200 loci should be used when
The curve for
is close to that of
and is thus omitted from Figure 4.
When estimating relatedness with 20 microsatellites, the biases of the A&W estimator are the highest, while the MSEs of the A&W estimator are the lowest (Figure 5). For method-of-moment estimators, the WA, TH, and NB estimators are least biased, with the MSE curves of these estimators being similar.
Finite Populations
Although all of these estimators performed reasonably well under their given assumptions, real situations often diverge from ideal conditions. For example, allele frequencies are obtained from relatively few individuals, and other population effects such as self-fertilization, inbreeding, and genetic drift may occur.
We followed Toro et al. (2011) by simulating a small population founded from 20 unrelated and outbred individuals, whose genotypes were randomly generated according to the genotypic frequencies under the Hardy–Weinberg equilibrium. Twenty loci with four visible alleles were simulated. In the founder generation, two levels of the expected frequencies of null alleles were simulated ( or 0.3), the expected frequencies of these visible alleles being drawn from triangular distributions. Ten discrete generations of 20 individuals were produced, with the sex ratio at each generation fixed at
. The parents of each individual were randomly selected from the previous generation (some individuals may not have reproduced), resulting in a data set of 200 individuals and 20,100 dyads. The true kinship coefficient was calculated from the pedigree by a recursive algorithm (Karigl 1981), and the true Wright’s relatedness was obtained using the formula
(equation 12 in Huang et al. 2015a), where
is the kinship coefficient between x and y, and
is the kinship coefficient between x and itself. The frequencies of null and visible alleles were estimated by a maximum-likelihood estimator, using an EM algorithm (Kalinowski et al. 2006). The results of the statistical analysis from 100 replications including bias, root MSE (RMSE), coefficient of correlation (R), and the slope (
) and the intercept (
) of regression of the estimated relatedness on the true relatedness are shown in Table 2.

Most of the statistics improved after correction, with the corrected estimator becoming more stable as increases. There were some exceptions, for instance the L&R estimator became more biased, with the bias of the corrected estimator being negative and close to
.
Discussion
We examined the performance of six estimators in coping with null alleles under various types of pairwise relationship. First, estimator bias before and after correction for null alleles was compared. Although some bias remained, correction reduced this bias to a low level. Second, a comparison of the number of loci that is needed to achieve the same accuracy was performed to evaluate the impact of null alleles on estimator efficiency. This comparison also demonstrated the potency of a different locus and will help researchers to choose an optimal estimator to address their specific research questions. Finally, a small, inbred population prone to genetic drift was simulated to mimic real populations.
Effect of null alleles
Null alleles can affect the estimates of relatedness in two ways. First, they can result in mistyping of true genotypes and change the similarity among genotypes. Given a pair of genotypes, the observed similarity index is under- or overestimated for genotypes such as and
, because these genotypes are instead observed to be
and
. These can reduce the accuracy of estimations.
Second, the overestimation of observed allele frequencies can also affect the overall estimation. Nonrelatives share IBS alleles only by chance. However, the inflation of observed allele frequency can cause overestimation of the probability that two nonrelatives share IBS alleles and may give an inaccurate, reduced estimate. Figure 2 shows the null allele bias for each estimator. The estimators of Wang (2002) and Thomas (2010) have reduced performance in the presence of null alleles, as does the novel estimator B. For null allele frequencies ∼0.5, all estimators have a large bias.
Why are these estimators still biased?
These estimators were corrected after we modeled the probability of genetic relationships in the presence of null alleles. However, bias still existed, although it was not high. There are two sources of bias in moment estimators.
The first source is due to the approximation [(e.g. Expressions (8), (11) and (15)] described in the Theory and Modeling section and is present in all moment estimators. When φ or Δ is not equal to zero or one, bias arises because the approximation is not equivalent to the original form (e.g., the bias for full sibs was the largest in Figure 3, while the biases for both parent–offspring and nonrelatives were nearly zero). However, the unbiased estimator for expression (7) or (13) does not exist, because there is an inverse of in the
, and
is a binomial distribution if only one locus is used for estimation. Consider a simpler form as follows (by using only one parameter x):
Assume that there is an unbiased estimator
then
, so
Multiplying both sides of the last equality by
, and acknowledging the representations of
and
above, it follows that
The index of x is not equal in the two sides of the above equation, but x is a consistent parameter, and the equation cannot hold under any conditions. Therefore, an unbiased estimator for expression (7) or (13) does not exist. For
, the bias is manageable.
The second source of bias is that the genotypic frequencies deviate from expected values. This problem is inherent to the L&R and NA estimators (the bias increases as decreases). These estimators model the probability of an observed genotype occurrence relative to a reference genotype. With null alleles, the observed genotypic frequencies of two related individuals are drawn from different distributions. As an example, Table 3 lists the distribution of parent–offspring relatedness, with the second column and bottom row showing the distributions of the reference and proband genotypes, respectively. However, without a priori information, the estimator cannot determine which is the reference genotype, and bias cannot thus be avoided.


For the maximum-likelihood estimator, there are no such problems. However, the maximum-likelihood estimator is still biased because the estimate is limited to the parameter space, with negative relatedness being unobtainable. Therefore, a bias for relationships with Δ lying at the edge of the parameter space is always present.
The potency of loci with null alleles
When the frequencies of null alleles are high enough, loci with null alleles can be identified. Although bias can be reduced, the loss of efficiency for loci with null alleles cannot be recovered. For example, for and 0.7, only 56% and 26% of loci, respectively, can give a valid estimate for nonrelatives.
In simulations, the MSE is used to evaluate the efficiency of loci. The inverse of the minimum number of loci that enables is shown in Figure 4, which describes the potency of a single locus with null alleles. Approximately 19, 25, 34, 100, and 200 polymorphic loci should be used to fulfill this requirement for null allele frequencies being 0.1, 0.3, 0.5, 0.6, and 0.7, so each locus contributes ∼5%, 4%, 3%, 1%, and 0.5% to reliability (Figure 4). In addition, the potency of each normal locus is approximately that of
Loci showing higher potency (e.g.,
) can be used for estimation after correction. For loci displaying low potency, they should be discarded to search for new polymorphic loci or new primers should be designed. We suggest that loci with null allele frequencies >0.5 should not be typed, even if regulated estimators are used. This is because the information provided by these loci is less than half that of normal loci. Even if they have been typed, it does not matter if they are used for estimation because they contribute little to the MSE.
Finite populations
In our fourth application, nonideal conditions were simulated. The corrected estimators had improved statistical performance and were more robust at different levels of allele frequency. The L&R, NA, and A&W estimators performed well for (Figure 2), so the statistical performances of the corrected estimators were not improved. For the WA, TH, and NB estimators, there was large negative bias (
) for smaller values of
(Figure 2); their corrected estimators were therefore less biased and more accurate (lower RMSE).
There was a negative bias near for corrected estimators (Table 2). For panmictic populations, the estimator gives an expected estimate of zero. However, we found negative bias in populations that included relatives. Two randomly selected individuals in such a population have an expected relatedness estimate of zero but their true expected relatedness is 0.111. This results in negative bias.
Kinship estimators
We also modeled the kinship estimators of Loiselle et al. (1995) and Ritland (1996) for null alleles. We found that these estimators cannot incorporate null alleles because they model the event that one pair of alleles is sampled from two individuals. Because null alleles cannot be observed and sampled, this would instead be a resampling of alleles if null alleles were sampled. However, without Δ and φ, the probabilities cannot be modeled through a single kinship coefficient. We performed a simple simulation and found that the expected similarity index and kinship coefficient were different in full sibs and parent–offspring in which null alleles were present, even though the kinship coefficients were identical in these two relationships.
Selection of estimators
Although the A&W estimator performs well in simulations, there are some conditions under which others perform better according to specific metrics. Therefore, there is no single estimator that has superior performance under all conditions and using all metrics. The RMSE depended on the allele frequency, the number of alleles and loci, and the type of relationship between individuals. The simulations and comparisons we present here will probably help researchers choose the most suitable estimator for their particular application. Furthermore, for applications for specific genetic conditions, we show that it is possible to identify a single optimal estimator. To make this easier, we have made available a free software package, “PolyRelatedness,” that provides a simulation function to help researchers evaluate the performance of each estimator under their given conditions.
For novice users, we recommend the A&W estimator because of its robustness. However, there are at least two situations when the A&W estimator cannot be used. The first one is dealing with tied values in the data set. Some of the estimates of the A&W estimator lie at the edge of parameter space, which results in many estimates of 0 or 0.5. If the estimated relatedness between two kinds of dyads is compared with a rank-sum test, it is difficult to determine the ranks of these ties. The second one is the problem of bias, because the maximum-likelihood estimator cannot be used for applications that required a bias-free analysis. For example, the relationship between bee or ant workers within the same colony is either half sibs () or full sibs (
) if there is only a single queen inside this colony, so the ratio of full sibs can be calculated by
if an unbiased estimator is used (Huang et al. 2015b). In this case, the L&R estimator can be used as an alternative because the MSEs generated for nonrelatives are lower than other method-of-moment estimators (Figure 5), the most likely relationship between two randomly selected individuals being nonrelatives in medium to large populations.
Acknowledgments
We thank Stephanie T. Chen for language corrections and Ruo-du Wang for helpful comments on the mathematics. We also thank the associate editor, Bret Payseur, and two anonymous reviewers for helpful comments on an earlier version of the manuscript. This study was supported by the National Nature Science Foundation of China (31130061, 31501872, 31270441, and 31470455).
K.H. and B.L. designed the project, K.H. wrote the program and draft, S.G. and X.Q. provided the tools and data, and K.R. and D.W.D. checked the model and edited the manuscript.
Footnotes
Communicating editor: B. A. Payseur
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.163956/-/DC1.
- Received March 16, 2015.
- Accepted October 17, 2015.
- Copyright © 2016 by the Genetics Society of America