Recent use of genomic (marker-based) relationships shows that relationships exist within and across base population (breeds or lines). However, current treatment of pedigree relationships is unable to consider relationships within or across base populations, although such relationships must exist due to finite size of the ancestral population and connections between populations. This complicates the conciliation of both approaches and, in particular, combining pedigree with genomic relationships. We present a coherent theoretical framework to consider base population in pedigree relationships. We suggest a conceptual framework that considers each ancestral population as a finite-sized pool of gametes. This generates across-individual relationships and contrasts with the classical view which each population is considered as an infinite, unrelated pool. Several ancestral populations may be connected and therefore related. Each ancestral population can be represented as a “metafounder,” a pseudo-individual included as founder of the pedigree and similar to an “unknown parent group.” Metafounders have self- and across relationships according to a set of parameters, which measure ancestral relationships, i.e., homozygozities within populations and relationships across populations. These parameters can be estimated from existing pedigree and marker genotypes using maximum likelihood or a method based on summary statistics, for arbitrarily complex pedigrees. Equivalences of genetic variance and variance components between the classical and this new parameterization are shown. Segregation variance on crosses of populations is modeled. Efficient algorithms for computation of relationship matrices, their inverses, and inbreeding coefficients are presented. Use of metafounders leads to compatibility of genomic and pedigree relationship matrices and to simple computing algorithms. Examples and code are given.
POWELL et al. (2010) pointed out the conceptual conflict between identity-by-descent (IBD) relationships based on pedigree and identity-by-state (IBS) relationships based on marker genotypes. These are also known as pedigree and genomic (VanRaden 2008) relationships, respectively, and we use this terminology hereinafter. Whereas reference for pedigree relationships is formed by founders of the pedigree, reference for the genomic relationships is most often the current genotyped population (e.g., Powell et al. 2010; Vitezica et al. 2011). Powell et al. (2010) showed that one can (at least conceptually) refer genomic relationship coefficients to the pedigree scale and vice versa. In the context of applied genetic evaluation of livestock, similar notions were introduced by VanRaden (2008) and Vitezica et al. (2011), explicitly modifying genomic relationships to refer to pedigree coefficients. However, an implicit assumption in these proposals is that the genotyped population has no pedigree structure, e.g., no sib groups and only one generation (Christensen 2012), and the proposals are also difficult to extend to several base populations (Harris and Johnson 2010; Misztal et al. 2013; Makgahlela et al. 2014).
In addition, pedigree relationships have several problems. Pedigrees, which are incomplete by definition, end up in one or several base populations (lines or breeds). For instance, the pedigree of the Romane sheep synthetic breed traces back to two base populations, the Romanov and Berrichon du Cher breeds, and the pedigree of global Holstein cattle population can often be traced back to “European” and “North American” base populations. In another more complex case, pedigrees are incomplete for some categories of animals. For instance, in dairy sheep, the father of all males is known, but only 5–80% of females have a known father. To further complicate things, in the presence of selection, assuming that all unknown parents belong to the same base population and have the same genetic level is unfair since younger (or sometimes foreign) animals are selected and therefore “better” than the base population. If not properly accounted for, this structure in several base populations results in biases (e.g., Ugarte et al. 1996; Misztal et al. 2013). Therefore, unknown fathers are assigned to different base populations, e.g., depending on year of birth, country of origin, sex, or path of selection. Current practice of genetic evaluations assumes that individuals in the different base populations (typically known as genetic groups or unknown parent groups) have different a priori average values, and these values are estimated as fixed effects within the model (Thompson 1979; Quaas 1988). However, the quantitative genetics theory of unknown parent groups has not been much further developed. For instance, Kennedy (1991) pointed out that genetic groups are incorrectly assumed to be unrelated to each other and that reduction of variance due to drift and selection should be accounted for. With molecular markers, there are more and more examples of observed relationships across base populations that were a priori unrelated (Kijaas et al. 2009 ; Gibbs et al. 2009; Ter Braak et al. 2010 ; VanRaden et al. 2011).
The hypothesis of unrelatedness of founders in a base population implies that the base population is drawn from a very large ancestral population. Not only is this false but it also contradicts marker-based information (animals seemingly unrelated share alleles at markers). Although unrelatedness can simply be seen as an arbitrary starting point, we suggest that relaxing this hypothesis gives more flexibility to the models.
On the other hand, genomic measures of relationships are not dependent on knowledge of pedigree. Further, they are more accurate because they consider realized, not expected, relationships (VanRaden 2008; Hayes, et al. 2009; Hill and Weir 2011). Genomic relationships can be projected along the pedigree for animals with no genotypes (Legarra et al. 2009; Christensen and Lund 2010). The so-called single-step GBLUP (SSGBLUP) thus mixes pedigree and genomic relationships, and is becoming the de facto standard in genomic evaluations for livestock (e.g., Legarra et al. 2014b). However, SSGBLUP requires genomic and pedigree relationship to refer to the same base. This base is however, hard to define. Genomic relationships of the current population change as more individuals are being included and are poorly defined if populations are structured (i.e., in lines, breeds or origins) (e.g., Harris and Johnson 2010). Defining a base is also difficult for pedigree relationships as pedigrees are incomplete and possibly end up in several base populations. An alternative is truncation of the pedigree, to have a more homogeneous base population (e.g., Lourenco et al. 2014), but this is not always a feasible option. Furthermore, defining pedigree founders as unrelated is contradictory with results obtained if these individuals are genotyped. Christensen (2012) suggested taking for genomic relationships an arbitrary reference and an ideal population with 0.5 allele frequency at the markers, and referring pedigree relationships to this base population. By doing so, he showed that founders of the base population should become related, and this extra relatedness can be understood as an excess of identical-by-descent homozygosity. The approach can be understood as a marginalization with respect to uncertainty in allelic frequencies, and a stable definition of the genetic base across time and different populations is obtained. Extension of this method to several founder populations is, alas, not straightforward.
In this work, we present a theory to consider relationships within and across base, or founder, populations. This theory provides the tools, on the one hand, to generalize the “unknown parent groups” used in genetic evaluations and, on the other hand, to generalize Christensen’s results, which conciliate pedigree and genomic relationships. The concepts developed in this work aim to be rather general and are based on pedigree considerations, but their use is of large interest in two cases: first, when combining genomic and pedigree relationships across individuals (as the SSGBLUP mentioned above) and, second, when considering several base populations simultaneously.
The outline of this article is as follows. First, we show that base populations with related individuals can be understood as issued from finite size ancestral populations. This, although not strictly necessary for practical purposes, gives a conceptual model and a genetic interpretation (Jacquard 1974). Second, such an ancestral population can be represented as a single pseudo-individual (a metafounder) with a particular self-relationship (a measure of homozygosity) and represents a pool of gametes. Several base populations can be represented as several, possibly related, metafounders. Metafounders are convenient because they simplify the representation and the algorithms for computing relationships and inbreeding. Finally, we show how parameters (ancestral homozygosities and relationships across populations) of ancestral populations can be estimated from the combined use of marker and pedigree data. Our work is an extension and unification of existing works by Jacquard (1969, 1974), VanRaden (1992), Aguilar and Misztal (2008), VanRaden et al. (2011), Colleau and Sargolzaei (2011), and Christensen (2012).
Relationships in a finite population
Relationships across base individuals:
Let “ancestral” be the population from which founders of the pedigree are drawn and “base” population be the set of these pedigree founders (Figure 1). Typically, individuals in the base population are assumed to be drawn from a large, unrelated, ancestral population mating at random, so that the base population individuals will not be related. Jacquard (1969, 1974) considered relationships in a finite-size population, showing that inbreeding and relationships increase steadily. We redevelop his treatment in a simplified form. Pedigree founders in the base population are drawn at random, with replacement, from an ancestral finite monoecious population with effective size , gametes, “true” average breeding value μ, and genetic variance . In this ancestral population gametes are assumed to be independent (in a sense, the ancestral population becomes the new base). Imagine two gametes sampled with replacement from the finite ancestral population to form the base population. The second gamete will be identical to the first gamete of the times. Therefore, the relationship coefficient (probability of identity by descent) between all pairs of gametes is , and this relationship can be understood as the correlation between gametes of Wright (1922). Jacquard (1974) used and called it the “inbreeding coefficient of a population.”
Across-individual relationships in the base population are depicted in Figure 2. Consider diploid individuals X and Y. They are constituted by four gametes, a,b,c,d. These gametes have been drawn from a pool of gametes where the probability of being identical (by descent) is across gametes and 1 with itself (Figure 2, left). Therefore, the coancestry coefficient between X and Y is the four-way average of probabilities of being identical for each possible pair of gametes, which sums to . Additive relationship between X and Y is twice the coancestry and therefore γ (Figure 2, right). Now consider individual X. The self-coancestry considers four ways of sampling alleles a and b (with replacement), and because , self-coancestry is equal to , and therefore self-relationship is equal to .
The base population has associated breeding values . From the developments above, the variance-covariance matrix of breeding values is , where I is the identity matrix and J is a matrix of ones. This covariance structure was suggested by Christensen (2012) to correctly compare genomic relationships and pedigree relationships. Due to random sampling of a limited number of founders, the mean of the base population composed of n individuals () will drift around the mean of the ancestral population with variance .
Pedigree relationships from related base populations:
VanRaden (1992) (unaware of the work of Jacquard 1974) assigned nonzero relatedness to animals in the base populations to correctly estimate inbreeding when pedigree information is missing. The value assigned to this relatedness, which is equivalent to γ, was set to the average relatedness of contemporary individuals with known relationships. Lutaaya et al. (1999) showed that the classical algorithm for calculating inbreeding is very sensitive to even a small loss of pedigree while VanRaden (1992) algorithm is much better although not perfect. This idea was also applied by Aguilar and Misztal (2008), and Colleau and Sargolzaei (2011) used a closely related idea in a similar setting.
Obtaining pedigree relationships from related base populations is conceptually straightforward, can be done following the tabular rules (Emik and Terrill 1949), and leads to a matrix of additive relationshipswhere A is the matrix with regular relationships and J a matrix of 1’s. In Jacquard (1974, p. 169), this formula is presented using coancestries instead of relationships. However, algorithms for computation of inbreeding (e.g., Quaas 1976; Meuwissen and Luo 1992), Henderson’s (1976) sparse inverse of the pedigree relationships, and other algorithms (Colleau 2002) need to be modified to account for nonzero relatedness of founders (e.g., Aguilar and Misztal 2008; Christensen 2012). These changes are rather complex and do not generalize well to the case of several base populations that are presented later. For this reason, and for its conceptual appeal, we have conceived the use of metafounders.
We now introduce a different, but equivalent, representation of related base populations that allows a greater flexibility. This representation uses so-called metafounders.
The notion of metafounder comes as an extension of VanRaden (1992) method for estimation of across-breed relationships. Imagine a pseudo-individual who can be considered as, simultaneously, father and mother of all base animals (Figure 3). We call this pseudo-individual a metafounder. The metafounder in Figure 3 represents the ancestral population in Figure 1.
In Figure 3, the metafounder (individual 1) represents a finite-size pool of gametes, from which the gametes constituting individuals 2–6 (the base population) are drawn. Picking two gametes at random with replacement, these gametes have an across-gamete relationship of . Therefore, the metafounder can be considered as having a self-relationship of and an individual inbreeding coefficient of , which will usually be negative. Inbreeding means departure from Hardy–Weinberg equilibrium, and negative inbreeding represents excess of heterozygotes. Therefore, negative inbreeding means that in most cases two gametes are different, i.e., the size of the pool is large, which is a tenable genetic hypothesis. For instance, considering (and therefore ) means that the two gametes are always different (by descent) and unrelated, i.e., the size of the pool is infinite, heterozygosity (by descent) is complete, and all individuals in the base population are unrelated. Considering (and ) means that two gametes drawn at random are always identical, i.e., the pool consists of one gamete, there is complete homozygosity, and all individuals in the base population are identical and completely inbred.
Algorithms for relationships and inbreeding with a single metafounder:
With this representation using metafounders, regular rules for computation of relationships and inbreeding change only slightly. Consider the Emik and Terrill (1949) rules for computation of additive relationship coefficients. They start by assigning self-relationships of 1 to all animals in the base population and later two rules are used,where d and s are the dam and sire of i, which must be younger than j. To include the metafounder, the only change is to set its self-relationship ( in the example) to γ. The Emik and Terrill rules do not otherwise need to be changed. For instance, for individual 2 in Figure 3, , and for individuals 1 and 2, . For individuals 2 and 3, . Therefore, assigning a metafounder with self-relationship γ is strictly equivalent to considering across-founder population relationships γ and founder self-relationships . The recursive algorithms of Karigl (1981) and Aguilar and Misztal (2008) are versions of Emik and Terrill (1949) and therefore need no modification beyond setting to γ. Using these rules, is easily created.
Consider Henderson’s (1976) inverse of the relationship matrix A. This consists in a product on the form , where D is usually a diagonal matrix containing variances of the Mendelian sampling terms (deviation of an individual’s breeding value from its parents’ average) and contains ones in the diagonal and 0.5 coefficients linking parents to offspring. Elements of D are a function of inbreeding of the parents (see Thompson 1977 for the proof and Elzo (2008) for a detailed explanation). This reasoning applies equally well to the use of one metafounder. Thus, using pedigrees with a metafounder, all the information about covariance of gametes transmitted from base animals to their descendants is contained in the inbreeding of the base animals, and the algorithm of Henderson (1976) works without changes, provided (and this is important) that inbreeding for all individuals is computed previously. This is opposite to Christensen (2012), who had to devise modifications of the algorithm.
Inbreeding coefficients can be computed by Emik and Terrill (1949) or, equivalently, using recursion (Karigl 1981; Aguilar and Misztal 2008). However, efficient algorithms for computation of inbreeding use Henderson’s (1976) decomposition of the numerator relationship matrix. These algorithms (e.g., Quaas 1976; Meuwissen and Luo 1992) proceed by computing the variance of the Mendelian sampling term, . Meuwissen and Luo (1992) presented one rule,where, in the case of unknown ancestor, (or ), their programming set . The same rule for computation of applies to the pedigree with one metafounder in Figure 3, by setting . In fact, the Meuwissen and Luo (1992) algorithm can be understood as having one metafounder with . Finally, the algorithm of Colleau (2002) for fast multiplication of matrix A with vector x, Ax, or extraction of elements of A also works.
Multiple base populations
An important case is the analysis of several populations at the same time, possibly with crosses. The conceptual model can easily be extended to several base populations, possibly with overlap as represented in Figure 4. In this case, we need to define within- and across-population relationships This was suggested by VanRaden (1992) and used by VanRaden et al. (2011). The interpretation of the across-base population coefficients like is that the ancestor populations overlap, as seen in Figure 4. If population A is composed of gametes, population B of gametes, and they overlap to an extent of gametes (for instance, in Figure 4 these are 6, 6, and 2, respectively), then , , and . The last result can be explained as follows: is the probability that the gamete from A comes from the overlap (), times the probability that the gamete from B comes from the overlap (), times the probability that both gametes are actually the same, given that they come from the overlap (). We allow values of , , and in a continuous range, even though the formulas only support values corresponding to integer values of , and . We also allow to potentially be negative, in order to consider the situation where populations have diverged due to selection in opposite directions. However, there is the restriction that the matrix Γ should be positive definite.
The consideration of each ancestral population as a metafounder is straightforward. Metafounders would be related by relationships (Figure 5). Actual numbers for the relationships within and across metafounders in Γ either can come from knowledge of the history of the populations (i.e., they diverged so many generations ago) or can be inferred from genomic relationships; this is detailed later.
Algorithms for relationships and inbreeding with several metafounders:
A pedigree with several metafounders defines a relationship matrix . Algorithms for creation of this matrix are extensions of previous ones. To form using the tabular rules (Emik and Terrill 1949), the first step is to set Γ as relationships of the metafounders and then apply the regular rules. Rules for the inverse consist in, first, inverting Γ to create a small submatrix of and then using Henderson’s rules (1976) with the elements for all individuals modified according to self-relationships of metafounders, as in the previous section. Using generalized inverses for inversion of Γ results in an algorithm that, for , gives the same as with unknown parent groups as in Thompson (1979) or Quaas (1988). The reason for this is that the generalized inverse of is 0, and otherwise the rules for inversion and the values of are identical. This shows that metafounders are a generalization of unknown parent groups.
Computing involves computation of inbreeding coefficients, which can be done by recursion or modifying Meuwissen and Luo (1992). The Meuwissen and Luo (1992) algorithm goes up the ancestors of a given animal i and adds contributions to the inbreeding coefficient of i; then animal j is deleted from the list of ancestors, and is set to zero. However, this does not work in the particular case of a crossbred individual issued directly from two related metafounders, i.e., an F1 crossbred individual with unknown parents. This is a case that does sometimes exist, e.g., in sheep and cattle. In this case, the contribution from the metafounders to is a sum over all metafounders , where is the kth column of K, the lower triangular Cholesky decomposition of , and nmf is the number of metafounders. Therefore, in the case of several metafounders, their contributions need to be processed for simultaneously. The core modification for the Meuwissen and Luo code is
Finally, the algorithm of Colleau (2002) to efficiently compute products as multiplies the result of by D, which has an upper diagonal block equal to Γ but that is diagonal otherwise. A complete code is furnished in Supporting Information, File S2.
Genetic variance considering related base populations
Single base population:
The additive genetic variance is the variance of the breeding values of the set of individuals constituting a population. This definition does not involve a notion of (un)relatedness in itself. However, in the base population, these individuals are typically assumed unrelated, which simplifies the reasoning. A question is how to relate the genetic variance of a population modeled as “related” to the genetic variance of a population modeled as “unrelated.” The breeding value is defined as relative to the average of the population. For this reason, any statistical model relating phenotypes to breeding values is forced to include an overall mean or an environmental effect confounded with it. A typical model for the phenotype can be written asWe follow the argument of Strandén and Christensen (2011), but for the sake of discussion, consider the mean as a random variable with variance . The covariance of y is, for the classical model with unrelated base animals, , where . As for the new model with related base animals . Two equivalent models (with equivalent likelihoods under multivariate normality) should have the same covariance for y and thereforeand . In other words, the general across-individual covariance γ is absorbed by the overall mean (and it will be the case even if the mean is considered as a “fixed” effect; Strandén and Christensen 2011). An intuitive explanation is that, when sampling a finite number of animals from a population, animals will tend to be related and therefore the mean will drift from zero; but this drift of the mean will be accounted for by the general mean of the model. The expression above agrees with the numerical results in Christensen (2012).
This result looks puzzling because it suggests that an “inbred” population has higher genetic variance than a non-inbred one, but this is not actually the case. The parameter has to be interpreted as a parameter of the statistical linear model used for the analysis, and it cannot be interpreted as a genetic variance within the population (whereas can be). In fact, the would be genetic variance in their hypothetical unrelated ancestral monoecious parents, and it would be reduced to assuming a rate of inbreeding from parents to offspring, as relatedness γ decreases the genetic variance within a population. Thus, the genetic variance within the population is always , and the variance component associated to the linear model is . Along the same line, genetic gain in the “related” base population is not proportional to (because when selecting individuals, they will be related) but to .
Multiple base populations:
The reasoning extends to the case with several populations but no crosses. For simplicity, we consider only two purebred populations. For breeds the model for phenotypes iswhere the variance–covariance matrix of the combined vector of breeding values iswith being the relationship matrix of breed and being matrices consisting of 1’s. Therefore, the vector of breeding values can be expressed aswhere subindex u on breeding values denotes that they are in the model with unrelated base populations, andand assumed independence of breeding values and . By an argument similar to above (i.e., Stranden and Christensen 2011), the parameters and are absorbed into the two general mean parameters and , respectively. Therefore, the two models are equivalent in the sense that genetic variance parameters are just scaled by and breeding values are just scaled and shifted. This model implies that phenotypes are separate by population and a mean (or distinct levels of fixed effects, e.g., herds) has to be fit by population. The argument above is not difficult to generalize to any number of populations, as far as crosses that do not exist.
Multiple base populations with crossing:
For crossbred populations the equivalence above does not hold because enters into the covariances across individuals. A different approximate equivalence of variances can be constructed as follows. Assume a set of n base population individuals (n is assumed large) drawn from each of m populations. Let the genetic values of the across-breed base populations be . The variance–covariance matrix isThe sample variance of , across all populations, iswhich, for ( is a parameter), has expectation (Searle 1982, p. 355)In the classical parameterization (unrelated founders) and thuswhich is equal to if the population is reasonably large (a popular assumption) and therefore if founders are unrelated. This means that when the founders are unrelated, the genetic variance is, on expectation, equal to the variance component of the covariance structure.
Consider now the structure above for . The two terms are equal toin which we neglect the last term. This means that the genetic variance is, on expectation, equal to the variance component times a constant , which is <1. Equating these two expressions for givesThis expression gives the previous result for a single population. Compared to the result in the previous subsection about multiple populations, this approximate equivalence is quite different. The result in the previous subsection is an equivalence between one genetic variance in a model with related base individuals and breed-specific genetic variances in a model with unrelated base individuals, whereas the result here is an approximate equivalence between two genetic variances, one being in a related base population and another being in an unrelated base population. This last expression is more general because it can consider correctly crosses across individuals. The difference comes also because in the previous expression there were separate means for each population, something that is not required here.
When crossing pure breeds, there is an increase of genetic variance due to the increase of heterozygosity of the QTL; for instance, if alternative alleles are fixed at each line. The additional variance in the F2 cross compared to the variance in the F1 cross is termed segregation variance (Lande 1981; Lo et al. 1993). This is typically ignored in a classical framework, although methods exist (Lo et al. 1993; Garcia-Cortes and Toro 2006). This increase in the genetic variance can be considered using related metafounders, as we show here. Two individuals in an F1 population (assuming—in a pedigree sense—unrelated and non-inbred parents, and factorizing out ) havewhereas two individuals in an F2 population (parents in F1 above) havewith and . The is transmitted forward and does not change in the F3, F4, etc. The genetic variance of such a population is thus . The variance–covariance matrix of two individuals in the F2 can be expressed asshowing that the segregation variance is . Because Γ is positive definite, then this term must be ≥0. Slatkin and Lande (1994) showed that segregation variance is a function of within-loci squared differences of means at the two breeds, plus cross-products of differences across loci weighted by linkage. If is estimated using markers as above, then it is implicitly assumed that genotypes at loci for the trait of interest have the same distribution across breeds and within the genome as marker genotypes. Reports of segregation variances in the livestock genetics literature are scarce (e.g., Cardoso and Tempelman 2004; Munilla-Leguizamon and Cantet 2011), partly because of poor data sets, partly because of computational difficulties, and partly because the bulk of crossbred animals is in poultry and swine, where crosses do not go beyond F1 populations. So it is uncertain whether accounting for segregation variance is of any practical relevance.
Estimation of metafounders ancestral relationships from genomic data
Because the within- and across-founder relationships cannot be inferred from pedigree, we suggest estimating these relationships using molecular markers, referring them to a genetic base defined according to genomic relationships (Christensen 2012). The objective of this section is to obtain estimators of Γ based on two kinds of statistical inference: a method of maximum likelihood and a method of moments (roughly, make first- and second-order statistics of genomic and pedigree relationships comparable).
Genomic information sheds light on relationships across breeds (Gibbs et al. 2009; Kijaas et al. 2009; VanRaden et al. 2011; Legarra et al. 2014a). Genomic relationships (VanRaden 2008; Hayes et al. 2009) are estimators of relatedness based on the observation of thousands of molecular markers, and typically matrix is used, where Z contains centered genotypes and s is a measure of global heterozygosity, for instance, , the total heterozygosity at the markers. This information can in principle be used to infer the Γ coefficients as follows. Marker genotypes follow Mendelian transmission, and therefore the covariance of genotypes of two individuals is determined by their relationship. Christensen (2012) used this to estimate γ in a single population. First, he integrated the likelihood over the unknown allelic frequencies, which results in using allelic frequencies of 0.5 as a reference (Z coded as ). Assuming multivariate normality for Z, the markers’ genotype, the likelihood of observed genotypes conditional to γ and s iswhere is the number of genotyped individuals and is the submatrix of corresponding to the genotyped individuals. The parameter s is a measure of heterozygosity in the genotyped population, and it is not equal to observed . The extension of this likelihood to multiple populations with different γ’s in Γ is straightforwardwhere is the relationship matrix constructed with a given Γ matrix and is the submatrix corresponding to the genotyped individuals. This likelihood can be factorized by markers asThe procedure can be completed by adding a prior distribution to Γ and using a Bayesian estimator instead of maximum likelihood. The prior distribution for Γ can be assigned based on spatial or temporal distances; for instance, Latxa sheep founders in 1990 and 1992 should be closer than 1990 and 2000. However, in none of these forms of the likelihood can Γ be factorized out, and the maximization of the likelihood needs to be done by a search method such as Simplex or Monte Carlo methods. For this reason, we present a method based on summary statistics.
Method of moments based on summary statistics:
This method matches summary statistics of across-individual and within-individual relationships in both (the matrix of extended pedigree-based relationships) and G (VanRaden et al. 2011; Vitezica et al. 2011; Christensen et al. 2012). This forces the equivalence between expected changes of the mean and variance under genetic drift (Vitezica et al. 2011; Christensen et al. 2012) for the populations described by either the pedigree or the genomic relationship matrices. For a set of n random variables u with variance–covariance matrix K, the sample average has a variance , whereas the sample variance has expectation (Searle, 1982, p. 355). The idea in the method is to force these two statistics of K ( and ) to be equivalent across both parameterizations ( and ). We consider three situations.
Two single unknowns need to be estimated: γ and s. Since , the average of all elements is , and the average of the diagonal is , where is the regular pedigree-relationship matrix for genotyped individuals. Therefore, a system of two equations needs to be set up,with solutionsThese solutions have an interpretation in terms of measures of inbreeding in the population. In a population large enough and mating at random, inbreeding of the individuals is equal to half the relationships of their parents, ( is average pedigree inbreeding), and ( is average genomic inbreeding). Therefore, in this case . This is basically the reverse of the expression derived by Vitezica et al. (2011), who adjusted G to match A and called . The expression shows that γ is a correction for underestimation of inbreeding of A with respect to G, following Wright’s F coefficients theory. An advantage of the method is that it needs only statistics of the and G matrices, which might be more available than full matrices.
Multiple pure populations:
Assume that a sample from each pure breed is genotyped. Consider the purebred parts of G and , for simplicity 2 breeds A and B:To meet the conditions of unbiasedness we need to force the equality of average diagonal and averages of G and and set up the four equationsThe solution is a generalization of the solutions for single populations. The scaling estimates for single populations are and with and , and and defined similarly. The solutions for two populations are
so that within-breed and across-breed average relationships agree. Assuming are close to zero, , which consist in setting to construct the G matrices (VanRaden 2008) and then simply quantify average relationships across breeds. This is the simple method used by VanRaden et al. (2011), although they did not define scaling s as we have done. This reasoning can be extended to as many breeds as needed. Again, this method can be used from published statistics without access to raw data.
Populations with crosses:
In some cases, pure populations may not be genotyped. For instance, Angus bulls may be mated to Limousine females and only the crossbreds and Angus genotyped. Another example is unknown parent groups (Quaas 1988), base populations that account for missing parentages. However, at some point descendants of these populations may be genotyped, and this information is usable. We propose an algorithm very similar to that of Harris and Johnson (2010). Let Q be a matrix containing in the cell the expected fraction of metafounder j in the individual i (Quaas 1988). This matrix can be efficiently obtained using Colleau (2002), recursion, or tracing down the pedigree. The following identity, which is an extension of , approximately holdsAnd therefore,. A linear model can be fit as , where E is an error term and is the section of Q containing proportions of metafounders in genotyped individuals. We neglect the term , which is small with respect to the rest of elements and obtain a further approximation , in which Γ is explicit. This expression can be linearized using the vec operator (Henderson and Searle 1979), and the least-squares estimator can be transformed back to a matrix form. This least-squares estimator of Γ isusing and assuming that the value of s is known. If only pure population animals are genotyped, this is identical to the approximation of the estimator above for “pure populations.” This solution for Γ is identical to the estimator proposed by Harris and Johnson (2010, Equations 13 and 14). As for s, one can usewhere the approximation is used for [in this case including ] such that and are linear functions of Γ. This system of two equations with two unknown is iterated until convergence. If there is little information for some metafounders (as is the case in ruminants), Bayesian estimation using a prior structure for Γ can be considered.
Combining pedigree relationships with metafounders and genomic relationships when not all individuals are genotyped
The SSGBLUP method for genomic evaluation (Aguilar et al. 2010; Christensen and Lund 2010; Legarra et al. 2014b) completes genomic information with pedigree-based information and in fact proceeds by correcting pedigree relationships in view of genomic relationships. Pedigree relationships are modified as (Legarra et al. 2009; Christensen and Lund 2010)where H is a matrix with relationships after including pedigree and genomic relationships, G is a matrix including genomic relationships for genotyped individuals , which is projected upon relationships of ungenotyped animals , A is the pedigree-based relationship matrix, and is a relationship matrix across genotyped individuals. This joint matrix H can be understood as a linear imputation of genotypes over all nongenotyped individuals (Christensen and Lund 2010), considering also the uncertainty in the imputation. This covariance matrix is increasingly used in genomic predictions of genetic merit (Aguilar et al. 2010; Christensen et al. 2012) and also in QTL detection (Dikmen et al. 2013).
The algebraic development of matrix H assumes that base allelic frequencies are known or, equivalently, that mean and variance of the population do not change with time. This is notoriously false with small populations, deep pedigrees, or in presence of selection. Different adjustments had been suggested to modify genomic relationships so that their genetic base is the same as that of pedigree relationships (Vitezica et al. 2011; Christensen et al. 2012). This implicitly estimates the shift in breeding values (or allelic frequencies) from the pedigree base to the genotyped population (Vitezica et al. 2011). However, these adjustments do not consider the pedigree structure of the populations, and their generalizations to crosses of lines or breeds are neither completely satisfactory nor well understood (but see Harris and Johnson 2010; Makgahlela et al. 2014).
Christensen (2012) argued that, contrary to pedigree relationships, genomic relationships are independent of pedigree completeness and they should define the genetic base. He thus considered matching pedigree relationships to genomic relationships instead of the opposite. He showed that after marginalizing the allelic frequencies from the joint likelihood, the result was a related base population and suggested estimating γ and s using maximum likelihood. All our developments rely on this base and therefore, the extended pedigrees with metafounders do automatically conciliate marker and pedigree-based relationships, using estimates of Γ and s from markers. In particular, the inverse of the joint pedigree and markers relationship matrix isThis matrix can be fit into the mixed model equations of the SSGBLUP.
We have seen that the variance component assuming “related” founders is not the same as the genetic variance assuming “unrelated” founders; the latter is the one classically estimated and used. The most straightforward solution is to reestimate the variance using metafounders. Alternatively, to use current estimates of genetic variance in the implementation, the variance of the breeding values needs to be scaled to . On expectation, the following equivalence holds: Thusandsuch that the inverse of the combined relationship matrix () can be multiplied by a single scalar,.
Example 1: How pedigree relationships are modified
Consider the pedigree in Figure 4 and the relationships between the subset of individuals 8 (pure breed A), 10 (pure breed B) , and 14 (crossbred, 56% breed A and 44% breed B, grandson of 8 and of 10). Regular relationships () areConsider now . ThenAll within-breed relationships have increased, because each base population is now assumed self-related. However, animals 8 and 10 are unrelated. Considering across-base population relationships in giveswhere the relationship between 8 and 10 appears, which in turn slightly increases the inbreeding coefficient of 14. To standardize to the genetic variance estimated assuming unrelated base individuals, must be divided by
Example 2: Interpretation of γ in a single population
Legarra et al. (2014a) used dairy sheep data (Manech Tête Rousse) for genomic prediction including 38,287 markers and 1295 rams. The relevant statistics are (observed) , , , , Using the single population method above yields , . What do these numbers mean? They imply that heterozygosity of markers at the base population should have been (instead of observed 14771), to appropriately match the fact that the heterozygosity at the markers reduced from the base to the observed population, according to inbreeding observed in the pedigree. Based on this estimate, average genomic inbreeding is , which can be achieved with an effective size of the founder population Ne = 1/0.43 and therefore . Although this effective size is very small, it refers to a reference with allelic frequencies equal to 0.5. This has to be taken as a reference point for the linear model and has no clear biological meaning.
Example 3: Numerical example of two breeds and crossbred individuals
VanRaden et al. (2011) estimated relationship coefficients across Jersey, Holstein, and Brown Swiss using 43,385 markers. Based on their published statistics and using the method based on summary statistics outlined above, we obtained an estimate of for Holstein and Jersey. Assuming the pedigree in Figure 5, we constructed using , which is equivalent to use of regular unknown parent group rules (Quaas 1988) and we also constructed as described before with ; we scaled to refer to the same regular genetic variance multiplying it by the constant . Results are shown in Figure 6.
It can be observed that the sparsity pattern does not change, except for the nonnull values across metafounders. Also, the numbers do not change greatly but diagonal elements are higher because there is shrinkage associated to the metafounders, which is not the case for regular unknown parent groups.
Consider now the variance in the hypothetical crossed Holstein–Jersey individuals. The segregation variance is, by the formula above, increased by compared to the variance in the F1.
This work presents new conceptual developments for pedigree relationships, including ancestral relationships at the founders due to finite size of ancestral population and across-base population relationships due to overlapping. Such development is of conceptual interest per se (Kennedy 1991; VanRaden 1992; ter Braak 2010), but it is obliged for genomic evaluations integrating genotyped and nongenotyped individuals. In practice regular genetic evaluations including several base populations and their crosses assume that ancestral populations are of infinite size and unrelated. This leads to unsolved questions. For instance, assume three pure breeds A, B, C and all of their F1 crosses, all in the same environment. If breed A and B are more similar to each other than to breed C, does this need to be included in the genetic analysis? Another typical case is with ruminant population with missing parentages, which are modeled as animals entering from new base populations. These base populations will become gradually more inbred (VanRaden 1992) and they will drift from the oldest base population. Also, they will be related (i.e., the unknown parent group “Holstein2004” will be more related to “Holstein2002” than to “Holstein1994”). All this can be conveniently modeled, estimated, and included in the genetic evaluations using metafounders. As genomic evaluation procedures are becoming more comprehensive, examples of these kind of problems are showing up in the animal breeding literature: Harris and Johnson (2010), Misztal et al. (2013), Makgahlela et al. (2014), Winkelman et al. (2015).
Metafounders and unknown parent groups
Metafounders are closely related to unknown parent groups or genetic groups (Thompson 1979; Quaas 1988). Genetic groups allow estimation of different genetic bases across the same population, which is necessary if the selection process is unknown (i.e., importing animals or missing pedigrees). Genetic values of individuals in a genetic group model can be written as , where g has average values of the genetic groups. Genetic groups are usually considered as fixed but they can be conceived as random (Sullivan and Schaeffer 1994). For random g and and , . This is similar to , but does not correctly model crosses and overestimates inbreeding. As pointed out by Kennedy (1991) this traditional formulation of genetic groups did not consider inbreeding or drift. Our work can be seen as a generalization of genetic groups to include inbreeding, drift, and across-group relationships. This generalization overcomes the problems mentioned by Misztal et al. (2013), who realized that inclusion of unknown parent groups into single-step methods involved approximations in the setup of joint matrix H.
Inclusion of finite size ancestral populations in genetic evaluation procedures has been largely neglected. Jacquard (1969, 1974) work on relationships in closed populations has been ignored. Independently, VanRaden (1992) made a first contribution to palliate the lack of genealogical information in cattle. He used inbreeding coefficients for unknown parent groups based on inbreeding of contemporaries; here we suggest using genomic information instead. Both ideas can possibly be merged.
A notion related to that of metafounders is partial relationships across pairs of individuals due to sharing alleles from some particular origin. This allows modeling the genetic value of an individual as a sum of genetic values from several breeds, and this is known as “splitting breeding values” (Garcia-Cortes and Toro 2006). The relationship matrix with metafounders can be decomposed into such a structure, as explained in the Supporting Information, File S1.
Metafounders and pedigree and genomic relationships
The use of metafounders with Γ relationships allows a reconciliation of pedigree and genomic relationships and inbreeding (Powell et al. 2010; Vitezica et al. 2011). Homozygosity (or identity) can be considered as deviation from Hardy–Weinberg equilibrium (Wright 1922). These deviations cannot be easily measured because they depend on the assumed allelic frequencies, which change in time. Considering unrelated founders assumes that all founder alleles are different, which is not tenable in view of marker information. By assuming 0.5 allelic frequencies, the reference is constant and there are no ambiguities.
The fact that inbreeding automatically increases when considering metafounders may seem worrying. If the objective of quantifying inbreeding is to describe the incertitude a priori of inbred animals (i.e., inbred animals tend to be more variable), this does not seem a concern. Use of pedigree inbreeding with metafounders to quantify inbreeding depression should not be problematic, for two reasons. The first is that adding a constant (roughly ) to inbreeding will not change estimates. The second is that, due to purge, only “new” rate of inbreeding () seems to have a measurable effect (e.g., Hinrichs et al. 2007). Recent inbreeding could even be better estimated using metafounders (for instance, in incomplete pedigrees; VanRaden 1992).
Genomic relationships are based on markers, and commercial marker chips are often biased toward intermediate frequencies or toward specific breeds. For instance many markers conceived for Bos taurus are monomorphic in Bos indicus and their use will result in biased estimates of Γ. For this reason, the approaches in this work should be considered with caution for such populations. Use of unbiased markers (e.g., from sequence data or from random genotyping across the genome) will result in more accurate estimates of relationships across metafounders, if the populations are distant ones.
Genetic background across populations
Use of metafounders assumes a common genetic background across all base populations. This is typically accepted as true within breed, but breed itself is somewhat ill defined. Some genomic predictions across breeds assume identical genetic background (i.e., Hayes et al. 2009; Harris and Johnson 2010). If the hypothesis of a homogeneous genetic background is not acceptable, for instance, in the case of genetic–environment interactions or scale effects, a genetic correlation model can be used (Wei and Vanderwerf 1994; Karoui et al. 2012).
Practical performance of our model has to be ascertained with real data but we give an example of its interest. Winkelman et al. (2015), using a simplified single-step GBLUP, reported better performance of the Euclidean distance matrix relationship matrix (Gianola and van Kaam 2008) across breeds and their crosses, compared to G adjusted as in Harris and Johnson (2010). We have observed that numerically, G matrices based on EDM and G matrices based on 0.5 allelic frequencies tend to be similar (unpublished). It would seem that the appeal of the EDM relationship matrix is therefore its independence of within-breed allelic frequencies, as proposed by Christensen (2012). In this work, we have aimed at creating tools to make pedigree relationships compatible with this kind of G matrices.
We have defined the notion of metafounders, which can be understood as a limited pool of gametes from which the founders of the pedigree are drawn. Metafounders can also be understood as a generalization of unknown parent groups or genetic groups, which are essential in genetic evaluation of livestock. Use of metafounders makes it possible to analyze pedigreed populations allowing for relatedness within and across base populations, something that is desirable for genetic evaluations combining pedigree and genetic markers. Metafounders can account for extra segregation variances due to crosses of populations. Efficient algorithms exist for computation of relationship matrices and their inverses and inbreeding. Relationships across metafounders can be inferred from marker data. By doing so, compatibility of pedigree and genomic relationships is warranted by construction. This work provides new tools and concepts for genetic evaluation and management of populations.
We are grateful to the genotoul bioinformatics platform Toulouse Midi-Pyrenees for providing computing resources. We thank reviewers and editor for useful comments and suggestions. A.L. and Z.G.V. acknowledge financing from INRA SelGen projects X-Gen and SelDir. OFC was supported by Center for Genomic Selection in Animals and Plants (GenSAP) funded by the Danish Council for Strategic Research.
Communicating editor: G. A. Churchill
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177014/-/DC1.
- Received February 12, 2015.
- Accepted April 3, 2015.
- Copyright © 2015 by the Genetics Society of America