# The Dimensionality of Genomic Information and Its Effect on Genomic Prediction

^{*}Department of Animal and Dairy Science, University of Georgia, Athens, Georgia 30602^{†}Institut National de la Recherche Agronomique, GenPhySE, F-31326 Castanet-Tolosan, France

- 1Corresponding author: Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602. E-mail: ipocrnic{at}uga.edu

## Abstract

The genomic relationship matrix (GRM) can be inverted by the algorithm for proven and young (APY) based on recursion on a random subset of animals. While a regular inverse has a cubic cost, the cost of the APY inverse can be close to linear. Theory for the APY assumes that the optimal size of the subset (maximizing accuracy of genomic predictions) is due to a limited dimensionality of the GRM, which is a function of the effective population size (*N*_{e}). The objective of this study was to evaluate these assumptions by simulation. Six populations were simulated with approximate effective population size (*N*_{e}) from 20 to 200. Each population consisted of 10 nonoverlapping generations, with 25,000 animals per generation and phenotypes available for generations 1–9. The last 3 generations were fully genotyped assuming genome length *L* = 30. The GRM was constructed for each population and analyzed for distribution of eigenvalues. Genomic estimated breeding values (GEBV) were computed by single-step GBLUP, using either a direct or an APY inverse of GRM. The sizes of the subset in APY were set to the number of the largest eigenvalues explaining *x*% of variation (EIG*x*, *x* = 90, 95, 98, 99) in GRM. Accuracies of GEBV for the last generation with the APY inverse peaked at EIG98 and were slightly lower with EIG95, EIG99, or the direct inverse. Most information in the GRM is contained in ∼*N*_{e}*L* largest eigenvalues, with no information beyond 4*N*_{e}*L*. Genomic predictions with the APY inverse of the GRM are more accurate than by the regular inverse.

- GenPred
- shared data resource
- genomic selection
- genomic relationship matrix
- inversion
- recursion
- effective population size
- single-step GBLUP

WHEN SNP information is available, genomic predictions most commonly use SNP-BLUP (and derivatives) or genomic BLUP (GBLUP) models (Meuwissen *et al.* 2001; VanRaden 2008). In the first model SNP effects are fitted directly, and the second model uses SNPs indirectly via a genomic relationship matrix. While both models are equivalent theoretically, analyses with complex models (multiple traits, several genetic effects, genotype-by-environment interactions) are simpler with GBLUP. For populations where only a small fraction of phenotyped animals are genotyped, there is a modification of GBLUP called single-step GBLUP (ssGBLUP) based on combining genomic and pedigree relationships (Aguilar *et al.* 2010; Christensen and Lund 2010). The ssGBLUP is becoming popular for commercial genetic evaluations because of simplicity of use, as existing models can be reused, and high accuracy due to joint modeling of phenotypes, pedigrees, and genotypes (Legarra *et al.* 2014).

GBLUP-based methods require an inverse of the genomic relationship matrix (GRM). A direct inverse has a cubic cost and can be computed efficiently for perhaps up to 150,000 individuals. Due to the popularity of commercial genotyping, some populations have >1 million genotyped animals (*e.g.*, U.S. Holstein cattle), and computing an inverse would be prohibitively expensive. Additionally, the GRM is not positive definite for larger dimensions and additional steps (*e.g.*, blending with a pedigree-based relationship matrix) are required to make the GRM positive definite (VanRaden 2008).

Misztal *et al.* (2014) postulated that the inverse can be computed efficiently using recursion on a small subset of animals (initially labeled as high-accuracy or “proven” in earlier studies) and named the method the algorithm for proven and young (APY). In this article, we refer to the inverse calculated with this algorithm as the APY inverse and to animals in the subset as core animals. While computing costs of APY are cubic for the subset, they are only linear for animals outside the subset. Fragomeni *et al.* (2015) analyzed Holstein data with 100,000 genotyped animals and found that any subset of animals containing at least 10,000 animals resulted in an accurate inverse. The optimal subset size was estimated as slightly >8000 for Angus cattle (Lourenco *et al.* 2015b). The APY inverse was successfully computed for ∼570,000 genotyped Holsteins in <2 hr of computing time on an average server (Masuda *et al.* 2016). More than 10,000 animals in the recursion did not improve genetic predictions. A regular inverse for 570,000 individuals would require weeks of computing and require memory available only in the largest computing clusters.

Misztal (2016) proposed a theory for the APY inverse. Assume that the additive information in a population is contained in a limited number (say, *n*) of independent chromosome segments (*M*_{e}) or effective SNP markers (ESM). If *M*_{e} or ESM completely explain the additive variation, breeding values of *n* animals are linear functions of *M*_{e} or ESM and contain nearly all the information included in *M*_{e} or ESM. Treating any subset of *n* animals as core animals, a recursion on any *n* animals is sufficient, because there is a high redundancy in genomic information. Whereas the number of *M*_{e} is a function of effective population size, the number of ESM could be computed as the number of eigenvalues explaining nearly all the variation in **G**. Assuming that *M*_{e} and ESM describe the same concept, the optimal subset size must be a function of effective population size (*N*_{e}) and can be derived from eigenvalue analysis of the GRM.

The purpose of this article was to test the theory of the APY with simulated data. In particular, we wanted to find whether (1) the optimal size of the recursion is related to effective population size, (2) the optimal size can be derived from eigenvalue analysis of the GRM, and (3) genetic predictions obtained with APY **G**^{−1} are superior to those with a regular inverse.

## Materials and Methods

### Data simulation

Data for this study were simulated using the QMSim software (Sargolzaei and Schenkel 2009). The historical population consisted of 1000 generations with a gradual increase in size from 1000 to 100,000 breeding individuals, with equal sex ratio, nonoverlapping generations, random mating, no selection, and no migration to create initial linkage disequilibrium (LD) and establish mutation–drift balance in the population. Six populations with different effective population size were created by selecting different numbers of breeding animals from the last generation of the historical population. Whereas the number of breeding females per generation was kept constant at 12,500, the number of males varied from 5 to 50 (5, 10, 20, 30, 40, and 50), aiming for approximate effective population sizes from 20 to 200 (data sets P20, P40, P80, P120, P160, and P200). In each generation randomly selected male offspring were used as sires for the next generation, while all the females were used as dams for the next generation. Ten recent generations were simulated for each population by random mating and with litter size of 2. All 75,000 individuals in generations 8–10 had genotypic information available. The simulated genome was assumed to have 30 chromosomes of equal length of 100 cM each, with 49,980 evenly allocated biallelic SNP markers and equal allele frequencies in the first generation of the historical population. A total of 4980 biallelic and randomly distributed QTL affected the trait, with allelic effects sampled from a gamma distribution with a shape parameter of 0.4. The recurrent mutation rate of the markers and QTL was assumed to be 2.5 × 10^{−5} per locus per generation (Solberg *et al.* 2008). Phenotypes were simulated with an overall mean as the only fixed effect and assuming heritability of 0.3. All animals in the recent generations had phenotypes available, except for animals in the last generation. The simulation was replicated five times.

### Matrices and models

The raw genomic relationship matrix was constructed as in VanRaden (2008),where **Z** is the centered matrix of gene content adjusted for gene frequencies, and *p _{j}* is the gene frequency for SNP

*j*. The observed allele frequencies were used and calculated from the genotyped animals. The eigenvalues of this matrix were computed using subroutine DSYEV in LAPACK. Because

**G**

_{0}was not full rank, estimation of breeding values was based on the inverse of a blended

**G**defined as (VanRaden 2008), where

**A**

_{22}is a pedigree-based numerator relationship matrix for genotyped animals. In preliminary tests, the blending had very little impact on realized accuracies (Aguilar

*et al.*2010).

The APY for inversion of **G** is based on a recursion on a subset of animals (Misztal *et al.* 2014; Misztal 2016). Split animals arbitrarily into core (*c*) and noncore (*n*) such that the number of core animals is close to the dimensionality of **G** or the number of effective SNPs. Also denoteAssume that the breeding values (BV) **u** for noncore animals are linear functions of those for core animals,where **P*** _{nc}* is a matrix relating BV of noncore to core animals and

**Φ**

*is the error term. Thenandwhere Subsequently,Using conditional distributions, and the final formula is as originally defined in Misztal*

_{n}*et al.*(2014):In this algorithm, the direct inversion is only for

**G**

*and computing*

_{cc}**G**

*is not needed.*

_{nn}Phenotypes were analyzed using the ssGBLUP modelin which **y** is the observation vector for the first nine of the recent generations, *µ* is an overall mean, **u** is the vector of additive animal effects, **S** is the incidence matrix relating observations in **y** to additive genetic effects in **u**, and **e** is the vector of random residuals. We assumed that the variances werewhere **H** is a matrix combining pedigree and genomic relationships, with its inverse as in Aguilar *et al.* (2010); *i.e.*,where **A**^{−1} is the inverse of a numerator relationship matrix for all animals included in the analysis. The partition in blocks refers to animals with/without genotypes.

### Computations

Effective population size was calculated using two formulas. Theoretical effective population size was calculated using the formula (Wright 1931), where *N*_{m} and *N*_{f} are the numbers of breeding males and females per generation, respectively. Inbreeding (or realized) effective population size was calculated from the realized increase in inbreeding by generation, using the formula by Falconer and Mackay (1996),whereand *F _{n}* is the average inbreeding in the

*n*th generation. The observed average inbreeding coefficients per generation were obtained from QMSim software (Sargolzaei and Schenkel 2009).

Genomic estimated breeding values (GEBV) were calculated using either an explicit inverse of **G** or the APY inverse. Core animals in the APY were selected randomly and their number corresponded to the number of the largest eigenvalues in **G**_{0} that explained 90% (EIG90), 95% (EIG95), 98% (EIG98), and 99% (EIG99) of the retained variance. Validation accuracies were computed only for animals in the 10th generation (without phenotypes) and defined as correlations between simulated breeding values and GEBV computed with either the regular inversion (GEBV_{REG}) or the APY (GEBV_{APY}) and a different number of core animals. All computations were applied to each of the six data sets.

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

## Results and Discussion

Figure 1 shows and with the different number of breeding males per generation. Both and were very similar and increased with the number of breeding males, from 20 and 19.3 to 199.2 and 188.1 for P20 and P200, respectively. If we take into account family size and variation in family size (Laporte and Charlesworth 2002), the *N*_{e} values would be in the same range (20–200) as in the simulation. Thus, the simulation scheme was effective in creating populations with close to the desired *N*_{e}. For simplicity, all graphs and discussions use rounded

The number of eigenvalues of **G**_{0} that accounted for 90%, 95%, 98%, and 99% of the original variation is shown in Figure 2 and *Appendix* Table A1. Accounting for 90% of the original variation (EIG90) required 814 ± 14 eigenvalues in population P20 and 5512 ± 19 eigenvalues in population P200. Accounting for 99% of the original variation (EIG99) required 6523 ± 68 eigenvalues in population P20 and 20,786 ± 29 eigenvalues in population P200. Thus, increasing *N*_{e} ∼10 times increased the number of selected eigenvalues by 6.8 for EIG90 and by 3.2 for EIG99. While the number of eigenvalues increased with *N*_{e}, the increase was less than proportional especially for higher *N*_{e}. Graphically (Figure 2), the increases in the number of eigenvalues corresponding to 90% and 95% were close to linear, but less so corresponding to 98% and especially for 99% past *N*_{e} = 120. The total number of positive eigenvalues in **G** is bounded by the number of SNPs (49,980 in this study) and the number of genotyped individuals (75,000 in this study). Subsequently, steeper declines from a linear trend when the number of eigenvalues is very high could be due to a limited number of SNPs and individuals used in the simulation.

The number of eigenvalues can also be expressed in terms of *N*_{e} and genome length *L* (*L* = 30). The value of EIG90 varies from ∼40*N*_{e} (P20) to 27*N*_{e} (P200), the value of EIG95 varies from 80*N*_{e} (P20) to 46*N*_{e} (P200), and the value of EIG98 varies from ∼185*N*_{e} (P20) to 75*N*_{e} (P200). Assuming that the increase in the number of eigenvalues is indeed linear with *N*_{e} but affected by limited number of SNPs and genotyped individuals, the approximate values would be EIG90 ≈ *N*_{e}*L*, EIG95 ≈ 2*N*_{e}*L*, and EIG98 ≈ 4*N*_{e}*L*.

Figure 3 shows the correlation between GEBV_{REG} and GEBV_{APY} for validation animals with variable numbers of core animals (from EIG90 to EIG99). Populations with greater *N*_{e} required a larger number of core animals to reach equivalent correlations. For all populations, the correlations were >0.99 with the number of core animals equal to EIG99 and >0.98 with the number of core animals equal to EIG98. Figure 4 shows the results as in Figure 3 but with the percentage of explained variance on the abscissa. The curves are linear and nearly identical. This means that the correlations between GEBV_{REG} and GEBV_{APY} are nearly a linear function of the percentage of the explained variance, regardless of *N*_{e}. The correlations are slightly higher than the percentage of explained variance, probably because GEBV contain not only contributions due to genomics but also some due to parent average (VanRaden 2008; Lourenco *et al.* 2015a).

Figure 5 shows true accuracies (defined as the correlation between simulated breeding value and GEBV) across the six simulated populations as a function of the number of eigenvalues explaining the given amounts of variance. All SDs of accuracies across replicates were ≤0.01. The accuracy is inversely related to *N*_{e} as it was highest for population P20 (0.89 ± 0.01) and lowest for P200 (0.77 ± 0.01). In simulated populations, Muir (2007) and Goddard (2009) showed that accuracy of GEBV decreases as *N*_{e} increases. Smaller *N*_{e} means fewer *M*_{e} or ESM to estimate and subsequently smaller prediction error variance of these effects. The accuracies were only ∼0.03 below the peak level with the number of core animals corresponding to EIG90; the accuracies increase by ∼0.02 at EIG95, peaking at EIG98; and they are slightly lower at EIG99. The accuracies with the regular inverse (noted as 100% in the graph) were slightly lower than with EIG99. The results indicate that the majority of the information for GEBV is provided by EIG90 largest eigenvalues. The accuracy provided by eigenvalues present beyond EIG95 in EIG98 was small but required almost doubling the number of core animals. Eigenvalues corresponding to the last 2% variation do not provide any information and in fact slightly reduce the accuracies, which shows that the genomic information may be redundant and in fact overfitted the data. Subsequently we can conclude that the dimensionality of the genomic information (defined as the number of the larger, informative eigenvalues) in this study does not exceed EIG98. For genomic prediction, using the number of core animals corresponding to EIG98 is sufficient, with reduction to EIG95 when computing is expensive.

The theory for the APY was developed either based on the dimensionality of **G** as computed from eigenvalues or based on the independent chromosome segments (*M*_{e}) (Misztal 2016). Both concepts may be closely related. In particular, the number of *M*_{e} is similar to the number of core animals beyond which the accuracy of GEBV does not increase. In this study, such a number corresponded to EIG98. Stam (1980) derived a probability density function for a size of an independent chromosome segment, which leads to the expected number of segments *M*_{e} = 4*N*_{e}*L*, where *L* is the size of the genome in morgans. With *L* = 30 in this study, *M*_{e} = 120*N*_{e}, which is close to an estimate of 140*N*_{e} for EIG98. The approximate values in this study, EIG90 ≈ 35*N*_{e}, EIG95 ≈ 70*N*_{e}, and EIG98 ≈ 140*N*_{e}, could be simplified to EIG90 ≈ *N*_{e}*L*, EIG95 ≈ 2*N*_{e}*L*, and EIG98 ≈ 4*N*_{e}*L*, respectively. As the segments are of variable size, Goddard (2009) argued that a more relevant formula is 2*N*_{e}*L*/log(4*N*_{e}*L*), which is equivalent to *M*_{e} = 8*N*_{e} (*N*_{e} = 20) to 6*N*_{e} (*N*_{e} = 200). Both numbers are well below EIG90. Several formulas for *M*_{e} were compared in a meta-analysis by Brard and Ricard (2015), and none were found satisfactory. Such conclusions could be due to several factors. First, their study looked at realized accuracies, and these are strongly affected by selection (Bijma 2012; Lourenco *et al.* 2015a). Second, the implicit assumption was of segments of equal size, while segment sizes were variable (Stam 1980). Third, Brard and Ricard (2015) pointed out that part of the difficulty resided in getting good estimates of *N*_{e}, a parameter that is not always well defined and that changes over time. We can posit that in genomic selection we can estimate the effects of the largest chromosome segments well and those of smaller segments not as well but they are still useful for prediction, and the remaining smallest segments provide insufficient accuracy for prediction. Compared to methods reviewed by Brard and Ricard (2015), the possible definition of *M*_{e} by EIG98 does not depend on realized accuracies or trait definition, but does require genotype collection.

This study focused on dimensionality of the GRM. In fact, the eigenvalue distribution of a SNP BLUP matrix (**Z**′**Z**, where **Z** is gene content) is the same as both share the same singular values from singular value decomposition of **Z**. Therefore, the dimensionality of the GRM can be defined as dimensionality of the SNP genomic information in general. In this study, the eigenvalues were computed from the GRM explicitly constructed. However, for large data sets, it is possible to compute them from the singular value decomposition of matrix **Z**, with a cost quadratic in the number of markers and only linear in the number of individuals (*e.g.*, by subroutine DGESVD in LAPACK).

Some results of this study could be influenced by simulation parameters. In particular, a larger number of genotyped animals and the number of SNP markers could have increased the dimensionality especially for higher *N*_{e}. Genotypes by simulation are perfect while in real data they are affected by quality control and possibly imputation. In addition, the simulated population was not selected and the number of genotyped generations was small. Further studies will show applicability of the results of this article to real populations undergoing selection.

Although the largest population size simulated in this study had *N*_{e} = 200, the dimensionality of the GRM can be extrapolated for populations with a larger *N*_{e}. In general, the dimensionality of the genomic information (**G** or **Z′Z**) is ≤min (*M*_{e}, *N*_{snp}, *N*_{ind}), where *N*_{snp} is the number of SNPs and *N*_{ind} is the number of genotyped individuals. In this study, *N*_{snp} and *N*_{ind} were several times larger than *M*_{e} although the dimensionality of the GRM was depressed by limited *N*_{snp} and *N*_{ind} especially for a large *N*_{e}. It appears that the dimensionality of the GRM is close to *M*_{e} when *N*_{snp} and *N*_{ind} are a few times larger than *M*_{e}. In fact, MacLeod *et al.* (2005) found that detection of 90% of junctions between independent chromosome segments required ∼12 times as many markers as the number of junctions (≈*M*_{e}). Assume *N*_{e} = 3000 and *M*_{e} = 360,000. If *N*_{ind} or *N*_{snp} is low in comparison to *M*_{e}, dimensionality will be close to min (*N*_{ind}, *N*_{snp}). The dimensionality will reach *M*_{e} only when both *N*_{ind} and *N*_{snp} are >>*M*_{e}. For polygenic traits, *N*_{snp} determines a fraction of the additive variance explained by the genomic information (Jensen *et al.* 2012). Hypothesizing that in fact it is the ratio of *N*_{e} to *M*_{e} that is important, even a large *N*_{snp} can create a “missing heritability” problem in humans where *N*_{e} and subsequently *M*_{e} are large (Yang *et al.* 2010).

Simulations in this study assumed a polygenic model with an equal variance of each SNP. Heterogeneous SNP variance can be incorporated via weights for each trait separately as discussed in Misztal (2016). In particular, if positions of all causative SNPs were known, the rank of **G** would be equal to the number of causative SNPs (Misztal 2016); with 200 causative SNPs the rank of **G** would be 200. If only a few causative SNPs are known and their position/variance is not known precisely, we can expect the rank of **G** to be lower than that estimated from the effective population size but larger than the number of causative SNPs.

Results of this study may be applicable toward understanding the limits of genome-wide association studies (GWAS) resolution. Wang *et al.* (2012) found in a simulation study that the highest correlation of a simulated QTL value was not with the SNP effect closest to the QTL but with the average of 8–16 adjacent SNP markers. Su *et al.* (2014) investigated individual or block variances on 50,000 SNPs in Holstein cattle and found that slightly higher accuracy was obtained when the same variances were imposed on a block of 30 SNPs, which corresponds to 2 Mb or ∼15*N*_{e} segments (assuming *N*_{e} = 100 for Holsteins). In a simulation study, Hassani *et al.* (2015) found that QTL effects were better predicted by averages of ±100 flanking markers than by an average of a smaller number of flanking markers. The resolution of GWAS may be limited to a size of an individual chromosome segment and subsequently by *N*_{e}.

In summary, when the number of SNP markers and genotyped animals is large, the dimensionality of the SNP genomic information defined by the eigenvalue of the GRM is approximately a linear function of effective population size. Subsequently, an inverse of the GRM based on limited recursion can be computed inexpensively for a large number of individuals. Such an inverse results in more accurate estimation of GEBV than a direct inverse.

## Acknowledgments

The authors thank Paul VanRaden for helpful suggestions. Editing by Heather L. Bradford and Shogo Tsuruta is gratefully acknowledged. This research was primarily supported by grants from the Holstein Association USA (Brattleboro, VT), the American Angus Association, Zoetis, Cobb-Vantress, Smithfield Premium Genetics, the Pig Improvement Company, and the U.S. Department of Agriculture’s National Institute of Food and Agriculture (Agriculture and Food Research Initiative competitive grant 2015-67015-22936).

## Appendix

## Footnotes

*Communicating editor: D. J. de Koning*

- Received January 19, 2016.
- Accepted February 29, 2016.

- Copyright © 2016 by the Genetics Society of America

Available freely online through the author-supported open access option.