## Abstract

Thousands of genes responsible for many diseases and other common traits in humans have been detected by Genome Wide Association Studies (GWAS) in the last decade. However, candidate causal variants found so far usually explain only a small fraction of the heritability estimated by family data. The most common explanation for this observation is that the missing heritability corresponds to variants, either rare or common, with very small effect, which pass undetected due to a lack of statistical power. We carried out a meta-analysis using data from the NHGRI-EBI GWAS Catalog in order to explore the observed distribution of locus effects for a set of 42 complex traits and to quantify their contribution to narrow-sense heritability. With the data at hand, we were able to predict the expected distribution of locus effects for 16 traits and diseases, their expected contribution to heritability, and the missing number of loci yet to be discovered to fully explain the familial heritability estimates. Our results indicate that, for 6 out of the 16 traits, the additive contribution of a great number of loci is unable to explain the familial (broad-sense) heritability, suggesting that the gap between GWAS and familial estimates of heritability may not ever be closed for these traits. In contrast, for the other 10 traits, the additive contribution of hundreds or thousands of loci yet to be found could potentially explain the familial heritability estimates, if this were the case. Computer simulations are used to illustrate the possible contribution from nonadditive genetic effects to the gap between GWAS and familial estimates of heritability.

UNDERSTANDING the genetic architecture of complex traits has become a fundamental topic of study in human genetics (Gibson 2012; Timpson *et al.* 2018). In recent years, huge efforts have been made to investigate the genetic basis of human complex traits through Genome-Wide Association Studies (GWAS) or meta-analyses of their results (Paternoster *et al.* 2015; Gormley *et al.* 2016; Justice *et al.* 2017; Visscher *et al.* 2017). There has been a parallel increase in the number of big Consortiums able to carry out large GWAS with higher and higher numbers of individuals, and, therefore, with increasing statistical power (SIGMA Type 2 Diabetes Consortium *et al.* 2014; Yengo *et al.* 2018), as well as of genomic repositories and online resources, including databases specialized in published GWAS results (Sudlow *et al.* 2015; MacArthur *et al.* 2017; Canela-Xandri *et al.* 2018). To date, thousands of SNPs have been identified to be associated with hundreds of human diseases or other traits with genome-wide significance, according to data recorded by the NHGRI-EBI GWAS Catalog (MacArthur *et al.* 2017). However, SNP markers of known variants explain but a small percentage of the heritability measured by cohort studies for almost every studied trait, what has been referred to as “missing” heritability (Manolio *et al.* 2009; Nolte *et al.* 2017).

The most common assumption to explain the missing heritability is that many common variants of small effect pass unnoticed in most GWAS due to a lack of statistical power (Yang *et al.* 2010), and a number of loci on the order of hundreds to thousands are yet to be found (Visscher *et al.* 2017). In fact, the missing heritability gap of well-studied traits such as human height has been reduced as GWAS had been performed with increasingly larger sample sizes and statistic power (Wood *et al.* 2014; Yengo *et al.* 2018), although the newly found SNPs tend to have smaller effect sizes on average (Park *et al.* 2010), and the gap is reduced slowly (Nolte *et al.* 2017). In addition, common genotyped SNPs can capture up to 60% of familial heritability estimates (Yang *et al.* 2010; de los Campos *et al.* 2013; Nolte *et al.* 2017) or even higher proportions (Yang *et al.* 2015).

The narrow-sense heritability explained by SNPs found in GWA studies is compared with estimates obtained from family data, usually twin designs. These may involve nonadditive (dominance and epistasis) genetic components as well as other interaction terms including environmental effects (Zuk *et al.* 2012; Chen *et al.* 2015; Ni *et al.* 2018). Therefore, although it has been suggested that most genetic variation for human traits is of additive nature (Hill *et al.* 2008; Polderman *et al.* 2015; Zhu *et al.* 2015), some part of the gap between GWAS and familial heritability estimates may also be due to the bias involved in the familial estimates (Zuk *et al.* 2012; Hemani *et al.* 2013). One way to address this issue is to try getting the expected full contribution to narrow-sense heritability from loci detected by GWAS, and compare it with the familial estimates. In this work we attempt to do so by using information from the GWAS Catalog.

Our analysis consists of extracting information on effects and frequencies of variants for a number of human traits and diseases from the GWAS Catalog with the following objectives: (1) To investigate the nature of the distribution of locus effect sizes already discovered and their contribution to narrow-sense heritability, and (2) to predict the expected full distribution of effects and frequencies of loci in order to ascertain whether or not this could be able to explain the estimates of heritability obtained from family studies. Our results indicate that the familial heritability of 10 out of the 16 traits studied could be potentially explained by the contribution of the average effects of hundreds to thousands of loci yet to be found by GWAS. However, for the other six traits there is a substantial gap between the expected GWAS heritability and the familial heritability, suggesting that an additive contribution of single loci is unable to explain the familial heritability values.

## Methods

In short, we began by processing the GWAS Catalog in two steps, in order to get a set of data with the most meaningful information associated to SNPs and GWA studies. First, by filtering incomplete or low informative data and, second, by clustering together traits with a highly overlapping genetic background. Additional processing was required for subsequent analyses involving locus effect sizes, frequencies, and contributions to heritability. Computer simulations were carried out to illustrate the possible impact of nonadditive genetic variation on familial estimates of heritability.

### Processing of the GWAS catalog

All data manipulation, including statistical analysis, was carried out using the R language (R Core Team 2017). We worked with the NHGRI-EBI GWAS Catalog data (MacArthur *et al.* 2017), publicly available at https://www.ebi.ac.uk/gwas/, and accessed on December 5, 2017. We started by selecting a limited number of fields from the database for each scientific study PubMed ID (PMID), as the SNP ID itself, the mapped gene, the effect, reported as an odds ratio (OR) or beta-coefficient (BETA), the frequency of the risk allele, and the reported *P*-value. PMID-related variables were also gathered, as the name of the disease or trait examined in the study and the total population sample, computed from the information of the initial and replication samples used. The Catalog contains some ambiguity regarding the units of the effects registered. Doubtful cases were checked by looking at the corresponding publications, and, if their effect could not be assigned as BETA or, *e.g.*, because it was measured in trait units rather than in standardized units, they were disregarded.

We checked for the occurrence of a list of necessary variables (effect, gene, *P*-value, SNP, and trait), and removed any row corresponding to a SNP without a complete information on these variables. We also limited our study to the most significant associations, eliminating SNPs with a significance level higher than the standard *P*-value = 5 × 10^{−8}. A separate dataset without filtering for statistical significance was also considered for the final set of traits. For all purposes, only one SNP per associated Catalog gene (that with the lowest *P*-value) was considered. Thus, the gene or intergenic name was the unit considered in the analysis aimed at representing a potential causal locus. Thus, hereafter each different GWAS-Catalog gene represents a locus corresponding to the information of a single SNP. For example, for the trait Height, the Catalog version analyzed contained a total of 855 SNP entrances from 25 different PMIDs. Many of these SNPs were associated to the same Catalog gene or intergenic sequence. Considering only different gene names and selecting the SNP with the lowest *P*-value associated to each gene, only 370 different loci remained. Later, after filtering by type of effect (BETA), which implied removing one PMID with ambiguous type and another with OR type, the remaining final set of data for this trait contained 346 loci arising from 10 PMID.

Because we wanted to investigate the distribution of locus effects with robustness, we only considered traits with a wide and well-known genetic background composed by at least 30 unique genes detected. We initially differentiated as many traits as unique names were given to the mapped disease or trait field in the original GWAS Catalog. However, it often occurs that different researchers studying the same trait publish their results using different trait names (*e.g.*, “LDL” and “LDL levels”). In order to avoid working with duplicated or redundant traits, we clustered some of them (see Supplemental Material, Table S1) on the basis of their common genetic background and carried out some additional processing steps, as explained in File S1. After this step, we restricted the traits analyzed to those represented by at least three different PMIDs.

### Contribution of loci to heritability

From the filtered GWAS database, narrow-sense locus-specific heritability was estimated through the calculation of the contribution of each locus to the additive variance by using the classical formula (Falconer and Mackay 1996), , where is the risk allele frequency and is the average effect of the gene substitution for the locus (henceforth, the average effect or locus effect). For BETA traits, the additive variance equals the heritability , as the average effects are measured in phenotypic SD. For OR traits, we estimated the locus-specific heritability (*i.e.*, variance in liability) following the method described by So *et al.* (2011), assuming additivity of SNP effects, and the prevalence values published in different epidemiology and genetic papers (Table S2). From the and frequency values, locus effects for OR traits were obtained in the same units of phenotypic SD as BETA traits. Finally, the contribution to heritability of all loci corresponding to a given trait were added together to obtain the GWAS heritability for each trait.

After all the above filtering steps, the dataset for locus effects and heritability analyses had a total of 7886 loci corresponding to 328 studies and 42 human traits. The estimated values are shown in Table S2 along with the reported values of familial heritability found in the literature.

In order to measure the proportional contribution of different classes of locus effects to global we defined three arbitrary, but well-defined, effect classes: low, medium, and high. These classes were assigned to each trait according to the mean and SD of their distribution of effect sizes. Low-effect sizes were defined as those with a value lower than *e*^{−1} SD below the mean effect size. Medium-effect sizes were those between *e*^{−1} SD below and above the mean, and high-effect sizes those with effects larger than *e*^{−1} SD above the mean. With these definitions, an average of ∼50% of the loci were in the low-effect class, ∼36% in the medium-effect class, and ∼14% in the high-effect class.

### Analysis of the change in locus effect size, frequency, and explained heritability for increasing sample sizes

We assumed that locus effects and frequencies would be better estimated in studies with larger sample sizes (*N*), in agreement with previous studies (Auer and Lettre 2015; Visscher *et al.* 2017) as well as our own observation from the GWAS Catalog. Thus, estimates obtained in studies with larger *N* were reassigned to the corresponding gene identity, independently of the study. That is, if we consider two studies, PMID_{1} and PMID_{2}, regarding the same human trait, with sample sizes , the SNP effects and frequencies associated to genes found in PMID_{1} that were also present in PMID_{2} were assigned the values of the corresponding genes in PMID_{2}. Therefore, a locus found in different studies would have an associated effect and frequency corresponding to a single SNP from the study with the largest sample size, usually, but not always, the most recent one.

We tested three different regression models to measure the relationship between variables in the analyses of locus effects, frequencies (in terms of the minor allele frequency, MAF) or heritability. These regression models were: simple linear regression: ; two-parameter exponential regression: ; and four-parameter logistic regression: ; where the dependent variable *Y* may refer to the mean locus effect size, frequency, heritability, or any other related variable, such as the parameters of the distribution of locus effects, and the independent variable *X* is the number of loci found at a given stage in studies with increasing sample sizes.

When these models were performed using the accumulated number of loci as an independent variable, we only considered those traits that had at least three observations (*i.e.*, three PMIDs) in which the cumulated number of loci was at least 30, so that every regression analysis had at least three points, each corresponding to an estimate obtained with at least 30 loci. This corresponds to a subset of 16 traits, 177 PMIDs, and 5692 loci. For model selection, the Akaike Information Criterion (AIC) was used (Akaike 1974). The final dataset with all unique loci described for each trait after all above processing steps, and after updating locus effect sizes and frequencies, is shown in Table S3. This is the dataset used for estimating heritabilities from the GWAS Catalog shown in Table S2.

### Inferring the distribution of locus effects, the missing number of loci, and the expected value of heritability

Locus effects and MAFs were fitted into known probability distributions using the R package “fitdistrplus” (Delignette-Muller and Dutang 2015) using the maximum likelihood estimation method. In order to determine which distribution best fitted the observed locus effects, we considered the following possible continuous distributions: Beta, Exponential, Gamma, Gaussian, Logistic, and Log-normal. We then selected the best fit by using AIC (see Table S4).

Given that the change in the parameters of the distribution of locus effects and MAF as new loci are being discovered could also be predicted with the regression parameters described in the previous section, the expected distribution of locus effects and frequencies, including those yet unobserved, could be inferred. From these, we could obtain the number of loci necessary to explain the observed value of familial heritability , or the closest one. To do so, we assumed an increasing number of loci for each trait, and sampled that number from the predicted distribution of effect sizes and MAF. For each number of loci sampled, the distribution parameters were those predicted from the corresponding regression parameters (Table S5). This process was repeated 10,000 times for each set of loci that were added (up to 20,000 loci), providing a distribution of expected heritability values. From this distribution, the expected parameters and numbers of loci that could explain within 95% confidence intervals were chosen. If could not be explained by any expected distribution and any number of added loci, the median heritability estimate closest to was chosen. A detailed example of the prediction procedure is shown in File S2 and Figure S1 therein.

### Cross-validation of predictions

We evaluated the accuracy of the predictions on a different set of data composed of new variants published in a more recent version of the GWAS Catalog accessed on August 27, 2018. For validation purposes, we only considered new gene-associated SNPs that belong to traits already present on our final dataset of 16 traits used for inferring the distribution of locus effects. This test set contained data of 153 SNPs mapping new gene names (loci) described in 11 different PMIDs corresponding to the following eight traits: Body mass index, Height, Prostate cancer, Psoriasis, Rheumatoid arthritis (including Rheumatoid arthritis ACPA-positive), Systemic lupus erythematosus, Type 2 diabetes, and Waist-to-hip ratio adjusted for BMI, *i.e.*, Waist-to-hip-related traits (Table S6).

### Computer simulations

Computer simulations were carried with an in-house C program to illustrate the possible biases inherent to estimates of heritability obtained from family data, particularly twin studies, when dominance and epistasis models are assumed. The expected distributions of locus effect sizes and frequencies for the trait “Digestive disease” were used for illustration. A population of size 2 × 10^{6} randomly mated diploid individuals was considered where alleles for 660 biallelic loci have homozygous effects *a*, where *a* values are twice the allelic average effects sampled from the inferred log-normal distribution of average effects for the Digestive disease trait (mean *a* = 0.029). Heterozygous effects (*ah*) were assumed either additive (*h* = 0.5), partially recessive (*h* = 0.2), or fully recessive (*h* = 0). Allelic frequencies (*q*) were taken from the expected distribution of frequencies for the Digestive disease. Individual genotypes for the quantitative trait were the sum of the genotypic values for all loci involved, and phenotypic effects were obtained by adding an environmental deviation normally distributed with mean zero and variance *V _{E}* adjusted such that the phenotypic variance is

*V*= 1.

_{P}The additive (*V _{Ag}*) and dominance (

*V*) variances in the absence of epistasis (genic variances) were obtained from the sum of the variances of individual loci. Thus,

_{Dg}*V*= Σ2

_{Ag}*α*

^{2}

*pq*and

*V*= Σ(2

_{Dg}*dpq*)

^{2}, where

*α*=

*ah*– 2

*dq*,

*d*=

*a*(

*h*– 1/2), the summation is over all loci, =

*V*/

_{Ag}*V*is the narrow-sense genic heritability, and =

_{P}*V*/

_{Dg}*V*the genic dominance contribution to phenotypic variance. The genotypic variance (

_{P}*V*) was calculated from the multilocus genotypic values of individuals, and the broad-sense heritability was obtained as

_{G}*H*

^{2}=

*V*/

_{G}*V*.

_{P}A twin design was carried out producing 10^{6} families of two monozygotic and two dizygotic twins. The phenotypic correlations between monozygotic (*t _{MZ}*) and dizygotic (

*t*) twins were obtained from ANOVA. Estimates of familial heritabilities were calculated as = 2(

_{DZ}*t*–

_{MZ}*t*). No shared environmental effects were assumed between twins. Thus, is expected to estimate + 1.5 in the absence of epistasis (Lynch and Walsh 1998, p. 538). Average locus effects (

_{DZ}*α*) were estimated from the regression of the individual phenotypic values on the number of copies of the alleles for each locus, and the contribution of each locus to heritability was obtained as because

*V*= 1, where

_{P}*q*is the allele frequency. The analogous to GWAS heritability was obtained as the sum of contributions from all loci.

A multilocus epistatic model was assumed where homozygous genotypes for the trait interact with one another. Thus, epistasis occurs only between homozygous loci such that their multilocus genotypic effects for the trait are doubled. Four scenarios were then considered combining within-locus additive or recessive gene action, and between-locus additive or epistatic gene action. Under dominance, the epistatic model assumed involves additive by additive (*V _{AA}*), additive by dominance (

*V*) and dominance by dominance (

_{AD}*V*) components. Allelic homozygous and dominance effects accounting for the epistatic effects imply an increase in the additive variance relative to the case of no epistasis (Cheverud and Routman 1995). The GWAS heritability is expected to estimate the narrow-sense heritability (

_{DD}*h*

^{2}) while the twins heritability is expected to estimate

*h*

^{2}+

*V*/

_{D}*V*+

_{P}*V*/

_{AA}*V*+

_{P}*V*/

_{AD}*V*+

_{P}*V*/

_{DD}*V*+ higher order epistatic components (Lynch and Walsh 1998, p. 583). All simulation values and estimates were averaged over 20 replicates.

_{P}### Data availability

The GWAS Catalog database is publicly accessible and downloadable from https://www.ebi.ac.uk/gwas/. The Supplemental Material contains two Files, five Figures, with Figure S1 inclu