## Abstract

Prediction of complex traits using molecular genetic information is an active area in quantitative genetics research. In the postgenomic era, many types of -omic (*e.g.*, transcriptomic, epigenomic, methylomic, and proteomic) data are becoming increasingly available. Therefore, evaluating the utility of this massive amount of information in prediction of complex traits is of interest. DNA methylation, the covalent change of a DNA molecule without affecting its underlying sequence, is one quantifiable form of epigenetic modification. We used methylation information for predicting plant height (PH) in *Arabidopsis thaliana* nonparametrically, using reproducing kernel Hilbert spaces (RKHS) regression. Also, we used different criteria for selecting smaller sets of probes, to assess how representative probes could be used in prediction instead of using all probes, which may lessen computational burden and lower experimental costs. Methylation information was used for describing epigenetic similarities between individuals through a kernel matrix, and the performance of predicting PH using this similarity matrix was reasonably good. The predictive correlation reached 0.53 and the same value was attained when only preselected probes were used for prediction. We created a kernel that mimics the genomic relationship matrix in genomic best linear unbiased prediction (G-BLUP) and estimated that, in this particular data set, epigenetic variation accounted for 65% of the phenotypic variance. Our results suggest that methylation information can be useful in whole-genome prediction of complex traits and that it may help to enhance understanding of complex traits when epigenetics is under examination.

- epigenetics
- DNA methylation
- MeDIP-Chip
- phenotypic prediction
- RKHS regression; GenPred; shared data resource

EPIGENETICS focuses on heritable changes of genetic materials that do not reside in the sequence of DNA, called epigenetic modifications (Riggs *et al.* 1996; Riggs and Porter 1996). Major forms of these changes are DNA methylation, histone modification, and noncoding RNAs (ncRNAs) (Rivera and Bennett 2010). DNA methylation is the most common epigenetic modification, which can have various forms depending on the targeting nucleotide of the modification (Ratel *et al.* 2006). In vertebrates and flowering plants, it is usually referred to as the covalent addition of a methyl group (-CH_{3}) to the 5-position carbon atom (^{5}C) of the cytosine pyrimidine ring, resulting in 5-methylcytosine (^{m5}C) (Jeltsch 2002; Meissner *et al.* 2005; Vanyushin 2006). Thus, “DNA methylation” stands for ^{m5}C throughout this article. Histone modification is the multivalent modification of histone tails of the core histones, which can be acetylation, methylation, phosphorylation, ubiquitination, and symoylation (Kouzarides 2007; Ruthenburg *et al.* 2007). Both DNA methylation and histone modification interact with the entering and binding of transcription factors (TFs) to the DNA molecule such that gene expression is altered. Usually, DNA methylation is associated with reduced gene expression (Bird 1984; Razin and Cedar 1991; Lim and Maher 2010) and histone modification can either enhance or repress expression, according to different modification targets (*e.g.*, which amino acids are at the histone tail) and modification types (*e.g.*, methylation or acetylation) (Berger 2002; Cheung and Lau 2005). Recently, ncRNAs were found to be composed of a hidden layer of internal signals that control various levels of gene expression associated with physiological and developmental processes. Their role in epigenetic regulation has been acknowledged as well (Zhou *et al.* 2010; Kaikkonen *et al.* 2011).

Epigenetic modifications have an important role in gene expression regulation, and malfunctioning of the regulation process can have severe consequences. In epidemiology and human genetics, many diseases and disorders, including cancer, have been confirmed to have an epigenetic basis (Jones and Baylin 2002, 2007; Jiang *et al.* 2004; Esteller 2008; Pembrey 2012; Tollefsbol 2012). For example, Prader–Willi syndrome (PWS) and Angelman syndrome (AS) are sister imprinting-related disorders involving deletion of DNA segments derived from different parents at the same genomic region (Meijers-Heijboer *et al.* 1992; Nicholls *et al.* 1998; Cassidy *et al.* 2000). Another example of epigenetics-related diseases is that of oncogenes; these exist in almost everyone’s genome while only a small proportion of the population develops a cancer. Here, the promoter region of a tumor suppressor gene is usually unmethylated such that the gene is expressed normally and, therefore, it prevents the formation of a tumor. In cases where there is hypermethylation in the promoter region, the tumor suppressor is deactivated and a cancer develops (Jones and Baylin 2002; Robertson 2002; Egger *et al.* 2004).

Due to the potentially important role of epigenetics in diseases, epigenome-wide association studies (EWAS), a counterpart of genome-wide association studies (GWAS) at the epigenome level, have been conducted in recent years (MacArthur 2008; Rakyan *et al.* 2011; Bell 2013), aiming at finding associations between epigenetic polymorphisms and traits of interest, instead of using DNA polymorphisms (*e.g.*, SNPs). Although epigenetic regulation is not restricted to DNA methylation, the latter is the most commonly used biomarker in EWAS at present, because it is more stable and easier to be quantified than other epigenetic regulatory mechanisms (Flanagan 2015). In EWAS, DNA methylation across the whole genome is converted into a certain measurement reflecting the “methylation level,” using methylation-sensitive enzyme digestion (Waalwijk and Flavell 1978; Kaput and Sneider 1979), methylated DNA immunoprecipitation (MeDIP) (Weber *et al.* 2005), or bisulfite sequencing (BS-Seq) that combines next-generation sequencing techniques with bisulfite conversion (Frommer *et al.* 1992), with BS-Seq being the most popular method used in methylation profiling. In BS-Seq, a DNA sample is treated with sodium bisulfite, which can convert unmethylated cytosine into uracil, whereas methylated cytosine is intact. Uracil is read as thymine in polymerase chain reaction (PCR) and sequence alignment after PCR amplification gives the counts of C (originally methylated cytosine) and T (originally unmethylated cytosine) at a single-base resolution. The ratio gives the absolute methylation level at that base, which is referred to as the *β*-value in methylation profiling literature and is usually considered as the “gold standard” in methylation quantification (Krueger *et al.* 2012). Once the methylation level is obtained, statistical methods are then applied to find associations between the “methylation profile” and the trait of interest in a selected sample.

Although some diseases associated with dysregulation of epigenetic modification at some genomic region have been found, EWAS has similar drawbacks to GWAS: it is difficult to estimate how much variation in phenotypes, especially for complex traits, is explained by epigenetic polymorphisms, even if there is evidence that they contribute to phenotypes, either biologically or statistically. Two studies attempting to solve this question have been published in recent years, with *Arabidopsis thaliana* used as experimental material (Johannes *et al.* 2009; Reinders *et al.* 2009). In Johannes *et al.* (2009), a wild-type inbred line was chosen as the paternal founder and a *ddm1* mutant was used as the maternal founder. The *ddm1* mutant was genetically identical to the wild type, except for the *DDM1* locus and few other loci. The *DDM1* locus encodes an ATPase chromatin remodeler that is involved in methylation maintenance, and the *ddm1* mutant used in their study was featured by a whole-genome-wide demethylation. The F_{1} generation was obtained by crossing the wild-type (as male) and *ddm1* mutant (as female), and then it was backcrossed with the wild type (as male) to create the backcross generation (BC_{1}). The BC_{1} individuals were selfed for several generations to construct a population of epigenetic recombinant inbred lines (epiRILs). In total, 505 epiRILs were obtained by Johannes *et al.* (2009) after four generations of selfing starting from the BC_{1} generation. Since these 505 lines were (almost) isogenic at the DNA level and differed only in methylation profile, all observable phenotypic variation was then regarded as due to epigenetic and environmental factors, with the impact of genetic polymorphism at the DNA level ruled out. By examining plant height and flowering time, Johannes *et al.* (2009) found that epigenetics contributed ∼30% of the phenotypic variation. A similar approach was used in Reinders *et al.* (2009) with the only difference being that the genetic polymorphism in the two parental *Arabidopsis* lines was at the *Met1-3* locus, which also has an impact on the whole-genome methylation level, and that instead of a backcrossing to a parental line, selfing of F_{1} was adopted. At the end of the F_{8} generation, 68 epiRILs were obtained.

Both Johannes *et al.* (2009) and Reinders *et al.* (2009) found that epigenetic variation contributed to a considerable proportion of phenotypic variation, hinting that epigenetic information may help prediction of quantitative traits. When using DNA polymorphisms, whole-genome-enabled prediction models can be viewed as an extension of the single-marker regression models used in GWAS, where instead of finding genomic regions that may be associated with a complex trait, integrating all marker information for prediction and/or artificial selection is the ultimate goal. In a similar context, EWAS can also be extended for prediction, using data mining and machine learning techniques. Because methylation profiles can explain phenotypic variation and it is widely believed that DNA methylation is the most stable epigenetic modification that can be retained in either mitosis or meiosis, perhaps prediction can be enhanced by using methylation data, as foreseen by González-Recio (2012). In this study, therefore, we used DNA methylation data for building statistical models suitable for prediction purposes, with the expectation that this information could potentially supplement that from DNA polymorphisms.

## Materials and Methods

### Data

This study used phenotypic and methylation data. The phenotypic data set is from Johannes *et al.* (2009), and it contains measurements of plant height (PH) and flowering time (FT) collected in two greenhouses for 505 *Arabidopsis* epiRILs and 2 parental lines. These data were analyzed by Johannes *et al.* (2009), using a mixed-effects model, to explore the proportion of phenotypic variance explained by different effects. Their model used greenhouses and microenvironments (*i.e.*, individual planting plots in the greenhouse) as fixed effects and the 505 epiRILs as a random factor. Greenhouse explained 39.61% and 2.45% of phenotypic variance for FT and PH, respectively, and microenvironment explained 4.12% and 0.086% of phenotypic variance for these two traits; the variance explained by random epiRIL effects accounted for ∼30% for both traits. Because the microenvironment arrangement data are no longer available (F. Johannes, personal communication), we decided to perform the analysis on PH only, as FT was apparently more strongly affected by this factor. The methylation data were downloaded from the Gene Expression Omnibus data repository (accession no. GSE37284). In this data set, 123 of the 505 epiRILs and the 2 parental lines were epi-profiled, using MeDIP with a customer-designed array chip. Each line was examined at 711,320 probes (loci) located on five *Arabidopsis* chromosomes. Each probe is associated with two values: one is the rescaled log_{2} of the signal/background intensity ratio, which describes the enrichment of methylated cytosine proxied by that probe. This information is referred to as methyl values in subsequent discussion, and a higher methyl value indicates higher level of methylation. The other value is methylation status [methylated (M), intermediately methylated (I), or unmethylated (U)] predicted from the methyl values. Note that the methyl values are generated from enrichment intensity ratios, so these are relative, rather than absolute, values. Due to this reason, there are typically no threshold values that can be used to perform methylation status calls, and hence the status was predicted using a hidden Markov model (Colomé-Tatché *et al.* 2012), a commonly used tool in bioinformatic analysis. This predicted methylation status is referred to as methyl status hereafter. There were no missing values in the methylation data, and after removing epiRILs without phenotypic data, 114 lines remained for subsequent analysis. Therefore, each epiRIL used in this study has 1 phenotypic record on PH and paired methyl-values/methyl-status records at each of 711,320 probes (loci). For more detailed information about the methylation data, see Colomé-Tatché *et al.* (2012) and the NCBI description page. A description on data processing was given in Cortijo *et al.* (2014a).

### Methods and prediction models

The methylation data described above were used by Cortijo *et al.* (2014b) to map epigenetic QTL (epiQTL) contributing to root length and FT, and three major epiQTL were found for both traits. Using analysis of variance, it was found that the broad sense (epi)heritabilities of these two traits were ∼60%, and major epiQTL explained 87% and 60% of (epi)heritability in the two traits, respectively. Due to the strong contribution of methylation to variation of phenotype, we decided to explore the predictive power of this information, as suggested by González-Recio (2012). Here, we built whole (epi)genome prediction models that are analogous to whole-genome prediction models, where instead of SNP markers, methylation information was used as predictor variables. Most genome-enabled prediction studies (*e.g.*, Meuwissen *et al.* 2001; de los Campos *et al.* 2013) use a linear model with the form (1)where ** y** is a vector of

*n*phenotypic records,

*μ*is an unknown constant (intercept) common to all individuals,

**is a vector of fixed effects with associated incidence matrix**

*b***,**

*X***is an matrix possessing SNP genotypic codes (**

*W**e.g.*, , 1, or 2), and

**is a vector of regression coefficients associated with all SNP loci. Model 1 is more statistical than biological when the**

*α***matrix contains epigenetic information, compared to when using genomic information such as SNP markers, with the reasons being stated next. When performing genomic prediction using SNPs,**

*W***represents allelic substitution effects at these marker loci, an important concept in quantitative genetics; hence, a statistical regression coefficient can be linked to a quantitative genetic parameter. Since such a concept does not exist in quantitative epigenetic analysis, this implies that a model with this form may not lead itself to interpretability of underlying biological processes. Therefore, we adopted kernel methods for prediction purposes, from which a biological interpretation is available with less difficulty.**

*α*#### Kernel methods: theory:

In kernel regression, phenotypes and predictor variables are often linked nonlinearly, via a kernel function. In a regression problem without nuisance variables, the relationship between an observation and its corresponding covariates is generally written as (2)where is the observation on the *i*th subject and is, say, a vector of covariates measured on *i*; is some function (usually unknown); and is the model residual. For the purpose of describing the kernel methods, it is assumed that phenotypes (*y*’s) and regression covariates (**x**’s) are centered, so Equation 2 does not include an intercept. In standard linear regression, is , where ** ω** is a vector of unknown coefficients to be inferred. The most common solution for the weights

**is obtained by using ordinary least squares (OLS). In whole-genome prediction of complex traits, many methods use this functional form but assign some penalties (or priors) to**

*ω***because the “curse of dimensionality” makes OLS not applicable, and often Bayesian techniques are employed (Gianola**

*ω**et al.*2009; Gianola 2013). The linear additive model often provides a reasonable approximation to the underlying statistical architecture of a complex trait and it is easy to interpret. However, nonadditive gene action, for example epistasis, is usually not accounted for, which may lead to incorrect attributions of genetic variation .

One can define as the conditional expectation of in Equation 2, given , which can be inferred using the Nadaraya–Watson estimator (Nadaraya 1964; Watson 1964), having the form (Silverman 1986; Gianola *et al.* 2006) (3)In genome-enabled prediction using high-density markers, *n* is the number of individuals, is the vector of SNP marker genotypes of individual *i*, ** x** is the focal point at which the kernel function is evaluated, and

*h*is a smoothing parameter of the kernel function. Because possesses the marker information of individual

*i*, measures the “genomic distance” between individuals

*i*and

*j*by definition. Therefore, the symmetric matrix measures the pairwise genomic distance of all individuals. According to Gianola and Van Kaam (2008), this kernel treatment can be written as (the “dual formulation”) the linear regression model (4)where

**is an vector; is an symmetric, positive definite matrix;**

*y***is an vector of regression coefficients; and**

*α***is the model residual with assumption , where is the residual variance. Under the reproducing kernel Hilbert spaces framework (**

*e**e.g.*, Gianola and Van Kaam 2008), one assumes that and, because is symmetric and invertible, is estimated as the solution to (5)Above, is the variance captured by the kernel. The vector estimates the vector of genetic effects marked by SNPs, that is, .

Alternatively, starting from , one can minimize a loss function with form (6)where *λ* is a regularization parameter and is the squared norm of ** g** under a Hilbert space ℋ. According to the representer theorem of Kimeldorf and Wahba (1971), the objective function

**is reduced to , as in Equation 4, and Equation 6 becomes . When minimizing by taking its first derivative with respect to**

*g***, Equation 5 is retrieved if is assumed. Because optimization of the penalty function is carried out under a Hilbert space, this approach is known as reproducing kernel Hilbert spaces (RKHS) regression, first proposed in computer sciences and machine learning (Aronszajn 1950; Kimeldorf and Wahba 1971; Wahba 1990, 1999, 2002).**

*α*Equation 4 has the same form as the “animal model” (*e.g.*, Henderson 1984; Mrode 2014) widely used in animal breeding, (7)where ** u** is the vector of infinitesimal additive effects and

**is the associated incidence matrix. Assumptions for this model are and , and the best linear unbiased predictor (BLUP) of**

*Z***can be obtained by solving (8)where and are the additive genetic and residual variances, respectively. Here, the additive relationship matrix**

*u***can be interpreted as a kernel matrix measuring the kinship between individuals based on pedigree, as discussed in de los Campos**

*A**et al.*(2009) and in Morota and Gianola (2014). Hence, the conventional animal model [pedigree-based BLUP (P-BLUP)] is a special case of RKHS regression. Similarly, the genomic BLUP (G-BLUP) proposed by VanRaden (2008) uses a genomic relationship matrix , with

**being the incidence matrix of marker genotypes, in lieu of the**

*X***matrix derived from pedigree. G-BLUP exploits “realized” relationship between individuals, using genomic information covering the entire genome. Therefore, G-BLUP is also a special case of RKHS regression. For more details on RKHS regression and its applications to animal breeding, see Gianola**

*A**et al.*(2006), Gianola and Van Kaam (2008), Gianola and de los Campos (2008), González-Recio

*et al.*(2008), de los Campos

*et al.*(2009, 2010), Morota

*et al.*(2013), and Morota and Gianola (2014).

In general, the role of a kernel matrix in RKHS regression is to convey pairwise similarity between individuals, using a certain type of input information, with methylation profiles used here. Although the choice of the kernel function is arbitrary, as any positive-definite function can be used as a kernel function, multiple factors may affect its choice in practice. For example, the diffusion kernel adopted by Morota *et al.* (2013) has a distance function (Manhattan distance) that may not be optimal for real numbers. Hence, we chose a Gaussian kernel for the continuous methyl values. By definition, the th element of the Gaussian kernel ** K** is calculated as (9)where is the Euclidean distance between vectors and , in our case the vectors of methyl values of epiRILs

*i*and

*j*, and

*h*is the bandwidth parameter of the kernel, which controls the smoothness of the fitted surface. The choice of the bandwidth parameter is important since it affects the performance of the regression. A number of algorithms have been proposed to optimize the bandwidth parameter (Jones

*et al.*1996). Here, we determined the optimal bandwidth parameter using a grid search approach under cross-validation, aiming at finding a value that maximized the predictive correlation within a model setting.

From the definition of the Gaussian kernel, all diagonal entries of the kernel matrix are 1, since the Euclidean distance between a vector and itself is always zero. Also, as the distance increases, approaches zero. Hence, the entries of ** K** range between 0 and 1, making the kernel act as a correlation matrix. Therefore, we considered Pearson’s correlation matrix

**as a naive kernel, where . Advantages of using the**

*P***matrix are computation related: (1) it is easy to obtain, and (2) tuning a bandwidth parameter is not needed. Comparisons between prediction performances obtained using the**

*P***and the**

*P***kernels are described later. A graphical comparison between the**

*K***and**

*P***kernels is shown in Figure 1. In Figure 1, the plot at the top left corner shows the**

*K***matrix created from the methyl-values data. Most between-lines correlations range from 0.7 to 0.9 and only few pairwise correlations are <0.65. The other three plots represent a**

*P***matrix with various**

*K**h*values. It can be seen that

*h*has a big impact on the values of the

**matrix. When**

*K**h*is large (1,000,000, top right corner), the majority of the entries range from 0.4 to 0.5; for intermediate

*h*(500,000, bottom left corner), most entries are between 0.2 and 0.7; when

*h*is small (250,000, bottom right corner), almost all entries are <0.5 except for the diagonal elements.

Given a kernel and a vector of fixed effects ** b** (in our case the greenhouses only), the prediction model can be written in matrix form as (10)where

**is the vector of phenotypic records (PH here);**

*y**μ*is an unknown intercept common to all observations;

**is the incidence matrix of fixed greenhouse effects;**

*X**α*is the random vector of regressions on the kernel associated with epigenetic variation, with assumed distribution , where is a variance component associated with the kernel; and

**is the model residual with distribution .**

*e*#### Prediction using preselected probes:

Our main goal is to build prediction models, using epigenetic information as a potential supplement to genomic variation (*e.g.*, SNP markers), as foreseen by González-Recio (2012). In animal and plant breeding, a training population with thousands of individuals is usually needed. However, methylation profiling experiments are extremely expensive, at least at present. Thus, cost is usually a main consideration in epigenetic studies, and data sets with hundreds of profiled individuals are commonly viewed as large-scale experiments. Due to Meissner *et al.* (2005), a molecular genetic technique called reduced representation bisulfite sequencing (RRBS) has been used to take only a small subset of all available probes as proxies to describe the methylation level of the whole genome, which may reduce experimental costs drastically and make experiments executed on a larger cohort possible. According to the mechanisms of DNA methylation known so far, cytosine in a CpG dinucleotide context (cytosine followed by guanine, where “p” indicates the phosphate bond in between) is the main target of DNA methylation in eukaryotic cells. Thus, genomic regions with high CpG content may represent the methylation profile of the entire genome and hence are chosen for BS-Seq in RRBS (note that CpG content is different from CG content; the latter evaluates cytosine and guanine frequencies separately). In the mouse, according to Meissner *et al.* (2005), these selected regions comprised only Mb of the whole genome (<0.5%), but captured most variation at the methylation level. This suggests that a subset of representative probes may perhaps provide a similar predictive performance to that from all probes. If this is the case, prediction using representative probes would be less expensive and computing burden would be lessened because generating a kernel matrix is potentially time-consuming.

Considering the size of the murine ( Mb) and the *Arabidopsis* ( Mb) genomes, we decided to select the top (see below) 10% of the profiled probes in *Arabidopsis* such that the genomic regions in which these probes reside had ∼12 Mb in total. Thus, the cost needed for the experiment would not exceed the magnitude of what was suggested by RRBS in mouse. The criterion for this selection was based on the observed/expected (O/E) CpG ratio defined as(Gardiner-Garden and Frommer 1987), which is a statistic describing the frequency of occurrence of CpG dinucleotides. Besides CpG dinucleotides, it has been found that trinucleotides CpHpG and CpHpH (H = A, C, or T) are target sites of DNA methylation in plants as well (Henderson and Jacobsen 2007; Lister *et al.* 2008). Thus, we also calculated the O/E ratio for these two trinucleotides. According to the reference genome (TAIR7, downloaded from http://www.arabidopsis.org), the total length of the *Arabidopsis* genome is 119,186,497 bp. With 711,320 probes on the designed chip, on average there is 1 probe for every 167 bp. The average length of all probe sequences is 55.2 bp (max 75 bp, min 50 bp), which means that the DNA segment between two probes is ∼112 bp long, on average. Considering that 55 bp may not be an adequate length for calculating the O/E ratio with accuracy, especially for the two trinucleotides, we decided to extend the region of examination by 120 bp to the upstream of each probe. After this extension, the estimation of O/E CpG ratio is expected to be more accurate, and the number 120 was chosen because (1) it fills the gap between two probes, so this ensures that the whole genome is under examination, and (2) the overlap between adjacent probes after extension is reduced.

To make up 10% of total probes, we chose the top 5% probes with highest O/E ratio for CpG dinucleotides and the top 2.5% probes with highest O/E ratio for each of CpHpG and CpHpH. This 2:1:1 partition comes from the fact that in *Arabidopsis*, the fractions of ^{m5}C identified in CpG, CpHpG, and CpHpH contexts are about 55%, 23%, and 22%, respectively (Lister *et al.* 2008). As a result, we selected 35,585, 17,783, and 17,783 probes based on the CpG, CpHpG, and CpHpH contents, respectively, and ended up with 65,506 probes (9.2% of all probes) in total (with some overlap between contents of different contexts). After mapping back to the genome annotation file (TAIR7, downloaded from http://www.arabidopsis.org), it was found that within these 65,506 probes, 10,044 (15.3%) were located in promoter regions of genes, and these 10,044 probes covered 40.9% of total promoters; 12,074 (18.4%) were found in coding DNA sequences (CDS); 2329 and 2005 (3.6% and 3.1%) were in 5′-UTR and 3′-UTR regions, respectively; and 2534 (3.9%) probes were in pseudogenic exons. Also, 1418 and 16,258 (2.2% and 24.8% of the 65,506 preselected probes) were found in the intron and transposon regions, respectively, with the reference information provided by Cortijo *et al.* (2014a). Finally, 18,620 (28.4%) probes did not map to any annotated region according to the current annotation file. A graphical representation of the distribution of selected probes by genomic element groups is shown in Figure 2. In addition to a model using all available probes, a prediction model using these 65,506 preselected probes was built as well.

In their RRBS study, Meissner *et al.* (2005) reported that the representative subset covered of gene promoter regions, while in our bioinformatic search, only 40% of the promoter regions were covered by preselected probes as described above. This is probably due to differences between species since the original RRBS method was developed in the mouse. Given the important role of gene promoter regions in epigenetic regulation of gene expression, we attempted to select a subset of probes with a different criterion such that more promoter regions could be covered. In epigenetics, CpG islands (CGIs) are CpG-rich regions that are usually unmethylated and located in the gene promoter region. In humans, at least gene promoter regions overlap with CGIs (Illingworth and Bird 2009). CGI shores are close proximity regions ( kb of upstream or downstream) of CGIs (Portela and Esteller 2010). Recent studies suggested that 70% of differentially methylated regions in epigenetic reprogramming are associated with CGI shores (Doi *et al.* 2009; Ji *et al.* 2010). Therefore, probes located in CGI shores were also selected such that more gene promoter regions can be covered and these probes may constitute another (independent) subset capturing most variation of the whole-genome methylation profile. Following the definition of CGI given by Gardiner-Garden and Frommer (1987), we found 23,640 CGIs, and the probes located in the shore regions of these CGIs covered 65.6% of all promoters (Table 1). Thus, apart from different kernel matrices applied, prediction was performed using (1) all probes available in the data set, (2) preselected probes based on CpG/CpHpG/CpHpH contents (referred to as contents rule hereafter), or (3) preselected probes located in the CGI shore region (referred to as CGI rule hereafter).

#### Prediction using methyl-status data:

When P-BLUP and G-BLUP are viewed as kernel methods, the ** A**- and

**-kernel matrices have an explicit biological meaning. For example, the kinship matrix**

*G***reflects the expected fraction of identical-by-descent alleles shared by a pair of relatives and the**

*A***matrix can be viewed as a realization of relationships given the observed molecular markers or as a “molecular similarity matrix” based on the DNA polymorphisms. Thus, variance components associated with**

*G***or**

*A***have a clear genetic basis. The correlation matrix**

*G***and any of the Gaussian kernels with specific bandwidth parameters used here, on the other hand, are constructed from methylation profiles and reflect only epigenetic similarity in some manner. Hence, variance components associated with these kernels do not have an easy biological interpretation except that of measuring a contribution to phenotypic variance. Further, when a Gaussian kernel**

*P***is used, the bandwidth parameter**

*K**h*has a large impact on the values in

**, as depicted in Figure 1. As such, one may expect that various distinct will be obtained when different values are assigned to**

*K**h*; hence, will vary as well. To obtain a more meaningful partition of phenotypic variation explained by epigenetic polymorphisms, we built an additional kernel matrix for RKHS.

Because the methylation state of one copy at a single locus (*e.g.*, a single cytosine at a CpG dinucleotide) can be only methylated or unmethylated, the absolute methylation level *β* (as obtained by BS-Seq, for example) is always measured as a ratio that ranges between 0 and 1, with the numerator being the number of methylation incidences in a sample. Under some circumstances, methylation at the locus under investigation can be classified into one of the three categories: M, I, or U, according to the *β*-value at that locus (Meissner *et al.* 2008; Du *et al.* 2010). If two DNA segments with similar nucleotide sequence but different methylation status (*e.g.*, one is methylated and the other is not) are considered as two epialleles, this classification provides an approximation to the underlying “epigenotypes” such that M and U stand for the “epihomozygotes” for one of the two epialleles and I is the “epiheterozygote.” Analogous to the SNP coding system, we can use 2, 1, and 0 to code M, I, and U and generate a kernel matrix mimicking the ** G** matrix in G-BLUP (VanRaden 2008), which we call the matrix. This required little extra effort since methyl status was available in the methylation data set. However, this approach has some pitfalls: (1) when continuous methyl values are converted to discrete methyl status, information is lost; and (2) once a numeric coding is arrived at, many probes would be excluded from downstream analysis because their “epi-MAF” would be <0.05 (MAF, “minor allele frequency”). In the current data set, only 206,600 probes were kept for subsequent analysis after this epi-MAF filtering. Nevertheless, this kernel may be more biologically intuitive than a Gaussian kernel generated from methyl values since the numeric coding used to generate the kernel is an absolute count of a certain epiallele of an epigenotype. Thus, a prediction model can be built and implemented as in G-BLUP, and the variance component associated with would estimate the proportion of total variance explained by epigenetic variation with a clearer biological sense.

### Data availability

Phenotypic data are from Johannes *et al.* 2009. Methylation data are downloaded froom Gene Expression Omnibus repository with accession number GSE37284.

## Results

### Prediction with different kernels

Considering that the data set had only 114 epiRILs, we used a leave-one-out cross-validation (LOO CV) for model evaluation throughout the study. When the correlation matrix ** P** was used as a naive kernel, the predictive correlation was 0.384. When a Gaussian kernel was used, the predictive correlation varied according to the bandwidth parameter chosen. In this case, when all probes were used to create the kernel matrix, the best prediction performance was obtained when the bandwidth parameter was set to 140,000, and the predictive correlation was 0.531, with predictive mean squared error (MSE) = 32.16.

It can be seen that a reasonable predictive correlation was reached when using the Gaussian kernel, which performed much better than the correlation kernel. However, the bandwidth parameter played an important role in model performance (Figure 3). Taking the four kernels in Figure 1 as an example, ** P** had entries ranging from ∼0.6 to 1, which means that the “dissimilarity” between each line must be distinguished within a 0.4 range. On the other hand, all three Gaussian kernels in Figure 1 ranged from ∼0 to 1, such that pairwise dissimilarity was better distinguished on a wider scale. Thus, it was not surprising that the Gaussian kernel outperformed the correlation kernel. Predicting an unobserved record borrows information from observations on similar individuals. Thus, the “similarity” between lines matters. From the definition of the kernel matrix (Equation 9), the off-diagonal elements are close to zero if

*h*is small (Figure 4). This makes the kernel matrix “confounded” with the identity matrix, which represents the variance–covariance structure of the model residuals. According to de los Campos

*et al.*(2010), this type of kernel matrix captures “local” similarity, focusing mainly on the comparison of an individual with itself and a few other individuals with highest similarities. A “global” kernel with a larger bandwidth parameter, on the other hand, will also take into account comparisons between more (epi)genetically distant individuals. Therefore, the “optimal” bandwidth parameter should provide a balance between local and global comparisons between different lines, using the available data. Unless multiple kernels with different bandwidth parameters are fitted simultaneously (

*e.g.*, Tusell

*et al.*2014), a kernel with an intermediate

*h*is expected to provide the best predictive correlation (Figure 3, black solid line). A similar pattern was observed for predictive MSE (Figure 3, blue dotted line).

### Prediction using preselected probes

For preselected probes, models using different kernel matrices were evaluated as well; again, the bandwidth parameter for the Gaussian kernel was determined based on a grid search via LOO CV. When using the ** P** kernel, the predictive correlations for contents rule and CGI rule probes were 0.398 and 0.395, respectively, slightly higher than when all probes were used for prediction. When a Gaussian kernel was used, the highest predictive correlations for these two sets of preselected probes were 0.532 and 0.531, respectively, given an appropriate bandwidth parameter. This result was the same as when all probes were used (Table 2, Figure 5). As a comparison, we also drew 10 subsets of probes, each consisting of a random 10% of all available probes, to evaluate the usefulness of preselection of representative probes according to different criteria. Results showed that the predictive correlations using randomly selected probes were all lower than when using representative probes selected according to an explicit criterion, regardless of the kernel used in prediction.

Our results suggest that a properly selected subset of all probes is able to capture most variation at the methylome level. Therefore, prediction of a larger cohort with a limited budget is possible since only a small fraction of “loci” is needed for methylation profiling with computation time decreasing drastically. This could be very useful in livestock or crop production since there are usually thousands of individuals in a breeding program that need to be chipped (*i.e.*, methylation profiled), which is still very costly. Lower predictive correlations obtained using randomly selected probes indicated that the most relevant methylation variation is harbored in previously identified regions, *i.e.*, high CpG/CpGpH/CpHpH-content regions or CGI shore regions.

### Prediction using the epi-G kernel

When using in prediction, the predictive correlation was 0.505 when all probes were used; predictive correlations were 0.494 and 0.507, if probes were preselected based on the contents or the CGI rules, respectively. Thus, the predictive correlation using was somewhat lower than that when a Gaussian kernel was used, perhaps due to the loss of information from discretization of methyl values. The estimated variance component associated with was 41.57 (SD = 1.69 under cross-validation) and the residual variance was estimated as 22.75 (SD = 0.45), so the proportion of total variance explained by epigenotype was 0.646, close to what was reported in Cortijo *et al.* (2014b). This proportion was 0.656 and 0.647 when only preselected probes (with two criteria, respectively) were used (Table 3). When using Gaussian kernels, on the other hand, the variance component associated with the kernel matrix represented 0.542 of the phenotypic variance (all probes used, bandwidth = 140,000), which was lower than with the kernel.

It is difficult to assess which kernel provides a more meaningful proportion of phenotypic variance explained by the methylation profile, since the true variance components are unknown. However, it is worth noting that in addition to a strong impact on predictive performance (Figure 3), the bandwidth parameter *h* had a big influence on variance component estimates as well (Figure 6). The estimated variance components associated with the Gaussian kernel were very large when *h* was large, and the proportion of phenotypic variation explained by the kernel matrix seemed excessive (up to 0.85). Note that the residual variance was essentially independent of the bandwidth parameter value. Therefore, caution needs to be exercised when interpreting the variance component associated with the kernel as variation explained by methylation polymorphisms. When using the kernel, on the other hand, the proportion of phenotypic variation explained by epigenetic polymorphisms seemed more reasonable, and predictions obtained using this kernel gave a better predictive correlation than when using the kernel (). Also, the regression of testing set observations on predicted values was 0.99, much higher than for the ** P** kernel (, Figure 7).

## Discussion

### Prediction using epigenomic data

It was found that methylation data produced a reasonable predictive correlation when predicting plant height in *Arabidopsis*. The kernel matrix used here reflected epigenetic similarity between epiRILs based on their methylation profiles, and such epigenomic information might complement genomic information at the DNA level. The predictive correlation and mean squared error values were similar when only preselected probes were used. Hence, use of representative probes may help to reduce the cost of methylation profiling and computing time as well, at least for prediction purposes.

Nonparametric prediction using kernel methods is relatively simpler than with Bayesian regression models based on Markov chain Monte Carlo involving an enormous number of proposal distributions. However, in most cases the variance components associated with a kernel matrix do not provide a meaningful explanation of underlying biological processes, except for P-BLUP and G-BLUP, two special cases of RKHS regression. It was observed that when a Gaussian kernel was used for prediction, the estimated variance component associated with the kernel varied much with different choices of the bandwidth parameter *h*, probably due to the big impact of *h* on the values of the kernel matrix. This variation produced a wide range of ratios, making it difficult to assess phenotypic variance explained by epigenetic polymorphisms (Figure 6), but the best predictive performance was obtained when a Gaussian kernel was used. To cope with this difficulty, an epi-** G** that mimics the

**matrix in G-BLUP was used as a kernel in RHKS regression. Since the epi-**

*G***kernel was generated from a discrete methyl status that was converted from continuous methyl-values data, a reduced predictive performance was observed probably due to loss of information during this data conversion process. However, the methyl-status data approximate the underlying epigenotypes of each locus. Hence the variance component associated with the epi-**

*G***is interpretable as in a G-BLUP model, which is clearly based on a biological concept.**

*G*In this study, we built prediction models with epigenetic information from MeDIP chip data. Alternatively, one should be able to build prediction models under the same statistical framework with BS-Seq data, which come from a combination of bisulfite conversion and next-generation sequencing (NGS) techniques with decreasing cost. Advantages of using BS-Seq data include the following: (1) unlike MeDIP chip data that rely on DNA segments, BS-Seq has single-base resolution inherited from NGS, making it more informative; and (2) instead of using the ratio between signal and background intensities to represent methylation level in a relative way, a ratio between the counts of methylated reads and total reads is used to measure the (absolute) methylation level, making it more accurate. However, when constructing the kernel matrix, any input information on methylation level, regardless of whether it is based on relative MeDIP data or on absolute BS-Seq data, can be turned into a relative measurement of epigenetic similarity, indicating that a better predictive performance may not be guaranteed when BS-Seq data are used. Moreover, the problem of information loss stemming from the discretization step when constructing the epi-** G** kernel will not be solved by the use of BS-Seq data. Therefore, even though NGS technologies are making inroads into the field of complex traits analysis, a potential next challenge is to develop a framework for BS-Seq data to take advantage of.

Despite the potential usefulness of epigenetic information in phenotype prediction suggested by our results, it should be noted that DNA methylation is reversible (*i.e.*, a methylated DNA molecule can be demethylated). Hence, methylation data are unstable relative to DNA polymorphisms. The reversibility of DNA methylation may produce “epimutation” events (Becker *et al.* 2011). Therefore, the entire methylome represents the dynamics of epimutations, and a particular methylation data set should be viewed as a “snapshot” of the methylome at a specific time from a specific tissue. To enhance phenotypic prediction performance further, information from multiple snapshots could be useful. Although methylation profiling is still expensive, its cost has decreased in recent years, and this trend is expected to continue.

### Integrating genomic and epigenomic data in prediction

Our results suggested that epigenetic information can be used alone for whole-genome prediction of plant height, as a reasonable prediction performance was obtained. Therefore, it is of interest to combine epigenetic and DNA information for the same purposes. Recently, Vázquez *et al.* (2014) showed that the inclusion of multilayer -omics data in human epidemiology can increase the predictive correlation of disease risk drastically. Likewise, Shah *et al.* (2015) presented evidence suggesting that combining genetic information with significant associations between phenotype and epiprofile may be useful for predicting body mass index in humans. These findings suggested a potential use of epigenomic data in addition to genomic data for prediction using RKHS regression since integrating multiple information sources by introducing extra kernels tends to enhance predictive performance. For example, Tusell *et al.* (2014) reported that a multikernel model performed better than a single-kernel model, as anticipated by de los Campos *et al.* (2010). Also, fitting a pedigree-based relationship matrix (the ** A** matrix) and a genome-based relationship matrix (the

**matrix) together can give a higher predictive correlation than when only one matrix was fitted (Crossa**

*G**et al.*2010). When viewing P-BLUP and G-BLUP as special cases of RKHS regression, a prediction model with both

**and**

*A***is then a model with multiple kernels. The benefit from fitting multiple kernels simultaneously can be enhanced if all kernels are mutually orthogonal (Morota**

*G**et al.*2014), which explains the result of Crossa

*et al.*(2010), since

**and**

*A***may provide information from different perspectives, with**

*G***supplementing information that is not captured by**

*G***. In a recent study, it was shown that genetic and epigenetic information can be uncoupled by epimutation over an evolutionary timescale (van der Graaf**

*A**et al.*2015), so a higher predictive correlation could be expected when information from the epigenome is included in a prediction model, as suggested by Vázquez

*et al.*(2014), since this extra information is distinct from the information conveyed by DNA polymorphisms.

Biologically, the preceding phenomenon can be interpreted as follows. The DNA sequence is transcribed into RNA and subsequently translated into protein, the building blocks of final phenotypes. Therefore, information at the protein layer (proteome) is “closest” to and genomic information is most “distant” from phenotypes in this biological pathway. Hence, proteomic information might provide better predictions of phenotypes than genomic information. Similarly, the epigenomic information, which lies between that conveyed by DNA and RNA layers, might be useful, if available. However, the availability of epigenomic data does not preclude the use of genomic information. On one hand, recent studies indicated that genomic variation and epigenomic variation may interact with each other (*e.g.*, Arnold *et al.* 2013; Wachter *et al.* 2014), such that epigenetics do not have a determinant effect on the ultimate phenotype, although it is a closer layer than DNA variation. On the other hand, epigenetic status can be predicted from genomic information (*e.g.*, Benveniste *et al.* 2014; Whitaker *et al.* 2015), suggesting that the inclusion of DNA polymorphisms may enhance predictive performance. Further, DNA information is crucial for artificial selection, and the epigenetic data would be informative in such a context only if transmission between generations is verified or if it enhances DNA-based predictions. For these reasons, integrating epigenomic with genomic data may be worthwhile for prediction purposes. Unfortunately, the epiRILs population used in this study did not have any SNP data, due to genetic identity between individual lines (C. Camilleri, personal communication). Nevertheless, one can expect that data integration will always be beneficial, along the lines that using information from multiple layers is expected to give stronger predictive correlations, as indicated by González-Recio (2012) and corroborated by Vázquez *et al.* (2014).

In short, including both DNA and epigenetic information into a prediction model may be fruitful. For example, if an kernel were to be used along with a ** G** matrix (using SNP data), the estimated variance components should help in interpreting the proportion of phenotypic variance attributed to genetic and epigenetic variation. Also, perhaps the loss of information incurred when forming the kernel might be compensated by

**. Therefore, using epigenomic and genomic information together has potential and additional study is warranted.**

*G*### Conclusion

We built prediction models nonparametrically, using DNA methylation data. We chose RKHS regression for prediction because, unlike with prediction using SNP data, estimated regressions using methylation data do not have an obvious interpretation that links to model parameters via some theory or biological concept. In RKHS regression, a kernel matrix describing epigenetic similarities between different epiRILs makes model interpretation less difficult. Further, the tuning procedure is easier than for a parametric model, where a Bayesian treatment and MCMC techniques are usually needed.

We used different kernels in RKHS regression, namely the naive correlation matrix ** P** and a Gaussian kernel

**with different bandwidth parameters. When the bandwidth parameter was selected appropriately, the model with a Gaussian kernel performed better than that with a**

*K***kernel. Since a reasonably good predictive correlation was observed, this suggested that epigenetic information may be useful in whole-genome prediction as a source of information that does not reside in a DNA sequence. Furthermore, the value of the predictive correlation was retained when using preselected representative probes, suggesting an avenue for cost reduction in prediction studies.**

*P*The performance of RKHS regression with a Gaussian kernel was strongly affected by its associated bandwidth parameter, not only in terms of the predictive correlation and predictive mean squared error, but also with respect to the variance component associated with the kernel matrix. This is because epigenetic similarities between individuals provided by the kernel matrix are based on a relative metric, instead of an absolute one. Therefore, the proportion of variance explained by the kernel does not give a meaningful interpretation of the proportion of phenotypic variance explained by epigenetic variation. On the other hand, a kernel matrix created from coded methylation status () mimicked the genomic relationship matrix ** G** and gave an estimated proportion of total variance explained by epigenetic variation of ∼65%. Although a small degradation in prediction performance is incurred when this kernel is applied, perhaps a better understanding of the importance of epigenetic variance can be obtained.

Using epigenetic information in addition to DNA polymorphisms in prediction has been studied by other authors in human epidemiology (*e.g.*, Vázquez *et al.* 2014), and results have suggested that this extra information may lead to a pronounced impact on prediction performance. Based on their results and on the empirical observation that RKHS regression with multiple kernels performs better than a single-kernel regression (Tusell *et al.* 2014), we conclude that inclusion of epigenetic information in prediction models may be warranted and possibly useful in livestock and crop production, as suggested by González-Recio (2012).

## Acknowledgments

The authors thank Frank Johannes and René Wardenaar from the Groningen Bioinformatics Centre, University of Groningen, The Netherlands; Vincent Colot from Institut de Biologie de l’Ecole Normale Supérieure, France; and Christine Camilleri from Institut Jean-Pierre Bourgin, France for their kind responses to our questions on experimental design and data collecting procedures. Editor Fred van Eeuwijk and two anonymous reviewers are also thanked for their valuable comments on our manuscript. This work was supported by the Wisconsin Agriculture Experiment Station and by a U.S. Department of Agriculture Hatch grant (142-PRJ63CV) (to D.G.).

Y.H. and G.M. conducted the study; G.J.M.R. and D.G. advised the analysis; Y.H. analyzed data and wrote the paper; and G.M., G.J.M.R., and D.G. revised the manuscript. All authors read and approved the final manuscript.

## Footnotes

*Communicating editor: F. van Eeuwijk*Data used in this article are from Johannes

*et al.*(2009) and the Gene Expression Omnibus data repository under accession no. GSE37284.

- Received April 8, 2015.
- Accepted August 2, 2015.

- Copyright © 2015 by the Genetics Society of America

Available freely online through the author-supported open access option.