Abstract
Targeted identification and purging of deleterious genetic variants has been proposed as a novel approach to animal and plant breeding. This strategy is motivated, in part, by the observation that demographic events and strong selection associated with cultivated species pose a “cost of domestication.” This includes an increase in the proportion of genetic variants that are likely to reduce fitness. Recent advances in DNA resequencing and sequence constraint-based approaches to predict the functional impact of a mutation permit the identification of putatively deleterious SNPs (dSNPs) on a genome-wide scale. Using exome capture resequencing of 21 barley lines, we identified 3855 dSNPs among 497,754 total SNPs. We generated whole-genome resequencing data of Hordeum murinum ssp. glaucum as a phylogenetic outgroup to polarize SNPs as ancestral vs. derived. We also observed a higher proportion of dSNPs per synonymous SNPs (sSNPs) in low-recombination regions of the genome. Using 5215 progeny from a genomic prediction experiment, we examined the fate of dSNPs over three breeding cycles. Adjusting for initial frequency, derived alleles at dSNPs reduced in frequency or were lost more often than other classes of SNPs. The highest-yielding lines in the experiment, as chosen by standard genomic prediction approaches, carried fewer homozygous dSNPs than randomly sampled lines from the same progeny cycle. In the final cycle of the experiment, progeny selected by genomic prediction had a mean of 5.6% fewer homozygous dSNPs relative to randomly chosen progeny from the same cycle.
GAINS from selection in plant and animal breeding could be improved through a better understanding of the genetic architecture of complex traits. One current source of debate is the relative frequency of genetic variants that contribute to complex traits. At mutation–drift equilibrium, the majority of genetic variants segregating in a population are expected to be rare (Ewens 1972; Watterson 1975). If a genetic variant affects a phenotype, it is more likely to be subject to selection, with the strength of selection proportional to the magnitude of phenotypic impact (Thornton et al. 2013). Since most newly arisen mutations with a phenotypic impact are expected to be deleterious (Thatcher et al. 1998; Sanjuán et al. 2004; Eyre-Walker et al. 2006) or have a pleiotropic effect on fitness (Keightley and Hill 1990; Eyre-Walker 2010), segregating variants that contribute to complex trait variation will likely be under purifying selection (Johnson and Barton 2005; Eyre-Walker 2010). Thus, a substantial portion of segregating genetic variants that affect phenotypes may occur as “rare alleles of large effect” (RALE) (Thornton et al. 2013). Consistent with the RALE hypothesis, association mapping studies find evidence that rare alleles have larger estimated phenotypic effects than common alleles (Stanton-Geddes et al. 2013). A recently reported mechanism by which rare alleles can impact phenotypes is the dysregulation of expression, which contributes to reduced fitness (Kremling et al. 2018).
Estimating the phenotypic contribution of rare alleles can be difficult. The potential for the detection of variants with phenotypic effects is dependent on allele frequency (Simons et al. 2018), and more generally, alleles with relatively large effects on phenotype are more readily detected (Lewontin 1974; Beavis 1994). Thus the most readily identified alleles with effects on fitness occur at an appreciable frequency and have a relatively large effect on phenotype; these variants are unlikely to be representative of the majority of genetic variants that contribute to phenotypic variation (Lewontin 1974; Rockman 2012).
Segregating variants that affect fitness are more likely to be deleterious than beneficial (Felsenstein 1974), and are thus more likely to be under purifying selection. Consistent with this postulate, low-frequency genetic variants in human populations are enriched for amino acid replacements (e.g., Marth et al. 2011), which likely have direct effects on protein function. The effect of individual deleterious SNPs (dSNPs) on fitness is expected to be small, but in aggregate, their impact may be substantial (cf. Felsenstein 1974). Domesticated plants and animal populations have often experienced reductions in effective population size, and strong selection associated with domestication and improvement that could result in exacerbated effects of deleterious variants as a genetic cost of domestication (Lu et al. 2006). Empirical evidence from a variety of organisms appears to support this conjecture, with comparisons in cassava (Ramu et al. 2017), dogs (Marsden et al. 2016), grapes (Zhou et al. 2017), and rice (Liu et al. 2017) showing evidence of an increased proportion of both fixed and segregating dSNPs relative to wild progenitors [see also Moyers et al. (2018) and Makino et al. (2018)].
Putative dSNPs can be readily identified based on phylogenetic conservation, particularly for coding polymorphisms (Ng and Henikoff 2003; Chun and Fay 2009). SNPs that are phenotype-changing in Arabidopsis thaliana are more likely to annotate as deleterious than “tolerated” (less-conserved) amino acid-changing SNPs at similar frequencies (Kono et al. 2018). Indeed, a number of putatively causative amino acid-changing SNPs that contribute to agronomic phenotypes annotate as deleterious (Kono et al. 2016). However, individual inbred lines for many cultivated species carry hundreds to thousands of dSNPs (Renaut and Rieseberg 2015; Kono et al. 2016). The vast majority of dSNPs occur at low frequency (Kono et al. 2016; Liu et al. 2017), and in predominantly outbreeding species such as maize, dSNPs are enriched in regions of the genome with elevated heterozygosity (Brandenburg et al. 2017). Any individual dSNP is unlikely to serve as the primary causative variant for essential agronomic traits. Because of their relative ease of identification, elimination of dSNPs either through selection against them in aggregate (Morrell et al. 2012; Moyers et al. 2018; Smith et al. 2018; Johnsson et al. 2019) or through targeted replacement of individual dSNPs (Morrell et al. 2012; Valluru et al. 2018; Johnsson et al. 2019) provides a potential means of crop improvement.
The phenotypic consequences of dSNPs are determined by their relative degree of dominance, the proportion of variants that occur in the homozygous state, and the fitness effects of individual dSNPs (Mezmouk and Ross-Ibarra 2014; Henn et al. 2015, 2016; Marsden et al. 2016; Yang et al. 2017; Moyers et al. 2018). Additionally, the genomic locations of dSNPs are important factors in how effective purifying selection can be in culling them from populations. This occurs because recombination rate places limits on the efficacy of purifying selection (Felsenstein and Yokoyama 1976; Charlesworth et al. 1993). A larger proportion of variants may be deleterious in low-recombination regions of a genome (Felsenstein 1974), as has been observed in sunflower (Renaut and Rieseberg 2015), rice (Liu et al. 2017), and soybean (Kono et al. 2016). There is evidence from studies of humans that variants in low-recombination regions may have larger fitness consequences or explain more of the variation for quantitative traits (Gazal et al. 2017; Pardiñas et al. 2018).
Modern breeding programs use genome-wide prediction approaches, which are designed to integrate large numbers of markers in the estimation of phenotypic values for quantitative traits (Meuwissen et al. 2001). This typically involves the use of a training panel of individuals with both genotypic and phenotypic information. Prediction and selection can be performed in a panel of related individuals with only genotypic data. There is evidence that the probable effect of genetic variants on quantitative phenotypic variation can vary by functional class and that prediction accuracy can be improved through differential weighting of variants (Edwards et al. 2016; Yang et al. 2017).
The purpose of this study is to assess the fate of dSNPs in a breeding population subject to genomic prediction and selection. We examine an experimental barley breeding population developed at the University of Minnesota (Tiede and Smith 2018). Genomic prediction was used to select lines with improvements in yield and resistance to the fungal disease Fusarium head blight (FHB), two unfavorably correlated quantitative traits. Phenotypic data were collected for yield, deoxynivalenol (DON) concentration (a measure of severity of fungal infection), and plant height, which was not under selection. The population showed gains in both yield (an increase of 186.1 kg/hectare) and in DON concentration (a reduction of 1.85 ppm) over three cycles of crossing and selection, with an index of yield and reduced DON concentration showing a consistent gain over cycles (Tiede and Smith 2018). The focus of the Tiede and Smith (2018) study was the optimization and updating of the training population to increase genetic gain. The pedigree design brings the rarest variants to ∼3% frequency, thus improving the potential to assess the contributions of putative dSNPs to agronomic phenotypes. The major questions we seek to address are: (i) how common are putative dSNPs in elite barley breeding material, (ii) are putative dSNPs uniformly distributed across the genome or concentrated in genomic regions with lower rates of recombination, and (ii) what is their fate through rounds of selection and breeding gain in an experimental breeding population? We also make use of a linear mixed model to estimate the proportion of phenotypic variance that can be explained based on SNPs genotyped in our panel or imputed from parents onto progeny.
Materials and Methods
Population design
Our experimental population consists of spring, six-row, malting barley adapted to the Upper Midwest of the United States. Three breeding programs (Busch Agricultural Resources, North Dakota State University, and University of Minnesota) contributed the 21 founders of the population, denoted as cycle 0 (C0) (Supplemental Material, Figure S1 and Table S1). Founders were used to produce 80 crosses (pedigrees available at https://doi.org/10.13020/d6w990). F1 progeny from each of the crosses were self-fertilized to the F3 generation, resulting in 1872 F3 progeny, denoted as cycle 1 (C1). A total of 72 lines were selected from C1 based on genomic estimated breeding value (GEBV) and randomly intercrossed to generate the next cycle of progeny. Training populations used for genomic prediction and approaches for updating those populations are detailed in Tiede and Smith (2018). The progeny from the intercrosses among selected lines were selfed to the F3 generation. The process of line selection, intercrossing, and inbreeding was repeated, creating three cycles of selection using genomic prediction. The total number of lines selected was 105 for C2 and 50 for C3 (Figure S1). Breeding program progress was evaluated by phenotypic comparison of the selected lines to a random subset of lines from the same cycle of progeny (Figure S1). The numbers of randomly selected lines were 298, 101, and 45 from C1, C2, and C3, respectively (Figure S1).
Selection was based on the predicted phenotypic values for grain yield and for reduced fungal disease severity using a proxy phenotype, the concentration of the mycotoxin DON, which is created during an active Fusarium infection (Tiede and Smith 2018). GEBV prediction was based on 384 SNPs evenly distributed across the seven barley chromosomes and chosen to maximize marker informativeness among the founders. Genotyping used an Illumina Veracode assay (Tiede and Smith 2018). Lines were selected for increased yield and reduced DON concentration. GEBVs were estimated with ridge regression, as implemented in the “rrBLUP” package (Endelman 2011) for R (R Core Team 2018).
Phenotypic data collection
F3:5 breeding lines in the selected and random pools for C1–3 were evaluated in yield trials at 5-year locations (see Appendix S1 in the supplemental material). Phenotypic data were collected on grain yield and DON concentration. Phenotypic data were spatially adjusted with a moving average across the field plots. Best linear unbiased estimates (BLUEs) for yield and DON concentration were then produced for each line using the “rrBLUP” package for R.
Raw and adjusted phenotypic data, including planting locations in the field trials, are available at https://github.com/MorrellLAB/Deleterious_GP and https://doi.org/10.13020/d6w990. For details of phenotypic data collection see Tiede and Smith (2018).
Genotypic data collection
A total of 5215 F3 progeny were genotyped across the three cycles using the 384 SNPs from the barley oligo pooled assay (BOPA) marker panel (Close et al. 2009). The physical locations of all SNPs were determined based on automated basic local alignment search tool (BLAST) searches against the barley reference genome (Mascher et al. 2017), using consensus genetic map positions to resolve ambiguous positions (Lei et al. 2019). A small number of SNPs were missing either a genetic or physical position. For these SNPs we used linear interpolation, as described in Appendix S2 in the supplemental material.
Genotypes were called using the signal-to-noise ratios from the raw probe intensities, as implemented in the machine-scoring algorithm ALCHEMY (Wright et al. 2010). ALCHEMY was used for genotype calls because it does not rely on clustering of samples to identify genotypic classes, thus avoiding Hardy–Weinberg equilibrium genotype frequency assumptions, and makes use of a prior estimate of the inbreeding coefficient to model the number of expected heterozygous genotypes. The prior inbreeding coefficient was specified as 0.99 for parental lines and as 0.75 for the F3 progeny, the average expected inbreeding coefficient after two generations of self-fertilization. Genotyping data were transformed to PLINK 1.9 format (Chang et al. 2015) and included pedigree information for each individual (data available at https://doi.org/10.13020/d6w990). PLINK was used to test for Mendelian errors in the inheritance of SNPs and to orient SNPs on the appropriate strand relative to the barley reference genome sequence from the cultivar Morex (Mascher et al. 2017). SNP genotypes from the barley BOPA markers genotyped in the Morex X Steptoe genetic mapping population (Kleinhofs et al. 1993; Muñoz-Amatriaín et al. 2011) were used to infer the reference strand of origin for each SNP. The “hybrid peeling” approach of Whalen et al. (2018) was used for simultaneous imputation and phasing of genotyping data, and thus to infer the parental contribution of chromosomal segments to progeny. The approach of Whalen et al. (2018) makes use of an extended pedigree so that phased genotype inference is improved by comparisons to both progenitors and progeny. The specified pedigree is available at https://github.com/MorrellLAB/Deleterious_GP/blob/master/Data/Pedigrees/AlphaPeel_Pedigree.txt. PLINK was used for a second round of Mendel error checking with imputed genotypes. Imputed genotypes in progeny were set to missing if their genotype probability was < 0.7.
DNA extraction, sequence analysis, and variant calling
DNA was extracted from young leaf tissue from each of the 21 founder lines using the Plant DNAzol extraction reagent and protocol from Thermo Fisher Scientific (Waltham, MA). Genomic DNA was captured with liquid-phase exome probes designed to capture 60 Mb of the barley genome (Mascher et al. 2013). Eighteen of the samples were sequenced with 100-bp paired-end technology on an Illumina HiSeq 2000, and three were sequenced with 125-bp paired end technology on an Illumina HiSeq 2500. Exomes were sequenced to a target depth of 30-fold coverage. Raw FASTQ files were cleaned of 3′ sequencing adapter contamination with Scythe (https://github.com/vsbuffalo/scythe), using a prior on contamination rate of 0.05. Adapter-trimmed reads were then aligned to the Morex pseudomolecule assembly (http://webblast.ipk-gatersleben.de/registration/) with the maximal exact match algorithm of the Burrows-Wheeler Aligner (BWA-MEM) (Li and Durbin 2009). Mismatch and alignment reporting parameters were tuned to allow for around three high-quality mismatches between the reads and the reference. This represents approximately the highest-observed coding sequence diversity in barley (Morrell et al. 2006, 2014). BAM files were cleaned of unmapped reads, split alignments, and sorted with SAMtools version 1.3 (Barrero-Sicilia et al. 2011). Duplicate reads were removed with Picard version 2.0.1 (http://broadinstitute.github.io/picard).
Alignment processing followed the Genome Analysis Toolkit (GATK) best practices workflow (McKenna et al. 2010; DePristo et al. 2011). Cleaned BAM alignments were realigned around putative insertion/deletion (indel) sites. Individual sample genotype likelihoods were then calculated with the HaplotypeCaller, with a haploid model and “heterozygosity” value of 0.008/bp. This value is the mean estimate of coding nucleotide sequence diversity, based on previous Sanger resequencing experiments (Caldwell et al. 2006; Morrell et al. 2014). SNP calls were made from the genotype likelihoods with the GATK tool GenotypeGVCFs (McKenna et al. 2010).
Estimates of read depth and coverage made use of “bedtools genomecov” relative to an empirical estimate of exome coverage. Briefly, estimated exome coverage was based on BWA-MEM mapping of roughly 241-fold exome capture reads from the reference barley line Morex (Sequence Read Archive accession number ERR271711), against the Morex draft genome. Read mapping was performed using the same parameters as for mapping the reads from the parental varieties against the reference assembly. Regions covered by ≥ 50 reads were considered covered by exome capture. Intervals that were separated by ≤ 50 bp were joined into a single interval. This results in ∼80 Mb of exome coverage relative to the 60 Mb based on capture probe design (Mascher et al. 2013). The recombination rate in centimorgans/Megabase was estimated based on the physical positions of SNPs in the reference genome (Mascher et al. 2017) and the estimated crossover rate from the consensus genetic map of Muñoz-Amatriaín et al. (2011).
Scripts to perform adapter contamination removal, read mapping, alignment cleaning, and the implementation of the GATK best practices workflow are available at https://github.com/MorrellLAB/Deleterious_GP. The BED file describing the empirical estimate of capture coverage is also available at the provided GitHub link and at https://doi.org/10.13020/d6w990.
Inference of ancestral state using an outgroup sequence
Whole-genome resequencing data for Hordeum murinum ssp. glaucum was collected using Illumina paired-end 150-bp reads on a NextSeq system. We chose H. murinum ssp. glaucum for ancestral state inference because phylogenetic analyses have placed this diploid species in a clade relatively close to H. vulgare (Jakob et al. 2004). Previous comparison of Sanger and exome capture resequencing from the most closely related species, H. bulbosum, identified shared polymorphisms at a proportion of SNPs, resulting in ambiguous ancestral states (Fang et al. 2014; Morrell et al. 2014). After adapter trimming, sequencing reads from H. murinum ssp. glaucum were mapped to the Morex reference genome using Stampy version 1.0.31 (Lunter and Goodson 2011), with prior divergence estimates of 3, 5, 7.5, 9, and 11%. Cleaned BAM files were generated using Samtools version 1.3.1 (Li et al. 2009) and Picard version 2.1.1 (http://broadinstitute.github.io/picard), and realigned around indels using GATK version 3.6. An H. murinum ssp. glaucum FASTA file was created using Analysis of Next-Generation Sequencing Data (ANGSD)/ANGSD-wrapper (Korneliussen et al. 2014; Durvasula et al. 2016). Inference of ancestral state for SNPs in this set of 21 parents was performed using a custom Python script. For the above sequence-processing pipeline, the following steps were performed using “sequence_handling” (Hoffman et al. 2018) for quality control, adapter trimming, cleaning of BAM files, and coverage summary. All other steps for processing H. murinum ssp. glaucum and inferring ancestral states are available on GitHub (https://github.com/ChaochihL/Barley_Outgroups).
Deleterious predictions
Variant annotation, including the identification of nonsynonymous variants, used gene models provided by Mascher et al. (2017). Annotations were applied to the reference genome using ANNOVAR (Wang et al. 2010). Nonsynonymous SNPs were tested with three prediction approaches: PROVEAN (Choi et al. 2012), Polymorphism Phenotyping 2 (PPH2) (Adzhubei et al. 2013), and BAD_Mutations (Kono et al. 2016, 2018), which implements a likelihood ratio test for neutrality (Chun and Fay 2009). All three approaches use phylogenetic sequence constraint to predict whether a base substitution is likely to be deleterious. PROVEAN and PPH2 used BLAST searches against the National Center for Biotechnology Information nonredundant protein sequence database, current as of August 30, 2016. BAD_Mutations was run with a set of 42 publicly available Angiosperm genome sequences, hosted on Phytozome (https://phytozome.jgi.doe.gov) and Ensembl Plants (http://plants.ensembl.org). A SNP was considered deleterious by PROVEAN if the substitution score was ≤ −4.1528, as determined by calculating 95% specificity from a set of known phenotype-altering SNPs in A. thaliana (Kono et al. 2018). PPH2 classifies SNPs as neutral or deleterious; the prediction was considered deleterious if the output was a deleterious call for a SNP. These programs use a heuristic process to test evolutionary constraint, as well as a training model for known human disease-causing polymorphisms. A SNP was considered deleterious by BAD_Mutations if the P-value from a logistic regression (Kono et al. 2018) was < 0.05. The logistic regression model used for dSNP identification is an update to the BAD_Mutations implementation reported by Kono et al. (2018). For comparative analyses, nonsynonymous SNPs were considered to be deleterious if they were identified as deleterious by all three approaches, or if they formed an early stop codon (nonsense SNP).
Population summary statistics
Pairwise diversity across classes of SNPs was calculated using VCFtools (Danecek et al. 2011) and diploid genotypes for each individual. Calculations were partitioned across breeding cycles and among functional classes including noncoding, synonymous, nonsynonymous, and deleterious. For progeny in C1–C3, calculations made use of imputed genotypes relative to the transmission of the Veracode genotyping for each partition and functional class.
Changes in derived allele frequency (DAF) among functional classes of SNPs were tested with a one-way ANOVA. For each SNP with an unambiguous ancestral state, we calculated a fold change value by subtracting the C1-DAF from the C3-DAF, and dividing by the C1-DAF. These fold change values were compared with a one-way ANOVA among functional classes.
Proportion of phenotypic variance explained
The proportion of phenotypic variance that could be explained from the genotyping data was estimated using a linear mixed-model approach, as implemented in the program Genome-wide Efficient Mixed Model Association (GEMMA) (Zhou and Stephens 2012). The model incorporates an estimated kinship matrix among samples that controls for family structure. We estimated the phenotypic variance explained for three phenotypes—yield, DON concentration, and plant height—using spatially adjusted best linear unbiased prediction (or BLUE) estimates averaged across years and locations, as reported by Tiede and Smith (2018). The mixed-model analysis used variants with a minor allele frequency (MAF) of ≥ 1% in the lines that were phenotyped. Heritability was estimated for SNPs from the 384-SNP Veracode panel and for SNPs imputed from parents to progeny based on the exome capture resequencing. Significance testing for association between individual SNPs and phenotypic variation was performed with a likelihood ratio test and a Benjamini–Hochberg false discovery rate of 0.01.
Data availability
All raw sequence reads for barley parental lines are available as BioProject identifier (ID) PRJNA399170. Raw reads for H. murinum ssp. glaucum are available as BioProject ID PRJNA491526. Additional files including a variant call format file of variants called in all parents, an empirical estimate of exome capture coverage, and all available genotypes and phenotypes are in a Data Repository University of Minnesota archive at https://doi.org/10.13020/d6w990. Scripts used for analysis are available at https://github.com/MorrellLAB/Deleterious_GP. All other relevant data are found in the paper and its supporting material. Supplemental material available at figshare: https://doi.org/10.25386/genetics.9872126.
Results
Summary of resequencing data
We made use of exome capture resequencing to identify nucleotide sequence variants in 21 barley breeding lines from three barley breeding programs (Table S1). The 5215 progeny in the experiment were genotyped using a 384-SNP Illumina assay (Tiede and Smith 2018). Based on observed genotypes in progeny in the known pedigree, we tracked the fate of genotyped and imputed SNPs through three breeding cycles (Figure S1). All lines were part of a genomic prediction experiment (Tiede and Smith 2018), where sets of progeny were selected based on genomic prediction for yield and fungal disease resistance. A second pool of progeny was drawn at random in each cycle, and subjected to the same field testing for yield and disease resistance as selected progeny.
The 21 parents (C0) in the experiment were subjected to exome capture resequencing, resulting in the identification of 497,861 SNPs. Of these, 487,473 mapped to portions of the reference genome that could be assigned to barley chromosomes and were subject to further analysis (Table 1). The intersections of three deleterious annotation approaches identified 3768 dSNPs at 62,826 nonsynonymous sites, including 1162 early stop codons in the founding parents. More of the dSNPs are private to North Dakota lines than to the other programs (Table S2), which has more private SNPs across classes. The number of dSNPs was remarkably similar among founder lines with a mean of 677.67 (± 16.51), though the number of dSNPs private to a founder varied from 11 to 172 (Table S3). The unfolded site frequency spectrum (SFS) for 283,021 SNPs with inferred ancestral state indicated that dSNPs occur primarily in the rarest frequency classes across cycles (Figure 1), a trend that was also evident among all variants in the folded SFS (Figure S2).
Derived site frequency spectra for 283,021 SNPs in the three cycles of the genomic prediction population. Ancestral state was based on majority state from H. murinum spp. glaucum resequencing mapped to the Morex assembly (for all SNPs, including those with no inferred ancestral state, see Figure S2). “Noncoding” refers to SNPs in regions that do not code for proteins, “Synonymous” refers to SNPs in coding regions that do not alter an amino acid sequence, and “Nonsynonymous” refers to SNPs that alter the amino acid sequence. SNPs listed as “Deleterious” are the intersect of variants that annotate as deleterious in each of three approaches.
SNP density was highest along chromosome arms and lower in pericentromeric regions (Figure S3), consistent with the reports of the distribution of gene density (Muñoz-Amatriaín et al. 2015; Mascher et al. 2017). Using pericentromeric regions as defined based on barley recombination rate and gene density reported by Muñoz-Amatriaín et al. (2015), we identified 71,939,192 bp (81.3%) of capture targets in euchromatic regions and 16,511,574 bp (18.7%) in pericentromeres (a BED file of positions covered by exome capture is available at: https://doi.org/10.13020/d6w990). Codon density was similar within exome capture from the two regions. Euchromatic regions included 6,945,584 bp (81.3%) of codons within capture targets and 1,592,281 bp (18.7%) in pericentromeric regions. The euchromatic regions included 401,148 (86.6%) of SNPs vs. 62,060 (13.4%) of SNPs in pericentromeres. This resulted in 3331 (87.7%) dSNPs in euchromatin and 466 (12.3%) dSNPs in pericentromeric regions. Thus, the proportion of dSNPs per codon was lower in the pericentromere than in higher recombination regions (Figure 2A). Another common measure of the effect of recombination rate on the density of deleterious variants contrasted recombination rate with the ratio of dSNPs to synonymous SNPs (sSNPs) (Figure 2B). We found a slightly negative, but highly significant correlation (P < 9e−4), meaning that dSNPs are more common in regions of low recombination.
(A) The number of nonsynonymous SNPs per covered codon in 10-Mb windows across the barley genome. SNPs that annotated as “Tolerated” or “Deleterious” are plotted separately. The light gray shading indicates the pericentromeric region, and the dark gray shading shows the centromere. (B) The ratio of putative deleterious SNPs (dSNPs) to synonymous SNPs (sSNPs) relative to recombination rate. chr, chromosome; dSNP, deleterious SNP; sSNP, [define] synonymous SNP.
To infer the ancestral state of variants in cultivated barley, we performed whole-genome resequencing of H. murinum ssp. glaucum, yielding 371,255,479 reads. A divergence rate setting of 3% in Stampy (Lunter and Goodson 2011) resulted in the largest percentage of reads mapping to the reference genome. Genome-wide coverage was estimated as 37×. This permitted estimation of ancestral state for 283,021 or 69.5% of barley SNPs. Results of ancestral state inference by functional class of variants are shown in Table S4.
Genotyping data
The final data set used for analysis consisted of 5215 individuals. Of the 384 SNPs on the custom Illumina Veracode assay (Tiede and Smith 2018), four were eliminated because of errors in Mendelian inheritance between parents and progeny. Three SNPs with > 20% missing genotypes were also excluded, resulting in 377 SNPs segregating among progeny (purple triangles in Figure S3). For 16 SNPs, either genetic or physical positions needed to be interpolated from flanking SNPs (see supplemental material). The parental lines and progeny produced an average of 366.5 (±40.1) genotyped SNPs. Pairwise diversity averaged 0.32 across cycles, with observed heterozygosity between 8 and 15% in C1 through C3 (Table S5).
Using the 377 genotyped Veracode SNPs, we imputed genotypes for all variants in the pedigreed populations using the program AlphaPeel (Whalen et al. 2018). Imputed genotypes are reported in AlphaPeel output as the expected dosage of the nonreference allele at each site. Recombination probabilities are modeled from interpolated genetic distances between observed markers with known genetic distances (Whalen et al. 2018). The unfolded SFS for parental lines demonstrates that dSNPs are at low initial frequencies (Figure S4). Both the unfolded (Figure 1a) and folded SFS (with all variants) (Figure S2), demonstrate that dSNPs remain at low frequency across generations in the population. Average pairwise diversity for SNPs resequenced in the founder lines and imputed onto progeny was ∼0.19 for synonymous SNPs and ∼0.12 for dSNPs, with noncoding and nonsynonymous SNPs having intermediate levels of diversity (Table S6).
Putatively deleterious SNPs and phenotypic variation
A total of 889 of the 5215 individuals, including the founders, had phenotypic data for grain yield, DON concentration, and plant height. Yield increased and average DON concentration decreased over three cycles of selection (Figure 3). While the individual traits do not show monotonic improvement, an index of yield and DON concentration showed steady improvement in each cycle (Tiede and Smith 2018). Plant height, which was not subject to selection in this population, increased over the course of the experiment (Figure 3). Yield was negatively correlated with the number of homozygous-derived SNPs across all classes (Figure 4). The correlation was greatest for noncoding (the largest class of) SNPs. Based on a product moment correlation, the correlation was significant at P < 0.05 for noncoding and nonsynonymous, and at P < 0.001 for dSNPs (Table 2). For DON concentration and plant height, where larger values are the less desirable trait, the correlations with the number of homozygous-derived SNPs were positive. These correlations are significant with the notable exception of DON and dSNPs (Table 2). The proportion of phenotypic variance explained by all genotypes jointly, also referred to as “SNP heritability,” was estimated using a linear mixed model implemented in GEMMA (Zhou and Stephens 2012). Retaining SNPs with a minimum MAF of ≥1% resulted in the inclusion of 357 of the 377 SNPs genotyped in all progeny. Heritability estimates for this SNP set were 0.198 for yield, 0.357 for DON concentration, and 0.237 for height.
Plots of yield (A), DON concentration (B), and plant height (C) data collected on the experimental population. Values for founding parents (C0), and each of three cycles are shown (C1–C3). In C1–C3, randomly selected lines are shown in white, and lines selected based on genomic estimated breeding values are shown in gray. Data shown are the BLUEs for individual lines based on yield, DON, and plant height observations at 5-year locations. BLUEs, best linear unbiased estimates; C, cycle; DON, deoxynivalenol.
The number of homozygous-derived SNPs across classes for each cycle of the experiment compared to the BLUE for yield. Values are shown for C0 to C3. Correlations and P-values are shown in Table 2. BLUEs, best linear unbiased estimates; C, cycle; ha, hectare; Hom., homozygous.
The population analyzed here involved a large number of relatively small families in a breeding design not specifically structured for linkage disequilibrium (association) mapping. However, given appropriate control for relatedness, we could query markers for genome-wide impacts on quantitative traits. There were no SNPs obtained from exome capture resequencing that crossed the thresholds for genome-wide significance for any of the traits measured (Figure S5A). Among the SNPs directly genotyped in the progeny, a total of 13 SNPs were significant with a 0.01 false discovery rate with the Benjamini–Hochberg procedure (Figure S5B). All of these SNPs had moderate (≥ 14%) MAF in the parental lines. No Veracode SNPs were detected to be significantly associated with DON concentration or plant height in this population (Figure S5B).
For linear mixed-model analysis using SNPs identified in exome capture, the ≥ 1% frequency threshold resulted in the retention of 419,956 SNPs (86% of all SNPs). Heritability estimates were 0.250 for yield, 0.514 for DON concentration, and 0.358 for plant height. These values are consistent with previous estimates: a study of a two-row barley double-haploid population grown across 25 locations reported average yield heritability of 0.35 and plant height of 0.33 (Tinker et al. 1996). Heritability for DON accumulation has been estimated as 0.46 in a separate study of crosses between two- and six-row barley (Urrea et al. 2002). Association mapping using all SNPs imputed onto progeny produced similar results to those described above with relatively few variants reaching genome-wide significance (Figure S5A), consistent with the highly quantitative nature of traits analyzed.
Change in SNP frequency over cycles
Using the parental assignment of genomic segments in the progeny, it was possible to track changes in frequency for segregating variation across various functional classes of SNPs. While all classes of SNPs became more homozygous over generations, dSNPs were lost from the population more frequently than synonymous SNPs (Table 3). Out of 37,766 synonymous SNPs with unambiguous ancestral state (required for dSNPs to infer which variant is likely deleterious) identified in the parents, 30,481 (80.7%) were still segregating in C3. Of the 1913 dSNPs identified with unambiguous ancestral state, 1278 (66.8%) were segregating in C3. However, this measure does not account for lower average DAFs for dSNPs. If measured as relative fold change in derived allele frequency, dSNPs more frequently decrease in frequency (Figure 5). The fold change in DAF was significantly different between each of the functional classes (one-way ANOVA P < 2.2e−16). Among classes, dSNPs experienced the largest decreases in DAF. A Tukey honest significant difference test identified significant differences among all six comparisons of functional classes (e.g., deleterious vs. nonsynonymous), with the largest observed P = 1e-5 (Table S7). The median change in DAF was −0.25 for dSNPs and closer to zero for all other classes (Table 3). Slightly more than one-half of the variants showed decreased DAF over breeding cycles, but this trend was observed at 0.627 of dSNPs. When using the pedigree to establish expectations for the allele frequencies in each cycle, we still observed a preferential loss of dSNPs as segregating variation (Figure 1, Table 3, Figure S2). When considering the variants with an inferred ancestral state, dSNPs had a larger proportion of variants that fixed for the ancestral state than other classes of variants (Figure S6). Fold change across the genome for individual classes of variants can be seen in Figure S7. Of the 1913 dSNPs with inferred ancestral state, 621 (32.5%) were fixed for the ancestral allele, while 14.6, 2.8, and 2.7% of noncoding, synonymous, and nonsynonymous SNPs were fixed for the ancestral allele, respectively.
The fold change in derived allele frequency (DAF) in SNPs from four classes of variants. The nonsynonymous class includes only SNPs determined to be “Tolerated” based on deleterious variant annotation.
The number of homozygous-derived dSNPs was reduced in each cycle, but the reduction was stronger for the lines selected for yield and reduced DON concentration than for randomly chosen lines from the same cycle (Figure 6). In other classes of variants, selected lines tend to have slightly more homozygous-derived variants than randomly chosen lines; across classes of SNPs, derived homozygous variants become less frequent over cycles.
Number of dSNPs in the homozygous state in parents and progeny over three breeding cycles, C1–C3. Values for all individuals are shown, with random samples in C1–C3 in black and selected samples in red. Boxplots summarize the data for each partition of samples. C, cycle; dSNP, deleterious SNP.
For homozygous-derived dSNPs, the difference in selected and random lines differed by cycle. For C1, selected lines had a mean of 224.25 (±22.72) homozygous dSNPs relative to 229.05 (±22.82) in randomly chosen lines, a difference that was not significant in a one-sided Student’s t-test, P = 0.060. The dSNP mean homozygosity was a slight decrease from 225.50 (±28.87) homozygous dSNPs in founders in C0. Selection in C2 saw a reduction in DON concentration but little change in yield (Figure 3); see also (Tiede and Smith 2018). In that generation, selected lines averaged more dSNPs than randomly chosen lines, 225.73 (±19.87) vs. 216.34 (±21.39). C3 progeny showed yield improvement, with minimal change in DON. The difference in selected and randomly chosen lines for mean dSNPs was ∼6%, with 205.86 (±16.43) vs. 218.02 (±15.52), with P = 0.00017 in a one-sided Student’s t-test. The number of homozygous dSNPs over generations changes more than the dosage of dSNPs in individual lines (Figure S8), consistent with the effects of dSNPs being primarily recessive.
Discussion
We examined the fate of multiple classes of variants in a population subject to genomic prediction and selection for two unfavorably correlated quantitative traits over three cycles. Selection was based on genomic prediction from a genome-wide set of 384 SNPs genotyped in all progeny. This selection did not make use of information on the functional annotation of variants. We identified 3855 putative dSNPs segregating in protein-coding regions; most of these SNPs were at low frequency in the founding parents (Figure S2 and Figure S4) and, on average, decreased slightly in frequency over the course of the experiment (Figure 5). The highest-yielding progeny in the population carried fewer dSNPs than progeny drawn at random (Figure 6).
Over three cycles of intercrossing and selection, the proportion of dSNPs occurring in the highest-derived frequency class (Figure 1) or reaching fixation (Table 3) was notably lower than other classes of SNPs. Taken together, these lines of evidence suggest that dSNPs that contribute to a diminution of yield are selected against despite the limitations of population size, and the countervailing effects of selection on predicted yield and FHB resistance.
Though progeny were selected for both predicted yield and FHB resistance, lines selected based on genomic breeding value typically had a lower total dosage of dSNPs (including SNPs in both the heterozygous and homozygous state) (Figure S8) and fewer dSNPs in the homozygous state (Figure 6). The reduction in the number of homozygous dSNPs occurred over successive generations in the experiment, resulting in a significant negative correlation between both yield and the number of homozygous SNPS, including dSNPs. A larger number of homozygous-derived SNPs was associated with higher DON, the undesirable state. The correlation between DON concentration and dSNPs was not statistically significant (Table 2). This is consistent with the expectation that dSNPs are more likely to be predictive to fitness-related phenotypes such as yield (Ramu et al. 2017; Yang et al. 2017; Valluru et al. 2018). Plant height was not under selection, but increasing plant height is generally not desirable. It was positively correlated with the number of homozygous-derived SNPs (Table 2).
The barley genome includes large pericentromeric regions with minimal crossover (Muñoz-Amatriaín et al. 2015; Mascher et al. 2017) (Figure S3). Based on our exome capture resequencing, these regions harbor fewer dSNPs per codon than the distal arms of chromosomes (Figure 2). This should not be taken as evidence that linked selection in these regions is unimportant, but rather that gene density plays an important role in determining the distribution of dSNPs across the genome. Previous studies have suggested that dSNPs occur at a higher frequency in lower-recombination regions of the genome in sunflower (Renaut and Rieseberg 2015), rice (Lu et al. 2006; Liu et al. 2017), and soybean (Kono et al. 2016). Evidence for this phenomenon in maize is mixed, with no evidence for higher mutational load reported by Mezmouk and Ross-Ibarra (2014), whereas it was identified by Rodgers-Melnick et al. (2015). Comparison among studies is made more difficult by differences in approaches for dSNP annotation and the sequence diversity statistics used as a point of comparison (e.g., the density of synonymous SNPs) [see Liu et al. (2017) and Moyers et al. (2018)]. There may also be a weaker relationship between recombination and diversity in predominantly self-fertilizing species (Marais et al. 2004). An implication is that for barley, and perhaps other species, the majority of dSNPs occur in genomic regions where crossover rates are relatively high. Thus many dSNPs can potentially be removed from populations based on the action of crossover and independent assortment.
This study involved simultaneous prediction and selection on two quantitative traits that are unfavorably correlated. This represents a somewhat realistic scenario for many applications of genomic prediction. All three traits measured in the study are highly quantitative. Only in the case of yield do we identify loci with a significant measurable effect on the phenotype. The 13 yield-related variants that survive adjustment for multiple testing, found among the SNPs that were genotyped directly, are scattered across the genome, and each variant has a relatively small effect on yield.
The identification of deleterious variants may provide useful information for genomics-based breeding (Morrell et al. 2012; Yang et al. 2017). However, further work is needed to refine the relationship between deleterious variants and agronomic traits of interest. The fitness effects of individual deleterious variants in crops remain largely unknown and the shape of the distribution of fitness effects of all variants is a challenging quantity to estimate (Eyre-Walker and Keightley 2007; Brandvain and Wright 2016). Distinguishing dSNPs with large vs. small effects on fitness could improve genomic prediction strategies. As with any examination of a complex trait, distinguishing classes of variants requires large sample sizes of phenotyped individuals. Also, because deleterious variants are completely commingled with other classes of variants, recombination limits the degree to which the effects of deleterious variants can be isolated. Conventional genomic selection relies on marker effects estimated from SNPs linked to causative variation and requires trait data to obtain those estimates. In contrast, dSNPs could be directly selected on and are identified without phenotypic data. While we observed little difference in the number of dSNPs per line, the number of private dSNPs varied widely, providing the opportunity to select progeny with fewer rare and potentially deleterious variants than either parent. Thus, fitness-related variants could be used in conjunction with trait-based genomic prediction to select superior individuals.
Acknowledgments
The authors thank Ana Poets and Tyler Tiede for sharing computer code used in data analysis; Maria Muñoz-Amatriaían and Timothy Close for providing SNP genotyping data from barley genetic mapping populations that included the variety Morex, used for the reference genome; Shiaoman Chao for providing raw genotyping data; Angelica Guercio for library preparation of the H. murinum sample that was used in this study; Sebastian Beier for physical positions for a portion of genotyped SNPs; and Li Lei, Yong Jiang, Jochen Reif, Albert Schulthess, Ruth Shaw, Robert Stupar, Peter Tiffin, and Yusheng Zhao for valuable comments on an earlier version of the text. This research was carried out with hardware and software support provided by the Minnesota Supercomputing Institute at the University of Minnesota. We acknowledge financial support from the National Science Foundation (grant IOS-1339393), the Minnesota Agricultural Experiment Station Variety Development fund, and a University of Minnesota Doctoral Dissertation Fellowship (in support of T.J.Y.K). The funders had no role in study design, data collection and analysis, the decision to publish, or preparation of the manuscript.
Footnotes
Supplemental material available at figshare: https://doi.org/10.25386/genetics.9872126.
Communicating editor: A. Charcosset
- Received May 7, 2019.
- Accepted October 11, 2019.
- Copyright © 2019 by the Genetics Society of America