The Epidermal growth factor receptor is an essential gene with diverse pleiotropic roles in development throughout the animal kingdom. Analysis of sequence diversity in 10.9 kb covering the complete coding region and 6.4 kb of potential regulatory regions in a sample of 250 alleles from three populations of Drosophila melanogaster suggests that the intensity of different population genetic forces varies along the locus. A total of 238 independent common SNPs and 20 indel polymorphisms were detected, with just six common replacements affecting >1475 amino acids, four of which are in the short alternate first exon. Sequence diversity is lowest in a 2-kb portion of intron 2, which is also highly conserved in comparison with D. simulans and D. pseudoobscura. Linkage disequilibrium decays to background levels within 500 bp of most sites, so haplotypes are generally restricted to up to 5 polymorphisms. The two North American samples from North Carolina and California have diverged in allele frequency at a handful of individual SNPs, but a Kenyan sample is both more divergent and more polymorphic. The effect of sample size on inference of the roles of population structure, uneven recombination, and weak selection in patterning nucleotide variation in the locus is discussed.
THE Epidermal growth factor receptor (Egfr) in Drosophila is involved in and quite essential for a wide range of activities, but there is little indication that it has ever had a central role in promoting evolutionary change. Unlike the Hox family, for example, which is widely recognized as a key mediator of morphological diversification (Gellon and McGinnis 1998), or numerous enzymes in central metabolism that facilitate ecological adaptation (Eanes 1999), Egfr is more obviously regarded as a generic signaling component that likely helps to promote developmental stability. In the absence of compelling implication of involvement in some unusual phenomenon, it is the sort of gene that escapes the notice of evolutionary geneticists. Yet to the extent that quantitative variation is the product of subtle polymorphisms in hundreds of loci, it is also arguably the sort of gene that should receive attention.
Point estimation of the basic parameters of nucleotide variation is generally recognized to be unaffected by sample size, and since the tests of neutrality based on these estimates have low power to detect weak selection, most population genetic surveys deal with samples of no more than 30 alleles. However, our study was motivated primarily by quantitative genetic questions, requiring that we have confidence in estimates of the level and variance of linkage disequilibrium—and hence haplotype structure—throughout the locus, and assessment of the presence or absence of population structure. As shown in the accompanying article (Palsson and Gibson 2004, accompanying article in this issue), it turns out that complete sequence coverage rather than selective genotyping is critical to the detection of genotype-phenotype associations. We argue here that for some purposes sample sizes of several hundred alleles are likely to yield information of interest to molecular evolutionists that cannot be gleaned from smaller samples. In particular, significance testing of population structure, inference that recombination rates may vary along a locus, and monitoring of the frequency of rare and complex alleles all depend on larger sample sizes.
We chose Egfr over thousands of other typical loci for two main reasons. First, it has highly conserved roles in cell growth and differentiation throughout the animal kingdom (Freeman 1998; Stein and Staros 2000; Shilo 2003), so is intrinsically interesting in terms of understanding the genetics of developmental pathways. Second, it is an excellent candidate gene for quantitative association with a range of traits in Drosophila (Dworkin et al. 2003; Palsson and Gibson 2004, accompanying article in this issue) yet has so many pleiotropic functions (e.g., Clifford and Schüpbach 1994; Wang et al. 2000; Duchek and Rorth 2001; Galindo et al. 2002; Johnson-Hamlet and Perkins 2002; Zecca and Struhl 2002; Yang and Baker 2003) that it is easy to imagine that very little variation would be tolerated in the protein structure, as seems to be the case. These different functions could also set up conflicting trade-offs at different stages of development, and linkage disequilibrium between regulatory sites affecting different domains of expression could impact polymorphism, perhaps resulting in strong balancing selection. Similar to downstream signaling components (Riley et al. 2003), we do not find evidence for the latter, but discuss some unexpected haplotype structure and population differentiation in the context of weak selection, uneven recombination, and a predominance of purifying selection.
MATERIALS AND METHODS
Thirty-six Kenyan lines were established from stocks obtained from the Bowling Green Stock Center in 1997, based on flies collected by R. Woodruff in the 1970s. Second chromosomes of these lines were isogenized by passage over the CyO balancer in the Samarkand background. Several cases of probable gene conversion to the CyO allele were observed upon sequencing, but these were excluded from further analysis. Eighty-three Californian isofemale lines were collected by S. Nuzhdin from the Wolfskill orchard near Winters, California (CA), in 1998, and subjected to >40 generations of pairwise sib mating to generate the near-isogenic lines surveyed here (Yang and Nuzhdin 2003). Similarly, 150 North Carolinian near-isogenic lines were established from isofemales trapped by us from several bins in a peach orchard near West End, North Carolina (NC), in July 2000. After 10 generations of sib mating only 70% of the Egfr loci in these lines were homozygous, so sib mating was continued in multiple sublines for a further 5 generations, and homozygous lines were then chosen on the basis of sequencing. A total of 130 NC near-isogenic lines survived, 15 of which remained at least partially heterozygous (cf. 2 of the CA lines), in which case that portion of the locus was ignored. It is probable that the process of inbreeding and selection of homozygous chromosomes biases the sampling, but no indication of departure from Hardy-Weinberg proportions was observed in the data and if there is a bias in the data set, it is toward an excess of rare alleles. The Drosophila simulans allele was obtained from the same population of West End flies. Flies were maintained in vials on 10 ml standard cornmeal medium supplemented with yeast.
All single-nucleotide polymorphism (SNP) and insertion/deletion (indel) genotypes were obtained by direct sequencing of PCR products obtained from squish-preps of a single male fly per line. The 10.9 kb was surveyed with 6 amplification reactions, ranging from 1.2 to 2.1 kb in length, and 17 sequencing reactions generated from the external PCR primers and an internal primer (listed in supplementary Table 1 at http://statgen.ncsu.edu/ggibson/supplinfo/supplinfo6.htm) for most of the fragments, as diagrammed in Figure 1. Amplification reactions were performed with Promega (Madison, WI) Taq polymerase, products were purified from agarose gels with QIAquick columns (QIAGEN, Valencia, CA), and sequencing was performed with Big Dye II mix (Perkin-Elmer, Norwalk, CT) on ABI 3700 sequencers at the North Carolina State University Genome Research Laboratory. Most high-quality reads were in the range of 650–700 bases, but some regions proved particularly refractory to both amplification and sequencing, due mainly to high SNP and indel polymorphism and to homopolymeric runs. New primer combinations were tested, particularly in the vicinity of exons 1 and 2, with limited success, and consequently depth of coverage varies along the locus. No attempt was made to sequence all fragments on both strands, due to a trade-off in time and expense against gain in accuracy. However, we estimate the base-calling error rate at <0.1% per polymorphic site on the basis of repeat sequencing of some alleles and up to 100-bp overlaps between ends of sequence fragments that resulted in ∼1.5× coverage with 18% of all nucleotides represented in at least two traces. There is slightly >2 Mb of completed sequence, and just four unresolved polymorphisms remain after manual inspection and comparison of pairs of traces.
Nevertheless, the genotype matrix for the entire collection of 257 partial Egfr alleles is only 74% complete, with an average read length of 8067 (±131) bp. The three main reasons for incomplete coverage are (i) reduced sampling of problematic regions, most notably around exon 2 in the CA lines; (ii) lingering heterozygosity in portions of some alleles, primarily in the NC lines; and (iii) reduced finishing effort for the Kenyan lines, as these were of less interest to us in relation to our genotype-phenotype mapping studies. Consequently all molecular evolutionary conclusions should be interpreted with these caveats of possible sequence error and sampling bias in mind. Locations of variants are specified relative to GenBank entry 17571116 (also available as supplementary information at http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm), which spans 48 kb encompassing the complete Egfr locus and up to 5 kb on either end. Coverage of the predicted coding region in our sequence is 100%; the extent of Egfr regulatory sequences is unknown, but at least one-third of the 24-kb intron 1 is occupied by three other open reading frames (ORFs), and the flanking genes are closely adjacent, so coverage of the entire locus lies between 25 and 75%.
All polymorphisms were aligned manually and inspected by two observers, and only unambiguous SNPs are reported. Trace files were imported into the ContigExpress module of VectorNTI Version 5 (Informax) for primary editing and assembly of contiguous alleles for each region from overlapping sequence reads. Sequence alignment was completed using ClustalW (Thompson et al. 1994), and the matrix of alleles was transferred to Genedoc (Nicholas et al. 1997), which facilitates manual adjustment of indel polymorphism. Final sequences are available from GenBank: D. melanogaster alleles are divided by exons and populations as follows: exon 1 [CA, AY460697–AY460774; Kenya (K), AY460775–AY460801; NC, AY460802–AY460927], exon 2 (CA, AY460928–AY460979; K, AY460980–AY461000; NC, AY461001–AY461101), and exons 3–6 (CA, AY461102–AY461184; K, AY461185–AY461220; NC, AY461221–AY461357). Outgroup sequences are from D. simulans, for exons 1 (AY460690–AY460692), 2 (AY460693), and 3–6 (AY460694), in addition to one D. sechellia allele for exon 1 (AY460689).
Analysis of nucleotide diversity:
Since the Kenyan sample had a large number of unconfirmed singletons and is less than one-fifth the size of the North American sample, all analyses aside from the FST comparison are based on the combined NC and CA sample. Nucleotide diversity was initially analyzed as the pairwise distance between alleles (π) and on the basis of the number of segregating sites (θ) in both DnaSP Version 3.53 (Rozas and Rozas 1999) and Tassel Version 2.0 (E. Buckler, http://www.maizegenetics.org). Inconsistencies between these estimates around exons 1 and 2 arose as a result of the way the two programs handle missing data, so we report more direct manual estimates. Nucleotide diversity, π, was estimated as the per-nucleotide expected heterozygosity in a sliding window of 100 bp in Figure 2A and over each functional region of the gene in Table 1. The estimates of θ per nucleotide in Table 1 were obtained as S/nai, where S is the observed number of segregating sites, and ai is the sum over i = 1 to N − 1 of 1/i for N, the mean number of alleles in the segment of length n nucleotides (Watterson 1975). These two estimates track closely and while there is a tendency for θ > π, this is not significant in any region or across the entire locus. Tests of neutrality (Hudson et al. 1987; Tajima 1989; Fu and Li 1993; Fay and Wu 2000) were implemented in DnaSP Version 3.53 on reduced data sets containing only complete sequences for sliding windows and/or specific segments of the locus, but did not provide any strong evidence for either directional or balancing selection: Palsson (2003) provides a detailed description. The effects of sample size on tests of neutrality were determined by 100 jackknife iterations with N = 12, 48, or 96, for three regions of ∼1000 bp each that had no missing data. Protein divergence (McDonald and Kreitman 1991) was assessed with Fisher's exact test. Alignment with the D. pseudoobscura homolog (raw contig 1071 downloaded from http://www.hgsc.bcm.tmc.edu) was analyzed with AVID and visualized with VISTA (Mayor et al. 2000).
Linkage disequilibrium (LD) was assessed in Tassel, using Fisher's exact test (Lewontin 1988; Weir 1996) to assess the significance of the squared correlation in allele frequencies, r2. All analyses were performed for NC and CA separately and as a combined data set, and results are reported for the combined North American data set since no qualitative differences in patterns of LD among populations were detected. Only sites that were present in at least 50 alleles were included in the analyses, and only if both variants were observed in 5 or more alleles. The second criterion is critical due to limitations of contingency tests with small marginal counts (Upton 1982), and this has demonstrated effects for estimates of linkage disequilibrium (Lewontin 1995). The effects of sample size, minimum number of alleles, and rare variants were explored by jackknife procedures in Tassel (with missing data) and those of sample size only in DnaSP 3.53 (without missing data).
Population differentiation among all three populations was estimated using the AMOVA procedure in Arlequin Version 2.0 (Schneider et al. 2000) for pairs of populations. The significance of FST parameters was assessed by 10,000 permutations with Bonferroni adjustment for the 258 comparisons involving common SNPs and indels in all three populations. Initial analyses were performed with sliding windows of haplotypes of 10 or 5 sites, and subsequently for each site separately. Demonstration of the power gained with larger samples was demonstrated by 1000 jackknife samples of the top four most population-stratified sites.
Homogeneity of the distribution of indel events along the locus, by regions defined in Table 1, was analyzed with a goodness-of-fit test (Schaeffer 2002). The effect of sample size on this test and on the variance in indel size was also estimated by 1000 jackknife iterations, with a sample size of 122. The shape of the microsatellite length distribution was studied by fitting a normal distribution and assessing deviations from uniformity by a Kolmogorov-Smirnov test. Skewness and kurtosis of the observed distributions were also compared to estimates derived from permutations programmed in R, with individual repeats scrambled with respect to one another.
Parameters of molecular variation in Egfr:
Almost 10.9 kb of Egfr was sequenced in three segments of 1250, 2090, and 7523 nucleotides, corresponding, respectively, to DNA including the first and second alternate 5′ exons and the cluster of four 3′ exons that encode the bulk of the protein. A schematic of the gene structure is shown in Figure 1, and parameters summarizing the polymorphism are provided in Table 1. A total of 523 biallelic SNPs, 24 triallelic SNPs, and 70 indels were detected, including 22 amino acid replacement polymorphisms. A Hudson-Kreitman-Aguadé test on intronic vs. exonic sequences was not significant (P = 0.89). Sequence diversity estimated on the basis of the number of segregating sites (θ) or as the average nucleotide heterozygosity (π) hovers around 0.01 substitutions per nucleotide, and there is an average of 1 SNP/20 bp. This is well within the normal range for D. melanogaster genes (Moriyama and Powell 1996). The estimates of θ and π vary along the locus, but do not differ significantly from one another in any region, suggesting that to a broad approximation the Egfr is evolving according to the expectations of neutral theory subject to varying levels of constraint in regulatory, exonic, and intronic sequences. Sample size >20 sequences has little effect on point estimation of π, θ, or haplotype diversity, but various tests of neutrality give more heterogeneous results with smaller samples, as indicated in supplementary Table 3 at http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm.
There is a low level of amino acid replacement polymorphism in the Drosophila EGF receptor (DER) protein sequence. Only six “common” replacement polymorphisms (where both alleles have frequencies >0.05) are observed, and four of these are located within the putative signal peptide region of alternate exon 1. All of the remaining 15 nonsynonymous SNPs are present in just 1, 2, or 4 of the 210 sampled North American sequences as listed in supplementary Table 2 (http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm), and half of these affect amino acids that are conserved in the D. pseudoobscura DER sequence. An exonic 9-bp deletion polymorphism that replaces a conserved ProAsn3 with His in exon 6 of a Californian allele is likely to be deleterious. Two of the North Carolinian alleles appear to have one instead of two initiating methionine codons in alternate exon 1, but polarization with the sibling species indicates that the derived allele is the insertion of 3 bp that is approaching fixation in D. melanogaster. The ratio of replacement to synonymous substitutions in the entire coding region for the comparison with a D. simulans allele is 0.05, which is normal for Drosophila, but is generally taken as a sign of strong functional constraint on the protein (Aquadro et al. 2001).
Each region of Egfr tells a slightly different story concerning the possible evolutionary forces acting on the gene, when intraspecific variation (Figure 2A) is contrasted with divergence over short (Figure 2B, D. simulans, ∼2.5 million years) or long (Figure 2C, D. pseudoobscura, 45 million years) timescales (Powell 1996). In the next four paragraphs, we discuss the parameters of variation in the two alternate 5′ exons, in intron 2, and in the main coding region.
Exon 1 is remarkable for being considerably more polymorphic than alternate first exon 2 at the protein level. It encodes by far the most polymorphic 51-amino-acid segment of the DER protein, and there are also six replacements relative to the D. simulans allele compared with just four in the remaining 1424 residues. Noncoding variation is at a slightly higher level than that observed elsewhere in the locus, and there are 15 indel variants in just 1.1 kb flanking the exon. Oddly, there are no synonymous substitutions relative to D. simulans, but this is not sufficient to provide evidence for positive selection by the McDonald-Kreitman test (McDonald and Kreitman 1991; two-tailed P = 0.068, Fisher's exact test). Comparison with D. pseudoobscura also indicates relaxed constraint: unlike the remainder of the locus, only a few short stretches of >50% nucleotide identity are seen, and apart from a putative start site and splice donor sites, alignment was impossible.
Exon 2 is characterized by an unusually low level of haplotype structure and by a slight but nonsignificant excess of rare segregating sites per nucleotide (see Table 1). This pattern extends upstream of the coding region, but the 1.4 kb downstream of the exon conforms more to the usual pattern of similar numbers of rare and common polymorphisms. Another intriguing feature associated with exon 2 is a 79-bp element 350 bp upstream of the translation start site. The element consists of two sets of CN dinucleotide repeats arranged as (CN)15X21 (CN)13 and is conserved in D. pseudoobscura although with reduced spacing between the sets of repeats. Ten of the sites are different in D. simulans, 9 of which change the N nucleotide in one of the 28 repeats. Of three segregating sites in D. melanogaster, only one is common, and it affects a C and is associated with variation for wing shape (Palsson and Gibson 2004, accompanying article). These data strongly suggest that the element is part of a motif that affects Egfr transcription in Drosophila. In addition, a complex CAA microsatellite described in the discussion is located 250 bp upstream of the translation start site for exon 2.
The 2.4-kb portion of intron 2 immediately preceding exon 3 is overall the least polymorphic region of the entire Egfr locus and includes five stretches of at least 100 bp that are almost monomorphic in D. melanogaster and show >50% nucleotide identity in D. pseudoobscura. Most of the 102 segregating sites in this region are rare (only 26 common sites observed compared to 57 expected), and similar to exon 2, there is little haplotype structure. By contrast with the sequences flanking exons 1 and 2, which contain 22 common deletions of which half are >4 bp, only 2 long deletions present in just three alleles are found upstream of exon 3. Similarly, only two short deletions are seen relative to the D. simulans allele. Although the cis-acting regulatory regions have not been mapped in the Egfr locus, the phylogenetic and intraspecific shadowing data in Figure 2 strongly suggest intron 2 as a candidate regulatory region.
Exons 3–6 encode the bulk of the DER protein, are highly conserved between species, and show low levels of polymorphism in D. melanogaster. Only 4 of the 1325 amino acids in this region differ in D. simulans, and there are only two segregating common replacement polymorphisms. The three introns of 170, 66, and 74 bp in the 4.3 kb of sequence covering these four exons have dramatically elevated divergence relative to the exons in both of the other species, as seen in Figure 2, B and C, although no increase in intraspecific polymorphism is observed in the two smaller introns (Figure 2A). Despite these unusual features, there is no formal support for departure from neutrality, perhaps because the intron sequences are so short. Intron 3 is particularly polymorphic, with 1 in every 6 nucleotides harboring a polymorphism. The two ends of the intron, 10 bases from their respective splice sites, show different recombination histories. A remarkable 3-base stretch before the start of exon 4 has polymorphisms with the rare allele at frequencies of 0.20, 0.34, and 0.50 that are almost in linkage equilibrium with one another, whereas four substitutions at a frequency of 0.04 in a 7-base stretch following exon 3 are in perfect linkage disequilibrium. Sequence conservation extends for at least 1 kb 3′ of the termination codon, in both the 3′-untranslated region (UTR) and the intergenic region.
Regional variation in the extent of haplotype block structure along the locus is shown in Figure 3, a plot of the degree (above the diagonal, D′) and significance (below the diagonal) of LD between sites along the gene. The plot is for the combined NC and CA data consisting of partial sequence from 210 lines and for clarity includes only SNPs where the less common allele has a frequency of at least 0.25. Two large blocks of strong LD correspond to 1.25 kb in and around exon 1 and the highly polymorphic 400-bp stretch encompassing exon 3 and intron 3. Both the exon 2 and intron 2 sequence stretches are essentially devoid of significant LD, while the 3′ half of the gene including the long exons shows a string of imperfect LD blocks up to 1 kb in length and a suggestion of occasional long-range associations. Significance is a function of the allele frequencies and correlation between them (r2) as well as the total sample size, whereas D′ shows the extent of LD relative to the maximum possible association. Even within the blocks of highly significant LD, adjacent sites are rarely in complete LD, and sites within a few hundred bases of one another can be in linkage disequilibrium. The depth of sampling enabled a comparison of LD between the North American populations, and it proved quite similar as the Pearson correlation between the two populations was 0.86 for r2 and 0.54 for D′. Table 2 shows that the power to detect pairwise associations that are significant experiment wide rises with sample size, but also depends critically on the absolute count of rare alleles, as argued by Lewontin (1995).
Linkage disequilibrium is generally expected to decay with distance, as a result of recombination breaking up haplotypes that are generated as new alleles arise or are introduced by admixture. The plot of LD with distance in Figure 4 confirms that this tends to be the case, with only sites within 50 bp of one another in linkage disequilibrium. LD rarely rises above the background level for sites that are separated by >1 kb. Note that the gaps in the plot without points arise because there were two unsequenced regions of 23.5 and 3 kb in introns 1 and 2. Some long-range associations are expected to arise by chance in the 20,100 pairs of alleles, so most of the r2 measures beyond 5 kb are probably sampling artifacts. Over a short range, LD can be maintained by factors such as low recombination, gene conversion, and epistatic selection, but a comprehensive analysis of these tendencies is beyond this study.
Another way to represent haplotype structure is to plot the actual alleles and haplotype networks, as is done for five representative sequence segments in Figure 5. On the left, each row represents a SNP, while alleles are columns ∼0.5 mm wide, and light shading indicates the rare genotype. In the networks on the right, circle diameters are proportional to the number of alleles of a common haplotype, and these are joined by lines with lengths proportional to the number of substitutions distinguishing the haplotypes. The short segments from exons 1 and 5 are examples of true haplotype blocks, with just a handful of alleles in each case having arisen by recombination. Exon 5 provides an example of a dimorphic haplotype, with one common type that is four or five substitutions distinct from the other two common types. The 5′ end of intron 2 by contrast shows several haplotypes and recombination over a stretch of 250 bp. Haplotype dimorphism is also observed in short stretches of exon 3 and intron 3, with deep branches of at least six substitutions separating the two major types in each case, but recombination and/or gene conversion has also generated networks of haplotypes that cannot be generated solely by stepwise mutation. Both of these blocks also present instances of linkage equilibrium between adjacent sites that are interspersed with other sites in LD.
Population differentiation was investigated using Wright's estimator of the proportion of variance in allele frequencies attributable to two or more subpopulations relative to the total population, FST (Weir and Hill 2002). Figure 6 shows FST as a sliding window of 10 polymorphic sites along the locus for the two North American populations below the abscissa, and each of these is compared to the smaller Kenyan sample above it. Bars below the plot highlight regions of significance for the three contrasts with four bars representing P < 0.0001 and one bar 0.01 < P < 0.05. For the comparison of North Carolina and California, one window shows a very large FST that is significant after Bonferroni correction for multiple comparisons (P < 0.0002), but the others must be regarded as marginally significant at best. Analysis of single sites summarized in Table 3 indicates that in all but one of the cases the population structure is focused on a single SNP around which linkage disequilibrium decays rapidly, the exception being the trio of sites 40428, 40458, and 40464 in exon 6.
Divergence is much more pronounced between the North American and Kenyan samples, with a maximum FST of 0.4. There is a tendency for elevated divergence associated with exon 2, intron 2, and exon 6, and it is perhaps noteworthy that the latter two regions are among the least polymorphic in the sample. Population structure can also be seen in the comparison of private allele frequencies, namely alleles that are observed in only one of the populations (Slatkin 1985). The Kenyan sample has 92 private alleles, almost three times as many as in the Californian sample, which is two and a half times larger. The North American samples share 50 sites that are not observed in Kenya, and 14 of these are common polymorphisms in the sense that the less common allele has a frequency >0.05. Moreover, the Californian sample shares 14 sites with the Kenyans while the larger North Carolina sample shares only 4 sites with the Kenyans.
Our survey is 20 times larger than most molecular evolutionary studies, which focus on 2 or 3 kb of 30 or so alleles. Increasing the sample size to >200 alleles has little effect on point estimation of most parameters of nucleotide variation, so for most molecular evolutionary studies it is not necessary. For this reason, Pluzhnikov and Donelly (1996) concluded that sequencing effort should be invested in length, not number, of alleles. However, consistent with Simonsen et al.'s (1995) simulation study of the power of Tajima's D to detect selective sweeps, we find that large samples decrease the variance of estimates to a degree that is sufficient to markedly increase the power of tests of neutrality. We argue here that detailed analysis of large samples of select regions of the Drosophila genome will provide extra insight into the evolution of population structure, haplotype structure, and complex insertion-deletion polymorphisms.
The only tests of neutrality that produce nominally significant deviations from the null hypothesis are Fu and Li's statistics, of which D* is plotted in Figure 7 for 100-nucleotide windows along the Egfr locus. These sites exceed significant thresholds only for the complete sample: jackknife samples of <100 alleles summarized in supplementary Table 3 (http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm) do not provide any suggestion of departure from neutrality. Of the 341 tests, 21 are significant at P < 0.05, but it should also be noted that none of these exceed the conservative Bonferroni threshold for multiple comparisons. Consequently, marginal evidence for purifying selection in the 3′ half of the gene is rendered insignificant by inclusion of the 5′ regions in the analysis. The fact that there is a large excess of negative test statistics might also imply an excess of rare alleles, but Bustamante et al.'s (2002) comparison of multiple loci from several species implies that the baseline for comparison is species specific and that negative test statistics are common in Drosophila. What has not been studied carefully is the variance of tests like Tajima's D and Fu and Li's D* within loci and the effect of weak selection on that variance. Since none of the Egfr regions are even nominally significant in samples of <40 alleles, genome-scale sequencing of this sample size will not have the resolution to address all molecular evolutionary processes.
Estimation of haplotype structure is essential both for understanding the evolutionary history of the locus and for determining the density of SNP sampling that will support genotype-phenotype association studies. The two analytical approaches used to infer haplotypes are assessment of the level of linkage disequilibrium (Figures 3 and 4) and phylogenetic analysis of regions of low recombination (Figure 5). Both analyses imply that haplotypes in Egfr are generally restricted to fewer than five common polymorphisms that all lie within one 500-bp stretch of sequence. Beyond 1 kb, linkage disequilibrium drops below 10% of maximum, and almost all sites are effectively in linkage equilibrium. The few examples of stronger association are due to either relatively rare alleles or chance, given the very large number of tests (Pritchard and Przeworski 2001). Assessment of the significance of LD is one area where larger sample sizes have a dramatic impact, as the number of significant associations increases >10-fold by measurement of 128 rather than 68 alleles. Comparison of LD in multiple populations is emerging as a crucial aspect of the human HapMap project, and our results confirm the importance of large samples. The documented levels of LD are consistent with other surveys of variation in D. melanogaster (Schaeffer and Miller 1993; Richter et al. 1997; Langley et al. 2000; Zurovcova and Ayala 2002) and imply that haplotypes in the fruitfly are almost two orders of magnitude shorter than those observed in humans (Reich et al. 2001; Ardlie et al. 2002). Both the low level of linkage disequilibrium and high polymorphism can be attributed to the fact that flies are outcrossing and have had an enormous effective population size for millions of years (Aquadro et al. 2001).
More detailed analysis of the haplotype blocks, however, suggests a fine scale to the evolutionary forces acting on the locus. Several authors have remarked that there is in fact an excess of linkage disequilibrium in non-African flies relative to theoretical expectations based on empirical measurement of the recombination rate and neutral mutation parameter (Schaeffer and Miller 1993; Kirby and Stephan 1996; Andolfatto and Przeworski 2000; Wall 2001). Possible biological explanations include pervasive balancing selection or small selective sweeps, a history of admixture in the species, and unequal gene conversion and/or recombination rates along a chromosome (Nachman 2002; Wall et al. 2002; Andolfatto and Wall 2003). Figure 5 provides a hint of the latter, since nonoverlapping 250-bp segments of Egfr with similar SNP densities clearly produce haplotype networks with distinct topologies ranging from two deep branches to a network of recombined alleles. Similar to our previous observation of bimodality in random sequence stretches (Teeter et al. 2000), it appears that several of the putative haplotype blocks in Egfr are bimodal. Coalescent theory suggests that the deep branches are not unexpected in a sample of genes, but that their depth and prevalence will be affected by a variety of demographic factors (Slatkin and Hudson 1991). While we do not present new analytical statistics, we hope that the suggestions of variation in haplotype structure across the locus will inspire more theoretical work on the variance of branch lengths under uneven recombination and with weak selection.
An even more subtle pattern in the sequence variation is the existence of strong linkage disequilibrium between sites within 1 kb of one another that are nevertheless separated by several common SNPs in linkage equilibrium. A level of heterogeneity in LD with distance is anticipated due to stochastic evolutionary forces like mutation, genetic drift, and gene conversion but could also be a consequence of population history or directional or epistatic selection. Our analysis of a hypermorphic Egfr phenotype in the eye (Dworkin et al. 2003) identified one example of significant epistatic interaction between sites 40119 and 40620 in exon 6. Those sites are in LD but are separated by several common and uncoupled SNPs. In this case, linkage disequilibrium reduces the frequency of the hyperactive two-site haplotype, consistent with the possibility that purifying selection contributes to maintenance of the association between the sites. In general, for this mechanism to operate, the selection pressure must be greater than the recombination rate (Lewontin 1964), which is of the order of 10−5 events/generation between sites separated by 100 bp in Drosophila euchromatin. Consequently, weak epistatic selection would not be expected to maintain linkage disequilibrium between pairs of sites over much more than a few hundred base pairs. Several instances of departure from monotonic decrease within the linkage disequilibrium blocks are evident in Figure 5, but much greater sampling depth will be required to ascertain whether this is a general feature of the fly genome and to determine the causes of such events.
D. melanogaster is a human commensal species that is thought to have spread from Africa in the past 15,000 years (Powell 1996) and is conventionally regarded, largely on the basis of allozyme data, as panmictic throughout its range. Numerous recent studies have challenged the latter assumption: several populations from southern Africa show incipient reproductive isolation from the remainder of the species (Wu et al. 1995; Ting et al. 2001), microsatellite frequencies show that European samples are even more genetically restricted than North American ones (Caracristi and Schlötterer 2003), and examples of clines of genetic variation have been documented (Berry and Kreitman 1993; Verelli and Eanes 2001; Freydenberg et al. 2003). The two most common measures of population structure are differences in nucleotide diversity and FST statistics. Averaged across the entire 10.9 kb of sequence, the Kenyan sample is more diverse than either the Californian or the North Carolinian samples, due to both rare and common SNPs. Among alleles shared among all of the populations, Figure 6 provides strong evidence for allele frequency differences in several regions of the gene between African and American samples. There is no reason to suppose that these differences are not simply due to genetic drift. A more surprising finding is that 50 SNPs not seen in Africa are shared by the CA and NC samples, which implies an extensive period of isolation of the New World flies from their source population prior to their spread across North America. Our data both confirm that a southern African sample is more diverse than derived North American ones (Schlötterer and Harr 2002) and add another layer of evidence for population differentiation in the New World. Glinka et al. (2003) observed a similar differentiation between European and African flies in a broad survey of loci on the X chromosome and discussed the possible role of selection in the New World in producing the observed differentiation.
Population stratification can cause false-positive associations between nucleotide variants and phenotypes in population-based linkage disequilibrium mapping, and for this reason it is critical to establish whether there is global evidence for structure in this data set. The depth of sampling in the two North American data sets allows more accurate assessment of population differentiation along the locus. One important question is whether there is structure within a population. Our NC population was collected on a single afternoon from a series of peach bins. Since no two alleles are identical it is unlikely that a single male has contributed in a biased manner to the sample, but it is quite possible that demes congregate at collection points. We tested a random set of 15 SNPs from throughout the locus and found them to be evenly distributed between NC and CA using the Structure algorithm of Pritchard et al. (2000). However, Figure 6 also provides suggestive evidence for several SNPs having significantly different allele frequencies on either side of the continent. Further analysis in Table 3 shows that in all but one case the differentiation is restricted to a single site rather than a haplotype. None of these allele frequency differences is even nominally significant for comparisons of <30 alleles from each population (supplementary Table 5 at http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm). It is unclear what could cause a change in frequency of at least 40% for a single site, without affecting closely linked sites, but if selection contributes, it must be very weak relative to the recombination rate, and it is possible that the high FST values are a sampling artifact.
Even so, if every locus throughout the fly genome harbors just a couple of SNPs that show allele frequency variation, there would be a total of at least 20,000 SNPs differentiating the geographic races, and a combined measure would provide an unambiguous geographic identifier. In this sense, flies are not so different from humans: well over 95% of the diversity is unstructured between geographic regions, but just a couple of percent of the sites may provide some differentiation at the DNA level (Aquadro et al. 2001; Rosenberg et al. 2002).
Insertion and deletion polymorphism:
As insertion and deletion polymorphism constitute only ∼10% of segregating polymorphisms in D. melanogaster, tests of deviation from neutral models are inevitably less powerful for this class of mutations. Schaeffer (2002) introduced a goodness-of-fit test for equal distribution of indels across segments of the Adh and Adh-related loci in D. pseudoobscura. The North American sample has 65 indel variants, of which more than half are <0.05 in frequency, with some evident constraint on indel size: 78% are <10 bp, compared with 56% anticipated by the mutation spectra for Drosophila (Petrov and Hartl 1998). The deletion events are not scattered randomly among the noncoding regions as determined by a goodness-of-fit test (P < 0.0001), while the insertions are marginally significant (P = 0.03). In particular, there is a deficit of large indels in 1900 bp upstream of exon 3 and in the 3′ UTR whereas larger indels reach high frequencies in the vicinity of exons 1 and 2 and in the three short introns (see supplementary Figure 3 at http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm). Resampling demonstrates that this test does depend quite markedly on sample size, as shown in Table 4.
An even more compelling indication of purifying selection was observed for a complex microsatellite in promoter 2. The consensus sequence (CAA)2-9CA(GCA)1-3(ACC)0-1 is flanked by two 35-bp stretches that are nearly invariant in D. melanogaster and differ by only six SNPs when compared to D. pseudoobscura. The most variable repeat has a tendency toward bimodal distribution of length classes that is greatly enhanced after addition of the shorter trinucleotides (Figure 8 and supplementary Table 4 at http://statgen.ncsu.edu/ggibson/SupplInfo/SupplInfo6.htm). The distribution of summed repeat lengths deviates significantly from normal (Kolmogorov-Smirnov goodness-of-fit test, P < 0.01), being extremely platykurtic (kurtosis −0.41, P < 0.0001) and right skewed (0.91, P < 0.0001) as assessed by 1000 jackknife iterations. The fact that the two classes are separated by 12 bp, which corresponds to a turn in the helix, suggests the hypothesis that the length distribution may be molded by selective constraints on the orientation of the conserved flanking regions. These results should encourage a wider survey of length distributions of composite microsatellites and their relation to the functional dissection of evolutionarily conserved noncoding regions.
Evolution of Egfr:
The DER protein is among the least variable of all fly proteins. Disregarding alternate exon 1, there are just a handful of rare, and two common, replacement polymorphisms in >1400 amino acids of sequence. Three of the proteins that initiate the signaling cascade downstream of DER, namely DRK, RAS, and DSOR, are actually even less polymorphic and are identical in a sibling species (Gasperini and Gibson 1999; Riley et al. 2003), but DER is clearly subject to strong purifying selection. Aside from structural constraints, two possible explanations for functional constraint are high levels of pleiotropy (Waxman and Peck 1998) and occupancy of a critical control point in a biochemical pathway (Nijhout 2002; Olsen et al. 2002; Riley et al. 2003). Egfr exhibits both of these features, being involved in dozens of developmental processes and integrating intercellular signaling events. Given Fraser et al.'s (2002) demonstration of a relationship between sequence constraint and the complexity of protein-protein interactions in yeast, it may also be informative to conduct comparative phylogenetic analyses of the other seven receptor tyrosine kinase loci in the fly genome.
The high level of amino acid polymorphism in exon 1, including a derived second initiator codon, is quite surprising given that alternate exon 2 is just as constrained as the common portion of the protein. Most of the variation is in the presumptive signal peptide, so it does not alter the mature protein. The two exons are both used in most tissues, but exon 1 is also used in a subset of adult neurons (Lev et al. 1985; Schejter et al. 1986). Lesokhin et al. (1999) tested the effect of two of the exon 1 polymorphisms that happened to be present in an EgfrEllipse gain-of-function allele (W15R and L21W) and found them to be qualitatively neutral with respect to phosphorylation of DER targets in transgenic flies. Since sequence conservation is also much reduced in D. pseudoobscura and undetectable in the mosquito Anopheles gambiae, presumably precise function of exon 1 is less important to the organism than that of exon 2.
As remarked, the current sample size decreases the variance in estimators of linkage disequilibrium, molecular evolution, and population subdivision and may also influence analysis of the length distribution of indels and composite microsatellites. We regard the most striking aspect of the analysis, however, to be the subtle shifts in sequence diversity along the gene. Statistical measures associated with individual blocks of 1 or 2 kb differ only marginally if at all from expectations of neutral theory, but it is the diversity of patterns within such a small region of the genome that calls for explanation. As for the vast majority of the genome, there is nothing particularly noteworthy about diversity in the Egfr. Yet there are subtle signs that various modes of selection, drift within populations, uneven recombination rates, and migration have all helped shape the variation in this typical Drosophila gene.
A number of people were instrumental in helping with the sequence analysis, including Bryon Sosinski and Regina Brierley at the North Carolina State University Genome Research Laboratory and April Duty and Naruo Nikoh in our lab. We also thank Roland Carrillo for help with inbreeding the North Carolina lines, Sergey Nuzhdin for the Californian lines, Naruo Nikoh for help with analysis, and Ed Buckler for use of his Tassel software. A.P. was funded in part by awards from the American Scandinavian Foundation and North Atlantic Treaty Organization, and the project was funded by the National Institutes of Health (R01 GM61600).
- Received January 6, 2004.
- Accepted April 7, 2004.
- Genetics Society of America