## Abstract

The fitness landscape defines the relationship between genotypes and fitness in a given environment and underlies fundamental quantities such as the distribution of selection coefficient and the magnitude and type of epistasis. A better understanding of variation in landscape structure across species and environments is thus necessary to understand and predict how populations will adapt. An increasing number of experiments investigate the properties of fitness landscapes by identifying mutations, constructing genotypes with combinations of these mutations, and measuring the fitness of these genotypes. Yet these empirical landscapes represent a very small sample of the vast space of all possible genotypes, and this sample is often biased by the protocol used to identify mutations. Here we develop a rigorous statistical framework based on Approximate Bayesian Computation to address these concerns and use this flexible framework to fit a broad class of phenotypic fitness models (including Fisher’s model) to 26 empirical landscapes representing nine diverse biological systems. Despite uncertainty owing to the small size of most published empirical landscapes, the inferred landscapes have similar structure in similar biological systems. Surprisingly, goodness-of-fit tests reveal that this class of phenotypic models, which has been successful so far in interpreting experimental data, is a plausible in only three of nine biological systems. More precisely, although Fisher’s model was able to explain several statistical properties of the landscapes—including the mean and SD of selection and epistasis coefficients—it was often unable to explain the full structure of fitness landscapes.

THE fitness landscape is defined by a set of genotypes, the mutational distance between them, and their associated fitness in a given environment (Wright 1931; Orr 2005). The structure of the fitness landscape determines the fitness effects of mutations and the interaction between mutations for fitness. These properties determine the pace of adaptation (Eyre-Walker and Keightley 2007), the predictability of evolution (Weinreich *et al.* 2006), the benefits of sexual reproduction (Kondrashov and Kondrashov 2001; de Visser *et al.* 2009), and the probability of speciation (Gavrilets 2004; Chevin *et al.* 2014). Thus, it is an important goal of evolutionary biology to characterize experimentally the properties of fitness landscapes across species and environments (de Visser and Krug 2014).

The most straightforward and popular experimental approach to access the properties of the fitness landscape consists of identifying mutations, constructing several genotypes that differ only by various combinations of these mutations, and measuring the fitness of these genotypes. This protocol allows reconstruction of what we call “empirical landscapes.” For example, several experiments identify a small number *L* of mutations and consider the fitness of 2^{L} genotypes with all possible combinations of these mutations. Early studies were primarily descriptive, with a focus on patterns of epistasis among mutations (Malcolm *et al.* 1990; de Visser *et al.* 1997; Whitlock and Bourguet 2000). In an influential study, Weinreich *et al.* (2006) studied the landscape between an ancestral strain of *Escherichia coli* and an evolved type with five mutations conferring high antibiotic resistance. They computed the number of paths up to the fitness maximum that could be followed by a population evolving by natural selection and showed that the ruggedness of the landscape implied that very few mutational paths could be used during biological evolution. This study suggested that the structure of fitness landscapes might severely constrain evolutionary trajectories, thus opening up the possibility that adaptation could be predicted to some extent. This finding has inspired the characterization of many other empirical landscapes [reviewed in Weinreich *et al.* (2013)].

In principle, empirical landscapes can be compared with predictions from theoretical fitness landscape models. For example, several studies fit specific models to empirical landscapes (Lunzer *et al.* 2005; Chou *et al.* 2011, 2014; Rokyta *et al.* 2011; Schenk *et al.* 2013). These models predict quantitatively the fitness values and epistasis coefficients and, as such, greatly improve our understanding of the form of epistasis that is typical of the particular system under study. However, the increasing number of empirical landscapes calls for a more general method to infer and compare the properties of fitness landscapes across species and environments. This possibility is very appealing and timely given that data accumulate on a diversity of empirical systems and selective environments, but it also raises several challenges.

The variability observed between empirical landscapes might be driven by biological differences of interest between organisms and environments of selection, but this variability is currently confounded with two other factors: stochastic variability due to sampling of a small number of mutations and variability in the protocol by which mutations are isolated. The full fitness landscape of a species in the environment of selection is defined as the fitness of all possible genotypes in that environment. This is an incredibly large space, scaling exponentially with the size of the genome. Most experiments explore a very small subset of the landscape because they examine at best a few dozen genotypes. Starting from the ancestral genotype, a single point in this large fitness landscape, the region of the fitness landscape that is explored depends on the particular mutations that were isolated. Thus, each empirical landscape results from a single realization of the stochastic sampling of a small number of mutations from a myriad of available mutations (Tenaillon *et al.* 2007; Salverda *et al.* 2010; Schenk *et al.* 2013; Szendro *et al.* 2013; de Visser and Krug 2014). In other words, a single constant underlying fitness landscapes can give rise to a diversity of small genotypic landscapes depending on the mutations that are sampled (Blanquart *et al.* 2014). Moreover, the region of the underlying fitness landscapes that is explored depends on the experimental protocol used to isolate mutations. For example, mutations are often obtained under protocols involving natural selection. While random mutations give more rugged empirical landscapes, mutations that have been sequentially selected in a single population give smoother empirical landscapes (Draghi and Plotkin 2013; Szendro *et al.* 2013; Blanquart *et al.* 2014). Thus, inferring the properties of fitness landscapes from empirical data in meaningful ways requires (1) quantifying the uncertainty resulting from sampling of a limited number of mutations and (2) explicitly modeling how mutations were experimentally isolated.

In this study, we address these challenges and develop a statistical framework to infer the properties of the underlying fitness landscape from empirical landscapes. We use a broad class of phenotypic fitness landscape models that includes Fisher’s geometric model (Fisher 2000). Phenotypic fitness landscapes model how the genotype of an organism translates into a set of phenotypes, which themselves determine fitness. In other words, the very large space of all possible genotypes is projected onto a continuous phenotypic space of arbitrary dimensionality, and fitness depends only on the position in this phenotypic space. Fisher’s model, in particular, assumes that the phenotypes are under stabilizing selection toward a single optimum, that the effects of mutations in the phenotypic space are drawn from a multivariate Gaussian distribution, and that mutations combine additively in the phenotypic space. Phenotypes can be biological traits that need to be tuned to a precise level to maximize growth of the organism in the environment of selection, *e.g.*, the concentration of an enzyme in a metabolic pathway or the level of expression of a gene. Fisher’s model also can be viewed as an abstract statistical description of the genotype fitness map.

A number of reasons motivate the choice of Fisher’s model as the underlying fitness landscape. A phenotypic model solves the problem of high dimensionality of the genotypic space. Indeed, genotypic fitness landscape models such as the rough Mount Fuji model (Szendro *et al.* 2013) or the NK model (Kauffman and Levin 1987) require a number of parameters increasing linearly with the number of mutations or the number of genotypes. In contrast, a phenotypic model can describe an arbitrarily large number of genotypes using a small number of parameters. More fundamentally, it has been shown recently that Fisher’s model emerges from a set of “first principles” that specifies how fitness results from developmental integration of a large number of mutable traits (Martin 2014). Last, Fisher’s geometric model is simple yet can generate a diversity of empirical landscapes (Blanquart *et al.* 2014), and it successfully predicts experimental quantities, such as the distribution of epistasis coefficient between pairs of mutations (Martin and Lenormand 2006; Martin *et al.* 2007) and the dynamics of mean fitness over time (Perfeito *et al.* 2014).

This study focuses on the following questions: How much information on the structure of the underlying fitness landscape can be inferred from existing empirical landscapes? What properties of fitness landscapes can be inferred from empirical data available so far, and are underlying landscapes similar in similar species or environments? Is the structure of empirical landscapes compatible with a model assuming stabilizing selection on a set of underlying unknown phenotypes?

To answer these questions, we developed an inference framework that allows fitting Fisher’s model to a diversity of experimental data sets obtained under a range of protocols. Using this framework, we infer the parameters and quantify the goodness of fit of Fisher’s model on 26 published genotypic landscapes representing nine distinct biological systems. We infer the properties of the underlying fitness landscape of each data set while accounting for the protocol used to obtain the data, allowing a meaningful comparison of fitness landscapes across several species and environments. This survey reveals substantial differences in the shapes of underlying fitness landscapes across biological systems and environments of selection. We also show that Fisher’s model is able to fully account for the observed properties of genotypic landscapes in only three of nine biological systems.

## Materials and Methods

### Data set selection

We searched the literature for published empirical landscapes that include clearly identified sets of genotypes with combinations of two mutations or more together with their fitness. The way in which these mutations evolved or were obtained had to be sufficiently described such that we could reproduce it with simulations (see below). For selected mutations, we verified that the fitness measure reported is relevant to the environment in which the mutations evolved. We identified a total of 26 published data sets spanning nine independent biological systems meeting these criteria. In the following, we will identify the data sets representing these nine systems using the letters A–I (Table 1 and Supplemental Material, File S1). The data sets encompassed a diversity of species– including species of virus, bacteria, fungi, animals– and ecological scenarios (Table 1). Several experiments explored the fitness landscape of species in a laboratory environment using random mutations in the fungus *Aspergillus niger* (de Visser *et al.* 1997) (data sets A1 and A2), the fruit fly *Drosophila melanogaster* (data sets C1 and C2) (Whitlock and Bourguet 2000), and the budding yeast *Saccharomyces cerevisiae* (data sets B1–B10) (Costanzo *et al.* 2010). The latter data set is a large collection of 5596 deletion mutants. To reduce this large data set to a size that was amenable to our analysis, we randomly drew 10 independent, randomly chosen subsets that included 20 mutations, all single mutants, and 100 double mutants (all combinations of the first 10 mutations times the last 10 mutations, for a total of 121 possible genotypes but, in reality, 104 to 116 genotypes because some were missing).

Three data sets represented the fitness landscape of virus species adapting to their hosts (data sets D, E1, and E2) (Sanjuán *et al.* 2004; Rokyta *et al.* 2011). Two data sets represented landscapes of adaptation of microbial species to a novel environment, including a long-term selection experiment in a low-glucose environment (data set F) (Khan *et al.* 2011) and a selection experiment in a methanol environment (data set G) (Chou *et al.* 2011). Last, seven data sets represented empirical landscapes reconstructed from mutations that confer drug resistance. These included studies of mutations in the enzyme TEM-1 β-lactamase, which confer resistance to cefotaxime in bacteria (four data sets H1–H4) (Weinreich *et al.* 2006; Tan *et al.* 2011; Schenk *et al.* 2013), and studies of mutations in the dihydrofolate reductase gene, which confer pyrimethamine resistance (an antimalarial drug) in transgenic bacteria and yeast (three data sets I1–I3) (Lozovsky *et al.* 2009; Brown *et al.* 2010; Jiang *et al.* 2013).

### Data analysis

A number of fitness measures were reported in the published empirical landscapes we collected. Our analysis requires meaningful estimates of fitness value to model how selected mutations differ from random mutations.

Meaningful selection coefficients are expressed in units of log-fitness. They must be calculated either as , where and are the multiplicative growth rate of the mutant and the ancestor (called “fitness” in most population genetics model), or as , where and are exponential growth rates (Chevin 2011). Unfortunately, many studies only reported the ratio (Table 1, data sets A, B, E, F, G, and I3), which in theory cannot be used to obtain a correct selection coefficient. To analyze the studies that only report , we used as a log-fitness measure. This measure is approximately equal to under weak selection, which is a quantity proportional to the selection coefficient. Moreover, this log-fitness measure, conveniently, does not depend on the unit of the growth rate and can be compared across landscapes.

For drug-resistance fitness landscapes, only one data set reported a growth rate at a given drug concentration (data set I3, Table 1). Other studies reported the minimum inhibitor concentration (MIC) or a similar measure (Table 1, data sets H1–H2, I1 and I2). MIC, the concentration of drug above which the population cannot grow, is not easily related to fitness. For this reason, we present the results of MIC landscapes in Figure S2 (Weinreich *et al.* 2006; Tan *et al.* 2011; Schenk *et al.* 2013).

We proceeded to several additional steps of data cleaning. Three nonviable genotypes (fitness value of 0) were excluded from the analysis [one in a pyrimethamine landscape (I1) and two in a *Drosophila* landscape (C2)] because Fisher’s model cannot easily account for lethal mutations. In data set G, the order of fixation of coselected mutations was unknown. We assumed that mutations fixed from the largest-effect mutation to the smallest-effect mutation in accordance with the reported dynamics of mean fitness through time in the experiment. In data set I2, two mutations occurred at the same locus. We made this data set compatible with our framework (which assumes that each locus is diallelic) by excluding all genotypes bearing the third allele.

### Approximate Bayesian computation

Table 1 shows that a number of protocols were used to obtain empirical landscapes. Some of the empirical landscapes were formed of single and double mutants only, while others included all possible combinations of four or five mutations, thus including genotypes with three, four, or five mutations. Moreover, the way in which mutations were isolated also varied. Mutations were random, independently selected, or coselected. “Independently selected” means that the mutations emerged under the action of selection in separate populations evolving independently from a unique ancestral genotype. “Co-selected” means that the mutations were selected sequentially in the same population. Modeling the way selection biased the resulting empirical landscape is already complicated. To make matters worse, several protocols included an additional step. These protocols were used to study the landscape of resistance to cefotaxime, a β-lactam antibiotic (landscapes H1–H4). Among a large set of 48 mutations found individually in cefotaxime-resistant natural isolates, three smaller subsets were studied in detail. These subsets were composed of the four mutations of smallest fitness effect, the four mutations of largest fitness effect (H3 and H4), and five mutations that together conferred a very high fitness (H1–H2). To account for this variety of protocols, we used a flexible approximate Bayesian computation (hereafter ABC) approach to infer from empirical data the parameters underlying Fisher’s geometric model.

#### Details of the ABC framework:

The original ABC rejection algorithm proceeds as follows: a large number of parameter sets are drawn in a prior distribution. For each parameter set θ, a data set is simulated, and a measure of distance between the true data set and each simulation is computed. A set of parameters is retained in the posterior distribution if the distance between *D* and is lower than a small value ε. In other words, the posterior distribution is composed of all the parameter sets θ such that . In practice, ε is chosen such that a given, small fraction of the prior parameter sets is retained in the posterior (Csilléry *et al.* 2012), but ABC will give the correct posterior distribution of parameters only in the limit where ε is close to zero.

The distance between the data set and simulation is often defined based on a set of statistics. This set of statistics must be carefully chosen to be informative but of relatively low dimensionality. We conducted the analysis using either the full set of observed log-fitness values (16–121 fitness values) or a set of six summary statistics. The six summary statistics are as follows: (1) the mean coefficient of selection of all single mutants, (2) the mean epistasis coefficient between all pairs of mutations averaged over all genetic backgrounds, (3) the SD of selection coefficients, (4) the SD of epistasis, (5) the correlation between the epistasis coefficient and the background fitness (specifically, for each pair of mutations, we calculate the epistasis coefficient and the average fitness of the two genotypes with one of the mutations and compute the correlation between these two quantities across all pairs of mutations and all genetic backgrounds), and (6) the maximal fitness value (Table S1). The distance of each simulated data set to the experimental data set waswhere is the number of statistics, is the statistic *i*, and is the simulated statistic *i*. Statistics are normalized by the median absolute deviation , which is analogous to SD but with medians instead of means. When statistics were the full set of fitness values, genotypes were uniquely identified by ordering mutations by their fitness effects.

We detailed earlier the rejection algorithm, where the posterior is simply the fraction of parameters randomly drawn from the prior distribution that generates simulated landscapes closest to the data. For this algorithm, we used a tolerance (the fraction of retained simulations) of 0.005 (using the lower tolerance of 0.0005 did not improve accuracy, Figure S1). In addition to the rejection algorithm, we used a linear regression algorithm (Beaumont *et al.* 2002). In this method, the posterior parameters are corrected using a local linear regression of the parameter values onto the summary statistics, giving more weight to simulations closer to the data set. Last, we used a neural-network algorithm that adjusts the posterior distribution based on a nonlinear regression using neural networks (Blum and François 2010). The three methods are implemented in the R package “abc” ( R Development Core Team 2010; Csilléry *et al.* 2012).

#### Details of the evolutionary simulations:

We simulated a large number of genotypic landscapes under Fisher’s model, seeding the simulation with parameters θ drawn from some prior distributions (detailed later). The simulated landscapes were based on Fisher’s model, a phenotypic fitness landscape model whereby an organism is evolving under stabilizing selection on *n* continuous phenotypic traits that together determine fitness. Each genotype is characterized by a phenotype vector consisting of *n* traits, where *n* is the dimensionality of the phenotypic space. The parameter *n* defines the number of phenotypes under selection, or “complexity,” for an organism evolving in a given environment (Tenaillon *et al.* 2007; Lourenço *et al.* 2011; Chevin *et al.* 2014). The effects of mutations are assumed to be additive in the phenotypic space. For example, if we consider five mutations at five distinct loci of the genome, the genotype 00101, where the series of 0s and 1s denote the absence or presence of mutations at each of five loci (relative to an ancestral strain with genotype 00000), has phenotype , where is the phenotype vector of the ancestral strain, and and are the phenotypic effects at mutations at loci 3 and 5. The effects of mutations on phenotypes (the vectors ) are drawn from a multivariate normal distribution with mean **0** and variance-covariance matrix , where σ is the size of mutations. Thus, each mutation jointly affects all phenotypes (assumption of full pleiotropy). The mapping of phenotype on fitness is defined by , where is the maximal fitness, which determines the distance to the optimum of the ancestral strain, is the Euclidean norm of the phenotype vector, and *e* is the experimental error on fitness measurements. Following Wilke and Adami (2001) and others (Tenaillon *et al.* 2007; Gros *et al.* 2009), we extended Fisher’s geometric model with the parameter *Q*, which quantifies how flat the peak is at the optimum (Figure 1). Fisher’s model, *sensu stricto* is the special case where , *i.e.*, the fitness function is Gaussian. Our definition of fitness implies that the ancestral strain had log-fitness 0, corresponding to the phenotype . This normalization was done without loss of generality. Maximum fitness , which is the height of the fitness peak in the environment where fitness is measured, was achieved when all phenotypes are at their optimal value, chosen here to be without loss of generality. Lastly, *e* is the measurement error in log-fitness measure and was assumed to be normally distributed with mean 0 and SD estimated from the empirical data (File S1). Figure 1 shows several examples of a single empirical genotypic landscape generated by sampling a small number of mutations in the underlying landscape.

For each set of parameters , we simulated the process by which mutations were isolated and generated a genotypic landscape. In practice, the sets of genotypes were of two broad categories: either four to five mutations were isolated and genotypes bearing all possible combinations of those mutations (2^{4} or 2^{5}) were constructed or a larger number of mutations (seven to nine) were isolated and single and double mutants were constructed. Mutations were considered to be random, independently selected, or coselected. For random mutations, simulations consisted of drawing the phenotypic effects of mutations in the multivariate normal distribution and then combining those mutations additively and computing fitness using our phenotype-to-fitness mapping. When mutations were isolated in an experiment involving selection, we assumed that adaptation proceeded by successive invasion of beneficial mutations without clonal interference. This allowed us to conduct fast simulations based on the strong-selection, weak-mutation (SSWM) approximation (Kimura 1983; Gillespie 1991), making it possible to conduct the large number of simulations required by ABC. Under the SSWM regime, a selected mutation is drawn among the pool of random mutations, with each mutation weighted by , where *s* is the fitness effect of the mutation. This derives from the fact that the probability of fixation of a beneficial mutation is scaling linearly with its fitness effect *s* in this regime (Patwa and Wahl 2008). Fitness effects were calculated relative to the ancestor for independently selected mutations and relative to the genetic background with previously evolved mutations for coselected mutations. For the protocol where five mutations that together confer a large fitness effect are isolated (H1-H2), we chose the set of five mutations that confers the highest fitness among 1000 random combinations.

For each empirical landscape, 10^{6} genotypic landscapes were generated using 10^{6} parameter sets drawn from prior distributions. Priors were chosen to be uninformative and to ensure that they could generate a diversity of fitness landscapes (Figure 1). The height of the peak in log-fitness was drawn from an exponential distribution with mean 2. Maximum fitness on a log scale ranged from 3.7 × 10^{−7} to 29 (2.5–97.5% quantile 0.05–7.4). The complexity of the phenotypic space, the number of phenotypic dimensions under selection, was given by , where · denotes the floor function, and η was drawn from an exponential distribution with mean 5. It ranged from 1 to 75 (2.5–97.5% quantile 1–7). We used an exponential prior for complexity because, under Fisher’s model with full pleiotropy, the distribution of fitness effects had unrealistically small variance at high complexity. The size of mutations σ in the phenotypic space was drawn from an exponential distribution with mean 0.2. It ranged from 1.7 × 10^{−7} to 2.6 (2.5–97.5% quantile 0.005–0.74). The choice of an exponential distribution was motivated by the fact that variations in fitness are modest in many of the data sets, and therefore, mutational effects are probably small. The shape of the peak *Q* was drawn from a uniform distribution (Figure 1).

### Cross-validation

We checked the accuracy of inference from empirical landscapes using simulated pseudo–data sets generated under Fisher’s model. We performed cross-validation using pseudo–data sets generated under Fisher’s model for each type of experimental protocol (Figure 2 and Table 2). We applied the ABC algorithm on each data set and compared the posterior distribution of parameters to the true (known) parameters. We computed the prediction error, defined for each parameter aswhere is the true value of the parameter used for the *i*th simulated pseudo–data set, is the median of the posterior distribution, and is the variance of the prior distribution. The expected prediction error is 0 when inference is perfect (the median always matches the true parameter) and 1 when no inference can be made (the posterior parameters are drawn at random from the prior). For cross-validation, we assumed that experimental errors were 0 in order to compare the accuracy of inference across protocols in an ideal case where fitness values are perfectly known.

### Posterior predictive checks

We next tested whether the empirical landscapes we analyzed were compatible with the hypothesis that Fisher’s landscape was the true model for the empirical data. We used posterior predictive checking (Gelman *et al.* 2014) to quantify the goodness of fit of Fisher’s model to each data set. For each experimental data set, we ran the ABC algorithm on 1000 random pseudo–data sets generated using parameters drawn from the joint posterior distribution of parameters. For each of these pseudo–data sets, we recorded the median distance between the pseudodata and the accepted (closest) simulated data in the ABC algorithm. This resulted in a null distribution for the median distance of the simulations retained in the ABC algorithm, which is the distribution of distance between simulations and data when Fisher’s model is truly underlying the data. We then used this distribution to compute a Bayesian *P*-value, also known as a “posterior predictive *P*-value” in Bayesian model checking (Gelman *et al.* 2014). This *P*-value is the probability that median distances for pseudo–data sets generated under Fisher’s model are greater than the median distance of the experimental data set. A low *P*-value suggests that the data are farther apart from Fisher’s model simulations than expected if the data followed Fisher’s model. A *P*-value was computed for the distances based on summary statistics and for the distances based on all fitness values. For the latter, we also decomposed the distance and computed an analogous *P*-value for each individual genotype to identify genotypes with fitness values that are particularly unlikely under Fisher’s model (those whose individual *P*-value is lower than 0.05).

### Data availability

The authors state that all data necessary for confirming the conclusions presented in this article are represented fully within the article.

## Results

### Cross-validation and accuracy of parameter inference

We quantified the accuracy of inference from empirical landscapes using 500 simulated pseudo–data sets generated under Fisher’s model. This analysis revealed that the true parameters of the underlying landscape are generally inferred with mediocre accuracy under most protocols used in existing studies (Figure 2 and Table 2). Inference based on summary statistics (Table 2) always yielded lower error than inference based on all fitness values (Table S2). Using summary statistics makes the ABC algorithm more accurate because it alleviates the “curse of dimensionality”: the distance of the data to the accepted simulations is closer to 0 for the same number of simulations such that the main assumption of ABC is better respected. However, the use of summary statistics causes loss of information (Sünnaker *et al.* 2013). Here the gain in accuracy more than offsets the loss of information, making inference based on summary statistics better.

ABC is an approximate method, and we cannot rule out totally that low accuracy was due to these approximations. However, low accuracy also may be caused by the limited information contained in small genotypic landscapes. In other words, even if the inference method were perfect, the true posterior distribution of parameters still might be quite wide and cause low accuracy. Because we have explored a number of variations on the ABC algorithm, including three different algorithms, full statistics *vs.* summary statistics, and several values of tolerance (Figure S1) and accuracy of inference was always relatively low, we hypothesize that the main reason behind low accuracy is probably the limited information contained in genotypic landscapes. Each empirical landscape conveys rather modest information on the underlying “true” fitness landscape.

In particular, empirical landscapes conveyed almost no information on the number *n* of phenotypes under selection. Prediction errors for this parameter were always higher than 0.5 and often close to 1. The size of mutations σ, the height of peak , and the shape of fitness peak *Q* were inferred with more accuracy. For all parameters, the regression and neural-network algorithms improved the accuracy of inference relative to the rejection algorithm, and the neural-network algorithm was most often the best (Table 2; compare the “rej,” “reg,” and “nn” columns for each parameter).

With the summary statistics we chose, the number of mutations that were combined together did not affect the quality of inference much. The experimental design with 32 genotypes made of all combinations of five mutations performed similarly to the one made of eight mutations and single and double mutants only (28 genotypes) (Table 2). The design where 20 random mutations were chosen (landscapes B1–B10) did not perform particularly better than the one with eight mutations and all single and double mutants (29 genotypes in total).

The protocol used to isolate mutations was of critical importance to the quality of inference (Figure 2 and Table 2). Generally, selected mutations allowed the most accurate inference (compare the “random,” “independently selected,” and “coselected” lines in Table 2 for a given experimental design). In these simulations, the protocol where the four largest mutations were isolated among 48 independently selected mutations performed best and allowed fairly precise inference of the size of mutations (error = 0.145), height of the peak (error = 0.068), and shape of the peak (error = 0.045) under the neural-network algorithm. Protocols that performed best regarding inference of the height and shape of the fitness peak allow a better exploration of the underlying fitness landscape around the fitness optimum. Independently selected mutations and particularly large-effect mutations create genotypes that are more likely to be around the fitness peak, especially when genotypes with more than two mutations are included. In contrast, genotypes constructed with random mutations do not always approach the fitness peak and may be confined to relatively linear and uninformative zones of the underlying fitness landscape.

### Parameter inference in experimental data sets

We obtained the posterior distribution of fitness landscape parameters in the 26 data sets. We used the ABC protocol based on summary statistics and the neural-network algorithm, which was shown to work best (Table 2). Note that the neural-network algorithm, on rare occasions, resulted in parameter estimates with biologically meaningless values, *e.g.*, negative values of dimensionality or maximal fitness. This is a known problem (Sünnaker *et al.* 2013) that happens when none of the summary statistics are very close to the data such that the neural-network regression extrapolates and yields posterior values outside the range of the prior. Results are similar, but the posterior distributions are wider when using inference based on the full set of fitness values and/or the rejection algorithm.

First, as expected from cross-validation, the posterior distributions were broader for parameters describing dimensionality and shape of the peak (Figure 3 and Table 3). Each empirical landscape could have been generated under a diversity of underlying fitness landscapes. Despite the uncertainty in parameters, different biological systems exhibited different types of fitness landscapes (Figure 3).

Three of the experimental systems that were represented by several nonindependent empirical landscapes resulted in similar posterior distributions across these “replicated” landscapes. This demonstrates the robustness of the ABC method to slight variation in the set of mutations, to variation in the fitness measure, and to experimental error. For *A. niger* (Figure 3, first row), two empirical landscapes, A1 and A2, were constructed using two partially overlapping sets of mutations (de Visser *et al.* 1997). For *D. melanogaster* (Figure 3, first row), the two landscapes, C1 and C2, corresponded to two correlated fitness measures, “productivity” (a measure of lifetime reproductive success) and “mating success” (Whitlock & Bourguet 2000). The posterior distributions of these two landscapes were overlapping, had the same covariance structure, and the median posterior distributions were similar. H1 and H2, two cefotaxime-resistance landscapes composed of the same mutations but with replicate MIC measurements, also had similar posterior distribution of parameters (Table 3).

Remarkably, independent empirical landscapes representing the same biological system had a similar posterior distribution of parameters. The 10 independent empirical landscapes extracted from the large yeast gene deletion data set B1–B10 (Costanzo *et al.* 2010) gave similar posterior distributions characterized in particular by mutations of small effect and a low maximal fitness. The two empirical landscapes of vesicular stomatitis virus, E1 and E2, had extremely similar posterior distributions of parameters, although they had very different statistical properties (Table S1). Different statistical properties arise because of differences in protocol: E1 is composed of independently selected mutations, while E2 is composed of random mutations. The fact that we recover similar underlying landscapes for E1 and E2 illustrates the ability of our method to correct for variation due to protocol.

Lastly, underlying landscapes were similar when using independent empirical landscapes obtained in similar biological systems, as revealed by comparison of the two empirical landscapes of virus on their host (landscapes D and E) and of the two landscapes of bacteria adapting to a novel environment (landscapes F and G) (Figure 3, third row). In each biological system, the two landscapes represented independent experiments, yet posteriors were similar in their marginal distributions and bivariate correlation structure, revealing similar underlying fitness landscapes. The landscape of resistance to pyrimethamine also was quite distinct, with large-effect mutations, large maximal fitness, and a flat peak (I3) (Figure 3, fourth row).

### Posterior predictive checks: are experimental landscapes compatible with Fisher’s model?

We tested whether the empirical landscapes we analyzed were compatible with the hypothesis that Fisher’s landscape was the underlying model for the empirical data. An informal test consisted of resimulating using the posterior distribution of parameters and examining how close these resimulated landscapes were to the data. We verified that resimulated landscapes are indeed close to the pseudodata in the cross-validation, *i.e.*, when the true model was Fisher’s model (Figure 4, left panels). For real data, in contrast, the resimulated fitness was close to the true fitness for some but not all landscapes (Figure 4, center panels). More formally, we computed a *P*-value that expresses the probability that the distance between observed data and simulated data sets would occur if data followed Fisher’s model, as described under *Materials and Methods* (Figure 4, right panels). We computed this *P*-value both for the distance based on the full set of fitness values and for the distance based on summary statistics. The test of rejection based on summary statistics tests whether Fisher’s model can reproduce these average statistical properties of landscapes. The test of rejection based on the full set of fitness values tests whether Fisher’s model can reproduce the whole of the data, including specific relationships between genotypes and fitness values not captured by summary statistics. Thus, the test based on the full set of fitness values will be a stronger test of the adequacy of Fisher’s model and will reject Fisher’s model more often than the test based on summary statistics because it conserves all information in the landscape.

Fisher’s model reproduced the overall statistical properties of all empirical landscapes, but in six of nine cases it could not reproduce the full structure of empirical landscapes (Table 3). The *P*-values based on summary statistics were almost always >0.05 (Table 3) (except for MIC landscapes H1–H4 and I1–I2, Figure S2). This indicates that the statistical properties of fitness landscapes described by the six summary statistics—mean and variance of epistasis and selection, correlation between epistasis and background fitness, maximal fitness—could be reproduced by Fisher’s model. However, Fisher’s model was not able to explain fully the structure of six of nine fitness landscapes (landscapes B, C, E, F, H and I, with *P <* 0.05 in Table 3). We did not identify a single reason why Fisher’s model was rejected, but it was often related to mutations with strong negative or positive epistasis (Figure 5). Fisher’s model could reproduce fully only the landscapes of *A. niger* (landscapes A1 and A2), of a bacteriophage adapting to its host (landscape D), and of bacteria adapting to a methanol environment (landscape G) (Table 3). In one of the landscapes compatible with Fisher’s model, the four beneficial mutations interacted almost additively (Figure 5, landscape G), but a very different landscape that includes beneficial and deleterious mutations and substantial sign epistasis among these was also compatible with Fisher’s model (Figure 5, landscape A1). In contrast, landscape C1, which looks superficially similar to landscape A1, rejected Fisher’s model. Landscape F also rejected Fisher’s model, one reason being that the third mutation had very strong positive epistasis with the first mutation. The landscape of pyrimethamine resistance (landscape I3) rejected Fisher’s model because of two cases of strong reciprocal sign epistasis. Thus, although Fisher’s model appears valuable to predict statistical properties of landscapes, in a number of cases it could not explain more detailed properties of experimental landscapes, such as mutations presenting large positive or negative epistasis.

In summary, our framework revealed biological differences between the underlying fitness landscapes of 26 experimental landscapes representing nine independent systems. Fisher’s model was generally able to reproduce the statistical properties of empirical landscapes but not their full structure. In particular, only three of nine biological systems (A, D, and G) featuring both very smooth and additive landscapes and more rugged ones had a structure that was reproduced by Fisher’s model.

## Discussion

Our understanding of the structure of fitness landscapes has greatly improved, in particular, thanks to experiments that identify mutations and systematically measure the fitness of a set of genotypes bearing combinations of these mutations. Yet the generality of insights drawn from these empirical landscapes has been questioned recently (Schenk *et al.* 2013; Szendro *et al.* 2013; Blanquart *et al.* 2014). The properties of empirical landscapes are heavily dependent on the particular mutations that are sampled (a small number, among a myriad of available mutations) and on the protocol used to identify mutations. We developed a novel framework based on approximate Bayesian computation to address these challenges and unravel the properties of the underlying fitness landscapes. More precisely, we inferred the underlying fitness landscape, parameterized with Fisher’s model, while accounting for the effects of the protocol on the empirical landscapes and quantifying the uncertainty due to sampling of a limited number of mutations. We used this statistical approach to conduct a survey of fitness landscapes across various species and ecological contexts.

### Summary of the results

Empirical landscapes, because they are composed of a small number of mutations, generally conveyed limited information on the underlying fitness landscape. This lack of information is manifest in wide posterior distributions and a low accuracy of inference. In other words, quite different underlying fitness landscapes may generate similar empirical landscapes. This relates to a previous study where we showed, conversely, that the same underlying landscape results in a variety of empirical landscapes when multiple sets of mutations are sampled (Blanquart *et al.* 2014). The fact that empirical landscapes are built with a small and often biased sample of mutations from the underlying fitness landscape suggests that any extrapolation on the global properties of the fitness landscape from measurement on small empirical landscapes should be taken with extreme caution.

While the size of mutations, the height of the peak (maximal fitness), and the shape of the peak were well inferred under some protocols, the number of dimensions under selection was not inferred with accuracy. Importantly, mutations independently selected in several replicates conveyed the most information on the underlying fitness landscape because they allowed an exploration of the most informative regions of the underlying landscapes. With a protocol that included as few as four mutations and all 16 possible genotypes carrying these four mutations, the size of mutations and the height and shape of the peak were well inferred (Table 2).

Fisher’s model did not accurately reproduce empirical landscapes in six of nine biological systems tested. The conceptual simplicity of Fisher’s model and its capacity, so far, to reproduce several experimental observations have made it a popular model to interpret experimental data and generate theoretical predictions (Tenaillon 2014). Fisher’s model has been used successfully before to predict the distribution of epistasis coefficients (Martin *et al.* 2007). Fisher’s model also generates sign epistasis by optimum overshooting when the ancestral strain is close to the optimum or by pleiotropic effects when two mutations have small positive fitness effects (Blanquart *et al.* 2014). We suggest here that although Fisher’s model is able to reproduce several statistical properties of fitness landscapes, it cannot account for their full structure in many cases. This leads to rejection of Fisher’s model even with data sets of modest size. Fisher’s model could not explain (1) sign epistasis far from the optimum (landscapes A1 and I3 in Figure 5), (2) large negative or positive epistasis (landscapes C1 and F in Figure 5), and (3) the large variance in selection coefficients and double-mutant fitness (landscapes B and E in Figure 5). It will be interesting to see whether these patterns can be explained by alternative phenotypic models that allow for some asymmetry around fitness peaks, restricted pleiotropy (mutations affect only a subset of the phenotypes), or anisotropy (mutations do not affect all traits to the same extent).

### Relationship with previous studies

To our knowledge, only three studies so far have attempted to compare properties of empirical landscapes across species. Szendro *et al.* (2013) quantified ruggedness for 10 experimental landscapes using a set of summary statistics. They showed that experimental levels of ruggedness are similar to those obtained with simulations of simple landscapes made of an additive component and random noise (rough Mount Fuji landscapes). They noticed the strong effect of the experimental protocol on the experimental landscape and in particular that coselected mutations tend to produce smoother empirical landscapes. However, their framework did not allow disentangling sampling variation resulting from protocol from variation owing to genuine biological differences between systems. Weinreich *et al.* (2013) analyzed 14 empirical landscapes, defined higher-order epistasis coefficients, and showed that these coefficients make an important contribution to fitness in all experimental landscapes. Lastly, Weinreich and Knies (2013) fitted Fisher’s model to seven published data sets using an elegant geometric interpretation of the relationship between the epistasis and selection coefficients. They found that Fisher’s model fits the data poorly. However, it is not clear whether this was due to the data itself or to the very strong assumptions on which the analytical approach was based: the ancestral strain was always assumed to be perfectly adapted because it was set at the fitness optimum, and all mutations were considered random, so the biasing effects of selection were not accounted for.

Some of the landscapes analyzed here have been analyzed previously with Fisher’s model or similar phenotypic landscapes. Martin *et al.* (2007) inferred the parameters of Fisher’s model from the distribution of selection coefficients and epistasis coefficients in an RNA virus (our data set E) (Sanjuán *et al.* 2004). They found that the distribution of epistasis coefficients is approximately normal with a variance twice that of the variance of the distribution of selection coefficient, in agreement with theoretical predictions from Fisher’s model, when the ancestral strain is close to optimum (Blanquart *et al.* 2014). Accordingly, we found that statistical properties of this landscape could be reproduced by Fisher’s model, but not its full structure. Last, the yeast deletion data set (B1–B10) also rejected Fisher’s model, as reported previously using a different analysis (Velenich and Gore 2013).

Several studies have attempted to fit phenotypic landscapes to data (Chou *et al.* 2011; Rokyta *et al.* 2011; Schenk *et al.* 2013). In those studies, the underlying phenotypic effects of mutations are considered as parameters that are explicitly estimated, and the mapping of phenotypes to fitness is defined by a function (*e.g.*, a Gaussian or a gamma function). This makes it easier to derive the likelihood but prevents the use of multivariate landscapes that require a number of parameters proportional to the number of dimensions. Explicitly estimating phenotypes of individual mutations gives interesting insights in the system when the underlying phenotypes are biologically meaningful and sometimes even measurable. It is also useful if one wants to predict the fitness of combinations of mutations not present in the data. However, it requires many parameters even for a simple univariate phenotypic landscape: for example, in data set D (Rokyta *et al.* 2011), a univariate-gamma landscape includes 14 parameters, while Fisher’s model has only two, and both models perform similarly in terms of Akaike’s information criterion (AIC). Fisher’s model is a useful heuristic to make predictions on the statistical properties of fitness landscapes, but the precise value of the underlying phenotypes is less interesting in such an abstract model.

### Current challenges in the analysis of genotypic fitness landscapes

In this study, we address a number of challenges to fit Fisher’s model to a diversity of experimental landscapes. But several other challenges remain to improve our understanding of fitness landscapes across species and environments:

*Modeling the effects of the protocol on the experimental fitness landscape to infer properly the underling fitness landscape*. Here selection was modeled using the SSWM approximation, which is valid when adaptation proceeds by successive invasions of rare beneficial mutants. This approximation was necessary to enable the fast simulations required by the ABC approach. However, in some situations of interest in experimental evolution, multiple beneficial mutations compete simultaneously in the population (clonal interference); under this regime, beneficial mutations of larger effect tends to invade the population (Nagel*et al.*2012). Clonal interference may be important in particular in experimental evolution (landscapes D–G). The fitness values reported also need to be ecologically relevant in the sense that they can be used to predict the fate of new mutations competing with the ancestral strain. Exponential growth rates, as reported in many studies, fulfill this condition. But other fitness measures are more dubious. For example, in drug-resistance landscapes, the fitness measure is commonly the MIC. We showed in the example of pyrimethamine resistance that the fitness landscape was quite different when a more correct fitness measure, the growth rate at a given drug concentration, was used. This invites to caution when analyzing MIC landscapes from an evolutionary perspective.*Fitting larger empirical landscapes*. Empirical landscapes contain little information on the parameters of their respective underlying landscape. Larger data sets (Costanzo*et al.*2010; Hietpas*et al.*2011; Bank*et al.*2015) may allow more accurate inference and will become much more common in the future. Our ABC method is too computationally intensive to handle such large data sets. New theoretical developments and new statistical techniques need to be developed. These must take into account the potential biases inherent in the data-production procedure. A likelihood approach would be ideal, but unfortunately, the probability of observing a set of fitness values under Fisher’s model is hard to compute as soon as genotypes carry two mutations or more, let alone when mutations were obtained using complex protocols. In essence, this is because computing the probability of a fitness value requires integration over all possible values of the unobserved phenotypes.*Fitting other types of data*. Other types of data may prove more informative than empirical landscapes. For example, Martin and Lenormand (2006) use the fitness effects of mutations across environments to infer very precisely the shape of the fitness peak (*i.e.*, our*Q*parameter), which they find to be very close to*Q*= 2 (the Gaussian function). Perfeito*et al.*(2014) show that temporal dynamics of fitness in experimental populations allow good inference of the underlying fitness landscape, including dimensionality, which is very hard to infer from genotypic landscapes. Again, new theoretical developments may reveal what type of empirical data informs best on the underlying fitness landscape.

## Conclusion

We have developed a rigorous statistical framework based on Fisher’s model to infer the properties of the underlying fitness landscape from empirical landscapes. This framework differs conceptually from previous approaches because it considers an empirical landscape as a small sample in the vast space of all possible genotypes. This new approach reveals that most experimental protocols reconstruct small landscapes that carry limited information on the true underlying landscape. As a consequence, any analysis and interpretation of empirical landscapes must be embedded within a proper statistical framework that quantifies the uncertainty on the true landscape. Surprisingly, we find that a very broad class of phenotypic models that has been successful so far in interpreting experimental data is unable to explain the structure of most empirical fitness landscapes. Yet phenotypic models represent an interesting venue for future research because they can represent landscapes of large dimensionality with a small number of parameters, and they are more biologically grounded that direct genotype fitness maps. Much larger empirical landscapes will become more frequent in the future; a model-based and statistically grounded analysis of these large landscapes will improve our understanding of the structure of fitness landscapes across species and environments.

## Acknowledgments

We thank Guillaume Achaz, Luis-Miguel Chevin, Florence Débarre, Luca Ferretti, Thomas Lenormand, and Olivier Tenaillon for discussions and useful comments. Comments from Craig Miller, Daniel Weinreich, and one anonymous reviewer greatly improved the manuscript. This work was supported by the Danish Research Council (FFF-FNU), the European Research Council under the European Union’s Seventh Framework Program (ERC grant 311341 to T.B.), and the Bettencourt Foundation (Young Researcher Award to F.B.).

## Footnotes

*Communicating editor: D. M. Weinreich*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.182691/-/DC1.

- Received September 15, 2015.
- Accepted March 25, 2016.

- Copyright © 2016 by the Genetics Society of America

Available freely online through the author-supported open access option.