## Abstract

An article reporting statistical evidence for epigenetic transfer of learned behavior has important implications, if true. With random sampling, real effects do not always result in rejection of the null hypothesis, but the reported experiments were uniformly successful. Such an outcome is expected to occur with a probability of 0.004.

INDEPENDENT replications of empirical findings are critical for the development of science (*e.g.*, Prinz *et al.* 2011; Collins and Tabak 2014; McNutt 2014), but there are difficulties in interpreting replications of *statistical* findings. Due to random sampling, not every experiment will produce a successful statistical outcome, even if an effect actually exists. If the statistical power of a set of experiments is relatively low, then the absence of unsuccessful results implies that something is amiss with data collection, data analysis, or reporting (Ioannidis and Trikalinos 2007; Francis 2012, 2013, 2014). Here, I apply these ideas to a recent study reporting epigenetic transfer of olfactory conditioning (Dias and Ressler 2014) that has been hailed as both groundbreaking and puzzling (Hughes 2014; Szyf 2014; Welberg 2014).

The claim for epigenetic transfer is based on behavioral and neuroanatomical findings. The first experiment (coded as “Figure 1a” in Table 1) is representative of the behavioral studies. One group of male mice was subjected to fear conditioning in the presence of the odor acetophenone. Compared to the offspring of unconditioned control mice, the offspring of the conditioned mice exhibited significantly enhanced sensitivity to acetophenone as measured by the fear-potentiated startle (*P* = 0.043). A *post hoc* power calculation suggests that a replication experiment using the same sample sizes is estimated to produce a statistically significant outcome (*P* < 0.05) only 51% of the time if the effect is similar to what was reported in the original experiment. Nine other behavioral experiments explored variations of the finding (using different odors, generations, mouse strains, and developmental contexts). As defined by Dias and Ressler (2014), success in those experiments usually involved rejecting the null hypothesis, but for some experiments success was based on a predicted null result or a pattern of significant and nonsignificant results. I estimated success probabilities for experiments like these with standard power calculations or simulated experiments that used the reported sample sizes, means, and standard deviations. For all of these calculations, the hypothesis tests of the original findings were assumed to be appropriate and valid for the data (*e.g.*, the data were sampled from populations having normal distributions with homogeneity of variance). R scripts for estimating the probabilities are provided with this article's supplemental material.

Table 1 lists the sample sizes, the inferences that defined success, and the estimated probability of such outcomes for each experiment. I followed Dias and Ressler (2014)’s treatment of the experiments as being statistically independent, so the probability of a set of 10 behavioral experiments like these all succeeding is the product of the probabilities: 0.023. This value is an estimate of the reproducibility of the statistical outcomes for these behavioral studies. Its low value suggests that the outcomes deemed by Dias and Ressler (2014) as support for their claim are unlikely with experiments similar to the ones they reported. It is important to recognize that such a low probability is not a necessary outcome for all possible experiment sets. When a reported experiment set includes unsuccessful results (as it should if the probabilities are modest), the excess success analysis estimates the probability of producing the observed or a greater number of successful outcomes. For example, if 3 of the 10 behavioral experiments reported in Dias and Ressler (2014) had been unsuccessful, then the probability of producing seven or more successful outcomes would be estimated as 0.65, which would not raise any concerns. R code for the calculation is provided in the Supporting Information, File S1 with this article.

Dias and Ressler (2014)’s argument for epigenetic transfer of conditioning was bolstered by 12 neuroanatomical experiments, with the first one (marked as “Figure 3g” in Table 1) being representative. Staining indicated that the offspring of mice fear conditioned with acetophenone had larger acetophenone-responding glomeruli in the olfactory bulb compared to both the offspring of mice without conditioning and to the offspring of mice conditioned to a different odor. Experimental success required a significant ANOVA and a significant contrast between the experimental group and each of the control groups. The probability of a successful outcome (estimated by simulated experiments as 0.782) differs from the ideal value of one because the test between mice conditioned to different odors has only modest experimental power due to the relatively small sample size for one of the groups (*n* = 18). Other neuroanatomical studies compared staining of odor-responding glomeruli in different brain regions and in different mouse strains, generations, and developmental contexts. Similar to the behavioral studies, every reported experiment produced a pattern of significant and nonsignificant findings deemed to provide support for the theoretical claims. The probability of experiments like these being so successful is the product of the appropriate probabilities listed in Table 1, which is 0.189. Although better than for the behavioral experiments, this analysis indicates only a one in five chance of successfully replicating the full set of neuroanatomical findings reported in Dias and Ressler (2014) with effects and sample sizes similar to the original report.

The claim that olfactory conditioning could epigenetically transfer to offspring is based on successful findings from both the behavioral and neuroanatomical studies. If that claim was correct, if the effects were accurately estimated by the reported experiments, and if the experiments were run properly and reported fully, then the probability of every test in a set of experiments like these being successful is the product of all the probabilities in Table 1, which is 0.004. The estimated reproducibility of the reported results is so low that we should doubt the validity of the conclusions derived from the reported experiments.

How could the findings of Dias and Ressler (2014) have been so positive with such low odds of success? Perhaps there were unreported experiments that did not agree with the theoretical claims; perhaps the experiments were run in a way that improperly inflated the success and type I error rates, which would render the statistical inferences invalid. Researchers can unintentionally introduce these problems with seemingly minor choices in data collection, data analysis, and result interpretation. Regardless of the reasons, too much success undermines reader confidence that the experimental results represent reality.

Even if some of the effects prove to be real, the findings reported in Dias and Ressler (2014) likely overestimate the effect magnitudes because unreported unsuccessful outcomes usually indicate a smaller effect than reported successful outcomes. Scientists planning to design experiments that replicate the significant behavioral findings in Dias and Ressler (2014) might find it prudent to halve the pooled effect size value from 1.0 to 0.5. To show statistical significance with a power of 0.8 for a difference of means, such a replication experiment requires sample sizes of 64 in each group, which is four times the size of the largest experimental samples used by Dias and Ressler (2014). Importantly, even for such high power experiments, one would not expect all studies to produce successful outcomes. For proper experiments, the rate of experimental success has to match the characteristics of the experiments, effects, and analyses. Scientific claims based on hypothesis tests from a set of experiments require either highly powered successful experiments or pooling across both successful and unsuccessful experiments.

*Note added in proof*: See Dias and Ressler 2014 (pp. 453) and Churchill 2014 (pp. 447–448) in this issue for a related work.

## Footnotes

Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.163998/-/DC1.

*Communicating editor: M. Johnston*

- Copyright © 2014 by the Genetics Society of America