A paper published in this Journal (Singh *et al.* 2013) estimated fine-scale recombination rates across the 2.1-Mb *garnet-scalloped* (*g-sd*) region of the *Drosophila melanogaster X* chromosome by pooling male progeny inheriting crossovers within the region in groups of 100 for DNA sequencing. In each pool, the proportion of the recombinant allele at each SNP therefore should be 0 to the left of the region and 1 to the right, and at each SNP, the proportion of recombinant alleles should equal the proportion of males that recombined to the left of that SNP. The authors inferred allele proportions from the proportion of sequencing reads matching the recombinant allele at each SNP. Because of sampling variance in the number of reads, the authors selected 451 SNPs across the region based on coverage criteria (5.9% of all SNPs) and applied a novel median-filtering method that set the proportion of recombinant alleles *p* for each SNP to the median of a window of *k* SNPs to the left and right, using the smallest value of *k* for each pool that resulted in a monotonically increasing DNA landscape. The amount of recombination was then estimated as the difference in that landscape *∆p* across nonoverlapping windows of five different sizes.

As part of another project, I set out to understand the properties of the authors’ method. I implemented their median-filtering and windowing algorithms based on their published descriptions and was able to recapitulate their results (Supporting Information, File S1). I then applied those algorithms to simulated data sets, which were generated by assuming that recombination rates are completely linear per nucleotide, and then simulated sequencing reads for each SNP simulated as a binomial with the same *N* as the real read-count data. When the median-filtering and windowing algorithms were applied to the simulated reads, they yielded heterogeneous recombination rates that were similar, but not identical, to the authors’ published rates across all window sizes, with correlation coefficients from *r* = 0.7 to *r* = 0.85 across their five window sizes. However, when the sizes of their intervals were equalized and the simulations repeated, the heterogeneity disappeared.

This would appear to mean that the rate heterogeneity the authors reported is a consequence of the unequal lengths of their 450 intervals. Their longest interval is over 41,000 nucleotides long; their shortest three intervals are each 1 nucleotide long, meaning that recombination events between adjacent nucleotides were being inferred. The shortest 10% of their intervals combined cover less than 1 kb, only 0.05% of the *g-sd* region. The longest 10% of their intervals cover 39% of the region. These unequal interval lengths interact with their median-filtering algorithm to introduce artifactual heterogeneity to their rate estimates, as described next.

Consider a median-filtering window centered on SNP *i*. The proportion of recombinant reads for the *2k* + 1 SNPs within the window (*p*_{i−k} to *p*_{i+k}) is sorted into an ordered list with index *j* = 1, 2, ..., 2*k* + 1, and the median-filtered value of *p* for SNP *i* is set equal to the proportion of recombinant reads at the SNP at the center of that list (*i.e.*, the SNP at index *j* = *k* + 1). Next, the window shifts one position to the right, to median-filter SNP *i* + 1. The sorted list, however, is nearly unchanged; *p*_{i−k} is removed, and *p*_{i+k+1} is inserted into its sorted position. Unless *p*_{i−k} is greater than the preceding median or unless *p*_{i+k+1} is less than the *p* value that was sorted to index *j* = *k* + 2, then the value for SNP *i* + 1 will necessarily be set to the *p* value previously sorted to index *j* = *k* + 2. Because the filter windows were set large enough that the filtered *p* values increased monotonically, these exceptions should happen rarely, and the median-filtering algorithm ought to be nearly equivalent to a simple sort of the *p* values for all SNPs in each pool.

Actually sorting the *p* values in each pool confirms this inference. Comparing sorted pools to median-filtered pools results in a correlation for individual landscapes of *r* = 0.9993 or greater (File S2). When the 25 pools are averaged, the correlation between the sorted and median-filtered landscapes is 0.9999. When the averaged sorted landscape is windowed to estimate recombination rates, the correlation of those sorted rates with the values published by the authors was *r* = 0.983 (5-kb windows) or better. Clearly, median filtering to monotonicity must be nearly equivalent to sorting.

This implies that the inferred ∆*p* values across each interval are decoupled from the crossover events that actually occurred within the interval because the random sampling of sequencing reads means that *p* values can be distributed above or below the true proportion of recombinant alleles for each SNP. This means that when the landscapes from all 25 pools were combined, the *p* values for up to 25 different SNPs were being averaged, and ∆*p* should approximate a constant, which is the average difference between adjacent sorted random values.

Keeping in mind that ∆*p* across an interval represents the fraction of the 2500 recombination events inferred to have occurred within that interval, the preceding makes a clear prediction: if ∆*p* is nearly constant, then the per-interval recombination rates reported by the authors ought to be inversely proportional to interval length. Because the change in *p* across an interval can be divided by the number of nucleotides within the interval to calculate centimorgans per megabase (the same justification as their nonoverlapping windowing approach), we can see that the prediction is correct (Figure 1, red circles). The log-log correlation between the length of an interval and its estimated recombination rate is *r* = −0.99, showing that the great majority of the recombination rate heterogeneity reported by the authors is determined by their unequal interval lengths.

Figure 1 also demonstrates that the rates inferred by the authors’ method are at odds with reasonable expectations of how recombination rates should vary. As intervals get longer, recombination rates ought to trend toward the overall experimental average (being equal to it, by definition, at the upper limit of a single interval containing the entire experiment). However, Figure 1 shows that the longest intervals have the lowest recombination rates. Likewise, the recombination rate observed for many of the authors’ short intervals ought to be exactly zero because the chance of even one crossover occurring within those intervals is small. However, the averaged landscape is inferring multiple crossovers (a minimum of 2.08 events) within every one of their intervals, leading to rate estimates of thousands of centimorgans per megabase within intervals that are much shorter than the conversion tracts associated with crossover events.

This anticorrelation between recombination rates and interval length means that it should be possible to derive the same recombination rates using only the interval lengths. Setting ∆*p* for each interval to a constant 1/450 produced per-interval recombination rates close to the trend line of their sequence-derived data (Figure 1, black points). When this constant ∆*p* was converted to a landscape and windowed, it yielded recombination rate estimates very similar to the published rates based on sequence data (Figure 2), with correlation coefficients of *r* = 0.931 or better for all window sizes. Much of the residual differences appear to be caused by an edge effect, where the median-filtering window extended beyond the edges of the sequenced SNPs. [This edge effect can be seen clearly in figure 3 of Singh *et al.* (2013), where the landscapes from individual pools have steep slopes near 0 and 1.] If the first and last windowed intervals are excluded, the correlation improves to *r*′ = 0.963 or better for all window sizes.

The study by Singh *et al.* (2013) was an innovative attempt to leverage high-throughput sequencing technology to gain insight into fine-scale recombination rate variation in *Drosophila*. However, the read-count data were noisy, and without the ability to position individual recombination events, the authors attempted a novel median-filtering method to infer recombination rates. I have shown that this analytical method is not sufficient to obtain the desired parameters from the read-count data and, unfortunately, produces rate estimates that are mainly determined by the locations of the SNPs being used and not by the underlying biological recombination rates of interest.

## Acknowledgments

I would like to thank Matt Rockman and anonymous reviewers of this manuscript for their helpful comments. This work was supported by the National Institutes of Health (National Institute of General Medical Sciences award GM-099054).

## Footnotes

Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177808/-/DC1

- Copyright © 2015 by the Genetics Society of America