## Abstract

For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.

ONE of the first and most celebrated bioinformatic tools is sequence alignment (Needleman and Wunsch 1970; Smith and Waterman 1981; Altschul *et al.* 1990), and algorithms to search for similarity between sequences in a huge database are still actively studied.

For this matter, we need to be able to distinguish sequence alignments that are due to a biological relatedness of two sequences from those that occur randomly. Let us, for simplicity, disregard mismatching nucleotides and insertions and deletions (indels or gaps) in an alignment and consider only so-called maximal exact matches, *i.e.*, sequences that are 100% identical and cannot be extended on both ends. In this case, the length distribution of matches is equivalent to the score distribution and can easily be calculated for an alignment of two random sequences where each nucleotide represents an i.i.d. random variable. We expect the number of matches to be distributed according to a geometric distribution, such that the number, of exact maximal matches of length *r* is given by(1)where *L _{A}* and

*L*are the lengths of the two genomes,

_{B}*p*is the probability that

^{r}*r*nucleotides match, and is the probability that a match is flanked by two mismatches. Here where is the frequency of nucleotide

*α*in the genome of species

*X*and the sum is taken over all nucleotides. Thus, the number of matches for a given length

*r*is expected to decrease very fast as the length

*r*increases, and for generic random genomes of hundreds of megabase pairs, we do not expect any match >25 bp.

For long matches, real genomes strongly violate the prediction of Equation 1 due to the evolutionary relationships between and within genomes (Salerno *et al.* 2006). Comparing the genomes of recently diverged species, we find regions in the two genomes that have not acquired even a single substitution. In the following, substitution refers to any genomic change that would disrupt a 100% identical match (for instance, a nucleotide exchange, an insertion, or a deletion). As the divergence time between the two species increases, such matches will get smaller very fast and most remaining long matches will be found either in exons or in ultraconserved elements (Bejerano *et al.* 2004) that both evolve under purifying selection.

Computing the match length distribution (MLD) from the comparison of human and mouse genomes, we thus expect to observe much longer exact matches than in a comparison of two random sequences of the same lengths. The observed MLD for exons and noncoding sequences in the human and mouse RepeatMasked genomes is shown in Figure 1. At the left end of the distribution, *i.e.*, for *r* < 25 bp, the distribution is dominated by random matches, as described by Equation 1. As expected, this MLD deviates from the random model for matches >25 bp.

Interestingly, in this asymptotic regime, the MLD exhibits a power-law tail ∼ *r ^{α}*. In the comparison of exonic sequences, the exponent

*α*is close to –5, in contrast to the MLD of noncoding sequences, where the exponent

*α*is close to –4 (Gao and Miller 2011, 2014; Massip

*et al.*2015). This property appears to be impressively reproducible in the comparison of various pairs of species (see Supplemental Material, File S1, Figure S1). In all cases, the value of

*α*was calculated using the maximum-likelihood estimator. To assess the robustness of this estimator, we also performed a bootstrap analysis that showed good agreement with the estimated value of

*α*; see

*Materials and Methods*. Note also that discrepancies of the power-law behavior can be observed for very long matches, since such matches are scarce and random noise distorts the distributions.

It is tempting to speculate that this peculiar behavior of exonic sequences is a direct consequence of their coding function. However, we demonstrate in the following that this distribution can be accounted for by a simple evolutionary model that takes into account the generation of paralogous sequences (due to segmental duplications) and orthologous sequences (due to speciation) (Fitch 2000). Further, we assume that paralogous and orthologous exact matches are subsequently broken down by random substitutions. These dynamics can be modeled by a well-known stick-breaking process (Kuhn 1930; Ziff and McGrady 1985) introduced below. Since our model describes the existence of long matching sequence segments in two genomes, it also has to include selection. However, we model selection in a minimal way, since we assume only that regional substitution rates are distributed, such that there are regions that evolve very slowly. Our model can therefore be viewed as a minimal model for evolution of functional sequences, which reproduces certain statistical features of their score distributions. In the next section, we describe the main methods and the data analyzed in this article.

## Materials and Methods

### Computational and statistical analysis

#### Computing MLDs:

To find all identical matches (both in the case of self and comparative alignments), we used the mummer pipeline (Kurtz *et al.* 2004) (version 3.0), which allows us to find all maximal exact matches between a “query” and a reference sequence using a computationally efficient suffix tree approach. For our analyses, we used the following options: -maxmatch such that mummer searches for all matches regardless of their uniqueness; -n that states that only ′A′, ′T′, ′C′, and ′G′ can match; -b such that mummer searches for matches on both strands; and -l 20 to filter out matches <20 bp.

The number of matches expected for a random i.i.d. sequence grows quadratically with *L*. For instance, we expect 3.5 × 10^{16} matches of length 2 in a comparison of two sequences of length *L* = 10^{9} bp (see Equation 1). For this reason, we have to define a threshold for the length of matches that mummer should output especially when comparing entire eukaryotic genomes.

#### Logarithmic binning:

Power laws appear in the tail of distributions, meaning that they are associated to rare events, which are thus subject to strong fluctuations. The high impact of noise in the tail of the distribution can make the assessment of the exponent of the distribution difficult. A way to resolve this issue is to increase the size of the bins with the value of the horizontal axes and normalize the data accordingly. Namely, the observed values for each bin are divided by the size of the bin. The most common choice to do this is known as the logarithmic binning procedure, which consists of increasing the size of the bin by a constant factor. Note that by doing so, we dramatically reduce the number of data points and some information is lost as we aggregate different data points together in the same bin. Therefore it is often useful to consider both representations, with and without the logarithmic binning. See Newman (2005) for more details on the logarithmic binning procedure and on power-law distributions.

#### Estimating the value of the power-law exponent:

To estimate the value of the exponent of the power-law *α*, we compute the maximum-likelihood estimator. The estimator is simply the value of *α* that maximizes the log-likelihood ℒ,(2)such that(3)while the value of has to be determined visually. This estimator is also sometimes referred to as the Hill estimator (Hill *et al.* 1975).

To estimate the robustness of the value of the exponents found using this method, we proceeded to bootstrap experiments on the human to mouse exome comparison. For each bootstrap, we sampled 5% of the mouse exons and compared them to all human exons. In each experiment, we calculated the exponent of the MLD, using the maximum-likelihood estimator as described in Newman (2005). We repeated this procedure 100 times. Values of *a* were all in the range [–4.7, –5.2] and the mean value for the exponent was *α* = –4.9.

### Data availability

All genomes and their specific annotations (such as repeat elements and exons) were downloaded from the ensembl website (Cunningham *et al.* 2015), using the Perl API (version 80); the corresponding release of the human genome is GRCh38. In all cases, we downloaded the RepeatMasked versions of genomes publicly available in the ensembl databases.

Perl, R, and C++ scripts used to simulate the data and compute the match lentgh distributions are available at https://github.com/Flomass/MaLenDi. The MUMmer pipeline is freely available online (http://mummer.sourceforge.net/).

## Results

### Theory

#### The stick-breaking model:

Before we turn to the detailed description of our model, let us shortly introduce some relevant results on random stick breaking. Consider a stick of length *K* at time which will be sequentially broken at random positions into a collection of smaller sticks. Breaks occur with rate *μ* per unit length. The distribution of stick lengths at time *t*, denoted by follows the integro-differential equation (4)(Ziff and McGrady 1985; Massip and Arndt 2013), where the first term on the right-hand side represents the loss of sticks of length *r* due to any break in the given stick and the second term represents the gain of sticks of length *r* from the disruption of longer sticks. Note that for any stick of length there are two possible positions at which a break would generate a stick of length *r*.

The initial state is one unbroken stick of length *K*; *i.e.*, The corresponding time-dependent solution is (5)(Ziff and McGrady 1985), where we define the rescaled time Apart from the singularity at *r* = *K*, which accounts for the possibility that the stick is not even broken once, the distribution is dominated by an exponential function; *i.e.*, there are far more small sticks than long ones. The average stick length is given by

#### The match length distribution of evolving sequences:

The above stick-breaking process can be used to describe the breakdown of a long DNA match into several smaller ones by substitutions in either one of the two copies of the match. In a comparison of two species, *A* and *B*, long identical segments are the signature of homology relationships between the two sequences. These homologous sequences either result from the copy of the genetic material during the time of speciation and are then orthologous sequences (see the blue dashed line in Figure 2) or are due to segmental duplications in the ancestral genome, *i.e.*, paralogous sequences (see the red dashed line in Figure 2) (Fitch 2000).

The MLD is then given by the integral(6)where is the number of homologous sequences with divergence *τ* and is given in Equation 5; see also Massip *et al.* (2015). The divergence between a pair of orthologous sequences is the sum of two contributions where *t*_{sp} is the time since the two species diverged and *i* is an index for regions in the genomes. The regional mutation rates and in the two species are themselves distributed and assumed to be independent from each other. We therefore define *N _{AB}* as(7)where [resp. ] is the number of sequences with divergence

*τ*from the last common ancestor

*I*in species

*A*(resp.

*B*). However, if the two regions are paralogous, the divergence

*τ*is a sum of three independent contributions where represents the time elapsed between the segmental duplication and the split of the two species. There are paralogous sequences with divergence with(8)For our purposes we are not interested in the full functional form of the distributions in Equations 7 and 8 but have to consider only their behavior for small because long matches [and thus the tail of the distribution of the match length distribution

*M*(

*r*)] stem from homologous exons that exhibit a small divergence

*τ*. A more general discussion about the functional form of the distribution of pairwise distances can be found in Sheinman

*et al.*(2015). We therefore take the Taylor expansion of the distributions

*N*(

*τ*) around Using Leibniz’s formula to take the derivative under the integral sign (Flanders 1973), we find for orthologous exons (9)(see details in File S1) and subsequently the match length distribution (10)(Massip

*et al.*2015), as

*K*≫

*r*. In contrast, expanding Equation 8 around finds(11)(see details in File S1). Thus, for paralogous pairs, the number of regions with divergence

*τ*increases as in the small

*τ*limit. Therefore the match length distribution exhibits a power-law tail with exponent (12)as

*K*≫

*r*.

Depending on the number of orthologous sequences and paralogous sequences we will be able to distinguish two regimes: one where the MLD follows an power law and one where it follows an power law. From Equations 10 and 12, it is straightforward to find that the crossover point *r*_{c} between those regimes (see Figure 3) is at(13)Recall that is defined as the number of paralogous segments that have not mutated even a single time since the duplication event at the time of the split. Thus, this term is proportional to the ratio of the duplication rate over the mutation rate. If there are significantly more paralogous sequences compared to orthologous ones and the crossover value, is large. Then, only the power-law tail will be observed. On the other hand, if then the crossing point is expected to be <20 such that the power law holds only for lengths where the distribution is already dominated by random matches. In contrast to previous models (Massip *et al.* 2015), this model does take into account the contribution of paralogous sequences and can explain both power-law behaviors and therefore predicts the crossing point between the two regimes.

### Numerical validation

Our theoretical considerations predict a complex behavior of the match length distribution under the described evolutionary dynamics. The key ingredients are segmental duplications, generating paralogous sequences in an ancestral genome, and point mutations that break identical pairs of homologous sequences of the two genomes into smaller pieces. To illustrate our theoretical predictions concerning the two power laws, as well as the existence of the crossover point we simulated the evolution of sequences according to the discussed scenario.

We describe the evolution of a genome of length *L* according to two simple processes, point mutations and segmental duplications. Point mutations exchange one base pair by another one and occur with rate per base pair and unit of time. To mimic the existence of regions under different degrees of selective pressure, we allow for regional differences of the point mutation rates. Segmental duplications copy a contiguous segment of *K* nucleotides to a new position where it replaces the same amount of nucleotides, such that the total length of the genome stays constant. Segmental duplications occur with rate *λ* per base pair.

Our simulation has two stages (see Figure 2). At time *t* = 0, we generate a random i.i.d. sequence During a time this sequence evolves according to the two described processes. In this first stage, the mutation rate is the same for all positions. At the end of this stage, the sequence represents the common ancestral genome of two species. At the beginning of the second stage, we copy the entire sequence of the common ancestor to generate the genomes of the two species *A* and *B*. These sequences are then subdivided into *M* continuous regions of equal length. In each such region *j*, the point mutation rates are the same for all sites *i* and are drawn from an exponential distribution with mean (*i.e.*, the point mutation rate during the first stage). We chose the exponential distribution because it stipulates the least information under the given constraints. For more details about the simulation procedure, see File S1, Appendix A.

We show the result of the comparison of simulated sequences in Figure 4, left. We obtain a power-law tail in the match length distribution, which for match length has an exponent and an exponent for longer matches For simulated sequences, we can easily classify homologous sequences into orthologous and paralogous sequences (while for natural sequences, paralogs and orthologs are not easily distinguishable due to genomic rearrangements). We show the MLD obtained from the comparison of paralogs and the MLD obtained from the comparison of orthologs for simulated sequences in Figure 4, right. We can clearly observe that orthologous sequences generate an power-law distribution while paralogous matches generate an power-law distribution. We can further easily identify the crossing point as the value of for which we obtain more matches from the comparison of orthologs than from the comparison of paralogs.

In the previous section the value of this crossing point between the two regimes was predicted to be (see Equation 13), where is the number of paralogous segments with a divergence just before the species split. In our simulation procedure, is simply the number of sequences that have been duplicated but that have not been mutated yet at the splitting time This number is known to be (Massip and Arndt 2013). In our simulations, and and therefore we predict which is in good agreement with our observations in Figure 4. The results of our simulations are thus in good agreement with our analytical predictions.

## Discussion

We developed a simple model that accounts for power-law tails in the length distribution of exact matches between two genomes. Our model assumes regional differences of the selective pressure such that the substitution rates in a region are drawn from a certain distribution. However, for naturally evolving exons the selective pressure varies also on shorter length scales. For instance, some nucleotides for many codons can be synonymously substituted by another one, mostly at third codon positions. Therefore, these substitutions at third codon positions occur with a higher rate than nonsynonymous ones. Hence, exons are expected to break preferentially at positions 3*n*, with such that the matches with 100% identity would have lengths with integer Classifying genomic matches according to the remainder that is left when dividing their length by 3, we observe an almost 10-fold overrepresentation of matches with length over matches of lengths and see Figure 5. This suggests that the match-breaking process is dominated by the synonymous mutation rate

Using the presented model, the puzzling observation of an power-law tail in the MLD in the comparison of the human and mouse genomes and a corresponding power-law tail in the comparison of their exomes can be explained. Although the sequences stem from the same species, the relative amount of paralogous to orthologous sequence segments is different in the two data sets, which subsequently leads to different crossover points Because of the selective pressure on coding exons, the number of nonmutated paralogous sequences at the time of species divergence is higher (relative to the number of orthologous sequences) in the exonic data set than in the noncoding data set. Thus, the crossover point in exomes is larger than the longest observed match and only the power law can be observed.

The opposite is true for matches in the alignment of noncoding sequences. Quantitatively, in this set, paralogous sequences play a lesser role and therefore only the power law is observed (see Figure 1). This is surprising, as the duplication rate is thought to be roughly the same in the coding and noncoding parts of genomes. To confirm this paradoxical observation, we classified matches according to the uniqueness of their sequences in both genomes. Assuming that unique matches are more likely to be orthologous, this gives us a rough classification of homologs into orthologs and paralogs, although matches unique in both sequences can be paralogs. After the classification of all matches, our analysis made apparent that matches unique in both genomes dominate the MLD in the comparison of the noncoding parts of the genomes, while matches with several occurrences in either (or both) of the genomes dominate the distribution in the case of the comparison of exomes (see File S1, Figure S2). Moreover, we computed the MLD from the set of nonunique matches of the noncoding part of the genomes. In this comparison, the contribution of paralogs is expected to be much higher than in the full set. As expected, this MLD also exhibits an power law (see File S1, Figure S2), confirming that the relative contribution of orthologs and paralogs is responsible for the shape of the MLD. These differences in the proportion of paralogous sequences in the coding and noncoding DNA are likely due to the fact that paralogs are more often retained in the coding part than in the noncoding part of genomes. Since there are many more noncoding sequences in both genomes, we also observe at least 10 times more matches in the comparison of noncoding sequences than in the comparison of exomes.

The presented model does not account for changes in the divergence rates after a duplication, a phenomenon that is well documented following a gene duplication (Scannell and Wolfe 2008; Han *et al.* 2009; Panchin *et al.* 2010; Pegueroles *et al.* 2013). To assess the impact of this phenomenon on the MLD, we performed simulations where the two paralogous segments are assigned different and independent mutation rates. Interestingly, these simulations yield results qualitatively similar to those of the simpler model introduced above (see File S1, Figure S3). This new condition does not affect the value of the number of paralogous sequences that have not diverged at the time of the split [*i.e.*, the value of *Nj* (0)] and thus the shape of the distribution.

The model we present is very simple, and more realistic models of genome evolution include many more evolutionary processes (Dalquen *et al.* 2012). For instance, we could include a transition/transversion bias in the mutational process, variations of mutation rates in time, a codon usage bias, or different rates of duplication within and between chromosomes. Since in the end we consider just identical matching sequences and want to explain the power-law tail in the MLD, all these additional model details are not expected to affect the results.

In this article, we demonstrate that on the genome-wide scale, the length distribution of identical homologous sequence segments in a comparative alignment is nontrivial and exhibits a power-law tail, and we propose a simple model able to explain such distributions. While paralogous sequences, which had been duplicated before the species diverged, generate a power-law tail with exponent orthologous sequences generate a power-law tail with exponent Depending on the relative amount of paralogous to orthologous sequences there is a crossover between these two power-law regimes. The exponent of the power-law tail in the comparative MLD can therefore be a litmus test for the abundance of paralogous relative to orthologous sequences, while it is usually difficult to distinguish between orthologous and paralogous sequences using classical bioinformatic methods (Studer and Robinson-Rechavi 2009; Dalquen *et al.* 2013; Gabaldón and Koonin 2013). If paralogous sequences dominate, the crossover occurs for a large value of *r* and the apparent exponent is equal to otherwise it is equal to

Our method is very easy to apply. In particular, it does not require that genomes are fully assembled as long as the continuous sequences are >1 kbp, comparable to the longest matches one expects. A natural extension of our method would be to apply it to sequences from metagenomic samples to assess relative amounts of paralogous and orthologous sequences. However, we would also have to consider horizontal gene transfer, which is common among prokaryotes and generates homologous sequence segments even between unrelated genomes. Our computational model can easily be extended to take into account these and other more complex biological processes, using, for instance, already developed tools (Dalquen *et al.* 2012). This would allow us to assess their impact on our results and will be the subject of future work.

This study shows that even very simple models can often successfully be applied to seemingly very complex phenomena in biology. We were able to present a minimal model for the evolution of homologous sequences that includes effects due to segmental duplications and evolution under selective constraints—the two processes that are responsible for a power-law tail in the length distribution of identical matching sequences.

## Footnotes

*Communicating editor: E. Eskin*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.193912/-/DC1.

- Received June 1, 2016.
- Accepted July 26, 2016.

- Copyright © 2016 by the Genetics Society of America