## Abstract

Fast genome sequencing offers invaluable opportunities for building updated and improved models of protein sequence evolution. We here show that Single Nucleotide Polymorphisms (SNPs) can be used to build a model capable of predicting the probability of substitution between amino acids in variants of the same protein in different species. The model is based on a substitution matrix inferred from the frequency of codon interchanges observed in a suitably selected subset of human SNPs, and predicts the substitution probabilities observed in alignments between *Homo sapiens* and related species at 85–100% of sequence identity better than any other approach we are aware of. The model gradually loses its predictive power at lower sequence identity. Our results suggest that SNPs can be employed, together with multiple sequence alignment data, to model protein sequence evolution. The SNP-based substitution matrix developed in this work can be exploited to better align protein sequences of related organisms, to refine the estimate of the evolutionary distance between protein variants from related species in phylogenetic trees and, in perspective, might become a useful tool for population analysis.

- protein sequence evolution
- SNP
- substitution matrices
- protein sequence alignment
- substitution rate variability

PREDICTING substitution probabilities between amino acids is of enormous practical and conceptual importance. For instance, phylogenetic analyses are rooted on the capability of accurately predicting substitution probabilities, which are used to determine evolutionary relationships between species, and to infer the time scales of speciation events. Substitution probabilities are also used to spot, among a set of sequences, those who have more likely evolved from a common ancestor (identification of homologs). Moreover, substitution probabilities can be used to classify of the whole proteome space in protein families of common function (Finn *et al.* 2015). Substitution probabilities are traditionally learned from alignments of proteins in different species. They are then used to build substitution matrices, which measure the relative probability that two amino acids are aligned because they descend from a common ancestor rather than by chance. The simplest procedure to infer such a matrix is based on counting amino acid substitutions in given ranges of sequence identity (Henikoff and Henikoff 1992), while other models are rooted on a Markov model of evolution (Dayhoff *et al.* 1978; Gonnet *et al.* 1992; Jones *et al.* 1992; Whelan and Goldman 2001; Schneider *et al.* 2005; Kosiol *et al.* 2007; Le and Gascuel 2008).

A source of information for inferring substitution probabilities more and more accurately is offered by the recent boost in DNA sequencing (Ronaghi *et al.* 1996; Bentley *et al.* 2008; Wheeler *et al.* 2008), which provides an increasing amount of sequenced genomes (1000 Genomes Project Consortium *et al.* 2010, 2015). The analysis of genomes has allowed not only the enormous enlargement of the database of protein families, but also the identification of a large number of Single Nucleotide Polymorphisms (SNPs) (Sherry *et al.* 2001)—isolated nucleotide variations in the DNA sequence that are commonly present among the individuals of a species. SNPs are nowadays at the center of clinical research (Giacomini *et al.* 2007), and are a key tool to understand population dynamics, such as migration and selection (Kingman 1982; Rosenberg and Nordborg 2002). A question that naturally arises is what is the relationship between the polymorphisms observed within a species and the substitutions observed between variants of the same proteins in different species. The interplay between these phenomena has drawn the attention of several recent studies (Wilson *et al.* 2011; De Maio *et al.* 2015). The picture that emerges from these works, and more in general from the literature on polymorphisms, is that SNPs should not be confused with amino acid substitutions (Sawyer and Hartl 1992; Fay *et al.* 2001; Tamuri *et al.* 2012). The probability of fixation of a polymorphism depends on its positive or negative impact on the fitness of its carrier. However, its destiny, especially for nondetrimental polymorphisms, is strongly connected to that of the population hosting it. In other words, polymorphisms lie in the “no man’s land” after the occurrence of a random mutation and before its definitive fixation or rejection. If their nature and their statistics are similar to those of random mutations, to those of fixed substitutions observed in proteins of different species, or to neither, is still the object of investigation.

Here, we shed some light on this issue: we show that an appropriately selected subset of SNPs encodes an information compatible with the substitutions observed in alignments and can be used to predict substitution probabilities in pairwise alignments at high sequence identity. We select from the dbSNP database (Sherry *et al.* 2001) only the human polymorphisms characterized by no known clinical significance, and which are present in at least 1% of the population, having therefore a high probability of being nearly neutral. We demonstrate that a protein evolution model based on a substitution matrix inferred from this subset of SNPs, and taking into account substitution rate variability (Miyazawa 2011a; Rizzato *et al.* 2016), can be used to predict the transition probabilities between amino acids observed in alignments. The accuracy of this model in the range of 85–100% of sequence identity is better than that achieved using any other substitution matrix we are aware of, including LG (Le and Gascuel 2008), WAG (Whelan and Goldman 2001), JTT (Jones *et al.* 1992) and BLOSUM90 (Henikoff and Henikoff 1992) for amino acid models or ECM (Kosiol *et al.* 2007) for codon models. This result is particularly significant because the SNP-based model presented here, in contrast to all the other models against which it is benchmarked, *is not learned from alignments*. Therefore, one would expect it to be disadvantaged with respect to the others in the prediction of substitution frequencies from alignments.

We also demonstrate that the SNP-based substitution matrix can be exploited to build phylogenetic trees, especially for *Homo sapiens* and related species. We therefore foresee it will be useful in fine-grain resolution of phylogenies in evolutionary genetics and population analysis.

We will also show that the accuracy of the SNP-based model slowly degrades at medium and low sequence identities, possibly due to the relatively poor statistics of the SNP database, or to a missing ingredient in the model of protein sequence evolution used in this work. This indicates that more research is needed to develop a model capable of describing protein sequence evolution at an arbitrary evolutionary distance. However, the results presented here provide evidence that two traditionally separated data domains, SNPs, and sequence alignments, can be employed together to build such a model.

## Materials and Methods

### Download and selection of SNP data

From the dbSNP database (Sherry *et al.* 2001), we selected the polymorphisms in the coding part of the human genome. We did not consider those SNPs that are likely to entail a positive or negative structural modification, having a known clinical significance (benign, likely benign, pathogenic, and likely pathogenic).

The SNPs were downloaded on the January 14, 2017 at www.ncbi.nlm.nih.gov/snp/advanced. Query: *(“homo sapiens”[Organism]) AND ((“snp”[SNP Class]) OR “multinucleotide polymorphism”[SNP Class]) AND “missense”[Function Class] NOT(((((“benign”[Clinical Significance]) OR “likely benign”[Clinical Significance]) OR “likely pathogenic”[Clinical Significance]) OR “pathogenic”[Clinical Significance]) OR “no info”[Validation Status])* for the nonsynonymous polymorphisms, and the same query with “synonymous codon” in place of “missense” for the synonymous polymorphisms. In this manner we gathered 3,228,872 polymorphisms, 1,144,770 of which preserving the original amino acid (synonymous) and 2,084,102 changing it (nonsynonymous, or missense).

### From the SNP database to codon substitutions

The identity of the codons involved in a SNP was derived with the following procedure. We downloaded the SNPs both in *flatfile* format and in *brief* format.

From the flatfile format we extracted the following data:

The label of that SNP.

Its Global Minor Allele Frequency (GMAF), which refers to the frequency of the second most common allele of the considered polymorphism.

The two (or more) alleles, the corresponding amino acids, and the indication on the frame of the codon in which the different nucleotides were found. For example, a frame equal to one means that the polymorphism is located on the first letter of a codon.

For each of the entries, we selected from the brief format the entry with the correct label, and read the nucleotide sequence two nucleotides before and after the polymorphism. We reconstructed the codons involved in the polymorphisms from a knowledge of the allele, the frame, and the amino acid. Since the sample may be taken in two possible orientations, we selected the orientation consistent with the genetic code. We divide the entries into three subsets according to their GMAF: a set of common polymorphisms, characterized by a set of medium-rare polymorphisms, characterized by and a set of rare polymorphisms, characterized by By this procedure, we obtained 10,585 synonymous and 9772 missense SNPs with 30,319 synonymous and 32,601 missense for , and 1,064,390 synonymous and 1,989,525 missense for

As usually done for alignments, the substitution process was treated as stationary, and the substitutions from codon *c* to codon were assumed to be as frequent as those from to *c*. So, the 64 × 64 matrix of occurrences was built by setting both and to one-half of the observed interchanges between the pair of codons From the matrix of occurrences, we computed the frequency of codon interchanges:(1)This codon matrix can be summed to a 20 × 20 matrix of interchange frequencies between amino acids by: where is the set of codons coding for amino acid A.

### Substitutions involving more than one nucleotide

Nucleotide substitutions are not necessarily isolated, but can happen simultaneously on neighboring nucleotides. Estimates of the fraction of multiple simultaneous substitutions range between the 0.3% (Smith *et al.* 2003) and 3% (Schrider *et al.* 2011). The number of multiple simultaneous substitutions satisfying the query mentioned above is 387. All these entries have a GMAF <0.01; therefore, for this subset of polymorphisms, we will not quantify the effect of the allele frequency as will do for SNPs in the *Results* section.

### Computing the transition probabilities for codons and amino acids

From the frequencies of interchanges we define the nondiagonal entries of the substitution rate matrix by:(2)where is the frequency of codon *c*, which we assume to be equal to those tabulated for the human codon usage bias (Nakamura *et al.* 2000). The diagonal entries of are, instead, defined as By construction, Therefore, the evolutionary time is measured in units of expected substitutions per site.

From or any other generic matrix on codons or amino acids, we compute the transition probability from *c* to after a time *t* by following Rizzato *et al.* (2016):(3)where *r* is the substitution rate, and is its probability distribution, assumed to be -shaped (Yang 1993, 1994). We remark that this procedure is not the standard -correction implemented in the standard algorithms for phylogenetic tree reconstruction (Stamatakis 2006; Yang 2007; Price *et al.* 2010): here, no information is available on the rate at different sites, so rate variability is treated by averaging over its among-site probability distribution. The distribution depends on the shape parameter, *α*, which is known to vary from protein family to protein family, and, inside the same protein family, also in time (Gaucher *et al.* 2001, 2002; Lopez *et al.* 2002). In order to take into account the variation of *α* with evolutionary time, we apply the procedure in Miyazawa (2011a, p. 5). In this procedure, the parameter *α* is assumed to vary linearly with time. Therefore, at time *t*, where is a reference evolutionary time and is the value of *α* at that time. The ratio is the only free parameter of this approach, and here was set optimizing the value of *α* at 92.5% of sequence identity, thus obtaining

We also computed the substitution probabilities in the more traditional framework of identical rates on all sites (-- model). In this case, the transition probability from *c* to at time *t* is given by:(4)From transition probabilities between codons, one estimates the transition probabilities between amino acids as follows:(5)where stands for the set of codons coding for amino acid A.

The expected sequence identity between an initial sequence and its evolution after a time *t* can be deduced from Equation 3 or Equation 4 by:(6)This equation implicitly specifies the time at which an evolutionary model should be compared with data characterized by a given sequence identity.

### Selection of alignments and computation of experimental substitution frequencies

To test how well the evolutionary models described above predict the transition probabilities in the alignments, we computed the frequencies of amino acid interchanges, from alignments at various sequence identities, grouping them in windows of 3% of sequence identity from 50 to 99%. These frequencies were learned from UniRef (Suzek *et al.* 2007), an arrangement of the UniProt database (The UniProt Consortium 2015) that clusters sequences above a sequence identity threshold. We downloaded from UniRef the clusters at 50% of sequence identity with at least one human sequence. The alignments were downloaded on July 23, 2015 from UniRef at www.uniprot.org/help/uniref with query: [query:count:[2 TO *] length:[50 TO *] taxonomy:Homo sapiens (Human) [9606] AND identity:0.5 ].

From each cluster, we detected the human sequence and aligned it with the other sequences. Sequences were aligned locally by the algorithm *water* (Smith and Waterman 1981) in the *emboss* software package (Rice *et al.* 2000). Only ungapped fractions of the pairwise alignments with ≤ 80 residues were considered. In order to avoid weighting bigger clusters more, from each cluster we collected at most one alignment per window of sequence identity. For each sequence identity window the average sequence identity () was computed, and the mismatches in the alignments for every amino acid pair were counted. The substitution process was treated as an equilibrium one: and were set equal to one-half of the number of mismatches between the unordered pair of amino acids A and B. The entries where were neglected, being affected by too large statistical errors.

From the frequency of interchanges in alignments at we also computed a matrix with an analogous procedure to that used to compute from (Equation 2).

### Download and implementation of benchmark models

We compared our evolutionary model with JTT (Jones *et al.* 1992), LG (Le and Gascuel 2008), ECM unrestricted (Kosiol *et al.* 2007), WAG (Whelan and Goldman 2001), and BLOSUM90 (Henikoff and Henikoff 1992). Here, we describe our implementation of these models.

The JTT matrix was deduced from the counts (Table 1) in the original paper (Jones *et al.* 1992) by the same procedure described there. The corresponding matrix was obtained by inverting Equation 4.

The LG matrix was reconstructed from the website of one of its authors (www.atgc-montpellier.fr/download/datasets/models/lg_LG.PAML.txt).

The ECM-unrestricted matrix was reconstructed from the Supplementary material of the original paper (Kosiol *et al.* 2007), and was chosen instead of its restricted version because it was proved by the authors to give better results.

The WAG matrix was reconstructed from the website of one of its authors (www.ebi.ac.uk/goldman/WAG/wag.dat).

The BLOSUM90 matrix was downloaded from the site of NCBI (www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/BLOSUM90). To adapt BLOSUM scores to our definition (see Equation 9), which is deprived of any prefactor and computed by a natural logarithm, we multiplied them by

In all cases, except BLOSUM, we derived the substitution probabilities at the desired times from Equation 4 for the no- version, or by Equation 3 in the version.

### Maximum likelihood tests

We also benchmarked the quality of our model by computing the likelihood of phylogenetic trees with respect to other substitution models. Maximum likelihood tests were performed using the PAML software package (Yang 2007) on reference alignments and phylogenetic trees downloaded from the Phylome Database (PhylomeDB v4, data accessed on February 1, 2016) (Huerta-Cepas *et al.* 2014). PhylomeDB provides a collection of highly accurate gene phylogenies for a wide variety of gene families and species. PhylomeDB phylogenies are derived from maximum likelihood tree inference, alignment trimming, and evolutionary model testing.

The reference Primates and Model Species Metaphylomes, seeded on *H. sapiens*, were downloaded from this database (Phylome IDs: 098 and 0500, respectively). We only analyzed multiple sequence alignments (and their corresponding phylogenetic trees) characterized by 500–900 pairs of protein sequences, with average sequence identity in the range 45–99%. This filter produces two reduced sets of 111 and 230 alignments, respectively, for the two phylomes.

The PAML software package was used to compute maximum likelihoods, keeping tree topologies intact and optimizing branch lengths. The standard -correction, as implemented in the PAML software, was included, and the free parameter *α* was optimized during the maximum likelihood estimations for all the tested models. We computed maximum likelihood values for each of these reference phylogenies using the the JTT (Jones *et al.* 1992), the WAG (Whelan and Goldman 2001), and the LG (Le and Gascuel 2008) evolutionary models.

The matrix used here is a reduction on amino acids if compared to that in Equation 2:

(7)### Data availability

The matrix obtained with our analysis is available as Supplemental Material, File S2, along with the obtained likelihoods for the two analyzed phylomes (File S3 and File S4). Furthermore, File S1 contains all the remaining supplementary materials. All the data analyzed in this paper were downloaded directly from public databases (dbSNP, UniRef, and PhylomeDB).

## Results

### The effect of global minor allele frequency

In this work, we propose to use a selection of SNPs to build an evolutionary model of protein sequences. Before estimating a substitution rate matrix from the SNPs, we attempt to quantify the effect of the fitness-related natural selection on the frequency of polymorphisms observed in the database. With this in mind, we divide our collection of SNPs into three subsets corresponding to different GMAF (see *Materials and Methods*). SNPs with a large GMAF are common, and are therefore more likely to get fixed by natural selection. Rare polymorphisms, characterized by a low GMAF, more likely resemble the properties of a random mutation on which fitness-related natural selection has not yet completed its action. We compare the properties of the subset at medium GMAF (label M, ), and the subset at high GMAF (label H, ). We compute the frequency of substitutions between codons *c* and in both datasets, and the corresponding error according to the Poisson statistics. For each pair of different codons, we compute:(8)where the denominator is an estimate of the statistical error on the difference at the numerator. Figure 1 shows the cumulative distribution function (cdf) of In the same figure, we plot the cdf of a Gaussian distribution with average 0 and SD 1. The similarity between the two curves shows that the difference () between the *H* and *M* datasets can be ascribed to statistical errors. We can then conclude that, for the subset of the SNPs chosen by our procedure, there is no evident bias in the frequency of observing a polymorphism at to medium and high range of GMAF. We then repeated the same test, comparing the set of SNPs at medium + high GMAF and a set with (label low, L). The cdf of estimated for the sets L and M+H is also shown in the figure, and is significantly different from the cdf of the considered Gaussian. This implies that the differences in the frequency of substitutions between the L and the M + H sets cannot be fully ascribed to statistical errors. This, according to our interpretation, means that the fitness-related natural selection has roughly completed its work for while this has not yet happened for rarer polymorphisms. Another possible interpretation is that selection has not completed its action even for polymorphisms at large GMAF, and that the sets M and H are indistinguishable by chance. However, in the light of the results that we are going to present, this last interpretation does not seem likely.

### The frequency of substitutions from SNPs and from alignments

We then compared the frequency of substitutions between amino acids *A* and *B* in our SNP dataset, and, the frequency from pairwise alignments at sequence identity namely at very high sequence identity (see *Materials and Methods*). In Figure 2, we plot the entry-by-entry comparison of these frequencies, both with error bars estimated according to Poisson statistics. The red points correspond to the amino acid pairs whose codons differ by only one nucleotide. For these entries, is estimated considering only substitutions with For these entries, the correlation is good, even if the statistical error is not large enough to explain entirely the deviations from the diagonal: statistical errors cover ∼30% of the difference.

The blue points correspond to amino acid pairs whose codons differ by two or three nucleotides. For example, histidine is coded by CAT and CAC, and phenylalanine by TTT and TTC. Therefore, at least two substitutions are necessary to transform histidine to phenylalanine. Multiple simultaneous substitutions are present in dbSNP, even if they are almost invariably associated with a low GMAF, and are then excluded when selecting only entries with We therefore estimated the frequency of polymorphisms associated with two nucleotide substitutions in the same codon without applying the filter on GMAF. The fraction of multiple nucleotide polymorphisms obtained in this manner is 0.46% of the total, a number of the same order of magnitude of the estimate in Smith *et al.* (2003), but low with respect to other estimates (Averof *et al.* 2000; Schrider *et al.* 2011). These entries are affected by large statistical errors, and possibly even by systematic errors, since they include entries with a very low GMAF. However, a mild correlation seems to show up.

This analysis indicates that the frequencies of amino acid substitutions estimated from the dbSNP database and from the alignments are correlated, but with deviations that cannot be fully ascribed to statistical errors. The deviations are particularly severe for entries associated to multiple nucleotide substitutions. In the next section we will show that the inconsistencies observed in Figure 2 can be largely accounted for by taking mutation rate variability into account.

### Prediction of substitution probabilities

We now show that, by taking mutation rate variability into account, the frequencies of interchanges between codons derived from our selection of SNPs can be used to predict substitution probabilities in alignments. As described in *Materials and Methods*, we derive from a substitution rate matrix on codons, . From this matrix, the transition probabilities between codons and amino acids are estimated by assuming that the substitution rates are -distributed (Equation 3).

To evaluate the quality of a model of protein sequence evolution we analyze the scores:(9)where is the transition probability from amino acid A to amino acid B in the evolutionary time *t* corresponding to the sequence identity is the frequency of amino acid *B*, and the shall here be intended as the natural logarithm. To check if the dynamics predicted by our model is a good descriptor of the real one, we compare with the equivalent score extracted from ungapped pairwise sequence alignments at the same sequence identity (see *Materials and Methods*).

In Figure 3A, we show the entry-by-entry comparison of in alignments at ∼92.5% of sequence identity and derived from our model (Equation 3) at the same sequence identity. For the SNP model, the parameter *α* of the distribution (see *Materials and Methods*) was set to after optimization (see Figure S5 in File S1). In such plots, a good model is characterized by points lying along the line with the smallest deviations. Amino acid pairs whose interchange may be determined by a single nucleotide change are identified by red circles, while those pairs for which this is not possible are labeled by blues crosses. The points lie near the line in both subsets, proving that the model correctly predicts the substitution probabilities in the alignments at 92.5% of sequence identity for both single and multiple nucleotide substitutions.

In Figure 3B, we compare (*x*-axis) and obtained by estimating the transition probabilities with Equation 4, namely, by assuming that protein sites evolve with the same rate. Contrary to panel (A), here the comparison is not very good, with deviations comparable to those observed in Figure 2. Even if the points lie in the proximity of the line the scores for the amino acid interchanges where double or triple mutations are necessary are systematically underestimated. It is evident that taking the variability of substitution rates makes the -based model more accurate.

In the other panels of Figure 3, we compare with the scores from some popular models for protein sequence evolution: JTT (Jones *et al.* 1992) in panel (C), LG (Le and Gascuel 2008) in panel (D), the codon matrix ECM-unrestricted (Kosiol *et al.* 2007) in panel (E), and BLOSUM90 (Henikoff and Henikoff 1992) in panel (F). For each model, time has been chosen in order to attain a sequence identity of 92.5% (see *Materials and Methods*). For JTT, LG, ECM, and also for the WAG matrix (Whelan and Goldman 2001) that we will consider below, we evolved the respective matrices both with and without considering the variation of rates across sites, similarly to what done for We checked that, in all those cases, computing transition probabilities by averaging over the distribution of the rates (Equation 3) worsens the performances rather than improving them (see Figure S1 in File S1). Therefore, in the comparison in Figure 3, and in all the following comparisons, we use the version (Equation 4). This may seem counterintuitive, since it is well known that the -correction tends to improve phylogenetic estimations with all these models. However, the distribution is only included on average, without associating a specific rate to each site (Miyazawa 2011a; Rizzato *et al.* 2016).

It is clear from Figure 3 that, at the sequence identity of 92.5%, none of the analyzed models estimates the probability of multiple substitutions more accurately than while the -- underestimates them, the standard models (JTT, LG, ECM-unrestricted and BLOSUM) overestimate them. BLOSUM90, in particular, dramatically fails to reproduce experimental data. This is somehow expected. In fact, while all the other models considered here are based on an evolutionary model of substitutions, BLOSUM90 is learned from conserved blocks of multiple sequence alignments (Henikoff and Henikoff 1991), whose maximum sequence identity is 90% and without any explicit lower bound. As a consequence, while BLOSUM90 is perfectly suited to score alignments at medium and low sequence identity, it is not optimal for scoring those at high sequence identity. The same argumentation can be extended to any other BLOSUM matrix.

To assess the quality of the score in a more quantitative way, we computed the average distance from the diagonal of the points in an entry-by-entry plot (as those in the panels of Figure 3) divided by the variance of the data:(10)where

Lower values of *δ* imply better predictions. In Figure 4, we plot the value of *δ* for all the models in the sequence identity range 75–100%. In particular, the thick dashed line describes the performance of the model obtained with fixed value of In the sequence identity range of 80–100%, this model performs comparably or even better than the JTT model, and outperforms the other tested models. However, in the lower sequence identity range, this model loses much of its predictive power (see also Figure S3 in File S1). This may be due to the approximation implicit in Equation 3 that rates remain constant in time, which is known not to be true (Fitch and Markowitz 1970; Penny *et al.* 2001; Lopez *et al.* 2002). Gaucher *et al.* (2001), (2002) observed that this phenomenon leads to a growth of the shape parameter *α* with evolutionary time. In order to approximately take this effect into account, we allowed *α* to vary linearly in time according to the model in Miyazawa (2011a) (see *Materials and Methods*), and obtained the performance described by the thick solid line in Figure 4. We here take at an evolutionary time corresponding to a sequence identity of 92.5%. Therefore, the model is identical to at fixed *α* at this sequence identity, while uses larger values of *α* at lower sequence identity. Even if this approach describes the time variability of the rates only approximately, it improves the performances of the model without adding any extra parameter.

### Likelihood tests on phylogenetic trees

In order to further benchmark the robustness of our model of protein sequence evolution, we perform likelihood ratio tests, using the PAML software package (Yang 2007), of our and other popular models, using reference datasets of phylogenetic trees and multiple sequence alignments on amino acids retrieved from the Phylome Database (Huerta-Cepas *et al.* 2014). Here, we use a reduced rate matrix on amino acids instead of that on codons, as described in *Materials and Methods*. Two phylomes are considered: one containing homologous sequences of closely related species (only primates), and another covering a wide diversity of species (mammals, birds, insects, plants, fungi, bacteria…). On a collection of 111 multiple sequence alignments of primates and their corresponding phylogenetic trees, our model provides the best likelihood values over the four tested models (SNP, JTT, WAG, and LG) for 48% of the phylogenetic trees being assessed (see File S3). Surprisingly, these results are obtained at practically any level of average sequence identity (40–99%) in this first dataset (see Table 1). This indicates that our model can also be considered for phylogenetic inference from multiple alignments of sequences in evolutionary close species, such as the primate phylome tested here. These tests are performed for each model by including the standard -correction implemented in the PAML software and, during the optimization, branch lengths are optimized keeping tree topologies fixed. We are confident that the quality of these results could be further improved if using our substitution rate matrix on codons instead of its reduction on amino acids, since they lead to different dynamics (Kosiol and Goldman 2011; Miyazawa 2013). When the same test is performed on a phylome that covers a wide variety of species (labeled as *Multiple species* in Table 1), the likelihood improves only for alignments at very high average sequence identity (see File S4). Indeed, our model is derived only from data at extremely high sequence identity and from the same species (*Homo sapiens* SNPs). However, also in the multiple species phylome, half of the phylogeny reconstructions in the range of high sequence identity can be improved with our SNP model.

## Discussion

Our results indicate that a selection of relatively common and nearly neutral SNPs provide information on the probability of amino acid substitution quantitatively consistent with alignments at high sequence identity. In particular, we verified that our selection of SNPs, when combined with an adequate treatment of the substitution rate variability, can be used successfully to learn scoring matrices and score alignments at high sequence identity (80–100%), with better performances with respect to the scoring matrices used nowadays. The capability of the model of reproducing substitution probabilities is remarkable if we consider that, at variance with all the other considered substitution matrices, it does not require learning the transition probabilities from alignments.

An important result of our analysis is that substitution frequencies learned from SNPs can be used to predict amino acid substitution probabilities *only if rate variability is taken into account* (Rizzato *et al.* 2016). Indeed, the performance obtained by a -based model in which this effect is not included is very bad (see Figure 4). On the contrary, taking into account rate variability worsens the performance for more standard models such as JTT, LG, WAG, and ECM (Figure S1 in File S1). A possible interpretation of these results is that standard models are learned from alignments at medium sequence identity. Then, the instantaneous rate matrices of these models are consistent with an effective Markovian dynamics, in which the effects of the degeneration of the genetic code (Kosiol and Goldman 2011) and of the rate variability (Rizzato *et al.* 2016) are averaged out. Instead, the matrix obtained from the SNPs describes evolution at extremely short evolutionary times. As a consequence, to give accurate results, it should be evolved with a dynamics that accounts for the among-site variability of the rates. In order to verify this scenario, we computed a substitution rate matrix from pairwise alignments with sequence identity >98%, and analyze its performances when using Equation 3 and Equation 4, respectively (Figure S6 in File S1). As for the Q matrix learned from SNPs, we find a clear improvement when accounting for the rate variability. This suggests that instantaneous substitution matrices learned at very short evolutionary times improve their performances when the rate variability is accounted for.

The observed worsening of the performances at lower sequence identities (see Figure 4 and Figure S3 in File S1) is likely to be related to the combination of several effects. First of all, in our model, the transition probabilities are estimated by extrapolating at large evolutionary time the information collected from the SNPs, namely at very short evolutionary times. This procedure unavoidably boosts the effect of errors in the training set of SNPs. A second weakness of our model is the treatment of multiple instantaneous substitutions, whose importance has already been established by other works (Miyazawa 2011b). These multiple substitutions are here approximately taken into account by measuring their frequency in the dbSNP database. However, the multiple substitutions in the database are almost invariably associated with a very low GMAF. Therefore, for these substitutions, it was not possible to check the effect of the GMAF on their frequency by the procedure followed for isolated substitutions, and this might introduce systematic errors. To investigate if a nonoptimal treatment of multiple instantaneous substitutions induces the worsening of performances at low sequence identities, we compared the performances of our SNP-derived model with a modified version of the JTT matrix where multiple substitutions are not allowed (Figure S4 and Figure S2 in File S1). The trend of the performance of the SNP model, and of the modified-JTT model, as a function of sequence identity are very similar. This indicates that the degradation of the performances at low sequence identity may indeed be ascribed to a nonoptimal treatment of multiple instantaneous substitutions. A last possible source of error in the model is our treatment of the among-site rate variability, which is taken into account only on average by Equation 3, and by the approach described in Miyazawa (2011a). Indeed, a separate optimization at different sequence identity of the parameter in the distribution improves the predictions, even if only marginally. This indicates that the model in Miyazawa (2011a) only approximates the true evolutionary dynamics, in which, for example, substitution rates at different sites are known to be correlated (Fitch and Markowitz 1970).

We also demonstrated that the SNP-based substitution matrix provides marginally better estimations of phylogenetic relationships in species closely related to *H. sapiens*, even when codon information is not available. The different performances between the two considered phylomes can be ascribed to the fact that both the primate phylome and the SNPs are specific for *H. sapiens* and share features, such as the equilibrium probability of amino acids, which improve the predictive power.

Even if, at high sequence identity, homology can be detected also by generic models, the use of more specific models for highly similar sequences can correct small local misalignments and errors in the alignment scores, and in the calculation of pairwise distances. Scoring the alignments with the approach introduced in this work may therefore become relevant in the framework of massive human genome sequencing projects aimed at deciphering human genetic variations among populations (1000 Genomes Project Consortium *et al.* 2015).

## Acknowledgments

We would like to acknowledge fruitful discussions with Edoardo Sarti, Stefano Zamuner, and Michele Allegra. This work was supported by Associazione Italiana per la Ricerca sul Cancro 5 per mille (grant 12214 to A.R. and A.L.).

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300078/-/DC1.

*Communicating editor: R. Nielsen*

- Received November 23, 2016.
- Accepted July 18, 2017.

- Copyright © 2017 by the Genetics Society of America