## Abstract

Proteins are among the most important constituents of biological systems. Because all protein-coding genes have a noncoding ancestral form, the properties of noncoding sequences and how they shape the birth of novel proteins may influence the structure and function of all proteins. Differences between the properties of young proteins and random expectations from noncoding sequences have previously been interpreted as the result of natural selection. However, interpreting such deviations requires a yet-unattained understanding of the raw material of *de novo* gene birth and its relation to novel functional proteins. We mathematically show that the average properties and selective filtering of the “junk” polypeptides of which this raw material is composed are not the only factors influencing the properties of novel functional proteins. We find that in some biological scenarios, they also depend on the variance of the properties of junk polypeptides and their correlation with the rate of allelic turnover, which may itself depend on mutational biases. This suggests for instance that any property of polypeptides that accelerates their exploration of the sequence space could be overrepresented in novel functional proteins, even if it has a limited effect on adaptive value. To exemplify the use of our general theoretical results, we build a simple model that predicts the mean length and mean intrinsic disorder of novel functional proteins from the genomic GC content and a single evolutionary parameter. This work provides a theoretical framework that can guide the prediction and interpretation of results when studying the *de novo* emergence of protein-coding genes.

THEORETICAL and empirical studies of how species acquire new proteins have described several mechanisms with distinct effects on genomes. Most of these mechanisms, such as gene duplication (Innan and Kondrashov 2010), horizontal gene transfer (Soucy *et al.* 2015), and gene fusion (Di Roberto and Peisajovich 2014), produce novelty by tweaking and rearranging preexisting gene sequences. In contrast, the mechanism of *de novo* gene birth consists of the emergence of new genes from reading frames that were ancestrally noncoding (McLysaght and Hurst 2016). This mechanism includes emergence from noncoding DNA, but also from alternative, noncoding reading frames in coding DNA (Keese and Gibbs 1992). Although *de novo* gene birth was once thought to be highly improbable (Jacob 1977), lineage-specific genes and proteins are observed in a variety of eukaryotes (McLysaght and Guerzoni 2015), bacteria (Neuhaus *et al.* 2016), and endosymbiotic organelles (Breton *et al.* 2011), which suggests that the contribution of *de novo* gene birth to the evolution of proteomes and cellular systems is not negligible. The biological activities of these novel sequences are often hard to infer since they lack well-studied homologs, but some of them have nevertheless been shown to play important and even vital biological roles (Heinen *et al.* 2009; Chen *et al.* 2010; Reinhardt *et al.* 2013). Since *de novo* gene birth is the only source of novel protein families and thus the only genetic mechanism which can add novel protein elements to cellular networks, it may have significantly influenced the diversity of existing protein structures through the initial emergence of unrelated proteins that gave rise to major protein families (Edwards *et al.* 2013).

Although many authors agree that conservation by natural selection should be part of the definition of *de novo* genes (Schlötterer 2015; McLysaght and Hurst 2016), the exact moment in the existence of a polypeptide at which *de novo* gene birth occurs has not been agreed upon, which makes “*de novo* gene birth” and related terms confusing in practice. For clarity, we hereinafter avoid these terms. Instead, we define a classification of polypeptides into three types: junk polypeptides (JPs), novel functional polypeptides (novFPs), and derived functional polypeptides (derFPs). Figure 1 illustrates this classification in terms of the underlying evolutionary processes and their implications for the comparison of sequence properties and *cis*-regulatory properties between classes.

A JP is a polypeptide that is encoded by some open reading frame (ORF) but whose beneficial effects, if it has any, has not yet caused the loss of a mutation that modifies its sequence and/or its *cis*-regulatory properties. JPs are a very wide class of polypeptides. They may be encoded by intergenic ORFs, but also by noncoding RNA genes and alternative ORFs in protein-coding genes. A JP may also have any sequence, any *cis*-regulatory properties, and any effect on fitness, with the sole condition that its beneficial effects are either too weak or too recent to have prevented the fixation of mutations at its locus. This definition of JPs is relevant in evolutionary proteomics because they have not been shaped by purifying selection for any activity, although they may be shaped by selection against deleterious effects such as metabolic cost, aggregation, and spurious interactions with other molecules (Figure 1C). As a result, evolutionary models that explain their structural and *cis*-regulatory properties will likely not apply to other polypeptides, and vice versa. The concept of JP is a more precise version of the concept of “spurious” expression producing the raw material of *de novo* gene birth (Wilson and Masel 2011).

Once the beneficial effects of a JP eliminate a mutation which modified its sequence or its *cis*-regulatory properties, it no longer meets the definition of a JP, and yet it is identical to the JP that it recently was. We refer to such a transitory polypeptide as a novFP. We use the term “functionalization” to refer to the transition between a JP and a novFP, which is consistent with the selected-effect definition of biological function (Doolittle *et al.* 2014). Even though each novFP has the same sequence and *cis*-regulation as its last ancestral JP, there may be important statistical differences between JPs and novFPs, since only a select subset of JPs become novFPs. Although the expression of a single JP is presumably unlikely to be strongly beneficial, this barrier to functionalization may be overcome by the “testing” of a large diversity of JPs during evolution. This diversity depends on the number of JPs expressed in the population, but also on their rate of allelic turnover, *i.e.*, the rate of appearance and disappearance of JP-expressing alleles.

Once a novFP undergoes the fixation of at least one nonsynonymous or *cis*-regulatory mutation, its properties are no longer the results of a biased “draw” from the pool of JPs, because they also depend on how the beneficial effects of the polypeptide filter the mutations that modify its sequence and its *cis*-regulation. We use the term derFP for such polypeptides that have changed since their functionalization. As most of the canonical coding genes (ORFs annotated by genome databases) have divergent homologs in multiple species, it is safe to assume that the large majority of the proteins they code for meet the definition of derFPs. Along with point mutations, genetic drift, and natural selection, derFPs are known to evolve through the loss, duplication, fusion, and fission of genes, which may also influence the distributions of their properties (Figure 1C).

This classification of the whole proteome into JPs, novFPs, and derFPs leads to the division of proteomic evolution into four parallel processes (Figure 1B): (1) the allelic turnover of JPs through the evolution of the sequence, transcription, and translation of ORFs without effective purifying selection for their polypeptides’ effects; (2) functionalization, which produces novFPs by filtering JPs without modifying their sequence and their *cis*-regulation, but can involve changes in the genetic background or the environment; (3) the subsequent evolution of the pool of novFPs and derFPs through sequence changes and through the loss, duplication, fusion, and fission of genes; and (4) the decay of novFPs and derFPs into JPs through the mitigation of selection, which can be driven by mutations, environmental changes, or an increase in genetic drift. Since most well-studied polypeptides are derFPs, we know very little about the first two processes, either experimentally or theoretically. Studying the allelic turnover of JPs and their functionalization is thus essential for completing our understanding of proteome evolution.

The set of all JPs expressed by a species, which we call the junk proteome, can be seen as a collection of fixed or segregating alleles in the genome. Although the extent of the junk proteome is still unknown, there is evidence of its existence based on diverse experimental approaches: experimental studies have shown that, in a variety of organisms, a large part of intergenic DNA is transcribed into 5′-capped and polyadenylated transcripts (Jensen *et al.* 2013) which can be translated (Ingolia *et al.* 2014; Ruiz-Orera *et al.* 2014). Contrary to canonical genes, these transcripts show signs of suboptimal translation (Guttman *et al.* 2013; Durand *et al.* 2019) and rapid evolution (Neme and Tautz 2016). Additionally, the so-called untranslated regions (UTRs) of canonical transcripts and the alternative reading frames within canonical ORFs are sometimes translated into polypeptides that lack known functions (Vanderperre *et al.* 2013; Ingolia *et al.* 2014; Landry *et al.* 2015; Mouilleron *et al.* 2016) and may thus be JPs. In mice, many translated ORFs in protein-coding genes and long noncoding RNAs were shown to evolve without any detectable selective constraints on the polypeptides that they encode (Ruiz-Orera *et al.* 2018). While it would be tempting to argue that the expression level of JPs should be kept to a minimum because it would represent a cost to the cell, recent studies have shown that this cost may not be high enough to be perceived and eliminated by natural selection in many species (Lynch and Marinov 2015). In addition, JPs could be the result of a trade-off between this cost and the need for the transcriptional and translational machineries to be dynamic and reactive for the expression of derFPs, which could decrease the specificity of these machineries (Hausser *et al.* 2019).

The determinants of the properties of novFPs are of particular interest, since they constrain the properties of genes at the very roots of gene families. Several studies have inferred lineage-specific functional polypeptides (*i.e.*, novFPs and young derFPs) and compared them with ancient derFPs, using *in silico* translations of noncoding DNA and with randomly generated polypeptides. Assuming that the young derFPs inferred by these studies are largely similar to novFPs, their results suggest that novFPs typically differ from ancient derFPs by their short length, weak expression (Toll-Riera *et al.* 2009; Neme and Tautz 2013; Schlötterer 2015), peripheral position in cellular networks, and random-sequence-like secondary structure (Abrusán 2013). It has been proposed that JPs and novFPs may be largely shaped by the genomic GC content through its effects on the properties of ORFs occurring in noncoding DNA sequences (Ángyán *et al.* 2012). Correlations supporting this role of GC content were observed for many quantities computed from the sequences of ORFs encoding inferred novFPs, although the averages of these quantities often depart from random expectations based on GC content (Basile *et al.* 2017). Such discrepancies were previously interpreted as the result of natural selection (Wilson *et al.* 2017), which is in line with the intuition that the probability of functionalization of a beneficial JP increases with its positive effect on fitness. However, several aspects of polypeptide functionalization require clarification before we can confidently interpret the average properties of observed novFPs and their differences from random or noncoding sequences.

In this article, we derive general mathematical results that link the average properties (*e.g.*, average length or average structural disorder) of novFPs to those of JPs. We find that the difference between the mean of a property among novFPs and the corresponding mean among JPs is proportional to the SD of this property among JPs. We also show that such mean discrepancies between JPs and novFPs may not result from natural selection alone, but also from the correlation of polypeptide properties with either the rate of allelic turnover of JPs, their probability of functionalization, or both. To illustrate how our general equations can be used to study polypeptide properties under specific models of the distribution of properties among JPs, we use a GC-content-based random-sequence model of JPs to predict how the genomic GC content and evolutionary parameters interact to determine the mean length and mean intrinsic structural disorder (ISD) of novFPs.

## Materials and Methods

An introduction of the mathematical concepts used in the following reasoning (mainly measures and related concepts from measure theory) can be found in the Supplemental Material, in the file "SuppMat_2019-06-20.pdf".

### A general model of the link between the properties of JPs and those of novFPs

Let Ω be the space of all possible polypeptides, each characterized by its sequence and its *cis*-regulatory properties. Over a given time period, the time averages of the number of JPs belonging to each possible category of polypeptides (each subset of Ω) can be divided by the time-averaged total number of JPs to form a probability measure P on Ω. In other words, for each subset S of Ω, is the ratio of the time-averaged number of JPs that belong to S and the time-averaged total number of JPs, which implies that . Similarly, the novFPs that emerge by functionalization in the same period of time form a probability measure on Ω, such that is the proportion of novFPs that belong to S. For any polypeptide property *i.e.*, any function that assigns a number to each possible polypeptide in Ω, statistics like the mean and variance of q are defined separately for each probability measure. Hereinafter, we use the subscript F to distinguish between statistics defined for P and those defined for . For example, the mean (expected value) of a property q among JPs will be denoted by , while its mean among novFPs will be denoted by .

Because functionalizationunder corresponds to the elimination of a mutation that would otherwise have modified a JP, each novFP is identical to its ancestral JP in terms of sequence and *cis*-regulation. As a result, the probability measures P and have a special relationship: for any subset S of Ω which satisfies , it is also true that . In other words, any category of polypeptides that occurs in novFPs necessarily occurred among JPs at some point. Because of this relationship between the two measures ( is “absolutely continuous” with respect to P), the Radon–Nikodym theorem for finite measures (Vestrup 2003) implies that there exists a polypeptide property such that, for any subset S of Ω with a well-defined , we have:where is the integral of the function over the set S with respect to the measure P. By dividing each side of the equation by , we get:The right side of this equation is the definition of the conditional average of a variable x knowing an event S (Çinlar 2011), with in this specific case. Although we interpret S as a class of polypeptides rather than as an “event” and as a polypeptide property rather than as a “random variable,” these terms refer to the same mathematical objects, so that the average of among JPs that belong to S is given by:Since this equation holds for any subset S of Ω with the polypeptide property can be interpreted as the factor by which the relative frequency of a polypeptide changes from JPs to novFPs. The average of among all JPs is:This fits the intuition according to which the increase in relative frequency of certain polypeptides between JPs and novFPs should be counterbalanced by a decrease in relative frequency of other polypeptides since these frequencies are relative.

If we define as the duration of the time period considered, F as the total number of functionalization events, and J as the time-averaged number of JPs, then is the time-averaged rate of functionalization events in the subset S of polypeptide space and is the time-averaged number of JPs that belong to The ratio of these two numbers is:If we define , this ratio becomes:Therefore, r is a polypeptide property representing the rate at which each region of the space of polypeptides produces novFPs, normalized by the time-averaged number of JPs that belong to that region. The average of this rate among JPs is: is thus a normalization of r by its own mean:Given a polypeptide property such as length or ISD, representing its mean among novFPs as a function of its mean among JPs would be useful in the study of *de novo* gene birth. The polypeptide properties and r that we just defined can be used to obtain such a representation. Because of the way we defined from the probability measures P and ( is the Radon–Nikodym derivative of with respect to P), it follows (Vestrup 2003) that for any polypeptide property q, we have:In probability theory, the expected value or average of a random variable (*e.g.*, q or ) among a population represented by a probability measure (*e.g.*, or P) is defined as the integral of the variable with respect to the probability measure over the space of all possibilities (*e.g.*, Ω). Therefore, the above equation is equivalent towhere is the average of the polypeptide property among novFPs and is the average of the product among JPs.

Now that we have an expression for the average of an arbitrary property among novFPs, we can obtain an expression for the difference between this average and the average of the same property among JPs. To achieve this, we will only use universal rules from probability theory without making any assumption, such that the results apply to all biological contexts. By applying the property of covariance , we obtain:Since , we obtain Equation 1:(1)The covariance in Equation 1 depends on both the variation of q among JPs and its relation with functionalization (as represented by ). We now seek to distinguish these two factors by modifying Equation 1. By applying the definition of the Pearson correlation coefficient , we obtain:Since , we can divide the right side of the equation by :By the definition of the coefficient of variation :If we define , we obtain Equation 2:(2)Because both the coefficient of variation and the correlation coefficient are insensitive to the multiplication of variables by positive constants, we have:where k can be any positive constant. Since and is a positive constant, we obtain:

### Distinguishing the role of the allelic turnover of JPs under the assumption of their evolutionary equilibrium

According to Equation 2, the parameter which we call the birth bias, is the only determinant of the average properties of novFPs that we cannot yet interpret in terms of the properties of JPs. To do so, we now make the assumption that the junk proteome is at evolutionary equilibrium, *i.e.*, JPs from any category are gained by mutation as often as they are either lost or functionalized. Since the polypeptide property r represents the ratio of the rate of functionalization events at one point in polypeptide space to the time-averaged number of JPs located at this point, it can be understood as the product where λ is the ratio of a JP’s frequency of appearance by mutation to the time-averaged number of loci expressing this exact JP, and f is the probability that such a gain leads to the functionalization of the polypeptide (the probability of functionalization). Because we assume the evolutionary equilibrium of the junk proteome, for each JP leaving a point in the polypeptide space, another mutant JP appears at this same point. Therefore, λ is also the rate at which a JP exits the junk proteome by either allele loss or functionalization (the inverse of its expected lifetime as a JP). Since λ is both a rate of arrival and a rate of departure of JPs at each point in polypeptide space, it is a rate of allelic turnover: a region of polypeptide space where λ is low will tend to be populated by mostly the same JPs for a long time, while a region where λ is high will have a large proportion of its JPs replaced by new ones in a short time. By combining with the definition of δ, we obtain Equation 3:(3)Equation 3 provides an interpretation of δ in terms of the product of the rate of allelic turnover of JPs with their probability of functionalization. However, it does not indicate how the correlation of only one of these two factors with a polypeptide property *q* could influence the associated value of δ, and thus the mean of q among novFPs. To find an expression of δ that makes this distinction, we transform Equation 3 using general rules of probability theory, which means that the results depend on the same assumption of evolutionary equilibrium as Equation 3. Using the definitions of the Pearson correlation coefficient and the coefficient of variation, we obtain:Using the identity By decomposing the numerator as the covariance of a product of random variables according to (Bohrnstedt and Goldberger 1969):where . By dividing the numerator and the denominator by By taking the factor out of the rightmost term of the numerator:By applying the definition of the Pearson correlation coefficient three times:By cancelling and rearranging factors within terms:By applying the definition of the coefficient of variation six times:By the definition of the coskewness of three variables , we obtain Equation 4:(4)The denominator in Equation 4 is strictly positive since, by the definition of the coefficient of variation and the properties of covarianceand since both and are positive.

### Defining a random-sequence model of JPs

This section defines a simple model of the sequence of JPs that can be used to predict the mean and SD of the sequence properties of JPs, which can then be used with Equation 2 to predict the effect of the birth bias δ on novFPs. To build such a model of JPs, we made five assumptions about the DNA encoding them: (1) all sites evolve independently, (2) the transition probability matrix is constant across sites, (3) the transition probability matrix is the same on both strands, (4) each site has reached evolutionary equilibrium, and (5) a random subset of ATG codons define ORFs that are translated into JPs. Assumptions 1 and 2 allow us to focus on a single site and generalize our findings to the whole sequence. Assumptions 3 and 4 entail that if two nucleotides are Watson–Crick complements, then a given site is equally likely to display either of them. As a result, complementary nucleotides are equally frequent within and between strands, and the frequency of each of the four nucleotides is a function of GC content. Since sites are independent, GC content is the only parameter needed to predict probability distributions for the properties of randomly occurring ORFs under this model. Assumption 5 allows us to extend our predictions to the properties of JPs expressed from these ORFs. In summary, this model implies that each JP is encoded by a random ORF appearing in a random DNA sequence with a fixed GC content.

### Predicting the mean length of novFPs from the GC content and the birth bias

In the resulting model, predicting the length distribution of JPs is equivalent to predicting the distribution of the number of in-frame sense codons separating each ATG codon from the closest downstream in-frame stop codon. The frequency of each nucleotide N is a function of the GC content, which we denote by . The frequencies of the four nucleotides are given by:Since, in this model, consecutive nonoverlapping DNA 3-mers are statistically independent and have the same probability of being stop codons, the number of sense codons in an ORF follows a geometric distribution with the following probability mass function:where n is any positive integer and is the probability that a given DNA 3-mer is a stop codon. Under our assumptions, the frequency of a DNA word is equal to the product of the frequencies of the nucleotides of which it is composed. Using this principle to calculate , we get:Using standard equations for the mean and SD of a geometric distribution, we obtained the mean and SD of the length of JPs as functions of the frequency of stop codons, which is itself determined by the GC content:Combining these two equations with Equation 2 results in an expression of the average length of novFPs in terms of the GC content and δ.

### Predicting the mean ISD of novFPs from the GC content and the birth bias

To predict the mean and SD of the ISD of JPs as functions of the GC content, we randomly generated the sequences of 100,000 JPs for each value of GC content from 20 to 80% with steps of 2.5%. We computed the per-amino-acid “long” and “short” disorder scores using IUPred (Dosztányi *et al.* 2005), averaged the two types of scores separately within each sequence, and computed the mean and SD of the sequence-wide average of each score for each GC content. We then applied Equation 2 to these means and SDs to predict the mean ISD of novFPs as a function of both the GC content and the birth bias of ISD.

### Data availability

All the code used in this study has been made available, along with supplemental figures and methods, in a total of five files. The file “Notebook.ipynb” is a Jupyter notebook containing the Python2 and Bash code used to generate Figure 2 and Figure S1. The Bash code in this notebook makes use of the three Python2 scripts “simulate_junk_proteome.py,” “iupred_multi.py,” and “stats_iupred_multi.py.” The file “SuppMat_2019-06-20.pdf” contains Figure S1, its caption, and the Supplemental Methods text section. The complete procedure for simulations can be found on page 7 of the manuscript. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.8304533.

## Results

### The difference in the average of a property between JPs and novFPs is proportional to the SD of this property among JPs

We consider the evolution of the proteome in a species over a time period that could be, for instance, a single branch on a phylogenetic tree. We compare two distributions on the space of possible polypeptides. The first one, which we call JPs, is the time-averaged distribution where the relative frequency of any group of polypeptides is the ratio of their time-averaged number among JPs to the time-averaged total number of JPs. The second distribution, which we call novFPs, is the distribution of all novFPs that emerge by functionalization during the time period considered. We use the term polypeptide property for a quantity that is determined by the sequence and *cis*-regulation of a polypeptide. Such properties include, for instance, the length of the polypeptide, the prevalence of some amino acid in its sequence, its proportion of disordered residues, and its expression level in a given *trans*-regulatory background. These polypeptide properties have distributions whose summary statistics can be used to compare JPs and novFPs. As illustrated in Figure 1C, the properties of JPs are expected to constrain those of novFPs. The equations shown here specify what the resulting constraints can be, and what can counteract them.

Based on the fact that any novFP must first exist as a JP before functionalizing, we find that the difference in the mean of any polypeptide property q between novFPs and JPs is given by the following equation:(1)where is the expected value (the average) of q among novFPs (the subscript F specifies that a statistic describes novFPs rather than JPs), is the average of q among JPs, is the covariance of q and among JPs, and is the factor by which the relative frequency of a specific polypeptide changes from JPs to novFPs (for instance, because of its fitness effect). Equation 1 is analogous to the Robertson–Price identity from quantitative genetics (Robertson 1966; Price 1970; Lynch and Walsh 1998) which states that during a round of natural selection in a population, the mean of a quantitative phenotypic trait changes by an amount equal to the initial covariance of this trait with relative fitness. This analogy between functionalization and natural selection is due to the fact that they both involve the comparison of distributions between two populations, where the second population (*e.g.*, novFPs or postselection individuals) only features trait values that were present in the first population (*e.g.*, JPs or preselection individuals). This is because both functionalization and natural selection act on preexisting material without producing novelty on their own.

To better distinguish the effect of the distribution of a polypeptide property among JPs from the effects of other factors, Equation 1 can be transformed into:(2)where is the SD of q among JPs, is its coefficient of variation of among JPs (the ratio of its SD to its mean), and is the Pearson correlation coefficient of q and among JPs. Because of the mathematical properties of the correlation coefficient, the value of δ does not depend on the mean and variance of q among JPs, but rather on its relation with functionalization as symbolized by We call the parameter δ the birth bias of the polypeptide property The birth bias is equal to the average difference in q between novFPs and JPs, measured in units of SD of q among JPs. It is thus analogous to the “intensity of selection” in quantitative genetics (Matsumura *et al.* 2012).

Equation 2 has implications for the use of random and noncoding controls in the study of novFPs. Such controls were often used to compute expected means (or other measures of central tendency) for polypeptide properties (Ángyán *et al.* 2012; Abrusán 2013; Basile *et al.* 2017; Wilson *et al.* 2017). According to Equation 2, the SDs of the properties of control sequences could be just as useful as their means for predicting the properties of novFPs, provided that the control is representative of real JPs. Given such a representative control, Equation 2 can be used to estimate the birth bias of a polypeptide property. Thus, to interpret average differences between JPs and novFPs given a model of JPs, we need to decompose the birth bias into contributions from different evolutionary forces.

### Neutral evolutionary forces can cause discrepancies between JPs and novFPs through the rate of allelic turnover of JPs

We sought to further dissect the birth bias (δ in Equation 2) into readily interpretable components. Once we make the additional assumption that JPs are at evolutionary equilibrium—*i.e.*, on average over time, each region of the space of possible polypeptides is entered by mutant JPs as frequently as it loses JPs through allele loss or functionalization—then the birth bias becomes:(3)where λ is the polypeptide-specific rate of allelic turnover and f is the polypeptide-specific probability of functionalization. The rate of allelic turnover λ is the inverse of the expected time from the appearance of a specific JP by mutation to either the loss of its allele or its functionalization. Because of our assumption of evolutionary equilibrium, the landscape of λ across the space of polypeptides measures how fast a single locus can explore a given region of this space. For instance, JPs that evolve slowly (*e.g.*, because of selective constraints on their toxicity and metabolic cost or a low propensity to mutation) will have low values of these polypeptides tend to persist in the population and contribute less to the exploration of polypeptide space than those with a high The polypeptide property f is the probability that the appearance of a given JP by mutation will lead to its functionalization rather than its loss through the fixation of a nonsynonymous or *cis*-regulatory change. In other words, among the events of appearance of a specific JP by mutation, f is the proportion of such events which lead to the functionalization of this JP.

Since the rate λ and the probability f each take a single value in each possible JP, they summarize which genetic backgrounds and environments a JP is likely to encounter at the moment of its appearance and during its existence, and how these factors would determine the longevity of the JP and whether or not it functionalizes. There is no obvious relationship between λ and the former is the inverse of the expected time to either one of two events (allele loss or functionalization), while the latter is the probability that one of these events (functionalization) happens before the other (allele loss). However, there are specific scenarios in which λ and f are closely related. For instance, in a case where functionalization would be mainly driven by random environmental and genetic-background changes which are equally likely to favor any JP, the probability that a JP functionalizes before allele loss would be directly proportional to its expected life span and the product would thus be a constant. This would lead to δ being zero for all polypeptide properties (Equation 3) and thus to JPs and novFPs being indistinguishable in terms of their average properties (Equation 2). In fact, they would be indistinguishable in terms of the distributions of properties (not only their averages), since the distribution of a property q is fully determined by the averages of properties of the form where n is a positive integer. In other words, an inverse proportionality between λ and f would mean that an unbiased sample of JPs become novFPs.

Because of the mathematical properties of the coefficient of variation and the correlation coefficient, the birth bias is insensitive to the scales of and As a result, each of them can be replaced with a directly proportional quantity without changing the birth bias. This may help to model the birth bias and to estimate it from the observed properties of JPs. For instance, if a model of the evolution of JPs assumed that the allelic turnover rate of a JP is directly proportional to the GC content of its ORF (their ratio is a constant), then λ could be simply replaced by the ORF’s GC content in Equation 3.

When the birth bias of a polypeptide property is not zero, its mean differs between JPs and novFPs. While Equation 3 could be used to compute this birth bias given enough data or assumptions about the junk proteome and its evolution, it does not highlight intuitive possible explanations for the existence of such a bias. To do this, and without adding any assumption to those behind Equation 3, we obtained the following expression of the birth bias using identities from probability theory:(4)where is the coskewness of the three variables and f among JPs.

The second term of the numerator in Equation 4 confirms previous intuitions about the probability of functionalization. All else being equal, an increase in the correlation between the probability of functionalization and a given polypeptide property results in an increase of this property’s birth bias and thus of its mean among novFPs (Equation 2). More surprisingly, the first term of the numerator indicates that the same relation exists between the birth bias and the rate of allelic turnover This implies that even if a given polypeptide property does not correlate positively with the probability of functionalization through a positive effect on fitness, the mean of this property can still be different between JPs and novFPs if it is positively correlated with the rate of allelic turnover of JPs. For example, if functionalization were equally likely for JPs of any given length, and if the allelic turnover of long JPs were especially fast because JPs mutate at a frequency proportional to their length, then longer polypeptides would be overrepresented among novFPs relative to JPs. In other words, the frequency of successes (events of functionalization) depends as much on the frequency of trials (the allelic turnover rate) as on the probability of success for a single trial (the probability of functionalization). Consequently, before interpreting observed differences between the average properties of novFPs and those of random sequences as the results of natural selection, we should either show or explicitly assume that these differences are not caused by neutral components of the allelic turnover rate, such as mutational biases resulting from the specific mutation spectrum of the organism under study.

The coskewness that appears in the third term of the numerator in Equation 4 is, roughly speaking, a measure of how any of three variables linearly affects the correlation between the two others (see Supplemental Material, file "SuppMat_2019-06-20.pdf" for a formal explanation). Like the correlation coefficient, coskewness does not depend directly on the means and SDs of variables. Despite the difficulty of its interpretation in the context of Equation 4, it could be estimated from data on the allelic turnover rate and functionalization probability of JPs, or predicted from a model of their evolution.

The denominator of Equation 4 is necessarily positive (proof in Supplemental Material, file "SuppMat_2019-06-20.pdf") and indicates that the overall correlation between the allelic turnover rate and the probability of functionalization negatively affects the magnitude of the birth bias. Interestingly, this term does not involve which means that its value is the same for every polypeptide property in a given species. It can be thought of as a measure of the overall tendency of the junk proteome to preferentially explore polypeptides that are likely to functionalize. It constitutes a baseline to which each source of evolutionary bias represented in the numerator must compare favorably to have a strong effect on the average properties of novFPs.

### A simple model of the mean length and mean ISD of novFPs as functions of the birth bias and the genomic GC content

Length and secondary structure are properties of polypeptides that can be studied from DNA sequences generated *in silico*, which makes them ideal targets for the modeling of JPs and novFPs. In particular, ISD, which measures a protein’s lack of stable tridimensional structure, has been a recurrent topic in previous studies of novel polypeptides (Ángyán *et al.* 2012; Basile *et al.* 2017; Wilson *et al.* 2017). To exemplify the usefulness of our general equations for modeling purposes, we used them to build a model of the mean length and mean ISD of novFPs as functions of the birth bias and the genome-wide GC content.

GC content is known to vary among genomes, typically from 20 to 70% (Long *et al.* 2018). If GC content affects the birth process of novel genes, it could make their properties highly dependent on the species’ genomic content. We built a simple model where the sequences of JPs are randomly generated by a single GC content. We used this model to predict the means and SDs of length and ISD among JPs as functions of the GC content. Length predictions were made analytically. For ISD we used the model to simulate 100,000 polypeptide sequences for each of several GC contents from 20 to 70% and we estimated their individual ISD levels using the sequence-wide average of IUPred long disorder (Dosztányi *et al.* 2005). We applied Equation 2 to the resulting means and SDs to compute the expected means of length and ISD among novFPs as functions of their respective birth biases and the GC content (Figure 2).

Figure 2 illustrates the importance of modeling both the SD and the average of a polypeptide property among JPs when trying to predict its average among novFPs. The same birth bias has a larger effect on novFPs when JPs are more diverse, just like phenotypic variation in a population makes the average of a phenotypic trait more sensitive to natural selection. In the case of polypeptide length, the SD of the length of JPs increases with GC content (Figure 2C), which makes the mean length of novFPs in GC-rich genomes especially sensitive to both neutral and selective sources of birth bias. To a lesser extent, this seems to also be the case for ISD: the mean ISD of novFPs increases less steeply with the birth bias when the GC content is low (Figure 2B), because the SD of ISD among JPs is lower (Figure 2D).

The equations and simulations underlying Figure 2 can be combined with values of genomic GC content, average length, and average ISD from the literature to estimate actual values of the birth bias in real species. We exemplify this process in the case of ISD in the budding yeast *Saccharomyces cerevisiae* and the house mouse *Mus musculus*. In the house mouse, *in silico* predictions suggest that novFPs have higher ISD than potential polypeptides encoded by intergenic DNA, which was interpreted as a result of natural selection in favor of high ISD during *de novo* gene birth (Wilson *et al.* 2017). Other results suggest that this trend may be specific to certain organisms and certain values of genomic GC content (Basile *et al.* 2017). As the average GC content of house mouse DNA is 42% (Elhaik and Graur 2014) and the average IUPred long disorder of its novFPs is close to 55% (Wilson *et al.* 2017), our model predicts that the birth bias of this specific measure of ISD should be >1 in house mouse (Figure 2B), more precisely 1.2. This value being larger than zero is consistent with the conclusion of Wilson *et al.* (2017) that novFPs appear more disordered than the raw material of *de novo* gene birth, assuming that the noncoding control sequences they used are well summarized by a single GC content that is close to 42%. By a similar reasoning, given the 38% GC content observed in yeast DNA (Engel *et al.* 2014) and the 32% average IUPred long disorder of yeast novFPs reported by Wilson *et al.* (2017), the associated birth bias should be between 0 and 1 (Figure 2B), more precisely 0.5. Under our GC-content-based model of JPs, this suggests that given the GC contents of mouse and yeast genomes, the biases of allelic turnover and functionalization in favor of disordered polypeptides are stronger in the mouse than in yeast. As shown by the various terms in Equation 4, many different scenarios could explain this trend in terms of the relation between ISD, the allelic turnover rate of JPs, and their probability of functionalization, which calls for further observation and modeling of these variables.

## Discussion

The determinants of the properties of polypeptides resulting from *de novo* gene birth were previously studied empirically by comparing them to random and noncoding sequences (Ángyán *et al.* 2012; Abrusán 2013; Basile *et al.* 2017; Lu *et al.* 2017; Wilson *et al.* 2017), but the field lacked the theoretical tools needed to interpret these comparisons in terms of evolutionary forces. We have defined a classification of polypeptides and their evolutionary history (Figure 1) that clarifies the process of *de novo* gene birth sufficiently to link the properties of its raw material to those of its products through broadly applicable equations. These equations suggest potential roles for both natural selection and neutral forces in biasing the mean properties of novFPs with respect to the properties of the raw material from which they are born. We also showed how a simple GC-content-based model of nonfunctional polypeptides can be combined with our general theoretical framework to infer evolutionary parameters of *de novo* gene birth from the properties of its products, or vice versa.

Various terms have been used to classify polypeptides in studies of *de novo* gene birth, but they do not have exact equivalents in the JP–novFP–derFP classification. It is worth noting that novFPs will be more frequent among polypeptides that are called *de novo* or “novel,” although the latter most often correspond to relatively young derFPs (McLysaght and Hurst 2016) and sometimes include JPs (Lu *et al.* 2017). On the other hand, the term “protogene” seems to encompass ORFs encoding JPs, novFPs, and young derFPs (Carvunis *et al.* 2012) since it is associated with viewing *de novo* gene birth as a continuous process.

Our framework can be used to investigate the effects of the relative importance of two types of events that may drive functionalization: (1) “external” changes in the genetic background and the environment, and (2) beneficial mutations at JP-expressing loci. In a hypothetical case where functionalization would only be driven by external changes, each JP would first appear in an unfavorable combination of genetic background and environment, and functionalization would be systematically caused by the occurrence of the “right” external change during the existence of a JP. In such an extreme scenario, the probability of functionalization of any given JP would be the product of its expected life span with the rate of external changes that lead to its functionalization. This rate of favorable external changes would thus correspond to the product in Equation 3 and would be the only determinant of the birth bias of each polypeptide property. In such a case, modeling differences between JPs and novFPs would only require modeling how the sequence and *cis*-regulation of each JP relate to the rate of external changes that favor this JP, rather than separately modeling both the rate of allelic turnover and the probability of functionalization. This scenario is especially likely if the favorable external changes in question are genetic-background changes that can be conserved by selection along with the JP that they favor. However, if the favorable changes are environmental, they may be reverted before the functionalization of the JPs that they favor. In the more specific case where the rate of favorable external changes would not vary between JPs, the birth bias would be zero for all polypeptide properties and novFPs would be undistinguishable from JPs in all regards, as we mentioned in the *Results* section.

In a hypothetical case where functionalization always follows the appearance of a JP with certain “good” properties that do not depend on the genetic background or the environment, the probability of functionalization would be essentially binary: favored JPs would functionalize with a probability of one, others with a probability of zero. The product in Equation 3 would equal the allelic turnover rate in favored JPs and would equal zero in other JPs. In such a scenario, the properties of nonfavored JPs and their rate of allelic turnover would not have to be modeled, since they would have no effect on the properties of novFPs. This extreme limitation of functionalization by the mutation of individual JPs seems especially likely in species with large effective population sizes and a stable environment. Genome evolution tends to be slower when the effective population size is large (Lanfear *et al.* 2014), which may stabilize the genetic background and thus the selection coefficients of JPs. Moreover, the weakness of genetic drift would allow natural selection to maintain beneficial JPs and eliminate their deleterious modifications, making these JPs likely to functionalize.

Although our framework can be used to develop models for a wide diversity of scenarios, these models are always formulated in terms of the distributions of properties of JPs rather than the evolution of individual JP-expressing loci. Our framework offers no ways to predict, for instance, whether or not a single mutation can modify the rate of allelic turnover of a JP without modifying its probability of functionalization. Such considerations may be useful to understand why the junk proteome reaches a particular state of equilibrium with particular correlations between the properties of JPs. However, once this state of equilibrium is modeled or empirically estimated, our equations can be applied without thinking about the evolution of individual loci.

Implications of the length distribution of JPs have been largely ignored, since studies of *de novo* gene birth usually use random or noncoding controls that are intentionally biased against short ORFs (Ángyán *et al.* 2012; Abrusán 2013; Basile *et al.* 2017; Lu *et al.* 2017; Neme *et al.* 2017; Wilson *et al.* 2017). Such practices may be partly justifiable if the contribution of short JPs to *de novo* gene birth turns out to be negligible because of slow allelic turnover or low probability of functionalization, but this remains to be shown. Even though inferred novFPs tend to be shorter than derFPs, their average length is usually at least 100 residues (Neme and Tautz 2013; Basile *et al.* 2017), which is larger than the expected mean length of JPs for common GC contents (Figure 2C). Thus, if polypeptides that were detected and classified as novel are representative of novFPs, the birth bias of polypeptide length is likely to be positive in many species. As explained in our interpretation of Equation 4, this would not necessarily mean that a long polypeptide is typically more likely to functionalize than a shorter one; it could also have a faster allelic turnover. The fact that derFPs tend to be longer than novFPs is also not conclusive evidence for such a selective advantage of length among JPs, since the evolution of derFPs may be channeled toward long polypeptides that are very different from JPs of the same length. One intuitive alternative explanation for novFPs being longer than JPs is that since the mutation rate of an ORF is proportional to its length, the length of JPs may be positively correlated with their rate of allelic turnover, which would increase the birth bias of polypeptide length as shown in Equation 4. However, the strength of this correlation depends on how much variation in the rate of allelic turnover is independent of ORF length (such as the allelic turnover of promoter sequences), and its contribution to the birth bias also depends on the overall correlation between the rate of allelic turnover and the probability of functionalization, as shown by the denominator of Equation 4. It is therefore currently difficult to tell if this effect is strong enough to fully explain the observed shift in mean length between random ORFs and those expressing novFPs, although this point could be clarified by modeling or estimating the allelic turnover rate and probability of functionalization of JPs.

Despite the ambiguity as to the causes of the apparent length difference between JPs and novFPs, this difference should increase with the SD of the length of JPs, and thus with GC content, unless this effect cancels out with a decrease of δ in GC-rich genomes (Figure 2A). For instance, the mutation spectrum of an organism affects both its genomic GC content and the relation between the sequence of an ORF and its mutation rate, which could lead to an interspecific correlation between GC content and birth bias for various polypeptide properties.

Our positive estimates of δ for the sequence-wide average of IUPred long disorder in mouse and yeast may reflect positive correlations of this polypeptide property with the allelic turnover rate and/or the functionalization probability (see Equation 4). However, IUPred long disorder is an estimator of ISD and we can only assume that the landscape of the actual average proportion of disordered amino acid residues in novFPs is similar to Figure 2B under the GC-content-based model. As a warning against this assumption, the corresponding landscape computed from IUPred short disorder differs in terms of the magnitude of δ because the means and SDs of long and short predictors of ISD among JPs have different relations to the GC content (Figure S1), even though they are both meant to estimate the proportion of disordered residues. Nevertheless, the difference between mouse and yeast in the estimated birth bias for the same measure of ISD suggests that the difference in average ISD between their novFPs is in part driven by a difference in birth bias rather than only differences in the mean and SD of ISD among JPs. Future studies may reveal that some components of the birth bias (such as mutational biases) can be predicted from the GC content, which would make the latter even more useful than Figure 2 suggests for the prediction of interspecific differences in the average properties of novFPs.

When using a model of JPs to infer the effect of birth bias on the properties of novFPs (*e.g.*, Figure 2), the use of Equation 2 is inherently valid since this equation stems from the definitions of JPs and novFPs. However, the means and SDs of the properties of JPs, which are needed to apply Equation 2, are model dependent. Therefore, the predictions that we made in Figure 2 as to the value of δ under a GC-content-based model may not apply to organisms where JPs are not well described by such a model. For instance, mammalian genomes are known to be organized into compositional domains with various GC contents (Elhaik and Graur 2014). In a study in yeast, candidate *de novo* genes had a significant tendency to be located in GC-enriched regions of the genome (Vakirlis *et al.* 2017). If several different GC contents contribute to a single junk proteome, the means and SDs of the properties of JPs may be different from those expected under our random-sequence model given the average GC content. However, it is possible that such a model will apply to each compositional domain separately, in which case the junk proteome would be readily modeled by drawing values of GC content from an appropriate distribution and using each value to generate a random polypeptide. From there, Equations 1–4 would apply just as they did for the simpler case of a single GC content.

Although the predictions that we made using a random-sequence model of JPs only involve their sequence and structure, *cis*-regulatory aspects of their expression may also be understood as polypeptide properties and analyzed using equations presented here. Transcription and translation levels of JP-encoding ORFs seem especially relevant since, as they approach zero, the probability that a JP functionalizes also goes toward zero and its other properties become irrelevant. Since transcription and translation are controlled by local sequence elements, knowledge of these elements may eventually be combined with a random-sequence model to predict the regulatory properties of JPs, like we did for their length and ISD. Studies of the transcriptional activity of synthetic random DNA in *Escherichia coli* (Yona *et al.* 2018) and yeast (de Boer *et al.* 2018) show that such sequences frequently contain the patterns required for the initiation and regulation of transcription. Factors that are external to intergenic regions also seem to play a role in the expression of JPs and their functionalization, such as bidirectional promoters (Vakirlis *et al.* 2017), translated UTRs, and translated alternative ORFs within canonical ORFs (Vanderperre *et al.* 2013). Understanding the importance of these factors may require more than a simple random-sequence model, but their effects on JPs should be “inherited” by novFPs in accordance with the general equations that we developed.

By empirical and experimental means, the reality of *de novo* gene birth has been made undeniable, and yet the community’s quantitative understanding of this process suffers from the scarcity of theoretical contributions to the field. Our results specify how knowledge of the structure, expression, and evolution of the nonfunctional proteome can be used to explain and predict the properties of novFPs. However, much of this knowledge remains to be uncovered by further empirical, experimental, and theoretical investigation.

## Acknowledgments

The authors thank C.K. Griswold, H. Martin, A.R. Carvunis, J. Masel, and three anonymous referees for their comments on the manuscript. L.N.-T. is supported by an Alexander Graham Bell Ph.D. fellowship from the Natural Sciences and Engineering Research Council (NSERC). C.R.L. is supported by a Discovery grant from NSERC and a team grant from the Fonds de Recherche du Québec – Nature et Technologies (FRQNT) and holds the Canada Research Chair in Evolutionary Cell and Systems Biology.

Author contributions: L.N.-T. conceived the project with the help of C.R.L. L.N.-T. performed all research, analysis, and interpretation. L.N.-T. wrote the manuscript with feedback from C.R.L.

## Footnotes

Supplemental material available at Figshare: https://doi.org/10.25386/genetics.8304533.

*Communicating editor: J. Masel*

- Received January 18, 2019.
- Accepted June 14, 2019.

- Copyright © 2019 by the Genetics Society of America

Available freely online through the author-supported open access option.