## Abstract

Sequence-specific interactions between proteins and DNA play a central role in DNA replication, repair, recombination, and control of gene expression. These interactions can be studied *in vitro* using microfluidics, protein-binding microarrays (PBMs), and other high-throughput techniques. Here we develop a biophysical approach to predicting protein–DNA binding specificities from high-throughput *in vitro* data. Our algorithm, called BindSter, can model alternative DNA-binding modes and multiple protein species competing for access to DNA, while rigorously taking into account all sterically allowed configurations of DNA-bound factors. BindSter can be used with a hierarchy of protein–DNA interaction models of increasing complexity, including contributions of mononucleotides, dinucleotides, and longer words to the total protein–DNA binding energy. We observe that the quality of BindSter predictions does not change significantly as some of the energy parameters vary over a sizable range. To take this degeneracy into account, we have developed a graphical representation of parameter uncertainties called IntervalLogo. We find that our simplest model, in which each nucleotide in the binding site is treated independently, performs better than previous biophysical approaches. The extensions of this model, in which contributions of longer words are also considered, result in further improvements, underscoring the importance of higher-order effects in protein–DNA energetics. In contrast, we find little evidence of multiple binding modes for the transcription factors (TFs) and experimental conditions in our data set. Furthermore, there is limited consistency in predictions for the same TF based on microfluidics and PBM data.

SEQUENCE-SPECIFIC interactions between proteins and genomic DNA control numerous cellular processes. An important class of DNA-binding proteins is transcription factors (TFs), which regulate expression of their target genes. Knowledge of *in vivo* binding locations of TFs is important for reconstructing regulatory networks of gene transcription and for understanding which factors control which genes. These locations can be mapped genome-wide, using chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip) or sequencing (ChIP-seq) (Ren *et al.* 2000; Park 2009). The latest version of this technique, ChIP-exo, uses exonucleases to trim immunoprecipitated DNA fragments to precise locations of bound proteins, achieving single-base-pair precision (Rhee and Pugh 2011). Many factors influence TF binding locations in living cells, including chromatin structure, which strongly modulates DNA accessibility (Felsenfeld and Groudine 2003), and cooperative or competing interactions with other DNA-binding proteins (Siggers and Gordan 2014). However, the primary determinant of TF binding locations is intrinsic DNA sequence specificity: TFs can recognize and bind their cognate sites on the background of numerous competing sites available in the genome.

Therefore, an accurate *in vitro* picture of intrinsic TF–DNA binding affinity and specificity can provide insights into *in vivo* TF function. For example, high-affinity TF binding sites that are not occupied *in vivo* might indicate that these sites are covered with nucleosomes [the nucleosome is a fundamental unit of chromatin that packages 147 base pairs (bp) of genomic DNA into a tightly bent left-handed superhelix (Richmond and Davey 2003)]. Conversely, low-affinity sites bound *in vivo* might be a sign of indirect or cooperative binding (Gordan *et al.* 2009). Recently, several new technologies have been developed that enable high-throughput *in vitro* determination of protein–DNA binding affinities (see Stormo and Zhao 2010 for a review). Here we focus on two of these technologies: mechanically induced trapping of molecular interactions (MITOMI) (Maerkl and Quake 2007; Fordyce *et al.* 2010) and protein-binding microarrays (PBM) (Bulyk *et al.* 2001; Berger *et al.* 2006).

The MITOMI device uses microfluidics to simultaneously measure binding affinities of a TF to a few thousand DNA probe sequences. The second-generation device (Fordyce *et al.* 2010) contains 4160 unit cells; each cell has DNA probes with identical sequences, and a given sequence appears in at least 2 unit cells. All DNA probes are 70 bp long: a probe starts with CGC followed by a 52-bp variable region and a 15-bp fixed sequence at the 3′ end used for labeling and primer extension. Sequences in the variable region are designed to accommodate all 65,536 DNA 8-mers. Two fluorescent labels are used to quantify the number of surface-immobilized protein molecules (BODIPY) and protein-bound DNA probes (Cy5); the ratio of Cy5 to BODIPY fluorescence is linearly proportional to the total protein occupancy of each probe. Fordyce *et al.* (2010) report MITOMI measurements for 28 TFs from *Saccharomyces cerevisiae*, comprising 10 different families.

The PBM approach to studying protein–DNA interactions utilizes microarrays with up to tens of thousands of spots. Each spot contains double-stranded DNA probes with identical sequences; the total number of spots is high enough to allow for all possible permutations of a 10-bp-long sequence. DNA probes are fluorescently labeled to monitor the consistency of primer-directed DNA synthesis responsible for creating double-stranded DNA at each spot on the array. TF molecules are added to the array and the array is subsequently washed to remove weakly bound and unbound molecules. Finally, the number of remaining bound proteins is quantified using a fluorescent antibody (Alexa488). The antibody fluorescence intensity (normalized by the DNA fluorescence intensity at each spot; DNA is labeled with Cy3) is then proportional to the total protein occupancy of a given probe sequence.

Here we present an algorithm, called BindSter, for inferring energetics of protein–DNA interactions from high-throughput measurements such as MITOMI and PBM, which report total protein occupancies for a set of DNA probe sequences. Under the assumption of thermodynamic equilibrium, these occupancies can be computed using efficient recursive techniques (Morozov *et al.* 2009; Locke *et al.* 2010) for an arbitrary protein–DNA interaction energy model. The parameters of the model that determine binding energies can then be fitted against the data. We consider all sterically allowed configurations of proteins bound to each probe; multiple binding events, overlapping sites, and multiple DNA-bound species (including alternative binding modes of the same TF) are treated rigorously as they do not entail significant computational costs. Thus our framework is more consistent in treating steric exclusion and multiple-species competition for DNA sequence compared with previous biophysical models of protein–DNA energetics: MatrixREDUCE (Foat *et al.* 2006) and BEEML-PBM (Zhao and Stormo 2011; Zhao *et al.* 2012).

Moreover, our approach is designed to test a hierarchy of protein–DNA energy models of increasing complexity. We start with a basic mononucleotide model in which the contribution of each nucleotide in the binding site to the total interaction energy is independent of all the other nucleotides. This assumption is commonly used in constructing position-specific weight matrices (PWMs) from TF binding site data (Stormo and Fields 1998). Next, we extend the mononucleotide model in two different ways by including (a) dinucleotide contributions and (b) energies of di- and trinucleotides regardless of their position within the binding site. In addition, we check for alternative TF binding modes by fitting two models to the data either simultaneously or sequentially. Although we have focused on 28 TFs with MITOMI data in this work, we have also used PBM data to make predictions for 12 TFs, and have carried out a detailed comparison of PBM- and MITOMI-derived models. The software for our algorithm is available online at http://nucleosome.rutgers.edu/nucenergen/bindster/.

## Materials and Methods

### Physical model of protein–DNA interactions

Consider an ensemble of *M* species of particles distributed along a DNA segment of *N* bp in length. The particles can bind anywhere within the DNA segment subject to steric exclusion (that is, adjacent particles cannot overlap). Partially bounds states (*e.g.*, where the protein hangs off the DNA) are also not allowed. Proteins may bind to either DNA strand. A grand-canonical partition function for this system of DNA-bound particles is given by (1)where “conf” denotes an arbitrary configuration of DNA-bound, nonoverlapping particles, and is the total DNA-binding energy of all the particles in the configuration. For simplicity we set , where is the Boltzmann constant and *T* is the temperature; note that chemical potentials are included implicitly into

One can compute *Z* efficiently, using a recursive relation (Segal *et al.* 2006; Morozov *et al.* 2009; Locke *et al.* 2010) (2)with the initial condition (). This procedure is similar to the forward algorithm widely employed in hidden Markov models (Durbin *et al.* 1998). Here, is the forward partial partition function, is the binding energy for a species *j* particle at position *i*, is the length of the DNA footprint for a species *j* particle, and is the Heavyside step function that is equal to 1 when *i* is positive and 0 otherwise. The zeroth species represents the background (no particle), with and for all *i*.

Likewise, we can compute the partial partition functions in the reverse direction (Segal *et al.* 2006; Morozov *et al.* 2009; Locke *et al.* 2010), (3)with the initial condition (). This step is analogous to the backward algorithm (Durbin *et al.* 1998). Note that by construction. Further, the probability of starting a particle of species *j* at position *i* is given by (4)where Note that gives the probability of finding no particle at position *i*. Equation 4 has a simple interpretation: the sum of Boltzmann weights of sterically allowed configurations with a particle of species *j* bound at base pairs , divided by the sum of Boltzmann weights of all possible configurations. The expected number of particles of species *j* covering base pairs *i* (particle occupancy ) is then given by (). Finally, the average total number of particles on the entire DNA segment is given by (5)where is the average total number of particles of species *j*. Correspondingly, yields the total occupancy of the DNA segment. Taking binding to both strands into account amounts to replacing energies with “free energies” , where is the binding energy of a particle of species *j* to the site on the upper strand, and is the binding energy of a particle of species *j* to the site on the lower strand.

We have also tested a simplified DNA-binding model that lacks steric exclusion. In this model, binding of each particle at a given position is independent of all the other particles. The probability of binding is then simply (6)and the quantities and are calculated as above.

### Models of protein–DNA binding energy

#### The mononucleotide model:

The mononucleotide model assumes independent contributions to the total protein–DNA interaction energy from each nucleotide in the binding site:

(7)Here, is the total energy of binding at position *i* on a DNA segment (we suppress the species label for brevity), *L* is the binding site length, is the nucleotide at position *k* within the DNA segment, is the energy contribution from the base *α* at position *j*, and is a sequence-independent offset that subsumes the chemical potential. We constrain the parameters at each position *j* so that , ensuring that the set of fitted parameters is nondegenerate (Locke *et al.* 2010). With these constraints, the mononucleotide model has independent parameters.

#### The dinucleotide model:

The dinucleotide model includes both mono- and dinucleotide energy contributions:

(8)Here, represents the dinucleotide contribution from dinucleotide at position all other symbols are as in Equation 7. The mononucleotide parameters are constrained as before, and the dinucleotide parameters are constrained so that and for all *j*. Thus the dinucleotide model has independent parameters.

#### The k-mer model:

Another extension of the mononucleotide model adds energy contributions of longer DNA words, irrespective of where these words occur within the binding site (Locke *et al.* 2010; Chereji and Morozov 2011). In this model, the binding energy is given by (9)where *N* is the maximum length of DNA words (two or three in our fits), is a DNA word of length *n*, is the number of times that word appears within the binding site, and is its energy contribution. The energies are constrained by (10)With , there are independent k-mer parameters in addition to mononucleotide parameters.

#### The position-independent model:

We have also constructed a model where contribution of a given DNA word is independent of its position within the binding footprint: (11)All symbols are as in Equation 9, and the energy parameters are constrained by Equation 10. We use and with this model.

#### Alternative motif models:

These models are designed to account for protein binding in two distinct sequence-dependent modes. The binding energy for one of the modes is given by the mononucleotide model (Equation 7). In the two-motif model, the other mode is also described by the mononucleotide model, and both models are fitted simultaneously. In the secondary motif model, one of the mononucleotide motifs is fitted first and then held fixed (except for ), while the parameters of the “secondary” mononucleotide model are allowed to vary. Finally, in the “mononucleotide + position-independent (PI)” model the alternative binding mode is described by the PI model (Equation 11), which is fitted simultaneously with the mononucleotide model.

### PBM location bias

A known source of bias in PBM experiments is the increasing difficulty for the proteins to bind at positions closer to the glass slide (Weirauch *et al.* 2013). We account for this effect by introducing a bias “energy,” modeled as a quadratic function of the absolute position on the DNA segment: (12)Here, *i* is the position along the DNA probe, and and are fitting parameters. The bias energy can be added to any of the protein–DNA energy models described above, separately for each species *j*.

### Fitting procedure

To fit a protein–DNA interaction model to the binding data, we perform 80 separate regressions and select the set of binding parameters (the “motif”) for which the linear correlation *r* between the predicted number of particles (Equation 5) for each DNA probe and the target data (MITOMI or PBM fluorescence ratios) is the highest. Each regression starts with a randomly generated motif. At each subsequent step, the motif from the previous step is modified by adding a random value , drawn from a normal distribution of unit variance, to a single randomly chosen parameter.

When one of the parameters (except ) is modified, some of the other parameters have to be changed as well to satisfy the constraints. For example, in the mononucleotide model, if , is used to offset the shift. In the dinucleotide model, mononucleotide parameters are treated as above, and dinucleotide parameters are modified as follows: if , for all dinucleotides that have *A* at either the first or the second position, and for all other dinucleotides. In general, for words of length *n* we have (13)where *m* is the number of letters in common between and the word whose energy was shifted by (*j* is omitted for position-independent parameters).

After modifying the fitting parameters, the new *r* is evaluated. If it is better than the previous one, the new motif is accepted, and the algorithm proceeds to the next regression step. If the new *r* is not an improvement, another trial step is made. If no improvement is found after 200 trial steps, the modification that decreased the correlation least is accepted, and the algorithm proceeds to the next regression step (Tolkunov and Morozov 2012). We call this trial procedure Try2Step; see Supporting Information, File S1 for the pseudocode. After regression steps have been made, *r* of the current step is compared to *r* obtained *T* steps ago. If this difference is <0.003, the current regression ends and its best *r* and the corresponding final motif are recorded. The motif that achieved the best linear correlation among 80 separate regression runs is reported as the final prediction.

### Motif correlation

We find correlations between motifs produced by energy models of arbitrary complexity. First, we convert each energy parameter to probabilities: (14)Here, is the probability of the word at position *i* (*i* is omitted for position-independent parameters). The information content is then given by the difference between the maximum possible and the observed entropy (Crooks *et al.* 2004), (15)where Ω is the alphabet size (*e.g.*, 4 for mono- and 16 for dinucleotides). Finally, the “height” of each word is defined as (16)similarly to letter heights in standard PWM logos (Crooks *et al.* 2004).

We use word heights to compare two motifs with the same number of energy parameters. We allow one motif to be offset from the other by up to 4 bases in either direction. We also create a reverse complement of the other motif, testing 18 alignments in total. For each alignment, a linear correlation between the two sets of word heights is computed; the best alignment is that with the maximum correlation. Note that when the two motifs are offset from one another, only the subset of word heights that corresponds to the aligned part is taken into account. Figure S1 shows a histogram of motif correlations among randomly generated mononucleotide models with To generate energy parameters for random motifs, we start with all parameters set to zero. Then for each letter *α* at each position *i* we draw a number from a normal distribution with zero mean and standard deviation 2. We add this number to the current and enforce the constraints using Equation 13. In the end, we compute word heights via Equation 16. Using this histogram, we approximate the *P*-value for a mononucleotide motif correlation of *r* as the fraction of correlations The *P*-values for dinucleotide motifs are obtained similarly.

### IntervalLogo

We have devised a novel way of displaying parameter confidence intervals in sequence logos, called IntervalLogo. We define the confidence interval for each energy parameter as the range of values of that parameter within which the correlation when the parameter is changed and all the other parameters are held fixed, except to satisfy the constraints. We demonstrate our procedure using the mononucleotide model; more complex models are treated in a similar manner. Let us say that are the values of the energy parameters at the confidence boundary. Here, is the parameter modified explicitly, and are the other parameters offset to satisfy the constraint (Equation 13). Given these energies, the probability of each letter at both confidence boundaries is defined by Equation 14, the information content by Equation 15, and the word heights and by Equation 16. By construction, we enforce , where is given by Equation 16 with best-fit parameters. Note that due to the nonmonotonic relationship between energy parameters and word heights, it is possible to have and associated with either end of the confidence interval for the energy parameter in question. Occasionally, and , where is now associated with the high-energy boundary for the energy parameter, and is associated with the low-energy boundary. In this case, we simply set Likewise, if and , we set

In the IntervalLogo, the height of each letter is given by the letters are sorted by their heights at the *lower* confidence boundary, so that letters with smaller and small errors occasionally appear on top of the letters with larger but also larger uncertainties. To show confidence boundaries in the logo, we draw the letter with light, medium, and dark colors (Figure 1; the choice of the overall color is arbitrary). The light color, representing the larger word height, is painted from the top of the letter down a distance (so that if , the light color is painted over the whole letter). The dark color, representing the smaller word height, is painted from the bottom of the letter up a distance If there is no overlap, the medium color appears between the light and dark regions. Otherwise, the light and dark colors are striped in the region of overlap, and the medium color is not visible.

### Mononucleotide logos for higher-order models

We can use a conventional mononucleotide sequence logo to represent higher-order motifs. To accomplish that, we predict the probability of finding base *α* at position *i* within the binding site, for all possible sites of length *L*. For mononucleotide motifs, from Equation 14; for higher-order motifs a recursive calculation is carried out.

Specifically, we need to compute (17)where *s* is a set of all sequences of length *L*, and is the subset of these sequences with the nucleotide *α* at position *i*. To compute recursively, we first define , the product of all Boltzmann weights due to substrings of the word covering position *k*, where is given by (18)We now introduce the forward partition function (19)where , and the initial condition is that for Note that with and , Equation 18 reduces to The backward partition function is given by (20)where , , and the initial condition is that for

We then compute the full partition function from or

(21)Finally, the marginalized probability is (22)where the and terms are set to 1 whenever Specifically, if and if in Equation 22.

## Results

### Overview of the modeling procedure

We have developed an algorithm, BindSter, which predicts TF–DNA binding energetics using a biophysical approach (Figure 2). The algorithm assigns a binding energy to every site of length *L* on both DNA strands. The energy is computed using a hierarchy of models of increasing complexity (*Materials and Methods*). The most basic of these, the mononucleotide model, assumes that the contribution of each DNA base pair to the total binding energy is independent of the other base pairs in the site (Equation 7).

The mononucleotide model can be extended by including either energies of dinucleotides (adjacent base pairs; Equation 8) or k-mers of maximum length *N* that can occur anywhere within the binding site (Equation 9). The dinucleotide model is designed to capture sequence specificity of DNA bending, which plays a major role in protein–DNA recognition (Morozov *et al.* 2005; Morozov and Siggia 2007; Rohs *et al.* 2010) and is typically described at the level of base-stacking energies (Olson *et al.* 1998). The k-mer model is capable of accounting for sequence-specific experimental biases (Weirauch *et al.* 2013), as well as capturing true sequence preferences for longer motifs. Similar models have been previously used to describe sequence preferences of nucleosome formation (Tillo and Hughes 2009; Locke *et al.* 2010; Chereji and Morozov 2011). In addition, it is impractical to extend the dinucleotide model to longer words, due to an excessively large number of fitting parameters. We have also studied a PI model in which protein–DNA energetics are described purely through k-mer contributions (Equation 11).

We assume thermodynamic equilibrium between proteins and DNA probes. Under this assumption, we compute the average total number of proteins bound to each DNA probe, using efficient transfer matrix-like algorithms (*Materials and Methods*) (Morozov *et al.* 2009; Locke *et al.* 2010). Unlike previous approaches (Foat *et al.* 2006; Zhao *et al.* 2012) that assume low protein concentrations, our algorithm sums over all configurations of DNA-bound proteins allowed by steric exclusion. Furthermore, several species of DNA-binding factors can compete for positions on the DNA probe, allowing us to fit two models of protein–DNA interactions simultaneously to detect alternative binding modes.

Ordinarily, BindSter regressions start from a set of random points in the parameter space, so that the fits are not biased by initial conditions. In contrast, MatrixREDUCE employs motifs overrepresented in probe sequences as seeds for subsequent position-specific affinity matrix fits (Foat *et al.* 2006; Fordyce *et al.* 2010). Input data for BindSter come in the form of the ratio of Cy5 to BODIPY fluorescence intensity for MITOMI devices (Fordyce *et al.* 2010) and Alexa488 to Cy3 for PBMs (Berger *et al.* 2006; Badis *et al.* 2009; Zhu *et al.* 2009) (Figure 2B). Each regression attempts to maximize a linear correlation coefficient *r* between measured fluorescence ratios and predicted protein occupancy (Figure 2, C and D). Parameter optimization is carried out using a simple derivative-free algorithm that we call Try2Step (Tolkunov and Morozov 2012) (*Materials and Methods*). Briefly, the algorithm tries out modifications of the current parameter set and accepts the first one that increases *r*; if no such modifications are found after 200 steps, the parameter set that decreased *r* the least is accepted. This offers a possibility of surmounting barriers on the landscape defined by the dependence of *r* on model parameters, although most accepted steps do lead to improvements (Figure 2C). The regression is stopped once improvements to *r* become too small to matter. After 80 independent regressions, the model that yields the highest *r* is retained as the final prediction (Figure 2D). There are significant computational costs associated with our regression approach, primarily due to its global, derivative-free nature. Each prediction consists of a set of final model parameters, typically shown as a sequence logo, and of the protein occupancy profile for each DNA probe.

### Sensitivity of prediction quality to model parameter values

Each BindSter run yields a set of model parameters that fit the data best. However, focusing only on best-fit models overlooks the fact that fit quality may be insensitive to the exact values of some of the fitting parameters. If the dependence of *r* on a given parameter is weak, different algorithms may yield apparently different motifs that would in fact have equivalent predictive power. To address this issue, we have developed a graphical representation of parameter uncertainties, called an IntervalLogo (see *Materials and Methods* for details). Similar to regular logos, in IntervalLogos letter heights correspond to predictions based on the best-fit model. In addition, however, color intensities and patterns are used to display the confidence interval associated with each letter (Figure 1). The intervals are determined by increasing/decreasing model parameters one at a time until *r* drops to 0.98 of its optimal value found in the best-model fit.

### The mononucleotide model and comparison with previous work

We have fitted the mononucleotide model to 28 *S. cerevisiae* TFs with MITOMI data (Fordyce *et al.* 2010) (Figure 3A, Table S1). Our results can be compared with previous fits of a PWM-like model to the same data, which employed MatrixREDUCE (Fordyce *et al.* 2010). When the same length of the binding motif is used (), our results are very similar to MatrixREDUCE predictions (*P* = 0.31, circles in Figure 3A). However, we find that prediction quality improves when motifs are longer (compare crosses and circles in Figure 3A) and therefore use in all subsequent fits. Despite comparable overall predictive power, MatrixREDUCE and BindSter sometimes yield significantly different motifs (Table S1, Figure 3C), indicating that (a) parameter uncertainty may be significant at some positions and (b) more than one model of binding specificity may explain the same data equally well.

To investigate this issue further, we have focused on mononucleotide BindSter fits for TFs Gcn4 and Cbf1, which produced non-canonical motifs (Table S1). The canonical AP-1 motif for Gcn4 is TGA(C/G)TCA (Ellenberger *et al.* 1992), whereas our fit finds TCA(C/G)TCA; the Cbf1 motif is CACGTG (Kuras *et al.* 1996), whereas our fit finds CACGTC. To investigate whether MITOMI data could be fitted equally well with canonical motifs, we performed seeded BindSter fits. Instead of starting the fits with random energy matrices as described above, we created seed motifs that prefer the canonical sites TGA(C/G)TCA (Gcn4) and CACGTG (Cbf1). At each position, energies of preferred letters were set 4 lower than energies of nonpreferred letters. The relative position of the canonical site within the longer motif was chosen to match the unseeded prediction (we have found that varying this position does not strongly affect the quality of fit). All energies outside of the canonical site were initialized to zero. For each seed motif, we performed four Try2Step regressions, reporting the result with the highest correlation to data. Our predictions are shown in Figure S2. For Gcn4, the seeded fit had nearly equivalent predictive power ( *vs.* for the randomly initialized fit). For Cbf1, the seeded fit performed marginally less well, achieving *vs.* but yielding a significantly different motif ( with the original prediction). Since the seeded fits did not result in a significant drop in performance, it appears that limited sequence diversity and measurement noise levels in MITOMI data sets may prevent reliable differentiation between closely related models. It is also possible that the MITOMI procedure, which involves binding TFs to the surface with antibodies, may be subtly altering the properties of the protein–DNA binding interface, perhaps enhancing monomer instead of dimer binding in some cases.

Note that in all MITOMI fits we have used only the 52-bp variable region from each DNA probe; refitting the mononucleotide model with the entire 70-bp probe sequence results in a modest improvement compared to the result shown in Figure 3A: the average improvement in correlation coefficients *μ* is 0.013, with The average motif correlation between 52- and 70-bp fits is 0.86, with motifs produced by poorly fitting models being more variable.

### Steric exclusion

One advantage of BindSter is its rigorous treatment of steric exclusion, which makes our algorithm applicable to arbitrary protein concentrations (at low concentrations, multiple proteins bound to the same DNA probe are rare and steric exclusion is not important). The additional computational effort is modest since we use an efficient recursive algorithm to calculate DNA probe occupancies (*Materials and Methods*). To investigate the extent to which neglecting steric exclusion affects the quality of predictions, we have compared the mononucleotide model to a simple model without steric exclusion, which treats each binding position as independent of all the other positions on the DNA probe (*Materials and Methods*). We find that rigorous treatment of steric exclusion has minimal effect on the predictive power of the model trained on MITOMI data (Figure S3). Both the correlation coefficients (Figure S3A) and the motifs (Figure S3B) are very similar, although there are occasionally minor differences in probe occupancy profiles. For example, Rox1 profiles are somewhat different with and without steric exclusion (Figure S3C), probably due to the preference for T/C at position 2 in the latter fit (Figure S3B). Our findings are consistent with the MITOMI experimental setup in which proteins are immobilized and DNA is recruited to the surface (Fordyce *et al.* 2010). It appears that in MITOMI experiments the effective concentration of immobilized proteins is relatively low, making steric exclusion a minor factor.

### Higher-order models of protein–DNA energetics

The mononucleotide model may not fully capture the complexity of amino acid–DNA base pair interactions at the protein binding interface (Luscombe *et al.* 2001), including the role of DNA shape in protein recognition (Rohs *et al.* 2010). To determine whether energetic contributions of longer motifs improve predictive performance, we have tested several extensions of the basic mononucleotide model (Figure 3B, Figure S4, Table S1, and Table S2; see *Materials and Methods* for model definitions). The extended models yielding the best improvements are the k-mer model (where *N* is the maximum word length) and the dinucleotide model (top and middle panels in Figure 3B; Table S1). Interestingly, the k-mer model has significantly less predictive power (Figure S4, top).

Either the or the position-independent model fitted simultaneously with the mononucleotide one does not substantially improve performance (Figure S4, two middle panels). Thus there is scant evidence in MITOMI data for nearly nonspecific binding driven by preferences for short sequence motifs, regardless of their position within the binding site. Finally, whereas fitting two mononucleotide models simultaneously is beneficial (the two-motif model in Figure 3B, bottom; Table S2), fitting one mononucleotide parameter set first and then fitting the other one on the residual signal is not (the secondary motif model in Figure S4, bottom; Table S2). Since the parameter spaces for these two models are equivalent, the issue lies with convergence: simultaneous fits achieve better optimization than sequential ones, probably because of greater flexibility afforded by a higher-dimensional optimization in the former case. Correspondingly, secondary motifs are often less specific and have little predictive power by themselves (*cf*. values for the secondary model motif_{2} in Table S2). In contrast, motif_{2} fits in the two-motif model are more predictive and the logos are often visibly related to motif_{1} and mononucleotide model fits (Table S2). The differences between motif_{1} and motif_{2} predictions may sometimes be interpreted as alternative binding modes. For example, in the case of Gcn4 motif_{1} is close to the symmetric AP-1 site sequence (TGA(C/G)TCA) that corresponds to homodimeric binding (Ellenberger *et al.* 1992); motif_{2}, on the other hand, may reflect monomeric binding to the TCA half-site (note, however, that C at position 7 has a large uncertainty associated with it). Consistent with the original study (Fordyce *et al.* 2010), we do not see evidence for Gcn4 binding to longer ATF/CREB (TGACGTCA) sites (Konig and Richmond 1993) in MITOMI data.

Interestingly, motif specificity (as seen in the heights of logo letters) does not necessarily confer predictive power, as has been previously observed (Weirauch *et al.* 2013). For example, the mononucleotide model prediction for the TF Cup9 is significantly more specific than the MatrixREDUCE one (compare Figure 3C, top left and bottom left), but their linear correlation coefficients are fairly similar ( and , respectively). The dinucleotide model, which provides a substantial improvement (), yields a motif that appears less specific still, although there are several prominent dinucleotide contributions (Figure 3C, top right and bottom right). Note that the dinucleotide model’s projected logo in Figure 3C is constructed using the full dinucleotide parameter set, as described in *Materials and Methods*.

In general, it is challenging to associate improvements in performance of higher-order models with a particular physical mechanism. For example, in the case of Cup9 the k-mer and dinucleotide models yield very similar correlation coefficients (0.71 and 0.73, respectively; Table S1), yet their physical underpinnings are completely different. The dinucleotide model is designed to capture DNA base pair stacking energies, whereas the k-mer model assigns energies to short words with lengths of 2 and 3 nucleotides independently of their position within the binding site, reflecting global preferences for these short motifs. Nonetheless, in the scatter plot of predicted *vs.* observed DNA probe occupancies (Figure S5A), predictions of both models colocalize for high-occupancy probes and are more accurate compared with the mononucleotide model. Furthermore, predictions of both higher-order models correlate equally well with the mononucleotide model and again colocalize for high-occupancy probes (Figure S5B). A direct comparison of k-mer and dinucleotide models reveals that they predict very similar distributions of probe occupancies ( Figure S5C). Similar behavior is observed across the entire MITOMI data set: k-mer and dinucleotide model predictions, and in particular their deviations from the mononucleotide model, are highly correlated (Figure S5, D and E). Thus the same signal is captured using two distinct higher-order models, which generalize the mononucleotide model in different ways.

The above observations (Figure 3B, Figure S4) lead us to conclude that modifying basic mononucleotide energetics appears to be more beneficial than introducing alternative binding modes, as far as extracting protein–DNA binding specificities from MITOMI data is concerned. Indeed, according to our fits there is relatively little evidence that TFs in the MITOMI experiments we analyzed switch between binding modes and in addition to the primary motif bind an alternative motif that could be characterized by either word counts (PI models) or another mononucleotide parameter set (two-motif and secondary-motif models). The most successful approach of this type, the two-motif fit, typically yields primary and secondary motifs that are clearly related (Table S2).

### Cross-validation of BindSter fits

For every TF, the MITOMI data set provides ∼3700 DNA probe occupancy measurements; the full data set contains two complete replicates and an incomplete replicate with a subset of probes from the complete one (Fordyce *et al.* 2010). Thus the number of free parameters in all our models (*Materials and Methods*) is significantly less than the total number of available measurements. Nonetheless, we have cross-validated our mononucleotide fits by comparing the models trained on first and second complete replicates with each other and with the model trained on all available data (Figure S6, Table S3, Table S4). Models trained on single replicates produce distributions of probe occupancies that are highly correlated with the full model (Figure S6A) and have similar predictive power against the data (Table S3). The fitted model parameters also tend to be similar, as measured by motif correlation coefficients against the mononucleotide model trained on the entire data set (Figure S6B, Table S4). As a rule, predictions with high correlation coefficients against the data tend to yield nearly identical motifs (see, *e.g.*, Rox1 and Sko1 examples in Figure S6C); motifs vary more for lower-quality predictions (see, *e.g.*, the Cad1 example in Figure S6C).

It is also possible that higher-order models fit the data better simply because they have more fitting parameters. However, there is no clear connection between the number of additional parameters and model performance. For example, the best-performing models in Figure 3B and Figure S4 are the dinucleotide model (112 independent parameters compared with 31 for the mononucleotide model) and the k-mer model (67 independent parameters with ). Despite many more parameters in the dinucleotide model, its performance is quite similar to that of the k-mer model, as discussed above.

The comparison between the k-mer and mononucleotide + PI models is particularly instructive. The parameter sets employed in these two models are nearly identical, but k-mer models typically outperform their PI counterparts. The physical mechanisms of protein–DNA interactions described by the two models are very different. In the mononucleotide + PI model, TFs are assumed to bind in two distinct modes (*Materials and Methods*). In the k-mer model, there is a single binding mode whose sequence preferences are described by a combination of position-specific mononucleotide parameters and position-independent contributions from longer words. Thus merely adding parameters will not significantly improve predictive performance unless the underlying physical model captures some essential aspects of protein–DNA energetics.

Since the dinucleotide model has the most fitting parameters, we have also cross-validated it directly (Table S3, Table S4). As with the mononucleotide model, predictions of DNA probe occupancies are largely consistent across the two models trained on separate replicates, with slight loss of predictive power that becomes more pronounced if the correlation coefficients from all models (replicate 1, replicate 2, and full) against the data are low overall (Table S3). However, dinucleotide motifs fitted on replicates are much more variable than their mononucleotide counterparts (Table S4). For example, Ace2 models fitted on separate replicates have the same predictive power as the full model. Nonetheless, the corresponding motifs are correlated with the full model motif only at and , respectively. For Met31, the dinucleotide motif fit on replicate 1 is correlated with the full motif at , whereas the replicate 2 fit yields Interestingly, both motifs predict DNA probe occupancies equally well ( and 0.80, respectively) and comparably to the full model prediction (). Overall, these observations underscore the fact that multiple models, and especially multiple higher-order models, can fit MITOMI data equally well.

### Comparison of PBM and MITOMI-based predictions

Predictions of protein–DNA binding specificity should not depend on the experimental platform being used to measure protein–DNA interactions. To compare the robustness of BindSter predictions across MITOMI and PBM platforms, we have carried out mononucleotide, dinucleotide, and *N* = 3 k-mer fits for 12 TFs for which both PBM and MITOMI measurements are available (Table S5). We have also compared our results with previously published BEEML-PBM mononucleotide fits (Zhao and Stormo 2011; Zhao *et al.* 2012).

One essential difference between BindSter MITOMI and PBM fits is the use of location bias in the latter. The location bias in our model is a quadratic energy contribution with two free parameters per DNA-binding species (Equation 12). It is designed to make binding sites closer to the glass slide less favorable, due to their lower accessibility for TFs (Weirauch *et al.* 2013). Indeed, although the coefficients of the quadratic function are not constrained in any way in our fits, for most TFs they fit to values that impose energetic penalties on the proximity of the protein to the glass slide (Figure S7A). Including location bias into PBM fits leads to sizable improvements in prediction quality for mononucleotide and k-mer models (Figure S7B). In contrast, including location bias into MITOMI fits leads to , several times smaller than the improvements observed with PBM data. Although including location bias increases predictive power of PBM fits, mononucleotide models fitted with and without it typically yield similar motifs (Figure S7C).

On average, our mononucleotide predictions outperform those from the BEEML-PBM algorithm (Figure 4A, Table S5). As with MITOMI data, k-mer and dinucleotide models provide statistically significant improvements compared to mononucleotide fits. Consistency between PBM- and MITOMI-based motifs (as measured by motif correlations) improves with overall prediction quality on PBM data; in several cases higher-order models recover motifs that, when projected, are closer to the MITOMI-based mononucleotide motifs than mononucleotide models inferred from PBM data (Figure 4B). We also note that BEEML-PBM motifs have the highest average correlation with BindSter MITOMI motifs on this data set. For example, the core GACACA Cup9 motif seen in both BEEML-PBM- and MITOMI-based mononucleotide fits is recovered only at the dinucleotide or k-mer level with BindSter PBM fits (Figure 4C, Table S5). This may be due to substantial k-mer bias present in PBM data (Weirauch *et al.* 2013), which requires use of higher-order models.

Predictive power of MITOMI-trained models on PBM data is modest overall, although it does improve if location bias is included (Table S6). As a rule, PBM-based fits yield lower correlation coefficients and partial or less specific motifs. For example, for Bas1 neither BEEML-PBM nor BindSter fits recover the highly predictive motif () inferred from MITOMI data, and even the high-quality prediction of the canonical Cbf1 motif with PBM-based mono- and dinucleotide models has less information content than the corresponding MITOMI result (Figure 4C, Table S5). Note, however, that with MITOMI data the symmetric CACGTG motif is replaced with GACGTG, perhaps as a sign of an alternative binding mode; PBM fits on the other hand yield the expected symmetric motif. Interestingly, the Cbf1 motif becomes significantly more specific when the potential k-mer bias is taken into account (Table S5).

## Discussion

We have developed a biophysical algorithm, BindSter, for predicting protein–DNA binding energetics from high-throughput *in vitro* measurements of total protein occupancies on a set of DNA probes. We have focused on MITOMI, a technique that uses a microfluidic device to measure binding interactions at thermodynamic equilibrium (Maerkl and Quake 2007; Fordyce *et al.* 2010). The MITOMI binding data are thus well suited for computational modeling based on equilibrium statistical mechanics, although the MITOMI approach is less high-throughput than PBMs (Berger *et al.* 2006; Badis *et al.* 2008, 2009; Zhu *et al.* 2009).

Our computational framework differs from previously published biophysical models of protein–DNA interactions (Foat *et al.* 2006; Zhao and Stormo 2011; Zhao *et al.* 2012) in two important aspects: (a) it employs a rigorous procedure in which all sterically allowed configurations of TFs bound to DNA probes are taken into account and (b) it is designed to accommodate a variety of protein–DNA interaction energy models. Thus, in contrast to previous work, BindSter is able to provide a quantitative treatment of overlapping binding sites and multiple DNA-binding species (as may occur if the TF switches between dimeric and monomeric binding or employs distinct binding interfaces to interact with different classes of sites). This built-in flexibility with respect to describing protein–DNA energetics has allowed us to explore a hierarchy of models of increasing complexity.

The most basic mononucleotide model, in which each nucleotide in the binding site contributes independently to the total protein–DNA binding energy, has predictive power on MITOMI data comparable to that of MatrixREDUCE (Fordyce *et al.* 2010). However, many BindSter motifs look substantially different (Table S1). Some of this discrepancy may be attributed to the fact that the models are insensitive to the values of a subset of fitting parameters. Such parameters could therefore change within a sizable range with little consequences for the quality of fit. To account for this possibility, we have developed a novel way of graphically displaying the uncertainty of model parameters, called IntervalLogos.

IntervalLogos are constructed as regular logos, but with letter color intensities and patterns used to indicate the confidence interval associated with each parameter as it is varied independently (*Materials and Methods*; Figure 1). However, visual inspection of mononucleotide and MatrixREDUCE motifs in Table S1 shows that even after accounting for parameter uncertainty the motifs remain related but distinct in many cases. Thus there is degeneracy among motifs that can explain MITOMI data equally well; moreover, there is no clear correlation between the specificity of a given motif and its predictive power. Finally, the role of steric exclusion is negligible in MITOMI experiments (Figure S3), likely due to the fact that proteins are immobilized on the surface (Fordyce *et al.* 2010), resulting in low effective protein concentrations.

Among seven extensions of the mononucleotide model that we have tested, the dinucleotide and k-mer models result in the most significant improvements to prediction quality (Figure 3B, Figure S4). Both of these models extend the mononucleotide description of protein–DNA energetics by including contributions of longer words. In contrast, models that test for the presence of alternative binding motifs fair less well, indicating that TFs in MITOMI experiments tend to have a single binding mode. The most successful model of the multiple-binding-mode type, in which two mononucleotide motifs are fitted simultaneously to the data, does result in a reasonable improvement (Figure 3B); some of the fitted model pairs can be interpreted as evidence of dimeric *vs.* monomeric binding or distinct subclasses of sequence motifs (Table S2). In general, it is challenging to associate an observed improvement in model performance with a particular physical mechanism of protein–DNA interactions. For example, k-mer and dinucleotide models tend to produce correlated predictions for high-occupancy DNA probes despite the fact that they are supposed to describe distinct aspects of protein–DNA energetics and/or experimental biases (Figure S5).

Finally, we find that BindSter fits on PBM data sometimes yield substantially different motifs compared to MITOMI-based fits, despite the fact that their predictive quality is as good as or better than previously published BEEML-PBM results (Zhao and Stormo 2011; Zhao *et al.* 2012) (Figure 4, Table S5). Motif correlations between MITOMI and PBM-fit models are modest overall, although they do improve with the quality of the PBM fit (Figure 4B). Correspondingly, the ability of MITOMI-trained models to predict the distribution of probe occupancies (fluorescent intensities) observed in PBM experiments also tends to be modest (Table S6). One potential explanation is limited applicability of thermodynamic models to the analysis of PBMs (Fordyce *et al.* 2010). Another possibility is experimental biases other than the k-mer bias and the location bias that are explicitly taken into account in the BindSter framework.

## Acknowledgments

A.V.M. acknowledges support from the National Institutes of Health (R01 HG004708) and an Alfred P. Sloan Research Fellowship.

## Footnotes

*Communicating editor: G. D. Stormo*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.178384/-/DC1.

- Received December 16, 2014.
- Accepted June 10, 2015.

- Copyright © 2015 by the Genetics Society of America