Abstract
Reconstructing a physical map of a chromosome from a genomic library presents a central computational problem in genetics. Physical map reconstruction in the presence of errors is a problem of high computational complexity that provides the motivation for parallel computing. Parallelization strategies for a maximumlikelihood estimationbased approach to physical map reconstruction are presented. The estimation procedure entails a gradient descent search for determining the optimal spacings between probes for a given probe ordering. The optimal probe ordering is determined using a stochastic optimization algorithm such as simulated annealing or microcanonical annealing. A twolevel parallelization strategy is proposed wherein the gradient descent search is parallelized at the lower level and the stochastic optimization algorithm is simultaneously parallelized at the higher level. Implementation and experimental results on a distributedmemory multiprocessor cluster running the parallel virtual machine (PVM) environment are presented using simulated and real hybridization data.
GENERATION of entire chromosomal maps has been a central problem in genetics right from its early years (Sturtevant 1913). These maps are central to the understanding of the structure of genes, their function, their transmission, and their evolution. Recent advances in molecular genetics have led to a wealth of DNA markers along a chromosome and also eased the process by which these markers can be assayed. Consequently, the focus of current research has shifted from data collection to the computational problem of map assembly.
Chromosomal maps fall into two broad categories—genetic maps and physical maps. Genetic maps represent an ordering of genetic markers along a chromosome where the distance between two genetic markers is related to their recombination frequency. Genetic maps are typically of low resolution, i.e., 110 Mb (Lander and Green 1987). While genetic maps enable a scientist to narrow the search for genes to a particular chromosomal region, it is a physical map that ultimately allows the recovery and molecular manipulation of genes of interest. A physical map is defined as an ordering of distinguishable (i.e., sequenced) DNA fragments called clones or contigs by their position along the entire chromosome where the clones may or may not contain genetic markers. The physical mapping problem is therefore one of reconstructing the order of clones and determining their position along the chromosome. A physical map has a much higher resolution than a genetic map of the same chromosome, i.e., 10100 kb (Brodyet al. 1991). Physical maps have provided fundamental insights into gene development, gene organization, chromosome structure, recombination, and the role of sex in evolution and have also provided a means for the recovery and direct molecular manipulation of genes of interest (Pradeet al. 1997).
Several techniques exist for generation of physical maps from contig libraries. These techniques are specific to an experimental protocol and the type of data collected, for example, mapping by nonunique probes (Alizadehet al. 1995), mapping by unique probes (Alizadehet al. 1994; Greenberg and Istrail 1995; Jain and Myers 1997), mapping by unique endprobes (Christofet al. 1997), mapping using restriction fragments (Fasuloet al. 1997; Jiang and Karp 1997), mapping using radiationhybrid data (BenDor and Chor 1997; Slonimet al. 1997), and optical mapping (Muthukrishnan and Parida 1997; Karp and Shamir 1998; Leeet al. 1998). Likewise, several computation techniques based on deterministic optimization and stochastic optimization have been reported in the literature in the context of physical mapping. Examples of stochastic optimization algorithms include simulated annealing (Cuticchia et al. 1992, 1993; Mottet al. 1993; Alizadeh et al. 1994, 1995) and the random cost algorithm (Wanget al. 1994a) whereas those of deterministic optimization algorithms include linear programming (Jain and Myers 1997), integer programming (Christofet al. 1997), integer linear programming with polyhedral combinatorics (Christof and Kececioglu 1999), and semidefinite programming (Chor and Sudan 1995). Various statistical analyses of the aforementioned physical mapping techniques have also been reported in the literature (Lander and Waterman 1988; Arratiaet al. 1991; Zhang and Marr 1993; Balding 1994; Nelson and Speed 1994; Xionget al. 1996; Wilsonet al. 1997).
The physical mapping protocol: The physical mapping protocol essentially determines the nature of clonal data and the probe selection procedure. The physical mapping protocol adhered to in this project is the one based on sampling without replacement (Fuet al. 1992). This protocol has been used successfully in physical mapping of Aspergillus nidulans (Brodyet al. 1991; Pradeet al. 1997), Schizosaccharomyces pombe (Mizukamiet al. 1993), and Pneumocystis carinii (Arnold and Cushion 1997) and is currently being used in physical mapping projects of A. flavus and Neurospora crassa (Aignet al. 2001; Kelkaret al. 2001) under the Fungal Genome Initiative (Arnold 1997; Bennett 1997).
The protocol that generates the probe set
The cloneprobe overlap pattern is represented in the form of a binary hybridization matrix H where the matrix element H_{ij} denotes the hybridization of the ith clone
In reality, the hybridization experiments are rarely error free. The hybridization matrix H could be expected to contain false positives and false negatives. The matrix element H_{ij} would be a false positive if H_{ij} = 1 (denoting hybridization of the ith clone with the jth probe) when in fact H_{ij} = 0. Conversely, H_{ij} would be a false negative if H_{ij} = 0 when in fact H_{ij} = 1. Other sources of error include chimerism wherein a single clone samples two or more distinct regions of a chromosome, deletions wherein certain regions of the chromosome are not sampled during the cloning process, and repeats wherein a clone samples a region of the chromosome with repetitive DNA structure. In this article, we confine ourselves to errors in the form of false positives and false negatives. Since the clones (and probes) in the mapping projects that use the aforementioned protocol are generated using cosmids, which makes them sufficiently small (∼40 kb), chimerism and deletions do not pose a serious problem. However, repeats do pose a problem but are not explicitly addressed here; rather they are treated as multiple isolated incidences of false positives.
In this article we present a maximumlikelihood (ML) estimator (Shete 1998; Kececiogluet al. 2000) that determines the ordering of probes in the probe set
MATERIALS AND METHODS
Cosmid libraries used to construct the physical map of N. crassa discussed in this article are described in Kelkar et al. (2001). The physical mapping data were generated by DNA hybridization described in Kelkar et al. (2001). Assignments of markers to physical and genetic maps was achieved by complementation, hybridization, and cosmid end sequencing as described in Kelkar et al. (2001).
MATHEMATICAL FORMULATION OF THE ML ESTIMATOR
In this section we present a brief synopsis of the ML estimator proposed in Kececioglu et al. (2000) and Shete (1998). The estimator reconstructs the ordering of probes in the probe set
The probe ordering problem can be formally stated as follows. Given a set
The problem as stated above is ill posed in the precise sense as defined by Hadamard (1923). A problem is deemed well posed when its solution exists and is unique. A problem is ill posed when no solution exists or, if a solution does exist, it is not unique. In our case, the problem is ill posed since the underlying constraints do not imply a unique solution. Any probe ordering Π and any interprobe spacing vector Y that satisfies the requirements that
Mathematical notation: The mathematical notation used in the formulation of the maximumlikelihood estimator is given below:
N, length of the chromosome;
M, length of a clone/probe;
n, number of probes;
k, number of clones;
ρ, probability of false positive;
η, probability of false negative;
H = ((h_{i,j}))_{1≤}_{i}_{≤}_{k}_{, 1≤}_{j}_{≤}_{n}; cloneprobe hybridization matrix, where
Π = (π_{1},..., π_{n}), permutation of {1, 2,..., n}, which denotes the probe labels in the ordering when scanned from left to right along the chromosome;
The model: Given a vector of interprobe spacings Y = (Y_{1},..., Y_{n}), there are 2^{n}^{+1} possible cases to consider depending on whether 0 ≤ Y_{i} ≤ M or Y_{i} > M, where 0 ≤ i ≤ n + 1. Without loss of generality, we present the maximumlikelihood model for n = 3 and illustrate 3 of the 2^{4} = 16 possible cases.
Case 1: 0 ≤ Y_{1} ≤ M, 0 ≤ Y_{2} ≤ M, 0 ≤ Y_{3} ≤ M, and 0 ≤ Y_{4} ≤ M as depicted in Figure 2.
Case 2: 0 ≤ Y_{1} ≤ M, 0 ≤ Y_{2} ≤ M, 0 ≤ Y_{3} ≤ M, and Y_{4} > M as depicted in Figure 3.
Case 3: Y_{1} > M, 0 ≤ Y_{2} ≤ M, 0 ≤ Y_{3} ≤ M, and 0 ≤ Y_{4} ≤ M as depicted in Figure 4.
In Case 1 above, if the left end of a clone falls in regions A, C, or E, then the clone will hybridize only with P_{π1}, P_{π2}, or P_{π3}, respectively (Figure 2). Similarly, if the left end of a clone falls in region B or D, then the clone will hybridize with both P_{π1} and P_{π2} or P_{π2} and P_{π3}, respectively (Figure 2). In Case 2 above, if the left end of a clone falls in region E, then the clone will hybridize with only P_{π3} and if it falls in region F, then the clone will not hybridize with any of the probes (Figure 3). Similarly, in Case 3 above, if the left end of a clone falls in region A, the clone will not hybridize with any of the probes (Figure 4). Therefore, from the above three cases it is easy to see that, in general, there will be three different types of regions, namely,
Type 1: Both region between probe P_{π}_{j} and P_{π}_{j}_{+1}, for j = 1,..., n  1. An intervening clone hybridizes to both probes if its left end falls in this region.
Type 2: Only region of probe P_{π}_{j}, for j = 1,..., n. A clone will hybridize to P_{π}_{j} only if its left end falls in this region.
Type 3: None region after probe P_{π}_{j}, for j = 0,..., n. A clone will hybridize to no probe if its left end falls in this region. Here probe P_{π0} is referred to as the beginning of the chromosome.
It can be shown that for j =1,..., n  1, Length of the Both region between probes P_{π}_{j} and P_{π}_{j}_{+1}
Length of the Only region of probe P_{π}_{j}
Length of the None region after probe P_{π}_{j}
The conditional probability of observing a clonal signature H_{i} (i.e., the ith row in the hybridization matrix H), given a probe ordering Π and an interprobe spacing vector Y, is given by
For a given probe ordering Π and interprobe spacing vector Y, event O_{i,j} implies that only h_{i,π}_{j} = 1, and all the remaining entries in row H_{i} should be equal to 0. In other words, h_{i}_{,π}_{j} ≠ 1 implies a false negative and a 1 in any other column position in the row H_{i} implies a false positive. That is,
Similarly, for a given probe ordering Π and interprobe spacing vector Y, event B_{i,j} implies that only h_{i}_{,π}_{j} and h_{i}_{,π}_{j}_{+1} should be equal to 1 and all the remaining entries of row H_{i} should be equal to 0. That is,
Hence,
Finally, for a given probe ordering Π and interprobe spacing vector Y, event N_{i,j} implies that all the elements of row H_{i} should be equal to 0. That is, for k = 1,..., n,
Alternatively, consider the negative loglikelihood function f(Π, Y) given by
Computation of the ML estimate: Computing the values of
Stage 1: First determine the optimal spacing Ŷ_{Π} for a given probe ordering Π; i.e., determine Ŷ_{Π} = (Ŷ_{1},..., Ŷ_{n}_{)} such that for a given Π,
Stage 2: Second determine
Here the minimum is taken over all Π, where Π is a permutation of {1,..., n}. The resulting values of
Computation of Ŷ_{Π}: A region
The objective function f_{Π}(Y) (Equation 27) can be expressed as
Using the fact that (d^{2}/ds^{2})f_{i}(Y) = 0 along the ray, it can be shown that
Consider the four disjoint subregions
The local minimum of f_{Π}(Y) in each of the aforementioned four disjoint subregions within
The initial value of Y = (Y_{1},..., Y_{n}) can be determined in one of several ways; the simplest solution is to assign (N  nM)/(n + 1) to each of Y_{i}’s, i.e., distribute the value of N  nM equally among the (n + 1)Y_{i}’s. Having obtained an initial value forŶ_{,} the gradient of the negative loglikelihood function is computed. A function decreases most rapidly in the direction of the local negative (i.e., downhill) gradient. The local downhill gradient is given by
The current value of Ŷ = Ŷ_{old} is updated by moving along the downhill gradient direction U = ▿f(Π, Ŷ) Y=Ŷ_{old}. The new value of Ŷ = Ŷ_{new} is given by
Having obtained the value of s^{*}, then the new interprobe spacings are given by
The above constraints impose bounds on s given by
The gradient computation and the solution update steps of the steepest descent method are continued until the gradient vector attains a magnitude less than a predefined threshold. During this iterative process, it may happen that the current value of Y is on one or more of the boundaries of the feasible region in which the solution is located. These boundaries are defined by the constraints on the Y_{i}’s for i = 1,..., n as discussed earlier. The fact that the current value of Y lies on one or more of the boundaries of the feasible region and downhill gradient vector is determined to point outside the feasible region would normally force the value of s to be equal to zero and hence stop the iterative procedure even though the gradient vector has not vanished.
The above difficulty can be overcome by projecting the downhill gradient vector U onto the feasible region. Note that all the boundary conditions on the Y_{i}’s for i = 1,..., n are hyperplanes. Suppose that U is directed outside the feasible region, and we have k hyperplanes (corresponding to k boundary conditions) that force s to be equal to zero. In this situation one needs to project U onto the intersection of these hyperplanes. Let
Computation of
A single iteration of a stochastic hillclimbing search algorithm consists of three phases: (i) perturb, (ii) evaluate, and (iii) decide. In the perturb phase, the current solution x_{i} to a multivariate objective function E(x), which is to be minimized, is systematically perturbed to yield another candidate solution x_{j}. Here, the probe ordering is systematically perturbed by reversing the ordering within a block of probes where the endpoints of the block are chosen at random. In the evaluate phase, E(x_{j}) is computed. Here, the function f(Π,Ŷ_{Π)} (Equation 27) is computed. In the decide phase, x_{j} is accepted and replaces x_{i} probabilistically, using a stochastic decision function. The stochastic decision function is annealed in a manner such that the search process resembles a random search in the earlier stages and a greedy local search or a deterministic hillclimbing search in the latter stages. The major difference between simulated annealing and microcanonical annealing arises from the difference in the stochastic decision function used in the decision phase. Their common feature is that, starting from an initial solution, they generate, in the limit, an ergodic Markov chain of solution states that asymptotically converges to a stationary Boltzmann distribution (Aarts and Korst 1989). The Boltzmann distribution asymptotically converges to a globally optimal solution when subjected to the annealing process (Geman and Geman 1984).
Simulated annealing: In the decide phase of SA, the new candidate solution x_{j} is accepted with probability p, which is computed using the Metropolis function (Metropoliset al. 1953)
The Metropolis function and the Boltzmann function give SA the capability of climbing out of local minima. Several iterations of SA are carried out for a given value of T and are referred to as an annealing step. The value of T is systematically reduced using an annealing function. As can be seen from Equations 45 and 46, at sufficiently high temperatures SA resembles a completely random search, whereas at lower temperatures it acquires the characteristics of a deterministic hillclimbing (i.e., local or greedy) search.
SA generates an asymptotically ergodic (and hence stationary) Markov chain of solution states at a given temperature using Equations 45 and 46. Logarithmic annealing schedules of the form T_{k} = R/log k for some value of R > 0 have been shown to be asymptotically good (Geman and Geman 1984); i.e., they ensure asymptotic convergence to a global minimum with unit probability in the limit k → ∞ (Geman and Geman 1984). The convergence criterion used here was the fact that the value of the objective function had not changed for the past k annealing steps. A geometric annealing schedule of the form T_{k}_{+1} = αT_{k} was used where α = 0.9. Although the geometric annealing schedule does not strictly ensure asymptotic convergence to a global optimum as does the logarithmic annealing schedule, it is much faster and has been found to yield very good solutions in practice (Romeo and SangiovanniVincentelli 1991).
Microcanonical annealing: MCA models a physical system whose total energy, i.e., the sum of kinetic energy and potential energy, is always conserved. The potential energy of the system is the multivariate objective function E(x) to be minimized, whereas the kinetic energy E_{k} > 0 is represented by a demon or a collection of demons. In the latter case, the total kinetic energy is the sum of all the demon energies. The demon energy (or energies) serve(s) to provide the system with an extra degree (or degrees) of freedom thus enabling MCA to escape from local minima.
In the decide phase of MCA, if E(x_{j}) < E(x_{i}), then x_{j} is accepted as the new solution. If E(x_{j}) ≥ E(x_{i}), then x_{j} is accepted as the new solution only if E_{k} ≥ E(x_{j})  E(x_{i}). If E(x_{j}) ≥ E(x_{i}) and E_{k} < E(x_{j})  E(x_{i}) then the current solution x_{i} is retained. In the event that x_{j} is accepted as the new solution, the kinetic energy demon is updated to
In the context of probe ordering, a kinetic energy demon is assigned to each distinct pair of probes. A square symmetric matrix E_{k} of size n × n, where n is the number of probes, is used to store the demon energy for each probe pair. Thus, an entry E_{k}(i, j) refers to the kinetic energy of the demon associated with the ith and jth probe. As in the case of SA, the convergence criterion used was the fact that the value of the objective function had not changed for the past k annealing steps. A geometric annealing schedule of the form
In spite of the robustness of SA and MCA to the presence of local minima in the solution landscape, the annealing schedule needed for asymptotic convergence is computationally intensive. This provides the motivation for the parallel computation of the maximumlikelihood estimator.
PARALLEL COMPUTATION OF THE MAXIMUMLIKELIHOOD ESTIMATOR
Computation of the maximumlikelihood estimator entails two levels of parallelism corresponding to the two stages of optimization discussed in the previous section:
Level 1: Parallel computation of the optimal interprobe spacingŶ_{Π} for a given probe ordering Π (Equation 27). This entails parallelization of the gradient descent search procedure for constrained optimization in the continuous domain.
Level 2: Parallel computation of the optimal probe ordering (Equation 28). This entails parallelization of the stochastic hillclimbing search procedure (SA or MCA) for optimization in the discrete domain.
Both levels of parallel computation were implemented on PVM (Sunderam 1990), which is a software environment designed to exploit parallel/distributed computing across a variety of computing platforms. PVM is based on a distributedmemory messagepassing paradigm of parallel computing. The interested reader is referred to Geist et al. (1994) for a more detailed description of PVM.
Parallel stochastic hillclimbing search: Parallelization of annealing algorithms has been attempted by several researchers especially in the area of computeraided design, image processing, and operations research (Greening 1990; Azencott 1992). Parallelization strategies for the SA and MCA algorithms can be categorized as follows:
Functional parallelism within a move where the task of evaluating each move is decomposed into subtasks that are performed in parallel by multiple processors (Wong and Fiebrich 1987).
Control parallelism with multiple active iterations with processors engaged in speculative computation (Witteet al. 1991).
Control parallelism with parallel Markov chain generation using a systolic algorithm (Aartset al. 1986; Greening 1990; Kim and Kim 1990; Azencott 1992).
Control parallelism with multiple searches of the solution space where the searches could be interacting, in which case the processors exchange data periodically, or noninteracting in which case the processors proceed independently of each other (Azencott 1992; Lee 1995).
Data parallelism with the state variables in a multivariate solution vector distributed among the individual processors in the multiprocessor architecture accompanied by either (1) parallel evaluation of multiple moves with acceptance of a single move (Casottoet al. 1987), or (2) parallel evaluation of multiple moves with acceptance of multiple noninteracting moves (Jayaraman and Rutenbar 1987), or parallel evaluation and acceptance of multiple moves (Banerjeeet al. 1990).
The authors’ earlier work in chromosome reconstruction involved ordering clones, i.e., rows of the hybridization matrix H as shown in Table 1 for physical map generation (Cuticchia et al. 1992, 1993). The clones were ordered on the basis of minimization of the total pairwise Hamming distance between the binary hybridization signatures of successive clones in a given permutation. This clone ordering problem was also shown to be isomorphic to the optimal linear arrangement problem (Garey and Johnson 1979), and SA and MCA were used to arrive at the optimal clone ordering, which was represented by a global minimum of the total pairwise Hamming distance objective function. The authors’ efforts at parallelizing SA and MCA in the context of the aforementioned clone ordering problem showed control parallelism based on multiple searches to be the most promising for implementation on a distributed memory multiprocessor platform such as PVM (Bhandarkar et al. 1996a,b, 1998; Bhandarkar 1997; Bhandarkar and Machaka 1997). Since a candidate solution in the serial SA or MCA algorithm can be considered to be an element of an asymptotically ergodic firstorder Markov chain of solution states, two models of parallel SA (PSA) and parallel MCA (PMCA) algorithms were formulated and implemented based on the distribution of the Markov chain of solution states on a workstation cluster running PVM. These models incorporate the parallelization strategies discussed under item (D) above and are described below:
The noninteracting local Markov chain (NILM) PSA and PMCA algorithms.
The periodically interacting local Markov chain (PILM) PSA and PMCA algorithms.
In the NILM PSA and NILM PMCA algorithms, each processor within the PVM system runs an independent version of the serial SA or MCA algorithm. In essence, there are as many Markov chains of solution states as there are physical processors within the PVM system. Each Markov chain is local to a given processor and at any instant each processor maintains a candidate solution that is an element of its local Markov chain of solution states. The serial SA and MCA algorithms run asynchronously on each processor; i.e., at each temperature value or kinetic energy value each processor iterates through the perturbevaluateaccept cycle concurrently (but asynchronously) with all the other processors.
The perturbation function uses a parallel random number generator to generate the Markov chains of solution states. By assigning a distinct seed to each processor at the start of execution, it is ensured that each processor contains a Markov chain of solution states that is independent from those in most or all of the other processors. The evaluation function and the decision function are executed concurrently on the solution state within each processor. On termination of the annealing processes on all the processors, the best solution is selected from among all the solutions available on the individual processors. The NILM model is essentially that of multiple independent (i.e., noninteracting) searches as described in item (D) above.
The PILM PSA and PILM PMCA algorithms are similar to their NILM counterparts except for one major difference. Just before the parameter T or E_{k} is updated using the annealing function, the best candidate solution from among those in all the processors is selected and duplicated on all the other processors. The goal of this synchronization procedure is to focus the search in the more promising regions of the solution space. This suggests that the PILM PSA or PILM PMCA algorithm should be potentially superior to its NILM counterpart. It should be noted that in the case of all four algorithms, PILM PSA, PILM PMCA, NILM PSA, and NILM PMCA, a single Markov chain of solution states is generated entirely within a single processor. The PILM model is essentially that of multiple periodically interacting searches as described in item (D) above.
In the case of all the four algorithms, NILM PSA, NILM PMCA, PILM PSA, and PILM PMCA, a master process is used as the overall controlling process. The master process runs on one of the processors within the PVM system. The master process spawns child processes on each of the other processors within the PVM system, broadcasts the data subsets needed by each child process, collects the final results from each of the child processes, and terminates the child processes. The master process, in addition to the abovementioned functions for task initiation, task coordination, and task termination, also runs its own version of the SA or MCA algorithm just as does any of its child processes.
In the case of the PILM PSA and the PILM PMCA algorithms, before the parameter T or E_{k} is updated, the master process collects the results from each child process along with its own result, broadcasts the best result to all the child processes, and also replaces its own result with the best result. The master process updates its temperature or kinetic energy value using the annealing schedule and proceeds with its local version of the SA or MCA algorithm. On convergence, the master process collects the final results from each of the child processes along with its own, selects the best result as the final solution, and terminates the child processes.
Each of the child processes in the PILM PSA and the PILM PMCA algorithms receives the initial parameters from the master process and runs its local version of the SA or MCA algorithm. At the end of each annealing step, each child process conveys its result to the master process, receives the best result thus far from the master process, and replaces its result with the best result thus far before proceeding with the next annealing step. On convergence, each child process conveys its result to the master process. The master process and child process for the PILM PSA and the PILM PMCA algorithms are depicted in Figures 5 and 6, respectively.
The master and child processes for the NILM PSA and NILM PMCA algorithms are similar to those of their PILM counterparts except for the absence of the process coordination phase (Figures 5 and 6) in the former. Note that it is during the process coordination phase at the end of each annealing step that the master process and the child processes interact in the PILM PSA and PILM PMCA algorithms.
Parallel gradient descent: Deterministic optimization techniques such as steepest descent and conjugate gradient descent are generally used for unconstrained optimization in the continuous domain (Polak 1997). However, the steepest descent procedure in this case needs to be adapted to the fact that the solution space of the interprobe spacings is constrained, since 0 ≤ Y_{i} ≤ M for i = 1,..., n. One of the wellknown problems with the steepest descent method is that it takes several small steps while descending a valley in the solution landscape, which usually causes it to be much slower compared to other techniques in its class (Polak 1997). The conjugate gradient descent (CGD) procedure, on the other hand, is known to be one of the fastest in the class of gradient descentbased optimization methods (Hestenes 1980; Hestenes and Stiefel 1952).
The conjugate gradient descent procedure is very similar to the steepest descent procedure with the only difference that different directions are followed while minimizing the objective function. Instead of consistently following the local downhill gradient (i.e., the direction of steepest descent), a set of n mutually orthonormal (i.e., conjugate) direction vectors are generated from the downhill gradient vector where n is the dimensionality of the solution space. The orthonormality condition ensures that minimization along any given direction vector does not jeopardize the minimization along another direction vector within this set. Unlike the steepest descent algorithm, the conjugate gradient descent algorithm guarantees convergence to a local minimum within n steps. For this reason, the conjugate gradient descent algorithm was chosen instead of the steepest descent algorithm in the current implementation. The serial conjugate gradient descent algorithm is depicted in Figure 7. The Hessian matrix is usually required to generate a set of conjugate vectors to minimize a particular objective function (Dorny 1980; Kincaid and Cheney 1991). However, the conjugate gradient descent algorithm presented in Press et al. (1988) describes a method for generating conjugate direction vectors without the need for evaluating the Hessian matrix. The conjugate gradient descent algorithm depicted in Figure 7 is an adaptation of the one presented in Press et al. (1988).
Due to its inherently sequential nature, data parallelism was deemed to be the appropriate parallelization scheme for the conjugate gradient descent algorithm. The Y and G vectors are distributed among the different processors constituting the virtual machine. Each processor performs the required operations on its local Y_{loc} and G_{loc} subvectors concurrently with the other processors. Here, Y_{loc} = Y/N_{proc} and G_{loc} = G/N_{proc}, where N_{proc} is the number of processors in the virtual machine. This entails some interprocessor communication and synchronization overhead since the individual subvectors have to be periodically scattered (i.e., distributed) among the processors and also periodically gathered (i.e., combined) to compute a global value. For example, in the bisection procedure (i.e., procedure for minimizing along the direction G) one needs to evaluate the objective function and compute its onedimensional derivative with respect to s during each iteration. Both of these operations require gathering of data (i.e., subvectors) within the different processors.
PVM permits one to define a group of processes such that reduction, scattering, and gathering operations can be performed selectively on the processes within the group. The pvm_scatter and pvm_gather primitives were used to scatter and gather the Y and G vectors among the different processors, respectively. It is to be noted, however, that scatter and gather operations in PVM are collective calls and need to be performed by all the processors concurrently. Synchronizing the processors, therefore, is necessary before any such call is initiated, thereby increasing the parallelization overhead. The parallelization scheme for the conjugate gradient descent algorithm follows the masterchild model used for the PSA and PMCA algorithms. The master and child processes are depicted in Figures 8 and 9, respectively.
A twolevel parallelism approach for computation of the maximumlikelihood estimator: To ensure a scalable implementation, two levels of parallelism were incorporated in the computation of the maximumlikelihood estimator. The finer or lower level of parallelism pertains to the computation ofŶ for a given probe ordering Π using the parallel conjugate gradient descent algorithm for continuous optimization as discussed previously. The coarser or upper level of parallelization pertains to the computation of
At the coarser level, the user has a choice of any of the four parallel stochastic hillclimbing algorithms: PILM PSA, NILM PSA, PILM PMCA, or NILM PMCA. The parallelization of the conjugate gradient descent algorithm at the finer level is transparent to the parallel stochastic hillclimbing algorithms at the coarser level. In other words, the communication and control scheme for the parallel stochastic hillclimbing algorithms is independent of that of the parallel conjugate gradient descent algorithm. This enhances the modularity and flexibility of the system. For example, one could use the serial or parallel version of the conjugate gradient descent algorithm or, for that matter, any other serial or parallel algorithm for continuous deterministic optimization at the finer level without having to make any changes to the parallel stochastic hillclimbing algorithms at the coarser level.
The interaction between the master and child stochastic hillclimbing (i.e., annealing) processes (NILM/PILM PSA/PMCA) is shown with the doubleheaded arrows between the larger components in Figure 10. A parallel conjugate gradient descent algorithm is embedded within each of the stochastic hillclimbing processes. When the parallel conjugate gradient descent procedure is invoked from within the master or child stochastic hillclimbing process, a new set of child conjugate gradient descent processes is spawned on the available processors, whereas the master conjugate gradient descent process runs on the same processor as the stochastic hillclimbing process (master or child). The master and child conjugate gradient descent processes cooperate to evaluate and minimize the value of the objective function for a specific probe ordering Π. Once the objective function is evaluated, the child conjugate gradient descent processes terminate, and the corresponding processors are available for future computation. The interaction between the master and child conjugate gradient descent processes is depicted by the doubleheaded arrows within each of the major components in Figure 10.
The twolevel parallelism approach can be seen to induce a logical treeshaped interconnection network on the processors in the PVM system. The first level is the group of processors that run the (master and child) stochastic hillclimbing processes, and the second level is the group of processors that run the child conjugate gradient descent processes. The processors that run the child conjugate gradient descent processes are connected to the processor running the parent stochastic hillclimbing process but are independent of the processors running other stochastic hillclimbing processes.
EXPERIMENTAL RESULTS
Experimental results on simulated data: The parallel algorithms were implemented on a dedicated PVM cluster of 200MHz PentiumPro processors running Solarisx86 and tested with artificially generated cloneprobe hybridization data (Shete 1998). Two sets of artificial data were used with the specifications outlined in Table 2. The artificial data were generated using a program described in Shete (1998), which generates clonal data of a given length with the left endpoints of the clones and probes uniformly distributed along the length of an artificial chromosome.
The serial stochastic hillclimbing algorithms (SA and MCA) were implemented with the following parameters: the initial value for the temperature or demon energy was chosen to be 0.5, and the maximum number of iterations D for each annealing step was chosen to be 100 · n. The current annealing step was terminated when the maximum number of iterations was reached or when the number of successful perturbations equaled 10 · n (i.e., 10% of the maximum number of iterations), whichever was encountered first. The temperature or demon energy values were systematically reduced at the end of each annealing step, using a geometric annealing schedule with the annealing factor α = 0.95. The algorithm was terminated (and deemed to have reached a global optimum) when the number of successful perturbations in any annealing step equaled 0.
In the case of the parallel stochastic hillclimbing algorithms (NILM PSA, PILM PSA, NILM MCA, and PILM MCA) the product of N_{proc} and the maximum number of iterations D performed by a processor in a single annealing step was kept constant, i.e., D = (100 · n)/N_{proc}. This ensured that the overall workload remained constant as the number of processors was varied, thus enabling one to examine the scalability of the speedup and efficiency of the algorithms for a given problem size with increasing number of processors. The other parameters for the parallel stochastic hillclimbing algorithms were identical to those of their serial counterparts. In the NILM PSA and NILM PMCA algorithms, each process was independently terminated when the number of successful perturbations in any annealing step for that process equaled 0. In the PILM PSA and PILM PMCA algorithms, each process was terminated when the number of successful perturbations in an annealing step equalled 0 for all the processes i.e., the master process and all the child processes. This condition was checked during the synchronization phase at the end of each annealing step.
The parallel conjugate gradient descent algorithm was tested on artificial data sets with a varying number of probes (n = 100, 200, and 300). Table 3 shows the runtime results for the parallel conjugate gradient descent algorithm with varying number of processors N_{proc} and varying number of probes n. Table 3 also shows the runtime results for the serial steepest descent algorithm on the same data sets. As can be seen from Table 3, the serial version of the conjugate gradient descent algorithm is significantly faster than the serial version of the steepest descent algorithm. Moreover, Table 3 also shows how the performance of the parallel conjugate gradient descent algorithm scales with an increasing number of processors and increasing problem size.
The speedup results for Table 3 are shown in Table 4. For n = 100, the best speedup is attained for N_{proc} = 2 with a degradation in speedup for N_{proc} > 2. For n = 200, the results show the best speedup for N_{proc} = 4 with a degradation in speedup for N_{proc} > 4. As for n = 300, the best speedup is seen for N_{proc} = 8. These results are in conformity with expectations, since the interprocessor communication overhead and synchronization overhead tend to increasingly dominate the overall execution time with increasing number of processors N_{proc} for a problem of given size (i.e., for a given value of n). The interprocessor communication overhead and synchronization overhead as a percentage of the computational load are higher for smaller problem sizes, which explains why the maximum speedup obtained is for values of N_{proc} < 8. In summary, the payoff in the parallelization of the conjugate gradient descent algorithm is realized only for large values of n (i.e., large problem sizes). This is a natural consequence of the network latency inherent in PVM systems that are comprised of a network of workstations.
The total number of iterations of the serial and parallel versions (with different N_{proc} values) of the conjugate gradient descent algorithm prior to convergence was also examined (Table 5). As can be seen in Table 5, the number of iterations prior to convergence varies significantly between the serial and the parallel versions of the conjugate gradient descent algorithm. On closer examination, we were able to conclude that this was due to numerical errors introduced by the parallelization process. Since the objective function is mathematically involved, and since the order in which numbers are added in the serial implementation and each of the parallel versions (with different N_{proc} values) is different, a numerical error is introduced. This error is the difference in the objective function values computed by the serial and parallel versions of the conjugate gradient descent algorithm in a given iteration having started from the same initial point. It was noted that the numerical error is initially insignificant. However, due to the iterative nature of the conjugate gradient descent algorithm, the results of the ith iteration depend entirely on the results of the (i  1)th iteration, thereby causing the error to build up throughout the computation. Also, to ensure convergence to a global minimum very highprecision thresholds have been used to test for convergence. For instance, the bisection method terminates once the value of the objective function varies by <10^{10}. The conjugate gradient descent algorithm itself is deemed to have converged to a global minimum when the value of the objective function varies by <10^{5} (if the gradient vector has not vanished by then). Consequently, all three factors, i.e., the different order of execution of arithmetic operations, the data dependency between successive iterations, and the highprecision thresholds, contribute to the discrepancy in the number of iterations prior to convergence between the serial and parallel versions (with different N_{proc} values) of the conjugate gradient descent algorithm. The final results of the serial and the parallel versions of the conjugate gradient descent algorithm are equivalent up to a precision of 10^{4}.
Table 6 shows the average number of iterations per minute for both the serial and parallel versions of the conjugate gradient descent algorithm. This information is derived from the overall execution time (Table 3) and the number of iterations prior to convergence (Table 5). Table 7 shows the average time (in seconds) per iteration for the serial and parallel versions (with different N_{proc} values) of the conjugate gradient descent algorithm. Table 8 shows the speedup values for the parallel conjugate gradient descent algorithm for varying values of N_{proc} based on the results in Table 7. Although the speedup results in Table 8 compare well with those in Table 4, one can observe greater uniformity and consistency in the results presented in Table 8. The speedup curves corresponding to the speedup results in Tables 4 and 8 are shown in Figures 11 and 12, respectively.
In the case of the parallel stochastic hillclimbing algorithms, experiments were conducted on problem sizes of n = 10 and n = 30 probes. All four algorithms—NILM PSA, PILM PSA, NILM PMCA, and PILM PMCA—were tested using the MLE objective function. Since the value of n (i.e., problem size) is small, the serial version of the conjugate gradient descent algorithm was used. The runtime results for these algorithms are shown in Table 9. It can be observed that the serial and parallel versions of the MCA algorithm took longer to converge when compared to their SA counterparts. As Table 10 shows, the average run time per annealing step of the serial or any of the parallel versions (with different N_{proc} values) of the SA algorithm is higher than that of its MCA counterpart. The higher run time of the PMCA algorithms is attributable to the fact that they took a larger number of iterations to converge to a globally optimal solution.
The speedup results for the PSA and PMCA algorithms are shown in Table 11. The corresponding speedup curves for n = 10 and n = 30 are shown in Figures 13 and 14, respectively. As can be observed, the PSA and PMCA algorithms exhibit consistent and scalable speedup with increasing N_{proc} values. As expected, the speedup scales better with increasing N_{proc} values for larger problem sizes (i.e., larger values of n). The PSA and PMCA algorithms arrived at the correct probe ordering in all cases but for one exception. In the case n = 30 and N_{proc} = 8, the PMCA algorithm came up with the reverse probe ordering (i.e., Π^{R} instead of Π), which is expected since the likelihood function is unique only up to reversal in the probe ordering. Consequently, the MLE procedure is capable of recovering the correct probe ordering only up to reversal.
The absolute root mean squared error (RMSE) χ between the true interprobe spacings Y and the estimated interprobe spacingsŶ is defined as
Experimental results on real data: The parallel implementation of the MLEbased physical mapping algorithm was also tested on real probeclone hybridization data from linkage group VII of the fungal genome N. crassa. The data were comprised of 105 probes and 1678 clones. The physical map was generated by first ordering the probes and estimating the interprobe distances using the MLE procedure. The intervening clones were then inserted between pairs of successive probes to generate a physical map of the chromosome. The resulting physical map exhibiting a total of 39 contig breaks can be viewed on the Genetics web site at http://www.genetics.org/supplemental/. The 1’s in the physical map that do not conform to the consecutive 1’s property (Booth and Lueker 1976) are identified as false positives and replaced by ^{*}′s. However, due to the inherent sparsity of the cloneprobe hybridization matrix H, it is not possible to identify the false negatives (Christof and Kececioglu 1999). A parallel version of the MLEbased physical mapping procedure that used the NILM PSA algorithm in conjunction with the serial CGD algorithm exhibited a speedup of 1.5 on a cluster of three SUN UltraSparc1 workstations (350 MHz, 128 MB RAM) connected via 100 Mbs fast Ethernet. Although the results are encouraging, the speedup of the parallel MLEbased physical mapping procedure leaves room for improvement.
CONCLUSIONS AND DISCUSSION
In this article, a MLEbased approach to physical map reconstruction under a probabilistic model of hybridization errors consisting of false positives and false negatives was presented. The maximumlikelihood estimator optimizes a likelihood function defined by the ordering of probes and interprobe spacings under an experimental protocol wherein clones of equal length are hybridized to a maximal subset of nonoverlapping equallength clones termed probes. The estimation procedure was shown to involve a combination of continuous and discrete optimization, the former to determine a set of optimal interprobe spacings for a given probe ordering and the latter to determine the optimal probe ordering. The conjugate gradient descent procedure was used to determine the optimal spacings between probes for a given probe ordering. The optimal probe ordering was determined using stochastic combinatorial optimization procedures such as SA and MCA.
The problem of MLEbased physical map reconstruction in the presence of errors is a problem of high computational complexity, thus providing the motivation for parallel computing. A twolevel parallelization strategy was proposed wherein the conjugate gradient descent procedure was parallelized at the lower level, and the stochastic hillclimbing algorithm was simultaneously parallelized at the higher level. The parallel algorithms were implemented on a distributedmemory multiprocessor cluster running the PVM environment. A data parallel approach, where the components of the gradient vector are distributed among the individual processors in the PVM system, was deemed more suitable for the parallelization of the conjugate gradient descent procedure. A control parallel scheme where individual processors perform noninteracting or periodically interacting searches was deemed more suitable for the parallelization of the combinatorial stochastic hillclimbing procedures.
Experimental results on artificial cloneprobe data showed that the payoff in data parallelization of the conjugate gradient descent procedure was realized only for large problem sizes (i.e., large values of n). A similar trend was observed in the case of the parallel stochastic hillclimbing algorithms. In all cases, the parallel implementation of the maximumlikelihood estimator resulted in the correct probe ordering, except in one case wherein the probe ordering was reversed. The RMSE between the resulting interprobe distances and the true interprobe distances was computed. The percentage RMSE was seen to exhibit a decreasing trend with increasing problem values of n (i.e., problem size), which is in conformity with the statistical theory underlying the MLE procedure. Experimental results on real hybridization data from linkage group VII of the fungal genome N. crassa yielded a physical map with 39 contig breaks. The parallel MLEbased physical mapping procedure that used a combination of NILM PSA and the serial CGD algorithm exhibited a speedup of 1.5 on a threeworkstation cluster interconnected via 100 Mbs fast Ethernet.
The formulation of the ML model entailed certain key assumptions. The assumption of a constant probe/clone size is reasonable (Kelkaret al. 2001). The assumption of a uniform distribution of clones along the length of the chromosome is also reasonable since Kelkar et al. (2001) and Aign et al. (2001) have taken substantial effort to represent fragments of N. crassa DNA in multiple cloning vectors. The most difficult assumption is that of a uniform distribution of probes along the length of the chromosome due to the presence of repeats (Xionget al. 1996). This has not been a serious problem in the case of N. crassa thus far, due to the paucity of offdiagonal hybridization signals in the physical maps (Aignet al. 2001; Hallet al. 2001). However, in general, the presence of DNA repeats does pose a serious problem in physical map creation and the ML model will need to be extended to address this issue. The ML model will also need to be enhanced to take into account hybridization errors due to deletions and chimerism, especially when longer length clones (and probes) are used in the samplingwithoutreplacement protocol. Another key assumption in the computation of the ML estimate is that the ndimensional space of interprobe spacings is convex or can be decomposed into a finite number of convex regions for a given probe ordering. This assumption permits the use of local searchbased techniques to determine a local minimum (which also happens to be a global minimum) of the ML objective function for a given probe ordering. This assumption is valid only when the clone coverage is sufficiently large. In the absence of this assumption, the problem of determining a global minimum of the ML objective function for a given probe ordering becomes intractable.
Future research will attempt to improve the performance of the parallel MLEbased physical mapping procedure on real hybridization data. Future research will also investigate extensions and enhancements to the ML objective function to account for errors due to repeat DNA sequences, deletions, and chimerism. The consistency of the ML estimator needs to be rigorously analyzed. This would entail a proof of the asymptotic convergence of the inferred physical map to the true physical map with probability one as the number of probes (and clones) grows. Statistical methods need to be developed to assess the statistical reliability of the ML estimator. The rate of convergence of the inferred physical map to the true physical map could be used to establish asymptotic confidence levels for links between pairs of probes (or clones) on the physical map (Xionget al. 1996). Variational analysis will provide the tools for an asymptotic analysis and bootstrap resampling techniques (Wanget al. 1994b) could provide a tool for estimating the statistical measure of confidence in the links on the physical map. A rigorous sensitivity analysis is necessary to show the validity and robustness of the various assumptions underlying the ML model. Extensions to the ML objective function to be able to integrate ordinal information from probes containing markers anchored to genetic maps (Hallet al. 2001) also need to be investigated. The current PVM implementation of the ML estimator is targeted toward a homogeneous distributed processing platform such as a network of identical workstations. Future research will explore and address issues that deal with the parallelization of the ML estimator on a heterogeneous distributed processing platform such as a network of workstations that differ in processing speeds.
Acknowledgments
The authors thank Dr. David Lowenthal for access to his PVM workstation cluster. This research was supported in part by a National Research Initiative Competitive Grants Program (NRICGP) grant by the U.S. Department of Agriculture to Dr. Suchendra M. Bhandarkar.
Footnotes

Communicating editor: J. Arnold
 Received September 1, 2000.
 Accepted December 19, 2000.
 Copyright © 2001 by the Genetics Society of America