## Abstract

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

OWING to the existence of genetic variation within species, geneticists routinely make choices about which individuals, inbred strains, or representatives of populations or breeds merit prioritization for genotyping or DNA sequencing. Often such choices, though typically made by informal criteria, reflect an explicit or implicit goal of maximizing the potential for extrapolating the information in the genotyped or sequenced individuals to all members of a breed, population, or species of interest.

Genotype imputation algorithms infer unobserved genotypes by matching a set of markers to the haplotype patterns observed in a reference sample (Li *et al.* 2009; Marchini and Howie 2010), adding a new dimension to these choices. Reference panels are used to facilitate genotype imputation in other individuals beyond the members of the panels themselves, and they often can be optimally selected to maximize the imputed genotypic information obtained about those other individuals of interest (Kang and Marjoram 2012; Zhang *et al.* 2013; Peil *et al.* 2015). The evaluation of alternative ways to select imputation reference panels thus provides an approach for making sample choices for major genotyping or sequencing studies more systematically generalizable.

When conducting genotype imputation studies in a population sample, reference panels generally have been selected from databases external to the sample, such as the 1000 Genomes Project Consortium (2010) and the International HapMap Consortium (2005) databases. As a result of the rapidly decreasing cost of sequencing, however, it has become increasingly possible to carry out *internal* reference panel selection, in which additional sequencing is performed on a subset of the study sample, and the sequenced subset is then used to impute the remaining haplotypes. The use of reference sequences that originate from the study sample itself can reduce the potential mismatch of ancestral backgrounds between sample and reference populations, decreasing imputation errors. It also allows for genetic variants unique to the sample population to be imputed successfully (Fridley *et al.* 2010; Zhang *et al.* 2013).

Previous studies have observed that a mismatch of population origins between reference panels and study samples can reduce imputation accuracy compared with when they originate from the same or similar populations (Huang *et al.* 2009, 2011; Li *et al.* 2010; Paşaniuc *et al.* 2010; Shriner *et al.* 2010; Surakka *et al.* 2010). Jewett *et al.* (2012) demonstrated, using a coalescent model, that with other variables held constant, smaller internal reference panels are often likely to outperform larger external reference panels despite the difference in panel size. Empirical studies also have shown that using an internal reference panel drawn from a subset of the sample under study, in addition to an external reference panel, gives rise to an increase in imputation accuracy over using the external reference panel alone (Fridley *et al.* 2010; Sampson *et al.* 2012; Duan *et al.* 2013; Kreiner-Møller *et al.* 2015, Pistis *et al.* 2015).

The value of internal reference panels for imputation studies raises the question of how an internal panel should be selected. Two recent studies have proposed maximizing phylogenetic diversity (PD) as a criterion for internal reference panel selection (Kang and Marjoram 2012; Zhang *et al.* 2013). In this approach, the PD of a set of haplotypes is defined as the total branch length of a tree spanned by the haplotypes (Faith 1992; Hartmann and Steel 2007). Given a panel size, the goal is to select the subset of haplotypes whose subtree yields the longest total branch length. Conceptually, the idea of seeking a maximally diverse subset of haplotypes in the reference panel aims to sample haplotypes that best cover the full range of haplotypes observed in the sample. The maximum-PD panel, by choosing haplotypes from different regions of the tree of haplotypes (Figure 1b), is more likely than a random panel to supply the necessary diversity to impute sites localized in a subgroup within the entire sample population. Zhang *et al.* (2013) showed, using simulated sequence data and data from the 1000 Genomes Project, that by using the maximum-PD panel, higher imputation accuracy is obtained, and more sites are imputed as polymorphic in the sample population, than if the reference panel consists of randomly selected haplotypes.

Despite the utility of maximizing PD as a method for the selection of an internal reference panel, other approaches focusing on different principles might be preferable. Because the algorithm explicitly selects haplotypes that are genetically distant from one another, long, pendant branches of the tree, if present, are likely to be chosen (Bordewich *et al.* 2008). The haplotypes associated with such branches might not be representative of the sample at large. These haplotypes might contain a large amount of sequencing error or missing data, and their inclusion in the reference panel might not contribute substantially to an increase in imputation accuracy. Even if they have high-quality data, such haplotypes are relatively unique in the sample, and therefore might assist as imputation templates only for a small number of sampled lineages.

PD can be viewed as emphasizing *diversity* in the internal reference panel rather than *representativeness*. To determine whether an alternative focused on identifying the most representative subsample for use as the internal reference panel is preferable, we adapted another method borrowed from phylogenetic studies (Matsen *et al.* 2013): minimizing the average distance to the closest leaf (ADCL), an approach that identifies reference haplotypes based on their genetic proximity to the rest of the sample haplotypes. We compare the imputation accuracy of the maximum-PD, minimum-ADCL, and random reference panels on both simulated data and data from the 1000 Genomes Project, and find that the minimum-ADCL panel consistently provides higher imputation accuracy, irrespective of changes to parameters such as reference panel size, marker density, and sequence length.

## Methods

### Maximizing PD

Given a tree of *n* haplotypes, to select a reference panel of haplotypes whose subtree spans the longest branch length, Zhang *et al.* (2013) applied a well-known greedy algorithm (Pardi and Goldman 2005; Steel 2005) that takes as inputs the tree and a parameter *k ≤ n*, the desired number of haplotypes for the panel. Let *X* be the *k*-element subset of the sample haplotypes chosen for the reference panel, and let *T _{X}* be the subtree spanned by the haplotypes in

*X*. The algorithm first selects the haplotype pair that is phylogenetically most distant (

*i.e.*, largest pairwise branch length) and adds both haplotypes to

*X*.

*T*now consists of a single pair of branches. Sequentially, the haplotype that is the most distant from

_{X}*T*is placed into

_{X}*X*, updating

*T*with each inclusion. This process continues until the required

_{X}*k*haplotypes have been selected (Figure 1b).

Pardi and Goldman (2005) and Steel (2005) proved that among all possible subsets of size *k ≤ n* haplotypes from the study sample, the greedy algorithm achieves the globally maximal PD. Thus, selection of the most diverse reference panel is computationally efficient because there is no need to exhaustively examine all possible panels of size *k* to arrive at the correct solution. In addition, because the selection algorithm is greedy, the haplotypes in the reference panel can be ranked by their order of inclusion, in which every haplotype added contributes a nonincreasing amount of PD. The maximum-PD panels of sizes 2 to *k* form a series of nested sets, and all previously selected haplotypes in a panel smaller than *k* also will be included in a panel of size *k*.

### Minimizing ADCL

#### Overview of ADCL:

Instead of focusing on diversity in the selected set and targeting the potential for accurate imputation of unusual haplotypes, the minimum-ADCL algorithm focuses on representativeness, aiming to maximize imputation accuracy of typical haplotypes likely to appear in a sample. The problem can be viewed as choosing haplotypes that are, on average, genealogically close to the remaining haplotypes not included in the reference panel. As in the case of PD, the algorithm takes as inputs a tree of the *n* haplotypes in the study sample, and a parameter *k ≤ n*, indicating the desired reference panel size.

Let *H* be the set of *n* haplotypes, and let *X* be the selected *k*-element subset of *H*. The objective is then to find *X* such that the branch-length distance from a randomly chosen haplotype in *H* to its closest neighboring haplotype in *X* is minimized over all possible *k*-element subsets of *H* (Matsen *et al.* 2013). Note that because the haplotypes in *X* are also in *H*, each of these haplotypes is its own closest neighbor, and we can equivalently consider either *H* or . In essence, the goal is to return a set of reference panel haplotypes that occupy the most central positions within clusters of the tree (Figure 1c).

In a detailed study of ADCL, Matsen *et al.* (2013) demonstrated that unlike when choosing the subset that maximizes PD, the greedy algorithm need not give rise to the globally optimal ADCL solution. It is therefore necessary to produce alternative algorithms that seek to minimize ADCL. Note that because the greedy algorithm is not applicable, the haplotypes selected cannot be ranked by their order of inclusion: a haplotype included in a subset of size smaller than *k* is not necessarily also included in a subset of size *k* (Figure 1c).

#### Adapted partitioning-around-medoids (PAM) algorithm for minimizing ADCL:

Matsen *et al.* (2013) described two algorithms that, for a given set of haplotypes, seek to produce the subset of size *k* that minimizes ADCL. The first approach leverages similarities between the problem of minimizing ADCL and the technique known as *k-medoids clustering* (Kaufman and Rousseeuw 1987). In the *k*-medoids problem, a set of data points is partitioned into *k* clusters, where *k* is predetermined. Within each cluster, a single point is designated as the center. The *k*-medoids clustering method is similar to *k*-means clustering, except that in *k*-medoids, each cluster center must be chosen from the original set of data points, whereas *k*-means permits any location to be a cluster center. The objective function to be minimized in the *k*-medoids problem is the distance from a random data point to the center of the cluster to which it is assigned. A cluster center can be viewed as the data point most representative of the remainder of the data points within the cluster.

It is then clear how the problem of minimizing ADCL is analogous to the *k*-medoids problem. A data point is a haplotype, and distances between data points are branch-length (patristic) distances between haplotypes. The *k* cluster centers are akin to the *k* haplotypes that are selected.

As with minimizing ADCL, there is no greedy algorithm that solves the *k*-medoids problem, and obtaining the globally optimal solution has been demonstrated to be NP-hard (Sheng and Liu 2004). A widely used *k*-medoids heuristic algorithm is the *partitioning-around-medoids* (PAM) algorithm (Theodoridis and Koutroumbas 2008), which works by randomly selecting *k* medoids from the original set of *n* data points and then minimizing the objective function via hill-climbing. One iteration of the algorithm consists of looping over all *k*(*n − k*) possible pairs containing a medoid and nonmedoid, exchanging the medoid statuses of the points in the pair, and recording the new value of the objective function from the updated arrangement. Among all *k*(*n − k*) proposed exchanges, the single exchange that leads to the lowest-cost configuration is chosen. The algorithm then enters a new iteration, and the process repeats until no further changes to the set of medoids take place.

The first approach Matsen *et al.* (2013) considered for minimizing ADCL is an adaptation of the PAM algorithm. First, the set *X* of haplotypes included in the reference panel is initialized by randomly selecting, without replacement, *k* haplotypes from the initial set *H* of *n* haplotypes. Next, the following loop over the haplotypes is executed until no exchanges occur for one complete iteration over every :

For a haplotype , remove it from

*X*, and attempt to replace it with every other in its place.Keep the best such exchange if it decreases ADCL.

Continue with . In the case of , continue with .

This method for minimizing ADCL differs from the original formulation of the PAM algorithm in that it evaluates potential exchanges one medoid at a time instead of examining all *k*(*n − k*) medoid-nonmedoid pairs before finding the exchange that most decreases the objective function (Matsen *et al.* 2013). Because each step in the iteration causes the value of ADCL to either stay constant or decrease, the solution is guaranteed to converge on a local minimum. However, the algorithm remains a heuristic approach, and the minimum-ADCL solution it achieves could depend on the specific haplotypes selected during random initialization. Hence, the global minimum might not always be found.

Alongside the adapted PAM algorithm, Matsen *et al.* (2013) also developed a second approach: an exact but more computationally intensive algorithm that is guaranteed to find the global-minimum ADCL solution. Both algorithms were implemented in the rppr binary in the pplacer suite of programs. Comparing between the two, Matsen *et al.* (2013) demonstrated that for their simulated test sets, the adapted PAM algorithm only rarely gets trapped in local minima. For computational efficiency, we therefore chose to use the adapted PAM algorithm rather than the slower exact algorithm, first testing that, in our setting, multiple runs of the adapted PAM algorithm with different initial seeds select a large percentage of the same haplotypes (see *Results*).

### Simulated sequence data

To evaluate how the maximum-PD and minimum-ADCL panels perform relative to each other, we analyzed simulated data sets produced by the coalescent-based sequence sampling program ms (Hudson 2002), closely following the parameters used by Zhang *et al.* (2013) to ensure that the results are comparable.

First, we independently generated 50 data sets, each consisting of two thousand 1-Mb haplotypes, assuming a constant effective population size of , a mutation rate of per site per generation, and a recombination rate of per site per generation. Parameter values provided to ms were as follows: nsam = 2000, nreps = 50, -t = 400, -r = 400, and nsites = 10^{6}. From the simulated data sets, we removed all singleton sites to ensure that the sequence data were truly imputable. Within a data set, if the *n =* 2000 haplotypes contained *q* polymorphic sites after excluding the singletons, then we randomly selected, without replacement, *s =* 300 of the *q* sites, each with minor allele frequency (MAF) greater than 0.1. These markers were considered genotyped. The remaining *q − s* sites were masked. Markers with MAF ≤ 0.1 were excluded from being treated as genotyped so as to mimic the phenomenon that rarer variants have been less frequently included in typical SNP arrays.

Following Zhang *et al.* (2013), we calculated the pairwise Hamming distances between the *n =* 2000 haplotypes in each of the 50 data sets based on the genotype information at only the *s =* 300 randomly selected markers. With these distances, we then used the software RapidNJ (Simonsen *et al.* 2008) to construct a neighbor-joining tree (Saitou and Nei 1987) of the haplotypes. Note that it was possible, as a result of random sampling, for two or more haplotypes to be identical at all *s* markers. In such a case, a leaf in the tree would represent more than one haplotype, although the multiplicity of each haplotype was retained for the purpose of assembling a reference panel.

Using the Python library DendroPy (Sukumaran and Holder 2010), we calculated the patristic distance matrix for each neighbor-joining tree. We then applied the greedy algorithm to select the reference panel of size *k =* 200 that maximizes PD. Furthermore, on each neighbor-joining tree, we used the rppr binary in pplacer (Matsen *et al.* 2013) to execute the adapted PAM algorithm, returning a reference panel of size *k =* 200 that minimizes ADCL. In cases for which either algorithm selected a leaf representing more than one haplotype, one of the haplotypes was randomly chosen to be included in the panel.

To model diploid samples, we also created diploid reference panels for use with both the maximum-PD and minimum-ADCL algorithms. First, making the simplest possible assumption about the pairing of haplotypes within populations, we randomly paired the *n =* 2000 haplotypes into 1000 diploid genomes. For the diploid PD panel, following Zhang *et al.* (2013), we included diploid individuals carrying at least one of the top-ranked haplotypes into the panel until we reached the desired panel size *k*. More specifically, we proceeded down the list of *k* haplotypes in the maximum-PD panel, ranked based on the order of inclusion. At each step, we selected both the top-ranked haplotype and the haplotype with which it was paired (and which was not necessarily top ranked) for the diploid panel, if they had not already been picked previously. We continued this process until *k*/2 diploid genomes were selected, for a total of *k* haplotypes.

Unlike in maximum-PD panels, haplotype sets in minimum-ADCL panels are not nested. Therefore, we cannot use the same process to construct the diploid ADCL panel. To address this problem, we first constructed, again using rppr, a half-sized minimum-ADCL panel of size *k*/2 = 100. Each haplotype in the half-sized panel, along with the haplotype with which it was paired, was then included in the diploid panel. In the event that both haplotypes of a diploid genome were in the half-sized panel, they were each chosen only once. If the diploid panel was not fully filled at the end of this process, then haplotype pairs were randomly taken from the previously unselected diploid genomes until the requisite panel size of *k*/2 diploid genomes was reached.

For comparison, for each of the 50 data sets, we also generated 1000 random reference panels by sampling, without replacement, *k =* 200 of the original *n =* 2000 haplotypes, giving a total of 1004 reference panels. A diagram of the simulation pipeline appears in Figure 2.

For each of the *k* haplotypes in a reference panel, we unmasked the genotypes at the *q −* s masked sites and used the resulting full sequences as a reference to perform imputation, under the assumption that the haplotypes represent sequences with resolved phasing. Following Zhang *et al.* (2013), to avoid edge effects and to improve imputation accuracy, within each 1-Mb haplotype, we imputed only the middle 100-kb segment while still retaining the markers in both 450-kb flanking regions (Li *et al.* 2010). Similar to Zhang *et al.* (2013), we used the program minimac (Howie *et al.* 2012) to perform imputation. The parameter values entered into minimac were as follows: and .

For each choice of reference panel, we evaluated imputation accuracy at the *r* imputed sites over the *n*/2 diploid genomes, applying a discordance metric. These *r* imputed sites consisted of all masked sites within the middle 100-kb segment, regardless of whether they were polymorphic in the reference panel. At imputed site *j* in diploid genome *i*, we define and to be the true and imputed genotypes, respectively. Both and take on values in {0,1,2}, corresponding to the number of copies of an arbitrarily chosen allele at that specific site. The discordance rate *D* across all sites is given byWe also compute the discordance rate *H* across all true heterozygous genotypes ():In addition, based on the MAF values of their constituent alleles, as computed in the full set of 2000 haplotypes, we further split the true heterozygous sites into three mutually exclusive MAF bins: 0 < MAF < 0.1 (low), 0.1 ≤ MAF < 0.2 (medium), and 0.2 ≤ MAF ≤ 0.5 (high). This separation was made to evaluate how the PD and ADCL algorithms perform across the spectrum of rare to common variants. Note also that the calculations of *D* and *H* sum over all *n*/2 diploid genomes, irrespective of whether they have one, both, or neither of their haplotypes represented in the reference panel. These computations evaluate both errors from mistaken imputations in markers that are polymorphic in the reference panel and errors arising from imputing a polymorphic marker as monomorphic in the full set of haplotypes because it is monomorphic in the reference panel.

### 1000 Genomes Project sequence data

We also applied both the PD and ADCL algorithms to sequence data from the 1000 Genomes Project (available at http://csg.sph.umich.edu/abecasis/MACH/download/1000G-PhaseI-Interim.html). Following Zhang *et al.* (2013), we considered *n =* 762 phased haplotypes from 381 diploid individuals of European ancestry: 87 Utah residents of Northern and Western European ancestry, 93 Finnish from Finland, 89 British from England and Scotland, 14 Iberians from Spain, and 98 Toscani from Italy.

We first removed all singleton sites from the data, and we then selected thirty 1-Mb segments that were approximately evenly spaced across chromosome 20, avoiding the centromere, telomeres, and adjacent areas. Study samples then were created using a similar procedure to that employed for the simulated data. For each of the 30 segments, we randomly selected *s =* 400 markers with MAF > 0.1 in the full set of 762 haplotypes, and masked the genotypes of the remaining sites. We then chose *k =* 120 haplotypes to include in the maximum-PD and minimum-ADCL reference panels, as well as in 1000 randomly generated panels. For each choice of reference panel used for each segment, we imputed the middle 100 kb, retaining the markers in both 450-kb flanking regions. We then evaluated *D* and *H* analogously to the experiments with the simulated data.

The *k =* 120 haplotypes were selected for the maximum-PD and minimum-ADCL reference panels in two ways. First, we built a neighbor-joining tree for each of the thirty 1-Mb segments based on only the information contained within the segment, and then we performed the panel-selection algorithms on the tree. In this procedure, which we call *separate panel selection*, each segment has its own distinct maximum-PD panel and minimum-ADCL panel. Second, we also considered *combined panel selection*, performing panel selection on a single tree built from the combined information in all 30 segments. In this case, each of the 30 segments shares the same maximum-PD and minimum-ADCL panels. Combined panel selection can be viewed as a small-scale version of a whole-genome analysis, in which the combined information from across a genome is used to choose reference individuals suitable for imputation for the whole genome.

## Results

### Stability of the adapted PAM algorithm

Before considering the actual imputation results produced by the different algorithms for reference panel selection, we empirically validated the stability of the adapted PAM algorithm in choosing the minimum-ADCL panel. Beyond the initial run for each of our 50 simulated data sets, we repeated the selection of the minimum-ADCL panel five additional times. For each repetition, we executed the adapted PAM algorithm with a different starting seed, and then determined the number of haplotypes that were shared by the minimum-ADCL panels from both the initial run and the run with the modified seed.

When comparing two panels of 200 reference haplotypes drawn from a set of 2000 sample haplotypes, let *m* be the number of haplotypes that are shared by both panels (0 ≤ *m* ≤ 200). For each of the five replicates, we calculated the mean value of *m* across the 50 data sets, comparing each replicate to the initial run. All five mean values of *m* were observed to be ∼179 (Table 1); for comparison, the mean of the hypergeometric distribution describing the number of haplotypes shared between two panels of size 200 independently drawn from a pool of 2000 is 20, with SD = 4.03. Therefore, despite changing the specific haplotypes used in randomly initializing the adapted PAM algorithm, most haplotypes eventually chosen for inclusion in the minimum-ADCL panel remain the same. This result suggests that the adapted PAM algorithm is reasonably stable, and in subsequent analyses, we considered only a single starting seed.

### Polymorphic sites in reference panels

For each of the 1004 reference panels, we evaluated the number of masked sites within the imputed 100-kb segment that were polymorphic. This calculation is important because only sites that are polymorphic in the reference panel can produce a meaningful imputation result for the remainder of the study sample. Summing across all 50 data sets, we detected a total of 12,851 masked sites within the 100-kb segment of interest. We then compared how many of those masked sites appear as polymorphic in the maximum-PD panel, the minimum-ADCL panel, and a single random panel.

Of the 12,851 masked sites, 8879 sites (69.09%) were polymorphic in all three reference panel types. Of the 3972 remaining sites, 1138 (8.86%) were polymorphic in both the maximum-PD and minimum-ADCL panels, 244 (1.90%) were polymorphic in both the maximum-PD and random panels, and 374 (2.91%) were polymorphic in both the minimum-ADCL and random panels. In addition, 464 (3.61%), 473 (3.68%), and 391 (3.04%) sites were polymorphic in only the maximum-PD, minimum-ADCL, and random panels, respectively. Finally, 888 (6.91%) of the masked sites were monomorphic in all three panels (Figure 3).

Overall, 10,725 sites (83.46%) were polymorphic in the 50 maximum-PD panels, 10,864 sites (84.54%) were polymorphic in the 50 minimum-ADCL panels, and 9888 sites (76.94%) were polymorphic in the 50 random panels. Using the two-tailed Wilcoxon signed-rank test, we found that both the maximum-PD and minimum-ADCL methods of panel selection identify substantially more polymorphic sites compared to choosing the reference panel randomly (*P =* 7.686 × 10^{−10} and *P =* 8.175 × 10^{−10}, respectively).

### Polymorphic sites in imputed data sets

The maximum-PD and minimum-ADCL selection algorithms resulted in similar numbers of polymorphic sites as a fraction of the total number of masked sites in their respective reference panels. We next evaluated the number of imputed sites the two methods recovered as polymorphic. In each of the 50 simulated data sets, we calculated the percentage of masked sites that were polymorphic in the imputed sample using the maximum-PD panel, the minimum-ADCL panel, the diploid PD panel, the diploid ADCL panel, and the same random panel used to assess the number of polymorphic sites within the reference panels.

Figure 4 compares the proportion of polymorphic sites imputed with combinations of the five reference panel types. In each panel of Figure 4, the random panel is used as a baseline for evaluating two of the other four panel-selection methods.

We used the two-tailed Wilcoxon signed-rank test to evaluate differences in the fraction of sites identified as polymorphic by the different panel types. Both the maximum-PD and minimum-ADCL panels recovered a significantly larger percentage of polymorphic sites compared with their respective diploid panels (*P =* 3.448 × 10^{−9} and *P =* 2.309 × 10^{−9}, respectively). The minimum-ADCL panel also outperformed the maximum-PD panel (*P =* 4.944 × 10^{−4}). However, comparing the 50 diploid PD and 50 diploid ADCL panels, the percentage of imputed sites that are polymorphic in the imputed data sets showed no significant difference (*P =* 0.1625).

### Discordance rates

As a measure of imputation accuracy, for each of the 50 simulated data sets, we separately calculated the discordance rate *D* across all sites that were imputed with the maximum-PD panel, the minimum-ADCL panel, the diploid PD panel, and the diploid ADCL panel. For a baseline, we also calculated the mean discordance rate over the 1000 randomly selected reference panels. We were mainly interested in comparing the performance between the maximum-PD and minimum-ADCL panels, as well as between the diploid PD and diploid ADCL panels.

The discordance rates are shown in Figure 5, and their mean values are summarized in Table 2. Again using the two-tailed Wilcoxon signed-rank test, the minimum-ADCL panel exhibited significantly lower discordance rates than the maximum-PD panel (*P =* 1.342 × 10^{−9}). The diploid ADCL panel also had lower discordance rates than the diploid PD panel (*P =* 2.597 × 10^{−3}). The minimum-ADCL, maximum-PD, diploid ADCL, and diploid PD panels all provided lower discordance rates than the mean of the 1000 randomly selected panels (*P =* 7.789 × 10^{−10}, 9.928 × 10^{−10}, 8.797 × 10^{−10}, and 4.920 × 10^{−7}, respectively).

To generate a discordance measure for low-frequency variants, we also calculated the discordance rate *H* across the heterozygous sites with 0 < MAF < 0.1. From Figure 5 and Table 2, it can be seen that the mean discordance rates are higher for low-MAF loci than they are for high-MAF loci. Nevertheless, compared to the maximum-PD panel, the minimum-ADCL panel still achieved significantly higher imputation accuracy on low-MAF heterozygotes (*P =* 1.606 × 10^{−9}). The same relationship also held between the diploid ADCL and diploid PD panels (*P =* 1.871 × 10^{−4}). As was observed when considering all variants, the minimum-ADCL, maximum-PD, diploid ADCL, and diploid PD panels all had lower discordance rates than the mean of the 1000 random panels (*P =* 7.790 × 10^{−10}, 1.264 × 10^{−9}, 7.790 × 10^{−10}, and 2.244 × 10^{−6}, respectively).

### Discordance rates under different simulation settings

Following Zhang *et al.* (2013), to investigate how different parameter choices might have affected the simulation results, we repeated the analysis taking into consideration (1) different reference panel sizes *k*, (2) different marker densities *s*, and (3) different target-sequence lengths. When varying a parameter, we kept the other two parameters constant at their default values used in the initial analysis (reference panel size *k =* 200, number of markers per megabase *s =* 300, imputation length kb). The baseline for comparison here is the mean discordance rate over the 50 randomly selected reference panels. This number is smaller than the 1000 randomly selected reference panels used to calculate the baseline mean discordance rate in the initial analysis, owing to runtime considerations that arise when performing a large number of imputations using the random panels. Box plots of the results are shown in Figure 6, and mean discordance rates of the various panel types over all sites and over the low-frequency variants are shown in Table 3 and Table 4, respectively.

We first evaluated the influence of reference panel size on imputation accuracy, considering cases with *k* = 100, 300, 400, and 500 (compared to the initial analysis with *k =* 200). We observed that as the panel size *k* increased, discordance rates decreased across all reference panel types. However, we also observed a decrease in the difference in performance between the ADCL and PD algorithms in both the haploid (maximum-PD and minimum-ADCL) and diploid cases. In other words, the gain in imputation accuracy obtained by minimizing ADCL instead of maximizing PD diminished with large reference panel sizes.

Next, we examined how the initial genotyping density of the markers affected imputation accuracy by considering instances with *s* = 200, 400, 500, and 600 (compared to the initial choice of *s =* 300). Here, across all reference panel types, the discordance rates decreased slightly with increasing marker density *s*. Nevertheless, for all densities, both the haploid and diploid ADCL panels consistently outperformed their PD counterparts in terms of imputation accuracy across all sites, as well as across only the low-frequency variants.

Finally, we considered whether the length of the target imputation region has an effect on imputation accuracy. We imputed segments of length 500 kb and 1 and 2 Mb (compared to the initial imputation length choice of 100 kb). In all cases, a flanking 450-kb region was added to each end of the sequence to avoid edge effects. We observed that discordance rates remained relatively constant across different imputation lengths. Again, the ADCL panels produced significantly lower discordance rates compared to the PD panels regardless of the specific choice of imputation length.

### Discordance rates with 1000 Genomes Project sequence data

To confirm that our findings on the simulated data set are also observed when using actual sequence data, we performed a similar analysis for thirty 1-Mb segments generated on chromosome 20 using 381 diploid individuals of European ancestry from the 1000 Genomes Project. We were again interested in comparing the difference in imputation accuracy achieved by the minimum-ADCL and maximum-PD panels using the mean discordance rate over 1000 randomly selected reference panels as a baseline for comparison. We considered two ways of selecting the reference panels: separate panel selection, in which the 30 segments each had distinct sets of panels derived from the information in local 1-Mb regions, and combined panel selection, in which all 30 segments shared the same set of panels derived from the combined information across all segments. The discordance rates are shown in Figure 7, and their mean values are summarized in Table 5. For the three different panel types, Figure 8 compares the discordance rates examined in each of the 30 segments over all imputed sites, as well as over only the low-frequency variants.

Applying the two-tailed Wilcoxon signed-rank test, we observed that when using separate panel selection, the minimum-ADCL algorithm produced significantly lower discordance rates than the maximum-PD algorithm across all imputed sites (*P =* 2.367 × 10^{−3}), as shown in Table 5. Focusing solely on the low-frequency variants, the minimum-ADCL panel continued to produce better imputation accuracy than the maximum-PD panel (*P =* 0.0234). With combined panel selection, the same trend held. The minimum-ADCL algorithm outperformed the maximum-PD algorithm across all imputed sites (*P =* 0.0364), with the improvement in performance close to significant across only the low-frequency variants (*P =* 0.0667).

## Discussion

The decreasing cost of modern sequencing has enhanced the practicality of generating a reference panel from the haplotypes that are already present in the study sample. It generally remains prohibitive, however, to perform full sequencing for large numbers of haplotypes. Given this constraint in resources, what is the optimal approach for selecting the subset of the study sample to sequence to achieve the best imputation results? We explored two objective functions for optimization, with the aim of ensuring high imputation accuracy.

Maximizing PD as a way of ensuring that the total genetic diversity of a sample is well represented is one sensible approach. This type of panel-selection method achieves lower imputation discordance rates than assembling reference panels from randomly selected haplotypes (Kang and Marjoram 2012; Zhang *et al.* 2013). Nevertheless, it has not been clear whether PD represents the best objective function for panel selection.

Minimizing ADCL attempts to ensure that the subset of the study sample selected for the panel is representative of the total diversity present, albeit using a different approach. It is conceptually similar to a clustering problem, in that the number of clusters is predetermined, and the algorithm returns the cluster to which each haplotype belongs, as well as the haplotype that is the most central within its cluster. This haplotype is then included in the reference panel. Unlike when maximizing PD, the problem of selecting nonrepresentative branches is mostly avoided by minimizing ADCL because the algorithm tends to not select long, pendant branches of the tree (Figure 1c).

For both simulated and actual sequence data, we observed that minimizing ADCL does in fact provide an improvement in imputation accuracy compared to maximizing PD. When looking at the overall discordance-rate measures, minimizing ADCL produced a significantly lower discordance rate over all sites compared to maximizing PD. This result held across various choices of genotyping density and imputation length, suggesting that the observed result is robust to such changes. It is only with increasing panel sizes that the gain in imputation accuracy obtained by minimizing ADCL decreased compared to maximizing PD. This outcome potentially could be due to the diminishing returns, in terms of representative variants, contributed by each additional haplotype in the reference panel. Consider the extreme case, where all the haplotypes in the study sample are included in the reference panel. In such a situation, both algorithms return trivially identical imputation results.

One metric of particular interest is the performance of an algorithm in the imputation of low-frequency variants. Although early genome-wide association (GWA) studies focused on identifying common variants associated with particular diseases or phenotypic traits, the focus of GWA studies has increasingly shifted toward an interest in rare genetic variants (Asimit and Zeggini 2010; Cirulli and Goldstein 2010; Eichler *et al.* 2010). As such studies improve in their ability to detect the effects of rare variants on phenotype (Li *et al.* 2013; Lee *et al.* 2014), it is paramount that the imputation process carried out alongside them generates reasonably accurate imputed genotypes with low-frequency variants.

In this context, from Table 2 and Table 5, we observe, based on differences in the mean discordance rates, that minimizing ADCL improved on maximizing PD by the largest absolute amount in the low-MAF bin (0 < MAF < 0.1), in both the simulated and actual data. This result might be explained by the fact that the discordance rates obtained when imputing low-frequency variants are relatively high to begin with, and can be potentially reduced to a much greater extent with an improved choice of algorithm for panel selection. Over all sites, as well as for heterozygous sites in the high-MAF bin (0.2 ≤ MAF ≤ 0.5), the improvement in imputation accuracy by minimizing ADCL is limited by the already low discordance rates. Our analyses are consistent in suggesting that the minimum-ADCL algorithm can contribute to reducing imputation errors in GWA studies that seek to identify the effects of low-frequency variants on phenotypic traits.

In summary, we have demonstrated that internal reference panel selection via minimizing ADCL produces empirically improved imputation accuracy compared to maximizing PD, particularly for low-frequency variants. This finding applies to both simulated and actual sequence data and is robust to changes in the choice of initial parameter values. Note that both ADCL and PD represent intermediate criteria that provide practical objective functions, where the ultimate goal is maximizing imputation accuracy or other aspects of imputation performance. Although both algorithms produce considerably better imputation performance measures than the use of random panels, neither is guaranteed to produce the maximal value of such measures over all possible panels. Further, both approaches currently rely on imperfect assumptions, such as the assumption that haplotype phase is known. It remains to be determined whether a single simple criterion exists that could lead to identification of the best possible panel for maximizing imputation performance.

## Acknowledgments

This work was supported by National Institutes of Health grant R01 HG005855.

## Footnotes

*Communicating editor: C. Kendziorski*

- Received April 7, 2015.
- Accepted August 14, 2015.

- Copyright © 2015 by the Genetics Society of America

Available freely online through the author-supported open access option.