## Abstract

Understanding the relatedness of individuals within or between populations is a common goal in biology. Increasingly, relatedness features in genetic epidemiology studies of pathogens. These studies are relatively new compared to those in humans and other organisms, but are important for designing interventions and understanding pathogen transmission. Only recently have researchers begun to routinely apply relatedness to apicomplexan eukaryotic malaria parasites, and to date have used a range of different approaches on an *ad hoc* basis. Therefore, it remains unclear how to compare different studies and which measures to use. Here, we systematically compare measures based on identity-by-state (IBS) and identity-by-descent (IBD) using a globally diverse data set of malaria parasites, *Plasmodium falciparum* and *P. vivax*, and provide marker requirements for estimates based on IBD. We formally show that the informativeness of polyallelic markers for relatedness inference is maximized when alleles are equifrequent. Estimates based on IBS are sensitive to allele frequencies, which vary across populations and by experimental design. For portability across studies, we thus recommend estimates based on IBD. To generate estimates with errors below an arbitrary threshold of 0.1, we recommend ∼100 polyallelic or 200 biallelic markers. Marker requirements are immediately applicable to haploid malaria parasites and other haploid eukaryotes. C.I.s facilitate comparison when different marker sets are used. This is the first attempt to provide rigorous analysis of the reliability of, and requirements for, relatedness inference in malaria genetic epidemiology. We hope it will provide a basis for statistically informed prospective study design and surveillance strategies.

- identity-by-state
- identity-by-descent
- relatedness
- independence model
- hidden Markov model
- malaria
*Plasmodium falciparum**Plasmodium vivax*- genetic epidemiology

GENETIC relatedness is a measure of recent shared ancestry (Weir *et al.* 2006; Speed and Balding 2015). It ranges from zero between two unrelated individuals to one between clones, and in the absence of inbreeding is broken down by recombination (Wright 1922). Since the early 20th century, relatedness has been used across a wide variety of fields: agriculture, forensic science, disease mapping, and ecology (Weir *et al.* 2006; Waples *et al.* 2019). Nevertheless, studies of relatedness are niche in the nascent field of infectious disease genetic epidemiology because only a subset of pathogens are eukaryotes, *e.g.*, helminths and parasitic protoza, which include malaria parasites (Gardy *et al.* 2015; Blanton 2018). Because relatedness is broken down by outbreeding, it can change with each generation (Thompson 2013). Studies of malaria parasite relatedness thus provide a sensitive measure of recent gene flow (Taylor *et al.* 2017), generating insight on an operationally relevant scale for disease control efforts (Blanton 2018; Wesolowski *et al.* 2018).

Malaria parasites are haploid during the human stages of their complex life cycle, which includes an obligate stage of sexual recombination between gametocytes within the mosquito (Baton and Ranford-Cartwright 2005). The probability of selfing depends on the number of parasite clones in the human source infection: certain if monoclonal *vs.* uncertain if polyclonal. Polyclonal infections result from either a single mosquito inoculation, in which case most parasite clones are likely interrelated, or multiple inoculations, in which case parasite clones are likely unrelated (Wong *et al.* 2017, 2018; Nkhoma *et al.* 2018). The prevalence of polyclonal infections depends on many epidemiological factors, *e.g.*, transmission intensity (Anderson *et al.* 2000; Schoepflin *et al.* 2009; Nkhoma *et al.* 2013) and correlates of human host immunity (Ntoumi *et al.* 1995; Konaté *et al.* 1999; Owusu-Agyei *et al.* 2002; Kiwuwa *et al.* 2013).

The diploid coefficient of inbreeding is a measure of relatedness between haploid genotype pairs, defined as a probability of identity-by-descent (IBD) (Hill 1996). Two alleles are identical-by-descent (also IBD) if descended from a recent common ancestor in some ancestral reference population (Bink *et al.* 2008; Thompson 2013; Speed and Balding 2015). IBD can also be interpreted in terms of shared segments unbroken by recombination since a recent common ancestor (Thompson 2013; Speed and Balding 2015), where the segment length distribution relates to ancestor generation under a coalescent model (Speed and Balding 2015). IBD segments underpin many applications from disease mapping (Browning and Thompson 2012) to *Plasmodium falciparum* selection detection (Henden *et al.* 2018), and can be averaged to generate a measure of relatedness (Speed and Balding 2015). However, coalescent interpretation remains challenging for malaria parasites because the complexities of their life cycle convolute generation in a setting-dependent manner. Two alleles that share the same allelic type are identical-by-state (IBS), and include those that are both IBD and not IBD but identical due to chance sharing of common alleles (Weir *et al.* 2006; Bink *et al.* 2008; Stevens *et al.* 2011; Thompson 2013; Huang *et al.* 2015; Speed and Balding 2015). While identity-by-state (also IBS) is observed, IBD is hidden and must be inferred.

Many estimators of relatedness exist, some assuming independence between IBD states (Weir *et al.* 2006; Bink *et al.* 2008) and others not [*e.g.*, Leutenegger *et al.* (2003) and subsequent models (Brown *et al.* 2012)]. Those assuming independence have fewer parameters but impaired power in the presence of dependence (Anderson and Garza 2006). Those that do not assume independence are often based on hidden Markov models (HMMs) (Rabiner 1989; Brown *et al.* 2012; Druet and Gautier 2017; Ramstetter *et al.* 2017). The HMM framework enables inference of IBD segments via one or more additional parameters that can be more difficult to reliably estimate than relatedness. Measures of relatedness used in studies of malaria include those estimated under HMMs [hmmIBD (Schaffner *et al.* 2018), isoRelate (Henden *et al.* 2018), and DEploidIBD (Zhu *et al.* 2018)]. IBS-based measures, *e.g.*, proportions of alleles shared [the haploid equivalent of the “allele-sharing coefficient” (Speed and Balding 2015)], or counts of allele differences, require only simple calculation and are thus popular also (Orjuela-Sánchez *et al.* 2009; Anderson *et al.* 2010; Daniels *et al.* 2015; Omedo *et al.* 2017a,b; Oyebola *et al.* 2018; Chang *et al.* 2019).

Despite many IBD- and IBS-based analyses, there are few systematic comparisons applicable to malaria studies. We compare IBD- and IBS-based measures for monoclonal malaria parasite samples using simulated data; various data sets of *P. falciparum*, the parasite responsible for the most-deadly type of human malaria; and a data set of *P. vivax*, the parasite most commonly responsible for malaria relapses. We use a framework encompassing two models assuming independence and not. It is an error-modified version of that of Leutenegger *et al.* (2003), which is at the core of many models (Brown *et al.* 2012), including those designed for comparison across malaria parasite samples (Henden *et al.* 2018; Schaffner *et al.* 2018). To guide future relatedness studies of malaria parasites and haploid eukaryotes more generally, we explore marker and allele counts for relatedness inference. We focus on relatedness alone, averaging over all IBD segments however small (Brown *et al.* 2012). Relatedness estimates are thus liable to reflect some linkage disequilibrium (LD) at the population level (Slatkin 2008). From relatedness alone, we can distinguish pairs that are highly related and not, but we cannot distinguish a highly inbred pair from an outbred pair with the same relatedness.

## Methods

### Relatedness

In this study, relatedness *r* is defined as the probability that, at any locus on the genome, the alleles sampled from two individuals are IBD. Let *m* denote the number of genotyped markers, each with a locus indexed by *t* = 1, . . . ,*m*. Let denote the index of the chromosome of the *t*-th locus, and its position on that chromosome (all markers are treated as point polymorphisms). For two indices , we either have , or and . Let if two individuals are IBD at the *t*-th locus; otherwise . We assume that *r* is constant across the genome: for all . The sequence could be made of independent variables, or could be a Markov chain, in which case, if we write for the probability of given that , the model statesAbove, denotes a genetic distance in base pairs between loci and *t*. If , ; such that and are independent. The value parameterizes the switching rate of the Markov chain and ρ is a constant equal to the recombination rate, assumed fixed across the genome with value for *P. falciparum* parasites (Miles *et al.* 2016).

The model connects *r* to the data as follows. At each locus, let denote a set of alleles, where denotes the cardinality of (allelic richness of the *t*-th marker). For individuals at locus *t* we observe the pair . We assume that alleles occur with frequencies , with for all and . The data comprise , and at *m* loci. A simple observation model relates the data to by assuming that, if , then and are independent categorical variables taking values in with probabilities . If , then is such a categorical variable and with probability one. A more realistic model in Section B of File S1 accounts for observation error.

Combining the Markov model for with an observation model as above leads to an HMM (Figure 1) with likelihood function , which can be evaluated using the forward algorithm. An independence model can be retrieved by setting for all *t*. Let and denote the maximum likelihood estimators (MLEs) of *r* and *k*, respectively. For each pair of individuals , we compute them in R (R Core Team 2018). We use the one-dimensional optimize function to compute under independence, and optim to compute and under the HMM, with initial values equal to 0.5 and 8, respectively. The default algorithm is that of Nelder and Mead (1965). Convergence of optim can be monitored via the number of calls made to the log-likelihood.

Under assumptions on the data-generating process, could be shown to be consistent for *r* as . However, these asymptotic considerations are intricate in the present setting, where the degree of dependencies between observations increases with the sample size *m* due to decreasing intermarker distance (Hill and Weir 2011). This departs from standard asymptotic analysis where observations are not increasingly dependent as (Douc and Moulines 2012); see also Section B of File S1.

Without standard results such as asymptotic normality of the MLE, there is no simple formula for sample size determination relating *m* to the variance of . The estimators’ distributions can still be approximately normal if the log-likelihood is approximately quadratic (Geyer 2013), in which case C.I.s can be obtained through the second derivative of the log-likelihood at the MLE. However, the present setting poses an additional difficulty since the MLE can be located on the boundary of the parameter space, or (Self and Liang 1987). Therefore, we rely on the parametric bootstrap (Wasserman 2013) to construct C.I.s around . Unless otherwise stated, we use 500 bootstrap draws throughout.

### Fraction IBS

For a pair of samples *i* and *j*, we define the fraction IBS as(1)Its expectation is a linear function of relatedness, *e.g.*, when there is no genotyping error(2)where(3)Here, and are equivalent to Nei’s gene identity and diversity, respectively, or, for an outbred diploid, homozygosity and heterozygosity, respectively, (Nei 1972, 1973; Nei and Tajima 1981). Equation 2 might suggest that could converge to (where ) as under assumptions such as independent loci. Under this setup, the estimator would not be consistent for *r*, but could be corrected (Section A of File S1).

*Plasmodium* data

*P. falciparum* data are biallelic (*i.e.*, for all ) SNP data from monoclonal samples (Table 1). All data are published (Echeverry *et al.* 2013; Nkhoma *et al.* 2013; Cerqueira *et al.* 2017; Omedo *et al.* 2017a,b; Taylor *et al.* 2017). They were obtained either from sparse genome-wide panels of select markers, called barcodes, or from a dense whole-genome sequencing (WGS) data set; full details of sample collection and data generation can be found via the citations above, and references therein. Additional steps we took to process the data are as follows.

Besides mapping SNP positions to the *P. falciparum* 3d7 v3 reference genome and recoding heteroallelic calls as missing [since all available samples were previously classified monoclonal (Echeverry *et al.* 2013)], we did not postprocess the Colombian data in any way. Thailand 93-SNP and WGS samples were used as described in Taylor *et al.* (2017). However, 5299 SNPs on chromosome 14 that were unintentionally omitted from the WGS data set in Taylor *et al.* (2017) are included here. Data derived from Omedo *et al.* (2017a,b) were processed using steps described in “Sample and SNP cut-off selection criteria” of Omedo *et al.* (2017a). In addition, we removed samples with duplicate SNP calls; removed samples classified as not monoclonal using a % heteroallelic SNP call rate to classify samples as monoclonal following (Nkhoma *et al.* 2013); and, among samples classified monoclonal, treated heteroallelic SNP calls as missing and removed monomorphic SNPs.

For each *P. falciparum* processed data set, allele frequencies were estimated by simple proportions: for and each locus *t*, where denotes the number of samples not missing data at the *t*-th locus. Minor allele frequencies, , vary considerably due to different marker panels and spatiotemporal variation among parasite populations (Figure 2).

Samples in the *P. vivax* data set were collected between 2010 and 2014 from two clinical trials on the Thailand–Myanmar border (Chu *et al.* 2018a,b). They were genotyped at three to nine highly polyallelic microsatellites (MSs). In this study, we analyze samples genotyped at nine MSs with no evidence of polyclonality (detection of two or more alleles at one or more MS) from people, selecting one episode per person uniformly at random from all episodes per person. We use allele frequencies reported in Taylor *et al.* (2018). They have average expected homozygosity and effective cardinality (defined below, Equation 7) averaged over MSs of . Since there are only nine markers, we analyze these data under the independence model.

### Simulated data

Unless otherwise stated, data were simulated under the HMM with genotyping error using positions sampled uniformly from the Thailand WGS data set and with frequencies as follows. Biallelic marker data were simulated using frequencies sampled from the Thailand WGS data set with probability proportional to minor allele frequency estimates (to compensate for the skew toward rare alleles in WGS data set). Polyallelic marker data were simulated using frequencies sampled from a Dirichlet distribution using parameter vector α with entries each equal to 100 to generate frequencies for approximately equifrequent alleles, and α with entries each equal to 1 to generate frequencies uniform over the simplex, thus increasingly skewed toward rare alleles when .

### Marker requirements for prospective relatedness inference

We explore marker requirements for error of around *r*. By maximizing the likelihood we obtain estimates of both *r* and *k*, but we focus on the quality of the estimate of *r* only.

For a given setting [*e.g.*, *m*, *r*, *k*,] we simulate 500 pairs of haploid genotype calls, and for each pair compute and under the HMM. We compute the root mean squared error (RMSE) of around *r* over the 500 repeats. From the RMSEs, we derive *m* or required for RMSE under a prespecified value. Unless otherwise stated, when we fix *k* we use 8, the mean for from the WGS data set; when we fix *r* we use 0.5, which we find leads to the largest RMSE, rendering data requirements based on conservative. For simplicity, we fix to be the same for all *t*. To explore *m* and for markers with and without equifrequent alleles, we use effective cardinality (Equation 7) averaged over all *m* considered,(4)As an aside, comparison between and *r* differs from that between and “realized relatedness,” , where *L* is the length of the genome (Speed and Balding 2015). The former has the advantage of revealing RMSE due to the finite length of the genome [*i.e.*, Mendelian sampling (Hill and Weir 2011)], while at the same time revealing the excess and thus theoretically avoidable error due to marker limitations.

We consider the theoretical impact of at a single locus. For given , we measure the informativeness via the Fisher information matrix (FIM), which relates to the precision of the MLE if the log-likelihood is approximately quadratic. We define , where the expectation is with respect to given *r* and the allele frequencies; we assume no genotyping error for simplicity; the sign stands for the second-order derivative with respect to *r*. depends on the allele frequencies and on *r*:(5)For any and *r*, it is maximized over all by for all *l*, *i.e.*, by equifrequent alleles (proof in Section B of File S1), in agreement with high minor allele frequency (Thompson 1975). When alleles are equifrequent we obtain(6)To explore the theoretical gain of increasing we calculate the multiplicative increase in relative to (Figure 3, left). The largest increase in precision is obtained upon increasing from 2 to 3 with increasing returns as *r* approaches zero. However the justification of the FIM as a measure of precision breaks at the boundary of the parameter space. The plot on the right of Figure 3 shows a multiplicative increase in precision as a function of effective cardinality,(7)the noninteger number of equifrequent alleles concordant with based on the allele frequencies . For example, is the effective cardinality of an “ideal” biallelic SNP with minor allele frequency 0.5, whereas is the effective cardinality of a realistic biallelic SNP with minor allele frequency . Precision increases with as it does with .

### Data availability

All data used in this study are either simulated or published previously. Additional steps we took to process the data are described in section *Plasmodium data*. The processed data and code necessary for confirming the conclusions of the article are available at https://github.com/artaylor85/PlasmodiumRelatedness. All code was written in R (R Core Team 2018). Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8977217.

## Results

This section is arranged as follows. First we consider the fraction IBS, , and show how it is problematic as an estimator of *r*. Second, we discuss for *Plasmodium* data. Third, the performance of the HMM is compared to that of the independence model using simulated data. Fourth, we explore marker requirements for the estimation of *r* using simulated data.

### Fraction IBS as an estimator of relatedness

As an estimator of *r*, does not satisfy favorable statistical properties but its expectation is a correlate of *r* (Equation 2). As such, studies have recovered trends in *r* (*e.g.*, with geographic distance) using IBS-based measures (Omedo *et al.* 2017a; Chang *et al.* 2019). However, quantitative trends and absolute values of are only comparable across data whose markers have the same allele frequencies (Chang *et al.* 2019). To illustrate the effect of differing frequencies, we simulated using and frequency estimates from published data sets (Figure 4, top). The distributions are far from (we would expect to see bigger and smaller distances for data simulated using and , respectively, with no difference for ). Their locations vary considerably, centering around and rendering absolute values nonportable across data sets. In contrast, distributions of all center around (Figure 4, bottom).

Figure 5 shows and distributions based on sample pairs from the published data sets. The locations and spreads of the distributions vary considerably. They are not comparable across data sets, *e.g.,* among SNP data sets, the left-most centering of the Thailand 93-SNP distribution is not evidence that *P. falciparum* parasites from Thailand are less related than those from Kenya. Despite very different absolute values, each distribution centers around , the expectation when . Thus, we conclude that many parasite pairs in these real data sets are unrelated, as corroborated by estimates based on IBD (Figure 5, bottom). The distribution based on *P. vivax* data (Thailand MS) most closely approximates its partner distribution due to highly polymorphic MSs.

### Relatedness of *Plasmodium* data

For each data set, values range from 0 to 1, suggesting the presence of unrelated, partially related, and clonal parasites (Figure 5, bottom plot). However, the vast majority are . The skew toward lowly related parasite pairs is consistent with primary IBD-based analyses of the Thai *P. falciparum* data (Cerqueira *et al.* 2017; Taylor *et al.* 2017), as well as mean IBD fractions reported elsewhere (Zhu *et al.* 2018). Though the majority are , mean values vary. Variation is caused by several factors. First, the mean is sensitive to small but variable counts of highly related parasite pairs: proportions of range from 0.003 in the Thailand 93-SNP data set (lowest mean ) to 0.062 and 0.065 in the Colombia and The Gambia data sets, respectively (highest mean ). These highly related pairs are often the focus of demographic analyses, *e.g.*, (Chang *et al.* 2019). Considering largely unrelated pairs, some variation among data sets is likely due to LD. For example, among , the mean of the Thai WGS data set is 0.08, equal to the mean IBD fraction reported for Cambodia (0.08) and greater than that reported for Ghana (0.002) (Zhu *et al.* 2018). Overall, the interpretation and comparison of point estimates hinges on them being sufficiently precise; otherwise C.I.s facilitate comparison across different data sets.

For 100 estimates selected specifically to span the [0, 1] range, Figure 6 shows 95% C.I.s. In general, they are tighter around estimates for data sets with larger , an observation we will return to. Considering the boundaries, intervals around estimates of *r* close to 1 are tighter, in general, than those for *r* close to 0. Due to the nonquadratic nature of the log-likelihood of *r* when is close to either 0 or 1 (*e.g.*, Figure B.3 of File S1, left top and middle), we construct C.I.s using the parametric bootstrap. For away from 0 and 1, the log-likelihood is quadratic (*e.g.*, Figure B.3 of File S1, bottom left plot) and thus normal approximation C.I.s could be constructed. As an aside, Figure B.3 also demonstrates both the difficulty in estimating *k* and the robustness of relative to when is close to the boundaries.

### The HMM *vs.* the independence model

The HMM was used to compute for biallelic *P. falciparum* data sets, all of which have (Table 1), whereas the independence model was used for the polyallelic *P. vivax* data set, Thai MS, whose . In this section, the performance of the HMM is compared to that of the independence model using data simulated under the HMM. The main difference between the HMM and the independence model is estimation uncertainty. Under a well-specified model, 95% C.I.s should have 95% coverage, *i.e.*, contain the value of *r* used to simulate the data 95% of the time. The HMM provides coverage close to 0.95 for , while the independence model (misspecified) provides waning coverage for , especially when *k* is small. For , both the HMM and the independence model provide similar coverage, above or around 0.85 (Figure 7, a and b). In terms of *r* estimation accuracy, the two models are similar, with only a slight increase in RMSE under the independence model when (Figure 7, c and d). The computational cost of obtaining the MLE under either model is comparable; timings are provided in Section B of File S1.

### Marker requirements for prospective relatedness inference

As Figure 4 exemplified using simulated data, estimates of concentrate around the value of *r* used to simulate the data. However, in Figure 4 they do so with large variability, due to limited data ( with ). We now consider how large *m* needs to be to estimate *r* with specified RMSE, first considering biallelic markers with for all *t*, and second considering polyallelic markers with .

#### Biallelic markers:

Biallelic markers include biallelic SNPs, the most abundant polymorphic marker type, commonly used for relatedness inference (Weir *et al.* 2006). Figure 8 shows the RMSE of ** generated under the HMM given allele frequencies drawn from the WGS data set, with probability proportional to their minor allele frequencies ***vs.* allele frequencies drawn uniformly at random. Errors obtained using the former approach are smaller (Figure 8, left) in agreement with the long-established result that higher minor allele frequencies are preferable for relationship inference (Thompson 1975). Either way, RMSE is relatively large for 24 markers, decreasing dramatically when **, with diminishing returns thereafter (it does not tend to zero due to the finite length of the genome). Also of note, RMSE decreases with increasing proximity of the data-generating ***r* to either 0 or 1 (especially the latter). As such, biallelic marker requirements for inference of ** constrain guidelines for inference of ***r* in general (Table 2).

#### Polyallelic markers:

Highly polyallelic MS markers have long been used for relatedness inference and there is growing interest in using microhaplotypes (regions of high SNP diversity, unbroken by recombination) (Weir *et al.* 2006; Baetscher *et al.* 2018). Neither MSs nor microhaplotypes are point polymorphisms. However, to explore the general utility of polyallelic markers for relatedness inference, we make the simplifying assumption that they are. We focus on , since for biallelic markers had the largest marker requirements in general (Table 2).

Figure 9a shows three notable results. First, if only a small number of markers (*e.g.*, 24) are available, a slight increase in their average effective cardinality markedly reduces RMSE, with diminishing returns as *m* grows. Second, to obtain RMSE less than some arbitrary amounts, one can either increase cardinality or *m*. For example, to obtain RMSE , our results suggest typing 96 markers with or ∼192 markers with (concordant with Table 2). However, third, within the range of *m* values explored here, markers with are necessary for optimally low RMSE (*i.e.*, RMSE comparable with Mendelian sampling).

The results shown in Figure 9a are projected onto a single axis in Figure 9b. Larger provides smaller RMSE with diminishing returns beyond . Informally, this result provides intuition as to why we obtain, in general, tighter C.I.s around based on *Plasmodium* data sets with larger (Figure 6). Moreover, it suggests that the C.I.s around the Thailand WGS estimates are as small as they can be.

## Discussion

Using a simple model framework, we call attention to properties of estimates of genetic relatedness, *r*, increasingly used in genetic epidemiology of malaria. These results are applicable more generally to haploid eukaryotes, while highly recombining prokaryotes would require model modifications.

The fraction IBS is a simple data statistic that includes the chance sharing of common alleles (Thompson 2013). It is not a statistically principled estimator of *r*. As such, it does not allow calculation of C.I.s for *r*, nor marker requirements. Its expectation is a correlate of *r*, but absolute values and quantitative trend estimates are not portable across studies due to dependence on allele frequencies, which vary in space and time, and with different marker panels and quality control procedures (Speed and Balding 2015). However, it is simple and its use will persist. To aid interpretation across studies that continue using IBS-based measures to investigate relatedness, we show how it is expected to change as a function of *r* and allele frequencies.

Model-based relatedness inference allows construction of C.I.s and marker requirements. Based on the parameters we explored, to achieve error arbitrarily below 0.10, data for ≥ 200 biallelic or 100 polyallelic markers are recommended (fewer are required if markers are highly polyallelic). In practice, a set of makers could combine different marker types.

We present results based on a gobal set of published data sets. The original studies all feature relatedness estimates either based on allele sharing (Echeverry *et al.* 2013; Nkhoma *et al.* 2013), SNP differences (Omedo *et al.* 2017a,b), or IBD (Cerqueira *et al.* 2017; Taylor *et al.* 2017). Those using IBS-based estimates legitimately focus on a single data set, recovering meaningful but data set-specific quantitative results. Where comparisons can be made, our results generally agree with primary analyses (Table C.1 of File S1). More widely, our results agree (in order of magnitude) with those reported for diploids and polyploids (Table C.2 of File S1). Relatedness inference for polyploids [*e.g.*, (Wang and Scribner 2014; Huang *et al.* 2015)] is similar to that for polyclonal malaria samples. However, the latter is more challenging, since the equivalence of ploidy is unknown and variable. Despite these challenges, methods to infer relatedness within polyclonal malaria samples exist (Henden *et al.* 2018; Zhu *et al.* 2018), while methods to infer relatedness across polyclonal malaria samples are under development. It will be interesting to see how marker requirements for monoclonal samples scale in this more complex setting.

Our results are limited by various simplifying assumptions; most problematically, fixed allele frequencies (Speed and Balding 2015; Waples *et al.* 2019). Typically, allele frequencies are estimated using data intended for relatedness inference yet assuming independent and identically distributed samples (Wang 2004; Voight and Pritchard 2005). These data-derived allele frequencies can lead to relatedness underestimation (Bink *et al.* 2008). Improving them could benefit inference more than increasing the number of markers (Bink *et al.* 2008). To better estimate malaria parasite allele frequencies, one could jointly model frequencies and relatedness (Wang 2004). Moreover, by borrowing information across samples and extending the inference framework, one could theoretically infer the ancestral recombination graph and thus the malaria parasite genetic map (presently assumed uniform across the malaria parasite genome (Henden *et al.* 2018; Schaffner *et al.* 2018; Zhu *et al.* 2018)). That said, complexities specific to malaria (*e.g.*, selfing and its association with transmission) present unique challenges (Speidel *et al.* 2019). Modular multi-way extensions of pairwise methods may also outperform pairwise methods (Ramstetter *et al.* 2018).

Formally stated in Equation 6, a highly polyallelic marker can be several times more informative than a biallelic marker for relatedness inference, as for population assignment (Rosenberg *et al.* 2003). Despite superior informativeness, MSs are being superceded by SNPs for relatedness inference, due to the abundance, and relative ease and reliability of typing SNPs (Weir *et al.* 2006). Microhaplotypes combine the ease of SNPs with the informativeness of polyallelic markers (Baetscher *et al.* 2018). They can be defined using an LD-based decision theoretic criterion (Rosenberg *et al.* 2003; Slatkin 2008; Gattepaille and Jakobsson 2012), and genotyped using amplicon sequencing (Neafsey *et al.* 2015; Baetscher *et al.* 2018) or molecular inversion probes (MIPs), also used to genotype MSs and SNPs (Mu *et al.* 2010; Hiatt *et al.* 2013; Aydemir *et al.* 2018). Amplicon and MIP approaches are especially useful given polyclonal samples, because they can capture within-host clonal densities and phases (Neafsey *et al.* 2015; Aydemir *et al.* 2018). A model that accurately reflects the fact that MSs and microhaplotypes are not point polymorphisms, while accounting for their associated mutation and observation error rates, merits consideration (Hoffman and Amos 2005; McDew-White *et al.* 2019).

Besides motif repeats within MSs and SNPs within microhaplotypes (presently overlooked), it is preferable to minimize dependence between markers. Dependence is a function of marker position and LD. When considering polyallelic markers, we sampled marker positions uniformly at random from the Thailand WGS data set. For microhaplotypes, a more realistic approach would draw from genomic intervals amenable to physical phasing and with high within-interval LD. If diverse windows are genomically clustered, this presents a trade-off between distance and cardinality. We do not consider the trade-off here, but it can be explored within the current framework and is a topic of future work. Regarding LD, some models commonly used in human genetics account for it (Browning 2008; Browning and Browning 2010) [also see Brown *et al.* (2012)], but those designed to estimate relatedness between malaria parasites do not (Henden *et al.* 2018; Schaffner *et al.* 2018; Zhu *et al.* 2018). LD reported in malaria parasite populations is highly setting-dependent but generally lower than that reported in human populations (Anderson *et al.* 2000; International HapMap Consortium *et al.* 2007; Neafsey *et al.* 2008; Echeverry *et al.* 2013; Samad *et al.* 2015). Its incorporation into methods for malaria parasite relatedness inference, both within and between polyallelic markers, warrants further research.

Here and elsewhere (Table C.2. of File S1), marker requirements are based on either down-sampled or simulated data. Standard asymptotic theory for HMMs is problematic in the present setting due to the finite length of the genome, and the increasing degree of dependencies between markers as their density grows. Understanding the finite sample properties of the MLE in this setting remains an open problem. Another open problem beyond the scope of this study, is that of sampling individuals for population-level inference (*e.g.*, how many parasite samples are required to reliably infer gene flow between different geographic locations using relatedness?). Work is ongoing to address these questions, which are very application-specific and dependent on many population factors (*e.g.*, transmission intensity, seasonality, and asymptomatic reservoir).

### Conclusion

For portability, we recommend estimates of relatedness based on IBD for malaria epidemiology. To generate estimates between monoclonal parasite samples with (which we find leads to the largest error) with < 0.1 error, ∼200 biallelic or 100 polyallelic markers are required. C.I.s facilitate comparison across studies that inevitably differ in terms of available genetic data. Together with anticipated work on population-level sampling, we hope this work on genetic-level sampling (and extensions thereof) will aid statistically informed design of prospective genetic epidemiological studies of malaria.

## Acknowledgments

We thank all the authors of the *Plasmodium* data sets for either sharing their data, or making them freely available online for use here and elsewhere, and Stephen Schaffner for helpful discussions. P.E.J. gratefully acknowledges support from the National Science Foundation through grant DMS-1712872 and the Harvard Data Science Initiative. A.R.T. and C.O.B. are supported by a Maximizing Investigators’ Research Award for Early Stage Investigators (R35 GM-124715). This project was funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under grant number U19 AI-110818 to the Broad Institute (D.E.N.).

## Footnotes

Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8977217.

*Communicating editor: M. Beaumont*

- Received March 12, 2019.
- Accepted June 3, 2019.

- Copyright © 2019 Taylor et al.

Available freely online through the author-supported open access option.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.