## Abstract

A properly designed distance-based measure can capture informative genetic differences among individuals with different phenotypes and can be used to detect variants responsible for the phenotypes. To detect associated variants, various tests have been designed to contrast genetic dissimilarity or similarity scores of certain subject groups in different ways, among which the most widely used strategy is to quantify the difference between the within-group genetic dissimilarity/similarity (*i.e.*, case-case and control-control similarities) and the between-group dissimilarity/similarity (*i.e.*, case-control similarities). While it has been noted that for common variants, the within-group and the between-group measures should all be included; in this work, we show that for rare variants, comparison based on the two within-group measures can more effectively quantify the genetic difference between cases and controls. The between-group measure tends to overlap with one of the two within-group measures for rare variants, although such overlap is not present for common variants. Consequently, a dissimilarity or similarity test that includes the between-group information tends to attenuate the association signals and leads to power loss. Based on these findings, we propose a dissimilarity test that compares the degree of SNP dissimilarity within cases to that within controls to better characterize the difference between two disease phenotypes. We provide the statistical properties, asymptotic distribution, and computation details for a small sample size of the proposed test. We use simulated and real sequence data to assess the performance of the proposed test, comparing it with other rare-variant methods including those similarity-based tests that use both within-group and between-group information. As similarity-based approaches serve as one of the dominating approaches in rare-variant analysis, our results provide some insight for the effective detection of rare variants.

THE advent of next-generation sequencing technology has produced an enormous amount of genetic data that scientists can use to investigate how genetic variants contribute to disease mechanisms. This has led to the development of many statistical methods designed to examine such associations, including various single-marker tests and the more powerful multiple-marker analyses (also known as gene-/set-based analyses) where the functional role of a group of variants or a signaling pathway is of interest. One class of the marker-set methods is based on the nonparametric *U*-statistic and uses dissimilarity or similarity (dis/similarity in short) measures to test for association (*e.g.*, Pinheiro *et al.* 2005; Schaid *et al.* 2005; Wessel and Schork 2006; Wei *et al.* 2008, 2014; Tzeng *et al.* 2009; Li 2012; Wang *et al.* 2015). The formulation and derivation of these *U*-statistic-based tests are simple and intuitive; they are useful when not much information is available about the mode of inheritance and phenotype distribution.

These *U*-statistic tests contrast genetic dis/similarity scores of certain subject groups in different ways. For instance, in case-control studies, Schaid *et al.* (2005) proposed quantifying the difference between two within-group kernel measures (*i.e.*, to perform case-case *vs.* control-control comparisons). Their global test performed well with an asymptotic distribution when functional variants were independent. Wei *et al.* (2008) further incorporated a between-group measure, specifically the case-control dissimilarity using Hamming distance (HD), and used the ratio of between-group to within-group similarity in the test. When only common variants (CVs) [*i.e.*, the minor allele frequency (MAF) ≥5%] are considered, the inclusion of both between- and within-group measures improves the testing power. Wang *et al.* (2015) also adopted HD to first cluster SNP data and next test the phenotypic association of each cluster with the difference between the between-group and within-group dissimilarity. For a review of *U*-statistic-based tests and similarity-based kernel methods, see Schaid (2010) and Li (2012). These methods have demonstrated great success when dealing with causal CVs; they are robust against the ratio of neutral:causal variants and maintain consistent power when deleterious and protective variants are mixed. For variants that are not common, Wei *et al.* (2014) combined genetic similarity and phenotypic similarity into a weighted *U*-statistic for sequencing data (referred to as WU-SEQ) to investigate whether variants with MAF ≤3% contribute to disease etiology. Their test includes both between- and within-group comparisons, and it outperforms the sequence kernel association test (SKAT) (Wu *et al.* 2011) when the phenotype distribution has heavy tail probability.

Although the strategy of including both within- and between-group information in the *U*-statistic-based tests can increase testing power for CVs (Wei *et al.* 2008; Wang *et al.* 2015), we found that the contrast of these two factors is not optimal for rare variant (RV) (MAF ≤1%) analysis. This is because when RVs are involved the between-group dis/similarity information tends to fall in between the case-case dis/similarities and the control-control dis/similarities, or it overlaps with one of the two within-group dis/similarities (Figure 1), for RVs and for less frequent variants (1% ≤ MAF ≤ 5%). This observation suggests that higher power may be achieved by contrasting case-case dis/similarity with control-control dis/similarity instead of contrasting within-group comparisons with between-group comparisons. This suggestion is echoed by several recent studies which imply that contrasting two extreme phenotype groups may improve the statistical power of RV association tests (Guey *et al.* 2011; Li *et al.* 2011; Barnett *et al.* 2013; Lee *et al.* 2014).

In this work, we investigate the behavior of dissimilarity scores obtained from within-group comparisons and between-group comparisons for RVs based on HD. We show that, unlike the analysis with CVs (where the difference between the between-group dissimilarity and the within-group dissimilarity tends to be larger than the difference between case-case dissimilarity and control-control dissimilarity), the between-group dissimilarity of RVs falls between the two within-group dissimilarities. The inclusion of between-group information tends to attenuate the association signals and leads to power loss. To address this issue, we construct a *U*-statistic test including only the within-case and within-control comparisons; we derive its statistical properties and computation algorithms for obtaining *P*-values. Finally, we show the power gain of the proposed test by comparing it with existing RV similarity tests that include both within- and between-group information, using coalescent simulations based on COSI (Schaffner *et al.* 2005) and data from the psychiatric Cohorte Lausannoise (PsyCoLaus) study (Preisig *et al.* 2009).

## Materials and Methods

Let denote the single-nucleotide variant vector of the -th individual in the control (normal) group, where and contains the genotype coding of variants. For instance, the coding can be the number of minor alleles in the genotype. Similarly, let represent the vector of genotype coding of the -th subject in the case (disease) group, where A dissimilarity measure is defined for any pair of genotype vectors from two individuals. The degree of dissimilarity (heterogeneity) in the control group is measured as the average dissimilarity within this group,The average dissimilarity within the case groups isSpecifically, a sum kernel over the *L* loci, is recommended to account for variant-specific “dosage” effects. Here we consider the HD dissimilarity for illustration. The counts the number of discordant elements between two strings (vectors) of the same length,Note that here uses a slightly different scoring system from the allele-matching kernel (which counts the number of matching alleles among the four comparisons of two alleles per locus) besides the fact that the former scores based on dissimilarity and the latter scores based on similarity. To illustrate the difference, consider only one SNP. The allele-matching kernel assigns 4 for comparing *AA* *vs.* *AA*, 2 for *AA* *vs.* *Aa*, and 0 for *AA* *vs.* *aa*; in contrast, when comparing *AA* *vs.* *AA* and for the other two. Although HD differs from the identity-by-state (IBS) similarity measure for CVs, would be a location shift of IBS when dealing with the extremely low frequency of the minor allele homozygotes.

### Distribution of dissimilarity scores for case-case, control-control, and case-control comparisons

Figure 1 displays the empirical distributions of for control-control pairs, for case-case pairs, and for case-control pairs, based on the COSI simulated sequence data (Schaffner *et al.* 2005) as detailed in the section *Simulation settings*. The data contain 520 RVs, of which 473 have MAFs ≤0.5%, 19 have MAFs in the range of (0.5, 1%), and 28 in [1, 5%]. We randomly selected 19 variants from each MAF range as causal variants and we used case-control sampling to generate data for 2000 cases and 2000 controls. More details are in Supplemental Material, File S1.

Under the null hypothesis of no association (first row in Figure 1), the dissimilarity within the case group and within the control group are nondifferentiable. When the effects of the RVs are deleterious or protective (second and third row, respectively), and are distant from each other, while usually lies between the two and is closer to one of them. Also note that the distributions may contain two modes depending on the value of the MAF: When the MAF falls in the range (0.5, 5%), the second bump becomes apparent especially when the variants are deleterious. That is, when the variants have low but not rare MAF (MAF between 1 and 5%), the bimode becomes substantial. In addition, depending on whether the effect is deleterious or protective, the can be either larger (second row) or smaller (third row) than

These observations motivate the consideration of a test statistic based only on the within-group information ( and ) and excluding We refer to this test as the HD-based association test for RVs (HDAT-RV) and it is defined as where is equal to the average of the and is the average of the

### Asymptotic distribution of HDAT-RV

First note that for any paired individuals with single-nucleotide variants, takes a nonnegative integer value between 0 and where the 0/1 indicator follows a Bernoulli distribution with a variant-specific discordant probability

If these SNPs are independent, then represents the sum of independent Bernoulli distributions and follows a Poisson-Binomial distribution,Note that the index set in contains the indices of SNPs corresponding to and (more details in File S1). The mean and variance are and respectively.

If these SNPs are correlated, then the distribution of can be derived from the joint distribution of and becomeswhere contains higher order association parameters (details in File S1). The mean is and variance is

Following the asymptotic theorems for -statistics with the HD as the kernel function, is asymptotically distributed asThe variance represents the variability of the average dissimilarity measure within the control group. It is four times the covariance, where index represents the common control subject. Similarly, it is for the case group.

Figure 2 examines the empirical distribution of based on 19 randomly selected variants with MAF ≤0.5%, 19 variants with MAF between 0.5 and 1%, and 19 variants with MAF between 1 and 5%. Each plot contains 50 replications (details in File S1). It shows that the distribution is majorly affected by the range of MAFs, but it is not much affected by the number of variants or number of subjects.

*P*-value computation of small sample sizes

For relatively small sample sizes, as often encountered in RV tests, the above asymptotic distribution may not be applicable and obtaining *P*-values empirically based on permutations or bootstraps is often recommended (Lin and Tang 2011). To assure the validity of the test even for small sample sizes, we use permutations to calculate *P*-value. In our permutation, the pairwise dissimilarity among subjects only needs to be calculated once. Then, in each permutation, the dissimilarity between any two subjects can be retrieved as long as the indices of these subjects are identified. As explained in Wang *et al.* (2015), such computational complexity is of the order where is the number of variants and is the sample size.

### Simulation settings

The performance of the proposed method was evaluated under various conditions of effect direction (deleterious *vs.* protective), control:case (CN/CS in short) ratio, total sample size, and proportion of causal RVs within a given region. A set of 10,000 haplotypes was first generated using the COSI software in Schaffner *et al.* (2005) with a coalescent model mimicking the linkage disequilibrium and population history of the European population. We focused on a region that includes 492 RVs with MAF ≤1%. Next, for each subject, a genotype sequence was constructed by randomly selecting two haplotypes with replacement from the 10,000 haplotypes. Based on the genotype sequence, the disease status of subject was then determined via the following model:where was the number of causal alleles at the -th locus and was set to correspond to a prevalence of 5%. To allow larger effect size (regression coefficients) for rarer causal variants, we followed Wu *et al.* (2011) and set for causal variant Value was chosen to make the power within a reasonable range across the scenarios considered. It was 0.4 for a deleterious or for a protective effect. The proportion of causal loci was assumed as 5, 15, or 25%, corresponding to = 25, 74, or 123 causal loci, respectively. The total sample size () was 600 or 1200, each with three CN/CS ratios: 1/1, 1/2, and 2/1. We considered four scenarios of causal effects, where the ratio of deleterious:protective (Del/Pro in short) variants ranged from 100/0, 80/20, 50/50, to 0/100. Every simulation setting contained 1000 replications and each *P*-value was derived based on 1000 permutations. The proposed HDAT-RV is compared with two *U*-statistic-based tests (*i.e.*, Wei’s test and WU-SEQ), SKAT, the burden test of Madsen and Browning (2009), and a hybrid method of kernel and burden tests [*i.e.*, SKAT-O of Lee *et al.* (2012)]. When quantifying genetic similarity, SKAT and SKAT-O use linear kernel, WU-SEQ uses IBS kernel, and Wei and HDAT-RV use HD. The code and data files for performing simulations are provided in File S2 and File S3.

### Data availability

Detailed descriptions of the simulation settings are in File S1. File S2 contains the code for deriving HDAT-RV, conducting simulation studies, and calculating type I error rate and power. File S3 contains the data needed for the simulation studies with the code in File S2. CoLaus data are available at the Database of Genotypes and Phenotypes (dbGaP) with accession number: phs000145.v4.p2.

## Results

### Simulation studies

Table 1 lists the type I error rates at a nominal level of 0.05 for 600 and 1200 under different CN/CS ratios. The values are around the nominal level for all methods. Figure 3 shows the power performance for 600 at the 0.05 level for a spectrum of effect patterns and CN/CS ratios. The CN/CS ratios of 1/1, 1/2, and 2/1 are shown in the columns from left to right. The Del/Pro ratios of 100/0, 80/20, 50/50, and 0/100 are shown from top to bottom. Inside each of the 12 plots in Figure 3, we show the power results for the percentages of causal RVs as 5, 15, and 25%. Generally speaking, the best method varies across scenarios: When the Del/Pro ratio is not 1 (*i.e.*, 100/0, 80/20, and 0/100), HDAT-RV performs better; when the Del/Pro ratio is 50/50, WU-SEQ becomes the best method.

We also observe that HDAT-RV tends to yield reasonable power regardless of CN/CS and Del/Pro ratio, while other methods are somewhat sensitive to these ratios. For example, SKAT has relatively lower power when CN/CS is 1/2 (more cases) and the Del/Pro ratio is high (100/0 or 80/20) or when CN/CS is 2/1 (more controls) and the Del/Pro ratio is 0/100; WU-SEQ has low power in scenarios different from those of SKAT, *e.g.*, WU-SEQ has little power when CN/CS is 2/1 and the Del/Pro ratio is high (100/0 or 80/20) or when CN/CS is 1/2 and the Del/Pro ratio is 0/100. The burden test, as expected, has less power when there is a mixture of deleterious and protective RVs.

We further explore the behavior of WU-SEQ. The quantile-quantile plot of *P*-values under the null hypothesis (Figure S1) confirms that it has the expected null distribution regardless of the CN/CS ratio. On the other hand, under the alternative hypothesis, WU-SEQ can become negative when the causal variants are rare (MAF ≤1%) and the CN/CS ratio is not 1 under certain effect patterns (*e.g.*, when there are more controls with deleterious effects, or when there are more cases under protective effects; see Figure S2). The negative statistics may be related to the fact that the phenotypic similarity is limited to only three values when dealing with binary traits, leading to negative WU-SEQ in this case. Such an issue is not present when traits are continuous or when variants are less frequent (MAF between 1 and 5%).

We also present the results at a nominal 0.005 level (*i.e.*, type I error rate in Table S1 and power in Figure S3), as well as an additional analysis with *N* = 1200 at nominal levels of 0.05 (Figure S4 and Figure S5) and 0.005 (Figure S6 and Figure S7). The results follow a similar pattern. In summary, the relative power of different methods suggests that the CN/CS and Del/Pro ratio together affect the performance of a method. In practice, because the Del/Pro ratio is unknown, a more robust choice would be HDAT-RV and SKAT-O, which consistently give the best or comparable power to the best-performing methods across different scenarios.

### Data applications

We applied the proposed test to the Cohorte Lausannoise (CoLaus) sequence study (Firmann *et al.* 2008; Song *et al.* 2012), where the targeted sequencing genotypes on 202 drug-targeted genes were obtained for 1769 subjects. The information on anxiety disorders according to the Diagnostic and Statistical Manual of Mental Disorders, fourth edition, was available for 1689 subjects from the psychiatric evaluation (PsyCoLaus study; Preisig *et al.* 2009). Among these subjects, 580 had at least one specific adult anxiety disorder and 1109 did not meet criteria for such a disorder. This is close to a CN/CS ratio of 2/1. Using the same threshold of 1% for the definition of RVs, there were 197 genes containing at least one RV (with a total of 10,330 RVs).

We started with a PubMed search to identify any genes in the 197 genes that were presented in any abstracts containing both the disease status (anxiety) and the gene symbol. A total of 73 genes, covering 4368 RVs, were identified and defined as “candidate genes,” although most of these candidate genes were reported because of the association between CVs and disease phenotype. We conducted gene-based analyses using HDAT-RV, SKAT, SKAT-O, Burden, Wei, and WU-SEQ tests, and evaluated the performance of each method by assessing its ability to reproduce the association between anxiety disorder status and the genes. The *P*-values of HDAT-RV, Wei, and WU-SEQ were computed using 5000 permutations.

Among the 73 candidate genes found in PubMed and at the 0.05 nominal level, four genes, *fatty acid amide hydrolase* (*FAAH*), *PDE4A*, *GHSR*, and *CCL11* were identified by at least three tests (top panel in Table 2). The next three genes reached significance under two tests (second panel), while 12 genes (third panel) were identified significant with only one test. Table 2 also lists the total number of genes identified by each method (bottom row). HDAT-RV identified the most genes (*i.e.*, 10 out of the 73 PubMed genes), followed by WU-SEQ (6), Burden (5), SKAT-O (4), SKAT (4), and Wei (2).

Note that several of the genes have been considered for drug development, including *FAAH*, *neurexin-1* (*NRXN1*), *OPRK1*, *DRD2*, *PDE4A*, and *ADORA2A*. In addition, *FAAH*, *GHSR*, *EGR1*, *HCRTR2*, *OPRK1*, *DRD2*, *CHRNA3*, and *ADORA2A* have been reported to be associated with neurodevelopment, nicotine dependence, sleep disorders, stress, and/or anxiety. *FAAH* is known to play an important role in the endocannabinoid system; the association between *FAAH* and anxiety behavior has been reported in Dincheva *et al.* (2015). The FAAH inhibitors have been suggested as a target therapy for central nervous system disorders (Ahn *et al.* 2009), for anxiety through activation of cannabinoid receptors (Panlilio *et al.* 2013), and for post-traumatic stress disorder (Bisogno and Maccarrone 2013). For *PDE4A*, studies have reported its association with anxiety-like behavior (Hansen *et al.* 2014; Keil *et al.* 2016) and it has been a target for antidepressants like rolipram (Wallace *et al.* 2005). More reported associations and references are listed in Table 2.

## Discussion

In this study, we focus on comparing the dissimilarity measures of multiple RVs from two extreme phenotype groups. We show that the between-/cross-group dissimilarity measure often overlaps with one of the two within-group dissimilarity measures, and thus reduces testing power if the between-group information is considered in the test statistic. This was not an issue in earlier studies because such overlap is not present for analyses with CVs. While it has been noted for CVs that the two within-group measures and the between-group measure should all be included (Wang *et al.* 2015); for RVs, comparison of the two within-group measures can more effectively quantify the difference between the case and the control group. Using simulations, we show that the proposed HDAT-RV has reasonable and robust power performance across different CN/CS and Del/Pro ratios, while other methods that incorporate both within- and between-group information may be sensitive to these two ratios.

To address the potential concerns of insufficient sample size to apply asymptotics, we also construct a computationally efficient permutation algorithm to obtain the *P*-value of the HDAT-RV. In the procedure, the dissimilarity measure of every pair of individuals is computed only once; no calculation of the measure is needed in all subsequent permutations. Such a feature is different from WU-SEQ, where pairwise similarity scores of genotypes and phenotypes have to be recomputed in every permutation.

When quantifying genetic similarity in this work, SKAT uses linear kernel, WU-SEQ uses IBS kernel, and Wei and HDAT-RV use HD. For RVs, the linear kernel, IBS kernel, and HD become equivalent; except that HD measures distance while the other two measure similarity. So we expect that the power difference would be less likely due to the choices of distance or dis/similarity measures among SKAT, Wei, WU-SEQ, and HDAT-RV; but would largely be from the different strategies of using the between-group similarity. Certain combinations may be vulnerable to the imbalance of sample sizes in the case and control group, as demonstrated in this research. When the MAF is <1%, the proposed HDAT-RV is less sensitive to this imbalance. Another issue of concern is the choice of genetic similarity measure. As addressed in Wu *et al.* (2011), a function that better captures the true effect of multiple loci will increase power. Unfortunately, this information is usually unavailable. Further studies focusing on different relationships and distance measures may help to sort out the effectiveness of these tests, particularly when the percentage of causal variants is not known and when deleterious and protective causal RVs are mixed with an unknown ratio.

## Acknowledgments

The authors thank Peter Vollenweider and Gerard Waeber, principal investigators of the Cohorte Lausannoise study, and Meg Ehm and Matthew Nelson, collaborators at GlaxoSmithKline, for providing the CoLaus phenotype and sequence data. The CoLaus/PsyCoLaus study was supported by grants from the Swiss National Science Foundation (#3200B0–105993) and from GlaxoSmithKline (Psychiatry Center of Excellence for Drug Discovery and Genetics Division, Drug Discovery, R&D, Verona, Italy). This work was supported in part by Ministry of Science and Technology of Taiwan grants 103-2314-B-002-039-MY3, 105-2118-M-032-012-MY2, 106-2314-B-002-097-MY3, and 106-2811-B-002-006 and National Institutes of Health grant P01 CA-142538-01.

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.118.300769/-/DC1.

*Communicating editor: E. Hauser*

- Received January 30, 2018.
- Accepted March 2, 2018.

- Copyright © 2018 by the Genetics Society of America