# “SNP Snappy”: A Strategy for Fast Genome-Wide Association Studies Fitting a Full Mixed Model

- Karin Meyer
^{1}and - Bruce Tier

- 1Corresponding author: Animal Genetics and Breeding Unit, University of New England, Armidale NSW 2351, Australia. E-mail: kmeyer.agbu{at}gmail.com

## Abstract

A strategy to reduce computational demands of genome-wide association studies fitting a mixed model is presented. Improvements are achieved by utilizing a large proportion of calculations that remain constant across the multiple analyses for individual markers involved, with estimates obtained without inverting large matrices.

GENOME-WIDE association studies (GWAS) have become a routine task for geneticists in a range of areas. Analyses employing a mixed model are widely used as this provides a flexible framework to account for systematic differences and covariances due to other sources, such as population stratification and a family structure among genotyped individuals (Kang *et al.* 2010; Price *et al.* 2010; Zhang *et al.* 2010). A common type of investigation involves solving a system of mixed model equations (MME) fitting one or a few single nucleotide polymorphism markers (SNP) at a time, treating SNP effects as covariables, with variance components fixed at their estimates from an analysis omitting SNP. Typically this is done by inverting the coefficient matrix in the MME for each analysis. While individual analyses take only seconds, analyzing all markers for a high-density chip imposes a considerable computational burden. Hence, estimation of SNP effects by first fitting the mixed model excluding any SNP effects and then applying the SNP-wise analysis to the resulting residuals has been suggested (Aulchenko *et al.* 2007). However, this may lead to biased results if genotypes are not randomized across the effects in the model or if SNP effects and population strata are partially confounded. A typical example is an analysis comprising animals of different breeds with different allele frequencies (Johnston and Graser 2010).

When fitting the full model, we can partition the pertaining MME into a small part due to SNP effects and a part due to the other effects in the model. For complete genotype information only the former changes as different SNPs are considered. This structure can be exploited to reduce computational requirements. We present the strategy to do so, describe its implementation in freely available mixed model software, and show an example application.

## Computing Strategy

Consider a mixed model(1) with **y**, **b*** ^{k}*,

**u**

*,*

^{k}**s**

*, and*

^{k}**e**

*denoting the vector of observations (phenotypes), fixed effects other than SNP effects, random effects, SNP effects and residuals, and*

^{k}**X**,

**Z,**and

**W**

*the incidence matrices pertaining to*

^{k}**b**

*,*

^{k}**u**

*, and*

^{k}**s**

*. As emphasized by the superscript*

^{k}*k*, only

**W**

*differs between analyses for different SNPs, with the elements of*

^{k}**W**

*equal to the number of copies of the reference allele—0, 1, or 2 in a biallelic model—for the SNP(s) in the*

^{k}*k*th analysis. To estimate we need to solve the MME(2) with

**R**= Var(

**e**) and

**G**= Var(

**u**) the covariance matrices among residuals and random effects, respectively. Rewrite Equation 2 as(3) with

**C**

_{11}of size

*n*

_{1}×

*n*

_{1}denoting the part of the coefficient matrix that is constant and

**r**

_{1}the pertaining vector of right-hand sides, , of size

*n*

_{2}×

*n*

_{2}, the corresponding terms for the effects changing with each analysis, and and the off-diagonal blocks in the coefficient matrix. With generally not of interest, we can estimate as a solution to(4) With

*n*

_{2}small, inversion of the coefficient matrix and direct solution of Equation 4 is undemanding. While remains constant and thus needs only to be determined once, computations for the inversion are proportional to and thus can be nontrivial for large

*n*

_{1}. Fortunately, we can obtain without inverting

**C**

_{11}. Let(5) denote the Cholesky factor of the coefficient matrix in Equation 3 with

**C**

_{11}=

**L**

_{11}

**L**

_{11}

**′**, and . Substituting these terms in Equation 4 yields(6) This suggests that estimates for

*k*= 1, …,

*K*can be obtained efficiently by splitting computations as follows.

### To be performed once

Set up

**C**_{11}and**r**_{1},*i.e.*, the MME omitting SNP effects.Perform the Cholesky factorization of

**C**_{11}to obtain**L**_{11}.Determine as a solution to . With

**L**_{11}triangular, this involves forward substitution steps and

for *r _{i}* and , the

*i*th element of

**r**

_{1}and , and ℓ

*the*

_{ij}*ij*th element of

**L**

_{11}.

### To be performed for each set of SNPs

Determine parts of the MME specific to the

*k*th analysis, , , and .Set up

**L***, representing the intermediate matrix arising in factorizing the coefficient matrix in Equation 3 after rows 1 to*n*_{1}have been processed:.“Complete” the factorization steps for rows

*n*_{1}+ 1 to*n*_{1}+*n*_{2}using

(for the *ij*th element of **L***). Processing columns 1 to *n*_{1} column-wise replaces in **L*** with . The remaining elements (in columns *n*_{1} + 1 to *n*_{1} + *n*_{2}) are then adjusted row-wise, overwriting with .

Determine a general inverse of , , to obtain . Sampling variances of are given by the diagonal elements of .

Note that can have diagonal elements of zero if a SNP is monomorphic or if SNPs with proportional allele counts are considered simultaneously. This is accounted for in the generalized inverse. If *n*_{2} is not small or if sampling variances are not required, an alternative to solve for is a series of forward and backward substitution steps.

## Implementation

The strategy described above has been implemented in the mixed model package wombat (Meyer 2007), utilizing the existing capabilities to set up the MME for an arbitrary model and sparse matrix calculations, including Cholesky factorization of the coefficient matrix. Estimation of SNP effects is invoked through a run-time option. In addition to the data, pedigree, and parameter files as required for standard analyses, allele counts for each SNP analysis are expected to be read sequentially from a separate file. The software and user manual together with a worked example illustrating its use for GWAS analyses are available for download from http://didgeridoo.une.edu.au/km/wmbdownloads.php.

## Application

Our strategy was applied to estimate effects for 4541 SNPs on age at first *corpus luteum* in beef cattle. Any missing allele counts were imputed so that marker information was complete. Records were a subset of data analyzed previously (Hawken *et al.* 2011). The model of analysis fitted five fixed effects and two linear covariables as well as a linear regression on a single SNP effect. Animals’ additive genetic effects were fitted as random effects with the relationship matrix determined from pedigree information. There were 941 animals with genotypes and phenotypes. Including additive genetic effects for parents without records yielded a total of 3858 animals in the model and 3909 equations in total.

Calculations were carried out on a desktop computer with an Intel I7 processor rated at 3.2 GHz. Performing single SNP analyses one by one, inverting the complete coefficient matrix in the MME each time required a total of 1784 sec CPU time. Estimation using our new strategy reduced this to 16 sec. Repeating SNP information 150 times to mimic a high-density chip with 681,150 SNPs, analysis was completed in 2295 sec. With a set-up time required of ∼1 sec, this gave an average of 297 SNPs analyzed per second.

## Conclusions

Computational demands for GWAS analyses fitting a full mixed model can be reduced by orders of magnitude utilizing that a large part of the MME and computations involved remain constant. The computing strategy described to exploit this is straightforward and is readily implemented in existing mixed model software. Savings that can be achieved increase with the number of effects in the mixed model and are proportional to the number of SNP effects considered.

## Acknowledgments

This work was supported by Meat and Livestock Australia under grant B.BFG.0050. We are indebted to the CRC for Beef Genetic Technologies for the data used in the example. AGBU is a joint venture of New South Wales Department of Primary Industries and the University of New England.

## Footnotes

*Communicating Editor: G. A. Churchill*

- Received September 21, 2011.
- Accepted October 19, 2011.

- Copyright © 2012 by the Genetics Society of America