Abstract
Schizophrenia is a severe mental disorder with a large genetic component. Recent genome-wide association studies (GWAS) have identified many schizophrenia-associated common variants. For most of the reported associations, however, the underlying biological mechanisms are not clear. The critical first step for their elucidation is to identify the most likely disease genes as the source of the association signals. Here, we describe a general computational framework of post-GWAS analysis for complex disease gene prioritization. We identify 132 putative schizophrenia risk genes in 76 risk regions spanning 120 schizophrenia-associated common variants, 78 of which have not been recognized as schizophrenia disease genes by previous GWAS. Even more significantly, 29 of them are outside the risk regions, likely under regulation of transcriptional regulatory elements contained therein. These putative schizophrenia risk genes are transcriptionally active in both brain and the immune system, and highly enriched among cellular pathways, consistent with leading pathophysiological hypotheses about the pathogenesis of schizophrenia. With their involvement in distinct biological processes, these putative schizophrenia risk genes, with different association strengths, show distinctive temporal expression patterns, and play specific biological roles during brain development.
SCHIZOPHRENIA is a debilitating brain disorder with a worldwide prevalence of ∼1% that results in substantial morbidity and mortality. It is characterized by constellations of symptoms such as hallucinations, delusions, and cognitive impairments. Most cases of schizophrenia start during adolescence and early adulthood, and often have a lifelong course. Converging evidence indicates that schizophrenia results from a disruption in brain development (du Bois and Huang 2007) caused by genetic predisposition and environmental factors, the latter of which include prenatal infection, maternal nutrition, and stress. Schizophrenia is a highly heritable disease, with an estimated heritability between 64 and 81% (Sullivan et al. 2003; Lichtenstein et al. 2009), confirming the major role of genetic factors in contributing to disease risk. Therefore, further dissection of the genetic underpinnings of schizophrenia is crucial toward advancing our understanding of its pathogenesis.
The genetic basis of schizophrenia involves complex interactions among risk variants across an allelic frequency spectrum. While no Mendelian inheritance patterns have been observed for schizophrenia risk variants (Giusti-Rodriguez and Sullivan 2013), accumulating evidence indicates that the polygenic component of risk is substantial (International Schizophrenia Consortium et al. 2009). Rare copy number variants (CNVs) have shown relatively high penetrance for schizophrenia: the majority of 11 known risk CNVs with genome-wide significance for schizophrenia association have minor allele frequencies (MAFs) <0.1%, and odds ratios (ORs) between 2 and 60 (Rees 2015). In addition, significant progress has been made recently, using large-scale exome-sequencing and genome-wide association studies (GWAS), on the role of risk variants with subtle effects. Enrichment of disruptive rare (MAF <0.1%) single nucleotide variants (SNVs) of small effect sizes (OR = 1.12), as well as enrichment of nonsynonymous de novo SNVs, was found in several gene sets associated with synaptic function (Fromer et al. 2014; Purcell et al. 2014). Previous studies suggest that common single nucleotide polymorphisms (SNPs) associated with schizophrenia generally have a small effect size (OR < 1.2), but, collectively, thousands of independent SNPs could account for up to 50% of variance in schizophrenia liability (Ripke et al. 2013). In particular, a recent large-scale meta-analysis based on past GWAS identified 108 schizophrenia risk regions with genome-wide significance (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014), and thus further confirmed the important contribution that common variants make to the genetic risk of schizophrenia. To date, over 20 GWAS studies have been conducted in schizophrenia, providing valuable data for downstream analysis.
Identification of genes that confer risk for developing schizophrenia is crucial to providing insight into the underlying disease mechanisms, and for identifying new drug targets. One of the best-known schizophrenia genes encodes the dopamine receptor D2 (DRD2). The fact that it can be used as a drug target to treat schizophrenia supports a major etiological hypothesis that abnormal brain signaling involving dopamine is a substantial factor in the pathophysiology of schizophrenia (Di Forti et al. 2007). In addition, genes implicated in schizophrenia by previous studies of common or rare variants (Fromer et al. 2014; Purcell et al. 2014; Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014) include genes involved in glutamatergic neurotransmission (GRM3, GRIN2A, SRR, GRIA1, and SLC38A7), calcium channel signaling (CACNA1C, CACNB2, CAMKK2, CACNA1I, NRGN, and RIMS1), and synaptic plasticity such as N-methy-d-aspartate receptor (NMDAR) and activity-regulated cytoskeleton-associated scaffold protein (ARC). However, these findings are mostly limited to the level of gene set enrichment due to difficulty in pinpointing risk genes. In contrast to exome sequencing studies, in which risk genes are directly implicated by risk exonic variants, GWAS can identify only risk regions instead of risk genes. This intrinsic limitation of GWAS cannot be resolved by increasing the sample size. Thus, in order to investigate the biological effects of common variants, new methodologies are required to track down risk genes responsible for the GWAS signals found in schizophrenia (Need and Goldstein 2014).
The challenge of pinpointing risk genes in disease-associated risk regions lies in several aspects. Most risk regions cover and implicate multiple genes, which, without other information, makes it exceedingly difficult to determine the true risk gene(s) within them. Furthermore, risk genes may reside outside risk regions, and be affected through regulatory elements. In this study, we propose a framework to tackle this challenge. To cover risk genes that reside outside of risk regions, we incorporated gene regulatory information to include candidate genes outside risk regions. In addition, we developed a computational method to score schizophrenia candidate genes based on Gene Ontology (GO) annotations and functional network characteristics of a group of known (and well-accepted) schizophrenia genes. We prioritized 132 schizophrenia risk gene candidates as putative schizophrenia risk genes in risk regions that we constructed from previous GWAS studies. Subsequent multiple integrated functional analyses of these putative susceptibility genes provide us with novel and deeper biological insight into the genetic architecture, enriched pathways, gene expression profiles, and penetrance of schizophrenia.
Materials and Methods
The overall strategy of our approach is depicted in Figure 1.
The flowchart of the integrated post-GWAS study of schizophrenia. The study consisted of two major parts: prioritization of schizophrenia risk gene candidates and subsequent functional analyses.
Identification of genomic risk regions for schizophrenia
We collected SNPs/indels from the PGC study (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014), and additional SNPs from the GWAS catalog (Hindorff et al. 2015) that were identified to be associated with schizophrenia (P < 1 × 10−5). The final set included 128 SNPs/indels from the PGC study, and 137 SNPs from the GWAS catalog. Using VCFtools (Danecek et al. 2011), and the 1KG reference panel (1000 Genomes Project Consortium et al. 2012), we calculated the linkage disequilibrium (LD) between each schizophrenia variant, and every 1KG variant in its 400-kb neighborhood. The neighboring SNPs with r2 > 0.5 define the LD block indexed by the enclosed schizophrenia variant. Finally, we combined overlapping or close (within 250 kb) LD blocks to form genomic risk regions for schizophrenia.
Identification of schizophrenia risk gene candidates
After pinpointing the schizophrenia risk regions, we identified schizophrenia risk gene candidates that are linked to these risk regions. Based on the genomic distance, a schizophrenia risk gene candidate is either proximal or distal to the schizophrenia risk regions. Proximal candidate genes are candidate genes inside or closest to risk regions, while distal candidate genes are candidate genes outside, and not closest to, risk regions (if there are genes inside risk regions, they are closest to risk regions). The proximal candidates were identified with the same approach as used in the PGC meta-analysis (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014): they are genes overlapping risk regions after extending them by 20 kb on both ends, or the closest genes to risk regions within 500 kb, when they contain no genes. In addition, in our analysis we also included possible distal risk genes by incorporating transcriptional regulatory interactions between expression quantitative trait loci (eQTL) or transcriptional regulatory elements (TREs) and their target genes. Both ENCODE and FANTOM5 provided enhancer-promoter connections, based on the correlation between their DNase hypersensitivity in different cell types, and between their expression activity, respectively. We used such enhancer–promoter connections to connect transcriptional regulatory elements to genes. Thus, the distal candidates are genes that are neither directly covered by, nor closest to, the risk regions (within 500 kb), but are likely regulated by eQTL or TREs within them. We collected eQTL, DHS, and enhancers, each set with their target genes from GTEx (Analysis Pilot V3), ENCODE (Thurman et al. 2012), and FANTOM5 (Andersson et al. 2014), respectively. To minimize inclusion of irrelevant distal genes, we only considered eQTL, DHS, or enhancers that are in the risk regions, and also contain at least one SNP or indel in strong LD (r2 > 0.5) with the schizophrenia GWAS SNPs or indels.
Scoring schizophrenia risk gene candidates
We have developed a statistical method to score the disease-relatedness of schizophrenia risk gene candidates, with predictive features extracted from gene networks, and annotation based on a set of training schizophrenia genes (Figure S1) (Supplemental Material, File S1). We used 56 training genes in our analysis, including (1) 38 manually curated schizophrenia genes with strong evidence (aka “core genes”) (Jia et al. 2010), (2) eight schizophrenia susceptibility genes cataloged in the Online Mendelian Inheritance in Man (OMIM) database (McKusick 2007), (3) six well-accepted schizophrenia genes from recent genetics studies (Hall et al. 2015; Kotlar et al. 2015), and (4) four schizophrenia genes with solid support from other sources (Canetta et al. 2014; Nawa et al. 2014; Bossu et al. 2015; Lv et al. 2015) (Table S1). The gene network that we used is the functional linkage network (Linghu et al. 2009), in which the functional association (the edge) between a pair of genes was predicted based on 16 genomic features.
The predictive features are either network features or annotation features. Network features are the frequent combinations of the neighbors of schizophrenia genes in the functional linkage network (Linghu et al. 2009), while annotation features are the frequent combinations of GO terms associated with schizophrenia genes. We extracted those frequent combinations of the network neighbors or GO terms by using the frequent item set mining algorithm (Tan et al. 2006). Network features characterize schizophrenia genes, indirectly, by a combination of genes that schizophrenia genes are usually functionally associated with, while those functionally associated genes can be any genes, not restricted to schizophrenia genes. In other words, network features characterize the context of functionally associated genes in which schizophrenia genes are usually enriched. Annotation features characterize schizophrenia genes directly in terms of GO terms. The two features characterize schizophrenia genes from different angles, and are complementary according to our evaluation (shown in the section “Evaluation of schizophrenia gene candidate scoring”). The final score integrates two scores, based on GO terms and functional linkage network (Figure S2), respectively.
Our evaluation showed that a final-score cutoff set at 80 could achieve a high prediction precision (Figure S3A). Meanwhile, the majority of the training genes, and the majority of genes with strong literature support for connection with schizophrenia, both have scores higher than 80 (Figure S3B and Table S2). Moreover, because our analysis showed that it is unlikely to observe scores higher than 80 from the same set of schizophrenia candidate genes by training with random or irrelevant genes (P = 0) (Figure S4, A and B), we set the high-scoring cutoff at 80. The majority of the prioritized genes have scores higher than 160 (Figure S4C).
Evaluation of schizophrenia gene scoring
We used two complementary approaches—binary classification tests and Wilcoxon rank sum tests—to evaluate our scoring method in discerning schizophrenia genes. The former directly assessed how well our scoring method distinguished schizophrenia genes from nonschizophrenia genes, while the latter compared the scores between schizophrenia genes and nonschizophrenia genes. For binary classification tests, the 56 training genes were used as the only positive testing gene set, while 56 genes randomly selected from the “background” gene set (Figure S5A) as the negative testing gene set in different binary classification tests. For Wilcoxon rank sum tests, we prepared an “enriched” gene set, which is composed of genes implicated in schizophrenia by rare mutations other than the 56 schizophrenia genes (Figure S5A), and compared both the schizophrenia gene set and the “enriched” gene set with the “background” gene set.
Gene sets association analysis
We compiled the following three gene sets, and compared our putative schizophrenia risk genes with each of them using Fisher’s exact test of association.
We collected from the Mouse Genome Informatics (MGI) database (as on April 23, 2015 at http://www.informatics.jax.org,) a list of 3765 genes whose knock-outs in mouse models generated phenotypes of nervous systems and neurological behaviors.
Using text-mining techniques, we compiled a list of 54 genes with strong literature support for connection with schizophrenia (Table S2).
We assembled a list of 1401 genes that have been shown in previous studies to be differential expressed between schizophrenia patients and normal controls.
Pathway and GO term enrichment analysis
We used GeneCoDis3 (Tabas-Madrid et al. 2012) to identify KEGG and Panther pathways enriched among schizophrenia risk genes. Briefly, putative schizophrenia risk genes that we identified and Ensemble human genes were used as the input and the reference gene sets, respectively. Pathway annotations from KEGG and Panther were searched and compared in both the input and the reference gene sets to find pathways significantly enriched in the putative schizophrenia risk genes. To measure the significance of enrichment, the hyper-geometric distribution was used to calculate P-values. Then, the false discovery rate was calculated for multiple test correction. Biological pathways with significant corrected P-values are candidates for involvement in the pathogenesis of schizophrenia. We used GO::TermFinder (Boyle et al. 2004) to analyze the enrichment of GO terms in the putative schizophrenia risk genes. To avoid potential confounding effects from the functional linkage network, we excluded associations between GO terms and genes based on Electronic Annotation (evidence code = IEA) from our enrichment analysis, and thus ensured that all associations between GO terms and genes were assigned manually by curators. P-values were adjusted for multiple tests using the Bonferroni method.
Tissue gene expression analysis
To examine the expression profiles of the putative schizophrenia risk genes in different tissues, we used the Gene Enrichment Profiler (http://xavierlab2.mgh.harvard.edu/EnrichmentProfiler/) (Benita et al. 2010), which catalogs normalized expression values of ∼12,000 genes across 126 primary human tissues. To investigate gene-tissue expression specificity, we grouped the putative schizophrenia risk genes into different clusters according to their different expression patterns across tissues using the Euclidean distance, and the Ward’s clustering method (Legendre 2014).
Data availability
The authors state that all the source of data necessary for reproducing the results are presented within the article. Strains are available upon request.
Results
Schizophrenia-associated common variants and genomic risk regions
We collected 261 schizophrenia-associated common variants (SNPs and indels) from 25 GWAS of the disease (see Materials and Methods). With a few exceptions, the associated variants reported by each study are within a range of effect size similar to one another (Figure S6). They represent at least 60 genomic loci harboring schizophrenia-associated variants that have been replicated in multiple independent studies (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014; Hindorff et al. 2015). Interestingly, the vast majority of these loci act independently of known risk factors, promising the discovery of hitherto unknown mechanisms influencing risk. These variants are distributed throughout the human genome, with local clustering (Figure 2A). Using neighboring 1000 Genomes Project (1KG) variants that are in high (r2 > 0.5) LD with schizophrenia GWAS variants, we identified 176 genomic schizophrenia-risk regions. After analyzing how schizophrenia GWAS variants, protein-coding genes (GENCODE v19) (Harrow et al. 2012), and TREs (we used ENCODE enhancers) (Encode Project Consortium 2012) are distributed together in human genome (Figure 2B), we found that many human genomic regions, such as the 1q43, 6p21, and 18q21 loci, are enriched with both schizophrenia GWAS SNPs and enhancers. These schizophrenia risk regions are either gene-rich or gene-poor.
Currently cataloged schizophrenia GWAS SNPs. (A) Genomic distribution of schizophrenia GWAS SNPs. Each red dot represents a schizophrenia GWAS SNP. Several local clusters are highlighted. (B) The numbers of schizophrenia GWAS SNPs, ENCODE enhancers, and protein-coding genes in risk regions. The bubble size indicates the number of genes. The risk regions are labeled with the chromosome bands, and the ones with the number of schizophrenia SNPs <4 are shown only by dots.
Evaluation of schizophrenia gene candidate scoring
Risk gene candidates can be scored with predictive features extracted from either the functional linkage network (Linghu et al. 2009), or GO annotation, or the two sources combined. In comparison, they may also be scored by another two approaches: one is with their node degrees alone, a simple but informative network characteristic; the other is the number of connections to schizophrenia training genes (aka “risk degree”), which characterizes the degree of relationship with training genes. We evaluated the performance of these five scoring designs by two different but complementary approaches. First, for each design, we constructed the receiver operating characteristic (ROC) curve, and calculated the area under the curve (AUC) (Figure S5B). When we used both network and annotation predictive features, our method achieved the best performance. Likewise, our method consistently outperformed the other two methods, and exhibited the most stable performance when confounding factors, such as the network degree and the gene size in evaluation gene sets, were controlled.
Next, we scored all 16,906 genes, and compared scores of 15,130 background genes with those of 56 known and 1718 “enriched” schizophrenia genes, respectively (Figure S5C). Wilcoxon rank sum tests showed that, with all five scoring designs, each of the schizophrenia gene sets scored significantly better than the background genes. For the test of evaluating score differences between 56 known/well-accepted schizophrenia genes and “background” genes, our method using both network and annotation predictive features exhibited similar performance to the method using risk degree as scores. However, for the test of evaluating score differences between the “enriched” gene set and the “background” gene set, our method using either network or annotation predictive features significantly outperformed the method of using risk degree as scores. This indicates that our network and annotation features are very effective in scoring unknown schizophrenia genes, not limited to scoring known schizophrenia genes that are highly functionally associated with other known schizophrenia genes. When both network and annotation predictive features were used, the differences in scores were the most significant.
Both evaluation approaches—the classification test and the Wilcoxon rank sum test—clearly showed that our scoring method can effectively prioritize schizophrenia genes. A large number of different combinations of functionally related genes as network features generated by training with seed genes can effectively capture the underlying genetic risk of schizophrenia. Network predictive features consider only genes functionally associated with query genes (but not query genes themselves), and may be insufficient to differentiate schizophrenia genes from genes with spurious functional linkage to the same neighbors in the network. On the other hand, scoring relying on GO annotation alone runs the risk of prediction being biased toward well-studied genes. Because the functional linkage gene network was built using high-throughput genomic data sets, by integrating both gene network and annotation, the risk of biased prediction was minimized.
High-scoring schizophrenia risk gene candidates
Using human genome annotation and transcriptional regulatory information, to 176 schizophrenia risk regions, we linked 643 schizophrenia risk gene candidates, of which 487 are proximal, covered by, or closest to, the schizophrenia risk regions, and the other 156 are distal, linked to the risk regions through long-range gene regulation. Our schizophrenia risk gene candidates show a size distribution very similar to that of all coding genes (Figure S7). By contrast, the set of genes closest to schizophrenia-associated SNPs, which were considered as risk genes by the current GWAS approach, show a striking bias toward large genes.
Due to the lack of GO annotation, or their absence from the functional gene linkage network, 58 candidates cannot be scored. Among the remaining 585 scored candidates, 132 genes from 76 schizophrenia risk regions achieve scores greater than the threshold (Table 1 and Table S3). Referred to as “schizophrenia risk genes” hereafter, these high-scoring candidates include 103 (78%) genes proximal to the schizophrenia risk genomic regions, and 29 (22%) distal genes that are likely regulated by TREs or eQTL in the risk regions (Table 1 and Table S4). For lack of a better approach, in most, if not all, GWAS in complex diseases, genes closest to the disease-associated SNPs were considered as the risk genes. This approach will certainly miss the distal risk genes, but, even among the proximal ones, only 189 out of 487 (39%) are genes closest to the schizophrenia GWAS signals.
We carefully examined the predicted schizophrenia risk genes to validate the effectiveness of our method. In one case (Figure 3A), a risk region on chromosome 8 indexed by schizophrenia-associated SNP rs16887244 is linked to nine protein-coding genes. Among them, rs16887244 is located in an intron of LSM1, and thus LSM1 was reported as a putative schizophrenia risk gene (Hindorff et al. 2015). Our method, however, identified different risk genes for rs16887244. While LSM1 scored low (4.8), we identified with high scores (822.7 and 926.8, respectively), two risk genes, STAR and FGFR1. STAR encodes a steroidogenic acute regulatory protein, which regulates the onset of steroidogenesis. Widely expressed throughout human brain, STAR may play a role in maintaining several brain functions, such as neurogenesis, neuroprotection, and synaptic plasticity (Sierra 2004). FGFR1 encodes fibroblast growth factor receptor 1, and is involved in many important signaling pathways, whose impairment could lead to abnormal brain development, and confer risk of schizophrenia (Terwisscha van Scheltinga et al. 2010). In contrast to STAR, which resides in the risk region, FGFR1 is a distal gene located outside the risk region. It was connected to the GWAS signal through two TREs in the risk region that may regulate its expression. This connection is strengthened by the strong LD between the schizophrenia SNP rs16887244 and the two SNPs, rs6999796 and rs16887343, each located in one of the TREs.
Prioritization of schizophrenia risk gene candidates. Schizophrenia-associated SNPs (shown with red “rs” IDs) were used to define schizophrenia risk genomic regions (red rectangles). Schizophrenia risk gene candidates are genes either overlapping schizophrenia risk regions or linked to them by TREs. The scores of candidates are indicated by the red bars. (A) A schizophrenia risk genomic region in 8p11.23. Among nine schizophrenia risk gene candidates, STAR and FGFR1 achieved high scores. FGFR1 is linked to this risk region through two TREs. (B) A schizophrenia risk genomic region in 16p11.2. 13 schizophrenia risk gene candidates are connected to this risk region (TBK6 is linked through eQTL). Three of them—TAOK2, DOC2A, and MAPK3—achieved high scores. Note that, in either case, the gene closest to the schizophrenia-associated SNPs has a score lower than the threshold (at 80, shown by the dashed line).
Another risk region on chromosome 16 indexed by schizophrenia-associated SNP rs12691307 is linked to 13 protein-coding genes (Figure 3B). The gene closest to the index SNP, KCTD13, was given a low score by our method (37.1). Instead, three other genes, DOC2A, MAPK3, and TAOK2, scored high (249.6, 915.7, and 93.7, respectively). In fact, the risk region is located at chromosome 16p11.2—a known risk locus for autism (Kumar et al. 2008). Autism is another neurodevelopmental disorder that shares a number of features with schizophrenia (Goldstein et al. 2002). Interestingly, all three genes have been implicated in autism in previous studies (Kumar et al. 2008; de Anda et al. 2012). DOC2A encodes calcium-signaling proteins responsible for neurotransmission (Glessner et al. 2010). MAPK3 encodes a serine/threonine protein kinase that plays an important role in the regulation of synaptic plasticity (Thomas and Huganir 2004). TAOK2 also encodes a serine/threonine protein kinase that affects basal dendrite formation (de Anda et al. 2012). Their biological roles are consistent with the current knowledge of schizophrenia etiology. In contrast to DOC2A and TAOK2, which reside in the risk region, MAPK3 is a distal gene ∼0.1 Mb downstream to the risk region. The connection between the GWAS signal and MAPK3 was established through a TRE in the risk region. SNP rs10871451 in this TRE is in strong LD with the nearby schizophrenia-associated SNP, and thus implicated as the underlying risk variant.
In the aforementioned cases, our method predicted risk genes different from genes closest to schizophrenia-associated GWAS SNPs. The high scores assigned by our method to those predicted risk genes were calculated based on solid gene annotation and functional linkage. Our post-GWAS analysis generated a high-confidence set of schizophrenia risk genes, many of which are new. Although their ultimate validation and confirmation can be achieved only experimentally (and thus beyond the scope of this work), we carried out computational analyses and the results—described in the following sections—show that our predictions are well supported by other resources.
Association among schizophrenia-related gene sets
To validate and characterize our schizophrenia risk genes, among 585 scored candidates, we compiled three sets of schizophrenia-related genes based on phenotypes found in transgenic mice, schizophrenia research literature, and differential gene expression studies in schizophrenia. Fisher’s exact tests of association among these four gene sets (Table S5) using 585 scored candidate genes as the background shows that our schizophrenia risk genes are highly associated with genes either rendering relevant phenotypes in transgenic mice (P = 9.58 × 10−19), or with schizophrenia literature support (P = 2.92 × 10−5). However, we did not detect association (P = 0.135) between our schizophrenia risk genes, and genes from differential expression studies of schizophrenia.
Tissue gene expression analysis
Although recognized as a brain disorder, accumulating evidence also shows that the etiology of schizophrenia is associated with immune dysfunction (Muller and Schwarz 2010). We examined the expression profiles of schizophrenia risk genes across different human tissues to investigate the tissue specificity of their transcriptional activities. Based on their expression patterns, we can cluster them into three groups (Figure S8). The first group of 39 genes is expressed almost exclusively in the central nervous system (CNS), especially the prefrontal cortex and the hippocampus. Many genes in this group, such as CACNA1C, CACNB2, and RIMS1 have been implicated in the pathogenesis of schizophrenia (Table S2).
Thirty-five genes in the second group are highly expressed in immune cells: B-c and T-lymphocytes. Genes in this group include the major histocompatibility complex (MHC) genes, such as HLA-DQB1, HLA-DRB1, and HLA-DRA. MHC genes code for proteins that regulate immune functions (Janeway 2001), while the MHC region on chromosome 6p implicated in schizophrenia in replicated GWAS (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014). Other immune-associated genes in this group, such as PTGS2 and FMR1, have been linked to schizophrenia in previous studies (Wei and Hemmings 2004; Kelemen et al. 2013). Recent studies have found the anatomical connection between the immune system and the CNS (Aspelund et al. 2015; Louveau et al. 2015), which could explain the involvement of immune-associated genes in schizophrenia. The third group consists of 54 genes that are expressed across a wide range of different tissues including the CNS. In contrast to genes in the first group, genes in this group are not exclusive to the CNS, and are expressed more ubiquitously. Many genes in this group, such as EGR1, FGFR1, CHRNA5, SREBF1, SREBF2, and PARD3, are known to be involved in schizophrenia (Table S2). According to our results, an unexpectedly high percentage (∼25%) of schizophrenia risk genes are not expressed in the CNS. How those genes expressed in the immune system play a role in the pathogenesis of schizophrenia requires further investigation.
Overlaps in schizophrenia genetic architecture
The common variant part of the genetic architecture of schizophrenia has been studied extensively in recent SNP array-based GWAS, which have identified a large number of associated SNPs, as noted above. A new frontier for schizophrenia genetics is to identify rare variants, and de novo mutations, associated with schizophrenia risk by whole exome (WES) or whole genome sequencing (WGS). Two recent studies—WES of 2536 schizophrenia individuals and 2543 healthy controls (Purcell et al. 2014), and WES of 623 schizophrenia trios (Fromer et al. 2014)—are the two largest sequencing-based studies to fill in the rare variant and de novo mutation part of the genetic architecture of schizophrenia. Although strong evidence from these large-scale genetics studies suggests that there is convergence of rare and common variants in genetic architecture of schizophrenia at broad gene functional levels (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014), it remains unclear, however, how commonly at gene levels, rare variants underlie schizophrenia GWAS signals (form common variants), and how commonly schizophrenia risk genes may exert their pathogenic effects through both common and rare variants. We were able to shed some new light on these questions by comparing our GWAS-derived schizophrenia risk genes with genes containing rare variants or de novo mutations implicated in schizophrenia by the previous exome-sequencing studies (Girard et al. 2011; Xu et al. 2012; Fromer et al. 2014; Purcell et al. 2014).
Of our schizophrenia risk genes, 37, 7, and 7 contain rare variants, de novo mutations, or both, respectively (Figure S9A). We conducted two statistical tests to assess the significance of overlap between schizophrenia risk genes that we predicted, and schizophrenia risk genes implicated by rare mutations (Figure S9, B and C). There is a statistically significant association (P = 6.7 × 10−4) between high scoring genes linked to schizophrenia GWAS loci and schizophrenia genes implicated by rare variants (Figure S9B). After eliminating the confounding effect of “high scoring,” the overlap between these two sets of schizophrenia risk genes remains significant (P = 8.3 × 10−4) (Figure S9C). Such overlaps indicate the possibility that some schizophrenia risk genes may contribute to the disease through both common and rare variants. Among the aforementioned 37 schizophrenia risk genes, we also found genes involved in glutamatergic neurotransmission (GRIK3 and GRIN2A), and genes encode calcium channels (CACNA1C and CACNB2) and synaptic plasticity (NMDAR genes such as FLNA and MAPK3) (Kirov et al. 2012). All these three gene classes have been implicated in schizophrenia by both rare and common variants in a previous study (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014).
Pathway enrichment
The Psychiatric Genomics Consortium (PGC) meta-analysis of schizophrenia (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014) could not, with statistical significance after multi-test correction, identify any enriched pathways among genes within the 108 loci. By focusing only on high scoring risk genes, and expanding gene candidates to include distal genes and genes associated with weak GWAS signals, many biologically plausible pathways were overrepresented (Table S6 and Table S7). In addition, we also found pathways not enriched with training schizophrenia genes, including pathways involved in neural development (FGF signaling and Adherens junction), synaptic function and plasticity (Endothelin signaling pathway), and immune system (B cell activation and intestinal immune network for IgA production), all of which are consistent with the current knowledge of the etiology of schizophrenia. By integrating GWAS signals and regulatory information, our approach can identify disease risk genes to uncover novel disease-related pathways.
Schizophrenia risk genes with different association strengths
In GWAS, variants show different degrees of association with the disease. Variants with smaller P-values in the same study imply higher risks than variants with larger P-values. To identify the biological factors underlying different genetic risks, we divided the range of schizophrenia association strength of the risk regions into three classes based on the single large PGC study (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014). According to the P-value distribution of GWAS SNPs (Figure S10), we divided 176 risk regions into three classes: 62 weak (P > 5 × 10−8), 70 moderate (10−10 < P < 5 × 10−8), and 39 strong (P < 10−10) regions—with different disease-association strengths based on the lowest P-values of associated PGC GWAS signals in each region (Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014). The “weak” class consists of risk regions that contain no genome-wide significant GWAS signals from the PGC study. Five weak risk regions with either no, or contradictory, imputation signals in the PGC study were excluded from the analysis (Table S3). We then assigned the schizophrenia risk genes to these three association classes based on the GWAS variants to which they are linked (Table S8). GO term analysis reveals that genes in these three disease-association classes are enriched with GO terms of distinct biological processes (Figure 4): schizophrenia risk genes with weak association are enriched in biological processes related to cellular regulation and differentiation; ones with moderate association function mainly in response to stimulus and organismal processes; and strong association is connected with synaptic transmission and signaling. For example, weak associations involve many schizophrenia risk genes that play a role in cellular regulation of neural development, such as cell motion and axongenesis (L1CAM, ANK3, BMP7, CXCL12, and RELN). In contrast, strong associations involve many schizophrenia risk genes encoding calcium channels (CACNA1C and CACNB2) and neurotransmitter receptors (DRD2, CHRNA3, CHRNA5, CHRM4, and HTR3B) that are directly involved in synaptic transmission. To provide some biological context to the three disease-association classes, we compiled a set of 20 genes connected to schizophrenia from the OMIM database (http://www.omim.org, accessed November 2014) (McKusick 2007). These OMIM genes do not overlap with our predicted schizophrenia risk genes. As cataloged in the OMIM database, these genes have identifiable genetic factors that may have larger effect sizes on schizophrenia risk in general. Interestingly, like schizophrenia risk genes with strong association, these OMIM genes are also enriched in the biological process of synaptic transmission and signaling.
GO terms enriched among schizophrenia risk genes with different association strengths. The five most significantly enriched GO terms, and their P-values adjusted for multiple testings, are shown for each gene set. The label “OMIMS” in purple denotes 20 schizophrenia risk genes that we curated from the OMIM database. The labels “Strong,” “Moderate,” and Weak denote 36, 49, and 35 putative schizophrenia risk genes implicated by strong, moderate, and weak GWAS signals, respectively (see Table S8).
Consistent with the widely accepted hypothesis that schizophrenia symptoms are caused by the imbalance of neurotransmitter in brain, our result suggests that genes involved in synaptic transmission and signaling tend to have strong association with schizophrenia due to their direct influence on the balance of neurotransmitter in brain. The mutations of many genes involved in cellular regulation in brain may contribute to brain defects in the brain developmental process. However, this consequence may have implicit connection to the outcome of neurotransmitter imbalance, which is reflected by their weaker associations in general.
Expression of schizophrenia risk genes during brain development
Strong research findings indicate that schizophrenia is a complex neurodevelopmental disorder (Fatemi and Folsom 2009; Catts et al. 2013). Thus, we investigated how schizophrenia risk genes are expressed during brain development. Instead of studying them individually, or together as a whole, we examined the spatiotemporal expression profiles of the aforementioned three disease-association classes at eight brain locations, and 12 time points during brain development, using RNA-Seq data from BrainSpan (http://www.brainspan.org/, accessed March 2016) (Figure S11) (File S1). Expression analysis reveals that the timing of their transcriptional activity during brain development correlates well with the strength of their association with schizophrenia (Figure 5): schizophrenia risk genes with weak, moderate, and strong association tend to be more actively transcribed during the early, middle, and late time periods, respectively, during brain development. Again, like schizophrenia risk genes with strong associations, the OMIM schizophrenia genes tend to be transcribed more actively during the late time period. We generated new sets of prioritized genes using specially controlled training genes. Our comprehensive analysis of these genes showed essentially the same spatiotemporal expression patterns during brain development as before (Figure S12), and thus excluded the possibilities that the properties of training genes drive the patterns of transcriptional activities of schizophrenia risk genes with different association strengths. Although the binarization process used in the approach discards some transcriptional information, the advantage of our approach to identifying spatiotemporal expression patterns is the interpretability of its result, which shows the proportion of genes in the gene set that tend to be transcriptionally active, or suppressed at the corresponding time stage and brain region. To ensure that the observed spatiotemporal expression patterns are robust, we used a different transformation of the expression data, which gave results (Figure S13A) consistent with our previous observation. Moreover, we conducted statistical tests to assess the significance of transcriptional activities. The test results show consistent spatiotemporal expression patterns (Figure S13, B and C), indicating that the distinct patterns of transcriptional activities of our prioritized genes in different association classes are not due to the overall characteristics of genes linked to the genomic regions (Figure S13C). The three transcriptionally active time periods correspond to distinct brain developmental stages (Figure S14). The early time period is from 4 to 12 postconception weeks (PCW), when cell birth and migration occur in the embryonic and early prenatal brain. The middle time period includes 25–38 PCW (late prenatal), and 6–18 months after birth (late infancy), a major development stage for synaptogenesis. The late time period mainly consists of 8–19 years and 20–40 years, which include adolescence and early adulthood, when the onset of schizophrenia usually occurs.
Spatiotemporal expression patterns of schizophrenia risk genes during brain development. The heat maps show both the active (red), and the suppressed (blue), expression, respectively, of different gene sets. The rows are 12 developmental stages in a chronological order, and the columns are eight brain regions. The shade of the color in a heat map is proportional to the ratio of genes that manifest active (or suppressed) activities, at the corresponding brain location and time stage, to the total number of genes in the specific gene set. E.a-f and P.g-l denote six embryonic, and six postnatal, developmental stages (see Figure S14 for details). DFC, dorsolateral prefrontal cortex; VFC, ventrolateral prefrontal cortex; OFC, orbitofrontal cortex; MFC, medial prefrontal cortex; STC, posterior superior temporal cortex; ITC, inferior temporal cortex; HIP, hippocampus; AMY, amygdala.
The significantly enriched GO terms of biological processes among genes with weak association is consistent with the formation of brain “hardware” at the cellular level, for which early neurodevelopmental stages are critical times when these genes are most transcriptionally active. In addition to early stages of neurodevelopment, perinatal development is also potentially vulnerable to perturbations in schizophrenia susceptibility genes that may contribute to the future onset of the disorder (Catts et al. 2013). Considering that emerging evidence implicates postnatal development changes in schizophrenia (Catts et al. 2013), the observation that many schizophrenia risk genes with strong association are more active during this period is intriguing. The developmental trajectories of eight schizophrenia risk genes with strong associations (Figure S15) suggest that they are more active during the postnatal period, including adolescence.
Discussion
Schizophrenia is a complex genetic disease. As a severe lifelong mental disorder affecting ∼1% of the United States population, it creates an enormous burden to patients, their families and the community. In the past several years, GWAS have been applied successfully to schizophrenia, and a large number of associated genetic loci have been identified, which could lead to the development of targeted therapies. Interpreting the GWAS results, however, remains difficult due to both the design of GWAS, and the nature of many identified risk loci. First, SNPs used in GWAS are tagging SNPs, each representing a large LD block, which may contain a large number of genes and regulatory elements (and thus possibly affecting genes elsewhere). Second, most variants found in GWAS to be associated with diseases including schizophrenia lie outside of protein-coding regions, and this observation remains true even after fine-mapping around the associated loci (Wellcome Trust Case Control Consortium et al. 2012).
For lack of a better approach, the genes closest to, or in the vicinity of, disease-associated SNPs found in GWAS are generally assumed to be the risk genes. However, this assumption may be overly simplistic, and identifying putative disease risk genes using new computational tools is critical in properly interpreting GWAS signals for diagnostic and therapeutic purposes. Responding to this need, we used an integrated post-GWAS analysis, and identified 132 putative schizophrenia risk genes, and determined their functional roles in schizophrenia. In our analysis framework, we used new computational methods based on rigorous statistical modeling to integrate a large number of heterogeneous genomic data sets from diverse sources, and, with a sensible score threshold, achieved high accuracy in our risk gene prediction. Two advantages of our method are immediately clear from our analysis results. First, our method can identify putative disease risk genes not only in the vicinity of GWAS signals, but also at a distance by regulatory elements in the risk region that affect gene expression. Disease genes distal to GWAS signals have never been identified before. Second, our method can also identify putative disease risk genes for GWAS variants that did not reach the genome-wide significance level (P < 5 × 10−8). Such weak GWAS signals are usually ignored. In this study of schizophrenia, we identified 29 putative distal risk genes, and 36 putative risk genes with weak association. Together, there are 55 novel schizophrenia risk genes that were missed by previous GWAS.
Our pathway analysis result indicates that, even though our gene scoring method is based on the functional properties of known risk genes, by integrating with GWAS signals and regulatory information, our approach has potential to uncover novo risk pathways in which common risk variants are involved. The underlying reason is that, although high-scoring genes must have certain functional similarities with seed genes, they are also likely involved in other risk factors not associated with seed genes. Therefore, benefitting from the fact that GWAS is non-hypothesis-driven, the analysis of high scoring genes implicated by GWAS signals may reveal novel risk factors associated with common risk variants.
The extended MHC region is a gene-dense region with long LD blocks, and often drives false-positive predictions. Six risk regions are located in this complex region (Table S3), and they involve 98 candidate genes, of which 11 are high scoring (Table S9 and Table S10). If the extended MHC region is excluded from our analysis, the results stay essentially the same. The set of high scoring genes remains highly associated with genes with relevant phenotypes of transgenic mice (P = 3.11 × 10−16), and genes with literature support (P = 6.26 × 10−5). The percentage of high scoring genes expressed in immune related tissues but not in the CNS remains high (∼25%). The enrichment of GWAS risk genes among schizophrenia risk genes implicated by rare variants stays significant (P = 8 × 10−5). The extended MHC region is not involved in the analysis of schizophrenia risk genes with different association strengths, due to the uncertainty about the association strength of the risk regions within it (Table S3).
To explain the lack of association between 132 schizophrenia risk genes and genes from differential expression studies of schizophrenia, we investigated their topological arrangements in the functional linkage network. There are 932 differentially expressed genes among the neighbors of all 132 schizophrenia risk genes. On average, there are more differentially expressed genes among the neighbors of each of 132 schizophrenia risk genes, compared to 132 random genes (Figure S16). The result indicates that, although schizophrenia risk genes themselves may not be differentially expressed between schizophrenia patients and normal individuals, compared to nonrisk genes, they are more likely (P = 0, with 1000 replicates) to be functionally associated with differentially expressed genes.
In this study, we focused functional analyses on 132 prioritized genes out of 643 candidate genes. Despite the presence of potential false negatives [e.g., ZNF804A], the overall characteristics of the remaining 511 candidate genes are very different from our prioritized genes. For example, genes with relevant phenotypes in transgenic mice, and genes with literature support for schizophrenia risk, are both overrepresented in our prioritized genes, but not in the remaining candidate genes (Figure S17). As expected, the patterns of transcriptional activities for prioritized genes with different association strengths are not observed for the remaining candidate genes (Figure S13C and Figure S18).
Of the 176 schizophrenia risk regions derived from GWAS signals, 100 do not contain genes with high scores. Several reasons could account for this absence. First, for risk regions with weak associations, the possibility that the associated GWAS signals were false positives could not be excluded, especially for regions that do not contain genes with high scores. Second, some distal risk genes might not be included in the candidate gene list due to incomplete TRE/eQTL regulatory information. Third, our schizophrenia gene scoring method relied on previous knowledge of functional linkage network and GO annotations, and thus was limited by them. Fourth, our schizophrenia gene scoring method was trained by using the schizophrenia training gene set. Some schizophrenia risk genes exerting pathogenic effects through very different mechanisms from schizophrenia training genes would not score highly. Fifth, our method considered only coding schizophrenia genes, while noncoding RNAs, such as miRNAs, were not considered. It should be noted that emerging evidence showed that miRNAs could also be risk factors for schizophrenia (Mellios and Sur 2012).
We identified 132 putative schizophrenia risk genes using our method, of which the majority have not been recognized in previous schizophrenia GWAS. In particular, 36 putative risk genes associated with GWAS signals at genome wide significance level were identified. Those weak signals are usually ignored due to the lack of an approach to avoid false positive GWAS signals. However, identification of risk genes with weak association is important to investigate the disease mechanisms underlying association strength. Our analysis suggests that, despite the high diversity of risk factors involved in schizophrenia, genes involved in certain biological processes are more likely to have higher degrees of penetrance, which indicates that certain biological processes have a stronger linkage to developing the disorder. Our analysis also shows that schizophrenia risk genes that are transcriptionally active in certain brain developmental stages are more likely to have higher degrees of penetrance, implicating a stronger linkage between the biological events in those brain developmental stages, and developing the disorder.
Acknowledgments
The authors thank Herbert M. Lachman of the Department of Psychiatry and Behavioral Sciences at Albert Einstein College of Medicine, and Anne S. Bassett of the Department of Psychiatry at the University of Toronto, for comments and suggestions. This work was supported by the National Institutes of Health grant MH101720 from the National Institute of Mental Health to the International Consortium on Brain and Behavior in 22q11.2 Deletion Syndrome. The authors declare that they have no competing interests.
Footnotes
Communicating editor: C. Sabatti
Supplemental material is available online at http://www.genetics.org/cgi/content/full/genetics.116.187195/DC1.
- Received January 15, 2016.
- Accepted September 30, 2016.
- Copyright © 2016 by the Genetics Society of America