A Statistical Multiprobe Model for Analyzing cis and trans Genes in Genetical Genomics Experiments With Short-Oligonucleotide Arrays

Alberts, Rudi; Terpstra, Peter; Bystrykh, Leonid V; de Haan, Gerald; Jansen, Ritsert C

doi:10.1534/genetics.105.045930

Abstract

Short-oligonucleotide arrays typically contain multiple probes per gene. In genetical genomics applications a statistical model for the individual probe signals can help in separating “true” differential mRNA expression from “ghost” effects caused by polymorphisms, misdesigned probes, and batch effects. It can also help in detecting alternative splicing, start, or termination.

IN a genetical genomics experiment, a panel of 30 genetically different recombinant inbred mice was derived from a cross between parental strains C57BL/6 (B6) and DBA/2 (D2) (Jansen and Nap 2001; Bystrykh et al. 2005). These 30 mice were profiled with Affymetrix MG-U74Av2 arrays, using RNA isolated from hematopoietic stem cells and 12,422 probe sets. The observed array data were background corrected and quantile normalized (Bolstad et al. 2003; Gautier et al. 2004). Although various methods have been developed to compute a single expression value per probe set for further data analysis (e.g., Zhang et al. 2003; Wu and Irizarry 2004; Manly et al. 2005), we here develop an alternative statistical method to more fully exploit the information contained in the individual probe signals.

Differential expression for a given gene can result from trans-regulation by other genes or from cis-regulation due to variation in the region of the gene itself (altering functional motifs in the promoter region, changing the stability of the mRNA, or modifying the gene product in such a way that the feedback loop is shifted; Jansen and Nap 2004). In either case signal differences are supposed to be rather stable across probes (Figure 1A). The differences may, however, also change from one probe to another, due to known or unknown single-nucleotide polymorphisms (SNPs) or microdeletions between mRNA transcripts of different samples (Figure 1B; see also Doss et al. 2005), due to misdesigned probes (the majority of probe sets in Figure 1 are sequence verified; see also Mecham et al. 2004) or due to other known or unknown factors (e.g., alternative transcription). In such cases computing a single expression value per probe set, as in the current methods, leads to a loss of biologically relevant information. This will also hold for future alternative transcription/splicing arrays with probes located in different exons and not in the last exon or 3′-untranslated region only, as in the MG-U74Av2 array used in our experiment (Sharov et al. 2005). When the differences in signal between probes match with information in alternative splicing databases, they indicate that alternative transcription/splicing is the cause, and not an SNP.

Figure 1.—

Open in new tab Download slide

Model parameterization and fitting: two examples and whole-genome results. (A) Expression profiles of 30 mice, adjusted for the effects of probe, batch, and probe-specific batch effects, are shown for probe set 103036_at. The mice with solid (shaded) profile carry the B6 (D2) allele at marker D15Mit158. Expression difference is relatively constant across the 16 probes. (B) Similar plot for probe set 96243_f_at and marker D1Mit145. In this case the expression difference is highly variable between probes; i.e., there is probe-by-QTL interaction. We sequenced our D2 strain and found an influential polymorphism at relative base position 30 in probes 4–9 and a second polymorphism at position 57 in probes 11–15 (see vertical lines). The direction of the QTL effect is indeed such that the B6 allele has higher signal than the D2 allele. (C) Results of the QTL analyses on all probe sets. Significance of QTL (top) and of probe-by-QTL interaction (bottom) is shown for the 100 most significant cis-acting probe sets. The dotted lines indicate probe sets carrying a currently known SNP. (D) Similar plot for the 100 most significant trans-acting probe sets. (E) Results of the QTL analyses on all probe sets. At each genome position the number of probe sets is shown that significantly map to that position, on the basis of either a model ignoring possible batch effects (top) or a model including batch and probe-specific batch effects (bottom).

In a genetical genomics analysis we search with the aid of molecular markers for a genome position where the difference in expression between mice carrying the B6 marker allele and mice carrying the D2 marker allele is (most) significant; this genome position is commonly denoted “expression quantitative trait locus” (eQTL or just QTL). In our study the 30 mice have been expression profiled by using probe sets of size 16. The 30 × 16 signals in a given probe set are decomposed into

\[\mathrm{log}(y_{ij}){=}m{+}B_{i}{+}P_{j}{+}PB_{ij}{+}A_{i}{+}PA_{ij}{+}e_{i}{+}e_{ij},\]

where y_ij is the signal of the jth probe of the ith mouse, m is the average signal, B_i is the average effect of the batch to which the ith mouse belongs (three batches), P_j is the average effect of the jth probe, PB_ij is the interaction effect between probe and batch, A_i is the average effect of the allele (B6 or D2) carried by the ith mouse at a given genome position, PA_ij is the interaction effect between probe and allele type, e_i is an error term per mouse, and, finally, e_ij is a probe-specific error term per mouse. The model is computed at each of 705 marker positions to find the position (QTL) with the most significant allele effect A_i; probe-by-QTL interaction is derived from the corresponding PA_ij. Standard F-values are computed: F_1,25 at the mouse stratum for the QTL and F_15,390 at the lowest stratum for the probe-by-QTL interaction.

Genes colocalizing within 20 Mb of their QTL are termed here cis genes, and genes mapping elsewhere are termed trans genes. The cis genes show many more probe-specific QTL effects than the trans genes do (Figure 1, C and D). It is expected that probe sets carrying the more influential polymorphisms between probe and transcript will be picked up as cis-acting with probe-specific QTL effects. Indeed 10 cis-acting genes carry currently known SNPs in one or more of their probes and the directions of the probe-specific QTL effects are in agreement with the SNPs; i.e., the mouse allele perfectly matching the probes on the array has the higher signal (allowing us to use the array for genotyping as well; see also Jansen and Nap 2001; Rostoks et al. 2005). Further lab research (e.g., sequencing of the D2 strain) will make clear how many of the cis-acting genes are caused by SNPs, alternative transcription, or other (hidden) factors.

Our analysis separates probe sets that are “consistently” cis across probes from those that are more “probe-specific” cis and should be investigated in more detail in silico or in the lab. Figure 1C shows that P-values for probe-by-QTL interaction can be very extreme. At the one hand probe-specific QTL effects can be large relative to e_ij (and thus statistically significant), at the other hand they can still be small relative to the average QTL effect (and thus biologically not really relevant). The biologist can best inspect figures (Figure 1B and alike) to see how many probes underlie probe-by-QTL interaction and to decide which genes merit closer scrutiny in silico or in the lab.

Some genomic regions showed up as “master” regulators in trans of many genes, but only if the factor batch was excluded from the model (particularly chromosome 4, see Figure 1E). The 30 samples have been profiled in three batches and samples in the same batch generally showed a very similar profile across the 16 probes, whereas samples in different batches showed different profiles. Spurious linkage between batch and these genome regions leads to “ghost” regulators, and, since batch and probe-specific batch effects are likely to occur in experiments with multiple arrays, it underpins the need for careful statistical modeling. It is important to note that standard permutation tests in QTL analysis without batch effects will not protect against this batch artifact (permutation should be carried out within batch-probe combinations).

We have shown that current methods fail to consider various influential variations (technical, molecular, and sequence) in genetical genomics studies, with worse fit, loss of relevant information, and possibly wrong conclusions as a result. Our statistical analyses on the entire probe data set are therefore at the moment the methods of choice in genetical genomics applications with short-oligonucleotide arrays to help in separating “true” cis and trans genes from “ghost” ones. Data are available at www.genenetwork.org (Chesler et al. 2004).

Footnotes

Communicating editor: C. Haley

Acknowledgement

We thank J. P. Nap, D. J. de Koning, and C. S. Haley for stimulating discussions. R.A. was supported by Netherlands Organization for Scientific Research-Biomolecular Informatics grant 050-50-203.

References

Bolstad, B. M., R. A. Irizarry, M. Astrand and T. P. Speed,

2003

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.

Bioinformatics

19

:

185

–193.

Crossref

Bystrykh, L., E. Weersing, B. Dontje, S. Sutton, M. T. Pletcher et al.,

2005

Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’.

Nat. Genet.

37

:

225

–232.

Crossref

Chesler, E. J., L. Lu, J. Wang, R. W. Williams and K. F. Manly,

2004

WebQTL: rapid exploratory analysis of gene expression and genetic networks for brain and behavior.

Nat. Neurosci.

7

:

485

–486.

Crossref

Doss, S., E. E. Schadt, T. A. Drake and A. J. Lusis,

2005

Cis-acting expression quantitative trait loci in mice.

Genome Res.

15

:

681

–691.

Crossref

Gautier, L., L. Cope, B. M. Bolstad and R. A. Irizarry,

2004

affy—analysis of Affymetrix GeneChip data at the probe level.

Bioinformatics

20

:

307

–315.

Crossref

Jansen, R. C., and J. P. Nap,

2001

Genetical genomics: the added value from segregation.

Trends Genet.

17

:

388

–391.

Crossref

Jansen, R. C., and J. P. Nap,

2004

Regulating gene expression: surprises still in store.

Trends Genet.

20

:

223

–225.

Crossref

Manly, K. F., J. Wang and R. W. Williams,

2005

Weighting by heritability for detection of quantitative trait loci with microarray estimates of gene expression.

Genome Biol.

6

:

R27

.

Crossref

Mecham, B. H., D. Z. Wetmore, Z. Szallasi, Y. Sadovsky, I. Kohane et al.,

2004

Increased measurement accuracy for sequence-verified microarray probes.

Physiol. Genomics

18

:

308

–315.

Crossref

Rostoks, N., J. O. Borevitz, P. E. Hedley, J. Russsell, S. Mudie et al.,

2005

Single-feature polymorphism discovery in the barley transcriptome.

Genome Biol.

6

:

R54

.

Crossref

Sharov, A. A., D. B. Dudekula and M. S. Ko,

2005

Genome-wide assembly and analysis of alternative transcripts in mouse.

Genome Res.

15

:

748

–754.

Crossref

Wu, Z., and R. A. Irizarry,

2004

Preprocessing of oligonucleotide array data.

Nat. Biotechnol.

22

:

656

–658.

Zhang, L., M. F. Miles and K. D. Aldape,

2003

A model of molecular interactions on short oligonucleotide microarrays.

Nat. Biotechnol.

21

:

818

–821.

Crossref

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
February 2021	1
March 2021	2
April 2021	2
May 2021	1
June 2021	12
August 2021	1
September 2021	1
November 2021	3
December 2021	1
January 2022	9
February 2022	7
March 2022	6
April 2022	2
May 2022	4
June 2022	6
July 2022	11
August 2022	10
September 2022	12
October 2022	2
November 2022	5
December 2022	14
January 2023	2
February 2023	9
March 2023	3
April 2023	5
May 2023	4
June 2023	2
August 2023	10
September 2023	1
October 2023	1
November 2023	6
December 2023	3
January 2024	5
February 2024	5
March 2024	7
April 2024	18

Article Contents

A Statistical Multiprobe Model for Analyzing cis and trans Genes in Genetical Genomics Experiments With Short-Oligonucleotide Arrays

Abstract

Footnotes

Acknowledgement

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A Statistical Multiprobe Model for Analyzing cis and trans Genes in Genetical Genomics Experiments With Short-Oligonucleotide Arrays

Abstract

Footnotes

Acknowledgement

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only