We present analysis of intronic sequences in the human DMD and UTRN genes. In both genes accumulation of repeated elements could account for intron expansion. Out-of-frame rod-domain exons have stronger splice sites and are separated by significantly longer introns as compared to in-frame exons. These features are unique for the two homologs and not shared by other spectrin superfamily genes.
THE DMD gene is the largest known human gene, spanning >2500 kb of the X chromosome and occupying ~0.1% of the genome (Nishioet al. 1994; Landeret al. 2001); it is composed of 79 exons that account for only 0.6% of the sequence (Ahn and Kunkel 1993). Its main protein product, dystrophin, a member of the spectrin superfamily, is a rod-shaped protein (Koeniget al. 1988) that localizes to the sarcolemma. In vertebrates another large gene (Loveet al. 1989) encodes utrophin, a protein that displays a conserved structure with dystrophin over its entire length, with higher sequence similarity in the C- and N-terminal regions (Tinsleyet al. 1992; Pearceet al. 1993); despite high structural homology, the utrophin gene (UTRN) is about one-third in length with respect to dystrophin.
Mutations in the dystrophin gene are responsible for either Duchenne or Becker muscular dystrophy (DMD and BMD). The majority of DMD and BMD patients carry deletions in the gene (Koeniget al.1989) with long introns being preferential sites for deletion breakpoints. Worldwide incidence of DMD is 1 in 3500 male births, one-third of which arise from new mutations (Ahn and Kunkel 1993); it has been speculated that the size of the dystrophin gene might partially account for the high new mutation rate. The extreme length of dystrophin introns, a feature conserved in mammals, birds, and invertebrates (Dominguez-Steglichet al. 1990; Neumanet al. 2001), has long been a matter of debate and, in this respect, detailed analysis of intron sequences might be of help.
To allow a closer comparison to dystrophin gene structure, we used BLASTn analysis of utrophin cDNA against human genomic sequences to map intron/exon boundaries and to describe splice junctions and most intronic sequences: the gene consists of 74 exons with a length varying between 23 and 269 bp. Average intron length is ~7633 bp (it is 26,137 bp for dystrophin). In the two genes, 56 of the exons are identical in size and pairwise sequence alignment of these exons revealed a mean identity score of 61.9%. In contrast, no relation seems to exist between corresponding introns (Figure 1). Sequence analysis was performed for both dystrophin and utrophin available intron regions (see Table 1 for dystrophin). A high concentration of repetitive elements was found to be a key feature of many dystrophin and utrophin long introns; overall 32.1% of total dystrophin intron length (28.4% for utrophin) is accounted for by repeated sequences, with LINE-1 elements representing the major contribution to dystrophin intron size. Interestingly, when total length of repetitive sequences per intron was compared with residual intron length (intron size after removal of all repeats), a highly significant correlation was found for both the dystrophin and utrophin genes (Spearman correlation coefficients = 0.93 and 0.70, respectively; P < 0.001 in both cases; Figure 2a). This finding might indicate that early insertional events (which are now obliterated by accumulated point mutations) triggered further insertions, leading to incremental intron growth. Nonetheless, it is also possible that residual sequences did not arise from early sequence insertion but rather existed per se and started accumulating interspersed repeated elements in proportion to their original length. When the presence of different repeats was analyzed as a function of time (Figure 2b), similar profiles were obtained for the two homologs and in both cases the trend was superlinear indicating that, indeed, the augmented intron size resulting from each insertion event has favored further insertions in a process that determined, in the last 130 million years, a size increase of ~20% for both utrophin and dystrophin. These data indicate that gradual accumulation of repeated elements may be regarded as a convincing hypothesis to explain intron expansion in these genes. Repeated elements also represent a large target for homologous unequal recombination, yet only a few breakpoints in the dystrophin gene have been sequenced and associated with homologous DNA misalignment (McNaughtonet al. 1998). Suminaga et al. (2000) recently indicated a nonhomologous recombination event between Alu and LINE-1 repeats as the cause of a deletion in the dystrophin gene and hypothesized the existence of a novel source of instability. Nonetheless, here we show that 31.2% of the dystrophin intron size is represented by repetitive elements; this implies that even if breakpoints were not promoted by any single sequence element, approximately one-third of them would be expected to involve repeated sequences. Whatever the molecular mechanisms involved, the longest dystrophin introns have been shown to be preferential sites for deletion breakpoints (Baumbachet al. 1989); in this view intron expansion can be regarded only as a genetic load and much more so if energetic effort during transcription and potential problems in pre-mRNA processing are considered. We have previously shown (Sironiet al. 2001) that, in the dystrophin gene, out-of-frame (OF) exons have significantly stronger splice sites with respect to in-frame (IF) exons. Here we extended this analysis to utrophin splice junctions and it is evident from Table 2 that the same finding is verified. A similar bias is not observed when splice sites of other genes of the spectrin superfamily are considered; this implies that duplication, early in evolution, of a common structural motif cannot be indicated as an explanation. Interestingly, in both genes, significant differences between splicing parameters are accounted for by rod-domain exons (Table 2) that encode a region where the two proteins display the lowest degree of conservation. One possibility is that this feature represents a device to minimize energetic waste: skipping of an IF rod-domain exon due to a splicing error would produce an internally truncated “Becker-like” protein, which would retain partial activity in cellular processes; in contrast, exon missplicing in the C- and N-terminal domains, where different binding sites are located, is expected to determine protein dysfunction irrespective of frame conservation or alteration. Intron lengths were also considered in the two homologs (Table 2) and OF exons were found to be separated by significantly longer genomic distances as compared to IF exons; again, significant differences were accounted for by rod-domain exons and were not found when other genes of the spectrin superfamily were considered. This finding is quite surprising since the probability of cryptic splice site activation is expected to increase with intron length. Nonetheless, despite the lack of any simple relation between lengths of corresponding introns, this feature has been preserved in both dystrophin and utrophin, suggesting underlying functional significances. To this respect it should be noted that attention has recently been focused on intron sequences as modulators of both splicing and transcription (Brinsteret al. 1988; Okkemaet al. 1993); intron-dependent recruitment of splicing factors to transcription sites has been reported in HeLa cells (Huang and Spector 1996) while Neel et al. (1993) indicated that, in NIH-3T3 cells, intron removal rate is dependent upon the number of introns on the nascent transcript; the authors suggest that interaction between intron sequences can increase both the specificity and the efficiency of splicing.
In a scenario whereby introns can enhance transcriptional activity and, eventually, stimulate the accumulation of splicing factors, long intronic regions might turn out to be not as disadvantageous as expected.
We are grateful to Dr. R. Giorda for useful discussion about the manuscript. We thank the Celera publication site for allowing sequence retrieval and analysis.
Communicating editor: A. J. Lopez
- Received August 31, 2001.
- Accepted November 15, 2001.
- Copyright © 2002 by the Genetics Society of America