Abstract
The human genome uses alternative pre-mRNA splicing as an important mechanism to encode a complex proteome from a relatively small number of genes. An unknown number of these genes also possess multiple transcriptional promoters and alternative first exons that contribute another layer of complexity to gene expression mechanisms. Using a collection of more than 100 erythroid-expressed genes as a test group, we used genome browser tools and genetic databases to assess the frequency of alternative first exons in the genome. Remarkably, 35% of these erythroid genes show evidence of alternative first exons. The majority of the candidate first exons are situated upstream of the coding exons, whereas a few are located internally within the gene. Computational analyses predict transcriptional promoters closely associated with many of the candidate first exons, supporting their authenticity. Importantly, the frequent presence of consensus translation initiation sites among the alternative first exons suggests that many proteins have alternative N-terminal structures whose expression can be coupled to promoter choice. These findings indicate that alternative promoters and first exons are more widespread in the human genome than previously appreciated and that they may play a major role in regulating expression of selected protein isoforms in a tissue-specific manner. (Blood. 2006;107: 2557-2561)
Introduction
Alternative pre-mRNA splicing is an important cellular mechanism by which functionally diverse proteins can be synthesized from a single gene. As many as 60% of human genes use alternative RNA processing to generate, from a single gene, mature mRNAs that differ in exon composition at the 5′ end, within the internal coding regions, or at the 3′ end.1-3 By producing different splice variants of a single gene, multiple protein isoforms with diverse structural/functional properties can be generated; examples of this abound among molecules that are involved in transcriptional activation, ligand interaction at the cell surface, and intracellular binding interactions among cytoskeletal components.4,5 Therefore, the use of alternative exons is an essential mechanism for proper regulation of many cellular processes.
The erythroid protein 4.1R is a prime example for which splicing can significantly affect a molecule's characteristics. Many different 4.1R isoforms can be produced from the single complex locus during erythroid development. For example exon 16, which encodes an essential part of the spectrin-actin binding domain of the 4.1R molecule, is excluded in early erythroid progenitors but included in erythrocytes of later stages; the inclusion of this critical spectrin-binding region allows 4.1R proteins to stabilize the developing membrane skeleton in mature red cells, by participating in associations with skeletal protein networks as well as contacts with integral membranes in the lipid bilayer.6 In addition to a host of internal exons, the 4.1R gene also has considerable complexity at the 5′ end. Specifically, there are 3 mutually exclusive first exons that map far upstream of the coding exons, with each of these alternative exons having its own transcriptional promoter. These exons splice to different acceptor sites downstream in exon 2, producing 2 protein isoforms (80 kDa and 135 kDa) that possess distinct N-termini structure and function.7
Besides 4.1R, several other erythroid genes are also known to have alternative first exons. The gene encoding uroporphyrinogen-III synthase, an enzyme essential for heme biosynthesis, has 2 alternative 5′ exons, one of which is specific for erythropoietic tissues.8 Two other genes vital for erythroid functioning, acetylcholinesterase9 and ankyrin-1,10 also exhibit complexity at the 5′ terminus. In the current study, we show that this genomic theme is present among many other erythroid genes as well. Genetic database analysis revealed that many erythroid genes possess 2 or more alternative first exons; the authenticity of these previously unappreciated first exons was supported by the presence of computationally predicted transcriptional promoters found near a large majority of the 5′ ends. The results of this study suggest that alternative first exons may represent an important mechanism by which the expression of erythroid genes is regulated.
Materials and methods
Erythroid genes analyzed
Many genes essential for human red cells were analyzed in this study. They were categorized into 6 different groups: (1) globins, (2) heme biosynthetic enzymes, (3) transcription factors, (4) cytoskeletal proteins, (5) plasma membrane proteins, and (6) glycolytic enzymes plus other cytoplasmic proteins. The majority of these genes were adopted from Hembase, a database of erythroid genes.11 A complete catalog of the genes analyzed in this study is displayed in Table 1.
Method of analysis
To investigate whether a particular gene has alternative first exons, the name of the gene (or its official gene symbol) was typed into the “Genome Search” option in the UCSC genome database.12 From the search results, choosing the “RefSeq” option provided extensive genomic information, including alignment of cDNA and EST clones. Clones with alternative 5′ ends were included in subsequent analyses only if they contained a majority of the gene's exons. For the erythroid genes that do possess alternative first exons, the accession IDs of their cDNA clones showing 5′ variability were recorded.
Analysis of promoter regions
DNA sequences flanking alternative first exons were analyzed using PROSCAN v1.7 (BioInformatics and Molecular Analysis Section, National Institutes of Health [NIH]; http://thr.cit.nih.gov/molbio/proscan), a program designed to find putative eukaryotic RNA polymerase II promoter sequences in primary sequence data.13 Whenever possible, 3000 bp upstream and 1500 bp downstream of each alternative first exon were inputted into the program to search for potential promoter sites. For some genes, the first exons are clustered very close together which makes this delineation impossible; in these cases, adjustments were made for the search regions to avoid overlapping. Finally, a negative control was also performed for each gene that exhibited 5′ end complexity. This was accomplished by taking internal gene sequences far from the first exon and searching them for potential promoters using PROSCAN.
Results
Identification of alternative first exons in erythroid genes
We examined a total of 103 genes which are expressed in the human red cell, using mainly genes from the Hembase database,11 supplemented with a number of transcription factor genes from the literature. A complete inventory of the genes studied is shown in Table 1. These genes are categorized as globins, heme biosynthetic enzymes, transcription factors, cytoskeletal proteins, plasma membrane proteins, and glycolytic enzymes. For each of these genes, we analyzed all of the known cDNA clones in the UCSC genome browser database12 to look for cDNAs that share coding exons but contain alternative 5′ termini. Such termini were considered strong candidates as alternative first exons if they were located upstream of, and were correctly spliced to, coding exons of the gene. Computational evidence for adjacent transcriptional promoter activity, when available, increased confidence that these 5′ sequences represent authentic first exons (see “Promoter scanning of alternative first exons”).
As shown in Table 2, our major novel finding was that a high proportion of erythroid-enriched genes possess candidate alternative first exons (36 of 103, 35%). Although several of these have been reported in earlier studies, the vast majority represent previously unrecognized alternative first exons. This result suggests that alternative first exons are much more common than previously appreciated. However, it seems likely that even this figure may be an underestimate, because a few previously reported alternative first exons were not apparent in the database analysis. Among the missing first exons were one in the protein 4.1R gene,7 2 in the acetylcholinesterase gene (ACHE),9 a muscle-specific promoter in the ankyrin-1 (ANK1) gene,10 a kidney-specific promoter in the band 3 gene (AE1),14 and a testis-specific promoter in the GATA1 gene.15 Therefore, additional first exons will likely be discovered when a more complete set of cDNAs is available for mapping against the human genome assembly.
Comparison of the various erythrocytic functional categories indicates that certain gene classes appear more likely than others to display 5′ complexity (Table 3). None of the prototypical erythroid globin genes, for example, shows alternative first exons. In contrast, a high frequency of cytoskeletal proteins and transcription factors, at least 50% of both groups, exhibited alternative first exons. The other gene classes exhibited an intermediate value in between these 2 extremes. Furthermore, there is a pronounced tendency for individual members of paralogous gene families to exhibit similar structures. For example, genes within the adducin and NFE2-like families exhibited complex 5′ structures, whereas the Rhesus blood group genes did not. Presently, it is not known whether the differences among the functional categories represent a genuine phenomenon or whether it is an artifact contributed by a limited sample size; additional studies on a more expansive, genome-wide scale will be necessary to confirm this observation.
Promoter scanning of alternative first exons
An essential feature of a genuine first exon is the presence of an adjacent transcriptional promoter. To search computationally for potential promoter sites among the predicted first exons of these erythroid genes, we used a promoter-scanning program, PROSCAN (v1.7). This program predicts eukaryotic RNA polymerase II promoter sites based on recognition of the number and type of transcriptional elements that are typically associated with RNA polymerase II.13 It recognizes approximately 70% of primate promoter sequences. Therefore, if putative promoter elements can be found for a similar proportion of first exons analyzed in this study, it would lend credence to their being authentic, bona fide 5′ ends.
Excluding a small number of 5′ exons which has been analyzed experimentally in previous studies, we scanned the flanking regions of all of the alternative first exons identified in this study and found that 72.6% of them possessed a predicted promoter in the vicinity. These promoters are indicated by arrows in Figure S1 (available at the Blood website; see the Supplemental Figure link at the top of the online article). Most of these promoter sequences are found within 500 bp upstream of the first base pair of the cDNA, although a few were predicted as far as 1500 bp upstream or even slightly downstream from the first exon. Precedence for downstream sequences being required for promoter activity exists in the case of α-spectrin.16 For several transcripts, 2 candidate promoters were predicted, likely indicating a high concentration of transcription factor binding sites in the region.
To independently ascertain the fidelity of the program, we performed both positive and negative controls in the promoter-scanning analysis. As a positive control, we analyzed the DNA flanking the first exons of the protein 4.1R gene; in a previous study, using luciferase reporter constructs, we showed definitively that the DNA regions flanking exons 1A, 1B, and 1C function as transcriptional promoters.7 Computer scanning of these regions did in fact predict promoters in the expected locations for 2 of the 3 5′ exons (Figure S1). The negative control was done by surveying randomly selected internal 3-kb (kilobase) sequences in each of the 36 genes with alternative first exons. Because these sequences were distant from any known 5′ terminus, they were not expected to possess promoter activity. Indeed, only 1 (2.8%) of the 36 negative control sequences contained a candidate promoter region. These controls suggest that the program, although accurate enough to predict known promoters, is not undesirably permissive.
Taken together, these results substantiate the view that the majority of predicted first exons are indeed legitimate because they possess nearby, flanking elements that may function as promoters. Because the proportion of first exons predicted to possess associated promoter activity in this data set is similar to the known accuracy of the algorithm, it suggests that the large majority of candidate first exons represent authentic 5′ ends. Positive and negative controls further reinforce this notion by showing that the program does recognize known promoters while not being promiscuous in its identifications.
Alternative first exons with unique features
Analysis of the genes possessing alternative first exons revealed that 5′ complexity can assume several different forms. For the majority of the genes found to have alternative first exons, most (28) exhibit 2 such exons, whereas 6 genes had 3 and 2 genes had 4. In terms of gene organization, the most common arrangement was the presence of mutually exclusive first exons that map upstream of, and splice accurately to, a common set of downstream exons (26 of 36, 72.2%). These upstream first exons can be located close together or tens of kilobases apart, differences that presumably reflect the regulatory requirements of the individual genes. The EPB41 gene, for example, has first exons as far as 101 kb upstream of the shared coding exon, whereas the distance is much less in EPB49 and NRF1. A complete listing of the genes belonging to this class is shown in Figure S1.
A second class of mutually exclusive alternative first exons differs from the conventional class by virtue of localization internally within the gene. That is, at least one of the apparent first exons occurs downstream of one or more internal coding exon(s), sometimes quite deep within the gene. This type of internal first exon has been reported previously in a number of nonerythroid genes as well as the erythroid AE114 and ANK110 genes; in the latter case, direct experimental evidence confirmed the presence of a transcriptional promoter in intron 39 associated with this novel 5′ end, located 228 kb from the most upstream first exon.10 Additional examples of this class include aldolase A (ALDOA) and TPM1 (tropomyosin-1), shown in Figure S1.
Finally, several genes possess “dual-capacity” 5′ exons predicted to function either as alternative first exons or as internal exons, depending on differential usage of transcriptional promoters and pre-mRNA splice sites. Precedence for this type of arrangement has been reported for the UROS (uroporphyrinogen synthase) gene, including experimental confirmation of erythroid-specific promoter activity associated with the previously designated exon 2.8 In the context of transcripts that initiate upstream at exon 1A, the erythroid promoter near exon 2 is ignored and instead the exon 2 splice acceptor site is used, converting this sequence into internal exon 2. In contrast, transcripts derived from the erythroid promoter express a 5′-extended variant of “exon 2” sequence that represents instead an alternative first exon. Among the erythroid genes analyzed here, similar organization is present in the ACHE (acetylcholinesterase)9 and SLC14A1 (hUT-A1 urea transporter; Figure S1) genes.
Alternative first exons versus protein N-termini and gene complexity
We next investigated the effects of alternative first exon utilization on the structures of the encoded proteins. Translation effects may occur directly, by insertion of alternative in-frame translation initiation sites, or indirectly by effects on downstream alternative splicing patterns of exons that contain initiation sites.7 Analysis of the erythroid alternative splicing data set in this study reveal that in about two thirds of the cases (23 of 36, 64%), genes with alternative first exons have the capacity to alter translation so as to encode protein isoforms with distinct N-terminal domains. In conventional cases of mutually exclusive upstream first exons, the alternative N-termini were most frequently in the range of 10 to 50 amino acids in length. At the other extreme, the internal alternative first exon in the ANK1 gene encodes a protein lacking 1726 amino acids,10 and the isoforms created by the various 4.1R first exons differ by 210 amino acids at the N-terminus.7 Therefore, it is clear that alternative first exons may have widespread effects on protein structure and function.
One final observation we made is that complexity of a gene (defined by the total number of exons in its transcription unit) is not correlated with its likelihood for having 5′ complexity. For example, we tabulated the average number of exons of the 36 genes that are found to have alternative first exons and also that of the 67 genes that did not show the feature (10.20 ± 1.30 and 9.93 ± 1.04, respectively). Comparison of the 2 values indicates no correlation between gene complexity and alternative first exon occurrence. This is a somewhat surprising result, because one would intuitively expect a gene with more exons to be more prone for 5′ complexity.
Discussion
The results presented here demonstrate that the incidence of alternative first exons is more widespread among erythroid genes than has been recognized previously. In this survey of 103 genes, 35% exhibited apparent alternative 5′ ends that fulfill the major requirements expected of authentic first exons: they map upstream of the coding exons, splice accurately to these exons, and are computationally predicted to possess closely associated transcriptional promoters. However, it seems likely that the true frequency of alternative first exons may be even higher than that predicted by database analysis, because there does not exist a complete cDNA data set available for mapping against the genome sequence.
The complex 5′ exon structure characteristic of many erythroid genes likely influences expression of the encoded proteome by several mechanisms. The presence of multiple transcriptional promoters, each potentially regulated by separate enhancer elements and responsive to different environmental cues, can facilitate proper spatial and temporal regulation of gene expression in the specialized differentiation programs unique to various cell types. For a few of the genes, it has been shown that one promoter serves a general housekeeping function to facilitate expression in nonerythroid cells, whereas the other is erythroid-specific and apparently functions to ensure proper quantitative, stage-specific expression during erythropoiesis. Included in this category are the genes UROS,8 GATA1,15 ANK1,10 and AE1.14 For the majority of cases described here, however, specific regulatory functions have yet to be attributed to the individual promoters.
Besides spatial and temporal regulation from alternative promoters, the unique sequence features of alternative first exons may also allow additional levels of gene regulation. The simplest way this is accomplished is through the inclusion of different start codons among the first exons; as this produces protein isoforms with different N-termini, the biochemical properties of a particular gene product can be greatly altered. However, first exons that do not alter translation products can still affect a gene's expression in other ways, such as altering the efficiency of mRNA translation. An estimated 20% to 48% of human transcripts contain at least one upstream open reading frame (uORF) preceding the “real” start codon.17-20 These uORFs can manipulate the translation initiation efficiency of the main, downstream ORF sometimes in tissue-specific patterns,21-25 or they may even influence initiation at different AUG codons.26 Alternatively, 5′ UTRs can dramatically alter translation efficiency of the downstream ORF through the action of specific binding proteins such as iron regulatory protein 1 (IRP1). IRP1 binding to iron responsive elements in the 5′ UTRs of ferritin and ALAS2 can inhibit translation in an iron-dependent fashion.27-29
It is important to note that the phenomenon of 5′ gene complexity and alternative promoters is not limited to erythroid cells, but rather is a general feature of human genes expressed in many cell types. Recent studies have used high-throughput microarray strategies to exhaustively map complete transcriptome sequences against the cognate genome assemblies, leading to identification of many novel alternative 5′ and 3′ exons in previously known genes.30,31 A general conclusion of this approach is that vertebrate genes frequently possess alternative promoters/alternative first exons.32 Alternatively, “ChIP on chip” assays can be used to map RNA polymerase II binding sites within the genome, representing an independent approach to the identification of transcription start sites.33 We speculate that these complementary strategies will ultimately reveal that alternative usage of first exons may prove to be as widespread and important in gene regulation as is alternative splicing of internal coding exons.
Prepublished online as Blood First Edition Paper, November 17, 2005; DOI 10.1182/blood-2005-07-2957.
Supported by the National Institutes of Health (grant DK32094) (N.M.) and the Director of the Office of Biological and Environmental Research, U.S. Department of Energy, under contract DE-AC03-765F0098.
The online version of the article contains a data supplement.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 U.S.C. section 1734.
We thank Christina Schleupen for discussions and suggestions.
