Abstract
Transcription factors (TFs) and the regulatory proteins that control them play key roles in hematopoiesis, controlling basic processes of cell growth and differentiation; disruption of these processes may lead to leukemogenesis. Here we attempt to identify functionally novel and partially characterized TFs/regulatory proteins that are expressed in undifferentiated hematopoietic tissue. We surveyed our database of 15 970 genes/expressed sequence tags (ESTs) representing the normal human CD34+ cells transcriptosome (http://westsun.hema.uic.edu/cd34.html), using the UniGene annotation text descriptor, to identify genes with motifs consistent with transcriptional regulators; 285 genes were identified. We also extracted the human homologues of the TFs reported in the murine stem cell database (SCdb; http://stemcell.princeton.edu/), selecting an additional 45 genes/ESTs. An exhaustive literature search of each of these 330 unique genes was performed to determine if any had been previously reported and to obtain additional characterizing information. Of the resulting gene list, 106 were considered to be potential TFs. Overall, the transcriptional regulator dataset consists of 165 novel or poorly characterized genes, including 25 that appeared to be TFs. Among these novel and poorly characterized genes are a cell growth regulatory with ring finger domain protein (CGR19, Hs.59106), an RB-associated CRAB repressor (RBAK, Hs.7222), a death-associated transcription factor 1 (DATF1, Hs.155313), and a p38-interacting protein (P38IP, Hs. 171185). The identification of these novel and partially characterized potential transcriptional regulators adds a wealth of information to understanding the molecular aspects of hematopoiesis and hematopoietic disorders.
Introduction
Transcription factors (TFs) play a critical role in the process of lineage commitment and differentiation in hematopoietic tissue.1-4 Several such factors are known to control the basic molecular mechanisms that underlie this process, and their expression is tightly regulated in a stage- and lineage-specific manner.5 For example, the level of expression of PU.1 and GATA binding proteins plays a major regulatory role in myeloid development, with PU.1 being up-regulated with myeloid differentiation,6 whereas GATA1 and GATA2 are down-regulated.6,7 Disruption in the expression, sequence, or structure of critical TFs or their associated regulatory proteins can upset the delicate balance between proliferation and differentiation and lead to leukemogenesis. Most of the consistent translocations in myeloid leukemias that have been analyzed to date result in a fusion protein that alters the normal function of a TF or a related regulatory protein8,9; it is increasingly recognized that these genes might also contribute to leukemia by functional inactivation by mutation10 or chromosomal translocation.11-13 It has been speculated that the majority of translocations that have not yet been fully characterized probably also involve transcriptional regulatory proteins.14-16 Thus, the identification of novel transcriptional regulators, especially those that are located near translocation breakpoints, may help to specify new leukemia-related proteins, leading to better understanding and treatment of this disease.
In the present study, we took a global approach to identify novel and known transcriptional regulators that might participate in hematopoiesis and leukemogenesis by surveying databases of genes that are expressed in normal hematopoietic stem cells. We searched our previously reported database of 15 970 transcripts that are present in human bone marrow CD34 antigen–positive cells17 to identify those with functional motifs consistent with transcriptional regulators. We also searched a murine stem cell database18to find the human homologues of TFs expressed in this tissue. Here we report the results of our search, which identified 330 genes that are potential transcriptional regulators, including 106 TFs, of which 25 are novel or poorly characterized. These TFs, especially those novel ones that have not been reported previously, may represent new pathways in hematopoiesis or leukemogenesis that have not yet been explored.
Materials and methods
The human CD34+ transcriptosome database
The preparation of the human CD34+ transcriptosome database was previously reported.17 The database, which is available online at http://westsun.hema.uic.edu/cd34.html, contains 15 970 complementary DNAs (cDNAs) expressed in CD34+cells, and includes the GenBank accession number, the UniGene cluster identification number (http://www.ncbi.nlm.nih.gov/UniGene/) to which the GenBank clone belongs, its relative expression in CD34+ cells, the gene name, a functional description of the gene (from the UniGene text descriptor, build version 129), and its chromosomal location. The UniGene text descriptors of this database were searched for the following terms: transcription factor, leucine zipper, zinc finger, ring finger, helix-loop-helix, PHD, POU, forkhead, bromodomain, homeobox, oncogene, nuclear, activator, and repressor. The dataset was then updated to the most recent UniGene Build version (version 135, June 2001). Redundant cDNAs contained within the same UniGene cluster were removed, saving only the clone having the highest expression level in the CD34+ database. Each cDNA sequence was then used to search the GenBank NR database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide) using BlastN alignment software, to verify homology to the predicted gene, using an arbitrary cut-off of E <-40 to indicate sufficient sequence identity. If the E value was greater than e-40, then the cDNA sequence was also used to search the TIGR database (http://www.tigr.org/) to verify that it is represented by a TIGR contig containing the transcript of the predicted functions. The cDNAs that did not pass these GenBank or TIGR screens were removed from the database, as were genes of known function that clearly did not represent the categories that were being sought.
The murine stem cell database
A murine stem cell database (http://stemcell.princeton.edu) consisting of expressed genes (devoid of housekeeping genes) in mouse fetal liver stem cells has been reported.18 The GenBank accession number of all 161 entries in the “transcription factor” category of this database were selected and used to identify the corresponding murine UniGene cluster for each entry (build version 129); the UniGene annotation was used to identify the human homologue, if available. All human genes that were also present in the CD34+ transcriptosome database were then selected for inclusion in the present study and updated to their corresponding entry in build version 135 of UniGene. GenBank and TIGR homology screens were performed as described above.
Results
Selection of genes from the human CD34+transcriptosome database
The 15 970 genes in the human CD34+ transcriptosome database were searched for cDNAs that encode known TFs, and for those containing motifs that are frequently found in TFs and their interacting proteins. The analysis was based on a text search of the UniGene descriptor of the clones in the CD34+ database, rather than a direct homology search of the clone sequence. UniGene is a database that automatically collects and partitions GenBank and expressed sequence tag (EST) sequences into a nonredundant set of gene-oriented clusters by establishing sequence overlaps; each cluster represents a single potential transcript. Each cluster is annotated with a descriptor of the transcript that is the result of automated searches for sequence homologies to proteins from 8 organisms, using both nucleotide and protein sequence alignment; thus, a fair amount of functional prediction is available for each gene cluster even if it represents an EST sequence that has not been further studied. Each cluster is assigned a chromosomal location, based on sequence alignment. Details of the construction and updating of the UniGene database are available athttp://www.ncbi.nlm.nih.gov/UniGene/.
The UniGene cluster descriptors contained in the CD34+ transcriptosome database were searched for terms that are thought likely to annotate TFs, corepressors or coactivators, nuclear factors, and other DNA-interacting proteins. The resulting genes were updated, corrected for redundancy, and verified through homology screens. The database was visually inspected, and an additional 6 genes of known function, which clearly did not contain transcriptional regulatory activity, were removed from the database. A total of 285 genes resulted. Table1 presents these genes, categorized according to their function or functional motifs, with their UniGene number, chromosomal location, and UniGene descriptor. The cDNAs in each category are presented in the order from highest abundance to lowest, based on the measured level of expression in CD34+ cells, as reported in the database.17
Selection of genes from the murine stem cell database
The TF category of the murine hematopoietic stem cell database was analyzed to identify the human homologues of known and novel TFs expressed in human bone marrow CD34+ cells, by cross-referencing the murine and human UniGene databases. The murine UniGene clusters corresponding to each of the 161 TFs listed in the murine database were matched with the human clusters in the UniGene database version 129 resulting in 155 homologous human clusters. A total of 145 human genes remained after updating to UniGene version 135 and removing redundant entries. Of these 145 clusters, 87 were represented in the human CD34+ transcriptosome database, including 30 that had already been identified by our search using text descriptors. These 30 clusters are indicated with an asterisk in Table1. Analysis of the remaining 57 human genes for homology to their assigned UniGene cluster or to a corresponding TIGR entry, and excluding those whose known function was obviously not in the category of a transcriptional regulator, resulted in 45 additional genes/ESTs. These additional 45 genes are listed in Table2, and each entry includes the murine gene and its presumed human counterpart, its human UniGene cluster ID and descriptors, its chromosomal location, and the level of expression in human CD34+ cells. Of the 58 clusters that are not present in the CD34+ transcriptosome database, 38 were thought to be unexpressed in human CD34+cells, based on an expression level less than 3-fold over background in the CD34+ transcriptosome database, and the remaining 20 could not be evaluated because they had not been included in the original expression studies that resulted in the CD34+transcriptosome database.
Literature analysis of the TF database
After combining the datasets mined from the human and murine databases, the total number of potential TFs or regulatory proteins was determined to be 330. This includes 106 genes that are recognized as TFs and 224 genes in other categories, which include zinc fingers (90 genes), enhancers (14 genes), activators (8 genes), forkhead (11 genes), oncogenes (20 genes), ring finger (16 genes), and the combination of helix-loop-helix, homeobox, leucine zipper, nuclear, PHD, POU, and repressor categories (21 genes). The remaining 44 cDNAs represent genes that are functionally characterized as transcriptional regulators but lacked any search terms used in our mining protocol. A literature search of each of these 330 genes was performed to determine what was known about each one, emphasizing the discovery of novel genes. The following convention was used to summarize our search results: K = known gene, well characterized; PC = partially characterized, the gene was reported and some preliminary studies have been performed to indicate its function; N = novel gene, no functional information other than its chromosomal location and sequence homology to a known gene or gene family has been reported. These summaries are given in Tables 1 and 2. As a result of the literature search, 165 (50%) of the 330 transcriptional regulators identified were found to be known genes, 86 (26%) have been partially characterized, and 79 (24%) are novel. The partially characterized and novel transcriptional regulators have been further categorized by their relative level of abundance in CD34+ cells, with 92 expressing at low level (≥ 3-fold to < 10-fold over background), 27 expressing at intermediate level (≥ 10-fold to < 25-fold), 28 at high level (≥ 25-fold to < 100-fold), and 18 expressing at very high levels (≥ 100-fold), using the conventions reported in the CD34+ transcriptosome database.17
In the current study, we emphasized the identification of novel TFs. Based on our literature search of the 106 identified TFs, 78 appear to be well characterized, known genes, whereas 18 have been partially characterized and 7 represent truly novel genes. These 25 partially characterized and novel genes are listed in Table3 along with details of their presumed function and the supporting literature references.
Discussion
The current report presents our initial attempts to describe the TFs and related regulatory proteins that are present in the human CD34+ transcriptosome. The study is based primarily on the survey of a human CD34+ transcriptosome database, supplemented by homologies identified in a murine stem cell database, referring them to the CD34+ database.
The human CD34+ transcriptosome database was prepared by hybridization of filter arrays, selecting transcripts that are common to both human and baboon bone marrow CD34+antigen–positive cells.17 This database is felt to be an accurate portrayal of the transcriptosome of the CD34+ cell and was estimated to contain 50% to 75% of the transcripts expressed in this tissue. This database contains 15 970 genes/ESTs expressed in CD34+ cells, and lists their relative level of expression; random sampling of selected transcripts verified (by semiquantitative reverse transcriptase–polymerase chain reaction) that most were expressed at the predicted level.
The murine database (http://stemcell.princeton.edu/) was the result of a cDNA library study, subtracting a stem cell–depleted (AA4.1neg) cDNA library from a mouse fetal liver hematopoietic stem cell (ScaposAA4.1posKitposLinneg/lo) cDNA library.18 The subtracted library represents genome-wide gene expression in mouse hematopoietic stem cells devoid of housekeeping genes. Sequence information on each of these clones was compared by BLAST against GenBank nonredundant protein and nucleotide databases, the EST database, Swissprot, and mouse and human DOTS contigs. Each clone was categorized according to its sequence homology to genes of known functions, resulting in a “transcription factor” category containing 161 entries.
The current study reported here was based on the search of UniGene text descriptors in the CD34+ transcriptosome database domains generally present in TFs and their regulatory proteins, whereas the mining of the murine stem cell database relied on homology between the mouse TFs and human genes. The study resulted in the identification of 330 genes that are likely to be transcriptional regulators expressed in human CD34+ cells. Because this transcriptional regulator database was prepared using text descriptors rather than primary sequence analysis, it should only be regarded as a preliminary database survey, limited by the accuracy of the sequence searches compiled by UniGene and by the contents of the databases that were analyzed. Because this study relied heavily on the UniGene database, a considerable number of potential transcriptional regulators might have been missed because of the absence of search terms in the text descriptors, or the simple fact that the UniGene database does not contain complete cDNA sequences for all human genes. This explains in part why the additional 45 TFs from the murine database were not selected during the CD34+ transcriptosome database analysis.
Despite these limitations, we believe that this gene list will prove to be very useful for further studies of normal and malignant hematopoiesis. One of the most striking features of this list is that many of the genes have been assigned functional roles in numerous other tissues besides bone marrow. Also of note is the identification of 165 partially characterized and novel genes, 11 of which are expressed at a very high level in CD34+ cells, suggesting that they have an important role in this tissue but have not been previously recognized as such. Some of the interesting novel or partially characterized genes include zinc finger protein 161 (ZFP161,Hs. 156000), a cell growth regulator protein with a ring finger domain (CGR19, Hs. 59106), zinc finger protein 198 (ZNF 198, Hs.109526), RB-associated CRAB repressor (RBAK,Hs.7222), death-associated transcription factor 1 (DATF1,Hs. 155313), and a p38-interacting protein (P38IP, Hs. 171185). The human ZFP161 protein is highly homologous (98%) to ZF5, a putative murine repressor for MYC, with a growth-inhibitory function.19 We anticipate that bothZFP161 and RBAK20 are associated factors for 2 very functionally important proteins, MYC and RB, respectively, and may play important regulatory roles in cellular functions such as proliferation, differentiation, and apoptosis; to our knowledge, these genes have not been previously evaluated in hematopoiesis or leukemia. Another interesting protein is zinc finger protein 198 (ZNF 198). This gene has not been functionally characterized, but it is reported to be involved in the t(8;13) translocation,21 resulting in a fusion protein with fibroblast growth factor receptor 1 (FGFR1). Studies of these and other novel genes are underway to ascertain their potential role in cell proliferation, differentiation, and apoptosis in the hematopoietic system.
Detailed studies will be required to verify that each of these genes is indeed expressed in hematopoietic CD34+ cells at the predicted level, to obtain the complete coding sequence for the partial cDNAs/ESTs in the database, and to verify the assigned chromosomal location. We predict that some of these genes may be disrupted by chromosomal translocations, thereby contributing to leukemogenesis. Overall, the database here represents a wealth of potential new information to aid in understanding the molecular aspect of normal and malignant hematopoiesis.
Supported by Public Health Service grant P01-75606 (to C.A.W.).
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 U.S.C. section 1734.
References
Author notes
Carol A. Westbrook, Department of Medicine, Section of Hematology and Oncology, 900 S Ashland Ave, M/C 734, Chicago, IL 60607; e-mail: cwcw@uic.edu.