Blood is a complex fluid that samples all tissues in the human body. Despite complete sequence determination of the human genome, defining genes and gene products remains a challenge. Here, we apply tandem mass spectroscopy as new source of unbiased data to interrogate genomic sequence and identify novel protein coding sequences. A six-frame translation of the Human genome was used as the query database to search for novel blood proteins in the data from the HUPO PPP. Significance is assessed using a Poisson statistical model incorporating the length of the matching sequence and the frequency of spectrum matches observed in searching the database [Nat Biotech 2006 24(3):333–8]. Matches are binned by X!Tandem hyperscore, and statistics for each score class are considered independently. The overall probability that the matches to an ORF occurred at random is calculated as the product of the probability that the matches in each score category occurred at random. The expected number of random matches, E, is calculated as the product of the probability that an ORF match occurred at random multiplied by the number of ORFs searched. The confidence in an ORF identification is 1/(1+E). An open reading frame is considered significant if confidence is greater than 95%. Expanding recently published work [

Genome Biol
2006
;
7
(4):
R35
], we have identified 837 significant open reading frames coding for 18852 peptides falling within 914 exons of 413 genes. Out of 8856 candidate ORFs outside the boundaries of known genes, 3246 of them achieved a confidence >= 0.95. We also required the XG ORFs to be supported by at least 3 distinct ESTs. Twenty four of the XG ORFs were found to have a significant alignment to the mouse genome. Of these, 13 of the alignments encompassed a coding region for one of the diagnostic peptides associated with the ORF. Gene models for the XG ORFS were derived from the GENSCAN prediction made for their coding regions. This analysis suggests that alternative splicing of blood protein genes is common and that much remains to be learned about the protein constituents of blood.

[This work was supported in part by grants R01LM008106, U54DA021519, P41RR018627, and MTTC6887.]

Disclosure: No relevant conflicts of interest to declare.

Author notes

*

Corresponding author

Sign in via your Institution