Abstract
The Histone Database (HDB) is an annotated and searchable collection of all full-length sequences and structures of histone and non-histone proteins containing the histone fold motif. These sequences are both eukaryotic and archaeal in origin. Several new histone fold-containing proteins have been identified, including Spt7p, and a few false positives have been removed from the earlier version of HDB. Database contents include compilations of post-translational modifications for each of the core and linker histones, as well as genomic information in the form of map loci for the human histone gene complement, with the genetic loci linked to Online Mendelian Inheritance in Man (OMIM). Conflicts between similar sequence entries from a number of source databases are also documented. Newly added to the HDB are multiple sequence alignments in which predicted functions of histone fold amino acid residues are annotated. The database is freely accessible through the WWW at http://genome.nhgri.nih.gov/histones/
INTRODUCTION
Histone proteins play a primary role in the compaction and accessibility of eukaryotic genomic DNA, and probably archaeal genomic DNA as well (1,2). Two molecules of each of the four core histones—H2A, H2B, H3 and H4—form an octamer around which ~146 bp of DNA are wrapped in repeating units called nucleosomes (3). The main chain and sidechains of the basic octameric histones establish hydrogen and ionic bonds with the negatively-charged phosphate backbone of DNA to effect nucleosomal packaging (4). Linker histones (H1 and H5) bind internucleosomal DNA and promote higher-order organization of chromatin (reviewed in 5); a function in which they may be aided by protruding domains of the core histones (4). Core histone sequences have been extraordinarily well conserved across evolution, indicating that there are strict structural constraints on histone function. Nevertheless, there has been enough latitude to allow the evolution of a handful of variant subclasses for each histone. Some of these variants are expressed in a tissue- or developmental stage-specific manner, indicating a specialized function.
A core histone structural motif dubbed the histone fold, consisting of three tandem α-helices connected by two short β-strand regions, is the primary site of histone–histone and histone–DNA binding (6,7). The histone fold also has been identified in a number of eukaryotic non-histone proteins, most of which are involved in functions related to DNA metabolism via protein–protein and protein–DNA interactions (8). These include several TATA-binding protein-associated factors (TAFs), which are components of the TFIID basal transcription complex (reviewed in 9). Histone fold-containing proteins have also been identified in archaea, which show many similarities to eukaryotes in their DNA replication and gene expression machinery (10).
The important roles played across the evolutionary spectrum by histones and histone-like proteins, and the proliferation of such sequences in various public databases, have led us to create and maintain a Web-based resource devoted exclusively to them. The Histone Database (HDB) represents a collection of all histone and histone fold-containing sequences available as of October 1999, with links for each to its GenBank flatfile and, where available, to its entry in a database of solved three-dimensional structures. The site also includes information on post-translational modifications extracted from database annotations for these proteins, human genomic locus information and sequence alignments in which the histone fold amino acid residues are functionally annotated according to the most recent crystallographic studies of the octamer–DNA complex (4,11), as well as being color-coded by physicochemical properties.
DATABASE CONTENTS
The database is divided into 10 subject areas.
(i) Background and summary data, including the primary reference (this paper), the protein databases searched, and a tabulation of the number of sequences, structures and gene loci included in the database.
(ii) A search engine for the database. Selectable search parameters include protein type, sequence set, organism, definition line keyword or sequence pattern.
(iii) All of the eukaryotic histone protein sequences in the database in FASTA format.
(iv) A non-redundant set of the same sequences in FASTA format.
(v) All of the archaeal and non-histone protein sequences in the database, in FASTA format. Both the complete sequences and the histone fold regions alone are available for downloading as FASTA libraries.
(vi) Multiple protein sequence alignments of the full-length core and linker histones, and of the histone fold regions of archaeal and non-histone proteins, performed using CLUSTALW (12) and rendered in downloadable PostScript format. Histone fold residues are color-coded by physicochemical criteria (e.g. polarity, acidity or basicity), and annotated for protein–protein and protein–DNA binding functionality.
(vii) A table of histone and histone fold protein structures available in three-dimensional structure databases. For each structure accession number, links to the Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) and the Molecular Modelling Database (MMDB; http:www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.html ) are provided, along with the protein name and source organism. We also provide a link to a molecular structure viewer (Cn3D; http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.html ).
(viii) A summary of post-translational modifications to histone proteins, rendered as multiple alignments with the modifications color-coded by type.
(ix) A graphical view of the seven human chromosomes (I, IV, VII, XI, XVII, XXII) which contain the human histone gene complement. Color-coded histone locus markers on each chromosome are linked to a table providing the histone name, the OMIM (Online Mendelian Inheritance in Man) accession number and chromosome map location.
(x) A list of discrepancies between multiple entries of the same sequence in primary databases such as GenBank.
DETECTION OF NOVEL HISTONE FOLD PROTEINS AND IMPLICATIONS FOR CHROMATIN ORGANIZATION
Using more modern profile searching methods, we have extended and revised the collection of histone fold-containing sequences. The use of the powerful profile search method PSI-BLAST (13) (http://www.ncbi.nlm.nih.gov/BLAST ) vastly improves the ability to detect subtle members of the histone fold superfamily. Using the minimal histone fold represented by the histones from archaea as seeds for PSI-BLAST searches (inclusion threshold e-value 0.01; searches run to convergence), profiles were prepared for the detection of new histone fold proteins. These profiles were then used to search individual complete genomes of yeast and Caenorhabditis elegans. The newly found members were used in subsequent searches to detect homologs in other eukaryotes. Further, a hidden Markov model was made from the alignment of the originally-identified histone fold domains using the HMMER2 package (14) (http://hmmer.wustl.edu/ ) and was used to search the protein sequences from different eukaryotes. This confirmed most of the findings obtained from the PSI-BLAST analysis. As a result, several previously unrecognized histone fold proteins were identified (Fig. 1). These proteins include the Spt7p and YPL011c protein from Saccharomyces cerevisiae, Bip2 (POZ domain protein Bric-a-Brac binding protein 2) and Prodos from Drosophila melanogaster and several related proteins from Arabidopsis thaliana and other animals (Fig. 1). One notable feature was the co-occurrence of the histone fold with other modules in some these polypeptides (Fig. 2). This modular organization had previously been observed only in Son of Sevenless (8), macroH2A (15) and a C.elegans gene product predicted to be a chromosomal protein (16).
Figure 1.
Multiple alignment of selected histone fold domains. The previously identified histone fold proteins form the top panel and the newly identified proteins are in the lower panel (see brackets on right). The helices shown above indicate the secondary structural elements of the TAFII42/60 dimer (PDB:1TAF). The multiple alignment was constructed using PSI-BLAST (–m 4 option) and adjusted using the 1TAF structure and PHD predictions. Protein sequences are labeled with the protein name to the left followed by species abbreviation and GenBank identifier number. The coloring has been carried out using the 90% consensus, with the key at the bottom of the figure corresponding to the following code: h, hydrophobic residues (YFWLIVAM); l, aliphatic residues shaded yellow (LIVAM); s, small residues colored green (SAGTVPNHD); p, polar residues colored purple (STQNEDRKH); b, bulky residues are shaded grey (KREQWFYLMI) and p/–, charged residues are shaded violet (KRHED). The species abbreviations are: At, A.thaliana; Ce, C.elegans; Dm, D.melanogaster; Rn, Rattus norvegicus; Hs, Homo sapiens; Sp, Schizosaccharomyces pombe; Sc, S.cerevisiae; Ph, Pyrococcus horikoshii; Mta, Methanobacterium thermoautotrophicum.
Figure 2.
Schematic showing the multidomain proteins with the histone fold domain. The detection of the alternative domains was carried out using hidden Markov models for individual domain using the HMMER2 package. The alignments for many of these domains are available from the SMART database (23). The domain acronyms are: HF, histone fold; MACRO, an ancient conserved domain found fused to the histone H2A domain in MacroH2A; PH, pleckstrin homology domain; RhoGEF and RasGEF, GDP exchange factor domains for Rho and Ras GTPases; REM, conserved domain associated with RasGEF; PHD, plant homeodomain zinc finger domain; A, ankyrin repeat; POZ, Pox virus zinc finger domain; yellow triangle, AT hook; BROMO, bromodomain.
An analysis of these additional modules reveals additional evidence for a potential role for the histone fold proteins in the organization of chromatin structure. As shown in Figure 2, the histone fold is combined with a variety of other protein domain modules [e.g. PHD fingers (17), AT hooks (16) involved in DNA binding, the bromodomain that binds acetyl-lysine containing peptides (18,19) and the POZ domains that mediate homophilic interactions in transcription factors and chromosomal proteins (20)]. Functional clues for such multidomain histone fold proteins can be derived from Spt7p in S.cerevisiae (Figs 1 and 2) which is part of a multiprotein chromatin remodelling complex, named SAGA (21). The SAGA complex possesses Gcn5p-dependent histone acetylase activity and contains another histone fold protein, Spt3p (21). Mutational analysis has shown that the SPT7 gene is central for the integrity and function of this complex (22). The multiple domains suggest that Spt7p acts as an adaptor, forming a nucleosome-like structure (probably in collaboration with the Spt3p family of proteins) using its histone fold while binding to acetylated peptides through the bromodomain. This nucleosome-like structure could provide the basis for the association of the SAGA complex with chromatin. Similar alternate nucleosome-like structures could be formed by other multidomain proteins like CCA3, BIP2 and C11G6.1 from C.elegans that could additionally bind DNA with their alternative DNA binding domains or could organize protein complexes with the POZ and ankyrin repeats (Fig. 2). Thus, the identification of novel histone fold proteins provides a tool to further investigate the role of alternative nucleosome-like structures in the assembly of chromosomal protein complexes.
DATABASE AVAILABILITY
The HDB is available on the WWW at http://genome.nhgri.nih.gov/histones/ . Studies utilizing this database should cite this paper as the primary reference.
REFERENCES
- 1.Kornberg R.D. and Thomas,J.O. (1974) Science, 184, 865–868. [DOI] [PubMed] [Google Scholar]
- 2.Pereira S.L., Grayling,R.A., Lurz,R. and Reeve,J.N. (1997) Proc. Natl Acad. Sci. USA, 94, 12633–12637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Thomas J.O. and Kornberg,R.D. (1975) Proc. Natl Acad. Sci. USA, 72, 2626–2630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Luger K., Mader,A.W., Richmond,R.K., Sargent,D.F. and Richmond,T.J. (1997) Nature, 389, 251–260. [DOI] [PubMed] [Google Scholar]
- 5.Ramakrishnan V. (1997) Crit. Rev. Eukaryot. Gene Exp., 7, 215–230. [DOI] [PubMed] [Google Scholar]
- 6.Arents G., Burlingame,R.W., Wang,B.C., Love,W.E. and Moudrianakis,E.N. (1991) Proc. Natl Acad. Sci. USA, 88, 10148–10152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Arents G. and Moudrianakis,E.N. (1995) Proc. Natl Acad. Sci. USA, 92, 11170–11174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Baxevanis A.D., Arents,G., Moudrianakis,E.N. and Landsman,D. (1995) Nucleic Acids Res., 23, 2685–2691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Burley S.K. and Roeder,R.G. (1996) Annu. Rev. Biochem., 65, 769–799. [DOI] [PubMed] [Google Scholar]
- 10.Makarova K.S., Aravind,L., Galperin,M.Y., Grishin,N.V., Tatusov,R.L., Wolf,Y.I. and Koonin,E.V. (1999) Genome Res., 9, 608–628. [PubMed] [Google Scholar]
- 11.Luger K. and Richmond,T.J. (1998) Curr. Opin. Struct. Biol., 8, 33–40. [DOI] [PubMed] [Google Scholar]
- 12.Thompson J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673–4680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Eddy S.R. (1998) Bioinformatics, 14, 755–763. [DOI] [PubMed] [Google Scholar]
- 15.Pehrson J.R. and Fried,V.A. (1992) Science, 257, 1398–1400. [DOI] [PubMed] [Google Scholar]
- 16.Aravind L. and Landsman,D. (1998) Nucleic Acids Res., 26, 4413–4421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Aasland R., Gibson,T.J. and Stewart,A.F. (1995) Trends Biochem. Sci., 20, 56–59. [DOI] [PubMed] [Google Scholar]
- 18.Haynes S.R., Dollard,C., Winston,F., Beck,S., Trowsdale,J. and Dawid,I.B. (1992) Nucleic Acids Res., 20, 2603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Dhalluin C., Carlson,J.E., Zeng,L., He,C., Aggarwal,A.K. and Zhou,M.M. (1999) Nature, 399, 491–496. [DOI] [PubMed] [Google Scholar]
- 20.Aravind L. and Koonin,E.V. (1999) J. Mol. Biol., 285, 1353–1361. [DOI] [PubMed] [Google Scholar]
- 21.Grant P.A., Duggan,L., Cote,J., Roberts,S.M., Brownell,J.E., Candau,R., Ohba,R., Owen-Hughes,T., Allis,C.D., Winston,F., Berger,S.L. and Workman,J.L. (1997) Genes Dev., 11, 1640–1650. [DOI] [PubMed] [Google Scholar]
- 22.Sterner D.E., Grant,P.A., Roberts,S.M., Duggan,L.J., Belotserkovskaya,R., Pacella,L.A., Winston,F., Workman,J.L. and Berger,S.L. (1999) Mol. Cell. Biol., 19, 86–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ponting C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) Nucleic Acids Res., 27, 229–232. Updated article in this issue: Nucleic Acids Res. (2000) 28, 231–234.9847187 [Google Scholar]


