CATH

http://www.biochem.ucl.ac.uk/bsm/cath_new

Pearl, F.M.¹, Bennett, C.F.¹, Bray, J.E.¹, Harrison, A.P.¹, Martin, N.², Shepherd, A.¹, Sillitoe, I.¹, Thornton, J.³, Orengo, C.A.¹

¹Biochemistry and Molecular Biology Department, University College London, University of London, Gower Street, London WC1E 6BT
²Department of Computer Science, Birkbeck College, University of London, Malet Street London WC1E 7HX
³EMBL-European Bioinformatics Institute. Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD

Database Description

The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath_new) currently contains 34,287 domain structures classified into 1,383 superfamilies and 3,285 sequence families. Each structural family is expanded with domain sequence relatives recruited from GenBank using a variety of efficient sequence search protocols and reliable thresholds. This extended resource, known as the CATH protein family database (CATH-PFDB) contains a total of 310,000 domain sequences classified into 26,812 sequence families. New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison (GRATH,(1)), based on secondary structure matching, for domain boundary assignment (CATHEDRAL,(2)). The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned. This is an improvement on a previous consensus approach for which only 10-20% of domains could be reliably processed in a completely automated fashion. Since domain boundary assignment is a significant bottleneck in the classification of new structures, CATHEDRAL will also help to increase the frequency of CATH updates.

Recent Developments

New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison (GRATH), based on secondary structure matching, for domain boundary assignment (CATHEDRAL). The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned

Acknowledgements

Frances Pearl, Andrew Harrison, David Lee, Ian Sillitoe and Christine Orengo all acknowledge the Medical Research Council for their funding. James Bray is currently supported by funding from the NIH. Adrian Shepherd acknowledges supports from the Biotechnology and Biological Research Council and Chris Bennett acknowledges support from the Wellcome Trust for research described in this manuscript.

REFERENCES

1. Harrison,A., Pearl,F., Sillitoe,I., Slidel,T., Mott,R., Thornton,J. and Orengo,C. (2002) A fast method for reliably recognising the fold of a protein structure. Submitted to Bioinformatics.
2. Harrison,A., Pearl,F., Sillitoe,I., Thornton,J. and Orengo,C. (2002) CATHEDRAL: an effective algorithm to delineate previously seen folds within a multi-domain structure. In preparation.
3. Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) A new approach to protein fold recognition. Nature 358, 86-89.
4. Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA. 95, 6073 - 6078.
5. Pearl,F., Todd,A.E., Bray,J.E., Martin,A.C., Salamov,A.A,, Suwa,M., Swindells,M.B., Thornton J.M., Orengo,C.A. (2000) Using the CATH domain database to assign structures and functions to the genome sequences. Biochem. Soc. Trans. 28, 269-75.
6. Altschul,S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 29, 351-361.
7. Karplus,K., Sjolander,K., Barrett,C., Cline,C., Hausser,D., Hughey,R., Holm,L. and Sander,C. (1997) Predicting protein structure using hidden Markov models. Proteins suppl. 1, 134-139.
8. Karplus,K., Barrett,C. and Hughey,R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846-56.
9. Pearl,F.M., Martin,N., Bray,J., Buchan,D.W.A., Harrison,A.P., Lee,D., Reeves,G.A., Shepherd,A.J., Sillitoe, I., Todd,A.E., Thornton, J.M. and Orengo,C.A. (2001) A rapid classification protocol for the CATH Domain Database to support structural genomics. Nucleic Acids Res. 29, 223-227.
10. Pearl,F.M., Lee,D., Bray,J.E., Buchan,D.W., Shepherd,A.J. and Orengo,C.A. (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci. 11, 233-44.
11. Orengo,C., Michie,A., Jones,S., Jones,D., Swindells,M. and Thornton,J. (1997) CATH--a hierarchic classification of protein domain structures. Structure, 5, 1093 -108.
12. Pearl,F.M., Martin,N., Bray,J.E., Buchan,D.W., Harrison, A.P., Lee,D., Reeves,G.A., Shepherd, A.J., Sillitoe I., Todd,A.E., Thornton, J.M., Orengo, C.A. (2001) A rapid classification protocol for the CATH Domain Database to support structural genomics. Nucleic Acids Res. 29, 223-227.
13. Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol. 5, 208,1-22.
14. Mitchell,E.M., Artymiuk,P.J., Rice,D.W. and Willett,P. (1990) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151-66.
15. Harrison,A., Pearl,F., Mott,R., Thornton,J. and Orengo,C. (2002) Quantifying the similarities within fold space. Journal of Molecular Biology, in press.
16. Holm,L. and Sander,C. (1994) Parser for protein folding units. Proteins 19, 256-268.
17. Holm,L. and Sander,C. (1999) Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244-7.
18. Orengo,C.A., Sillitoe,I., Reeves,G., Pearl,F.M. (2001) Review: what can structural classifications reveal about protein evolution? J. Struct. Biol. 134, 145-65.
19. Jones,S., Stewart,M., Michie,A., Swindells,M.B., Orengo,C. and Thornton,J.M. (1998) Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 7, 233-42.
20. Todd,A., Orengo,C. and Thornton,J. (2001) Domain assignment for protein structures using a consensus approach: characterization and analysis. J.Mol.Biol. 307, 1113-1143.
21. Shepherd,A. L., Martin, N., Johnson , R.G., Kellam, P., Orengo, C.A. (2002) PFDB: A generic protein family database integrating the CATH domain structure database with sequence based protein family resources. Bioinformatics, in press.

Category Structure

Go to the abstract in the NAR 2003 Database Issue.