Nuclc. Acids. Res. OUP
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH ARTICLES TABLE OF CONTENTS
Categories List
Alphabetical List
Search Summary Papers

The Molecular Biology Database Collection: 2003 update

Andreas D. Baxevanis*

Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Building 50, Room 5222, Bethesda, MD 20892-8002, USA

The Molecular Biology Database Collection is an online resource listing key databases of value to the biological community. This Collection is intended to bring fellow scientists' attention to high-quality databases that are available throughout the world, rather than just be a lengthy listing of all available databases. As such, this up-to-date listing is intended to serve as the jumping-off point from which to find specialized databases that may be of use in advancing biological research. The databases included in this Collection provide new value to the underlying data by virtue of curation, new data connections or other innovative approaches. Short, searchable summaries and updates for each of the databases included in this Collection are available through the Nucleic Acids Research Web site at http://nar.oupjournals.org.

The biological community will mark the completion of the Human Genome Project's major goal in April 2003: complete, high-accuracy sequencing of the human genome (1). This remarkable achievement, often compared to landing a man on the moon, lays the groundwork for a fundamental shift in how biological and biomedical research will be performed in the future. The free, widespread availability of a wide variety of data beyond human genome sequence - sequence variation data, model organism sequence data, expression data and proteomic data, to name a few - will provide a fertile playground for biologists in all disciplines to better-design and interpret their laboratory and clinical experiments, hopefully accelerating the pace of biological discovery.

Even though human sequencing is not yet .complete. as a whole, sequencing has been completed on six human chromosomes as of the time of this writing (6, 7, 20, 21, 22 and Y). Along with the data available from numerous completed model genomes, the major public databases contain a phenomenal amount of sequence data. Currently, GenBank contains >17 billion nucleotide bases, representing >14 million sequences in 100 000 species. While the opportunities that this massive data set presents is mind-boggling, it also presents a problem in that the inexperienced user will either not know how to approach the data space or not know how to make best use of the data available to them. This problem will only continue to compound as GenBank continues its exponential rate of growth, with doubling rates on the order of 14 months or less. With the recent announcement of plans to sequence 'high-priority' model organisms by the National Human Genome Research Institute (NHGRI), it becomes more and more obvious that all biologists will need to avail themselves of the basic tools with which to navigate this large 'sequence space', as well as specialized databases that provide potentially easier access to subsets of the data.

Despite the large amount of publicity surrounding the Human Genome Project, a recent survey conducted on behalf of the Wellcome Trust indicates that only half of biomedical researchers using genome databases are familiar with the tools that can be used to actually access the data. For example, only 11% of those surveyed used the European Bioinformatics Institute's Ensembl Web site regularly, with 24% using it occasionally. Half of the remaining users had never even heard of Ensembl or its Web site. This low level of usage has led the Wellcome Trust to establish an advertising campaign aimed at increasing the public awareness of the availability of free tools such as Ensembl for searching human genome sequence data. Anecdotally, there is a similar lack of awareness or familiarity with the tools available through the University of California, Santa Cruz (UCSC) and, quite surprisingly, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health, even though many biologists visit the NCBI Web site frequently. In response to this low level of awareness of the tools freely available to biologists, Wolfsberg et al. (2), developed a 'user's guide' to the human genome, intended to provide an elementary, hands-on guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts. The guide provides step-by-step instructions and strategies for using many of the most commonly-used tools for sequence-based discovery. NCBI, Ensembl and UCSC are all also in the process of developing (or have already released) similar, online guides for using the tools available on their respective Web sites.

While educational efforts such as this help to address the need for rational ways to approach mining genomic data, additional efforts in the form of providing curated views of the data in specialized databases have been taking place for many years now. These efforts afford tremendous value to the biological researcher since they, in essence, reduce the massive 'sequence space' to specific, tractable areas of inquiry and, by doing so, allow for the inclusion of many more types of data than are found in the larger data repositories. These databases often provide not just sequence-based information, but additional data such as gene expression, macromolecular interactions, or biological pathway information, data that might not fit neatly onto a large physical map of a genome. Most importantly, data in these smaller, specialized databases tends to be curated by experts in a particular specialty and are often experimentally-verified, meaning that they represent the best state of knowledge in that particular area. This journal has devoted its first issue over the last several years to documenting the availability and features of these specialized databases in order to better-serve its readership, to promote the use of these resources in the design and analysis of experiments and to encourage the continued development of these resources. These reviewed databases are collectively listed in the Molecular Biology Database Collection.

The databases listed in this Collection distinguish themselves by their approach to presenting the underlying data.by adding new value to the underlying data by virtue of curation, by providing new types of data connections, or by implementing other innovative approaches that facilitate biological discovery. The individual entries are classified by type, but the reader should recognize that the distinctions between these classes are often arbitrary, and that many of these databases provide more than one type of information to the user.

In addition to the list presented in this paper, an electronic version of the Database Issue and Collection can be accessed online and is freely available to everyone, regardless of subscription status, at http://nar.oupjournals.org. While the list contains the databases described in the papers comprising the current issue, it should be immediately apparent to the reader that there are simply not enough pages in this issue to accommodate full-length, printed descriptions of all of the databases making up the Collection. To address this, the online version of the Collection provides short summaries of many of the databases, the summaries having been provided directly by the investigators responsible for the individual databases. Contributors have been asked to point out new features of their databases in the Recent Developments section of their entry. It is hoped that this approach will provide the reader with an additional source of information that will facilitate finding and selecting the sources of data that would be of most value in addressing a specific biological problem. Contributors are encouraged to keep their entries up-to-date.

Suggestions for the inclusion of additional database resources in this Collection are encouraged and may be directed to the author (andy@nhgri.nih.gov).

ACKNOWLEDGEMENTS

I wish to thank Ken Trout for maintaining the online submission Web site for the Collection, as well as for his technical support throughout this project. I also wish to thank Debbie Wilson and Karen Otto for providing logistical support and for their assistance in tracking and processing the manuscripts that appear in this issue.

REFERENCES

1. Collins,F.S., Patrinos,A., Jordan,E., Chakravarti,A., Gestetland,R., Walters,L. and Members of the DOE and NIH Planning Groups (1998) New goals for the US Human Genome Project: 1998-2003. Science, 282, 682-689. MEDLINE Abstract

2. Wolfsberg,T.G., Wetterstrand,K.A., Guyer,M.S., Collins,F.S. and Baxevanis,A.D. (2002) A User's Guide to the Human Genome. Nature Genet., 32(suppl.), 1-79. MEDLINE Abstract

*To whom correspondence should be addressed. Tel: +1 3014968570; Fax: +1 3014802634; Email: andy@nhgri.nih.gov

Categories List
Alphabetical List
Search Summary Papers