Question 4 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?


	$Users\$

doi:10.1038/ng969
volume 32 supplement pp 29 - 32

Question 4
A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?

The starting point for this search would be the web site for the Database of Single Nucleotide Polymorphisms (dbSNP) at the NCBI¹³, which is located at http://www.ncbi.nlm.nih.gov/SNP. There is a series of links on the page that allow the user to search using either information about the database submission itself or information regarding genes and gene loci.

For this particular search, assume that the region of interest is known and defined by two STS markers, RH70674 and G32133. Begin by scrolling to the section labeled Between Markers at the bottom of the page. Enter the STS marker names 'RH70674' and 'G32133' into the two text boxes, and click on Submit STS Markers. This will produce a display showing SNPs 1–25 out of the total of 81 within the region of interest. Go to page 3 of the display by entering '3' in the Page box and clicking Display.

The resulting page (Fig. 4.1) illustrates most of the possible types of result one would find on a typical dbSNP results page. In the table, starting from the left, the first column gives the individual dbSNP cluster IDs (all starting with 'rs'). The second column, labeled Map, shows whether a particular SNP has been mapped to a unique position in the genome (illustrated by a single green arrow, as in the first row of the example) or to multiple positions (not shown here).

The next set of columns, labeled Gene, indicates whether these SNPs are associated with particular features, such as genes, mRNAs or coding regions. The three columns (L, T and C) are either lit up or appear gray in every row. Taking each in order:

If the L (for locus) appears in blue, part or all of the marker position lies either within 2 kilobases (kb) of the 5' end of a gene feature or within 500 bases of the 3' end of a gene feature.

If the T (for transcript) appears in green, part or all of the marker position overlaps with a known mRNA. This does not mean, however, that the SNP marker necessarily falls within a coding region.

If the C (for coding) appears in orange, part or all of the marker position overlaps with a coding region.

The next column, labeled Het, indicates the average heterozygosity observed for this marker, on a scale of 0–100%. A reading of zero means that no information is available for that particular marker, whereas the pink bars show a 95% confidence interval for the marker. The Validation column indicates whether the marker has been validated (shown by a star) or is unvalidated (shown by light blue boxes). Validated markers have been verified by independent re-analysis of the sequence. All of the unvalidated markers shown in Fig. 4.1 are denoted by three blue boxes, which, according to the scale at the top of the column, means that there is a >95% success rate in validation. This figure indicates the probability that this marker is real. (The success rate is defined as 1 – false-positive rate.)

In the penultimate column, the symbol TT (not shown here) indicates that individual genotypes are available for this marker. Finally, the Linkout Avail column indicates which markers are linked to other databases; a P in this column indicates that the variation has been mapped to a known protein structure. For a complete description of all the features within this display, click on any part of the header above the columns.

Returning to the original question, one of the SNPs displayed on this page does indeed fall within a coding region, as indicated by an orange C. To obtain more information on any particular SNP, simply click on the hyperlinked SNP Cluster ID. Clicking on rs1059133, for example, produces a new page, with all available information on that SNP (Fig. 4.2). Under the header marked Submitter records for this RefSNP Cluster is a list of the individual SNPs (in this case, only one SNP) that have been clustered together to form this single reference SNP. The sequence of the SNP is shown in the next header. Under the header marked NCBI Resource Links are GenBank and NCBI RefSeq entries that are associated with this SNP. Scrolling further down on the SNP page (Fig. 4.3), the gene whose coding region this SNP falls within is indicated on the LocusLink Analysis section (ADAM2, a disintegrin and metalloproteinase domain 2). The SNP allele is G/C, a non-synonymous change leading to replacement of the Asp residue in the reference sequence by a His residue. Links are also provided to the NCBI Map Viewer, Ensembl map and UCSC genome assembly in the section labeled Integrated Maps. The sections labeled Variation Summary and Validation Summary (not shown) give the raw data on this particular SNP.

To answer the final part of this question requires jumping from dbSNP to LocusLink¹⁰. To do so, click on the ADAM2 link in the line marked LocusLink at the top of the page (Fig. 4.3). This brings the user to the LocusLink page for ADAM2 and provides numerous jumping-off points to the NCBI and affiliated resources through the boxed links at the top of the page. More information on these resources can be found by following the LocusLink FAQ link in the left-hand column of the page. By simply examining the LocusLink page itself, one sees that the ADAM2 protein belongs to a family of membrane-anchored proteins that have been implicated in processes as diverse as fertilization, muscle development and neurogenesis.

One often-overlooked source of information on genes and gene products is OMIM¹⁴. This is an electronic version of the catalog of human genes and genetic disorders developed by Victor McKusick at The Johns Hopkins University. OMIM provides the user with concise textual information from the published literature on most human disorders with a genetic basis, and links back to the primary literature as appropriate. Information comprising an OMIM entry includes the gene symbol, alternate names for the disease, a description of the disease (including clinical, biochemical and cytogenetic features), details of the mode of inheritance (including mapping information) and a clinical synopsis. These entries are manually curated, ensuring that the 'executive summary' is up to date and accurate. Although OMIM can be searched directly, many LocusLink entries also link to the OMIM record for the gene. The OMIM entry page for the ADAM2 protein is shown in Fig. 4.4. The page is fully hyperlinked to PubMed, GenBank and other related databases.

REFERENCES

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). | Article | PubMed | ChemPort |
Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540-544 (2001). | Article | ChemPort |
Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953). | ChemPort |
Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573-583 (2001). | Article | PubMed | ChemPort |
Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952-955 (1997). | PubMed | ChemPort |
Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45-48 (2000). | Article | PubMed | ChemPort |
Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38-41 (2002). | Article | PubMed | ChemPort |
Kent, W.J. BLAT--the BLAST-like Alignment Tool. Genome Res. 12, 656-664 (2002). | Article | PubMed | ChemPort |
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493-503 (2001). | Article | PubMed | ChemPort |
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137-140 (2001). | Article | PubMed | ChemPort |
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354 (1998). | Article | PubMed | ChemPort |
Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456-459 (1998). | Article | PubMed | ChemPort |
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308-311 (2001). | Article | PubMed | ChemPort |
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52-55 (2002). | Article | PubMed | ChemPort |
Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367-375 (1995). | PubMed | ChemPort |
Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803-816 (2001). | Article | PubMed | ChemPort |
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281-283 (2002). | Article | PubMed | ChemPort |
Apweiler, R. et al. InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145-1150 (2000). | Article | PubMed | ChemPort |
Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656-664 (1998). | Article | PubMed | ChemPort |
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113-115 (2002). | Article | PubMed | ChemPort |
Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201-205 (2001). | Article | PubMed | ChemPort |
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276-280 (2002). | Article | PubMed | ChemPort |
Letunic, I. et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30, 242-244 (2002). | Article | PubMed | ChemPort |
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997). | Article | PubMed | ChemPort |
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541-545 (2001) [erratum Trends Genet. 18, 218 (2002)]. | Article | PubMed | ChemPort |
Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19-29 (2001). | PubMed | ChemPort |
Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129-130 (2000). | Article | PubMed | ChemPort |
Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499-506 (1941). | ChemPort |
Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955-964 (2000). | Article | PubMed | ChemPort |
Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554-1556 (1987). | PubMed | ChemPort |
Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8-11 (1999). | Article | PubMed | ChemPort |
Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543-544 (1992). | Article | PubMed | ChemPort |
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147-164 (1999). | Article | PubMed | ChemPort |
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481-1488 (2000). | Article | PubMed | ChemPort |
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132-133 (1999). | Article | PubMed | ChemPort |
Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653-660 (1996). | PubMed | ChemPort |