Question 6 How would one retrieve the sequence of a gene, along with all annotated exons and introns, as well as a certain number of flanking bases for use in primer design?


	$Users\$

doi:10.1038/ng971
volume 32 supplement pp 40 - 43

Question 6
How would one retrieve the sequence of a gene, along with all annotated exons and introns, as well as a certain number of flanking bases for use in primer design?

This type of search can be initiated at the UCSC Genome Browser home page, located at http://genome.ucsc.edu. Select Human from the pull-down menu labeled Organism, and then click on Browser. This brings the user to the Human Genome Browser Gateway, from which a number of text- or position-based searches can be performed on current or older versions of the genome assembly. In this case, select the Dec. 2001 assembly, type the name of the gene of interest (PTPN1) into the position box, and then click Submit. The Browser returns all genes starting with the characters 'PTPN1' (Fig. 6.1). The gene of interest here is the one called PTPN1; click on the hyperlinked PTPN1 (arrow, Fig. 6.1) to view the genomic context of this gene (Fig. 6.2).

The text box at the top of Fig. 6.2 gives the absolute base pair position of this gene (chromosome 20, positions 48929540–49003636) and indicates that the gene spans 74 kb. The track labeled Chromosome Bands shows that PTPN1 is located at 20q13.13. Finally, the track marked Known Genes shows that the gene is on the forward strand, as the arrows on that track are pointing to the right. The exons within this gene are indicated by the vertical lines in the Known Genes track.

One way to obtain sequence upstream of a gene is described in Question 7. Here we explain how to retrieve flanking sequence on both sides of a gene. To retrieve an adequate amount of sequence with which to design primers, one can increase the size of the region displayed by changing the position numbers within the position box at the top of the figure. To add an additional 1,000 nt at the 5' end and an additional 200 nt at the 3' end, for example, change the text in the position box to 'chr20:4892854-49003836' and click Jump. This now redraws the graphic with the new boundaries.

To obtain the actual sequence within the region, click on the DNA link in the blue bar at the top of the page. This produces a new page, entitled Get DNA in Window (Fig. 6.3). Click the button next to extended case/color options and then click Submit. By selecting this option, the user can highlight features in the sequence by changing the format (case, underline, bold, italic) and/or color (red, green, blue) of the text. Colors can be varied in darkness and mixed together by changing the values in the boxes under Red, Green and Blue to any number between 0 and 255; examples of how to specify in RGB (red-green-blue) format color are given below the table. At this point, check the Toggle Case box in the Known Genes row, change the red saturation to 255 and leave the other color values set at zero (Fig. 6.4). Once the user clicks Submit, a new page is presented with the entire length of the sequence specified above (chr20:48928540-49003836) and the exons within this range are shown in red in capital letters (Fig. 6.5). This genomic sequence can now be saved and imported into a primer design or sequence assembly package for further analysis.

The Extended DNA Case/Color Options page can be used to combine and differentiate between genomic tracks. For example, return to the Options page, leave the Known Genes row as before but now also check the Underline square in the Mouse Blat row of the table. Clicking Submit produces a page on which the human exons still appear in red capital letters, but hits from the mouse sequence are now shown as underlined text (Fig. 6.6). In this section of the gene, the conserved mouse sequence overlaps with the exons.

REFERENCES

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). | Article | PubMed | ChemPort |
Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540-544 (2001). | Article | ChemPort |
Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953). | ChemPort |
Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573-583 (2001). | Article | PubMed | ChemPort |
Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952-955 (1997). | PubMed | ChemPort |
Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45-48 (2000). | Article | PubMed | ChemPort |
Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38-41 (2002). | Article | PubMed | ChemPort |
Kent, W.J. BLAT--the BLAST-like Alignment Tool. Genome Res. 12, 656-664 (2002). | Article | PubMed | ChemPort |
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493-503 (2001). | Article | PubMed | ChemPort |
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137-140 (2001). | Article | PubMed | ChemPort |
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354 (1998). | Article | PubMed | ChemPort |
Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456-459 (1998). | Article | PubMed | ChemPort |
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308-311 (2001). | Article | PubMed | ChemPort |
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52-55 (2002). | Article | PubMed | ChemPort |
Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367-375 (1995). | PubMed | ChemPort |
Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803-816 (2001). | Article | PubMed | ChemPort |
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281-283 (2002). | Article | PubMed | ChemPort |
Apweiler, R. et al. InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145-1150 (2000). | Article | PubMed | ChemPort |
Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656-664 (1998). | Article | PubMed | ChemPort |
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113-115 (2002). | Article | PubMed | ChemPort |
Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201-205 (2001). | Article | PubMed | ChemPort |
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276-280 (2002). | Article | PubMed | ChemPort |
Letunic, I. et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30, 242-244 (2002). | Article | PubMed | ChemPort |
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997). | Article | PubMed | ChemPort |
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541-545 (2001) [erratum Trends Genet. 18, 218 (2002)]. | Article | PubMed | ChemPort |
Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19-29 (2001). | PubMed | ChemPort |
Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129-130 (2000). | Article | PubMed | ChemPort |
Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499-506 (1941). | ChemPort |
Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955-964 (2000). | Article | PubMed | ChemPort |
Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554-1556 (1987). | PubMed | ChemPort |
Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8-11 (1999). | Article | PubMed | ChemPort |
Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543-544 (1992). | Article | PubMed | ChemPort |
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147-164 (1999). | Article | PubMed | ChemPort |
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481-1488 (2000). | Article | PubMed | ChemPort |
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132-133 (1999). | Article | PubMed | ChemPort |
Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653-660 (1996). | PubMed | ChemPort |