HS3D

Database Description

In the last years many computational tools for gene identification and characterization[1,2,3,4,5,6,7,8 and many others], mostly based on machine learning approaches, have been used. In the machine learning approach, a learning algorithm receives a set of training examples, each labelled as belonging to a particular class. The algorithm’s goal is to produce a classification rule for correctly assigning new examples to these classes. The success of these methods depends largely on the quality of the data sets that are used as the training set[9]. Furthermore a common data set is necessary when the prediction accuracy of different programs needs to be comparatively assessed[10,11]. The Irvine Primate Splice Junctions Dataset (UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html) is a standard “de facto” in the machine learning community [12,13,14,15 and many others], but it is now very out of date and do not include sufficient material for the most learning algorithm needs. A recent and EST confirmed data set[16] has the same limitation in the data extend. More recently Burset et al.[17] developed an extensive data base, but the data do not include false splice sites (negative examples), and, specifically, proximal false splice sites. The latter data form a well known critical point of classification systems[11]. We developed a new database (HS3D - Homo Sapiens Splice Site Dataset) of Homo Sapiens Exon, Intron and Splice regions. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. From the complete GenBank Primate Sequences Rel.123 (8436 entries), 697 entries of Human Nuclear DNA including a Gene with Complete CDS and with more than one exon have been selected according to assessed selection criteria[18] (file genbank_filtered.inf). 4450 exons and 3752 introns have been extracted from these entries (files exons.seq and introns.seq). Several statistics for such exons and introns (overall nucleotides, average GC content, number of exons/introns including not AGCT bases, number of exons/introns in which the annotated end is not found, exon/intron minimum length, exon/intron maximum length, exon/intron average length, exon/intron length standard deviation, number of introns in which the sequence does not start with GT, number of introns in which the sequence does not end with AG) are reported (files exons.stat and introns.stat). Then 3762 + 3762 donor and acceptor sites have been extracted as windows of 140 nucleotides around each splice site. After discarding sequences not including canonical GT–AG junctions (176 +191), including insufficient data (not enough material for a 140 nucleotide window) (590+547), and including not AGCT bases (30+32), there are 2955+2992 windows (files GT_true.seq and AG_true.seq). Information and several statistics about the splice sites extraction are reported (files GT_true.inf, AG_true.inf, GT_true.stat, and AG_true.stat). Finally, there are 287,296+348,370 windows of false splice sites, selected by searching canonical GT–AG pairs in not splicing positions. The false sites in a range+/- 60 from a true splice site are marked as proximal (files GT_false.seq, and AG_false.seq) (Related information: GT_false.inf, and AG_false.inf). HS3D is available at the Web server of the University of Sannio http://www.sci.unisannio.it/docenti/rampone/

REFERENCES

1. S. Brunak, J. Engelbrecht, and S. Knudsen (1991
) Prediction of the human mRNA donor and acceptor
sites from the DNA Sequence, J.Mol.Biol., 220, 49-
65.
2. V.V. Solovyev, A.A.Salamov, and C.B. Lawrence (
1994) Predicting internal exons by oligonucleotide
composition and discriminant analysis of
spliceable open reading frames. Nucleic Acids
Research, 22, 5156-5163.
3. J. Henderson, S. Salzberg, and K.H. Fasman (
1997) Finding Genes in DNA with a Hidden Markov
Model. J. Comput. Biol. 4(2) 127-41
4. N. Friedman, D. Geiger, and M. Goldszmidt (1997
) Bayesian network classifiers. Machine Learning,
29, 131-163.
5. M.Q. Zhang (1997) Identification of protein
coding regions in the human genome by quadratic
discriminant analysis, Proc. Natl. Acad. Sci. USA,
94, 565-568.
6. A. Krogh (1998) An Introduction to Hidden
Markov Models for Biological Sequences. In
Computational methods in Molecular Biology, S.L.
Salzberg, D.B.Searls, and S.Kasif ed.s, Elsevier,
45-63.
7. S. Rampone (1998) Recognition of Splice-
Junctions on DNA Sequences by BRAIN learning
algorithm. Bioinformatics, 14(8), 676-684.
8. D. Cai, A. Delcher, B. Kao, and S. Kasif (2000)
Modelling splice sites with Bayes Networks.
Bioinformatics, 16(2), 152:158.
9. C.M. Bishop (1995) Neural Networks for Pattern
Recognition, Oxford University Press.
10. M. Burset, and R. Guigo (1996). Evaluation of
gene structure prediction programs. Genomics, 34,
353-367.
11. T.A. Thanaraj (2000) Positional
Characterisation of False Positives from
Computational Prediction of Human Splice Sites.
Nucleic Acids Research, 28(3), 744-754.
12. M.O. Noordewier, G.G. Towell and J.W. Shavlik,
(1991) Training Knowledge-Based Neural Networks to
Recognize Genes in DNA Sequences. Advances in
Neural Information Processing Systems, volume 3,
Morgan Kaufmann.
13. G.G. Towell, J.W. Shavlik, and M.W. Craven (
1991) Constructive Induction in Knowledge-Based
Neural Networks. In Proceedings of the Eighth
International Machine Learning Workshop, Morgan
Kaufmann.
14. G.G. Towell (1991) Symbolic Knowledge and
Neural Networks: Insertion, Refinement, and
Extraction. PhD Thesis, University of Wisconsin -
Madison.
15. G.G. Towell, and J.W. Shavlik (1992)
Interpretation of Artificial Neural Networks:
Mapping Knowledge-based Neural Networks into Rules
. In Advances in Neural Information Processing
Systems, volume 4, Morgan Kaufmann.
16. T.A. Thanaraj (1999) A Clean data set of EST-
confirmed Splice Sites from Homo Sapiens and
Standards for Clean-up Procedures. Nucleic Acids
Research, 27(13), 2627-2637.
17. M. Burset, I.A. Seledtsov, and V.V. Solovyev (
2001) SpliceDB: database of canonical and non-
canonical mammalian splice sites, Nucleic Acids
Research, 29(1), 255-259
18. T.A. Thanaraj (1999) Standards to Create Clean
Data Sets for Gene Prediction. Bioinformer, Fall `
99, http://bioinformer.ebi.ac.uk/newsletter/
archives/5/gene_prediction.html.
19. P.Pollastro, S.Rampone (2002), IJMPC, 13(8), 2002.

HS3D

Database Description

Acknowledgements

REFERENCES