WHO

3rd South East Asian Training Course on Bioinformatics
Applied to Tropical Diseaes
Sept 28-Oct 11, 2004,
International Centre for Genetic Engineering and Biotechnology (ICGEB)
New Delhi, India

ICGEB

Group Project: Focus Questions

Instructions:

Due: Monday, Oct 11, 2004 at 9am. There are 16 course participants in both the lecture/workshop section of the WHO/TDR Regional Training Bioinformatics Training Course in India this year. There are four groups of four participants each. See the participant page for group assignment. This is a relatively long list of questions, do as many questions as possible, especially Exercise Question #1. You are expected to write your answers and email your answer in the body of the email (NOT as an email attachment!!!) to urmila@bioinfo.ernet.in, dinesh@icgeb.res.in, scwsr@mahidol.ac.th, cathal@science.uct.ac.za, AND especially ap2@sanger.ac.uk, huynh@ncbi.nlm.nih.gov. These questions are intended to be focus questions to help you focus on the main topics of the course. We (i.e. Arnab Pain - ap2@sanger.ac.uk) will be commenting on your group's presentation and going through your answers at the end of the course on Monday, Oct 11, 2004 at 10am-12noon during the group presentation.

There are two sections to the Focus Questions: Discussion Questions and Exercise Questions. We will not have time to go over all the questions, so please choose which questions, you would like to answer as a group. Suggested reading on gene annotation (PDF). Some tips ..


Discussion Questions

  1. Different tools for predicting gene function often give very different predictions. How useful are bioinformatics methods for functional annotation and what can be done to improve their accuracy?
  2. What is a protein domain? To what extent is it possible to infer the function of a protein by sequence and phylogenetic comparison with domain present in other proteins?
  3. Describe how you would profile the complete set of novel organic compounds found in a unique tropical plant found to have anticancer activity.
  4. How can functional genomics be applied to characterize the roles of malaria genes that do not have a known mutation?
  5. You are given two files of amino acid sequences. One of the files was created by taking all of the sequences in the GenBank nr database. The other file was created using a random amino acid sequence generator that created sequences using a random distribution for both amino acid composition and sequence length. Each file has the same number of sequences and amino acid residues. Your task is to determine which file has the sequences from the nr database. You may apply any computational, statistical, or bioinformatics method of your choosing to the files. However, you may only apply these methods to these two files. You may NOT search any sequence databases to look for similarities. You may NOT use known motifs to search the files. No external data (e.g. sequence, structure etc. ) may be used to test these files in any way. Please explain your rationale and approach for determining which file has the nr sequences.
  6. If someone gave you 1 Mb of genomic DNA sequence from a eukaryote, how could you identify the species. (Assume you cannot use BLAST to directly identify the species.) What features distinguish the genomic DNA sequence of a protozoan parasite from an insect or a fish?
  7. You are a bright young cellular parasitologist and your Ph.D. project is to characterize those proteins that are localized to the golgi apparatus of the Plasmodium cell. You have just read a really cool paper in Nature in which a group has discovered that if a protein ends in the four amino acids ”HDEL” the protein will be retained in the golgi. You know that the genome sequence and annotation are available and that you should be able to search for the set of sequences ending in “HDEL”. You try a BLASTP search of the amino acid sequences and you don’t get any hits. In 15 words or less, explain why.
  8. What is the difference between RefSeq and GenBank? (choose one)
    1. RefSeq includes publicly available DNA sequences submitted from individual laboratories and sequencing projects.
    2. GenBank provides nonredundant curated data
    3. GenBank sequences are derived from RefSeq
    4. RefSeq sequenes are derived from GenBank and provide nonredundant curated data.
  9. In your own words, explain the difference between a homolog, ortholog, paralog.
  10. When would you prefer analyzing nucleotide sequences with BLAST instead of protein?
  11. If you have highly similar sequences, what is its effect on variation of the scoring matrix, penalty settings, window size on the alignment output. Would there be any difference between your alignment results obtained using BLAST vs FASTA? [Assuming, you are using the same databases.]
  12. What does it take to make the leap from similarity to homology when interpreting a BLAST report?
  13. True or False: Most proteins have more than one domain, so I should be careful when looking at BLAST results because not all reported hits belong to the same big family.
  14. You have isolated a strain of Plasmodium that is resistant antimalarial drug X. Drug X only works if it can be transported into the parasite. It is not known which transporter is responsible, but now that you have this interesting mutant you may be able to determine which one it is. You decide to screen for mutants in all transporters. To do this, you need to generate PCR primers for each of the transporter genes in the entire genome. How many genes are currently annotated as being a transporter?
  15. If there were no repetitive DNA of any kind, how would the genomes of various eukaryotes (human, mouse, a plant, a parasite) compare in terms of size, gene content, gene order, nucleotide composition, or other features?

Exercise Questions

  1. Apply the knowledge and skills that you have learned from the workshop on bioinformatics to crack a sequence code and to convert raw sequence data into biological knowledge for basic biomedical research. Your team has isolated a virus that has been implicated in causing an epidemic and has sequenced its genome.Each group receives a different set of viral genome sequences for genome analysis. Obtain your group's sequence. The objective is to characterize this sequence using bioinformatics tools and methods.
    1. Verify that your query sequence has a viral origin and not contamination. Perform a database similarity search to retrieve similar high scoring sequences (i.e. hits). [Hint: Find this sequence against the a nucleotide sequence database (e.g. NR) and/or protein sequence database (e.g. Swiss-Prot or PIR). Also, VecScreen]. So, where is your group's query sequence from?
    2. Annotate the genome using results of these searches.
    3. Draw a diagram of genome organization.
    4. Now select ONE protein of your choice from the genome.
      1. Carry out multiple sequence alignment, extract sequence of similar viruses from Swiss-Prot & PIR. Example, flaviviruses.
      2. Find prosite patterns
      3. Check if these sequence are part of BLOCKS database
      4. Determine the domain architecture -- defined as the sequential order of conserved domains in proteins -- of the protein. What types of domains were found? How many of each of these domains are present in the unknown? [Hint: Using Pfam database. Other choices not covered in the course include: Profile Scan, DART at NCBI, etc. on the web.]
      5. Apply the domain name or sequence to retrieve a group of sequences with this domain [Hint: Pfam]
      6. Carry out phylogeny analysis for sequences your group has retrieved from the protein databases.
      7. Epitope prediction using Antigen program in EMBOSS package.
      8. Predict the secondary structure of this protein domain. [Hint: choose one of the secondary structure prediction software introdued in the course.]
      9. Check if structure could be predicted for any viral proteins. [Hint: Do a similarity search against the PDB database to find an ortholog with a known structure.]
    5. Conduct a genome comparison using ACT of any two of the other group's sequence with your group's sequence. Then as a class, compare:
      1. GroupA sequence <--> GroupB sequence;
      2. GroupB sequence <--> GroupC sequence
      3. GroupC sequence <--> GroupD sequence
      4. GroupD sequence <--> GroupA sequence
      5. GroupA sequence <--> GroupC sequence
      6. GroupB sequence <--> GroupD sequence
    6. SUMMARIZE what you discovered in your bioinformatics investigation into a brief scientific report. (A little more than half a page, but not more than one page would be sufficient) by answering
  2. ANSWER <-- [Will Add LINK :) ]

For additional questions for the ambitious, please contact the course instructors for more exercise questions.


Questions #1: Query sequence:

Sequence for:



Last updated: Tuesday, October 5, 2004 18:37 (Delhi Time GMT+5:30)