WHO-TDR BLAST Exercises

Today's Exercises have several goals:
  1. Perform a BLASTP search at NCBI
  2. Save selected sequences from an NCBI BLAST search (to use later)
  3. Perfrom a PSI-BLAST search
  4. Learn to use ENTREZ limits to download a custom set of sequences
  5. Think conceptually about BLAST
  6. Learn to download and install WU-BLAST
  7. Learn to download files to your computer
  8. Learn to make a custom WU-BLAST database on  your computer
  9. Learn to search a custom WU-BLAST databse on your computer
NOTE: you will not be able to finish these exercises in one sitting. Take your time and do each step as time permits. Be prepared to answer the questions in section 5 over dinner.

1. Perform a BLASTP search at NCBI
    • Open a browser and go to: http://www.ncbi.nlm.nih.gov
    • CLICK on "BLAST" on the page header
    • Click on "protein vs protein - BLASTP" search
    • Paste the following sequence in the box at the top  of the page, choose the "nr" database and click "BLAST!"
>UniProt/Swiss-Prot|Q09746|YB65_SCHPO Hypothetical protein C12C2.05c in chromosome II
MSLETYKFSDELHDDFKVVDSWINNGAKWLEDIQLYYKERSSIEKEYAQKLASLSNKYGE
KKSRKSSALSVGDTPAMSAGSLECASLTTWSKILDELTRSSKTHQKLSDDYSLDIAEKLK
KLESHIEALRKVYDDLYKKFSSEKETLLNSVKRAKVSYHEACDDLESARQKNDKYREQKT
QRNLKLSESDMLDKKNKYLLRMLVYNAHKQKFYNETLPTLLNHMQVLNEYRVSNLNEIWC
NSFSIEKSLHDTLSQRTVEIQSEIAKNEPVLDSAMFGRHNSKNWALPADLHFEPSPIWHD
TDALVVDGSCKNYLRNLLVHSKNDLGKQKGELVSLDSQLEGLRVDDPNSANQSFESKKAS
INLEGKELMVKARIEDLEVRINKITSVANNLEEGGRFHDFKHVSFKLPTSCSYCREIIWG
LSKRGCVCKNCGFKCHARCELLVPANCKNGEPEVADDDAVDTSVTATDDFDASASSSNAY
ESYRNTYTDDMDSSSIYQTSLSNVKTEETTPAEPASKVDGVVLYDFTGEHEGVITASEGQ
EFTLLEPDDGSGWVRVKIDGTDGLIPASYVKLNDELNTSVTLDGDSSYVKALYAYTAQSD
MELSIQEGDIIQVTNRNAGNGWSEGILNGVTGQFPANYVTDV

    • Look at the CDD domain information while you are waiting for your BLAST results

    • When your results appear, Look at them
    • Look at the Taxonomy Report (link located up near the top of the page), What is the taxonomic distribution of the hits for this protein?
    • What do the Purple/Blue "G" boxes at the end of each definition line mean?

2. Save selected sequences from an NCBI BLAST search (to use later if you wish)
    • Select the 10 best alignments (based on E-value) from the alignments section and download them as fasta formatted sequences.
      • To do this, check the box next to the alignment for each sequence of interest.
      • Next, go to the top of the alignments section and select "Get Selected Sequences". An NCBI page with the 10 sequences will appear in NCBI "summary" format. 
      • Go to the second level tool bar on the page and select "FASTA" from the "display" option. A new page will appear with all 10 sequences in fasta format. 
Image of select Fasta
      • In order to save them in FASTA text format, click on the "send" all to file option.  You will be prompted for a file name and location in which you should save the files. I suggest saving them to your desktop. I also suggest naming the file "For_MSA.txt".


3. Perfrom a PSI-BLAST search
    • Use the FASTA "Uniprot" sequence from above to perform a PSI-BLAST search at NCBI
    • Go to http://www.ncbi.nlm.nih.gov
    • Click on BLAST
    • Choose PSI-BLAST
    • Search "nr"
    • Look at the BLAST results, are they any different from the search above? (hint, look at the 10 best sequences you saved, are they the same?)
    • Do a second iteration, How many new sequences were found (a general number is sufficient)?
    • Do you believe they are real? Why?
    • How could you make the search more stringent?

4. Learn to use ENTREZ limits to download a custom set of sequences
View of tool bar
    • Scroll down the Databases page and select "dbEST" from the nucleotide database section
    • type in "Plasmodium falciparum" and click "GO"
    • How many "hits" are found?
    • Click on the "Limits" tab at the top of the page
    • In the "limited to" section, select "organism" from the first pulldown menu
View of limits
    • Click on Go at the top of the page
    • Now how many hits are there?
    • Why is there a difference?
    • (NOTE: you could download all of these sequences if you wanted to using the steps we used above. PLEASE do not do this now, this is just for your information. Now that you know how to download select files of FASTA sequences, you can make a custom BLAST database to search on your own local computer).  However, in the interest of time, we have some files ready for you to download in step 7.

5. Test your conceptual knowledge of BLAST  (Hint: this is a test)

1.    What is the difference between a global and local alignment strategy?
2.    Calculate the score of the following alignment  using the following matrix, +1 for a match,  -1 for a mismatch, -3 to open a gap and –1 for each additional position.

AATTAGATCCTA--GATTTTACCGGACCA
||||| ||||||  ||||| ||||||| | 
AATTACATCCTACAGATTT-ACCGGACGA


3.    If a match from a database search is reported to have an E-value of 0.0, is it considered highly insignificant  or highly significant?
4.    Which approach guarantees you an optimal alignment,  Smith-Waterman or a heuristic approach?
5.    Does increasing the  value of “T”, the threshold value, generate a greater, or a fewer number of neighborhood words?
6.    Does increasing the word size increase or decrease the sensitivity of BLAST?
7.    Does increasing the word size increase or decrease the selectivity of BLAST?
8.    Which is better for database, in general fo searching,  a nucleotide or a protein sequence database?  why?  Does it depend on what you are looking for?



6. Installing WU-BLAST:

1.    Inside linux, go to your home directory. Type “cd” to do this
2.    Make a software directory. Type “mkdir SoftWare”
3.    Go to this new directory
4.    To do this, type “cd SoftWare”
5.    Make a new directory called WUBLAST
6.    To do this, type “mkdir WUBLAST
7.    Go to this directory
8.    To do this type “cd WUBLAST
9.    Now, open an internet browser by clicking on the globe icon next to the start button.
10.    Go to the course web site http://192.168.1.103
11.    Click on “software”
12.    Click on wu_blast.tar.gz and download the program into the newly constructed directories
13.     Be sure to download the software into the directory /home/workshop/SoftWare/WUBLAST 
14.    Type “ls” to make sure that the file is present. Do you see it? If not, ask for HELP.
15.    Uncompress the software. Type “gunzip wu_blast.tar.gz
16.    Type “ls”, is the “.gz” now gone?
17.    Untar the file, Type “tar –xvf wu_blast.tar
18.    Type “ls”. Is the “.tar” gone?  Use the command “cd” to explore the new folder.

7. Getting the query and the database files:

1.    Return to your home directory, type “cd” to do this
2.    Type “ls” to verify that you are home, you should see the “SoftWare” directory.
3.    Make a new directory for your BLAST databases, type “mkdir db
4.    Go to the course web site, http://192.168.1.103, and click on the “data link”
5.    Click on the “seq_for_blast” link.
6.    You should see 5 files, three end in “.fa” and one is query.nuc and the other is query.prot.
7.    Download the three files ending in “.fa” into your /home/username/db directory
8.    Download the two query files into your home directory, /home/username/  (remember to substitute your username.
9.    Use the “cd” and “ls” commands to verify that you now have all 5 files. (The 2 query files are in your home directory, and the three  .fa files are in the db directory, and blast is in the SoftWare/WUBLAST directory.
10.    If all is in order, you are now ready to format your databases.


8. Learn to make a custom WU-BLAST database on  your computer

1.    Locate the fasta files you will use to make your BLAST searchable database (Hint: they are in “db”
2.    Go to this directory, type "cd db".
3.    Type the command :  ../SoftWare/WUBLAST/xdformat –n plasmodiumGENE_nuc.fa  ”. The database will take a minute, please be patient, when your “prompt” returns, it will be done and you can move on to the next command.
4.    Type “ls”, are there three new files?
5.    Type the command: ../SoftWare/WUBLAST/xdformat –p plasmodiumGENE_prot.fa
6.    Type the command: ../SoftWare/WUBLAST/xdformat –p plasmodiumORFs.fa

      -n = nucleotide input data
      -p = aminoacid input data

      Three files will be created for each input file and some statistics will be presented to you for each when it is done.

9. Learn to search a custom WU-BLAST databse on your computer

Data and type
Name
Location
Plasmodium annotated genes, nucleotide
plasmodiumGENE_nuc.fa db/
Plasmodium annotated gene, protein
plasmodiumGENE_prot.fa
db/
Plasmodium predicted ORFs, protein
plasmodiumORFs.fa
db/
 
Now, change the name of query.prot to "rifin.txt". Type "mv query.prot rifin.txt".

Here is the text of  "rifin.txt".
>query.prot
MKIHYINILLFELPLNILIYNQRNHNSTTPHHPPNTRLLCECELYAPATYDDDPQMKEVMVKFSKQTQQR
FEEYDERMKTTRQKCKDKCDKEIQKIILKDRLEKQMEQQLTTLETKIDTNDIPTCICEKSMADKLEKECL
KCAQNLGGIVAPSSGVLVGIAEGALYAWKPTAITAAKKAALAEATDAAIEAGMNAVSLKIEELGTVFKPS
EGFVNLSSIVNKLTYNNGDALVESAKNVIGGLYSNGKGGNTIFYNTTIHTKSGTLYVGNFGDIGRAAHDA
KLASETTALTKAKVGAVESTYGGCQTAIIASIVAIVVIVLIMVIIYLILRYRRKKKVNKKLQYIKLLNE


Note the structure of a BLAST command.  "Algorithm/Program  Databases  query"  If all you typed was this, the results would appear on your screen. I added the "> filname.blast_type" to save the results to a file that you can look at. This is a way of "saving" them.
If you would like to see all of the BLAST options, just type "blastp" and hit return.