Today's Exercises have several goals:
- Perform a BLASTP search at NCBI
- Save selected sequences from an NCBI BLAST search (to use later)
- Perfrom a PSI-BLAST search
- Learn to use ENTREZ limits to download a custom set of sequences
- Think conceptually about BLAST
- Learn to download and install WU-BLAST
- Learn to download files to your computer
- Learn to make a custom WU-BLAST database on your computer
- Learn to search a custom WU-BLAST databse on your computer
NOTE: you will not be able to finish these exercises in one sitting.
Take your time and do each step as time permits. Be prepared to answer
the questions in section 5 over dinner.
1. Perform a BLASTP search at NCBI
- Open a browser and go to: http://www.ncbi.nlm.nih.gov
- CLICK on "BLAST" on the page header
- Click on "protein vs protein - BLASTP" search
- Paste the following sequence in the box at the top of the
page, choose the "nr" database and click "BLAST!"
>UniProt/Swiss-Prot|Q09746|YB65_SCHPO
Hypothetical protein C12C2.05c in chromosome II
MSLETYKFSDELHDDFKVVDSWINNGAKWLEDIQLYYKERSSIEKEYAQKLASLSNKYGE
KKSRKSSALSVGDTPAMSAGSLECASLTTWSKILDELTRSSKTHQKLSDDYSLDIAEKLK
KLESHIEALRKVYDDLYKKFSSEKETLLNSVKRAKVSYHEACDDLESARQKNDKYREQKT
QRNLKLSESDMLDKKNKYLLRMLVYNAHKQKFYNETLPTLLNHMQVLNEYRVSNLNEIWC
NSFSIEKSLHDTLSQRTVEIQSEIAKNEPVLDSAMFGRHNSKNWALPADLHFEPSPIWHD
TDALVVDGSCKNYLRNLLVHSKNDLGKQKGELVSLDSQLEGLRVDDPNSANQSFESKKAS
INLEGKELMVKARIEDLEVRINKITSVANNLEEGGRFHDFKHVSFKLPTSCSYCREIIWG
LSKRGCVCKNCGFKCHARCELLVPANCKNGEPEVADDDAVDTSVTATDDFDASASSSNAY
ESYRNTYTDDMDSSSIYQTSLSNVKTEETTPAEPASKVDGVVLYDFTGEHEGVITASEGQ
EFTLLEPDDGSGWVRVKIDGTDGLIPASYVKLNDELNTSVTLDGDSSYVKALYAYTAQSD
MELSIQEGDIIQVTNRNAGNGWSEGILNGVTGQFPANYVTDV
- Look at the CDD domain information while you are waiting for
your BLAST results
- When your results appear, Look at them
- Look at the Taxonomy Report (link located up near the top of
the page), What is the taxonomic distribution of the hits for this
protein?
- What do the Purple/Blue "G" boxes at the end of each definition
line mean?
2. Save selected sequences from an NCBI BLAST search (to use later if
you wish)
- Select the 10 best alignments (based on E-value) from the
alignments section and download them as fasta formatted sequences.
- To
do this, check the box next to the alignment for each sequence of
interest.
- Next, go to the top of the alignments section and select "Get
Selected Sequences". An NCBI page with the 10 sequences will appear in
NCBI "summary" format.
- Go to the second level tool bar on the page and
select "FASTA" from the "display" option. A new page will appear with
all 10 sequences in fasta format.
- In order to save them in FASTA text
format, click on the "send" all to file option. You will be
prompted
for a file name and location in which you should save the files. I
suggest saving them to your desktop. I also suggest naming the file
"For_MSA.txt".
3. Perfrom a PSI-BLAST search
- Use
the FASTA "Uniprot" sequence from above to perform a PSI-BLAST search
at NCBI
- Go
to http://www.ncbi.nlm.nih.gov
- Click
on BLAST
- Choose
PSI-BLAST
- Search
"nr"
- Look
at the BLAST results, are they any different from the search above?
(hint, look at the 10 best sequences you saved, are they the same?)
- Do
a second iteration, How many new sequences were found (a general number
is sufficient)?
- Do
you believe they are real? Why?
- How
could you make the search more stringent?
4. Learn to use ENTREZ limits to
download a custom set of sequences
- Scroll
down the Databases page and select "dbEST" from the nucleotide database
section
- type
in "Plasmodium falciparum" and
click "GO"
- How
many "hits" are found?
- Click
on the "Limits" tab at the top of the page
- In
the "limited to" section, select "organism" from the first pulldown menu
- Click
on Go at the top of the page
- Now
how many hits are there?
- Why
is there a difference?
- (NOTE:
you could download all of these sequences if you wanted to using the
steps we used above. PLEASE do not do this now, this is just for your
information. Now
that you know how to download select files of FASTA sequences, you can
make a custom BLAST database to search on your own local
computer). However, in the interest of time, we have some files
ready for you to download in step 7.
5.
Test your conceptual knowledge of BLAST (Hint: this is a test)
1. What is the difference between a global and local
alignment strategy?
2. Calculate the score of the following
alignment using the following matrix, +1 for a match, -1
for a mismatch, -3 to open a gap and –1 for each additional position.
AATTAGATCCTA--GATTTTACCGGACCA
||||| |||||| ||||| |||||||
|
AATTACATCCTACAGATTT-ACCGGACGA
3. If a match from a database search is reported to
have an E-value of 0.0, is it considered highly insignificant or
highly significant?
4. Which approach guarantees you an optimal
alignment, Smith-Waterman or a heuristic approach?
5. Does increasing the value of “T”, the
threshold value, generate a greater, or a fewer number of neighborhood
words?
6. Does increasing the word size increase or decrease
the sensitivity of BLAST?
7. Does increasing the word size increase or decrease
the selectivity of BLAST?
8. Which is better for database, in general fo
searching, a nucleotide
or a protein sequence database? why? Does it depend on what
you are
looking for?
6. Installing WU-BLAST:
1. Inside linux, go to your home directory. Type
“cd”
to do this
2. Make a software directory. Type
“mkdir SoftWare”
3. Go to this new directory
4. To do this, type “
cd SoftWare”
5. Make a new directory called WUBLAST
6. To do this, type “
mkdir WUBLAST”
7. Go to this directory
8. To do this type “
cd WUBLAST”
9. Now, open an internet browser by clicking on the
globe icon next to the start button.
10. Go to the course web site http://192.168.1.103
11. Click on
“software”
12. Click on wu_blast.tar.gz and download the program
into the newly constructed directories
13. Be sure to download the software into the
directory
/home/workshop/SoftWare/WUBLAST
14. Type “
ls”
to make sure that the file is present.
Do you see it? If not, ask for HELP.
15. Uncompress the software. Type “
gunzip
wu_blast.tar.gz”
16. Type “
ls”,
is the “.gz” now gone?
17. Untar the file, Type “
tar –xvf wu_blast.tar”
18. Type “
ls”.
Is the “.tar” gone? Use the
command “cd” to explore the new folder.
7. Getting the query and the database
files:
1. Return to your home directory, type “
cd” to do this
2. Type “
ls”
to verify that you are home, you should
see the “SoftWare” directory.
3. Make a new directory for your BLAST databases,
type “
mkdir db”
4. Go to the course web site,
http://192.168.1.103,
and click on the “data link”
5. Click on the “seq_for_blast” link.
6. You should see 5 files, three end in “.fa” and one
is query.nuc and the other is query.prot.
7. Download the three files ending in “.fa” into your
/home/username/db directory
8. Download the two query files into your home
directory, /
home/username/
(remember to substitute your username.
9. Use the “
cd”
and “
ls” commands to
verify that you
now have all 5 files. (The 2 query files are in your home directory,
and the three .fa files are in the db directory, and blast is in
the SoftWare/WUBLAST directory.
10. If all is in order, you are now ready to format
your databases.
8. Learn to make a custom WU-BLAST
database on your computer
1. Locate the fasta files you will use to make your
BLAST searchable database (Hint: they are in “db”
2. Go to this directory, type "
cd db".
3. Type the command :
../SoftWare/WUBLAST/xdformat –n
plasmodiumGENE_nuc.fa ”. The database will take a minute,
please
be patient, when your “prompt” returns, it will be done and you can
move on to the next command.
4. Type “ls”, are there three new files?
5. Type the command:
../SoftWare/WUBLAST/xdformat –p
plasmodiumGENE_prot.fa
6. Type the command
: ../SoftWare/WUBLAST/xdformat –p
plasmodiumORFs.fa
-n = nucleotide input data
-p = aminoacid input data
Three files will be created for each input file and some statistics
will be presented to you for each when it is done.
9. Learn to search a custom WU-BLAST
databse on your computer
- You are now ready to do a local, command line BLAST search on
your linux computer.
- You have installed and formatted several databases:
Data and type
|
Name
|
Location
|
Plasmodium annotated genes,
nucleotide
|
plasmodiumGENE_nuc.fa |
db/
|
Plasmodium annotated gene,
protein
|
plasmodiumGENE_prot.fa
|
db/
|
Plasmodium predicted ORFs,
protein
|
plasmodiumORFs.fa
|
db/
|
- Create texts files for your query sequences. You
can do this the Linux program "emacs" or you can use one of the text
editor programs available with Linux (check your tool bar). If you
downloaded the files query.nuc and query.prot, use one of the editors
to look at your files and make sure they are in fast format.
Now, change the name of query.prot to "rifin.txt". Type "mv query.prot
rifin.txt".
Here is the text of "
rifin.txt".
>query.prot
MKIHYINILLFELPLNILIYNQRNHNSTTPHHPPNTRLLCECELYAPATYDDDPQMKEVMVKFSKQTQQR
FEEYDERMKTTRQKCKDKCDKEIQKIILKDRLEKQMEQQLTTLETKIDTNDIPTCICEKSMADKLEKECL
KCAQNLGGIVAPSSGVLVGIAEGALYAWKPTAITAAKKAALAEATDAAIEAGMNAVSLKIEELGTVFKPS
EGFVNLSSIVNKLTYNNGDALVESAKNVIGGLYSNGKGGNTIFYNTTIHTKSGTLYVGNFGDIGRAAHDA
KLASETTALTKAKVGAVESTYGGCQTAIIASIVAIVVIVLIMVIIYLILRYRRKKKVNKKLQYIKLLNE
- You have 3 local blast-searchable databases and 2 query
sequence files. You are ready to Blast.
- In your terminal window
- Type "cd" and hit
return
- Type "ls" and hit
return, you should see your query files. You can look at them with "more" if you like.
- Type "SoftWare/WUBLAST/blastp
db/plasmodiumGENE_prot.fa rifin.txt > rifin.blastp"
This may take a few
minutes to run
- Type "SoftWare/WUBLAST/blastx
db/plasmodiumGENE_prot.fa query.nuc > query.blastx"
- Look at your results files using "more".
- What is the putative identity of query.nuc?
- Is this the sequence of a complete gene? How do you know?
Note the structure of a BLAST command. "Algorithm/Program
Databases query" If all you typed was this, the results
would appear on your screen. I added the "> filname.blast_type" to
save the results to a file that you can look at. This is a way of
"saving" them.
If you would like to see all of the BLAST options, just type "
blastp" and hit return.