WHO

3rd South East Asian Training Course on Bioinformatics
Applied to Tropical Diseaes
Sept 28-Oct 11, 2004,
International Centre for Genetic Engineering and Biotechnology (ICGEB)
New Delhi, India

ICGEB

Alternatives to Web BLAST

Hands-on session - Self Study

October 1, 2004

 

I.  Introduction

 

We have prepared a package of BLAST programs, databases, query files, and documentation that provide practice in using the standalone BLAST and Netblast (blastcl3) programs. The packages, one for PC and one for Mac, can be downloaded from the course ftp directory:

ftp://ftp.ncbi.nlm.nih.gov/pub/FieldGuide/FGPlus/August2004/Blast/Software

 

LINUX: DB/blast229.tar.gz [600 MB] -- local copy

 

The contents of the file, blast.zip, for PC, are shown in the table. The contents for the Mac archive, blast.tar.gz, are equivalent.

 

Table 1. Contents of the BLAST archive (blast.zip)

Name

Size

Type

Purpose

bl2seq.exe

2,039,808

Program

Compares two sequences

blast.txt

70,980

Read me

blastall/blastpgp readme

blastall.exe

1,970,176

Program

General purpose blast prgram

blastcl3.exe

2,154,496

Program

Netblast client-server program

blastn_api.pl

4,690

Perl script

Nucleotide blast search script

blastpgp.exe

2,154,496

Program

Standalone PSI-BLAST

blastp_api.pl

6,251

Perl script

Protein blast search script

bovine_ipp.txt

475

Text

Bovine IPP protein sequence

copymat.exe

1,470,464

Program

Not needed in this class

Data

-

<DIR>

Directory with matrix files

ecoli_pur.txt

659

Text

E. coli PUR protein sequence

fastacmd.exe

1,576,960

Program

Sequence retrieval tool

fastacmd.txt

5,122

Read me

Fastacmd readme

firewall.html

7,169

Web page

Information on firewall configuration

formatdb.exe

1,642,496

Program

For formatting BLAST databases

formatdb.txt

29,982

Read me

Formatdb readme

megablast.exe

1,961,984

Program

Faster nucleotide BLAST program

megablast.txt

10,879

Read me

Megablast readme

netblast.txt

55,706

Read me

Blastcl3 readme

nr.phr

481,209,789

Database

Protein nr database file

nr.pin

14,627,744

Database

Protein nr database file

nr.pnd

27,251,696

Database

Protein nr database file

nr.pni

106,500

Database

Protein nr database file

nr.psd

355,528,553

Database

Protein nr database file

nr.psi

7,748,498

Database

Protein nr database file

nr.psq

607,788,962

Database

Protein nr database file

query_aa.txt

5,139

Text

Protein query sequences

query_nt.txt

10,505

Text

Nucleotide query sequences

rpsblast.exe

1,945,600

Program

Reverse psi-blast for CDD search

rpsblast.txt

10,126

Read me

Rpsblast readme

seedtop.exe

1,888,256

Program

Not used in this class

swissprot.00.msk

228,562

Database

Swissprot database (masking file)

swissprot.pal

255

Database

Swissprot database (alias file)

worm_prt.txt

606

Text

Worm protein sequence

yeast.nt

12,308,631

Text

Yeast genome contigs

 

 

II.  Setup

 

1. Downloading the archive

 

Point your browser to the URL below, then change to the appropriate subdirectory: 

ftp://ftp.ncbi.nlm.nih.gov/pub/FieldGuide/FGPlus/August2004/Blast/Software/

 

LINUX: DB/blast229.tar.gz [600 MB] -- local copy

 

To save the archive, right click on the file and select “save link target as ...” and follow the prompt to save the archive to C:\ drive. 

 

LINUX: Follow similar instructions as outlined in Seq Analysis for Raghava

 

2.  Installing the archive

 

Right-click the saved archive and choose “Extract to” from the popup menu.  In the popup window select C:\ as the target, then hit “Extract” button to install the programs and the needed files.

 

This will place the blast229 directory in the C:\ directory with the list of files described in Table 1.

 

LINUX: Follow similar instructions as outlined in Seq Analysis for Raghava

 

3. Executing the program

 

Double click the program icon; you will see a window quickly flashes by. The reason is that there is no graphical interface for any of the programs in the archive. They all need to be executed from the command line prompt.

 

To do this, we need to open a “Command Prompt” window, which is under “Start >> Programs >> Accessories >> Command Prompt”.  In this window, type “cd\” then “cd blast229” to get to the blast229 directory.  Programs can then be executed using the program name with appropriate options. 

 

Options for a given program can be displayed on the screen by typing the program name followed by a space, a dash, and return key stroke.  For example this will display the blastall command line option:

            blastall – [return]

 

The complete program options for the commonly used blast programs are listed in the appendices.

 

III. Exercises

 

1. Client server netblast (blastcl3)

 

As mentioned in lecture, blastcl3 is a client server program that formulates and submits BLAST searches and retrieves the results. The actual search is performed at NCBI.

 

It has batch capability and can perform a variety of BLAST searches similar to blastall (standalone BLAST).  Only two options need to be specified for a BLAST search:

 

            -p         program: blastp, blastn, blastx, tblastn, tblastx

            -i          the input text file with FASTA formatted sequence(s).

           

Other commonly used options include:

            -d         database. The default is nr (you may want to search other databases)

            -o         output file name (to save result rather than viewing it on the screen)

            -F        Filter options. The default is T. To disable filter/masking, use –F F

            -e         e-value significance cut-off. Default is 10.

            -W       Word size. Affects speed and sensitivity. Default 11 (nt) or 3 (prt).

 

  1. Protein-protein blast search

 

Start with a simple search to get used to the commandline options and to see the effect of low complexity filtering. We will search with a single sequence against the swissprot database to identify a protein in worm_prt.txt file.  The command line to use is:

 

blastcl3 -p blastp -d swissprot -i worm_prt.txt -o worm

 

Open the output file “worm” with a text editor (Notepad or similar).

-         Is the top match identical to our query?

-         Notice the “X’s” that have replaced a low complexity region in the query.

 

Now re-run the search using the following commandline with low complexity filtering disabled:

 

blastcl3 -p blastp -d swissprot -i worm_prt.txt -F F -o worm2

 

Examine the results again. Notice the effects of disabling low complexity filtering.

 

  1.  Nucleotide-nuclotide BLAST: batch searching and non-standard databases

 

1) Batch search against the polymorphism database (dbSNP)

 

The query file query_nt.txt contains 39 potential SNPs. We’ll first use blastn to see if these polymorphisms are already present in dbSNP. In order to use non-standard BLAST databases at NCBI (those not listed on the nucleotide-nucleotide or protein-protein BLAST pages) you’ll need the correct name and path for the database. A list of database paths is found in the netblast.txt file. For example, the path to the SNP database is snp/snp. Try the following command line to search against the snp database:

blastcl3 -p blastn -d snp/snp -i query_nt.txt -F F -o snp-out -v 20 -b 20

 

If this search takes too long you can try the  shorter nucleotide query file provided at the following URL www.ncbi.nlm.nih.gov/staff/tao/URLAPI/SNP_1.txt.  Open this with the web browser and save the page as text only to your BLAST directory.

Use the following command line to run the search.

 

blastcl3 -p blastn -d snp/snp -i snp_1.txt -F F -o snp-out -v 20 -b 20

 

Do you see where the SNP is? It is marked by the r in the query. An “r to r” match indicates that this SNP was identified already. To facilitate the analysis, we can also use the “-m 1” option to display the multiple matches in a stacked up view.

 

We are running the search with the low-complexity filter off. This is reasonable where the goal is to find an exact match and we’re not concerned with BLAST statistics. Also, the position containing the SNP may fall within a masked region.  One drawback of turning filtering off is that it usually extends the search time. When searching with a input file with multiple queries, we can speed things up by invoking megablast algorithm (-n T)

 

blastcl3 -p blastn -d snp/snp -i query_nt.txt -F F -o snp-out2 -m 1 -nT

 

The alignments in pair-wise and stacked pairwise views are given below for your reference.

                   

Query: 301 rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 360

           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 501 rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 560

 

QUERY         301  rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 360

rs2228064...> 501  ............................................................ 560

rs11239498..> 892  g..................................................          942

rs12252472..> 908  g...........................                                 935

 

2) Map SNPs onto the genome: Entrez query limitation

 

We can use netblast to map our SNPs onto the human genome sequence. There are two different databases that contain the assembled human genome sequence. One is the chromosome database that is available on the standard BLAST Web forms. The other is the human contig database “genome”, which is the default on the specialized human genome BLAST page. The chromosome database is made of the NCBI chromosome reference sequences, the NC_ accession series. These include complete microbial, organelle and plasmid genomes, as well as gapped-filled (with N’s) chromosome records for certain higher genomes, notably the fruit fly, Arabidopsis, C. elegans and the reference human chromosomes. The human contig database contains the NCBI assembled sequences without gap filling; these are the segments shown in the NCBI map viewer and human genome BLAST.

  

The BLAST chromosome database contains much sequence from other organisms that is irrelevant to our present search. As with the Web service, we could use an Entrez query to focus our search on only the human sequences. This has the desirable effect of reducing the search time and eliminating irrelevant hits. The Entrez query limitation is managed through the –u option. We won’t use the –u option in this example. The syntax would be -u "human[orgn]".  For human, we have a separate human_genomic database that contains only the human chromosomes that will serve the same purpose. Try the following command line

 

blastcl3 -p blastn -d human_genomic -i SNP_1.txt -o chr_out -nT

 

Notice that the filter option explicitly specifies the type of filtering: the m argument specifies “mask for lookup table only” and L specifies the low complexity filter. The R option masks interspersed repeats such as Alus. Searches against human genomic DNA may be extremely slow if repetitive elements are not masked. The following command line will search the assembled human contigs:

 

blastcl3 –p blastn –d hs_genome/genome –i SNP_1.txtt.txt –o contig_out –nT

 

Compare these results with those from the chromosome database. Do you hit the same regions? Can you easily compare the coordinate systems? These searches could take several minutes.

 

 

2. Standalone blast

 

Running a BLAST search requires three components: the query, the program, and a database.  With Netblast and the Web BLAST services, including the URL-API, you don’t need to maintain the databases locally or have the BLAST engine. As discussed in lecture though, there are circumstances where setting up a local BLAST service is warranted; for example, where a large amount of sequence data is generated locally and must be processed quickly, or where there are proprietary considerations. The two options NCBI offers for local BLAST searching are the command line standalone BLAST and the BLAST web server, WWWBLAST. Both require the maintenance of local databases and typically require dedicated high capacity hardware. The standard sets of databases from NCBI are downloadable from ftp.ncbi.nih.gov/blast/db/. In the archive for this class we provided the nr protein database and the swissprot database (as a mask file to nr). Searches against large nucleotide databases require server class machines to run in a reasonable time, however, more modest searches against protein sequences or other smaller datasets can be run on laptop computers, as we demonstrate below.

 

  1. Standalone protein BLAST

 

The query file, query_aa.txt, contains 12 human proteins associated with the inflammatory response. We’ll use standalone blastp to search against a local copy of Swissprot. The Swissprot database, as provided here and in ftp.ncbi.nih.gov/blast/db/, is a mask of the parent database nr, so the full nr database is required as well. Searches against Swissprot are useful because the records tend to be reliable and highly informative, and the database is small and rapidly searched.

 

If you use the option ‘-I T’, which adds gi numbers to the deflines in the output, then you can use a parser (below) to make a file of gi numbers. These gi numbers can then be used to retrieve the database hits in various formats.

 

blastall -p blastp -d swissprot -i query_aa.txt -o query_aa.out –b 5 –v 5 –I T

 

-         What species are represented in the output? It’s not always easy to tell. A list of gi numbers for the database hits can be parsed from the output and used in Batch Entrez to retrieve taxonomy links, FASTA sequences, etc. A JavaScript based parser is available at:

http://www.ncbi.nlm.nih.gov/staff/tao/tools/parser4.html

 

-         One can copy and paste the one line descriptions for the matched sequences in the text area, check the “Protein” button, and hit “Get Sequence” button to retrieve the sequences. In the result page, change the display to “Taxonomy Links”, hit “display” button again to see the organisms with hits to your query sequence.

 

  1. Formtdb and fastacmd

 

Let’s create a BLAST-ready nucleotide database from yeast.nt, a file of FASTA-formatted contig sequences.

 

We’ll run formatdb with and without the ‘-o T’ option:

 

            formatdb -i  yeast.nt -p F

            formatdb -i  yeast.nt -p F -o T -n yeast.nt2

 

-         Note the difference in index files created by formatdb.

-         Below, we’ll use fastacmd to retrieve a sequence from formatted yeast.nt2. This is only possible when ‘-o T’ has been set.

 

1)      Use the yeast nucleotide database to identify potential matches to a worm protein

 

To search a protein query against a nucleotide database, use tblastn. Run a search against both versions of yeast.nt to see another effect of ‘-o T’:

 

blastall -p tblastn -i worm_prt.txt -d yeast.nt -o worm_nt.out

blastall -p tblastn -i worm_prt.txt -d yeast.nt2 -o worm_nt2.out

 

-         Compare the definition lines created by searching yeast.nt and yeast.nt2

-         Are these biologically significant hits?

 

2)      Use fastacmd for sequence retrieval

 

The first match is to a portion of NC_001142. Using this string we can retrieve this sequence from a formatted database using fastacmd.

 

fastacmd -s NC_001142 -d yeast.nt2 | more

 

            A more practical use of fastacmd might be to retrieve a defined piece of NC_001142. Use the ‘-L ‘ option to specify the sequence range encompassing the first alignment. Note that his hit was to “Frame -1”. Use the ‘-S ‘ option to indicate the strand:

 

fastacmd -s NC_001142 -d yeast.nt2 -L 333140,333250 –S 2

 

 

C. Running standalone PSI-BLAST (blastpgp)

 

 

Position-Specific Iterated BLAST (PSI-BLAST) is implemented in standalone BLAST as the executable program blastpgp. As with blastall, you can list all command line options by typing a dash after the program name.

 

 blastpgp –

 

The blastpgp program can write out a position specific score matrix in a human readable format. In this example, we'll create such a matrix using the sequences collected by searching with inositol polyphosphate 1 phosphatase (IPPase). We’ll use the bovine IPPase sequence included in the file bovine_ipp.txt as a query.  Inositol monophosphatases and related enzymes contain conserved acidic residues that are essential for binding metal ions (Mg++). The PSSM generated should show high self-substitution scores for the residues in these positions.

 

The following command line will run blastpgp for 4 iterations with IPPase against the swissprot database and write out the PSSM. 

 

blastpgp –i bovine_ ipp.txt -d swissprot -j 4 -Q pssm.txt

 

Unlike the web version, blastpgp requires you to specify the number of iterations in the beginning. You can use the –C option to create a checkpoint file so that you can pick up the search and perform additional iterations if desired.

 

Open pssm.txt using a text editor or a web browser. Can you identify three conserved acidic residues? Compare their self-substitution scores in the PSSM with that in BLOSUM62. There are also two other residues that are conserved. What are they? Confirm these findings by using CD-search on the web with IPPase. Display the alignment for the inositol_P domain and look for these conserved residues.

 

3. WWWBLAST

 

For this course, we demonstrate an existing WWWBLAST installation. The actual setup requires a web server (such as Apache) and fairly straightforward configuration (see ftp://ftp.ncbi.nlm.nih.gov/blast/documents/wwwblast.txt).

 

Go to:   www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blast/.

 

Select the second link (Regular BLAST with client-server support), which allows one to input a gi/accession number as query.

 

The following steps perform a test search with the query F12345 against test_nt_db:

-         keep the program (blastn) and database (test_nt_db) as they are

-         change “sequence in FASTA format” to “Accessions or GI”

-         type F12345 in the text box

-         hit “Search” button to initiate the search

-         The page will change to the Results display, similar to an NCBI web BLAST result

 

To add additional databases into the pull down list, we need to do the following:

 

-         Format the database using formatdb. We have preciously done this in the standalone blast section.

-         Edit the blast.rc file to link the database with appropriate blast program

 

# Number of CPUs to use for a single request

#

NumCpuToUse     4

#

# Here are list of combinations program/database,

# that allowed by BLAST service. Format: <program> <db> <db> ...

#

blastn test_na_db yeast

blastp test_aa_db yeast.aa

blastx test_aa_db yeast.aa

tblastn test_na_db yeast

tblastx test_na_db yeast

 

-         edit blast_cs.html to add the new database to the pull down list

 

<select name = "DATALIB">

   <option VALUE = "test_na_db"> test_na_db

   <option VALUE = "test_aa_db"> test_aa_db

   <option VALUE = "yeast"> yeast nt           

   <option VALUE = "yeast.aa"> yeast aa        

</select>

 

NOTE: the italicized bold represents newly added databases.

 

4. URL API

 

The BLAST URLAPI uses a standardized API (Application Programming Interface) to access the NCBI Qblast system. It makes direct HTTP-encoded requests to bypass the use of a browser, making it easily incorporated into automated scripts. For documentation see: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html

 

This tool uses URL to send encoded commands to the blast server through Blast.cgi for both submitting searches and requesting results. There are three steps:

 

-         Posting the search request using “put” command to Blast.cgi

-         Getting the returned html document and extracting out the RID

-         Formatting the result using the RID and “get” command.

 

Some batch capability and automation can be achieved with custom scripts.  Scripts should have a built-in 3 sec delay between requests and should also include error and status checking.

 

We have provided some scripts (www.ncbi.nlm.nih.gov/staff/tao/URLAPI/) that are Perl based and require the installation of ActivePerl package. These scripts take a protein or nucleotide FASTA file, formulate the search and send them to BLAST server.  The issued RIDs are parsed out and then used for result retrieval.  In the provided scripts there is only one round of result checking.  Ideally, there should be additional error checking and multiple result polling.


Last updated: Friday, October 1, 2004 18:45 (Delhi Time GMT+5:30)