Content

Alternatives to Web BLAST

Hands-on session - Self Study

October 1, 2004

I. Introduction

We have prepared a package of BLAST programs, databases, query files, and documentation that provide practice in using the standalone BLAST and Netblast (blastcl3) programs. The packages, one for PC and one for Mac, can be downloaded from the course ftp directory:

ftp://ftp.ncbi.nlm.nih.gov/pub/FieldGuide/FGPlus/August2004/Blast/Software

LINUX: DB/blast229.tar.gz [600 MB] -- local copy

The contents of the file, blast.zip, for PC, are shown in the table. The contents for the Mac archive, blast.tar.gz, are equivalent.

Table 1. Contents of the BLAST archive (blast.zip)
Name	Size	Type	Purpose
bl2seq.exe	2,039,808	Program	Compares two sequences
blast.txt	70,980	Read me	blastall/blastpgp readme
blastall.exe	1,970,176	Program	General purpose blast prgram
blastcl3.exe	2,154,496	Program	Netblast client-server program
blastn_api.pl	4,690	Perl script	Nucleotide blast search script
blastpgp.exe	2,154,496	Program	Standalone PSI-BLAST
blastp_api.pl	6,251	Perl script	Protein blast search script
bovine_ipp.txt	475	Text	Bovine IPP protein sequence
copymat.exe	1,470,464	Program	Not needed in this class
Data	-	<DIR>	Directory with matrix files
ecoli_pur.txt	659	Text	E. coli PUR protein sequence
fastacmd.exe	1,576,960	Program	Sequence retrieval tool
fastacmd.txt	5,122	Read me	Fastacmd readme
firewall.html	7,169	Web page	Information on firewall configuration
formatdb.exe	1,642,496	Program	For formatting BLAST databases
formatdb.txt	29,982	Read me	Formatdb readme
megablast.exe	1,961,984	Program	Faster nucleotide BLAST program
megablast.txt	10,879	Read me	Megablast readme
netblast.txt	55,706	Read me	Blastcl3 readme
nr.phr	481,209,789	Database	Protein nr database file
nr.pin	14,627,744	Database	Protein nr database file
nr.pnd	27,251,696	Database	Protein nr database file
nr.pni	106,500	Database	Protein nr database file
nr.psd	355,528,553	Database	Protein nr database file
nr.psi	7,748,498	Database	Protein nr database file
nr.psq	607,788,962	Database	Protein nr database file
query_aa.txt	5,139	Text	Protein query sequences
query_nt.txt	10,505	Text	Nucleotide query sequences
rpsblast.exe	1,945,600	Program	Reverse psi-blast for CDD search
rpsblast.txt	10,126	Read me	Rpsblast readme
seedtop.exe	1,888,256	Program	Not used in this class
swissprot.00.msk	228,562	Database	Swissprot database (masking file)
swissprot.pal	255	Database	Swissprot database (alias file)
worm_prt.txt	606	Text	Worm protein sequence
yeast.nt	12,308,631	Text	Yeast genome contigs

II. Setup

1. Downloading the archive

Point your browser to the URL below, then change to the appropriate subdirectory:

ftp://ftp.ncbi.nlm.nih.gov/pub/FieldGuide/FGPlus/August2004/Blast/Software/

LINUX: DB/blast229.tar.gz [600 MB] -- local copy

To save the archive, right click on the file and select “save link target as ...” and follow the prompt to save the archive to C:\ drive.

LINUX: Follow similar instructions as outlined in Seq Analysis for Raghava

2. Installing the archive

Right-click the saved archive and choose “Extract to” from the popup menu. In the popup window select C:\ as the target, then hit “Extract” button to install the programs and the needed files.

This will place the blast229 directory in the C:\ directory with the list of files described in Table 1.

LINUX: Follow similar instructions as outlined in Seq Analysis for Raghava

3. Executing the program

Double click the program icon; you will see a window quickly flashes by. The reason is that there is no graphical interface for any of the programs in the archive. They all need to be executed from the command line prompt.

To do this, we need to open a “Command Prompt” window, which is under “Start >> Programs >> Accessories >> Command Prompt”. In this window, type “cd\” then “cd blast229” to get to the blast229 directory. Programs can then be executed using the program name with appropriate options.

Options for a given program can be displayed on the screen by typing the program name followed by a space, a dash, and return key stroke. For example this will display the blastall command line option:

blastall – [return]

The complete program options for the commonly used blast programs are listed in the appendices.

III. Exercises

1. Client server netblast (blastcl3)

As mentioned in lecture, blastcl3 is a client server program that formulates and submits BLAST searches and retrieves the results. The actual search is performed at NCBI.

It has batch capability and can perform a variety of BLAST searches similar to blastall (standalone BLAST). Only two options need to be specified for a BLAST search:

-p program: blastp, blastn, blastx, tblastn, tblastx

-i the input text file with FASTA formatted sequence(s).

Other commonly used options include:

-d database. The default is nr (you may want to search other databases)

-o output file name (to save result rather than viewing it on the screen)

-F Filter options. The default is T. To disable filter/masking, use –F F

-e e-value significance cut-off. Default is 10.

-W Word size. Affects speed and sensitivity. Default 11 (nt) or 3 (prt).

Protein-protein blast search

Start with a simple search to get used to the commandline options and to see the effect of low complexity filtering. We will search with a single sequence against the swissprot database to identify a protein in worm_prt.txt file. The command line to use is:

blastcl3 -p blastp -d swissprot -i worm_prt.txt -o worm

Open the output file “worm” with a text editor (Notepad or similar).

- Is the top match identical to our query?

- Notice the “X’s” that have replaced a low complexity region in the query.

Now re-run the search using the following commandline with low complexity filtering disabled:

blastcl3 -p blastp -d swissprot -i worm_prt.txt -F F -o worm2

Examine the results again. Notice the effects of disabling low complexity filtering.

Nucleotide-nuclotide BLAST: batch searching and non-standard databases

1) Batch search against the polymorphism database (dbSNP)

The query file query_nt.txt contains 39 potential SNPs. We’ll first use blastn to see if these polymorphisms are already present in dbSNP. In order to use non-standard BLAST databases at NCBI (those not listed on the nucleotide-nucleotide or protein-protein BLAST pages) you’ll need the correct name and path for the database. A list of database paths is found in the netblast.txt file. For example, the path to the SNP database is snp/snp. Try the following command line to search against the snp database:

blastcl3 -p blastn -d snp/snp -i query_nt.txt -F F -o snp-out -v 20 -b 20

If this search takes too long you can try the shorter nucleotide query file provided at the following URL www.ncbi.nlm.nih.gov/staff/tao/URLAPI/SNP_1.txt. Open this with the web browser and save the page as text only to your BLAST directory.

Use the following command line to run the search.

blastcl3 -p blastn -d snp/snp -i snp_1.txt -F F -o snp-out -v 20 -b 20

Do you see where the SNP is? It is marked by the r in the query. An “r to r” match indicates that this SNP was identified already. To facilitate the analysis, we can also use the “-m 1” option to display the multiple matches in a stacked up view.

We are running the search with the low-complexity filter off. This is reasonable where the goal is to find an exact match and we’re not concerned with BLAST statistics. Also, the position containing the SNP may fall within a masked region. One drawback of turning filtering off is that it usually extends the search time. When searching with a input file with multiple queries, we can speed things up by invoking megablast algorithm (-n T)

blastcl3 -p blastn -d snp/snp -i query_nt.txt -F F -o snp-out2 -m 1 -nT

The alignments in pair-wise and stacked pairwise views are given below for your reference.

Query: 301 rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 360

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 501 rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 560

QUERY 301 rccccacggggactacatcgagttcccctgctaccgctggatcaccggcgatgtcgaggt 360

rs2228064...> 501 ............................................................ 560

rs11239498..> 892 g.................................................. 942

rs12252472..> 908 g........................... 935

2) Map SNPs onto the genome: Entrez query limitation

We can use netblast to map our SNPs onto the human genome sequence. There are two different databases that contain the assembled human genome sequence. One is the chromosome database that is available on the standard BLAST Web forms. The other is the human contig database “genome”, which is the default on the specialized human genome BLAST page. The chromosome database is made of the NCBI chromosome reference sequences, the NC_ accession series. These include complete microbial, organelle and plasmid genomes, as well as gapped-filled (with N’s) chromosome records for certain higher genomes, notably the fruit fly, Arabidopsis, C. elegans and the reference human chromosomes. The human contig database contains the NCBI assembled sequences without gap filling; these are the segments shown in the NCBI map viewer and human genome BLAST.

The BLAST chromosome database contains much sequence from other organisms that is irrelevant to our present search. As with the Web service, we could use an Entrez query to focus our search on only the human sequences. This has the desirable effect of reducing the search time and eliminating irrelevant hits. The Entrez query limitation is managed through the –u option. We won’t use the –u option in this example. The syntax would be -u "human[orgn]". For human, we have a separate human_genomic database that contains only the human chromosomes that will serve the same purpose. Try the following command line

blastcl3 -p blastn -d human_genomic -i SNP_1.txt -o chr_out -nT

Notice that the filter option explicitly specifies the type of filtering: the m argument specifies “mask for lookup table only” and L specifies the low complexity filter. The R option masks interspersed repeats such as Alus. Searches against human genomic DNA may be extremely slow if repetitive elements are not masked. The following command line will search the assembled human contigs:

blastcl3 –p blastn –d hs_genome/genome –i SNP_1.txtt.txt –o contig_out –nT

Compare these results with those from the chromosome database. Do you hit the same regions? Can you easily compare the coordinate systems? These searches could take several minutes.

2. Standalone blast

Running a BLAST search requires three components: the query, the program, and a database. With Netblast and the Web BLAST services, including the URL-API, you don’t need to maintain the databases locally or have the BLAST engine. As discussed in lecture though, there are circumstances where setting up a local BLAST service is warranted; for example, where a large amount of sequence data is generated locally and must be processed quickly, or where there are proprietary considerations. The two options NCBI offers for local BLAST searching are the command line standalone BLAST and the BLAST web server, WWWBLAST. Both require the maintenance of local databases and typically require dedicated high capacity hardware. The standard sets of databases from NCBI are downloadable from ftp.ncbi.nih.gov/blast/db/. In the archive for this class we provided the nr protein database and the swissprot database (as a mask file to nr). Searches against large nucleotide databases require server class machines to run in a reasonable time, however, more modest searches against protein sequences or other smaller datasets can be run on laptop computers, as we demonstrate below.

Standalone protein BLAST

The query file, query_aa.txt, contains 12 human proteins associated with the inflammatory response. We’ll use standalone blastp to search against a local copy of Swissprot. The Swissprot database, as provided here and in ftp.ncbi.nih.gov/blast/db/, is a mask of the parent database nr, so the full nr database is required as well. Searches against Swissprot are useful because the records tend to be reliable and highly informative, and the database is small and rapidly searched.

If you use the option ‘-I T’, which adds gi numbers to the deflines in the output, then you can use a parser (below) to make a file of gi numbers. These gi numbers can then be used to retrieve the database hits in various formats.

blastall -p blastp -d swissprot -i query_aa.txt -o query_aa.out –b 5 –v 5 –I T

- What species are represented in the output? It’s not always easy to tell. A list of gi numbers for the database hits can be parsed from the output and used in Batch Entrez to retrieve taxonomy links, FASTA sequences, etc. A JavaScript based parser is available at:

http://www.ncbi.nlm.nih.gov/staff/tao/tools/parser4.html

- One can copy and paste the one line descriptions for the matched sequences in the text area, check the “Protein” button, and hit “Get Sequence” button to retrieve the sequences. In the result page, change the display to “Taxonomy Links”, hit “display” button again to see the organisms with hits to your query sequence.

Formtdb and fastacmd

Let’s create a BLAST-ready nucleotide database from yeast.nt, a file of FASTA-formatted contig sequences.

We’ll run formatdb with and without the ‘-o T’ option:

formatdb -i yeast.nt -p F

formatdb -i yeast.nt -p F -o T -n yeast.nt2

- Note the difference in index files created by formatdb.

- Below, we’ll use fastacmd to retrieve a sequence from formatted yeast.nt2. This is only possible when ‘-o T’ has been set.

1) Use the yeast nucleotide database to identify potential matches to a worm protein

To search a protein query against a nucleotide database, use tblastn. Run a search against both versions of yeast.nt to see another effect of ‘-o T’:

blastall -p tblastn -i worm_prt.txt -d yeast.nt -o worm_nt.out

blastall -p tblastn -i worm_prt.txt -d yeast.nt2 -o worm_nt2.out

- Compare the definition lines created by searching yeast.nt and yeast.nt2

- Are these biologically significant hits?

2) Use fastacmd for sequence retrieval

The first match is to a portion of NC_001142. Using this string we can retrieve this sequence from a formatted database using fastacmd.

fastacmd -s NC_001142 -d yeast.nt2 | more

A more practical use of fastacmd might be to retrieve a defined piece of NC_001142. Use the ‘-L ‘ option to specify the sequence range encompassing the first alignment. Note that his hit was to “Frame -1”. Use the ‘-S ‘ option to indicate the strand:

fastacmd -s NC_001142 -d yeast.nt2 -L 333140,333250 –S 2

C. Running standalone PSI-BLAST (blastpgp)

Position-Specific Iterated BLAST (PSI-BLAST) is implemented in standalone BLAST as the executable program blastpgp. As with blastall, you can list all command line options by typing a dash after the program name.

blastpgp –

The blastpgp program can write out a position specific score matrix in a human readable format. In this example, we'll create such a matrix using the sequences collected by searching with inositol polyphosphate 1 phosphatase (IPPase). We’ll use the bovine IPPase sequence included in the file bovine_ipp.txt as a query. Inositol monophosphatases and related enzymes contain conserved acidic residues that are essential for binding metal ions (Mg++). The PSSM generated should show high self-substitution scores for the residues in these positions.

The following command line will run blastpgp for 4 iterations with IPPase against the swissprot database and write out the PSSM.

blastpgp –i bovine_ ipp.txt -d swissprot -j 4 -Q pssm.txt

Unlike the web version, blastpgp requires you to specify the number of iterations in the beginning. You can use the –C option to create a checkpoint file so that you can pick up the search and perform additional iterations if desired.

Open pssm.txt using a text editor or a web browser. Can you identify three conserved acidic residues? Compare their self-substitution scores in the PSSM with that in BLOSUM62. There are also two other residues that are conserved. What are they? Confirm these findings by using CD-search on the web with IPPase. Display the alignment for the inositol_P domain and look for these conserved residues.

3. WWWBLAST

For this course, we demonstrate an existing WWWBLAST installation. The actual setup requires a web server (such as Apache) and fairly straightforward configuration (see ftp://ftp.ncbi.nlm.nih.gov/blast/documents/wwwblast.txt).

Go to: www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blast/.

Select the second link (Regular BLAST with client-server support), which allows one to input a gi/accession number as query.

The following steps perform a test search with the query F12345 against test_nt_db:

- keep the program (blastn) and database (test_nt_db) as they are

- change “sequence in FASTA format” to “Accessions or GI”

- type F12345 in the text box

- hit “Search” button to initiate the search

- The page will change to the Results display, similar to an NCBI web BLAST result

To add additional databases into the pull down list, we need to do the following:

- Format the database using formatdb. We have preciously done this in the standalone blast section.

- Edit the blast.rc file to link the database with appropriate blast program

# Number of CPUs to use for a single request

NumCpuToUse 4

# Here are list of combinations program/database,

# that allowed by BLAST service. Format: <program> <db> <db> ...

blastn test_na_db yeast

blastp test_aa_db yeast.aa

blastx test_aa_db yeast.aa

tblastn test_na_db yeast

tblastx test_na_db yeast

- edit blast_cs.html to add the new database to the pull down list

<option VALUE = "test_na_db"> test_na_db

<option VALUE = "test_aa_db"> test_aa_db

<option VALUE = "yeast"> yeast nt

<option VALUE = "yeast.aa"> yeast aa

</select>

NOTE: the italicized bold represents newly added databases.

4. URL API

The BLAST URLAPI uses a standardized API (Application Programming Interface) to access the NCBI Qblast system. It makes direct HTTP-encoded requests to bypass the use of a browser, making it easily incorporated into automated scripts. For documentation see: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html

This tool uses URL to send encoded commands to the blast server through Blast.cgi for both submitting searches and requesting results. There are three steps:

- Posting the search request using “put” command to Blast.cgi

- Getting the returned html document and extracting out the RID

- Formatting the result using the RID and “get” command.

Some batch capability and automation can be achieved with custom scripts. Scripts should have a built-in 3 sec delay between requests and should also include error and status checking.

We have provided some scripts (www.ncbi.nlm.nih.gov/staff/tao/URLAPI/) that are Perl based and require the installation of ActivePerl package. These scripts take a protein or nucleotide FASTA file, formulate the search and send them to BLAST server. The issued RIDs are parsed out and then used for result retrieval. In the provided scripts there is only one round of result checking. Ideally, there should be additional error checking and multiple result polling.