The major goal of bioinformatics is to assist biologists in their basic tasks. One of the redundant and laborious task is to perform literature review of any problem. Usually, this is performed by searching each key term in pubmed and some times these key terms can be hundreds. Secondly, this approach has many limitations because of the server load and interface. My biologists friends frequently request me to text mine the pubmed for there key terms and give them the data in excel sheet. This will help them in having a birds eye view of the literature about their hypothesis and gene of interests. There are many different tools provided by NCBI to text mine pubmed and the best thing about that is they already have the analyzed data present which can be downloaded easily from relevant website. One just have to parse the data from these files. The complete list of tools and software available for this purpose are provided on the following link. The major tools include:

  • PubTator: I found this tool as the most useful resource to text mine the complete PubMed with key biological entities e.g. diseases and genes. The tools is available here freely. This can be assessed by an API but I prefer to download all the files from the FTP site and parse them for fast and more customized results. I will discuss and share the codes for basic parsing of PubTator files in a next article. An example of what can you do using these resources is available on this website as http://bioinfoguide.com/index.php/tools, this was the basic tool I developed in early 2018 for performing the literature review of some of my genes of interest.

This tool contains all genes and relevant information. If you click on any gene name you will get a list of all genes in which this gene is reported with all the PubMed article till the database was updated (It was early 2018, I have plan to update this database soon). 


  • LitVar: This is the great resource for people working on genetics, it allows retrieval of variant related information from biomedical literature. It links the key biological features of a variant with the genes, diseases and drugs. 

Other than these 2 major tools a large list of tools is available at following link to perform personalized tasks but I believe these 2 tools offers all the basic functionality to perform literature review about any gene, disease, drug or variant which is asked very frequently. 

BLAST (Basic Local Alignment Search Tool) is a method to ascertain sequence similarity. The program takes a query sequence and searches it against the database selected by user. It aligns a query sequence against the every subject sequence in the database. The results are reported in a form of a ranked list followed by a series of individual sequence alignments, plus various statistics and scores. Every hit in that list is assigned with a similarity score S. Further, that score is analyzed how likely it is to arise by chance. For that purpose so called E-value is calculated for every hit. E-value for the score S tells the expected number of hits of the score S or higher in the database.
For detailed discussion of statistics used in BLAST check the following link.

This program can be accessed directly online on NCBI webserver (https://blast.ncbi.nlm.nih.gov/Blast.cgi) or one can download blast to run in local settings. It also offers an API that can be used in different applications. In this tutorial I am just going to discuss about basic types and option of these BLAST program; majority of the stuff on NCBI website is self explanatory. 


Types of BLAST programs

We can divide the BLAST programs in to two different categories depending on their functionality; first category is general search tools and second category is specialized search tools. First I am discussing general search tools which all have almost similar interface and features.

  1. BLASTP compares an amino acid query sequence against a protein sequence database

  2. BLASTN compares a nucleotide query sequence against a nucleotide sequence database

  3. BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database

  4. TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands)

  5. TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database

Sequence input: BLAST accept the sequence in FASTA format (widely used file format starts with >Sequence_Definition and from new line sequence with 60 characters in one line) or Accession Number (GI number).

Subject databases: There are many databases to use as subject databases. One of the most commonly used is nr database: collection of "non-redundant" sequences from GenBank and other sequence databanks. There are many other option one can select according to the requirement e.g protein data bank (PDB), RefSeq Genome Database (RefSeq Genomes) etc.

FILTER (Low-complexity): Mask off segments of the query sequence that have low compositional complexity (i.e. regions of biased composition, such as short-period repeats)


Understanding BLAST results: BLAST result are self-explanatory with some key terms that will help you to understand them in a better way.

Query Coverage: Query coverage should be maximum, it shows that how much of your query sequence is binding to traget sequence with accuaracy. If 10% of query is binding 100% to target sequence, these results will not be considered as good.

EXPECT value: The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Increasing the EXPECT value forces the program to report less isgnificant matches.

Identity: This value shows the similarity between query and target sequence.
In short blast results should be considered after checking all three necessary values.


Second category of BLAST tools contains specialized search tools, which are mention below:

  1. SmartBLAST: It can be used for searching highly similar proteins to query sequence
  2. Primer_BLAST: It is one of the most important tool used for designing primers specific to any PCR template
  3. GlobalAlign: Its an implementation of Needleman_Wunsch alogrithm used for global alignment of two sequences accross their entire span
  4. CD-Search: It is used for finding conserved domains in any particular sequence which are important in evolutionary genomics, motif prediction etc.
  5. GEO: It have capability to find matches to gene expression profiles. This is performed by searching against Gene Expression Omnibus (GEO) database
  6. IgBLAST: It is used for searching immunoglobulins and T-Cell receptor sequences, widely used in the field of immunology
  7. VecScreen: It is used for searching sequences for vector contamination. Vector contamination can cause problems in any kind of analysis, so, it is necessary to remove all kind of vector sequences from target query before performing further alignment or analysis
  8. CDART: This tool can find sequences with similar conserved domain architecture, have a lot of usuage in the field of proteomics, evolutionary genomics etc.
  9. TargetedLoci: Again a golden tool for evolutionary biology having a capability of searching markers for phylogenetic analysis
  10. Multiple Alignment: It is used for multiple alignment of sequences using domain and protein constraints
  11. BioAssay: This tool can be used for searching protein or nucleotide targets in PubChem BioAssay; a large public repository for small-molecule and RNAi screening data since 2004 providing open access of its data content to the community.
  12. MOLE-BLAST: Classify multiple query sequences and discover their relationship to each other. This tool provides a taxonomic context for the queries. It is intended to work with a specific locus from a set of organisms rather than sequences like the entire genome of an organism or unannotated contigs.



© 2018 BioinfoGuide. All Rights Reserved.