Peter Y. Chou and Gerald D. Fasman developed the Chou–Fasman method  in 1974 for prediction of secondary structures of proteins. The methods properly utilizes previously available information obtained from X-ray crystallograhy experiments e.g the relative frequencies of each amino acid in turns, alpha helices and beta sheets. Based on this information algorithm decides the probability of each position and apppearance of each amino acid in all possible secondary structures and all these parameters are combined for the prediction of final appearance  of any sequence of amino acids in a protein (turn, helix or beta sheet).  The accuracy of this algorithm was approximately 50-60% which is very less as compared to modern tecniques exploiting machine learning and artificial intelligence techniques.

Amino Acid Propensities

The original parameters of algorithm were not trustworthy because at that time the data was collected from a very small and non-representative sample of protein structures. Major problem in original parameters was the strong tendency of amino acids to prefer one type of secondary structure over others. For example, helix former amino acids (Alanine, glutamate, leucine, and methionine) and amino acids with unique conformational propensities that can end a helix (proline and glycine). With time these parameters kept on improving along with some modifications in original algorithm. This algorithm only considers the properties of specific amino acid rather than the properties of whole sequence including its neighbors. Properties of only specific amino acid are not sufficient to acccuately predict a definite secondary structure but this helps in increasing its computational efficiency. New methods also considers the propensities  of neighboring amino acids for more acccuate prediction (GOR Method).

Table of Amino Acid Propensities

Name  P(a) P(b) P(turn) f(i) f(i+1)  f(i+2) f(i+3)
Alanine 142 83 66 0.06 0.076 0.035


Arginine 98 93 95 0.07 0.106 0.099 0.085
Aspartic Acid 101 54 146 0.147 0.110 0..179 0.081
Asparagine 67 89 156 0.161 0.083 0.191 0.091
Cysteine 70 119 119 0.149 0.0550 0.117 0.128
Glutamic Acid 151 37 74 0.056 0.060 0.077 0.064
Glutamine 111 110 98 0.074 0.098 0.037 0.098
Glycine 57 75 1156 0.102 0.085 0.190 0.152
Histidine 100 87 95 0.140 0.047 0.093 0.054
Isoleucine 108 160 47 0.043 0.034 0.013 0.056
Leucine 121 130 59 0.061 0.025 0.036 0.070
Lysine 114 74 101 0.055 0.115 0.072 0.095
Methionine 145 105 60 0.068 0.082 0.014 0.055
Phenylalanine 113 138 60 0.059 0.041 0.065 0.065
Proline 57 55 152 0.102 0.301 0.034 0.068
Serine 77 75 143 0.120 0.139 0.125 0.106
Threonine 83 119 96 0.086 0.108 0.065 0.079
Tryptophan 108 137 96 0.077 0.013 0..064 0.167
Tyrosine 69 147 114 0.082 0.065 0.114 0.125
Valine 106 170 50 0.062 0.048 0.028 0.053

Algorithm in simple steps

  1. A set of appropriate parameters for all amino acids in the sequence
  2. Any region where 4 of 6 contiguous amino acids have a P(a) > 100 will be considered as alpha-helix. This alpha-helix will extend in both directions until 4 contiguous amino acids with an average of p(a) < 100 are not reached. If the final segment is greater than 5 amino acids and the average P(a) > P(b) that segment will be considered as helix.
  3. Identify of helical segments in the given sequence.
  4. Any region where 3 of 5 contiguous amino acids have a P(b) > 100 will be considered as alpha-helix. This alpha-helix will extend in both directions until 4 contiguous amino acids with an average of p(b) < 100 are not reached. If the final segment has P(b) > 105 and the average P(b) > P(a) that segment will be considered as beta-sheet.
  5. Segments with overlapping alpha-helical and beta-sheet assignments are considered as helical if the average P(a) > P(b) and a beta sheet if the average P(b) > P(a).
  6. The turn on any residue (j) will be calculated by the following formula:
    1. p(t) = f(j)f(j+1)f(j+2)f(j+3) (where the f(j+1) value for the j+1 residue, the f(j+2) value for the j+2 residue and the f(j+3) value for the j+3 residue).
  7. The beta-turn is predicted if the following 3 criteria's are fulfilled:
    1. p(t) > 0.000075
    2. The average value for P(t) > 1.00 in the tetrapeptide
    3. The averages for the tetrapeptide obey the inequality P(a) < P(t) > P(b)


Hirschberg's a dynamic programming algorithm algorithm developed by Dan Hirschberg has the capability of finding the best sequence alognment of two sequences. The algorithm utilizes Levenshtein edit distance which is distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. It can also be defined as an modified version of Needleman-Wunsh algorithm, it is more space efficient and uses the divide and conquer strategy.

Time and space calculations:
The major application of this algorithm is to search optimal sequence alignment between two nucleotide or amino acid sequences. There are also suboptimal heuristics approaches of sequence alignment e.g BLAST and FASTA. Finding optimal sequence alignment requires computational resources in terms of time and space. For example, Needleman-Wunsh algorithm will take O(nm) time and O(nm) space for finding an optimal sequence alignment of two strings of x and y, where length(x) = n and length(y) = m. On the other hand Hirschberg's algorithm still takes O(nm) time, but needs only O(min{n,m}) space. Hirschberg's algorithm is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.

How we compute Levenshtein edit distance:
For calculating the least score (Edit(x,y)) required for change x into y by using insertions, substitutions and deletions, we give a score to each of the event. For example, we give a score for inserting a character "A" into substring, a score for substituting character "B" with "A" in the string and similarly a score for deleting a character "A" from the string. The standard score can be Ins(a) = Del(a) = 1 for each character a, Sub(a,a) = 0, and Sub(a,b) = 1 if a is not equal to b.

Differences with Needleman-Wunsch algorithm:
Levenshtein edit distances can be computed using linear space. What we call the "forward subprogram" computes the values of Edit(Prefix[x,i],Prefix[y,j]) for all i and j, just as the Needleman-Wunsch and returns the array {Edit(x,Prefix[y,j])}0 = j = m. The "backward subprogram" is similar, except that the dynamic program is done in the opposite direction, i.e., starting from the right ends of the strings. It returns the array {Edit(x,Suffix[y,j])}0 = j = m, where Suffix[y,j] is the suffix of y of length j.

The Algorithm:
Compute V(A, B) while saving the values of the -th row. Denote D(A,B) as the Forward Matrix F.
Compute V(Ar, Br) while saving the -th row. Denote D(Ar, Br) as the Backward Matrix B.
Find the column k* so that the crossing point ( , k*) satisfies:

Now that k* is found, recursively partition the problem to two sub problems:
Find the path from (0,0) to ( , k*).
Find the path from (n, m) to ( , m - k*)



High-level description of the forwards subprogram

Forwards[x,y] is

1. n = length(x); m = length(y)
2. Edit[Prefix[0,0]] = 0;
3. For all i from 1 to n:
Edit[Prefix[x,i],Prefix[y,0]] = Edit[Prefix[x,i-1],Prefix[y,0]]
+ Del(x_i)
4. For all j from 1 to m:
A. Edit[Prefix[x,0],Prefix[y,j]] = Edit[Prefix[x,0],Prefix[y,j-1]]
+ Ins(y_j)
B. For all i from 1 to n, execute the following steps:
i. Edit[Prefix[x,i],Prefix[y,j]] =
min{Edit[Prefix[x,i-1],Prefix[y,j]] + Del(x_i),
Edit[Prefix[x,i-1],Prefix[y,j-1]] + Sub(x_i,y_j),
Edit[Prefix[x,i],Prefix[y,j-1]] + Ins(y_j)}
ii. Erase Edit[Prefix[x,i-1],Prefix[y,j-1]]
C. Erase Edit[Prefix[x,i-1],Prefix[y,j]]
5. RETURN Edit[x] %% an array of length m+1

High-level description of the backwards subprogram

Backwards[x,y] is

1. n = length(x); m = length(y)
2. For all i from 1 to n:
Edit[Suffix[x,i],Suffix[y,0]] = 0
3. For all j from 1 to m:
A. Edit[Suffix[x,0],Suffix[y,j]] = Edit[Suffix[x,n],Suffix[y,j-1]] +
B. For all i from 1 to n:
i. Edit[Suffix[x,i],Suffix[y,j]] =
min{Edit[Suffix[x,i-1],Suffix[y,j]] + Del(x_{n-i-1}),
Edit[Suffix[x,i-1],Suffix[y,j-1]] +
Edit[Suffix[x,i],Suffix[y,j-1]] + Ins(y_{m-j+1})}
ii. Erase Edit[Suffix[x,i-1],Suffix[y,j-1]]
C. Erase Edit[Suffix[x,i-1],Suffix[y,j]]
4. RETURN Edit[x] %% an array of length m+1

High level description of Hirschberg's algorithm:

Hirschberg(x,y) is
1. n = length(x); m = length(y)
2. If n <= 1 or m <= 1:
OUTPUT Alignment(x,y) using Needleman-Wunsch.
A. mid = floor(n/2)
B. x_left = Prefix[x,mid]
C. x_right = Suffix[x,n-mid]
D. Edit[x_left] = Forwards(x_left,y)
%% an array of length m+1
E. Edit[x_right] = Backwards(x_right,y)
%% an array of length m+1
F. cut = ArgMin{Edit[x_left,Prefix[y,j]] +
Edit[x_right,Suffix[y,m-j]]} %% j ranges from 1 to m-1
G. Hirschberg(x_left,Prefix[y,cut])
H. Hirschberg(x_right,Suffix[y,m-cut])

The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. This algorithm is a variation of Needleman-Wunsch Algorithm developed by  Temple F. Smith and Michael S. Waterman in 1981, it is also a dynamic programming algorithm to find the optimal local alignment with respect to the scoring system being used. The major difference from Needleman-Wunsch algorithm includes:

  1. In order to highlight the best local alignments; negative scoring matrix cells are set to zero
  2. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered

This algorithm was designed to sensitively detect similarities in highly diverged sequences. For the purpose of explanation, first summarizing the algorithm in to simple steps and then I will move forwards with examples and explanations.

Algorithm is similar to global alignment with modified boundary conditions and recurrence rules

  1. The top row and left column are now filled with 0
  2. If the (sub-)alignment score becomes negative, restart the search:Smith-Waterman_Algorithm
  3. Traceback is from the maximum of F (i, j) in the whole matrix to the first 0.
  4. Example: the optimal local alignment between HEAGAWGHEE and PAWHEAE is AWGHE::AW-HE
  5. Issue: In gapped alignments, the expected score for a random match may be positive

Example of  Smith–Waterman Algorithm


1. Start with a N x N integer matrix where N is sequence length (both s and t). Compute M[i ][j ] based on Score Matrix and optimum score compute so far (DP). 


2. Understanding the matrix

  • Alignment
    t : - - - - - - - -
    s : c c c t a g g t


  • Alignment
    t : c g g g t a t ...
    s : - - - - - - - ...


3. Computing cell scores: Finding m[i][j]: There are three ways to finish the alignment of s0..i and t0..j



4. Scoring process: Element Computation M[i ][j ]:




5. Backtracking Process: For finding the BEST local alignment, find the highest score and then traceback to first 0

Scoring used in this example:


  • diag - the letters from two sequences are aligned
  • left - a gap is introduced in the left sequence
  • up - a gap is introduced in the top sequence


Today I am going to explain one of the most basic and important algorithm of bioinformatics "Needleman–Wunsch Algorithm" developed by Saul B. Needleman and Christian D. Wunsch in 1970. It was designed to compare biological sequences and was one of the first applications of dynamic programming to the biological sequence comparison. This algorithm is usually used for global alignment of two sequences (nucleotide or amino acids). 

For the purpose of explanation, first summarizing the algorithm in five steps and then I will move forwards with examples and explanations.

  1. Consider all the possible pairs of residues from two sequences, the best way is to generate a 2D matrix of two sequences. We will need 2 such matrices one for sequence and one for scores.
  2. Initialize the score matrix. Scoring matrices are used to determine the relative score made by matching two characters in a sequence alignment. There are many flavors of scoring matrices for amino acid sequences, nucleotide sequences, and codon sequences, and each is derived from the alignment of "known" homologous sequences. These alignments are then used to determine the likelihood of one character being at the same position in the sequence as another character.
  3. Gap penalty: Usually there are high chances of insertions and deletions (indels) in biological sequences but one large indels is more likely rather than multiple small indels in a given sequences. In order to tackle this issue we give two kind of penalities; Gap opening penalty (relatively higher) and gap extension penalty (relatively lower)
  4. Calculate scores and fill the traceback matrix
  5. Deduce the alignment from the traceback matrix

Now, let's with a small example, the alignment of two sequences using BLOSUM62 substitution matrix. Assume sequence A as "SEND" and sequence B as "AND"  and gap opening penalty of 10 (no gap extension):

The Score And Traceback Matrices: The cells of the score matrix are labelled C(i; j) where i = 1; 2; :::; N and j = 1; 2; :::; M


Scoring: The score matrix cells are filled by row starting from the cell C(2; 2). The score of any cell C(i; j) is the maximum of:

  1. qdiag = C(i 1; j 1) + S(i; j)
  2. qup = C(i 1; j) + g
  3. qleft = C(i; j 1) + g

where S(i; j) is the substitution score for letters i and j, and g is the gap penalty. The value of the cell C(i; j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells.


The Needleman-Wunsch Progression

1. The first step is to calculate the value of C(2; 2):



  • qdiag= C(1; 1) + S(S; A) = 0 + 1 = 1
  • qup= C(1; 2) + g = 10 + (10) = 20
  • qleft = C(2; 1) + g = 10 + (10) = 20

Where C(1; 1), C(1; 2), and C(2; 1) are read from the score matrix, and S(S; A) is the score for the S <-> A taken from the BLOSUM62 matrix. For the score matrix C(2; 2) =qdiag which is 1. The corresponding cell of the traceback matrix is "diag":


2.  The next step is to calculate C(2; 3):

  • qdiag= C(1; 2) + S(E; A) = 10 + 1 = 11
  • qup= C(1; 3) + g = 20 + (10) = 30
  • qleft =C(2; 2) + g = 1 + (10) = 9

Thus C(2; 3) = 9 and the corresponding cell of the traceback matrix is "left".  After complete calculations of all cells, the score and traceback matrices are:


The Traceback: Traceback is the process of deduction of the best alignment from the traceback matrix. The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell. One moves according to the traceback value written in the cell. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the first, top-left cell of the matrix is reached ("done" cell). The traceback performed on the completed traceback matrix:

Needleman-Wunsch_AlgorithmThe Best Alignment: The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix:

  • diag - the letters from two sequences are aligned
  • left - a gap is introduced in the left sequence
  • up - a gap is introduced in the top sequence

Sequences are aligned backwards.

The Traceback Step-By-Step:

  1. The first cell from the traceback path is "diag" implying that the corresponding letters are aligned:
  2. The second cell from the traceback path is also "diag" implying that the corresponding letters are aligned:
  3. The third cell from the traceback path is "left" implying the gap in the left sequence (i.e. we stay on the letter A from the left sequence):
  4. The fourth cell from the traceback path is also "diag" implying that the corresponding letters are aligned. We consider the letter A again, this time it is aligned with S:

Compare With The Exhaustive Search:

  • The best alignment via the Needleman-Wunsch algorithm:
  • The exhaustive search:
    -AND score: +1
    A-ND score: +3 the best
    AN-D score: -3
    AND- score: -8

Pros and Cons: In this example we considered very short sequences SEND and AND which are much easier to compare using exhaustive search but if we think of real problems that contains longer sequences the situation quickly turns against it. For example; two 12 residue sequences would require considering ~ 1 million alignments and two 150 residue sequences would require considering ~ 10ˆ88 alignments (~ 10ˆ78 is the estimated number of atoms in the Universe). On the other hand for two 150 residue sequences the Needleman-Wunsch algorithm requires only filling a 150 * 150 matrix, which is not much computationally expensive. The major problem is Needleman-Wunsch algorithm works in the same way regardless of the length or complexity of sequences which provides heuristic algorithm an upper hand over this algorithm but Needleman-Wunsch algorithm guarantees to provide the best alignment.




© 2018 BioinfoGuide. All Rights Reserved.