Today I am going to explain one of the most basic and important algorithm of bioinformatics "Needleman–Wunsch Algorithm" developed by Saul B. Needleman and Christian D. Wunsch in 1970. It was designed to compare biological sequences and was one of the first applications of dynamic programming to the biological sequence comparison. This algorithm is usually used for global alignment of two sequences (nucleotide or amino acids).

For the purpose of explanation, first summarizing the algorithm in five steps and then I will move forwards with examples and explanations.

- Consider all the possible pairs of residues from two sequences, the best way is to generate a 2D matrix of two sequences. We will need 2 such matrices one for sequence and one for scores.
- Initialize the score matrix. Scoring matrices are used to determine the relative score made by matching two characters in a sequence alignment. There are many flavors of scoring matrices for amino acid sequences, nucleotide sequences, and codon sequences, and each is derived from the alignment of "known" homologous sequences. These alignments are then used to determine the likelihood of one character being at the same position in the sequence as another character.
- Gap penalty: Usually there are high chances of insertions and deletions (indels) in biological sequences but one large indels is more likely rather than multiple small indels in a given sequences. In order to tackle this issue we give two kind of penalities;
**Gap opening penalty**(relatively higher) and**gap extension penalty**(relatively lower) - Calculate scores and fill the traceback matrix
- Deduce the alignment from the traceback matrix

Now, let's with a small example, the alignment of two sequences using BLOSUM62 substitution matrix. Assume sequence A as "SEND" and sequence B as "AND" and gap opening penalty of 10 (no gap extension):

**The Score And Traceback Matrices**: The cells of the score matrix are labelled C(i; j) where i = 1; 2; :::; N and j = 1; 2; :::; M

**Scoring**: The score matrix cells are filled by row starting from the cell C(2; 2). The score of any cell C(i; j) is the maximum of:

- q
^{diag}= C(i 1; j 1) + S(i; j) - q
^{up}= C(i 1; j) + g - q
^{left}= C(i; j 1) + g

where S(i; j) is the substitution score for letters i and j, and g is the gap penalty. The value of the cell C(i; j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells.

**The Needleman-Wunsch Progression**:

1. The first step is to calculate the value of C(2; 2):

- q
^{diag}= C(1; 1) + S(S; A) = 0 + 1 = 1 - q
^{up}= C(1; 2) + g = 10 + (10) = 20 - q
^{left}= C(2; 1) + g = 10 + (10) = 20

Where C(1; 1), C(1; 2), and C(2; 1) are read from the score matrix, and S(S; A) is the score for the S <-> A taken from the BLOSUM62 matrix. For the score matrix C(2; 2) =q^{diag }which is 1. The corresponding cell of the traceback matrix is "diag":

2. The next step is to calculate C(2; 3):

- q
^{diag}= C(1; 2) + S(E; A) = 10 + 1 = 11 - q
^{up}= C(1; 3) + g = 20 + (10) = 30 - q
^{left}=C(2; 2) + g = 1 + (10) = 9

Thus C(2; 3) = 9 and the corresponding cell of the traceback matrix is "left". After complete calculations of all cells, the score and traceback matrices are:

**The Traceback**: Traceback is the process of deduction of the best alignment from the traceback matrix. The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell. One moves according to the traceback value written in the cell. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the first, top-left cell of the matrix is reached ("done" cell). The traceback performed on the completed traceback matrix:

**The Best Alignment**: The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix:

- diag - the letters from two sequences are aligned
- left - a gap is introduced in the left sequence
- up - a gap is introduced in the top sequence

Sequences are aligned backwards.

**The Traceback Step-By-Step**:

- The first cell from the traceback path is "diag" implying that the corresponding letters are aligned:

D

D - The second cell from the traceback path is also "diag" implying that the corresponding letters are aligned:

ND

ND - The third cell from the traceback path is "left" implying the gap in the left sequence (i.e. we stay on the letter A from the left sequence):

END

-ND - The fourth cell from the traceback path is also "diag" implying that the corresponding letters are aligned. We consider the letter A again, this time it is aligned with S:

SEND

A-ND

**Compare With The Exhaustive Search**:

- The best alignment via the Needleman-Wunsch algorithm:

SEND

A-ND - The exhaustive search:

SEND

-AND score: +1

A-ND score: +3 the best

AN-D score: -3

AND- score: -8

**Pros and Cons**: In this example we considered very short sequences SEND and AND which are much easier to compare using exhaustive search but if we think of real problems that contains longer sequences the situation quickly turns against it. For example; two 12 residue sequences would require considering ~ 1 million alignments and two 150 residue sequences would require considering ~ 10ˆ88 alignments (~ 10ˆ78 is the estimated number of atoms in the Universe). On the other hand for two 150 residue sequences the Needleman-Wunsch algorithm requires only filling a 150 * 150 matrix, which is not much computationally expensive. The major problem is Needleman-Wunsch algorithm works in the same way regardless of the length or complexity of sequences which provides heuristic algorithm an upper hand over this algorithm but Needleman-Wunsch algorithm guarantees to provide the best alignment.