Sequence alignment

 

Sequence alignment algorithm: a dynamic programming approach, always trying to follow the best prior-result so far.

Try to align two sequences by inserting some gaps at different locations what are unknown, i.e. getting two new sequences with nulls, so as to maximize the score of this alignment.
Score measurement could be is self-defined. It is determined by "match award", "mismatch penalty" and "gap penalty". The higher the score, the better the alignment. If both penalties are set to 0, it aims to always find an alignment with maximum matches so far. Maximum match = largest number matches can have for one sequence by allowing all possible deletion of another sequence.
It's used to compare the similarity between two sequences of DNA or Protein, to predict similarity of their functionalities.

There are three algorithms available: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)

Global alignment: the first two algorithms are used for global alignment.

Suppose, we align and match the two sequences from left to right. Each cell in the score matrix denotes a sub-alignment between the two subsequences that forms a sub-matrix. It's the best alignment  we can get so far for these two sub-sequences. At this point, there are three ways to go, "insert a gap to X sequence"--moving down the matrix, "insert a gap to Y sequence"--moving right the matrix, "try for a matching"--both X and Y sequences move forward diagonally. For same reason, for any cell in the matrix, there are three sub-matrixes, i.e., three pairs of sub-sequences leads to this sub-alignment. Choosing the best score among these three ways, get the optimal score for the current sub-matrix.

Gap penalty:

a general gap-penalty function has the form W(k) = copen + clength * k, where k is the length of the gap, copen is the gap-open penalty constant, and clength the gap-length penalty constant, also sometimes cend will be applied.

There may be different formulas to compute score.

Problem:

To align two sequence: ABCNJRQCLCRPM, AJCJNRCKCRBP

Needleman Wunsch algorithm: M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j-k], M[i-r,j-1]}, can't deal with case with gap

Condition:

Given: Match Award: Pa=1 points, no penalty for mismatch and gap.

Process:

Step 1: to fill the matrix S with 1 if match, else 0.

 

Step 2:  compute score based on the formula M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j-k], M[i-r,j-1]}. The following case starts inversely from right-bottom, so the formula above should be modified relatively. Also, there is an assumption behind: the row(-1st row) and column(-1st col) that are not shown in the matrix, but will be used to compute first row(0th row) and column(0th col) are all 0s, which means an empty string matched to a sequence in case of without gap penalty.

Step 3: trace back from the left-top corner, and always select the maximum value from the outmost column and row, and jump to next maximum in the next row or column.

Step 4: present alignment as Seller's algorithm shown below.

 

Sellers algorithm: M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i,j-1]+W(1),M[i-1,j]+W(1)}. //1 is number one, instead of letter 'l'.

Condition:

Given: Match Award: Pa=2 points, Mismatch Penalty: Pm=-1 point, Gap Penalty: Pg=-1 point. Source Code.

Process:

first step: create the initial matrix, (N+1)X(M+1), M is for X axis, N is for Y, and this matrix is called S.

That is, get the white part of the following matrix by filling Pa and Pm relatively.
If no match, no mismatch and no gap, cell value is default to 0.
First row(-1st row) and first column(-1st col) which is in gray means match a sequence to another empty sequence/a gap. Strictly to say, they belong to the second step. While in this step, they should be all Pg( such as -1 or 0), except the intersection between -1st row and -1st col which is 0.
However, it's common sense to have them available directly.

second step: create the score matrix called M, from top-left to low-right direction.

D: means from diagonal neighbor, L: left neighbor, U: upper neighbor
based on the formula:
M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j]-Pg,M[i,j-1]-Pg}, 1<=i<=M+1,1<=j<=N+1, -Pg could be more general as:
and remember the arrow from the cell/cells which is/are Max{S[i,j]+M[i-1,j-1],M[i-1,j]-Pg,M[i,j-1]-Pg}, to M[i,j]

third step: backtrack from low-right corner in the following way:

Simple tracking: without remember the arrows, suitable for maximum matches, else may introduce error.
//always start from the largest in the outmost row and outmost column.
  for(i=M+1,j=N+1;i>0&&j>0;)
   {
        highlight the cell P(x,y) with biggest number in column P.x in 0~j and row P.y in 0~i;
        if(ties happen) trace every P(x,y);
        i=P.x; j=P.y;
        i--; j--;
   
}

Standard tracking:
start from the start, follow the arrows back to end.

fourth step: insert gaps and compute score.

Starting from upper-left corner (end), read X and Y sequences respectively.
If there is a down arrow, insert a gap for X sequence, a right arrow, insert a gap for Y sequence.
Results:

A B C N J   R Q C L C R   P M
A J C   J N R   C K C R B P  

2-1+2-1+2-1+2-1+2-1+2+2-1+2-1=9

A B C   N J R Q C L C R   P M
A J C J N   R   C K C R B P  

2-1+2-1+2-1+2-1+2-1+2+2-1+2-1=9

 

Local alignment: Smith-Waterman, M[i,j]=Max(M[i-1,j-1]+S[i,j], Max(M[i,r]+W(r)), Max(M[s,j]+W(s)), 0)

The idea comes from global alignment, and is similar especially to Sellers(1974). See following:

The difference between local alignment and global alignment algorithms lies in the following observation: local alignment always set a cell to 0 when it becomes negative, which means a match will always be restarted from this cell( because a cell's value comes from three directions, M[i,j] reaching 0 denotes all three directions' score are balanced to be 'worthless'), while global alignment will consider the result from the beginning which is cumulative to current match. Also, when tracing back, local alignment will start from the maximum value, however, global alignment will span whole length of two sequences. A local alignment should be the method of choice when the homology of two sequences is distant or the sequences are too long.

Process:

first step: initialize the matrix S to be filled with 0

second step: continue initializing the matrix S by adding Match/Mismatch points to the matrix got above. Notice, the row(-1st) and column(-1st) are not shown, because they will take no effect on the 0th row and 0th column after applying to the formula.

third step: compute the score matrix and trace back from the highest cell in the matrix and following the arrows till cell with value 0(not included).

fourth step: present result as Seller's algorithm above.