|
Sequence alignment
|
Sequence alignment algorithm:
a dynamic programming approach, always trying to follow the best
prior-result so far.
Try to align two sequences by inserting some gaps at different
locations what are unknown, i.e. getting two new sequences with nulls,
so as to maximize the score of this alignment. Score measurement
could be is self-defined. It is determined by "match award",
"mismatch penalty" and "gap penalty". The higher the score, the
better the alignment. If both penalties are set to 0, it aims to always
find an alignment with maximum matches so far. Maximum match = largest number
matches can have for one sequence by allowing all possible deletion of
another sequence. It's used to compare the similarity between
two sequences of DNA or Protein, to predict similarity of their
functionalities.
There are three algorithms available: Needleman-Wunsch(1970),
Sellers(1974), Smith-Waterman(1981)
|
|
Global alignment: the first two algorithms
are used for global alignment.
Suppose, we align and match the two sequences from left to right.
Each cell in the score matrix denotes a sub-alignment between the two
subsequences that forms a sub-matrix. It's the best alignment we
can get so far for these two sub-sequences. At this point, there are
three ways to go, "insert a gap to X
sequence"--moving down the matrix, "insert a gap to Y sequence"--moving right the matrix,
"try for a matching"--both X and Y sequences move forward
diagonally. For same reason, for any cell in the matrix, there are three
sub-matrixes, i.e., three pairs of sub-sequences leads to this
sub-alignment. Choosing the best score among these three ways, get the
optimal score for the current sub-matrix.

Gap penalty:
a general gap-penalty function has the form W(k) =
copen + clength * k, where k
is the length of the gap, copen is the
gap-open penalty constant, and clength the
gap-length penalty constant, also sometimes cend
will be applied.
There may be different formulas to compute score.
Problem:
To align two sequence: ABCNJRQCLCRPM, AJCJNRCKCRBP
|
|
Needleman Wunsch algorithm:
M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j-k], M[i-r,j-1]}, can't
deal with case with gap
Condition:
Given: Match Award: Pa=1 points, no penalty for mismatch and
gap.
Process:
Step 1: to fill the matrix S with 1 if match, else
0.

Step 2: compute score based on the formula M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j-k], M[i-r,j-1]}.
The following case starts inversely from right-bottom, so the
formula above should be modified relatively. Also, there is an
assumption behind: the row(-1st row) and column(-1st col) that are not
shown in the matrix, but will be used to compute first row(0th row)
and column(0th col) are all 0s, which means an empty string matched to
a sequence in case of without gap penalty.

Step 3: trace back from the left-top corner, and always
select the maximum value from the outmost column and row, and jump to
next maximum in the next row or column.

Step 4: present alignment as Seller's algorithm shown below.
|
|
Sellers algorithm:
M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i,j-1]+W(1),M[i-1,j]+W(1)}. //1 is
number one, instead of letter 'l'.
Condition:
Given: Match Award: Pa=2 points, Mismatch Penalty:
Pm=-1 point, Gap Penalty: Pg=-1 point.
Source Code.
Process:
first step: create the initial matrix, (N+1)X(M+1), M is for X
axis, N is for Y, and this matrix is called S.
That is, get the white part of the following matrix by filling Pa
and Pm relatively. If no match, no mismatch and no gap, cell value is
default to 0. First row(-1st row)
and first column(-1st col) which is in gray means match a sequence to another
empty sequence/a gap. Strictly to say, they belong to the
second step. While in this step, they should be all Pg( such as
-1 or 0), except the intersection between -1st row and -1st col which
is 0. However, it's common sense to have them available directly.

second step: create the score matrix called M, from
top-left to low-right direction.
D: means from diagonal neighbor, L: left neighbor, U: upper
neighbor based on the formula: M[i,j]=Max{S[i,j]+M[i-1,j-1],M[i-1,j]-Pg,M[i,j-1]-Pg},
1<=i<=M+1,1<=j<=N+1, -Pg could be more general as:
and remember the arrow from the cell/cells which is/are
Max{S[i,j]+M[i-1,j-1],M[i-1,j]-Pg,M[i,j-1]-Pg}, to
M[i,j]

third step: backtrack from low-right corner in the following
way:
Simple tracking: without remember the arrows, suitable for
maximum matches, else may introduce
error. //always start from the largest in the outmost row
and outmost column. for(i=M+1,j=N+1;i>0&&j>0;)
{ highlight the cell
P(x,y) with biggest number in column P.x in 0~j and row P.y in
0~i; if(ties happen)
trace every P(x,y);
i=P.x; j=P.y; i--;
j--; }

Standard tracking: start from the start, follow
the arrows back to end.

fourth step: insert gaps and compute score.
Starting from upper-left corner (end), read X and Y sequences
respectively. If there is a down arrow, insert a gap for X
sequence, a right arrow, insert a gap for Y sequence. Results:
| A |
B |
C |
N |
J |
|
R |
Q |
C |
L |
C |
R |
|
P |
M |
| A |
J |
C |
|
J |
N |
R |
|
C |
K |
C |
R |
B |
P |
|
2-1+2-1+2-1+2-1+2-1+2+2-1+2-1=9
| A |
B |
C |
|
N |
J |
R |
Q |
C |
L |
C |
R |
|
P |
M |
| A |
J |
C |
J |
N |
|
R |
|
C |
K |
C |
R |
B |
P |
|
2-1+2-1+2-1+2-1+2-1+2+2-1+2-1=9
|
|
Local alignment: Smith-Waterman,
M[i,j]=Max(M[i-1,j-1]+S[i,j], Max(M[i,r]+W(r)), Max(M[s,j]+W(s)), 0)
The idea comes from global alignment, and is similar especially to
Sellers(1974). See following:

The difference between local alignment and global alignment
algorithms lies in the following observation: local alignment always set
a cell to 0 when it becomes negative, which means a match will always
be restarted from this cell( because a cell's value comes from three
directions, M[i,j] reaching 0 denotes all three directions' score are
balanced to be 'worthless'), while global alignment will consider the
result from the beginning which is cumulative to current match. Also,
when tracing back, local alignment will start from the maximum value,
however, global alignment will span whole length of two sequences. A
local alignment should be the method of choice when the homology of two
sequences is distant or the sequences are too long.
Process:
first step: initialize the matrix S to be filled with 0

second step: continue initializing the matrix S by
adding Match/Mismatch points to the matrix got above. Notice, the
row(-1st) and column(-1st) are not shown, because they will take no
effect on the 0th row and 0th column after applying to the formula.
third step: compute the score matrix and trace back from the
highest cell in the matrix and following the arrows till cell with value
0(not included).

fourth step: present result as Seller's algorithm above.
|