Longest common subsequence problem - AbsoluteAstronomy.com

The longest common subsequence (LCS) problem is to find the longest subsequence

Subsequence

In mathematics, a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements...

common to all sequences in a set of sequences (often just two). Note that subsequence is different from a substring, see substring vs. subsequence. It is a classic computer science

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

problem, the basis of file comparison

File comparison

File comparison in computing compares the contents of computer files, finding their common contents and their differences. The result of the comparison may be presented in a graphic user interface or as part of larger tasks in networks, file systems, or revision control.Some widely-used file...

programs such as diff

Diff

In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...

, and has applications in bioinformatics

Bioinformatics

Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

Complexity

For the general case of an arbitrary number of input sequences, the problem is NP-hard

NP-hard

NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...

. When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming

Dynamic programming

In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

(see Solution below). Assume you have

sequences of lengths

. A naive search would test each of the

subsequences of the first sequence to determine whether they are also subsequences of the remaining sequences; each subsequence may be tested in time linear in the lengths of the remaining sequences, so the time for this algorithm would be

For the case of two sequences of n and m elements, the running time of the dynamic programming approach is O

Big O notation

In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

(n × m). For an arbitrary number of input sequences, the dynamic programming approach gives a solution in

There exist methods with lower complexity, which often depend on the length of the LCS, the size of the alphabet, or both.

Notice that the LCS is not necessarily unique; for example the LCS of "ABC" and "ACB" is both "AB" and "AC". Indeed the LCS problem is often defined to be finding all common subsequences of a maximum length. This problem inherently has higher complexity, as the number of such subsequences is exponential in the worst case, even for only two input strings. It should not be confused with the longest common substring problem

Longest common substring problem

The longest common substring problem is to find the longest string that is a substring of two or more strings. It should not be confused with the longest common subsequence problem. The longest common substring problem is to find the longest string (or strings) that is a substring (or are...

(substring

Substring

A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved...

s are necessarily contiguous).

Solution for two sequences

The LCS problem has an optimal substructure

Optimal substructure

In computer science, a problem is said to have optimal substructure if an optimal solution can be constructed efficiently from optimal solutions to its subproblems...

: the problem can be broken down into smaller, simple "subproblems", which can be broken down into yet simpler subproblems, and so on, until, finally, the solution becomes trivial. The LCS problem also has overlapping subproblems: the solution to a higher subproblem depends on the solutions to several of the lower subproblems. Problems with these two properties—optimal substructure and overlapping subproblems—can be approached by a problem-solving technique called dynamic programming

Dynamic programming

, in which the solution is built up starting with the simplest subproblems. The procedure requires memoization

Memoization

In computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously processed inputs...

—saving the solutions to one level of subproblem in a table so that the solutions are available to the next level of subproblems.
This method is illustrated here.

Prefixes

The subproblems become simpler as the sequences become shorter. Shorter sequences are conveniently described using the term prefix. A prefix of a sequence is the sequence with the end cut off. Let S be the sequence (AGCA). Then, a prefix of S is the sequence (AG). Prefixes are denoted with the name of the sequence, followed by a subscript to indicate how many characters the prefix contains. The prefix (AG) is denoted S₂, since it contains the first 2 elements of S. The possible prefixes of S are

S₁ = (A)

S₂ = (AG)

S₃ = (AGC)

S₄ = (AGCA).

The solution to the LCS problem for two arbitrary sequences, X and Y, amounts to constructing some function, LCS(X, Y), that gives the longest subsequences common to X and Y. That function relies on the following two properties.

First property

Suppose that two sequences both end in the same element. To find their LCS, shorten each sequence by removing the last element, find the LCS of the shortened sequences, and to that LCS append the removed element.

For example, here are two sequences having the same last element: (BANANA) and (ATANA).

Remove the same last element. Repeat the procedure until you find no common last element. The removed sequence will be (ANA).

The sequences now under consideration: (BAN) and (AT)

The LCS of these last two sequences is, by inspection, (A).

Append the removed element, (ANA), giving (AANA), which, by inspection, is the LCS of the original sequences.

In terms of prefixes,

LCS(X_n, Y_m) = (LCS( X_n-1, Y_m-1), x_n)

where the comma indicates that the following element, x_n, is appended to the sequence. Note that the LCS for X_n and Y_m involves determining the LCS of the shorter sequences, X_n-1 and Y_m-1.

Second property

Suppose that the two sequences X and Y do not end in the same symbol.
Then the LCS of X and Y is the longer of the two sequences LCS(X_n,Y_m-1) and LCS(X_n-1,Y_m).

To understand this property, consider the two following sequences :

sequence X: ABCDEFG (n elements)

sequence Y: BCDGK (m elements)

The last character of the LCS of these two sequences either ends with a G (the last element of sequence X) or does not.

Case 1: the LCS ends with a G

Then it cannot end with a K. Thus it does not hurt to remove the K from sequence Y: if K were in the LCS, it would be its last character; as a consequence K is not in the LCS. We can then write: LCS(X_n,Y_m) = LCS(X_n, Y_m-1).

Case 2: the LCS does not end with a G

Then it does not hurt to remove the G from the sequence X (for the same reason as above). And then we can write: LCS(X_n,Y_m) = LCS(X_n-1, Y_m).

In any case, the LCS we are looking for is one of LCS(X_n, Y_m-1) or LCS(X_n-1, Y_m). Those two last LCS are both common subsequences to X and Y. LCS(X,Y) is the longest. Thus its value is the longest sequence of LCS(X_n, Y_m-1) and LCS(X_n-1, Y_m).

LCS function defined

Let two sequences be defined as follows: X = (x₁, x₂...x_m) and Y = (y₁, y₂...y_n). The prefixes of X are X_{1, 2,...m}; the prefixes of Y are Y_{1, 2,...n}. Let LCS(X_i, Y_j) represent the set of longest common subsequence of prefixes X_i and Y_j. This set of sequences is given by the following.

To find the longest subsequences common to X_i and Y_j, compare the elements x_i and y_j. If they are equal, then the sequence LCS(X_i-1, Y_j-1) is extended by that element, x_i. If they are not equal, then the longer of the two sequences, LCS(X_i, Y_j-1), and LCS(X_i-1, Y_j), is retained. (If they are both the same length, but not identical, then both are retained.) Notice that the subscripts are reduced by 1 in these formulas. That can result in a subscript of 0. Since the sequence elements are defined to start at 1, it was necessary to add the requirement that the LCS is empty when a subscript is zero.

Worked example

The longest subsequence common to C = (AGCAT), and R = (GAC) will be found. Because the LCS function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences: C₀ = Ø; and R₀ = Ø. All the prefixes are placed in a table with C in the first row (making it a column header) and R in the first column (making it a row header).

LCS Strings
	0	A	G	C	A	T
0	Ø	Ø	Ø	Ø	Ø	Ø
G	Ø
A	Ø
C	Ø

This table is used to store the LCS sequence for each step of the calculation. The second column and second row have been filled in with Ø, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.

LCS(R₁, C₁) is determined by comparing the first elements in each sequence. G and A are not the same, so this LCS gets (using the "second property") the longest of the two sequences, LCS(R₁, C₀) and LCS(R₀, C₁). According to the table, both of these are empty, so LCS(R₁, C₁) is also empty, as shown in the table below. The arrows indicate that the sequence comes from both the cell above, LCS(R₀, C₁) and the cell on the left, LCS(R₁, C₀).

LCS(R₁, C₂) is determined by comparing G and G. They match, so G is appended to the upper left sequence, LCS(R₀, C₁), which is (Ø), giving (ØG), which is (G).

For LCS(R₁, C₃), G and C do not match. The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these, LCS(R₁, C₃) is (G). The arrow points to the left, since that is the longest of the two sequences.

LCS(R₁, C₄), likewise, is (G).

"G" Row Completed
	Ø	A	G	C	A	T
Ø	Ø	Ø	Ø	Ø	Ø	Ø
G	Ø	Ø	(G)	(G)	(G)	(G)
A	Ø
C	Ø

For LCS(R₂, C₁), A is compared with A. The two elements match, so A is appended to Ø, giving (A).

For LCS(R₂, C₂), A and G do not match, so the longest of LCS(R₁, C₂), which is (G), and LCS(R₂, C₁), which is (A), is used. In this case, they each contain one element, so this LCS is given two subsequences: (A) and (G).

For LCS(R₂, C₃), A does not match C. LCS(R₂, C₂) contains sequences (A) and (G); LCS(R₁, C₃) is (G), which is already contained in LCS(R₂, C₂). The result is that LCS(R₂, C₃) also contains the two subsequences, (A) and (G).

For LCS(R₂, C₄), A matches A, which is appended to the upper left cell, giving (GA).

For LCS(R₂, C₅), A does not match T. Comparing the two sequences, (GA) and (G), the longest is (GA), so LCS(R₂, C₅) is (GA).

"G" & "A" Rows Completed
	Ø	A	G	C	A	T
Ø	Ø	Ø	Ø	Ø	Ø	Ø
G	Ø	Ø	(G)	(G)	(G)	(G)
A	Ø	(A)	(A) & (G)	(A) & (G)	(GA)	(GA)
C	Ø

For LCS(R₃, C₁), C and A do not match, so LCS(R₃, C₁) gets the longest of the two sequences, (A).

For LCS(R₃, C₂), C and G do not match. Both LCS(R₃, C₁) and LCS(R₂, C₂) have one element. The result is that LCS(R₃, C₂) contains the two subsequences, (A) and (G).

For LCS(R₃, C₃), C and C match, so C is appended to LCS(R₂, C₂), which contains the two subsequences, (A) and (G), giving (AC) and (GC).

For LCS(R₃, C₄), C and A do not match. Combining LCS(R₃, C₃), which contains (AC) and (GC), and LCS(R₂, C₄), which contains (GA), gives a total of three sequences: (AC), (GC), and (GA).

Finally, for LCS(R₃, C₅), C and T do not match. The result is that LCS(R₃, C₅) also contains the three sequences, (AC), (GC), and (GA).

Completed LCS Table
	Ø	A	G	C	A	T
Ø	Ø	Ø	Ø	Ø	Ø	Ø
G	Ø	Ø	(G)	(G)	(G)	(G)
A	Ø	(A)	(A) & (G)	(A) & (G)	(GA)	(GA)
C	Ø	(A)	(A) & (G)	(AC) & (GC)	(AC) & (GC) & (GA)	(AC) & (GC) & (GA)

The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA). The table also shows the longest common subsequences for every possible pair of prefixes. For example, for (AGC) and (GA), the longest common subsequence are (A) and (G).

Traceback approach

Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row. Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space. Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.

Storing length, rather than sequences
	Ø	A	G	C	A	T
Ø	0	0	0	0	0	0
G	0	0	1	1	1	1
A	0	1	1	1	2	2
C	0	1	1	2	2	2

The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table. When the length decreases, the sequences must have had a common element. Several paths are possible when two arrows are shown in a cell. Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease. The bold numbers trace out the sequence, (GA).

Traceback example
	Ø	A	G	C	A	T
Ø	0	0	0	0	0	0
G	0	0	1	1	1	1
A	0	1	1	1	2	2
C	0	1	1	2	2	2

Relation to other problems

For two strings

and

, the length of the shortest common supersequence is related to the length of the LCS by

The edit distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is:

Computing the length of the LCS

The function below takes as input sequences X[1..m] and Y[1..n] computes the LCS between X[1..i] and Y[1..j] for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, and stores it in C[i,j]. C[m,n] will contain the length of the LCS of X and Y.
function LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j]
C[i,j] := C[i-1,j-1] + 1
else:
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]
Alternatively, memoization

Memoization

could be used.

Reading out an LCS

The following function backtracks

Backtracking

Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

the choices taken when computing the C table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping

and

, and make the same choice. Just choose one if they were equally long. Call the function with i=m and j=n.

function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i = 0 or j = 0
return ""
else if X[i] = Y[j]
return backtrack(C, X, Y, i-1, j-1) + X[i]
else
if C[i,j-1] > C[i-1,j]
return backtrack(C, X, Y, i, j-1)
else
return backtrack(C, X, Y, i-1, j)

Reading out all LCSs

If choosing

and

would give an equally long result, read out both resulting subsequences. This is returned as a set by this function. Notice that this function is not polynomial, as it might branch in almost every step if the strings are similar.

function backtrackAll(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i = 0 or j = 0
return {""}
else if X[i] = Y[j]
return {Z + X[i] for all Z in backtrackAll(C, X, Y, i-1, j-1)}
else
R := {}
if C[i,j-1] ≥ C[i-1,j]
R := backtrackAll(C, X, Y, i, j-1)
if C[i-1,j] ≥ C[i,j-1]
R := R ∪ backtrackAll(C, X, Y, i-1, j)
return R

Print the diff

This function will backtrack through the C matrix, and print the diff

Diff

between the two sequences. Notice that you will get a different answer if you exchange ≥ and <, with > and ≤ below.

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i > 0 and j > 0 and X[i] = Y[j]
printDiff(C, X, Y, i-1, j-1)
print " " + X[i]
else
if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j])
printDiff(C, X, Y, i, j-1)
print "+ " + Y[j]
else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])
printDiff(C, X, Y, i-1, j)
print "- " + X[i]
else
print ""

Example

Let

be "XMJYAUZ" and

be "MZJAWXU". The longest common subsequence between

and

is "MJAU". The table C shown below, which is generated by the function LCSlength, shows the lengths of the longest common subsequences between prefixes of

and

. The

th row and

th column shows the length of the LCS between

and

.

| 0 1 2 3 4 5 6 7
| M Z J A W X U
-----|-----------------
0 | 0 0 0 0 0 0 0 0
1 X | 0 0 0 0 0 0 1 1
2 M | 0 1 1 1 1 1 1 1
3 J | 0 1 1 2 2 2 2 2
4 Y | 0 1 1 2 2 2 2 2
5 A | 0 1 1 2 3 3 3 3
6 U | 0 1 1 2 3 3 3 4
7 Z | 0 1 2 2 3 3 3 4

The underlined numbers show the path the function backtrack would follow from the bottom right to the top left corner, when reading out an LCS. If the current symbols in

and

are equal, they are part of the LCS, and we go both up and left. If not, we go up or left, depending on which cell has a higher number. This corresponds to either taking the LCS
between

and

, or

and

Code optimization

Several optimizations can be made to the algorithm above to speed it up for real-world cases.

Reduce the problem set

The C matrix in the naive algorithm grows quadratically

Quadratic growth

In mathematics, a function or sequence is said to exhibit quadratic growth when its values are proportional to the square of the function argument or sequence position, in the limit as the argument or sequence position goes to infinity...

with the lengths of the sequences. For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done. In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time. If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.

function LCS(X[1..m], Y[1..n])
start := 1
m_end := m
n_end := n
trim off the matching items at the beginning
while start ≤ m_end and start ≤ n_end and X[start] = Y[start]
start := start + 1
trim off the matching items at the end
while start ≤ m_end and start ≤ n_end and X[m_end] = Y[n_end]
m_end := m_end - 1
n_end := n_end - 1
C = array(start-1..m_end, start-1..n_end)
only loop over the items that have changed
for i := start..m_end
for j := start..n_end
the algorithm continues as before ...

In the best case scenario, a sequence with no changes, this optimization would completely eliminate the need for the C matrix. In the worst case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.

Reduce the comparison time

Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences. For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters. This can mean comparisons of relatively long strings for each step in the algorithm. Two optimizations can be made that can help to reduce the time these comparisons consume.

Reduce strings to hashes

A hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

or checksum

Checksum

A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

can be used to reduce the size of the strings in the sequences. That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long. Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

There are three primary drawbacks to this optimization. First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences. Second, additional memory needs to be allocated for the new hashed sequences. However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.

The third drawback is that of collisions

Hash collision

Not to be confused with wireless packet collision.In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest....

. Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash. This is unlikely in source code, but it is possible. A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum. However, the setup and computational requirements of a cryptographic hash may not be worth it for small sequence lengths.

Reduce the required space

If only the length of the LCS is required, the matrix can be reduced to a

matrix with ease, or to a

vector (smarter) as the dynamic programming approach only needs the current and previous columns of the matrix. Hirschberg's algorithm

Hirschberg's algorithm

Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the least cost sequence alignment between two strings, where cost is measured as Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null...

allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.