KMP String Searching Algorithm
String strStr Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a “word” S within a main “text string” T by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. (Paragraph cited from https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)
The most straightforward strStr() algorithm is to look for a match of S at every index in text T, which takes O(TS) time in the worst case, while the KMP has a better worst case performance of O(T + S) = O(T).
The difference is that KMP makes use of previous match information from the jump table of string S that the straightforward algorithm does not.
Example of KMP searching
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 3
This round T and S matches until i = 3. Next round KMP will reuse the information of the visited substring "ABC" in T, and know that "BC" does not contain a prefix of S. So next round S skips T[1], T[2] and jumps to T[3].
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 0
This round T and S instantly mismatch. Move S forward by 1 step.
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 6
This round T and S matches until i = 6. Next round KMP reuse the information of the visited substring "ABCDAB" and moves S to the first occurrence of its prefix in "BCDAB", which is the "AB" at T[8].
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 2
This round T and S matches until i = 2. No prefix of S is found in "AB ". So move S by 2 steps.
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 0
Mismatch instantly. Move S forward.
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
i = 6
This round T and S matches until i = 6. Next round KMP reuse the information of the visited substring "ABCDAB" and moves S to the first occurrence of its prefix in "BCDAB", which is the "AB" at T[15].
T: ABC ABCDA ABCDABCDABDE
S: ABCDABD
i = 7
This round T and S completely match. Record the this occurrence of S in T and continue the search. Next round KMP reuse the information of the visited substring "ABCDABD" and moves S to the first occurrence of its prefix in "BCDABD". However no prefix of S is found in "BCDABD". So move S by 7 steps to skip on "BCDABD" in T.
T: ABC ABCDAB ABCDABCDABDE
S: ABCDABD
This round the end of S has gone over the length of T, so end search.
Above example explains how the jumps take place during a search. But how do we learn the number of steps to jump at each round?
The occurrences of S's prefix in T is the key to the matter.
Building a Jump Table for S
For S = "ABCDABCA"
The jump table is a integer array that indicates the length of the matching prefix and suffix in S.
I. If S and T matches till i = 0, the part matched is "A".
Then look for the occurrence of S's prefix in "" (Cut the first character in the matched part because we don't need to compare S[0] = 'A' and T[x + 0] = 'A' again).
No prefix of S exists in "". Put 0 to the jump table for i = 0.
S = "ABCDABCA"
J = 0
II. If S and T matches till i = 1, the part matched is "AB".
Then look for the occurrence of S's prefix in the suffix of "B".
No prefix of S exists in "B". Put 0 to the jump table for i = 1.
S = "ABCDABCA"
J = 00
III. If S and T matches till i = 2, the part matched is "ABC".
Then look for the occurrence of S's prefix in the suffix of "BC".
No prefix of S exists in "BC". Put 0 to the jump table for i = 2.
S = "ABCDABCA"
J = 0 0 0
IV. If S and T matches till i = 3, the part matched is "ABCD".
Then look for the occurrence of S's prefix in the suffix of "BCD".
No prefix of S exists in "BCD". Put 0 to the jump table for i = 3.
S = "ABCDABCA"
J = 00 00
V. If S and T matches till i = 4, the part matched is "ABCDA".
Then look for the occurrence of S's prefix in the suffix of "BCDA".
The suffix "A" in "BCDA" is a prefix of S.
As the matching prefix and suffix has a length of 1, put 1 to the jump table for i = 4.
S = "ABCDABCA"
J = 00001
VI. If S and T matches till i = 5, the part matched is "ABCDAB".
Then look for the occurrence of S's prefix in the suffix of "BCDAB".
The suffix "AB" in "BCDAB" is a prefix of S.
As the matching prefix and suffix has a length of 2, put 2 to the jump table for i = 5.
S = "ABCDABCA"
J = 00 0 01 2
IIV. If S and T matches till i = 6, the part matched is "ABCDABC".
Then look for the occurrence of S's prefix in the suffix of "BCDABC".
The suffix "ABC" in "BCDABC" is a prefix of S.
As the matching prefix and suffix has a length of 3, put 3 to the jump table for i = 6.
S = "ABCDABCA"
J = 0 00 01 23
VIII. If S and T matches till i = 7, the part matched is "ABCDABCA".
Then look for the occurrence of S's prefix in the suffix of "BCDABCA".
The suffix "A" in "BCDABCA" is a prefix of S.
As the matching prefix and suffix has a length of 1, put 1 to the jump table for i = 7.
S = "ABCDABCA"
J = 0 0 0012 3 1
The jump table for string S is
[0, 0, 0, 0, 1, 2, 3, 1]
Implementation of Jump Table
For S = "ABCDABCA",
J = [ ] jump table
Base case
At i = 0, set J[0] = 0.
General case
At i >= 1, let idx = i,
LOOP: Check for the characters S[J[i - 1] ] == S[idx]
if true, J[i] = J[i - 1] + 1 (size of the matching prefix and suffix increased by 1)
while false, let i = J[i - 1], check for S[J[i - 1]] == S[idx] (Repeat LOOP until i == 0, if S[0] != S[idx], then set J[idx] = 0, otherwise J[idx] = 1)
The purpose of the loop cover the case when string S looks like
"AABABXXXAABAA"
01 010 00012 34 idx
i
To find J[idx], we first check whether S[idx] == S[J[idx - 1]], which is checking whether the prefix "AABAB" matches the suffix "AABAA", but no.
Does this mean that no prefix and suffix in S match any more?
Obviously there could still be a match.
So update i = S[J[i - 1]].
"AABABXXXAABAA"
01 010 00012 34 idx
i
Check for S[idx] == S[J[i - 1]], which checks for whether the prefix "AA" matches the suffix "AA". Sure it does. Therefore J[idx] = J[i - 1] + 1 = 2.
"AABABXXXAABAA"
01 010 00012 34 2
i
int[] table(String s) {
int[] t = new int[s.length()];
int i = 1, len = 0;
while(i < s.length()) {
if(s.charAt(i) == s.charAt(len)) {
len++;
t[i] = len;
i++;
} else {
if(len == 0) {
i++; //t[i] = 0;
} else {
len = t[len - 1 ];
}
}
}
return t;
}
How many steps to jump during a search?
For S = "ABCDABCA",
If S and T match up to the ith character, the matched substring is S[0, i + 1].
From the jump table J we learn that the length of matching suffix and prefix of S[0, i + 1] is J[i].
Our purpose is to move S to where the suffix locates.
For example, i = 6, the matched substring is "ABCDABC",
The length of equal prefix and suffix is J[6] = 3.
How to move S to where the suffix "ABC" locates?
ABCDABC
^
In other words we are trying to skip on the chars "ABCD".
The number of characters to skip on is (i - J[i]).
Therefore, in the next round simply move S forward by (i - J[i] + 1) steps.
Implementation
//find needle in haystack, i iterates through haystack and j iterates through needle.
public void strStr(String haystack, String needle) {
int[] next = table(needle); //get jump table
int i = 0, j = 0;
while (i <= haystack.length() - needle.length()) {
if (haystack.charAt(i + j) == needle.charAt(j)) {
j++;
if (j == needle.length()) { //needle found
System.out.println(i); //print where needle is found
i = i + j - next[j - 1]; //jump
j = 0; //reset j and look for the next match
}
} else {
if(j == 0) i++; //first chars mismatch
else {
i = i + j - next[j - 1]; //jump
j = 0; //reset j and look for the next match
}
}
}
}