# strStr KMP Algorithm

Posted by JH

## KMP String Searching Algorithm

String strStr Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a “word” S within a main “text string” T by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. (Paragraph cited from https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)

The most straightforward strStr() algorithm is to look for a match of S at every index in text T, which takes O(TS) time in the worst case, while the KMP has a better worst case performance of O(T + S) = O(T).

The difference is that KMP makes use of previous match information from the jump table of string S that the straightforward algorithm does not.

# Example of KMP searching

T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 3 This round T and S matches until i = 3. Next round KMP will reuse the information of the visited substring "ABC" in T, and know that "BC" does not contain a prefix of S. So next round S skips T, T and jumps to T. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 0 This round T and S instantly mismatch. Move S forward by 1 step. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 6 This round T and S matches until i = 6. Next round KMP reuse the information of the visited substring "ABCDAB" and moves S to the first occurrence of its prefix in "BCDAB", which is the "AB" at T. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 2 This round T and S matches until i = 2. No prefix of S is found in "AB ". So move S by 2 steps. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 0 Mismatch instantly. Move S forward. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD i = 6 This round T and S matches until i = 6. Next round KMP reuse the information of the visited substring "ABCDAB" and moves S to the first occurrence of its prefix in "BCDAB", which is the "AB" at T. T: ABC ABCDA ABCDABCDABDE S: ABCDABD i = 7 This round T and S completely match. Record the this occurrence of S in T and continue the search. Next round KMP reuse the information of the visited substring "ABCDABD" and moves S to the first occurrence of its prefix in "BCDABD". However no prefix of S is found in "BCDABD". So move S by 7 steps to skip on "BCDABD" in T. T: ABC ABCDAB ABCDABCDABDE S: ABCDABD This round the end of S has gone over the length of T, so end search.

Above example explains how the jumps take place during a search. But how do we learn the number of steps to jump at each round?
The occurrences of S's prefix in T is the key to the matter.

# Building a Jump Table for S

For S = "ABCDABCA"
The jump table is a integer array that indicates the length of the matching prefix and suffix in S.

I. If S and T matches till i = 0, the part matched is "A". Then look for the occurrence of S's prefix in "" (Cut the first character in the matched part because we don't need to compare S = 'A' and T[x + 0] = 'A' again). No prefix of S exists in "". Put 0 to the jump table for i = 0. S = "ABCDABCA" J = 0 II. If S and T matches till i = 1, the part matched is "AB". Then look for the occurrence of S's prefix in the suffix of "B". No prefix of S exists in "B". Put 0 to the jump table for i = 1. S = "ABCDABCA" J = 00 III. If S and T matches till i = 2, the part matched is "ABC". Then look for the occurrence of S's prefix in the suffix of "BC". No prefix of S exists in "BC". Put 0 to the jump table for i = 2. S = "ABCDABCA" J = 0 0 0 IV. If S and T matches till i = 3, the part matched is "ABCD". Then look for the occurrence of S's prefix in the suffix of "BCD". No prefix of S exists in "BCD". Put 0 to the jump table for i = 3. S = "ABCDABCA" J = 00 00 V. If S and T matches till i = 4, the part matched is "ABCDA". Then look for the occurrence of S's prefix in the suffix of "BCDA". The suffix "A" in "BCDA" is a prefix of S. As the matching prefix and suffix has a length of 1, put 1 to the jump table for i = 4. S = "ABCDABCA" J = 00001 VI. If S and T matches till i = 5, the part matched is "ABCDAB". Then look for the occurrence of S's prefix in the suffix of "BCDAB". The suffix "AB" in "BCDAB" is a prefix of S. As the matching prefix and suffix has a length of 2, put 2 to the jump table for i = 5. S = "ABCDABCA" J = 00 0 01 2 IIV. If S and T matches till i = 6, the part matched is "ABCDABC". Then look for the occurrence of S's prefix in the suffix of "BCDABC". The suffix "ABC" in "BCDABC" is a prefix of S. As the matching prefix and suffix has a length of 3, put 3 to the jump table for i = 6. S = "ABCDABCA" J = 0 00 01 23 VIII. If S and T matches till i = 7, the part matched is "ABCDABCA". Then look for the occurrence of S's prefix in the suffix of "BCDABCA". The suffix "A" in "BCDABCA" is a prefix of S. As the matching prefix and suffix has a length of 1, put 1 to the jump table for i = 7. S = "ABCDABCA" J = 0 0 0012 3 1

The jump table for string S is
[0, 0, 0, 0, 1, 2, 3, 1]

# Implementation of Jump Table

For S = "ABCDABCA", J = [ ] jump table Base case At i = 0, set J = 0. General case At i >= 1, let idx = i, LOOP: Check for the characters S[J[i - 1] ] == S[idx] if true, J[i] = J[i - 1] + 1 (size of the matching prefix and suffix increased by 1) while false, let i = J[i - 1], check for S[J[i - 1]] == S[idx] (Repeat LOOP until i == 0, if S != S[idx], then set J[idx] = 0, otherwise J[idx] = 1) The purpose of the loop cover the case when string S looks like "AABABXXXAABAA" 01 010 00012 34 idx i To find J[idx], we first check whether S[idx] == S[J[idx - 1]], which is checking whether the prefix "AABAB" matches the suffix "AABAA", but no. Does this mean that no prefix and suffix in S match any more? Obviously there could still be a match. So update i = S[J[i - 1]]. "AABABXXXAABAA" 01 010 00012 34 idx i Check for S[idx] == S[J[i - 1]], which checks for whether the prefix "AA" matches the suffix "AA". Sure it does. Therefore J[idx] = J[i - 1] + 1 = 2. "AABABXXXAABAA" 01 010 00012 34 2 i
``````int[] table(String s) {
int[] t = new int[s.length()];
int i = 1, len = 0;
while(i < s.length()) {
if(s.charAt(i) == s.charAt(len)) {
len++;
t[i] = len;
i++;
} else {
if(len == 0) {
i++;    //t[i] = 0;
} else {
len = t[len - 1 ];
}
}
}
return t;
}
``````

# How many steps to jump during a search?

For S = "ABCDABCA", If S and T match up to the ith character, the matched substring is S[0, i + 1]. From the jump table J we learn that the length of matching suffix and prefix of S[0, i + 1] is J[i]. Our purpose is to move S to where the suffix locates. For example, i = 6, the matched substring is "ABCDABC", The length of equal prefix and suffix is J = 3. How to move S to where the suffix "ABC" locates? ABCDABC ^ In other words we are trying to skip on the chars "ABCD". The number of characters to skip on is (i - J[i]). Therefore, in the next round simply move S forward by (i - J[i] + 1) steps.

## Implementation

``````//find needle in haystack, i iterates through haystack and j iterates through needle.
public void strStr(String haystack, String needle) {
int[] next = table(needle); //get jump table
int i = 0, j = 0;
while (i <= haystack.length() - needle.length()) {
if (haystack.charAt(i + j) == needle.charAt(j)) {
j++;
if (j == needle.length()) { //needle found
System.out.println(i);  //print where needle is found
i = i + j - next[j - 1];  //jump
j = 0;       //reset j and look for the next match
}
} else {
if(j == 0) i++; //first chars mismatch
else {
i = i + j - next[j - 1]; //jump
j = 0; //reset j and look for the next match
}
}
}
}
``````