-
Notifications
You must be signed in to change notification settings - Fork 50
/
agrep.algorithms
48 lines (43 loc) · 2.57 KB
/
agrep.algorithms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Copyright (c) 1994 Sun Wu, Udi Manber, Burra Gopal. All Rights Reserved. */
The implementation of agrep includes the following algorithms.
Except for exact matching of simple patterns, for which we use
a simple variation of the Boyer-Moore algorithm,
all the algorithms (listed below) were designed by Sun Wu and Udi Manber.
1. bitap: The most general algorithm inside agrep.
It supports many extensions such as approximate regular expression
pattern matching, non-uniform costs, simultaneous matching of
multiple patterns, mixed exact/approximate matching, etc.
The algorithm is described in agrep.ps.1.
2. mgrep: A sub-linear expect-time algorithm for matching a set of patterns.
It assumes that the set of patterns contains k patterns, and that
the shortest pattern is of size m.
See agrep.ps.2 for a brief description of the algorithm.
3. amonkey: a Boyer-Moore style algorithm for approximate pattern matching.
let b = log_c (2*m), where c is the size of alphabet set.
In the preprocessing, a table is built to determine whether
a given substring of size b is in the pattern.
Suppose we are looking for matches with at most k errors.
The search is done in two passes.
In the first pass (the filtering pass), the areas in the text
that have a possibility to contain the matches are marked.
The second pass finds the matches in those marked areas.
The search in the first pass is done in the following way.
Suppose the end position of the pattern is currently aligned with
position tx in the text.
The algorithm scans backward from tx until either (k+1) blocks
that do not occur in the pattern have been scanned, or
the scan has passed position (tx-m+k).
In the former case, pattern is shifted forward to align
the beginning position of the pattern with one character after
the position in the text where the scan was stopped.
In the latter case, we marked tx-m to tx+m as a candidate area.
4. mmonkey: Combining the mgrep algorithm with a partition technique, we
have an algorithm with the same time complexity as amonkey.
For ASCII text and pattern, this algorithm is faster than amonkey.
The principle of the partition technique is as follows.
Let A and B be two strings of size m.
If we partition A into (k+1) blocks, then the distance between
A and B is > k if none of the blocks of A occur in B.
This implies that to match A with no more than k errors,
B has to contain a substring that matches exactly one block of A.
A brief description can be found in agrep.ps.2.