I finished the implementation (tested on CLISP) of the regular expression parser.
It was created as automatically generated deterministic finite automata and should be very fast in searching. State machine uses lookup table for the states, so effectively it is O(number of states), it could be compressed if range of symbols is huge.
For example for
[a|b]*abb regexp there are only 4 states, and yet it can be optimized further.
That’s how it works:
$ ./regexp.lisp "[a|b]*abb" "qweabbzxcaabbasd" [a|b]*abb in qweabbzxcaabbasd -> abb [a|b]*abb in qweabbzxcaabbasd -> aabb
Regexp parser supports braces, ‘or’, ‘star’ and concatenation operations, the latter does not require a special symbol. It is rather simple to add another ones (like dot for any single symbol or dash for the range of symbols). It is also not too complex to replace character-oriented comparison operations with different types and functions.
Next task for this project is to extend it a bit and implement a simple HTML page parser, namely I want to parse pages from the corporate site and get out birthday dates to build a histogram of the ages of people around. I suspect that majority of the collegues are younger than me.