Tag Archives: Lisp

Clisp in Lucid

Dear fcking Ubuntu Lucid, I’m writing you from the tank in fire, and you know, I want you to be here.

$ clisp 
  i i i i i i i       ooooo    o        ooooooo   ooooo   ooooo
  I I I I I I I      8     8   8           8     8     o  8    8
  I  \ `+' /  I      8         8           8     8        8    8
   \  `-+-'  /       8         8           8      ooooo   8oooo
    `-__|__-'        8         8           8           8  8
        |            8     o   8           8     o     8  8
  ------+------       ooooo    8oooooo  ooo8ooo   ooooo   8

Welcome to GNU CLISP 2.44.1 (2008-02-23) 

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2008

Type :h and hit Enter for context help.

[1]> (asdf:operate 'asdf:load-op :cl-ppcre)
; loading system definition from /usr/share/common-lisp/systems/cl-ppcre.asd into #
;; Loading file /usr/share/common-lisp/systems/cl-ppcre.asd ...
; registering # as CL-PPCRE
; registering # as CL-PPCRE-TEST

*** - Program stack overflow. RESET
[2]> 

Pronoun replacement

A task of knowledge extraction, I devote my free time to, consists of several parts, and the main one is weighted graph building.

This graph will ‘connect’ close words in the text and/or sentences and assign a weight to the neighbour words. Actually we want to connect not words, but their roots, or bases, but it does not matter – root/base extraction is rather simple task I already finished (for russian language only though).

Much more complex task is to replace pronouns with the appropriate nouns. In some languages like english it is not a problem, but in russian each pronoun may have the same (large) set of properties as usual noun, and in some cases it may have a different case or number for example…

I implemented a rather simple task of pronoun replacement – I only care about those words, which have exact match of gender, number and case, and even in those cases there may be problems. For example in russian there is no strict word order, so we can mix nouns and pronouns (not always though) in any way. Also I have a rather small dictionary to extract morphological information from (all web sites related to this problem produce so ugly output that it is not always possible to automatically extract structured information, and wikipedia with its abused templates is on the first place).

But actually I believe it should not be a major problem. Currently we somehow replace words and that should be enough. Not because it is a complete and finished task, but because higher-level one, namely weighted graph generation, will allow us not to use those sentences and facts which are rare enough.

And I believe that dumb text, like the one we use to teach children, is the best one for system to work with. Sky is blue, word is perfect, heat is dangerous and so on – such texts should be used for training and initial knowledge graph building. And even in more ‘complex’ texts, where facts are properly bound to nouns should be (in theory) more frequent than those related to pronouns.

So, plan is to build a graph generation application and then test how it works with the input texts of different ‘complexity’. I suppose there will be something interesting in such knowledge extraction approach.

Or maybe not – we will see :)

Everything is freq’n beautiful in Lisp

except its useless crappy brain sucking uglymoron called error reports.

I understand that S-expression evaluation can not produce the same level of error report we have in C, but when multi-thousands line code reports (Clisp 2.44.1 for that matters)

*** – SETF: The class # has no slot named (SLIVE “dead”)

I want to start making shitballs. And this is dangerous for the mental health, which Lisp is about to cure after spending most of the time in check-everyfcking-return-code-C-swamp.

Porter’s stemmer for russian language in Lisp

aka no-disctionary root extractor. Just for the reference.

$ ./root.lisp важнейшие важном павловной павлиний пагубная падение
важн
важн
павловн
павлин
пагубн
паден

Next step to the knowledge extraction system and long-term memory is a quite prosaic task of pronoun replacement. Fortunately I already have a tool to extract morphological information even from the words, which are not presented in used dictionary (not precisely of course, but it works surprisingly well after common dictionary was learned).

When it is ready I will start building a simple weighted graph of word roots in the presented texts. In theory this will allow to create a local ‘knowledge’ of the datum and relations between items. This does not allow to build a higher-level knowledge of what things are in real life, but let’s make smaller steps first.

LISP, lingual analysis and multiple object references

In a meantime I think about a knowledge database creation, which could work like a long-term memory.

I want to parse a text and build a weighted graph of words (actually their roots only) connected into sentences or lingual blocks. The first task I want to complete is a simple replacement of the pronouns with the appropriate nounses from the previous blocks.

But while thinking about this idea I stumbled upon a non-trivial problem with LISP and object pointers. It is a simple task to have a single object to be referenced from multiple other data structures – we can allocate memory and have pointers stored in multiple objects (I will leave object removal and reference counter problems for now).

But in LISP I basically do not know how to do this. Or more simple task: how to have the same data structure (namely LISP object) to be referenced, accessible and editable from two lists for example.

In LISP I only know how to push an object into list, but not its pointer. Here is an example:

[1]> (defparameter *test-list* (list 1 2 3))
*TEST-LIST*
[2]> (defparameter *storage-list* (list))
*STORAGE-LIST*
[3]> (defparameter *another-list* (list))
*ANOTHER-LIST*
[4]> (push *test-list* *storage-list*)
((1 2 3))
[5]> (push *test-list* *another-list*)
((1 2 3))
[6]> (push "string" *test-list*)
("string" 1 2 3)
[7]> *storage-list* 
((1 2 3))
[8]> *another-list* 
((1 2 3))

I added *test-list* into *storage-list* and *another-list*, but actually I added its copy, since added object modification was not represented in the destination storages.

With object it is a bit simple – we do can modify it in-place, but still I do not know how to make it being referenced from multiple storages (like lists).

And thus I do not know how to implement a weighted graph, where each word is supposed to be linked and referenced to/from multiple other words. Plus each object should be accessible from the indexed data structure (like hash table or tree) for fast access.

Any ideas how to implement multiple references in LISP? I can switch to some other language of course, but I like idea to make this in LISP :)

Vigenere crackdown

Suddenly decided to make something bad, while waiting for other things to settle.

Years ago I used to believe that I know something about hacking. Not kernel hacking, but the one, related to cracking of various supposed to be secure systems. Starting from ciphers down to code ‘issues’ (described in phrack, yup). It is rather laughable right now :)

But today I know, that reading (and even understanding) smart and hardcore articles is really far from being able to apply given knowledge to some practical problems. So, I decided to start (and complete practical implementation) from really simple things: various mono/poli alphabet ciphers. Ceasar/rot13 and Vigener are good choices afaics.

Bruce Schneier’s Self-Study Course in Block Cipher Cryptanalysis does not allow me to easily fall asleep, although its a little bit fun to compare my alphabet ciphers analysis versus even simplest crackdowns described in the article.

Anyway, it took me a day or less to hack up a semi-automatic Vigenere and monoalphabet cipher cracker in LISP. In Vigenere code there are two steps: find out key length and split and decode monoalphabet enciphered text blocks.

The former task is performed manually – the only application created searches trigrams in text and shows their frequencies. When there are 3 or more trigrams it is possible to find key length with rather high probability. Bigrams will work too, but error rate is higher. One have to find distance between the same trigrams and found greatest common divisor, which will likely be equal to cipher key length.

Vigenere cipher is theoretically uncrackable when unique key long enough to cover input message is used. But in practice shorter keys are used, which are then repeated number of times so that resulted key string length becomes equal to plaintext message. So, when cipher starts to repeat itself, the same letters will be encrypted into the same ciphertext.

This method is named after Friedrich Kasiski, who found it 150 years ago. When key length is found, we split ciphertext into separate strings, where each one encrypted using monoalphabet cipher. It consists of letters separated by key length, i.e. the each string will be formed from $string_number + $i * $key_length letters, where $i runs over ciphertext until it is fully covered.

Monoalphabet ciphers are rather trivial to crack using frequency analysis. Namely we should find the most frequent letters in the encrypted text and they will correspond to the most frequent letters in the language used for plaintext. Difference between those letters is a cipher shift, so it is trivial to recover the text just by replacing each letter with the one shifted by calculated number.

Here is example I used (random New York Times article, cut just to show idea, I also converted it to downcase and dropped all non-letter character, since gcipher application does not understand that):

After spending months searching for a bipartisan consensus on financial regulatory reform, Senator Christopher Dodd, chairman of the banking committee, is expected to unveil his own bill on Monday, without one Republican supporter.

afterspendingmonthssearchingforabipartisanconsensusonfinancial
regulatoryreformsenatorchristopherdoddchairmanofthebanking

That’s how it was encrypted and resulted ciphertext:

$ gcipher -c Vigenere -k asknfalwiwf v1.enc v2.enc

axdrwsaavznnywbstsoaafrurvsgqkzwgihkeyidwvytnkoaxudkvbnnsxpn
awnmczlsdbwycankwmkoaftznkdwikdbuhpnlkidurnnrxwvkktzoofnv

Trigram search uncovers that the most frequent ciphertext trigram ‘tzo’ is spread in the text with the distances being multiple of 11, which does correspond to above key length.

Splitting text and calculating frequences is a rather simple technical task. After performing decoding we got following result:

lfterwpeydiygmonxhsdeacchinkfocabtpartmsaycoysensysoyfiyancielrpguwatorcreqorxsenaxornhrtstopler
afterspendingmonthssearchingforabipartisanconsensusonfinancialregulatoryreformsenatorchristopher

Second line is a plaintext message, which differs in less than half symbols. I was not able to decode some of the text parts because of small enough text length, so that frequency analysis did not always provide correct data. But even as is decoded text allows to read and recover data.

On this positive note I will start preparig for the weekend skiing.

Stay tuned!

Improving morphological analyzer and sentence generator

(сверху впечатлению заикалась дотация , вокруг видению она выныривала)

Since russian has noticebly more complex morphological structure than english, and I basically do not know english good enough (most of my blog readers already noticed that :), I work with my native language, although derived logic could be applied for other language analysis too.

First, automatic grammatics generation is not very complex task, but since every word can have multiple meanings (like the same word can be a verb and noun in multiple forms with different cases and so on), number of grammatics automatically derived from given sentence rarely equals to just one and tends to explode with large numebr of words in the input sentences.

So there should be a way to eliminate some of them from the initial set. One of the factors, which can allow to drop impossible combiations of the word forms is relations between some common word forms, like closest adjective and noun, which should have the same case, or some prepositions which are only applicable to noun in selected case. I’m pretty sure that number of such rules is quite big, but getting that my last russian language lesson was in school kind of 15 or so years ago, it is a bit hard to summarize how it should look like.

I loosely implemented just couple of rules, and result looks noticebly more correct now. From grammatical point of view of course – currently I do not aim at generating content with some correct meaning.

(мы начинаем спокойный видеоряд о кости)

Another issue I was stumbled upon is actually wrong morphological information stored in database and dictionaries. Like verb voice or noun ‘animation’ i.e. check whether it is related to alive subject or not.

System can derive active voice, but dictionary will suggest passive voice verb for the substitution, and phrase will explode reader’s brain first by the fact, that it is a grammatical crap way before it will the head with its meaning. So I added couple of checks (applied to russian language only of course), and results improved noticebly.

Another big problem I have right now is preposition logic. Or actually its absence – in russian each preposition can be related to only limited set of noun forms, and to date I did not find such information structured into the form, suitable for the database. So I need to manually write such information for about several dozens of such ‘small words’, which I’m a bit lazy to do.

That’s why automatically derived grammatics which contain plain prepositions without relation to the appropriate nouns usually produce rather ugly random sentences.

To play with the system I implemented ability to generate content from manually created grammatics. Usually it differs from automatically derived ones only a little bit.

(к решению вернулась правда , сзади сомнению она пожимала)

To date I consider this part of the AI task as ready. Next one is long-term memory and fact extraction.

Basic idea is to create a system which will be able to memorize not only document content, but also relations between separate word forms and phrases. Such memory will allow to extract knowledge from the input data according to already learned information.
Thus we will be able to select words not randomly like now, but according to system’s memory, so that selected set of words will be intelligent and related to some of its internal previously learned facts.

And it has to be automatic. The same way system is able to automatically derive grammatics from the random sentences according to some rules, it should extract possibly hidden relations between terms in the input data.

Its time to have some serious thinking on this problem…

Suddenly: back to business

At took a bit of time to settle things down all over the place, so I can continue with the things I used to work with. Now huge post about things happened :)

First, recent development things. Altough a bit confused, but still.
Last several days I worked on morphological analysis and mainly on automatic grammatics extraction from texts and text generation based on them. I did not care much about performance, multithreading and the like, but instead concentrated on the idea. So I selected Lisp for prototyping, to date I do not know whether it will have more serious usage.

So, task of grammatics extraction is rather trivial when you have a database of morphological data. Namely I wanted to parse a sentence and get a knowledge about all words and their morphological forms, for example for nouns it could be at least cases and number and so on. Having a huge database for all possible word forms and all words in russian language (I selected it since I know it well and it has a quite reach morphological structure, but technique is applicable to any language of course) is not a feasible task.

So I developed a lemmer/stemmer, which has a limited database (I use about 10k words), obtained by aot.ru output parsing, although originally I wanted to DoS wikipedia with my requests. Getting that there are no on-line dictionaries (for russian at least) with tagged morphological data and structured output, I had to write regexp parsers to get that information from HTML pages. Wikipedia has noticebly worse and noisy output compared to aot.ru though.

To select unique words I downloaded one rather big text (Pelevin’s “Generation P”) and used Levenstein-Damerau distance to select words, which are rather far from each other (I used 0.25 normalized distance as a threshold, i.e. words are considered different in this metric, when roughly more than 1/4 of the letters are different). Then I received morphological data from aot.ru and stored it into local structures.

My lemmer allows not only to obtain information about words it has stored in the database, but also guess it for the words, which it does not know about based on matched word endings.
Grammatics generation becomes rather trivial – iterate over all words in the sentence and write realted morphological information without word form itself.

Second part – sentence generation, is rather simple, when morphological data is well structured. Namely I use derived grammatics and select random words which have to match morphological data for grammatics. So it looks like a grammatically valid sentence, but it contains nonsence of course, since words are not related to each other and do not follow some ‘meaning’ of the phrase. Also, there is no information for prepositions, which concatenate different forms, and some of them (in russian at least) can not be used with some forms and vice versa.
Another problematic part is multiple meaning of some words. It is even possible that the same word will not only have multiple noun cases with the same wordform, but simultaneously will be a verb in some form. And I did not yet develop a sentence analyzer, which will drop grammatics which do not match russian sentence rules, like must-have verb and/or noun forms presentd in the sentence and so on.

Below is several examples of how it works (in russian). First sentence is origianal text.

(какая-нибудь простая грамматика с текстом .)
(сдайся эротическая девушка меж мотоциклом .)
(жмурься рыжеволосая межа вне логиком .)
(вдумайся техническая папироса от борисовичем .)
(ответь тренажерная сидоровна изо симптомом .)
(просыпайся непосредственная харольдовна безо фактом .)
(тревожь забытая судьба между бомжом .)

It is possible to generate grammatics by hand and select words around some meaning, but such manual interference is not what I want. Next step is sentence processing rules described above. It is rather simple task, but it is a must-have feature for text generation.

Noticebly more complex is knowledge extraction problem and long-term memory, which in turn will allow to select words tied to each other based on previous experience. Using such technology system will be able to understand meaning of the data in terms of related words and generate reply based on its knowledge of their relations.

This is a task for some future though…
————————————————————————————————————–

Another lexical problem I worked on is language detection. Common algorithms use N-M gramms, where N is number of letters and M is number of subsequent words, such NM-gramms are used to calculate conditional probability of the next characters based on probability of the previous ones in selected NM-gramm, so it is possible to detect languages, when system was trained and language-specific NM-gramms have been selected.

I think that brain works quite differently and does not calculate any kind of NM-gramms at all, but instead use highly parallel fuzzy words search in the memory. I did not yet develop a fast fuzzy searching except calculating Levenstein-Damerau distance against every word in the dictionary for every input word, which is rather costly task. So this is another interesting task to think about.

So I decided to switch to more simple matching – cut off endings and match against learned words. Thus I implemented in LISP a simple RADIX-based algorithm, where downloaded documents are parsed and reversed words (optionally without one or two last letters – kind of endings) are inserted into RADIX tree. Checked words are reversed and looked in this ‘dictionary’ optionally without one or two letters from its end – I kind of cut off the ending. When lookup returns a match system considers given word as being part of the language it refers to. Of course it is possible that the same word will be present in multiple languages (especially when training corpus contained words from different languages like what I used: raw wikipedia pages), so to determine document language we should check all words and calculate how many of them matched against every known to the system language. It is still possible to check single words of course.

This simple technique (less than 200 lines in LISP not counting RADIX tree implementation) behaves surprisingly good in the test case I ran. Namely I selected 3 big articles (several thousands of words) from wikipedia in english, turkey, ukrainian and russian languages, and then got wikipedia texts (not used in learning of course) and text matched its real language with probability (just a division of the matched words (in the above sence) to all words in the document) noticebly higher than for any other languages. All texts were downloaded automatically and CL-PPCRE based parser removed all tags, numbers and non-letter characters.

Here is an example output for english learning process:

$ ./get-page.lisp :radix-root-path=radix.obj.en :learn-url-path=/tmp/learn.en.txt \
  :output-dir=learn-data :check-url-path=/tmp/check.txt

url: http://en.wikipedia.org/wiki/Bahá'í_Faith, learned words (including dublicates): 9197
url: http://en.wikipedia.org/wiki/Carabane, learned words (including dublicates): 11072
url: http://en.wikipedia.org/wiki/Is_This_It, learned words (including dublicates): 6469

url statistics: http://tr.wikipedia.org/wiki/Uğultulu_Tepeler_(roman)
total match: 35 %

url statistics: http://en.wikipedia.org/wiki/Wuthering_Heights
total match: 60 %

url statistics: http://ru.wikipedia.org/wiki/Заглавная_страница
total match: 7 %

url statistics: http://uk.wikipedia.org/wiki/Головна_сторінка
total match: 6 %

As we see, it detected english language with 2 times higher probability. I skipped other tests (turkey, ukrainian and russian), but they show similar numbers.

Here is example for the most popular russian livejournal blogger Tema Lebedev LJ page and its profile:

TR: tema: 5%  profile: 21%
UA: tema: 30%  profile: 15%
EN: tema: 8%  profile: 27%
RU: tema: 47%  profile: 20%

Profile contains rather large number of english usernames words, so result is quite correct.
Percentage is far from 100% since small number of words were learned, I skipped prepositions and other small words and there are non-russian words there of course.

Not sure whether this is a very useful project, but I did not regret a day spent on thinking and development.

————————————
That’s it for noticeble development issues. Now lets more to life happenings.

I made an eye correction operation and can look at women without glasses now. I can also swim, play tennis and football and overall behave like a normal person. It is fucking cool feeling!

Operation itself is painless, but was rather complex from psycological point of view, at least for me. Especially things like vacuum cup on eye, can-opener-like part of the cornea cut and the like. But overall it was not something you should be afraid of.

I absolutely do not regret I did it.

I also filled an action at law against development company, which build a house I bought appartments in, to recognize me as a real owner of the appartments. It will take a while to settle though, I think a month or two.

Ugh, and I play trumpet. I do play it, and it sounds quite good, when I’m in a good mood and can play loudly on my Yamaha. I still can not improvise out of the head, but I usually have no troubles playing some melody after I learnt it. Learning can take a while if I did not hear melody before. Piano playing is rather stuck – I prefer to learn melody first on piano, but I have real troubles playing by both hands, even when left one part is really trivial like 2-3 notes. So I usually learn melody part only before trying it on trumpet :)

Heh, I’m back to business :)
Stay tuned!

Knowledge extraction and morphological analysis

I started to work on basic lingustic tasks, namely morphological analysis. And started to do that using Lisp of course, although suspect that using C++ (with its cool STL) would be much faster to develop.

To date I implemented a simple algorithm to calculate Levenshtein-Damerau distance between words, which can be used for fuzzy text search. Effectively that’s what ispell does.

Now I work on morphological analysis itself. I decided to start with Russian language for this, since I know it quite good, and it is rather complex one. Application should get a word and provide morphological tag for this: whether it is noun, verb or whatever else, which case or conjugation it has and so on.

I want to use a trainig corpora to teach algorithm about lemmes and inflexions, which then can be used to determine other forms and try to guess them if there is no strict match. I will use rather simple algorithm – separate endings and use them as a pointer to RADIX tree roots, which will contain reversed testing word roots. Ending and matched root will allow to determine morphological nature of the word.

To date I finished simple compressed RADIX tree implementation in Lisp, but there is a huge problem with trainig corpora. Did I say that all linguists are greedy bastards? There are no morphologically tagged corporas in public access, and even those who claim having it, does not provide such access, at least I did not find it anywhere except wikipedia.

Which in turn provides it as a HTML page like this, and of course there is no strict template or at least standard format for morphological part of the page. So it can add some additional tags or symbols between strings and so on. I implemented a simple Lisp HTTP downloader, which tries to analyze wikipedia pages and select morphological information, but to date it only work for nouns and adjectives. Next task is verb.

Then I will be able to build a testing corpora and create morphological parser. I expect this is already implemented in ispell, and I could use its dictionaries, but I was not able to find out how to make it dump morphological information about words.

The main question is why do I need this? And the answer is ‘for lulz’. Plan is to create a simple grammatics generator, which will take training text, analyze every word and store learned grammatics. Then application will select some other words and produce sentences using those grammatics.

Stupid, but wery interesting, and allows to move furhter…

Stay tuned!

Age histogram and LISP regexp parser

Effectively I wrote LISP regexp parser and trivial network fetcher to download part of the corporate site with the user info, parse it and build an age histogram. Another usage will be highlighted in AI development.

Well, any normal human could ask admins or manager, but those are not hackers. Real men write regexp parsers on LISP I believe. Or crazy ones, but what’s the difference?

So, I mostly finished this project. And here is the result.


Age histogram

My age is highlighted by the black line, it actually a bit smaller :)
So, about a half of the people is younger than me, so it is definitely not the time to fall into the senile imbecility, but I should start packing things.
There will be so many interesting cases soon. Stay tuned!

Fixed fair number of bugs in LISP regexp parser

Now we can do this with the LISP regexp parser:

zbr@gavana:~/aWork/tmp/regexp$ ./parser.lisp
a href='http://site/login1' class="auto-person-card"
a href='http://site/login2' class="auto-person-card"
a href='http://site/login3' class="auto-person-card"
a href='http://site/login4' class="auto-person-card"

$ cat ./parser.lisp
#!/usr/bin/clisp

;
; Copyright 2009+ Evgeniy Polyakov 
; All rights reserved.
;
; This program is free software; you can redistribute it and/or modify
; it under the terms of the GNU General Public License as published by
; the Free Software Foundation; either version 2 of the License, or
; (at your option) any later version.
;

(load "regexp.lisp")

(defparameter *control-chars* "[]|*/'!\"#$%&()+, =?@\-^`.;:\\")

(defun generate-regexp-string ()
  (let ((res (string "a href='http://site/[[-")))
    (dolist (sym (generate-ascii-symbol-list))
      (unless (position sym *control-chars*)
    (setf res (concatenate 'string res "|" (string sym)))))
    (setf res (concatenate 'string res "]*]' class=\"auto-person-card\""))
    ;(format t "regexp: ~A~%" res)
    res))

(with-open-file (fd "/tmp/test.html")
  (let* ((regexp (generate-regexp-string))
     (dtrans (generate-state-transformation (concatenate 'string regexp "#"))))
    (loop :for line = (read-line fd nil nil) :while line :do
      (let ((res (regexp-search regexp line dtrans)))
        (dolist (r res)
          (format t "~A~%" (subseq line (first r) (second r))))))))

Where regexp itself is rather weird: a href='http://site/[[whole ASCII table via OR operation]*]' class="auto-person-card"

Because it misses ‘-‘ operator yet. Effectively it fully works, so I can start document analysis.

After I implemented working regexp parser in LISP I finally can write a simple application to download whole staff corporate site, parse its HTML a bit and fetch birthday dates to generate a histogram.

Actually HTML should be parsed as a context-free grammatics (and yet there are problems there), but it will some further step as well as next items in the original knowledge extraction problem.

Regexp parser on LISP

I finished the implementation (tested on CLISP) of the regular expression parser.

It was created as automatically generated deterministic finite automata and should be very fast in searching. State machine uses lookup table for the states, so effectively it is O(number of states), it could be compressed if range of symbols is huge.
For example for [a|b]*abb regexp there are only 4 states, and yet it can be optimized further.

That’s how it works:

$ ./regexp.lisp "[a|b]*abb" "qweabbzxcaabbasd"
[a|b]*abb in qweabbzxcaabbasd -> abb
[a|b]*abb in qweabbzxcaabbasd -> aabb

One can download sources from archive or check a homepage.

Regexp parser supports braces, ‘or’, ‘star’ and concatenation operations, the latter does not require a special symbol. It is rather simple to add another ones (like dot for any single symbol or dash for the range of symbols). It is also not too complex to replace character-oriented comparison operations with different types and functions.

Next task for this project is to extend it a bit and implement a simple HTML page parser, namely I want to parse pages from the corporate site and get out birthday dates to build a histogram of the ages of people around. I suspect that majority of the collegues are younger than me.

Further LISP syntax tree improvements

$ ./stree.lisp "[ab*]|[qw*]"
WARNING: INTERN("(SETF COMMON-LISP:TYPE)"): # is locked
         Ignore the lock and proceed
Dumping object #[SYMBOL-INSTANCE #x21E3DC66]
type: 2
  type: 3
    type: 4
      value: w, type: 0
      value: q, type: 0
  type: 3
    type: 4
      value: b, type: 0
      value: a, type: 0

Moving further in my LISP syntax tree parser – I switched from lists to objects which as value may contain a list. It is needed for simpler object extension, when I need to add object ID for example (monotonically increasing number in this case) I will just extend corresponding object, while previously, when relations were represented as lists I needed to add another list field and add clumsy list position helpers to determine what this object represents.

In case of objects it is also simpler to work with non-character symbols (i.e. another objects), although currently reader works with (char=) and friends.

I’m slowly preparing this code to start generating deterministic finite automata, which will be the next step I suppose, since current architecture is exactly what I wanted – rather simple and very extensible.

P.S. Out of curiosity, what method should object have to allow its pretty printing with (format t "~A") instead of #[SYMBOL-INSTANCE #x21E3DC66]?

Syntax tree parser in LISP

I’ve finally made it. Got really a lot of experience in functional-like thinking (and partially programming) and recalled bits of LISP I knew.

Parser can build a tree with concatenation, multiplication and OR nodes. It supports braces and symbol escapement. Here are couple of examples:

$ ./stree.lisp "\[a\*"
(cat * (cat a [))
$ ./stree.lisp "zxc*"
(mult (cat c (cat x z)))
$ ./stree.lisp "[a|b]*qwe"
(cat e (cat w (cat q (mult (or b a)))))

It uses rather dumb parser which does not differentiate priority or the control symbols, so string like “123*” will have multiplication operator applied to all previous symbols (namely concatenation of 1, 2 and 3). There are no error checks either, which I will likely work to fix, especially implement graceful exiting for non-closed braces.

Next step is to build a deterministic finite automate out of this LISP objects.

Code in LISP under the link.
I believe I’m not that bad as I started to believe.

#!/usr/bin/clisp

(defconstant *control-esc* #\\)
(defconstant *control-mult* #\*)
(defconstant *control-or* #\|)

(defconstant *control-cat-id* "cat")
(defconstant *control-or-id* "or")
(defconstant *control-mult-id* "mult")

(defconstant *control-object-open* #\[)
(defconstant *control-object-close* #\])

(defconstant *control-string* (concatenate 'string (list *control-mult*
							 *control-or*
							 *control-object-open*
							 *control-object-close*)))

(defun control-object-p (obj)
  (and (characterp obj) (or (char= obj *control-mult*) (char= obj *control-or*))))

(defun read-object (str len l total esc)
  (let ((sym (elt str 0))
	(skip 1))
    (cond
      (esc
        (push (list sym) l)
	(setf esc nil))
      ((char= sym *control-object-open*)
        (let* ((ret (read-object (subseq str skip) (- len skip) '() 0 nil))
	       (obj (first ret))
	       (obj_skip (second ret)))
	  (setf skip (+ skip obj_skip))
	  (push obj l)))
      ((char= sym *control-object-close*))
      ((char= sym *control-esc*)
       (setf esc 1))
      (t
	(push sym l)))
    (setf total (+ total skip))
    (if (or (<= len skip) (char= sym *control-object-close*))
      (list l total)
      (read-object (subseq str skip) (- len skip) l total esc))))

(defun wrap-control (obj)
  (cond
    ((not (characterp obj))
     *control-cat-id*)
    ((char= obj *control-mult*)
     *control-mult-id*)
    ((char= obj *control-or*)
     *control-or-id*)
    (t
     *control-cat-id*)))

(defun unwind-list (obj)
  (if (and (listp obj) (= (length obj) 1))
    (first obj)
    obj))

(defun parse-list (l)
  (if (null l)
    (list nil nil)
    (let* ((obj (pop l))
	   (lret (parse-list l))
	   (ret (first lret))
	   (is-or (second lret))
	   (l '()))
      (if (listp obj)
	(setf obj (first (parse-list obj))))
      (if (= (length ret) 1)
	(setf ret (first ret)))
      (if (null ret)
	(push (unwind-list obj) l)
	(progn
	  (cond
	    (is-or
	      (push ret l)
	      (push (unwind-list obj) l)
	      (push *control-or-id* l)
	      (setf is-or nil))
	    ((and (characterp obj) (char= obj *control-or*))
	      (setf is-or 1)
	      (push ret l))
	    (t
	      (push ret l)
	      (unless (control-object-p obj)
		(push (unwind-list obj) l))
	      (push (wrap-control obj) l)))))
      (list l is-or))))

(dolist (str *args*)
  (let ((ret (read-object str (length str) '() 0 nil)))
    (format t "~A~%" (first (parse-list (first ret))))))

Abstraction magic

(a|b)*abb -> ("cat" ("cat" ("cat" ("star" ("or" (a) (b))) (a)) (b)) (b))

I found it – and it happend to be really trivial. That abstraction which I looked for everywhere, it becomes so obvious and clear when it is created. Above regexp string should end up with the list structure shown at the right, and it will be processed using recursion to substitute names of the operations with the appropriate DFA states doing this at recursion unwinding and eventually create a finite automata matching table.

I spent the last two weeks reading and thinking, thinking and reading. Mostly to get in touch with regexps and state machines and then to abstract my rather narrow vision of the problem into something more generic, something not really tied to the state machines and regular expressions. I read Aho’s “Compilers: Principles, Techniques, and Tools”, Greene’s “The Elegant Universe” and Abelson’s “Structure and Interpretation of Computer Programs”.

The first two I have in my home library and the last one is available on-line at MIT. Although I want to get a hardcopy since reading from the paper is much more pleasant and effective imho. I like its rather simple bits about abstractions and multilingual chapter (I read only introduction though).

That was like knowledge accumulation which transferred from the quantity into quality, like eyes opening, since things are that simple actually. It is not about the list structure shown above, it is just a quick step after understanding and abstracting the problem into the structure I know how to operate with, which is a main result. And I practical aspects of that transformation to be resolved real soon.

To LISP or not to LISP

Last several days I’m stumbled upon a serious problem with the regular expression automatic state machine programming. Not only I stopped at the very beginning of the syntax tree creation from the regexp string, but I do not know how to move forward.

In C or any other language with the pointers or non-destructing operations I can easily assign some object several inner structures referenced from the outside of the ‘parent’ object, and use pointer magic to get access to the container object.
While in LISP I do not know how to link the same object to become a part of to be checked list of all objects and simultaneously allow to build a tree-like structure, i.e. when the same object can reference others via parent and child relations.

And getting this done in LISP is somewhat essential – I like this language and its capabilities, and while I’m not that good at it, I want to continue.

So going back to drawing board to rething the syntax tree approach of the initial regexp parsing into something more easily representable in LISP. Eventually I expect it to finish :)

EDITED TO ADD: That’s the pointer math example I had in mind:

struct some_node {
....
  struct list_head  global_list_entry;
  struct rb_node  rb_tree_entry;

  struct some_node  *parent;
  struct some_node  **child;
}

static void inline add_node_into_global_list(struct some_node *n)
{
  lock_some_lock();
  list_add(&n->global_list_entry, &global_list);
  unlock_some_lock();
}

#define container_of(ptr, type, member) ({			\
	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
	(type *)( (char *)__mptr - offsetof(type,member) );})

struct some_node *get_node_from_global_list(struct list_head *entry)
{
  lock_some_lock();
  struct some_node *n = container_of(entry, struct some_node, global_list_entry);
  atomic_inc(&n->refcnt);
  unlock_some_lock();
  return n;
}

I used Linux kernel definitions for list entry and container_of() macros, but idea is to allow to reference the same object from different entities – like some list of objects and tree of the same objects.

LISP error handling gotcha.

[1]> (defun resolve-host-name (addr)
  (handler-case (hostent-name (resolve-host-ipaddr addr))
    (t ()
       addr)))
RESOLVE-HOST-NAME
[2]> (resolve-host-name "1.2.3.4")
"1.2.3.4"
[3]> (resolve-host-name "195.178.208.66")
"tservice.net.ru"

Exception mechanism is a great extension to the whatever language, and I think LISP has one of the best realizations (and the first one actually). I’m not very familiar with the exceptions in C++ as long as with language itself, but iirc it is not (easily) possible return back to the calling point with some value determined by the exception handler. Even in Java with its

finally

section it is still less convenient. But I may be wrong of course :)

Above chunk of the code catches the error (all exceptions) and returns requested address itself, and when no error happend it returns resolved address.

Http log parser in LISP.

Updated my old LISP parser of the http logs to support date and time parameters, cleaned up code a bit and improved performance a little.

What I really like in LISP is its nature. It is just perfect to be used for some obscure project when you slack after some other (potentially hard) task. And while result may be far from the perfect code, it still does its job of brain relaxation.

That’s the parts of my (rather bad I do not argue) code:

(defun filter-urls (check-list rules)
  (let ((l check-list)
	(res '()))
    (setf l (remove-if #'(lambda (check)
			   (let ((str-to-push check))
			     (dolist (r rules)
			       (let* ((match (first r))
			  	      (trans (second r)))
				 (when (search match check)
			           (when trans
				     (when (eql 0 (count trans res :test #'string=))
			               (push trans res)))
				   (setf str-to-push nil)
			           (return t))))
			     (when (not (null str-to-push))
			       (when (eql 0 (count check res :test #'string=))
			         (push check res)))
			       t))
		       l))
    res))

(defmethod dump-short-block ((inst address-instance))
  (with-slots (atime get_urls post_urls empty addr refs) inst
      (setf get_urls (filter-urls get_urls *filter-rules*))
      (setf post_urls (filter-urls post_urls *filter-rules*))
      (setf refs (filter-urls (filter-urls refs *filter-refs*) *filter-rules*))
      (when (or (not (null get_urls)) (not (null post_urls)))
	(dump-block inst))))

Next task is to find out a way to get data from the DNS server, likely via:

> (dolist (a (hostent-addr-list (resolve-host-ipaddr "www.kernel.org")))
    (format t "addr: ~A, name: ~A~%" a (hostent-name (resolve-host-ipaddr a))))
addr: 204.152.191.37, name: www.kernel.org

Its main advantage is that it is very different from the-best-ever high-level-assembler language C.

Edited to fix the title

Knuth-Morris-Pratt vs. optimized Boyer–Moore in LISP.

I slacked again and stole Boyer-Moore implementation from netfilter source, although it contains a small error in good shift array preprocessing (l array setup may run out of bounds), which I fixed in the new algorithm implementation.

Optimized netfilter version was compared against wikipedia’s one (I also compared against algorithm, which does not contain searching part itself, but only preprocessing), Knuth-Morris-Pratt and natively complied into machine code CLISP’s (search) version. Results are quite interesting imho.

Knuth-Morris-Pratt vs. optimized and non optimized Boyer-Moore
Knuth-Morris-Pratt vs. optimized and non optimized Boyer-Moore

Test system generated 3000-chars string and randomly selected a substring with different string length (from 2 to 100 as shown on the graph, actually I tested upto 200 chars substring, but non-optimized Boyer-Moore skyrocketed so quickly, that other results were lost in the noise around zero), test was repeated 1000 times and timings were summed.

As we see, optimized Boyer-Moore compiled into LISP bytecode is faster than native CLISP (search) compiled into machine code (partial listings below).

New Boyer-Moore was only slightly tested though, so it may contain errors. Sources can be found in archive as usual.

Partial listings of the disassembled (search) and (bm-search) are presented below.

> (disassemble 'search)
0xffffe424 in __kernel_vsyscall ()
0x08142b7c :	push   %ebp
0x08142b7d :	mov    %esp,%ebp
0x08142b7f :	push   %edi
0x08142b80 :	push   %esi
0x08142b81 :	push   %ebx
0x08142b82 :	sub    $0x1c,%esp
0x08142b85 :	mov    0x8261d08,%eax
0x08142b8a :	sub    $0xc,%eax
...
> (disassemble 'bm-search)
Disassembly of function BM-SEARCH
(CONST 0) = INTEGER
(CONST 1) = 256
(CONST 2) = 0
(CONST 3) = 1
(CONST 4) = SYSTEM::|(SETF ELT)|
(CONST 5) = -1
4 required arguments
0 optional arguments
No rest parameter
No keyword parameters
213 byte-code instructions:
0     (LOAD&PUSH 1)
1     (PUSH-UNBOUND 7)
3     (CONST 0)                           ; INTEGER
4     (STORE 5)
5     (LOAD 9)
6     (STORE 4)
7     (CALLS1&PUSH 29)                    ; MAKE-ARRAY
...

Knuth-Morris-Pratt vs. Boyer–Moore in LISP.

I’ve ‘implemented’ full (i.e. 3*N) Boyer–Moore substring search algorithm in LISP. But I was lazy to analyze the algorithm and essentially used what Wikipedia provides (with its unoptimized good-suffix-skip array calculation). Word ‘imlemented’ above is in quotes since I would not call this implementation, but just a transposing Wikipedia’s C code into LISP.

Both kmp-search and bm-search were compiled into lisp bytecode (using (compile-file)) and also compared it agaist compiled naive search algorithm and default CLISP (search) compiled into native machine code.

Knuth-Morris-Pratt vs. Boyer–Moore in LISP.
Knuth-Morris-Pratt vs. Boyer–Moore in LISP

Test generates 1000-chars string, and then randomly selects 1-to-50 characters substring and calls four different test functions to find out given substring in the generated string. (get-internal-real-time) times for function calls are summed and next iteration starts. 1000 iterations were passed 10 times to get above graphs.

Boyer-Moore source code below. All sources can be found in the archive.

(defun suffix-match (needle nlen offset suffix-len)
  (if (> offset suffix-len)
    (and (char/= (elt needle (- offset suffix-len 1))
	        (elt needle (- nlen suffix-len 1)))
	 (string= needle needle :start1 (- nlen suffix-len) :end1 nlen
		  :start2 (- offset suffix-len) :end2 offset))
    (string= needle needle :start1 (- nlen offset) :end1 nlen
	     :start2 0 :end2 offset)))

(defun bm-search (haystack hlen needle nlen)
  (let ((skip (make-array nlen :element-type 'integer))
	(occ (make-array 257 :element-type 'integer :initial-element -1)))
    (if (or (> nlen hlen) ( hpos (- hlen nlen)))
      (let ((npos (1- nlen)))
	(do ()
	  ((char/= (elt needle npos) (elt haystack (+ npos hpos))))
	  (when (= npos 0)
	    (return-from bm-search hpos))
	  (decf npos))
	(setf hpos (+ hpos (max (elt skip npos)
                         (- npos (elt occ (char-code (elt haystack (+ npos hpos)))))))))))
  nil)

Tested like this (load performed multiple times):

[1]> (compile-file "./string.lisp")
;; Compiling file /root/lisp/string.lisp ...
;; Wrote file /root/lisp/string.fas
0 errors, 0 warnings
#P"/root/lisp/string.fas" ;
NIL ;
NIL
[2]> (load "string.fas")
;; Loading file string.fas ...
kmp: 279147, bm: 477522, dafault-search: 118057, naive: 1510182
;; Loaded file string.fas
T