At took a bit of time to settle things down all over the place, so I can continue with the things I used to work with. Now huge post about things happened :)
First, recent development things. Altough a bit confused, but still.
Last several days I worked on morphological analysis and mainly on automatic grammatics extraction from texts and text generation based on them. I did not care much about performance, multithreading and the like, but instead concentrated on the idea. So I selected Lisp for prototyping, to date I do not know whether it will have more serious usage.
So, task of grammatics extraction is rather trivial when you have a database of morphological data. Namely I wanted to parse a sentence and get a knowledge about all words and their morphological forms, for example for nouns it could be at least cases and number and so on. Having a huge database for all possible word forms and all words in russian language (I selected it since I know it well and it has a quite reach morphological structure, but technique is applicable to any language of course) is not a feasible task.
So I developed a lemmer/stemmer, which has a limited database (I use about 10k words), obtained by aot.ru output parsing, although originally I wanted to DoS wikipedia with my requests. Getting that there are no on-line dictionaries (for russian at least) with tagged morphological data and structured output, I had to write regexp parsers to get that information from HTML pages. Wikipedia has noticebly worse and noisy output compared to aot.ru though.
To select unique words I downloaded one rather big text (Pelevin’s “Generation P”) and used Levenstein-Damerau distance to select words, which are rather far from each other (I used 0.25 normalized distance as a threshold, i.e. words are considered different in this metric, when roughly more than 1/4 of the letters are different). Then I received morphological data from aot.ru and stored it into local structures.
My lemmer allows not only to obtain information about words it has stored in the database, but also guess it for the words, which it does not know about based on matched word endings.
Grammatics generation becomes rather trivial – iterate over all words in the sentence and write realted morphological information without word form itself.
Second part – sentence generation, is rather simple, when morphological data is well structured. Namely I use derived grammatics and select random words which have to match morphological data for grammatics. So it looks like a grammatically valid sentence, but it contains nonsence of course, since words are not related to each other and do not follow some ‘meaning’ of the phrase. Also, there is no information for prepositions, which concatenate different forms, and some of them (in russian at least) can not be used with some forms and vice versa.
Another problematic part is multiple meaning of some words. It is even possible that the same word will not only have multiple noun cases with the same wordform, but simultaneously will be a verb in some form. And I did not yet develop a sentence analyzer, which will drop grammatics which do not match russian sentence rules, like must-have verb and/or noun forms presentd in the sentence and so on.
Below is several examples of how it works (in russian). First sentence is origianal text.
(какая-нибудь простая грамматика с текстом .)
(сдайся эротическая девушка меж мотоциклом .)
(жмурься рыжеволосая межа вне логиком .)
(вдумайся техническая папироса от борисовичем .)
(ответь тренажерная сидоровна изо симптомом .)
(просыпайся непосредственная харольдовна безо фактом .)
(тревожь забытая судьба между бомжом .)
It is possible to generate grammatics by hand and select words around some meaning, but such manual interference is not what I want. Next step is sentence processing rules described above. It is rather simple task, but it is a must-have feature for text generation.
Noticebly more complex is knowledge extraction problem and long-term memory, which in turn will allow to select words tied to each other based on previous experience. Using such technology system will be able to understand meaning of the data in terms of related words and generate reply based on its knowledge of their relations.
This is a task for some future though…
————————————————————————————————————–
Another lexical problem I worked on is language detection. Common algorithms use N-M gramms, where N is number of letters and M is number of subsequent words, such NM-gramms are used to calculate conditional probability of the next characters based on probability of the previous ones in selected NM-gramm, so it is possible to detect languages, when system was trained and language-specific NM-gramms have been selected.
I think that brain works quite differently and does not calculate any kind of NM-gramms at all, but instead use highly parallel fuzzy words search in the memory. I did not yet develop a fast fuzzy searching except calculating Levenstein-Damerau distance against every word in the dictionary for every input word, which is rather costly task. So this is another interesting task to think about.
So I decided to switch to more simple matching – cut off endings and match against learned words. Thus I implemented in LISP a simple RADIX-based algorithm, where downloaded documents are parsed and reversed words (optionally without one or two last letters – kind of endings) are inserted into RADIX tree. Checked words are reversed and looked in this ‘dictionary’ optionally without one or two letters from its end – I kind of cut off the ending. When lookup returns a match system considers given word as being part of the language it refers to. Of course it is possible that the same word will be present in multiple languages (especially when training corpus contained words from different languages like what I used: raw wikipedia pages), so to determine document language we should check all words and calculate how many of them matched against every known to the system language. It is still possible to check single words of course.
This simple technique (less than 200 lines in LISP not counting RADIX tree implementation) behaves surprisingly good in the test case I ran. Namely I selected 3 big articles (several thousands of words) from wikipedia in english, turkey, ukrainian and russian languages, and then got wikipedia texts (not used in learning of course) and text matched its real language with probability (just a division of the matched words (in the above sence) to all words in the document) noticebly higher than for any other languages. All texts were downloaded automatically and CL-PPCRE based parser removed all tags, numbers and non-letter characters.
Here is an example output for english learning process:
$ ./get-page.lisp :radix-root-path=radix.obj.en :learn-url-path=/tmp/learn.en.txt \ :output-dir=learn-data :check-url-path=/tmp/check.txt url: http://en.wikipedia.org/wiki/Bahá'í_Faith, learned words (including dublicates): 9197 url: http://en.wikipedia.org/wiki/Carabane, learned words (including dublicates): 11072 url: http://en.wikipedia.org/wiki/Is_This_It, learned words (including dublicates): 6469 url statistics: http://tr.wikipedia.org/wiki/Uğultulu_Tepeler_(roman) total match: 35 % url statistics: http://en.wikipedia.org/wiki/Wuthering_Heights total match: 60 % url statistics: http://ru.wikipedia.org/wiki/Заглавная_страница total match: 7 % url statistics: http://uk.wikipedia.org/wiki/Головна_сторінка total match: 6 %
As we see, it detected english language with 2 times higher probability. I skipped other tests (turkey, ukrainian and russian), but they show similar numbers.
Here is example for the most popular russian livejournal blogger Tema Lebedev LJ page and its profile:
TR: tema: 5% profile: 21% UA: tema: 30% profile: 15% EN: tema: 8% profile: 27% RU: tema: 47% profile: 20%
Profile contains rather large number of english usernames words, so result is quite correct.
Percentage is far from 100% since small number of words were learned, I skipped prepositions and other small words and there are non-russian words there of course.
Not sure whether this is a very useful project, but I did not regret a day spent on thinking and development.
————————————
That’s it for noticeble development issues. Now lets more to life happenings.
I made an eye correction operation and can look at women without glasses now. I can also swim, play tennis and football and overall behave like a normal person. It is fucking cool feeling!
Operation itself is painless, but was rather complex from psycological point of view, at least for me. Especially things like vacuum cup on eye, can-opener-like part of the cornea cut and the like. But overall it was not something you should be afraid of.
I absolutely do not regret I did it.
I also filled an action at law against development company, which build a house I bought appartments in, to recognize me as a real owner of the appartments. It will take a while to settle though, I think a month or two.
Ugh, and I play trumpet. I do play it, and it sounds quite good, when I’m in a good mood and can play loudly on my Yamaha. I still can not improvise out of the head, but I usually have no troubles playing some melody after I learnt it. Learning can take a while if I did not hear melody before. Piano playing is rather stuck – I prefer to learn melody first on piano, but I have real troubles playing by both hands, even when left one part is really trivial like 2-3 notes. So I usually learn melody part only before trying it on trumpet :)
Heh, I’m back to business :)
Stay tuned!