I started to work on basic lingustic tasks, namely morphological analysis. And started to do that using Lisp of course, although suspect that using C++ (with its cool STL) would be much faster to develop.
To date I implemented a simple algorithm to calculate Levenshtein-Damerau distance between words, which can be used for fuzzy text search. Effectively that's what ispell does.
Now I work on morphological analysis itself. I decided to start with Russian language for this, since I know it quite good, and it is rather complex one. Application should get a word and provide morphological tag for this: whether it is noun, verb or whatever else, which case or conjugation it has and so on.
I want to use a trainig corpora to teach algorithm about lemmes and inflexions, which then can be used to determine other forms and try to guess them if there is no strict match. I will use rather simple algorithm - separate endings and use them as a pointer to RADIX tree roots, which will contain reversed testing word roots. Ending and matched root will allow to determine morphological nature of the word.
To date I finished simple compressed RADIX tree implementation in Lisp, but there is a huge problem with trainig corpora. Did I say that all linguists are greedy bastards? There are no morphologically tagged corporas in public access, and even those who claim having it, does not provide such access, at least I did not find it anywhere except wikipedia.
Which in turn provides it as a HTML page like this, and of course there is no strict template or at least standard format for morphological part of the page. So it can add some additional tags or symbols between strings and so on. I implemented a simple Lisp HTTP downloader, which tries to analyze wikipedia pages and select morphological information, but to date it only work for nouns and adjectives. Next task is verb.
Then I will be able to build a testing corpora and create morphological parser. I expect this is already implemented in ispell, and I could use its dictionaries, but I was not able to find out how to make it dump morphological information about words.
The main question is why do I need this? And the answer is 'for lulz'. Plan is to create a simple grammatics generator, which will take training text, analyze every word and store learned grammatics. Then application will select some other words and produce sentences using those grammatics.
Stupid, but wery interesting, and allows to move furhter...
Stay tuned!
http://en.wikipedia.org/wiki/Cyc
The latest version of OpenCyc, 2.0, was released in July 2009. OpenCyc 1.0 includes the entire Cyc ontology containing hundreds of thousands of terms, along with millions of assertions relating the terms to each other, however these are mainly taxonomic assertions, not the complex rules available in Cyc. The knowledge base contains 47,000 concepts and 306,000 facts and can be browsed on the OpenCyc website.
It is a rather complex project and it aims at a little bit different direction. I want to implement simple grammatics analyzer with morphological feature first, and try to automatically build a database of relations of different objects, while Cyc implements a fair handcrafted knowledge base.