ioremap.net

Storage and beyond

Improving morphological analyzer and sentence generator

(сверху впечатлению заикалась дотация , вокруг видению она выныривала)

Since russian has noticebly more complex morphological structure than english, and I basically do not know english good enough (most of my blog readers already noticed that :), I work with my native language, although derived logic could be applied for other language analysis too.

First, automatic grammatics generation is not very complex task, but since every word can have multiple meanings (like the same word can be a verb and noun in multiple forms with different cases and so on), number of grammatics automatically derived from given sentence rarely equals to just one and tends to explode with large numebr of words in the input sentences.

So there should be a way to eliminate some of them from the initial set. One of the factors, which can allow to drop impossible combiations of the word forms is relations between some common word forms, like closest adjective and noun, which should have the same case, or some prepositions which are only applicable to noun in selected case. I’m pretty sure that number of such rules is quite big, but getting that my last russian language lesson was in school kind of 15 or so years ago, it is a bit hard to summarize how it should look like.

I loosely implemented just couple of rules, and result looks noticebly more correct now. From grammatical point of view of course – currently I do not aim at generating content with some correct meaning.

(мы начинаем спокойный видеоряд о кости)

Another issue I was stumbled upon is actually wrong morphological information stored in database and dictionaries. Like verb voice or noun ‘animation’ i.e. check whether it is related to alive subject or not.

System can derive active voice, but dictionary will suggest passive voice verb for the substitution, and phrase will explode reader’s brain first by the fact, that it is a grammatical crap way before it will the head with its meaning. So I added couple of checks (applied to russian language only of course), and results improved noticebly.

Another big problem I have right now is preposition logic. Or actually its absence – in russian each preposition can be related to only limited set of noun forms, and to date I did not find such information structured into the form, suitable for the database. So I need to manually write such information for about several dozens of such ‘small words’, which I’m a bit lazy to do.

That’s why automatically derived grammatics which contain plain prepositions without relation to the appropriate nouns usually produce rather ugly random sentences.

To play with the system I implemented ability to generate content from manually created grammatics. Usually it differs from automatically derived ones only a little bit.

(к решению вернулась правда , сзади сомнению она пожимала)

To date I consider this part of the AI task as ready. Next one is long-term memory and fact extraction.

Basic idea is to create a system which will be able to memorize not only document content, but also relations between separate word forms and phrases. Such memory will allow to extract knowledge from the input data according to already learned information.
Thus we will be able to select words not randomly like now, but according to system’s memory, so that selected set of words will be intelligent and related to some of its internal previously learned facts.

And it has to be automatic. The same way system is able to automatically derive grammatics from the random sentences according to some rules, it should extract possibly hidden relations between terms in the input data.

Its time to have some serious thinking on this problem…

Comments are currently closed.