Tag Archives: Other

Other development notes

pthreads vs fork

Elliptics supports atomic server-side scripts execution, in particular there is python implementation for complex scripting tasks like directory structure support in pohmelfs.

Python by its nature is a single-threaded application, so we want a pool or processes each of which will host python interpreter. Contrary elliptics is a heavily multi-threaded application. As practice showed – mixing those two beasts is a tricky and dangerous business.

fork() copies all allocated virtual memory space from thread which invoked syscall. In particular it copies all internal shared library states, which can be modified in parallel threads. To deal with this problem POSIX introduced uglymoron called pthread_atfork() which can only help (if this can be ever called a help) with your own state (like locks).

And while I could properly deal with elliptics state, I can not even know what shared libraries are doing during fork() syscall. One of my threads printed timed log message, which dig into libc’s tz conversion, which happens under its private lock.
When copied in fork() into newly created process, this lock is still locked there, so subsequent python initialization, which wants timezone too, will deadlock.
If you start a pool of hundred python workers, there are about 30 of them ‘deadlocked’ in timezone conversion.

To fix this issue we really want to ‘clean’ address space of the forked children, but this will break ‘connection’ with parent process if there were pipes or sockets created. So I execve() a special binary which in turn initializes python interpreter and connects to parent via named pipes. This was originally implemented in libsrw library, but then I moved it into elliptics itself to minimize dependencies.

Those bugfix took me almost a whole week to find out, why pohmelfs sometimes got stuck during synchronization of the huge number of files. And now things go smooth. I should start pushing pohmelfs into kernel again.

splice() syscall

I found that splice() syscall does not transfer data, when in and out file descriptors are the same or refer to the same file. Instead destination ‘space’ is filled with zeroes while supposed to contain input buffer content.

Here is my code for the reference (with some debug added):

static int eblob_splice_data_one(int *fds, int fd_in, loff_t *off_in,
		int fd_out, loff_t *off_out, size_t len)
	int err;
	size_t to_write = len;

	while (to_write > 0) {
		err = splice(fd_in, off_in, fds[1], NULL, to_write, 0);
		printf("splice  in: %zu bytes from fd: %d, off: %llu: %d\n",
				to_write, fd_in, *off_in, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		if (err  0) {
		err = splice(fds[0], NULL, fd_out, off_out, to_write, 0);
		printf("splice out: %zu bytes into fd: %d, off: %llu: %d\n",
				to_write, fd_out, *off_out, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		if (err < 0) {
			err = -errno;
			goto err_out_exit;
		to_write -= err;

	err = 0;

	return err;

Unfortunately it does not even return error, but silently corrupts data.

I would be happy to be wrong of course.

LevelDB benchmarks against Kyoto Cabinet and SQLite

Holy shit – who the hell tests DATABASE by writing thousands-to-millions records of 100 bytes?

There is a test, where they run 1000 values of 100k each – a bit overkill for database, but yet it is still ALL IN MEMORY.
Those tests are actually “how fast we can read from / write to RAM if our control structure is good enough not to content on locks and the like”

What is really interesting is how it behaves, when database size does not fit memory, or at least is close enough, but not hundred of MB.

Moving forward

To date there are no major features requested by our users in elliptics network. We want to make some log messages refactoring, monitoring tools and the like.

But overall it is considered complete. First major production introduction of the new version will be in January and right now it is heavily tested in several projects labs.

In a meantime I wrote a simple enough dating site for my collegues who do not want to celebrate New Year alone. It was quite fun to programm in Python/Django, but its time to move forward.

And we considered that PAXOS-compliant locking service will be the next project. Yet I’m not sure whether I will use plain C with event-driven library (like libevent or libev), C with Glib or C++ with some king of external framework (maybe even with boost-asio, although errors produced by Boost are way too much of a crap).

Framework is supposed to be used as a layer to hide network communication for the async PAXOS state automata changes. Although writing a simple enough message passing interface is not a problem, but its yet to think whether extensions will be needed.

Stay tuned!

Shotwell extension for Yandex.Fotki was committed

My 1100+ lines patch to add Yandex.Fotki support has been committed to Shotwell photo manager svn trunk (tracker: http://trac.yorba.org/ticket/2536 )

Yandex.Fotki is the biggest photo hosting in Russia and east Europe, it has about 2 millions of users and I believe quite a few of them use Linux. There are more than 500 Tb of photos already.

Patch is very non-intrusive for Shotwell core, it just adds another class construction into web publishing process, generic set/load/clear gconf helpers and some deps into makefile (all libs are already installed in glib package required by Shotwell anyway).

What we can:

  • upload photos into existing album
  • create album if needed
  • set access permissions as well as hide original/forbid comments metadata flags (XXX flag was removed)
  • upload photos with local tags
  • set local tag which prevents upload (to eliminate copies, one can remove tag and upload will succeed of course) – removed in favour for future generic implementation suitable for all web exports
  • use OAuth 2.0 for authorization (using embedded webkit toolkit to request tokens if they are missed as well as allowing to register new user just from Shotwell)
  • cache username/tokens in gconf

There is Ubuntu package I made (for Lucid with 2 additional packages from Maverick, but should work for Maverick out of the box), or one can browsearchive.

Add support for Yandex.Fotki to Shotwell photo manager

Sent a 1100+ lines patch to add Yandex.Fotki support to Shotwell photo manager.
Here is an announcement.

UPD: Tracker http://trac.yorba.org/ticket/2536
I fixed all issues to date, let’s see how this will end.

Attached patch adds support for Yandex.Fotki photo hosting site, which is the biggest photo hosting in Russia and east Europe, it has about 2 millions of users and I believe quite a few of them use Linux.
There are more than 500 Tb of photos already.

Anyway, having Yandex.Fotki support in Shotwell will be a good stimulus to start using Linux :)

Or maybe not, who knows.
Patch is very non-intrusive for Shotwell core, it just adds another class construction into web publishing process, generic set/load/clear gconf helpers and json-glib-1.0 dep in makefile (this lib is already installed in glib package require by Shotwell).

What we can:

  • upload photos into existing album
  • create album if needed
  • set access permissions as well as XXX/hide original/forbid comments metadata flags
  • upload photos with local tags
  • set local tag which prevents upload (to eliminate copies, one can remove tag and upload will succeed of course)
  • use OAuth 2.0 for authorization (using embedded webkit toolkit to request tokens if they are missed as well as allowing to register new user just from Shotwell)
  • cache username/tokens in gconf

There is a small caveat thogh: Yandex did not yet release OAuth for photos, so I use ‘undocumented’ server names and client ids. One can edit special gconf keys to add proper strings which will change server names and client id. When OAuth for photo hosting will be released for public I will send updated patch to change default names of course.

This is just for the case that you will suddenly make a new release, which will miss Yandex.Fotki functionality :)

Please review, comment and apply.
Thank you.

Want to do it good – do it yourself


Python does not have an audio library, which allows to asynchronously play music and control it in close enough to real time. Mainly I need a fast audio start/stop capability, which will be triggered from python code when I press and release different keys on software piano keyboard.

So I will write it in C using SDL and SDL_mixer library in particular:

Mix_OpenAudio(audio_rate, audio_format, audio_channels, audio_buffers);
music = Mix_LoadMUS("music.ogg");
Mix_PlayMusic(music, 0);

The latter two functions can be used to control audio flow and it will be hardware mixed with other system sounds out of the SDL_mixer library box. SDL_mixer supports a wide variety of formats including OGG Vorbis and MP3.

And in a week or so I will publish a nice project I spent a week on. Actually it can be announced right here, but it does not yet work with production servers.
I created a simple enough photo uploader for Shotwell photo management application which uses Yandex.Fotki photohosting, which is the largest on in Russia and I used to use it quite frequently now. I hope my 1000 lines patch will be accepted (some people will even bring me beer for that :)

If I understood correctly Shotwell will replace F-Spot in modern distros.

Shotwell is written in Vala language, which is kind of C#, but its complier is sooo weird, that it is very unlikely I will return to it anytime soon. But java-style memory memory management (i.e. when you do not care about malloc/free) as well as compiler forced exception handling (one has to use try/catch blocks, when some function can throw exceptions) just make me relax during programming…

Python loving psto

I love languages with rich standard library. Python is just awesome in this regard.

But amount of already written extension is outstanding – I parsed HTML using regexps in Lisp, but in Python with python-lxml it took just couple of hours to parse rather broken html using xpath and small string matching calls.

I spent one day to write a parser of non-structured morphological data (frequently with suddnly unwanted symbols or additional tags within) from aot.ru to create a quite large (300k+ morphems) russian dictionary, and then to store it into prefix array and ouput as XML file.

Yes, default CPython sucks with threads, it is not (yet) suitable for trivial audio processing (play and stop sound when pressing/releasing a key), but it is just bloody ubergood at high-level prototyping.

Returning back to morphological analysis I’m about to start rewriting my experimental knowledge extraction and grammatic generation ‘engine’ from Lisp to Python. And I expect to have some cool results with it soon.

Python threads: scheduling

Python threads suck! They do not have deterministic behaviour from scheduling point of view, i.e. one does not know when thread will run and how long will it run.

And the main problem is that it likely may not give up control when faces syscall and/or lock. So it is not possible to control it using lock acquired in main thread and spawned one. PyQt signal will not be handled with enough priority to interrupt spawned thread which drops into thread.lock.

Eventually spawned thread will give up control, but this non-determinism is not acceptible for piano project I work on – we have to start and stop sound when key is pressed (modulo some short enough timeout, but not failing into extreme like using low-latency jack audio servers).

Python threads …

Using ‘threading’ module apparently is not a very good idea

$ python /tmp/threads.py
# ... interrupt by Ctrl-C

Traceback (most recent call last):
  File "/tmp/threads.py", line 12, in 
  File "/usr/lib/python2.5/threading.py", line 442, in start
    _sleep(0.000001)    # 1 usec, to let the thread run (Solaris hack)

$  cat /tmp/threads.py 
import threading, sys

test_num = 0
class TestThread(threading.Thread):
	def run (self):
		global test_num
		test_num = test_num + 1

num = 10000;
for x in xrange (num):

print "var: ", test_num, "num: ", num

I created this trivial Python::threading example which spans 10k threads which update global variable to find out whether GIL will protect it and to check how fast is thread creation. It happend that python threads are damn slow to create (and topmost dump confirms that – python sleeps there!). During this test python interpreter ate about 1% of 16 available CPUs and there was only single thread at a time, which means GIL protects thread creation and execution. When I added time.sleep() there, GIL was dropped to enter syscall and procfs confirmed that there were many simultaneous threads created.

But low-level module ‘thread’ is different. As documentation states:

This module provides low-level primitives for working with multiple threads (also called light-weight processes or tasks) — multiple threads of control sharing their global data space. For synchronization, simple locks (also called mutexes or binary semaphores) are provided.

I do not know what above high-level stuff does, but low-level threads are created very quickly forcing python interpreter to ate gigabytes of virtual memory and then throw unknown exceptions at the end (practice shows that waiting for all threads to complete does not rise those exceptions, maybe this happens only in 2.5.2):

Unhandled exception in thread started by 
Error in sys.excepthook:

$ cat /tmp/low_level_threads.py 
import thread

test_num = 0
def test_thread(smth):
	global test_num
	test_num = test_num + 1

num = 10000
for x in xrange(num):
	thread.start_new_thread(test_thread, ('qwe',))

print "var: ", test_num, "num: ", num

This application ate about 60-100% of CPU so I wonder whether GIL allows multiple threads to run (and not wait in syscall) on different physical processors without hacks (i.e. explicit GIL state drop and acquire).
There are also nice lock classes in that module.

I plan to use separate thread for each key in my piano keyboard emulator, which will be scheduled from key-pressed signal.

Python piano

Spent several hours to write a trivial piano emulator using Python. I do not have a book so just googled whatever did not work: apparently all newbie problems are already solved.
I used PyQt bindings and Qt-designer to create UI. pyuic4 generates a bit ugly Python code: first, python 2.6 does not understand it (namely assignment some value to function, like self.some_func() = something, so I had to manually edit it (I found a post that latest Python snapshots already support it, likely it was about Python3), second, it does not know about loops and just creates a static representation of the form screen.
Also I did not work hard on things like resizing (I just forbid it :)

To play sounds I used PyMedia library, unfortunately it is not available in Ubuntu, and its compilation failed, so I had to edit setup scripts and put proper defines for C code :) Furtunately I quite know C a little.

Probably I will switch out from PyMedia, since it does not allow to control output sound buffer size. I need to ‘schedule’ audio sample playing so that qt-signal handler returned and next one (like key release) could be processed. PyMedia implements async sound output, but it may be not enough – for example it blocks trying to get and play 100.000 samples.
Or I will try to work with Python threads and assign press/release signals to different threads (if it is possible, I do not know Python enough yet).

This is a very ugly proof of concept code to date. I will polish things and split downloaded scales into separate tones, which will be played when appropriate keys are pressed. I will also try to implement nicer classes for keys and sounds.

And in a meantime my workplace screenshot (full-screen on click):

I found it so simple to program using Python that it is very likely I will switch all prototyping development to Python from Lisp. Lisp is great and I do like its features, but it is way too rare when I used its cool abilities like writing complex macros, most of my macros were rather simple with-something () like with-open-file() and friends.
Lambda and closures for internal states are great, but iirc Python also has them.

But main feature of the Python I like is its incredible standard library as well as zillions of already written extensions. I was not able to find graphical bindings for CLISP or audio processing for example, although its quite unusual to write GUI in Lisp.
I had to write HTTP GET/POST utility myself using low-level sockets; regular-expression in CLISP is rather limited, and although cl-ppcre is great, rich regexp is available in Python out of the box. Python has some multithreading support (and ugly GIL to name), while CLISP does not (at least 2.44.1 version, which comes with Ubuntu Lucid).
And so, and so and on…

Stay tuned, I will play more with this project, after all I want to play some simple music at work :)

I am a musician

A fucking bad musician actually, but I learn…

Yes, there are harminies and scales, accords and arpeggios and so on, which I study every day as well as practicing my trumpet itself. And I have quite good progress I think.

But very frequently I want to play some simple (or complex) motive of the melody I hear. Or ‘workaround’ it a little… At home I can play piano and try to pick out melody by ears, but in office or anywhere near computer I can not do this.

The reason is simple: Linux does not have a piano synthesizer. Well, it has plenty of them, but each one plays through MIDI. And audio system in Linux is only good for listening. MIDI/alsa/oss/jackd/pulseaudio – thousands of them, and this web only works in some precise setups.

I tried to install plenty of MIDI software on Ubuntu Lucid Lyx – epic fail. Jackd stops, pulseaudio crashes and freezes… It is such a mess.

So I decided to write a trivial application which will look like a piano keyboard and play recorded samples instead of going through MIDI synth when keyboard or mouse keys are pressed. I checked FreshMeat – there are only MIDI synthesizer there.

And I need a simple, fast, nice and non-brain-fucking in setup application, which just plays what I press.

O, poor python

My study of this language was started today. And while my knowledge base is somewhere between zero and void I already can ask stupid qustions, which were not answered on #python / irc.freenode.net

What python type to use when c++ to python boost::python binding requires ‘unsigned char *’ parameter?
Code snippet will tell much more:

class test_class : public test_class_base {
		test_class(const char *path);
		virtual ~test_class();

		virtual void log(const char *msg);
		unsigned char test(unsigned char *ptr) { return *ptr; };
		std::ofstream *stream;


	class_<test_class_wrap, boost::noncopyable, bases >("test_class", init())
		.def("log", &test_class::log, &test_class_wrap::default_log)
		.def("test", &test_class::test)

I ommitted wrapper class definition, since it is not critical. It works for python-string to c++ ‘const char *’ transform, but unsigned char pointer fires up an error. I tried both struct.unpack_from("P", array) and array.buffer_info()[0], but python says that they both return ‘int’ while c++ code expects ‘unsigned char *’:

id = array('B')
for x in xrange(0, 20) :

s = struct.unpack_from("P", id);
print hex(s[0])

t = test_class("/dev/stderr")
print hex(t.test(s[0]))

[zbr@baccara python]$ ./test.py
Traceback (most recent call last):
  File "./test.py", line 19, in 
    print hex(t.test(s[0]))
Boost.Python.ArgumentError: Python argument types in
    test_class.test(test_class, int)
did not match C++ signature:
    test(test_class {lvalue}, unsigned char*)

Code snippets can be also found at http://paste.pocoo.org/show/239085/.

If no solution will be found, I will refactor elliptics python binding to support new classes for missing types like ‘elliptics_id‘ for ‘unsigned char *’.

Week and weekend

New weekend is coming, and it there is a new snow in Moscow.

But unfortunately I’m unlikely to move to the ‘mountains’ this weekend – I damaged my leg next week so that even walked three days with a crutche. It was a bit warm day, although it was lower than -10C early morning when I moved to the ski resort.
But on the slope weather was noticebly hotter – about 0 degrees Centigrade or so and quite moist. So that my new skis felt quite uncotrollable in the high and middle stand. But when sit rather low I was able to control skis at quite high speeds, although this requre substantial muscle efforts.

I managed to film a small porn video on how I ski over the red trace in Stepanovo. Phone in left hand is not the best way to fight for Oscar, but it was fun. There is completely no feel of speed, although it was substantial for me at least – more than 40 km/h (about 11 meters per second). Calculated by dividing trace length by moving time, so effectively it does not take into accout arc length, which I prefer to make small to medium.

On such speeds I manage to outrun many of the skiers and almost all snowboarders. But since I have essentially no technique (I moved to outdoor traces three times, each time I spent about 3-5 hours on the slope), it is likely that I move quite wrong. And this can explain problems I sometimes get during the movement on the slope.

Add here weather and wet snow and result is quite simple: I fall. I do not care about that until I feel the pain longer than a day or so. And this week was my first time when pain was that strong and long.
I managed to outrun some other boarder and was not able to control skis, so fell and flew several meters away from the trace breaking the boarding :)

well, it was quite simple to break that bearding net, but there was a noticebly gap out of the trace, where I moved several meters crawling over the snow. leg did not hurt that much on the trace, but when I moved home pain started to show up.

Currently I feel mostly ok, although play table tennis quite slowly and can not move without slight lameness. Well, recently I moved with a crutch only :)
So, things are getting better.

In a meantime I added fair number of tasty things into elliptics network project, namely broke addressing storage model – now each node stores IDs which are greater than node’s ID. This breaks compatibility but allows simple human understanding of how objects are spread over the storage.
Also implemented random transformation function selection for read IO requests in fastcgi frontend, now we can balance erading among multiple data copies. Dropped BerkeleyDB support – Tokyo Cabinet performs way faster, so I do not see any reason to support both. Made a big step towards completed merge support, I expect it to be finished very soon, which will be the first 2.7.x release – there is a fair number of changes accumulated already.

And as a tasty project to warm up the brain I decided to implement a rhyme generator based on Levenstein-Damerau distance and sound-syllable similarity algorithm. It was not formalized even in my head yet, but it is interesting thing to think about.

Also managed to win a judgement against development company which built my house (without judge and defendant though). I’m quite close to finally get property rights on my appartments and to sold it for good. I believe its time to make living place wider.

So far so good. Stay tuned!

Vigenere crackdown

Suddenly decided to make something bad, while waiting for other things to settle.

Years ago I used to believe that I know something about hacking. Not kernel hacking, but the one, related to cracking of various supposed to be secure systems. Starting from ciphers down to code ‘issues’ (described in phrack, yup). It is rather laughable right now :)

But today I know, that reading (and even understanding) smart and hardcore articles is really far from being able to apply given knowledge to some practical problems. So, I decided to start (and complete practical implementation) from really simple things: various mono/poli alphabet ciphers. Ceasar/rot13 and Vigener are good choices afaics.

Bruce Schneier’s Self-Study Course in Block Cipher Cryptanalysis does not allow me to easily fall asleep, although its a little bit fun to compare my alphabet ciphers analysis versus even simplest crackdowns described in the article.

Anyway, it took me a day or less to hack up a semi-automatic Vigenere and monoalphabet cipher cracker in LISP. In Vigenere code there are two steps: find out key length and split and decode monoalphabet enciphered text blocks.

The former task is performed manually – the only application created searches trigrams in text and shows their frequencies. When there are 3 or more trigrams it is possible to find key length with rather high probability. Bigrams will work too, but error rate is higher. One have to find distance between the same trigrams and found greatest common divisor, which will likely be equal to cipher key length.

Vigenere cipher is theoretically uncrackable when unique key long enough to cover input message is used. But in practice shorter keys are used, which are then repeated number of times so that resulted key string length becomes equal to plaintext message. So, when cipher starts to repeat itself, the same letters will be encrypted into the same ciphertext.

This method is named after Friedrich Kasiski, who found it 150 years ago. When key length is found, we split ciphertext into separate strings, where each one encrypted using monoalphabet cipher. It consists of letters separated by key length, i.e. the each string will be formed from $string_number + $i * $key_length letters, where $i runs over ciphertext until it is fully covered.

Monoalphabet ciphers are rather trivial to crack using frequency analysis. Namely we should find the most frequent letters in the encrypted text and they will correspond to the most frequent letters in the language used for plaintext. Difference between those letters is a cipher shift, so it is trivial to recover the text just by replacing each letter with the one shifted by calculated number.

Here is example I used (random New York Times article, cut just to show idea, I also converted it to downcase and dropped all non-letter character, since gcipher application does not understand that):

After spending months searching for a bipartisan consensus on financial regulatory reform, Senator Christopher Dodd, chairman of the banking committee, is expected to unveil his own bill on Monday, without one Republican supporter.


That’s how it was encrypted and resulted ciphertext:

$ gcipher -c Vigenere -k asknfalwiwf v1.enc v2.enc


Trigram search uncovers that the most frequent ciphertext trigram ‘tzo’ is spread in the text with the distances being multiple of 11, which does correspond to above key length.

Splitting text and calculating frequences is a rather simple technical task. After performing decoding we got following result:


Second line is a plaintext message, which differs in less than half symbols. I was not able to decode some of the text parts because of small enough text length, so that frequency analysis did not always provide correct data. But even as is decoded text allows to read and recover data.

On this positive note I will start preparig for the weekend skiing.

Stay tuned!

Hashes and theory of their cracking

Of course there is no such theory, but practice breaking a hash is fascinating for the researcher.

Currently in netdev@ people started lengthy discussion about new hash for the interface name and (optionally?) for dentry hash, or I just misunderstood the latter.

Anyway there is more than a dozen of different algorithms tested for deviation and speed. It is very interesting to find out which one will be selected.

Actually it is only interesting from the single side – how to break it. By breaking I mean creating application which can generate input data which will produce the same hash value after processed by the selected algorithm.

That’s what I did for Jenkins and Bernstein/Torek (hash * 33) hash quite for a while already.

Looking forward for the new hash :)

NTT Cyber Space Labs presents Sheepdog – distributed storage system for KVM

MORITA Kazutaka wrote:

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
– Tolerance against network failure
* Zero configuration (newly added machines will join the cluster automatically)
– Autonomous load balancing
* Snapshot
– Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
– Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, – = on our todo list)

More details and download links are here:

Note that the code is still in an early stage.
There are some critical TODO items:

– VM image deletion support
– Support architectures other than X86_64
– Data recoverys
– Free space management
– Guarantee reliability and availability under heavy load
– Performance improvement
– Reclaim unused blocks
– More documentation

IMHO, block level distrubuted systems are dead overall, although it has its niche.


I did not blog about technical stuff for a while since read some articles and books here and there about related topics. To date I did not get enough ground to start development, but there is already a short todo list to be completed in a week or so (well, only first couple of tasks are supposed to be finished, I will see later how it will go, it is unfeasible to finish them all quickly).

So, how knowledge extraction and cognition development processes are about to be implemented with time:

  • Regexp parser implemented as a (non)deterministic finite automata. This task is not about to implement another bicycle, but to get in touch with complex state machines and their automatic generation from some abstract data and its relations.
  • Test regexp implementation with html parsing and tag stack implementation. Can be used as a yet another HTML validator and sanitizer. Not that it is a particular goal, but to date I did not see any such tool with the extensible configuration (like with a tag substitution language) of how tag stack should be fixed upon error detection.
  • Above tasks actually form a minimum program (for the week or so). It is kind of HTML compiler. In particular, knowledge extraction task needs to have a regexp parser to select fact words (and combinations) and their relations if expressed not in a grammatically correct form (like changed prefixes and/or suffixes, i.e. ‘plane flew over’ -> ‘plane’ – ‘fly’ – ‘over’).

  • Context-free grammatics. LR(1) grammatics. The latter is mainly to get in touch with the area. The former is a very powerful tool for the structured data analysis and its correctness.
  • Knowledge extraction itself – building of the weighted graph of the word relations in the studied information.
  • Input information processing:
    • Knowledge graph of the input data build based not only input text relations but also affected by the knowledge stored in the already processed data.
    • Question processing – generate a reply based on the existing knowledge.
    • Free actions – generate uncondition ‘ideas’ based on random spikes of action in the specified areas of the knowledge graph. Uncondition means not absolutely random, but random in the area with the highest knowledge weight concentration or in the area specified in the latest input data.
    • Input data analysis looking for known and unknown facts confirmation and extraction. Generating replies/questions based on those data.
  • Linguistically correct reply generation using predefined grammatics.

To date I’m somewhere at the very beginning. Back to drawing board reading room…

On knowledge extraction problem

Thought of the weighted graph of the dictionary learned by the system, where node corresponds to the word, each link’s value corresponds to the weight of the appropriate phrase, and amount of links going out of the every node determines how frequently given word was used in the learned texts.

The same graph can be built out of the new input sentence and already learnt data (like from memory), which will correspond to the meaning of the phrase by the given system. It is still quite rough in my head, but I have a thinking direction.

What I’m a bit stuck with yet, is a response generation. Let’s suppose I determined from the question sentence its meaning. I expect it to be similar to what search engines output, but for example it should be quite different for questions like “where to swim on vacation in Moscow” or “do you like Stanislav Lem” and other similar kinds of questions.

Having a local system memory means subjective decision on each question, but I have troubles imagining how system should respond. How to program its own decision on what and how to answer when there is a (some kind of) understanding what was said.

Going to sleep, one says brain will think about the problem in the short sleeping stage, maybe I will be enlighted. I think its time to create AI tag for this kind of ideas :)
And it absolutely does not mean I stopped my other projects!