I did not blog about technical stuff for a while since read some articles and books here and there about related topics. To date I did not get enough ground to start development, but there is already a short todo list to be completed in a week or so (well, only first couple of tasks are supposed to be finished, I will see later how it will go, it is unfeasible to finish them all quickly).
So, how knowledge extraction and cognition development processes are about to be implemented with time:
- Regexp parser implemented as a (non)deterministic finite automata. This task is not about to implement another bicycle, but to get in touch with complex state machines and their automatic generation from some abstract data and its relations.
- Test regexp implementation with html parsing and tag stack implementation. Can be used as a yet another HTML validator and sanitizer. Not that it is a particular goal, but to date I did not see any such tool with the extensible configuration (like with a tag substitution language) of how tag stack should be fixed upon error detection.
- Context-free grammatics. LR(1) grammatics. The latter is mainly to get in touch with the area. The former is a very powerful tool for the structured data analysis and its correctness.
- Knowledge extraction itself – building of the weighted graph of the word relations in the studied information.
- Input information processing:
- Knowledge graph of the input data build based not only input text relations but also affected by the knowledge stored in the already processed data.
- Question processing – generate a reply based on the existing knowledge.
- Free actions – generate uncondition ‘ideas’ based on random spikes of action in the specified areas of the knowledge graph. Uncondition means not absolutely random, but random in the area with the highest knowledge weight concentration or in the area specified in the latest input data.
- Input data analysis looking for known and unknown facts confirmation and extraction. Generating replies/questions based on those data.
- Linguistically correct reply generation using predefined grammatics.
Above tasks actually form a minimum program (for the week or so). It is kind of HTML compiler. In particular, knowledge extraction task needs to have a regexp parser to select fact words (and combinations) and their relations if expressed not in a grammatically correct form (like changed prefixes and/or suffixes, i.e. ‘plane flew over’ -> ‘plane’ – ‘fly’ – ‘over’).
To date I’m somewhere at the very beginning. Back to
drawing board reading room…