POHMELFS and elliptics network design issues

Currently POHMELFS has a coherent writeback cache on each client. This feature requires special protocol to be developed which manages data consistency between clients. Namely it maintains per-object-per-client data on the server, which is checked each time new client access the data.

There are at least three problems with this:

  • need to store a state on the server about client, which means that server crash can not be easily handled
  • increasing number of communication messages needed to be sent when new client access some object, it grows lineary with number of clients, now suppose a huge server with lots of clients and large amount of data objects
  • need to send lock request and wait for ack for each new or flushed from the local cache object

Plus code complexity, POHMELFS implements MOSI-like cache coherency protocol, and while modern processors can observe the bus and flush cache lines, distributed cloud system has to maintain message protocol for each data access.

Another huge problem is server redundancy, since effectively all states have to be replicated on multiple servers.

To date there are no open network systems with such cache support. Originally it was presented in Oracle CRFS developed by Zach Brown, but currently its development is rather dead.

Experiments with POHMELFS and latency led to decision to drop this implementation while maintaining high performance for metadata-intensive operations, which especially benefit from it.

Main idea is to continue to split IO operations from network processing, the same way it was done in elliptics network, where network IO is handled by the properly implemented state machine on top of pool of threads. POHMELFS currently supports this only for data receiving. Data writing blocks in cache writeback, which sends data to the servers.

Now I plan to introduce transaction queue where each new data IO request will find and concatenate transactions before they are sent by the dedicated threads. Reading for up-to-date pages will not force any kind of network IO, while writes and fresh reads will just allocate and queue a transaction. Write then stops while reading sleeps for transaction completion.
Subsequent write may append data into just created transaction if it was noet yet sent.
As previously transaction will live in retransmit queue until ack is received and resend if needed.

Local cache invalidation will be handled similar way it exists currently in POHMELFS and its server, which I will think some more about. Likely I will use built into elliptics network update notification mechanism.

Transaction queue will be processed by the pool of threads, or maybe even by the single thread per network connection, separately from userspace-visible IO.

That’s the original plan I ended up with to date, but it may happen that this will be just a start of the things breaking. While I’m suffering from climbing aching and appartment development I may think out something additional.
Stay tuned!