Distributed computation in elliptics

Tagged:  

In last days (or months) I thought a lot about what people really want from distributed system.

Usually it is enough just to store data being able to horizontally scale when needed. But when more and more people start using the system, they start getting out of the supported feature range.

In particular - people do not want to just store data, they want to implement some processing on top of it.
We added server-side scripting as a simple reply for that demand - server-side scripting is a way to perform arbitrary code on the data you provide to the system. In particular it can be thought of as a trigger for data write - we get data, process it and store.

Server-side scripts can be quite complex objects - for example in Pohmelfs we update directory content and hardlink counts using server-side scripts. They may read data, change object structure, serialize received objects, store combined entities back to low-level elliptics backend and so forth.
Server-side scripts can be treated as a low-level processing units which work with single object or key.

Using those basic 'bricks' we can create rather complex systems - it is possible to create streams and kind of graph processing engines, but it require enormous amount of low-level code. Something like 'get-key, process-key, write-key-to-leafX, copy-to-leafY, wait-for-replyZ' and so on. Changing computational topology is real pain.

What we really want is a way to describe how our processing engine should look. Something high-level like 'compute-function-1(input-key) -> compute-function-2() and compute-function-3()' and so on. Then we would submit that topology into the cluster and start emitting keys.

This whole idea I got from Storm presentation. Storm is a distributed real-time processing system recently bought by Twitter and used there for real-time analytics.
While I certainly do not like some of its concepts, in particular single master node (say hi to hadoop), closure/java implementation (it is really pain in the ass to support multiple stacks of technologies in a project - and we already have a bunch of them, so likely it is not ready to bring in java).

But Storm shows a really exceptional parallelism in computation and very convenient high-level abstraction. Network topology created on top of low-level objects like stream, spouts and bolts - this is exactly how we want to create processing engine, and not by thinking on how to receive message and how to send a reply.

My search/AI ideas definitely need something similar to be created for doing high-level computation modeling.
So I'm starting to think about how this can be created not bound to elliptics as its data storage, but generic enough.
This may sound like bicycle reinventing - and this definitely is in some way - I wish Storm could be created as embeddable library, so I could put it into our datastore easily... But it doesn't, so we have to create something.

I would also add a true graph processing into this engine - the way Google's Pregel work with its supersteps - this is what is missing in Storm. Storm is good for realtime 'stateless' processing - when your tuple is processed in the whole topology, it is forgotten. What I want is to create a way to synchronize parallel topologies, so that when they are completed, next topology gets started with the data previously generated by its predecessors.

Stay tuned. I promised to keep this blog running, and failed to find time to update it regulary.
But I will do more than my best to improve this situation :)

pthreads vs fork

Tagged:  

Elliptics supports atomic server-side scripts execution, in particular there is python implementation for complex scripting tasks like directory structure support in pohmelfs.

Python by its nature is a single-threaded application, so we want a pool or processes each of which will host python interpreter. Contrary elliptics is a heavily multi-threaded application. As practice showed - mixing those two beasts is a tricky and dangerous business.

fork() copies all allocated virtual memory space from thread which invoked syscall. In particular it copies all internal shared library states, which can be modified in parallel threads. To deal with this problem POSIX introduced uglymoron called pthread_atfork() which can only help (if this can be ever called a help) with your own state (like locks).

And while I could properly deal with elliptics state, I can not even know what shared libraries are doing during fork() syscall. One of my threads printed timed log message, which dig into libc's tz conversion, which happens under its private lock.
When copied in fork() into newly created process, this lock is still locked there, so subsequent python initialization, which wants timezone too, will deadlock.
If you start a pool of hundred python workers, there are about 30 of them 'deadlocked' in timezone conversion.

To fix this issue we really want to 'clean' address space of the forked children, but this will break 'connection' with parent process if there were pipes or sockets created. So I execve() a special binary which in turn initializes python interpreter and connects to parent via named pipes. This was originally implemented in libsrw library, but then I moved it into elliptics itself to minimize dependencies.

Those bugfix took me almost a whole week to find out, why pohmelfs sometimes got stuck during synchronization of the huge number of files. And now things go smooth. I should start pushing pohmelfs into kernel again.

Eblob got new automatic defragmentation

Tagged:  

Eblob is a low-level storage for elliptics distributed network.
It is append-only (with rewrite support though) system with tricky disk and memory indexes, which allow to have very high IO performance.
But everything comes with its own price - when key is deleted, it is only marked as removed in eblob and later defragmentation tool has to restructure eblob to make this space free.

Quite for a while this functionality was disabled - there was only offline tool which required node restart and manual admin work. Admins do not like to work and I can not blame them - it is rather annoying task to defragment 2 hundreds of nodes and then run recover (since while node was offline its replicas could get updates which are better synchronized to given node too).

Now eblob has automatic defragmentation again. It is rather costly task, so it is recommended to 'run' (timeout is specified in config) it rarely like once per several hours. Moreover, only single blob will be processed in one timeout slot, since eblob reserves free space only for one blob - thus defragmented 'copy' can live in the same place as data blobs.

So far only our tests show that things work as expected, so we start testing this in our large clusters.

Linux kernel workqueue latencies

Tagged:  

Kernel usually uses workqueues for process context execution. Pohmelfs use to for network processing engine, which is executed on top of socket stack.

Since it is kernelspace entity it is expected to be at least somehow fast and has low latency.
But in current linux kernel (3.3.0-rc3) latency is usually about 100 milliseconds

Its just as little as whoooping 0.1 second to schedule kernelspace work

Yes, there is a bunch of debug options turned on in my kernel, it is a viartual machine (vmware), but still it is a one tenth of a second!

pohmelfs: network raid1 example

Tagged:  

Pohmelfs configuration is actually trivial:

# mount -t pohmelfs -o "server=172.16.136.1:1025:2,fsid=xxx,groups=3:2:1,noatime,noreadcsum,successful_write_count=1,sync_timeout=600,readdir_allocation=5" none /mnt/

where 'server' mount option specifies IP address in form address:port:family (2 - ipv4, 6 - ipv6). It is ok to specify only subset of all cluster IP address - pohmelfs will download route table itself, it only needs at least one alive node at connection time to discover other nodes

'groups' mount option specifies groups you want to write data into. Group is kind of replica ID.
That's all for pohmelfs.

Let's configure elliptics and create 2 groups (with id 2 and 3 for example), which will store essentially identical replicas (they may differ, since writes can be unordered, or one group may be down for some time)
There are 2 configuration files - elliptics server (let's call it ioserv.conf) and server-side script environment (we use python, so it is python.init).

Here is ioserv.conf
I will highlight parameters, which differ in separate groups

# log file
# set to 'syslog' without inverted commas if you want elliptics to log through syslog
log = syslog

# log mask
#log_mask = 10
log_mask = 15

# specifies whether to join storage network
join = 1

# config flags
# bits start from 0, 0 is unused (its actuall above join flag)
# bit 1 - do not request remote route table
# bit 2 - mix states before read operations according to state's weights
# bit 3 - do not checksum data on upload and check it during data read
# bit 4 - do not update metadata at all
# bit 5 - randomize states for read requests
flags = 4

# node will join nodes in this group
group = 2

# list of remote nodes to connect
# address:port:family where family is either 2 (AF_INET) or 6 (AF_INET6)
# address can be host name or IP
remote = 172.16.136.1:1025:2 172.16.136.2:1025:2

# local address to bind to
# port 0 means random port
#addr = localhost:1025:2
addr = 172.16.136.1:1025:2

# wait timeout specifies number of seconds to wait for command completion
wait_timeout = 60

# this timeout specifies number of seconds to wait before killing
# unacked transaction
check_timeout = 60

# number of IO threads in processing pool
io_thread_num = 64

# number of IO threads in processing pool dedicated to nonblocking operations
# they are invoked from recursive commands like DNET_CMD_EXEC, when script
# tries to read/write some data using the same id/key as in original exec command
nonblocking_io_thread_num = 32

# number of thread in network processing pool
net_thread_num = 64

# specifies history environment directory
# it will host file with generated IDs
# and server-side execution scripts
history = /opt/elliptics/history.2

# specifies whether to go into background
daemon = 1

# authentification cookie
# if this string (32 bytes long max) does not match to server nodes,
# new node can not join and serve IO
auth_cookie = qwerty

# Background jobs (replica checks and recovery) IO priorities
# ionice for background operations (disk scheduler should support it)
# class - number from 0 to 3
# 0 - default class
# 1 - realtime class
# 2 - best-effort class
# 3 - idle class
bg_ionice_class = 3
# prio - number from 0 to 7, sets priority inside class
bg_ionice_prio = 0

# IP priorities
# man 7 socket for IP_PRIORITY
# server_net_prio is set for all joined (server) connections
# client_net_prio is set for other connection
# is only turned on when non zero
server_net_prio = 1
client_net_prio = 6


# anything below this line will be processed
# by backend's parser and will not be able to
# change global configuration

# backend can be 'filesystem' or 'blob'

backend = blob

# zero here means 'sync on every write'
# positive number means data amd metadata updates
# are synced every @sync seconds
sync = 300

# eblob objects prefix. System will append .NNN and .NNN.index to new blobs
data = /opt/elliptics/eblob.2/data

# Maximum blob size. New file will be opened after current one
# grows beyond @blob_size limit
# Supports K, M and G modifiers
blob_size = 500G

# Maximum number of records in blob.
# When number of records reaches this level,
# blob is closed and sorted index is generated.
# Its meaning is similar to above @blob_size,
# except that it operates on records and not bytes.
records_in_blob = 10000000

Our second replica will live in group 3, so you should change above 'group' parameter to 3 as well as node's address and optionally 'remote' parameter, which is a list of nodes to connect. It can include local address itself.

Second configuration file is python.init
It must live in directory specified in 'history' parameter above
You should put all srw/pohmelfs* scripts in 'history' path too.

import sys
sys.path.append('/tmp/dnet/lib')
sys.path.append('/opt/elliptics/history.2')

from libelliptics_python import *

# groups used in metadata write
pohmelfs_groups = [1, 2, 3]
pohmelfs_log_file = '/opt/elliptics/history.2/python.log'

log = elliptics_log_file(pohmelfs_log_file, 10)
n = elliptics_node_python(log)
# we should only add own local group, since we do not want all updates to be repeated for all groups
# this should be changed to 3 for group number 3
n.add_groups([2])
# this is an IP address for local node, i.e. server, which belongs to group 2
# you may specify multiple addresses with multiple calls
n.add_remote('172.16.136.1', 1025)

__return_data = 'unused'

import gc

import struct
# python sstable implementation
from sstable2 import sstable

import logging
FORMAT = "%(asctime)-15s %(process)d %(script)s %(dentry_name)s %(message)s"
logging.basicConfig(filename=pohmelfs_log_file, level=logging.DEBUG, format=FORMAT)

pohmelfs_offset = 0
pohmelfs_size = 0
# do not check csum
#pohmelfs_ioflags_read = 256
pohmelfs_ioflags_read = 0
pohmelfs_ioflags_write = 0
# do not lock operation, since we are 'inside' DNET_CMD_EXEC command already
pohmelfs_aflags = 16
pohmelfs_column = 0
pohmelfs_link_number_column = 2
pohmelfs_inode_info_column = 3
pohmelfs_group_id = 0

def pohmelfs_write(parent_id, content):
	n.write_data(parent_id, content, pohmelfs_offset, pohmelfs_aflags, pohmelfs_ioflags_write)
	n.write_metadata(parent_id, '', pohmelfs_groups, pohmelfs_aflags)

Putting together this initialization script (you may edit one in source tree) with pohmelfs scripts ends up adding support for server-side scripts executed with above context.
There is a pool of processes which pick up execution contexts to run your requests.

That's it, feel free to ask if you hit any problem!

pohmelfs: call for inclusion

Tagged:  

I'm please to announce new and completely rewritten distributed filesystem - POHMELFS

It went a long way from parallel NFS design which lived in drivers/staging/pohmelfs for years effectively without usage case - that design was dead.

https://lkml.org/lkml/2012/2/8/293

Let's see, how this will end up

5 billions

Tagged:  

It is amount of objects in one of our clusters - it hosts avatars and other rather small objects (hundreds of bytes to several kilobytes)
This amount of objects is hosted just on 2 nodes, each has 24x2Tb disks, 48Gb of ram, only 6Tb are actually used

We should multiply it to 3, since that's amount of objects in whole cluster spread over 3 datacenters in Moscow region
Load is quite small though, about 500 rps of reads and about the same for simultaneous writes

pohmelfs got http compatibility support

Tagged:  

Now inode's ID is generated as a hash from its full path.

And it is possible to read data from such elliptics storage not only via pohmelfs, but using HTTP and library APIs
In particular it is very convenient to upload huge number of files into storage via pohmelfs, and then provide them to clients via common interfaces.

Http compatibility disables rename, since key changes and given object should possibly be moved to different node.

I also added sync-on-close mount option, since pohmelfs uses local page cache quite for a long time and object might not be written to storage until writeback fires. Sync forces data and metadata to be written to storage, making object accessible to other clients.

I believe that's all for pohmelfs. I will post patch to lkml tomorrow, likely I will post 2 of them - to remove drivers/staging/pohmelfs and just put it into fs/pohmelfs, since it goes production quite soon here.

pohmlefs got hardlink and socket/pipe support

Tagged:  

While adding pipe/socket is pretty much straightforward - it is just another inode types, which have full VFS implemnetation in kernel, hardlinks are much harder.

Basically saying, hardlink is a link to inode, so that updates made through one path ended up with new data read from all other hardlinked pathes. Removing one hardlink may not end up removing whole fie, instead there should be some kind of reference counter - we only remove file's data when there are no more hardlinks to given file, and of course when we remove file by original path, we can not remove its data, if it can be accessed via its hardlinks.

POHMELFS uses elliptics as its data storage backend, which supports atomic transactions per replica and per-column data IO. So I put reference counter update into elliptics server-side script, which is called for every unlink and hardlink creation command.

Scripts are executed atomically against other operations for given key, but only on single replica. It is possible that different operations with the same key are executed on different replicas, but for unlink it does not matter - if we drop refcnt by removing 'hardlink-1' and then original path or vice versa data will only be removed when last reference drops to zero (actually to -1, since for performance reasons we do not create reference counter for plain object, only for its hardlinks).

Last weeks we extensively tested POHMELFS in a lab, but its time to move it to small testing environment, which mimics behaviour of one of our large systems. We will setup all software we use daily there and start looking for real bugs.
Getting our previous testing 'mode' (we synced about hundred of terabytes of files of different sizes) I do not expect any problems to arise.

Still, I want http compatibility mode, when we save files with IDs generated from full path length. This implies that move is not supported, but it is a special case where we can afford this. Basically we want it to write data via POHMELFS for performance and then read it via HTTP.

Stay tuned, I will send kernel inclusion request next week. First patch will replace drivers/staging/pohmelfs with this new code. Hopefully it will move into fs/ in some next release.

Pohmelfs got quorum read support

Tagged:  

Due to eventual consistency nature of elliptics there may exist a window in which replicas are not in sync.

Depending on various factors this window may be rather large, for example in our production clusters we run this kind of check weekly, so having replica outdated for that long may require additional steps to get synchronized data.
We store metadata for every written object, which among others contain update timestamp. So it is quite simple task to make a parallel request for metadata from multiple replicas in different datacenters and select the one with the latest timestamp.

This exists in elliptics API quite for a while, and now I added it to pohmelfs.
At open() time we check all replicas and build a list of groups, where it is present, sorted by timestamp, so any subsequent read will get the must up-to-date data.

As a bonus I added keepalive mount options to detect broken links early. In our tests where datacenters may shutdown on weekly basis waiting for couple of hours is a bit of overhead to detect it.

There are 2 pohmelfs tasks to date - http compatibility mode, where object ID is generated according to its full path, and umount bug, which is likely because of my dirty dcache games.

I've just announced pohmelfs in linux-kernel

Tagged:  

Here is the link: https://lkml.org/lkml/2011/12/23/161

As mentioned there, there are features yet to complete:

  • quorum read. pohmelfs supports quorum write only so far.
  • http compatibility mode - we do want to upload data via pohmelfs and read it through http applications. And vice versa actually too.
  • column read-write or more generally file-as-directory feature.
  • even more testing - abusing dcache is fun, but likely there are hiddens stones
  • replace drivers/staging/pohmelfs with this code.

It is quite a lot of work, but not that much of work. Maybe a week or so to actually implement all that stuff and then to start real testing. Right now we strike into system limitations quite fast - like doing multi-terabyte rsync between local raid and pohmelfs mounted storage and hitting NIC limitations all the time - it is 1 Gbps after all (and I want to play with 10 GigE but still without luck). Pohmelfs uses local VFS page cache so it can easily saturate bandwidth during writeback.
Since we also have to write multiple copies in parallel (we write 3, but only 2 groups are turned on to get quorum), write is limited by network 40-50 MB/s.

Also elliptics sometimes scares me - we sync data into eblob once per 300 seconds, and it takes about 3-10 seconds to actually write it to disks on every system. At that time machine almost freezes - 100% disks utilization. And rsync sometimes looses its mind - it stops and timeouts, although sync is already finished.

Another problem is related to eblob. We store 10+ gigabyte log files in test case, and since eblob is append-only, it can fill its quota for given key and subsequent write will have to copy previous record into new chunk. Copying 10 Gb takes some time too, and it also takes space - old records are just marked as deleted until offline defragmentation cleans it up.

And my favourite - dcache abuse. Sometimes I get kernel BUG at /home/apw/COD/linux/fs/dcache.c:873 which means at umount time there are dentries with positive reference counters. I do not yet know whether it is pohmelfs or not, but more likely that it is the reason :)
I will fix that too of course.

But overall pohmelfs is close to finish.

Next task is to implement indexing and embedded search in elliptics.
It is quite good idea to have out-of-the box full-text search engine inside storage system.

Stay tuned, new things are close!

POHMELFS and elliptics server-side

POHMELFS in a meantime got 'noreadcsum' mount option support as well as some other tricky bits.

This bumps read performance several times, since eblob (elliptics backend) stores data in contiguous chunks increasing read/write performance, but optionally forcing to copy data, when prepared space is not enough.
Reading with csum enabled forces to checksum the whole written area, which in turn requires to populate it from disk to page cache and so forth. For gigabyte file this takes about 5-6 seconds (first time).

And that's only to read one page. I.e. to read every page (or readahead requested chunk).
Not sure that disabling read checksumming is a good idea, so I made this a mount option. Maybe eventually we end up with some better solution.

I also fixed nasty remount bug in POHMELFS which uncovered a really unexpected (for me at least) behaviour of Linux VFS.
Every inode may have ->drop_inode() callback, which is called each time its reference counter reaches 0 and inode is about to be freed.
But sometimes when inode was recently accessed, it is not evicted, but placed into lru list with special bits set.
Inode cache shrink code (invoked from umount path in particular, but likely may be called from memory shrink path too) grabs all inodes in that lru list and later calls plain iput() on them, which in turn invokes ->drop_callback() for inode in question.

Thus it is possible to get multiple invocation of callback in question without reinitializing inode between them. This crashed pohmelfs in some cases. Now it is fixed with appropriate comment in the code, but I'm wonder how many other such tricks are yet to discover?

POHMELFS is a great tester for elliptics server-side scripting support. I was lazy and put all somewhat complex processing into server-side scripts written in Python. Anton Kortunov implemented simple sstable-like structure in Python for directory entries used by pohmelfs.

Since every command is processed atomically (on single replica) in elliptics, we can put complex directory update mechanism in this 'kind-of-transactions'. In particular server-side scripting is used to insert and remove inodes from directory. Lookup is also implemented using server-side scripting - we read directory inode in python code, search for requested name, and return inode information if something is found, which is sent back to pohmelfs.

Overall this takes about 2-13 msecs. I.e. receive command from pohmelfs, 'forward' it to pool of python executers (srw project), where python code will read directory inode data from elliptics (using elliptics_node.read_data_wait()), search for inode with given name there and send it back to pohmelfs.
Insert takes about 30-150 msecs - script reads directory content, adds new entry (or update old) and then writes it back into the storage.

That's how it looks in python - srw/pohmelfs_lookup.py

Given that we spend 10 msecs in such not really trivial piece of code, I believe that my implementation is actually not that bad.
Those are recent news. Stay tuned for more!

POHMELFS

Tagged:  

In a meantime I rewrote pohmelfs from scratch and it enters heavy testing stage.
As promised, it became just a POSIX frontend to elliptics network with weak synchronization. By using elliptics as its backend, it gets multiple copies support, atomic transactions (in single replica), multiple datacenter support with IO balance, checksums, namespaces and so on.

And by 'weak synchronization' here I mean, that all writes are not visible to other users, who mounted external storage, until writer performs sync or writer's host system decides to writeback dirty pages to the storage.
This actually mirrors behaviour of VFS in all modern OSes - we write data into page cache, and if system catches power failure, its data is lost. Even more - users are not synchronized in any way, and if one of them removes file, another one will only detect that after reading directory again (or trying to open/access given filename).

There is a very interesting approach I use in directory listing. We store directory information as a record indexed by directory key id. It is atomically (as in single replica, multiple replicas are updated independently) updated for every written/removed object and hosts whole inode indexed by dentry names.
But directory listing just reads that whole directory structure and parses it adding inodes/dentries not at lookup time (this is supported too of course), but at readdir time. Since records are stored as single continuous areas in elliptics, we only have to download and sequentially iterate over this blob data to get listing completed without multiple server lookups per name.

But we have to cleanup parent direntry list every time we are about to perform directory listing, since other users may delete some files or rename them. For example rsync creates '.blah.random-crap' files first and then renames them to 'blah' when copy is completed, which resulted in 2 files having the same inode and id previously.

There is no hardlink support yet, balanced/random reads from multiple replicas and quorum read, when we try to reach multiple replicas and select one with the latest consistent data. Writes also should support quorum option (at least mark pages back as dirty if write did not reach quorum or requested number of replicas).

I also plan to add column read/write (this is definitely not a POSIX interface, but kind of file-is-a-directory feature).
Also we want HTTP API compatibility with elliptics, i.e. we write data via pohmelfs and read it via HTTP (or any other) client, which uses default id-is-a-name-hash approach.

This is all is planned for future releases though, I plan to submit new stable version in December and better in a week or two.
Stay tuned!

Recent elliptics changes

Tagged:  

It was a while I wrote here last time, but there are a fair number things happened.

First, elliptics.
We fixed number of bugs in server-side scripting implementation, so it is very stable now and easily allows to create complex scripts which not only perform basic tasks, but also rather complex transactions, like read/process/update multiple records.

We added transaction locks, which allow to perform operations on given key atomically on single replica. In this case your server-side script, which reads data, updates its structure and uploads it back, will be performed atomically on single replica.

There are many other distributed storage systems, which allow atomic key operations, but they do not support datacenter replica split. Usually it is implemented (like in Oracle NoSQL or MongoDB) with single master node, which copies data to multiple replicas. This operation may even be synchronous and safe.
And it is possible to put replicas into different datacenters. But what happens, when you want to add more datacenters, but do not want to increase number of copies? In this scheme one has to rebalance replica sets.

In elliptics you just say which datacenters you want to use for given IO transaction. This forces non-atomic replica updates, which happen in parallel. This allows to have faster writes and more flexible setup, but we can not guarantee that replicas are updated synchronously.

And actually with mongodb scheme it is possible, that replicas will not be in sync, since there may be failed node which ost some data, so reads from this node will not return consistent results.

Real fix for this problem lies outside of the low-level storage subsystem. It is like forcing block device to lock against multiple processes writing into the same file. Instead this protection lives on higher layers and generally requires external locks, which are both known and used by concurrent users.

In distributed world this algorithm is based on Paxos with optimizations (like fast Paxos or Zookeeper Atomic Broadcast and friends). We do not yet have such subsystem.

Another change is embedded checksums. Previously we stored them in blobs after each data records, but never used. Instead there was a separate command to calculate and store checksum in metadata (separate column). This may resulted in stall checksum stored in metadata, so parallel read could fail with inconsistent data error. Right now metadata is just a storage for replica information and update/check status dates and other meta information (namespace and so on).
Each write updates checksum in blob atomically (again, this is not synchronized between replicas), so any read will always get correct check if data was not corrupted.

We also added bulk IO operations, namely read and write. Write issues many parallel writes and waits for all of them to complete, thus essentially being equal to the performance of the slowest nodes, while doing sequential write ends up being as long as sum of all writes in question.
Read does similar thing, but splits and optimizes read using offset sorting (i.e. that we read data from disk in sequential order).

Elliptics: server-side scripting

I added binary data support into server-side scripting (currently we only support Python) as well as extended HTTP fastcgi proxy to support server-side script execution.

Previously we only were able to use C/C++/Python to force some server to execute script on our data - API is rather simple and can be found in examples in binding dirs.
Now we can do that through HTTP.

Here is an example of how to setup python server-side scripting and run scripts on posted through HTTP data.
First of all, you have to put python.init script into 'history' directory (specified in server config).
It may look like this:

import sys
sys.path.append('/usr/lib')
from libelliptics_python import *

log = elliptics_log_file('/tmp/data/history.2/python.log', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('elisto19f.dev', 1030)
__return_data = 'unused'

This creates global context, which is copied for every input request.
Second, you may add some scripts into the same dir, let's add test script named test_script.py:

__return_data = "its nothing I can do for you without cookies: "

if len(__cookie_string) > 0:
       # here we can add arbitrary complex cookie check, which can go into elliptics to get some data for example
       # aforementioned global context ('n' in above python.init script) is available here
       # so you can do something like data = n.read_data(...)
        __return_data = "cookie: " + __cookie_string + ": "

__return_data = __return_data + __input_binary_data_tuple[0].decode('utf-8')

HTTP proxy places cookies specified in its config into variable __cookie_string, and it is accessible from your script. Anything you put into __return_data is returned to the caller (and eventually to the client who started request).

__input_binary_data_tuple[0] is a global tuple which contains binary data from your request.
When using high-level API you may find, that we send 3 parameters to the server:

  • script name to invoke (can be NULL or empty)
  • script content - it is code which is invoked on the server before named script (if present), its context is available for called script
  • binary data

That binary data is placed into global tuple without modification. In Python 2.6 it holds byte array, I do not yet know what it will be in older versions though (basically, we limited support to python 2.6+ for now).

HTTP proxy puts your POST request data into aforementioned binary data. It also puts simple string like __cookie_string = 'here is your cookie' into 'script content' part of the request.
One can put there whatever data you like if using C/C++/Python API of course.

Here is example HTTP request and reply:

$ wget -q -O- -S --post-data="this is some data in POST request" "http://elisto19f.dev:9000/elliptics?exec&name=test_script.py&id=0123456789"
  HTTP/1.0 200 OK
  Content-Length: 79
  Connection: keep-alive
  Date: Wed, 02 Nov 2011 23:24:10 GMT
  Server: lighttpd/1.4.26
its nothing I can do for you without cookies: this is some data in POST request

As you see, we do not have cookies, so our script just concatenated some string with binary data and returned that data to the client. Name used in POST parameters is actually a name of the server-side script to invoke. ID is used to select server to run this code, otherwise it will run on every server in every group.

It is now possible to run our performance testing tools against server-side scripting implementation (it is not straightforward - there is separate SRW library which implements pool of processes each of which has own initialized global context, which is copied for every incoming request and so on

We believe that numbers will be good out of the box, and of course I expect that performance will suffer compared to plain data read or write, but it should not be that slow - we still expect that every request completes within milliseconds even when it uses python to work with data.
Our microbenchmark (every command executed on server is written into log with time it took to complete) shows that timings are essentially the same - hundreds of microseconds to complete simple scripts.
So I believe with IO intensive tasks we will not be limited by python server scripts and/or various reschedulings.

I'm quite excited with idea of adding full-text (rather simple though) search for all uploaded content into elliptics.
And I'm working hard on this - expect some results really soon!

P.S. We have almost finished directory support for POHMELFS via server-side scripts, expect it quite soon also!

Elliptics network: authentication bits, IO/net priorities and write performance

In a meantime elliptics got basic authentication support. It is not aimed at protecting against rogue invaders, but instead to protect against configuration error, where different clusters happen to connect to each other breaking route table and data forwarding.

Auth cookie can be set in elliptics config via 'auth_cookie' global parameter. By default it is empty string.
Nodes with different cookies can not join each other and will not reconnect if connection failed.

Cookie is transferred over unencrypted channel and is not supposed to add security bits, but instead to prevent misconfiguration. Storage channels are not usually cryptographically protected (at least not at this level - client should encrypt content if needed, or it can be done via server-side scripting), and all 'real' authentication happens at higher levels.

We also added IO priorities (for those IO schedulers which support it) - server-server IO operations (like data recovery) may get lower priority. This applies to network coloring - it is possible to assign different network priorities (as in socket(7), which may turn on TOS bits for example) to server-server and client-server traffic.

And small note on write performance.
Elliptics uses pool of IO threads to do actual low-level work. It is possible to have forward progress when some of threads are blocked in long writes for example, although usually system is kind of unresponsible at those moments. Having backlocg thread to perform your work means that there will be number of reschedulings to complete single work request.
Usually this is not a problem, since requests come in queue and IO threads pick them up as soon as processor/kernel permits.

But for single write request this may introduce unneeded latency, disappearing in batch load. Latency came up to 40 ms per single request, which is not acceptable for some cases. So we lightly changed send logic (added CORK usage) to eliminate unneeded rescheduling on header/data split reads and also dropped acknowledge sending in writes - if we wrote data successfully there is a proper reply already (information about where and how record was stored), otherwise negative error code is sent in ack message.

Those changes dropped single write request latency to sub-millisecond range, which should be fine for now - this is a median time (not including disk _sync_, but counting blob write) we perform write request when we have a queue of jobs and not just a single request.

And now I plan to implement a secondary index in elliptics via server-side scripting. For the starters, I will put all wikipedia articles and add prefix search (secondary index over written names), like returning list of urls starting with en.wikipedia.org/wiki/abc*.

Also will post Elliptics vs MongoDB benchmark on huge number of small records on single node.
Stay tuned!

Elliptics network: server-side scripting for data processing engine

Tagged:  

Storage systems more and more commonly demand not just ability to save data to disk and provide fast access.
It is almost impossible to find a plain storage stack, which will not perform some kind of data processing.
This may include applications starting from simple client-side operations (like wrapping data in HTML in web case) and finishing with complex real-time processing or batch jobs wrapped in mapreduce-like systems.

Writing most of server-side work in low-level languages is not convenient and frequently is error-prone. Also performance is not commonly limited by CPU power, but disk IO speeds.

So, I added server-side scripting support to elliptics. Separate lib (called libsrw) creates a pool of processes (since many scripting languages are single-threaded like Python or popular javascript v8) which talk to external world via pipes. Each process initializes global context where it can run user provided script.
For example in elliptics we store a client node, which is connected to the storage and has access to all local data.
Every execution command is started in context, which is copied from global environment, and although global context can be affected by command (for example connection can be dropped by remote node), all variables created for scripts to be executed are destroyed when command/script completes.

To date I implemented python context and we plan to complete v8 javascript soon.
It is possible to create, for example, a simple checker on every node, where systems checks clients credential when processing data via direct URL. Direct URLs are used when we do not want to proxy data through dedicated servers (like when streaming huge files), but instead allow client to connect and send requests to particular server which hosts needed data objects.

Another example is batch data processing, like background data conversion from one format into another without need to copy every object to processing node. We also plan to use this mode for batch recovery - instead of checking replicas for every stored key, we can collect a bunch of objects which are known to me missed on some other node and later upload this blob to remote host in one go.

Classical mapreduce can also be implemented on servers. Actually our contexts are exact mappers, which have access to data and can iterate over all records selecting those to 'reduce'. In our scheme reduce operation will run on the node, which sent request (or client). This may not be the optimal solution though (for example when reduce data set is rather huge), so we could extend it in the future.
Plan is to allow execution contexts to store its state locally when needed, so that subsequent commands have access to already processed data (for example consider the case, when we calculate number of links to given URL - we generally do not want to check all files every time we start processing, instead would like to only parse files uploaded since previous start).

Another application for server-side scripting is directory structure implementation for POHMELFS. This may be implemented in low-level code like C/C++ though and linked to server binary. Storing and processing complex structure on the server allows client (which is in kernel) to be really simple and do not mess with complex locks.

What I want to play right now is some kind of full-text search embedded into elliptics. The most naive implementation could just parse all written documents and build reverse indexes (using appropriate language morphology if needed) and store them in dedicated columns in elliptics as well. Writing this kind of scripts in Python or JavaScript is rather simple task, which greatly extends functionality of the storage kind of 'for free' - bunch of server-grade processors are usually unused in storage systems.

Here is context initialization script for elliptics for example:

$ cat /tmp/test/history.2/python.init 
import sys
sys.path.append('/tmp/dnet/lib')
from libelliptics_python import *

log = elliptics_log_file('/dev/stderr', 40)
n = elliptics_node_python(log)
n.add_groups([1,2,3])
n.add_remote('localhost', 1025)
__return_data = 'unused'

We create global elliptics node object 'n' (not a good name for such things of course, but that's just an example), which connects to storage server listening on a localhost:1025. __return_data is a special variable used to return data from execution context back to client. It is possible to write data into a file on the local filesystem or upload test results back into elliptics though.

That's how client requests may look (there are also API extensions for C++ and Python):

$ ./example/dnet_ioclient -r server:1025:2 -C1 -c "local_addr = '`hostname`'" -n script.py
$ ./example/dnet_ioclient -r localhost:1025:2 -C2 -c "__return_data = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)"

The latter example is just a plain python script executed on the server - it only reads data from elliptics.
First example executes a script named script.py, which should exist on the server (config specifies scripting dir).
local_addr = '`hostname`' is part of the final script which initializes some variable (it can be arbitrary complex if needed).

Script (on the server) may look something like this:

$ cat /tmp/test/history.2/script.py
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Prior being executed server-side code concatenates client's part of the script (string with local_addr in above example) with script itself, so server executes following code:

local_addr = 'zbr.localnet'
from time import time, ctime

ret = n.read_data('/tmp/current.date', 0, 0, 0, 0, 0)
__return_data = local_addr + '>>>  ' + ctime(time()) + ' --- ' + ret

Everything stored in __return_data is returned back to client.
Next execution command initializes context back into the state which existed just after initialization script execution (although global elliptics node in our example can be internally modified).

Those were our first set of changes to data processing engine implementation in elliptics network.
Stay tuned for more results!

Elliptics vs HBase on hundreds of millions of small records

Tagged:  

We recently run a test on 215 millions of production records, which we wanted to store on single server.
Objects are rather small about 200 bytes, server hardware is quite common in our environment: 24 Gb of RAM, handful of mostly unused CPU cores, 4 SATA disks of 1-2 Tb in RAD10
Unfortunately due to configuration issues there were only 2 working disks in elliptics server (raid near layout sucks), while HBase had fair load spread over all 4 disks

Usage case: random reads, random range reads.

Out of the box elliptics _yet_ does not provide good enough on-disk index, so it is usually a binary search, which is costly. So we warmed index cache files by reading them into /dev/null

Upload speed was roughly the same in HBase and Ellipitcs - 2.5-3 Krps. But using HBase batch upload it was possible to write data upto 30.000 objects per second, object were packed into 5000 blocks. Elliptics does not yet support such batch upload mode.

So, reading. First of all, we started plain run of 200 random IO rps test. Its duration was about 10 minutes.
Elliptics showed about 30ms median reply times. HBase was closer to 100-150 ms.


Elliptics data (timing scale is wrong, though, median time is 30 ms)


HBase data

Second test was set to find maxium RPS rate of random IOs starting from cold caches and data.
I will not post graphs though, only numbers. Graphs on demand.
Elliptics showed 115 RPS within 100 ms each.
HBase reached almost 300 RPS, but with 100-200 ms median.

And that was with only 2 working disks in elliptics setup (blame on me), while HBase used 4 disks in RAID. Here are proof pictures (first one is elliptics dstat, second - HBase)

HBase also supports index compression, which ended up with 1500 RPS of random IO within 100 ms. Pretty good numbers for single server with 215 millions of records. Our objects are easily compressable, so HBase's data base only occupied about 15 GBs of space.

Elliptics does not compress index, only data, so compression would not help us. Instead we plan to replace current index (binary search on disk, roughly the same happens in every filesystem with b-tree, when you are looking for a file by name). New scheme will cache each N'th key in ram with specifying ID range stored behind given key. Keys are rather small - 64 bytes by default (specified at compile-time), so we could pack a bunch of them into 4k page for example. Reading this page from disk will take single IO, and searching for the key in this page (read into RAM) will be very fast.

This only optimization will kick hbase's ass I think. Getting that HBase can not safely live in multi-datacenter environment, elliptics will fit all needs for huge-number-of-small-records data storage.
But there is also a plan to add bloom filter for faster detection whether given key is present in blob or not.
Since blob file is 'closed' for writes after reaching maximum size or number of records, and no more writes goes into it except when overwrite is turned on or deletions sets the bit, but both do not change keys, it is possible to create rather static and very optimized index. This comes after HBase index design acutally, when its blocks are immutable and so are index ranges.

We currently cache in RAM each key we read from disk, but practice shows that overwhelming number of old enough records (so theirs keys moved to disk from ram) are never (or extremely rarely) read again, so we will add cache timeout for keys to drop them back to disk.

Elliptics recovery process

Tagged:  


Click for full size

24 disks system, 350-450 MB/s
101% disks utilization

I suppose it buzzes very fun in the shelve

Fun set of slides for Yac 2011

It was shown on the 'wall' for eyes pleasure, we did not talk and did not make real presentation, instead there was a stand where several people from our team answered questions and supported talks about distributed processing, universe and so on

Enjoy! :)

P.S. slideshare.net provides only iframe embedding, so you may fail to watch it here, so click to watch on slideshare

Elliptics HOWTO

Tagged:  

Elliptics was put into Ubuntu PPA

deb http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main
deb-src http://ppa.launchpad.net/eightn/elliptics/ubuntu lucid main

and we started a community site www.elliptics.ru

There is a nice howto there, but only in russian so far.
There are also couple of articles about how elliptics works and what it is all about as well as video from my previous year presentation on YaC 2010.

Elliptics range requests benchmark

Tagged:  

73+ millions of records, 25-30 Gb total space on single node (actually there were 3 replicas, but we used only one)

Here is the graph

3000 rps within 10 milliseconds, where each range request returned 20-4k records.

Elliptics range request is a full analogue of SQL's "SELECT * from TABLE WHERE key > X and key < Y LIMIT (from, num)". In this test each record's key contained timestamp and range request asked for data in some time range.

Each key (elliptics uses 64 bytes for key) looked like this:
xxxyyy...whatever else...timestamp[16 bytes]

and range request was from
xxxyyy...whatever else...0000...0000[16 bytes]
to
xxxyyy...whatever else...ffff...ffff[16 bytes]

POHMELFS got full read/write support

Tagged:  

as well as directory listing and file creation.

All operations are not group-aware though, i.e. all writes are made only into single group and reads (directory listing and object lookup) do not balance between multiple groups.

There was a fair number of inode/dentry hacks and I suppose that populating dentry cache from outside of directory inode operations (like ->create() callback) is not a good idea, but I ended up adding new dentries when reading directory content in ->lookup() or ->readdir().

POHMELFS also does not yet handle errors - i.e. it is a luck we do not crash if server returns borrked structures for directory for example, network errors are not fixed also - client will not reconnect and will not even drop connection if some errors are found.

But it is a matter of time to clean things up. Stay tuned!

Initial POHMELFS commit

Tagged:  
$ git commit -a --stat -m "Initial POHMELFS commit"
[master 19ce0b3] Initial POHMELFS commit
 13 files changed, 3301 insertions(+), 0 deletions(-)
 create mode 100644 fs/pohmelfs/Kconfig
 create mode 100644 fs/pohmelfs/Makefile
 create mode 100644 fs/pohmelfs/Module.symvers
 create mode 100644 fs/pohmelfs/dir.c
 create mode 100644 fs/pohmelfs/file.c
 create mode 100644 fs/pohmelfs/inode.c
 create mode 100644 fs/pohmelfs/net.c
 create mode 100644 fs/pohmelfs/packet.h
 create mode 100644 fs/pohmelfs/pohmelfs.h
 create mode 100644 fs/pohmelfs/route.c
 create mode 100644 fs/pohmelfs/super.c
 create mode 100644 fs/pohmelfs/trans.c

POHMELFS is able to create objects (files only so far) and perform directory listing.
No directories or file IO support yet. No object removal.
Directory structure is trivial linear array which does not scale.
And that in freaking 3.3 thousands of lines of code. And 5 real nights of code.
So do not expect magic here, we will update it in time.

POHMELFS works with elliptics and it is possible to read (write and remove) binary data using elliptics tools and/or via HTTP proxy if you want to mess with raw IDs (64-bytes long numbers).

Stay tuned, fun things are about to begin!

splice() syscall

Tagged:  

I found that splice() syscall does not transfer data, when in and out file descriptors are the same or refer to the same file. Instead destination 'space' is filled with zeroes while supposed to contain input buffer content.

Here is my code for the reference (with some debug added):

static int eblob_splice_data_one(int *fds, int fd_in, loff_t *off_in,
		int fd_out, loff_t *off_out, size_t len)
{
	int err;
	size_t to_write = len;

	while (to_write > 0) {
		err = splice(fd_in, off_in, fds[1], NULL, to_write, 0);
		printf("splice  in: %zu bytes from fd: %d, off: %llu: %d\n",
				to_write, fd_in, *off_in, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		}
		if (err < 0) {
			err = -errno;
			perror("splice1");
			goto err_out_exit;
		}
		to_write -= err;
	}

	to_write = len;
	while (to_write > 0) {
		err = splice(fds[0], NULL, fd_out, off_out, to_write, 0);
		printf("splice out: %zu bytes into fd: %d, off: %llu: %d\n",
				to_write, fd_out, *off_out, err);
		if (err == 0) {
			err = -ENOSPC;
			goto err_out_exit;
		}
		if (err < 0) {
			err = -errno;
			perror("splice2");
			goto err_out_exit;
		}
		to_write -= err;
	}

	err = 0;

err_out_exit:
	return err;
}

Unfortunately it does not even return error, but silently corrupts data.

I would be happy to be wrong of course.

Initial POHMELFS bits

Tagged:  

I'm a bit budy with other boring tasks around, so only get chance to hack on POHMELFS today. It is almost 1000 lines of code already, but yet its functionality is virtually near zero.
But I believe all interface for VFS are implemented and I can concentrate on actual interaction with remote elliptics storage.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   41 fs/pohmelfs/dir.c
   27 fs/pohmelfs/file.c
  185 fs/pohmelfs/inode.c
   85 fs/pohmelfs/net.c
  104 fs/pohmelfs/pohmelfs.h
   96 fs/pohmelfs/pohmelfs.mod.c
  437 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  993 total

# mount -t pohmelfs -o "server=ipv4_addr1:1025:2;server=ipv6_addr2:1025:6" none /mnt/
# ls -laui /mnt/
total 4
1 drwxr-xr-x   2 root root    0 Aug 26 08:52 .
2 dr-xr-xr-x. 23 root root 4096 Aug 26 07:47 ..

File creation does not work as well as directory content reading. It will be rather trivial for start.
Maybe I will even implement server-side scripting instead and will use it for directory updates, so that I do not create leases (in the first release) needed for read-modify-write loop of directory update.

Or (what is more probable) I will just create read-modify-write loop for directory update without server locks, which is rather bad idea from concurrent point of view, but it is the simplest case which can be used as a base for future improvements.

Stay tuned, I plan to create kind of alpha working version very soon!

Pohmelfs in da haus

Linux kernel for the last 3 years didn't change at all - I wrote 600 lines of code, and it does not yet even connect normally to remote elliptics node.

$ wc -l fs/pohmelfs/{*.[ch],Makefile,Kconfig}
   37 fs/pohmelfs/inode.c
   88 fs/pohmelfs/net.c
   58 fs/pohmelfs/pohmelfs.h
   62 fs/pohmelfs/pohmelfs.mod.c
  353 fs/pohmelfs/super.c
    7 fs/pohmelfs/Makefile
   11 fs/pohmelfs/Kconfig
  616 total

Looking forward for the larger datasets

Tagged:  

To date largest (in terms of number of objects) elliptics cluster hosts as much as 400 millions of records on every node. Those are quite small records, so it does not occupy much space (just about 4 Tb per server node), but index becomes quite large (45+ Gb).

We dropped in-memory index in eblob quite for a while already, but having it on disk means that we have to check disk to find out needed key. Currently it is binary-searchable structure on disk, but this is very suboptimal. Well, for 45+ Gb indexes lookup has acceptable timings and likely will have it for 2-3 times larger datasets, but we want every node to host as much as 5-10 times more data, i.e. 40 Tb of data and 4 billions of records.

In this case binary search in 450 Gb index will take too long. We can add more servers to spread data, but in this case we will not optimally use quite limited physical space in datacenters. We have couple of ideas on faster indexes, which basically employ kind of sharding property of our keys, i.e. we can split our key (512 bits in default configuration) into chunks where each part will be an offset into box-like index structure.

In a meantime I added in-memory caching of the read keys - now every key read will be placed into hash table in memory for the fast next-time access. Eblob also got Python bindings and set of handy utils to scan blobs (like regexp match). It also supports statistics file (you will find it in root directory), which shows number of objects present on disk, removed on disk and pushed into memory. Removing this file will force eblob to regenerate it on the next start.

And now one really good news - new POHMELFS will be started next week. Estimated completion time is rather short (we want it to be if not production ready, but more usable at the end of the month), since it will be just elliptics frontend. To date main question is object indexing - the most naive and simple design will have quite slow file/dir renames, i.e. full copy and delete, which is not a good idea generally.
But I did not yet think in details about design problems, so this is an open question.

Stay tuned, there will be very interesting thins shortly!

Elliptics API extension

Tagged:  

I added a bunch of new URI parameters, so HTTP frontend is now almost as capable as clients created using C/C++/Python API. (Github has not yet been updated though)

In particular, added

  • column= - read/write data from particular column. Number of columns is not limited. Well, each new column is a separate large eblob, so it is limited by number of blob files in directory, by default blob size is 50 Gb, so it should not be a limitation
  • size=/offset= - to read/write data from particular offset and size
  • prepare/plain_write/commit - using 'prepare' one can reserve large enough space and then put there chunks of data using 'plain_write'. 'commit' will commit data into index, calculate checksum and so on
  • aflags/ioflags - one can enable/disable checksum, turn on/off compression, APPEND mode and so on
  • range/from/to - range requests. HTTP server returns array of data chunks, where each record is prepended with DNET_ID_SIZE key (64 bytes by default) and 8 bytes of size (little endian)
  • id - to read/write data using own ID (DNET_ID_SIZE max, 64 bytes in default compile - just for sha512 keys)

Here are examples.

1. Write data using own 64-byte keys (in hex: 1122334400... - 1122334600...):

wget -q  -S -O- --post-data="1234567890 id=11223344" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223344"
wget -q  -S -O- --post-data="1234567890 id=11223345" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223345"
wget -q  -S -O- --post-data="1234567890 id=11223346" "http://elisto19f.dev:9000/elliptics?name=1.xml&id=11223346"

2. Read data by own user-provided key:

wget -q  -S -O- "http://elisto19f.dev/elliptics-proxy?name=1.xml&id=11223345&direct"

3. Range request hex keys are from 1122334400... to 112233ff00...:

wget -q  -S -O- "http://elisto19f.dev/elliptics?range&from=11223344&to=112233ff"

hexdump of the stream (red is key, blue is size):

 wget -q  -S -O- "http://elisto19f.dev.yandex.net/elliptics?range&from=11223344&to=112233ff" | hexdump -C
  HTTP/1.0 200 OK
  Content-Length: 459
  Content-Type: application/octet
  Connection: keep-alive
  Date: Thu, 04 Aug 2011 19:27:57 GMT
  Server: lighttpd/1.4.26

00000000  11 22 33 44 00 00 00 00  00 00 00 00 00 00 00 00  |."3D............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  16 00 00 00 00 00 00 00  31 32 33 34 35 36 37 38  |........12345678|
00000050  39 30 20 69 64 3d 31 31  32 32 33 33 34 34 11 22  |90 id=11223344."|
00000060  33 45 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |3E..............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000000a0  00 00 00 00 00 00 31 32  33 34 35 36 37 38 39 30  |......1234567890|
000000b0  20 69 64 3d 31 31 32 32  33 33 34 35 11 22 33 46  | id=11223345."3F|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 16 00 00 00  |................|
00000100  00 00 00 00 31 32 33 34  35 36 37 38 39 30 20 69  |....1234567890 i|
00000110  64 3d 31 31 32 32 33 33  34 36 11 22 33 47 00 00  |d=11223346."3G..|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 00 00 00 00 00 00  00 00 16 00 00 00 00 00  |................|
00000160  00 00 31 32 33 34 35 36  37 38 39 30 20 69 64 3d  |..1234567890 id=|
00000170  31 31 32 32 33 33 34 37                           |11223347|
00000178

We will stress-test range requests next week, it is expected that performance should not be radically different from plain random IO performance (benchmarks can be found here)

We also tested Cassandra in 4-nodes configuration, and I'm about to collect raw number and graphs to show.

Stay tuned!

Sometimes I think I'm doing something wrong

I could be an artist, if trained to paint 15 years instead of programming :)

Created under impression about new album from Leningrad band. That's the only way I can imagine Julia Kogan while listening her vocal.

EDITED TO ADD: Blog has been removed from kernelplanet syndication and probably from many others likely because of some raffinated policies.
But that's what I do, and if you do not like that, do not read it. But if you still want to know how to create cool technical stuff, but suddenly missed start of the 21 century - there are freaking tags at the right of this page.

Syndicate content