pohmelfs: network raid1 example
Pohmelfs configuration is actually trivial:
# mount -t pohmelfs -o "server=172.16.136.1:1025:2,fsid=xxx,groups=3:2:1,noatime,noreadcsum,successful_write_count=1,sync_timeout=600,readdir_allocation=5" none /mnt/
where ‘server’ mount option specifies IP address in form address:port:family (2 – ipv4, 6 – ipv6). It is ok to specify only subset of all cluster IP address – pohmelfs will download route table itself, it only needs at least one alive node at connection time to discover other nodes
‘groups’ mount option specifies groups you want to write data into. Group is kind of replica ID.
That’s all for pohmelfs.
Let’s configure elliptics and create 2 groups (with id 2 and 3 for example), which will store essentially identical replicas (they may differ, since writes can be unordered, or one group may be down for some time)
There are 2 configuration files – elliptics server (let’s call it ioserv.conf) and server-side script environment (we use python, so it is python.init).
Here is ioserv.conf
I will highlight parameters, which differ in separate groups
# log file # set to 'syslog' without inverted commas if you want elliptics to log through syslog log = syslog # log mask #log_mask = 10 log_mask = 15 # specifies whether to join storage network join = 1 # config flags # bits start from 0, 0 is unused (its actuall above join flag) # bit 1 - do not request remote route table # bit 2 - mix states before read operations according to state's weights # bit 3 - do not checksum data on upload and check it during data read # bit 4 - do not update metadata at all # bit 5 - randomize states for read requests flags = 4 # node will join nodes in this group group = 2 # list of remote nodes to connect # address:port:family where family is either 2 (AF_INET) or 6 (AF_INET6) # address can be host name or IP remote = 172.16.136.1:1025:2 172.16.136.2:1025:2 # local address to bind to # port 0 means random port #addr = localhost:1025:2 addr = 172.16.136.1:1025:2 # wait timeout specifies number of seconds to wait for command completion wait_timeout = 60 # this timeout specifies number of seconds to wait before killing # unacked transaction check_timeout = 60 # number of IO threads in processing pool io_thread_num = 64 # number of IO threads in processing pool dedicated to nonblocking operations # they are invoked from recursive commands like DNET_CMD_EXEC, when script # tries to read/write some data using the same id/key as in original exec command nonblocking_io_thread_num = 32 # number of thread in network processing pool net_thread_num = 64 # specifies history environment directory # it will host file with generated IDs # and server-side execution scripts history = /opt/elliptics/history.2 # specifies whether to go into background daemon = 1 # authentification cookie # if this string (32 bytes long max) does not match to server nodes, # new node can not join and serve IO auth_cookie = qwerty # Background jobs (replica checks and recovery) IO priorities # ionice for background operations (disk scheduler should support it) # class - number from 0 to 3 # 0 - default class # 1 - realtime class # 2 - best-effort class # 3 - idle class bg_ionice_class = 3 # prio - number from 0 to 7, sets priority inside class bg_ionice_prio = 0 # IP priorities # man 7 socket for IP_PRIORITY # server_net_prio is set for all joined (server) connections # client_net_prio is set for other connection # is only turned on when non zero server_net_prio = 1 client_net_prio = 6 # anything below this line will be processed # by backend's parser and will not be able to # change global configuration # backend can be 'filesystem' or 'blob' backend = blob # zero here means 'sync on every write' # positive number means data amd metadata updates # are synced every @sync seconds sync = 300 # eblob objects prefix. System will append .NNN and .NNN.index to new blobs data = /opt/elliptics/eblob.2/data # Maximum blob size. New file will be opened after current one # grows beyond @blob_size limit # Supports K, M and G modifiers blob_size = 500G # Maximum number of records in blob. # When number of records reaches this level, # blob is closed and sorted index is generated. # Its meaning is similar to above @blob_size, # except that it operates on records and not bytes. records_in_blob = 10000000
Our second replica will live in group 3, so you should change above ‘group’ parameter to 3 as well as node’s address and optionally ‘remote’ parameter, which is a list of nodes to connect. It can include local address itself.
Second configuration file is python.init
It must live in directory specified in ‘history’ parameter above
You should put all srw/pohmelfs* scripts in ‘history’ path too.
import sys
sys.path.append('/tmp/dnet/lib')
sys.path.append('/opt/elliptics/history.2')
from libelliptics_python import *
# groups used in metadata write
pohmelfs_groups = [1, 2, 3]
pohmelfs_log_file = '/opt/elliptics/history.2/python.log'
log = elliptics_log_file(pohmelfs_log_file, 10)
n = elliptics_node_python(log)
# we should only add own local group, since we do not want all updates to be repeated for all groups
# this should be changed to 3 for group number 3
n.add_groups([2])
# this is an IP address for local node, i.e. server, which belongs to group 2
# you may specify multiple addresses with multiple calls
n.add_remote('172.16.136.1', 1025)
__return_data = 'unused'
import gc
import struct
# python sstable implementation
from sstable2 import sstable
import logging
FORMAT = "%(asctime)-15s %(process)d %(script)s %(dentry_name)s %(message)s"
logging.basicConfig(filename=pohmelfs_log_file, level=logging.DEBUG, format=FORMAT)
pohmelfs_offset = 0
pohmelfs_size = 0
# do not check csum
#pohmelfs_ioflags_read = 256
pohmelfs_ioflags_read = 0
pohmelfs_ioflags_write = 0
# do not lock operation, since we are 'inside' DNET_CMD_EXEC command already
pohmelfs_aflags = 16
pohmelfs_column = 0
pohmelfs_link_number_column = 2
pohmelfs_inode_info_column = 3
pohmelfs_group_id = 0
def pohmelfs_write(parent_id, content):
n.write_data(parent_id, content, pohmelfs_offset, pohmelfs_aflags, pohmelfs_ioflags_write)
n.write_metadata(parent_id, '', pohmelfs_groups, pohmelfs_aflags)
Putting together this initialization script (you may edit one in source tree) with pohmelfs scripts ends up adding support for server-side scripts executed with above context.
There is a pool of processes which pick up execution contexts to run your requests.
That’s it, feel free to ask if you hit any problem!
pohmelfs: call for inclusion Linux kernel workqueue latencies
Comments are currently closed.

at the end of python script u write one copy of data to elliptics one group but metadata to all groups. is that correct ?
Metadata is written into groups specified at initialization time (or at time add_groups() is called), provided groups parameter is only stored in metadata, but not used to send messages
what mean “fsid=xxx” in mount string ?
is it possible to use several “server” parameters in mount line ?
fsid=
Filesystem ID – you may have multiple filesystems in the same elliptics cluster
This ID may be thought of as container or namespace identity
By default it is ‘pohmelfs’ (without quotes)
server=
Remote node to connect (family may be 2 for IPv4 and 6 for IPv6)
You may specify multiple nodes, usually it is ok to put here only subset
of all remote nodes in cluster, pohmelfs will automatically discover other nodes
Hi, I didn’t find any mailing list. When I start dnet_ioserv with the example configuration at least one thread starts spinning and eating 100% cpu immediately. Here’s a bt from what I believe to be the culprit:
Thread 24 (Thread 0x7f4e69885700 (LWP 16971)):
#0 0x00007f4e6b91461e in __pthread_mutex_unlock_usercnt () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f4e6bd2da26 in eblob_sync (data=0x1c08790) at blob.c:1517
#2 0x00007f4e6b910b40 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f4e6b65b36d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0×0000000000000000 in ?? ()
Here’s what perf top says:
45.08% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt ◆
33.06% libpthread-2.13.so [.] pthread_mutex_lock ▒
Also,
You may want to increase sync timeout – sync may take a long time, if it already has resonable value (like 30-300 seconds), then I would like to know more about your setup (hardware and eblob/elliptics versions) and how you generate load
Ah right, thanks. Increasing sync results in reasonable CPU usage. The default value is 0 in examples/ioserv.conf. Looking at the
code I still don’t understand why setting sync to 0 eats 100% CPU even if there is no load at all.
best, Tom
With zero timeout it constantly runs in a sync loop without sleep in-between
I will fix it up