ioremap.net

Storage and beyond

IT development roadmap.

This will be a short enough post about what projects and theirs status are included into the nearest roadmap. I wrote IT because I will describe programming projects only, and not electronics for example.

So, let’s start with existing two: DST and POHMELFS.
The former is essentially ready (with fun experiment I run with the latest version, kernel and public releases) and will be pushed upstream for some time. Next version will be released in a week or so and will only have new name. That will be 10′th distributed storge release, and 4′th resend of the same code.
Recently released POHMELFS got a bug just after release, and it is supposed to be fixed in the current version (pull from both kernel and userspace POHMELFS git repos). So far I do not see any new major changes in the POHMELFS client code, so essentially kernel side will only be extended when this is required by distributed server changes. I will not push it upstream until server side is also close to be finished.

Obviously both projects will be maintained and bugs will be fixed with the highest priority.

Here we comes to the new project I’m thinking about for some time already. This is a network storage server built on top of distributed hash table design without centralized architecture and need to have metadata servers. So far there is not that much of a code, only trivial bits, and I’m designing node join/leave part of the protocol. Some results are expected to be sooner than later, but not immediately.

And of course brain needs something to play with for the rest. Here comes language parser (LISP XML parser) I mentioned previously, and some computer language application based on this idea. That’s wjat I’m about to work on for the next days.

As another programming exercise I frequently think about buying myself a Play Station 3 and playing with its SPU processors and parallel applications, for example graphic algorithms (the first one which comes in mind is wavelet transformation and non-precise searcing of the images, which I made severaly years ago, I even have sources hidden in some old arch repo). Or playing with video card engines.
This does not have a very high priority though.

That’s the programming plans. Let’s see if anything will be completed anytime soon :)

, , ,

Comments are currently closed.

23 Responses to “IT development roadmap.”

  • jonsmirl says:

    Check out Lustre and Btrfs. There is also an implementation of git on DHT.

    Have you tried DRC for your amps?
    http://www.duffroomcorrection.com/wiki/Main_Page

    I see you have your Rigol already. Ours is coming tomorrow.

    Playing with video card engines is not much fun. I’ve tried doing it before. Poor documentation of all of the bugs in the silicon will cause you to waste huge amounts of time.

    I suspect you’d have a lot more fun with gnuRadio – software defined radio in FPGAs. You can use the gnuRadio hardware to clone your Rigol scope.

    Home automation is also fun to play with.

  • zbr says:

    I know what is Lustre and how bad it is when network suddenly goes down. One year ago it was way too unstable for this kind of workload with even a little bit faulty network connection. Btrfs (and crfs with its local data and metadata cache) is definitely a good idea, and POHMELFS was started after conversation with Zach Brown (main CRFS developer) when it was a closed Oracle project. I somewhat stole the idea of having local data and metadata cache :)

    POHMELFS has noticebly more features implemented compared to CRFS now, but CRFS is developed according to own roadmap, and there are no distributed capabilities there at all.

    Effectively right now there is no distributed parallel filesystem which is stable and simple enough to be used out of the box. Each of them contains some or another problem, which POHMELFS is aimed to resolve. I do not say that it is perfect, but by design it should be close enough :) I wrote a small design comparison of a dozen of network/distributed/parallel filesystems before started POHMELFS implementation about a year ago.

    There is fair number of DHT implementations out there, but none is tied to the kernel interfaces and designed with Linux kernel VFS internal structures and operation modes in mind.

    Working with video adapters as a calculating engine is just an idea, I’m pretty sure it is a non-trivial thing and requires lots of time,

    According to electronics projects, I have set of very interesting ideas to work with (startig from various (somewhat illegal) sniffing/man-in-the-middle/cloning devices upto robot implementation), but so far I only study the basics, and it is already very interesting, and I expect even more. DRC looks very interesting to work with (thanks for the link), I will very likely play with it, when will implement an amplifier (chip likely) and stereo system myself (I have kind of a problem, when I want to implement myself as much as possible starting from programming down to appartment development).

    Electronics projects are different from what I’m doing during light part of the day, when I implement various software projects for fun and profit :)
    And that’s great and interesting.

  • jonsmirl says:

    Lustre has a neat mode where the disks store file objects instead of blocks. They actually made new firmware for the disks. There are interesting concepts in the Lustre project.

    I’m much more a fan of distributed object stores for file system than distributed block stores.

    Another interesting idea is to completely get rid of the whole concept of directories as blocks on disk. Instead each file contains a branch with metadata about it, such as its path name(s), keywords, mp3 ids, creation date, etc. Files are accessed by something like an SHA from the object store. Indexes are then independently built using the the metadata from the branch in the file. In this model you can whack every index in the system and then recreate them by rescanning the object store. Paths are just another metadata tag. Redundancy is achieved by storing the file object multiple times. You can build full-text indexes in this model too.

    Using an SHA (or similar) as the file identifier is a neat concept since it makes duplicate detection easier. My dream file system would be distributed enough so that everyone on the Internet could mount it.

    There are concepts in the design of git that can be applied to file systems. Linus created something very different than previous version control systems when he designed git.

    I believe the next big file system lies somewhere in the world of DHT and object stores but no one has discovered the right combination yet.

    GPU does room correction…
    http://koonlab.com/CUDA_RealFIR/CUDA%20Real%20FIR.html
    GPU doing room correction is way overkill.

  • zbr says:

    What you described is exactly how POHMELFS server will operate.
    There is a fair number of problems to solve in this approach, namely addressing, redundancy and join/leave events, but I’m working on it, and (by design) it should be a filesystem, which indeed can be mounted from anywhere in the world, and any computer can provide its space to be part of the appropriate address space. POHMELFS itself is just a kernel client for this distributed network.

    I never worked with object-based storages, and maybe because of that I actually do not think this will be widely used, since whatever object FS was created for, there will always be different one, which should be stored on given FS, and then things will explode with bucket of problems. Generic solutions work always, but indeed specially crafted filesystem will aways outperform generic one as long as specific conditions are met.

  • jonsmirl says:

    I don’t mean object based as in C++/Java objects. Object based meaning the basic unit stored in the file system is a file, not a block. The file is the object. Disk block allocations are a private matter for the disk drive to deal with, not the file system. The disk can either hold the file or it can’t. How it is stored in the disk is the drive’s problem. That’s what the special Lustre disks do. Of course large files may be split into multiple objects.

    Ext3 is block based, CIFS is object based. Files can have resource forks for holding the metadata and keeping it out of the file data. But the contents of the metadata and file data resource forks are just binary streams.

    RAID works on block based systems. Redundancy is used on object based systems.

    Coral is an example of a DHT based object store.
    http://en.wikipedia.org/wiki/Coral_Content_Distribution_Network

  • zbr says:

    I understood that it is not like programming objects :)

    The main problem to store the object on the disk, as I see it, is the fact, that eventually disk’s firmware will not be able to store some new file which is provided by the user. As a trivial example: 32bit limitation in firmware. Bugs will never be fixed in the disk’s microcode, since it is very much closed and likely written in obscure language.

    Effectively the same happens with modern SSD/flash drives, when firmware wants to provide get/put object interface to the higher layer, expecting that it will implement better wear-leveling and block allocation policy than usual filesystem. But there is no some special tihngs controlled only from the firmware, so I do not see any reason, why firmware-stored filesystem will behave better than the same filesystem on top of block-based storage? it is the same, but part of the work is hidden inside the disk’s CPU, and no one really knows what happens inside, which bugs and problems live there until suddenly data dissapears.

  • jonsmirl says:

    There is no requirement that the object store to physically be in the disk. The concept is to think of the disk as an object store, just like another network file system can be an object store. Then implement a file system on top of these object stores.

    Instead of RAID you put the file in the object store multiple times. The file system can be smart enough to create new copies when redundancy is lost or a new copy is needed in a different location for performance.

    The two things I never liked about traditional file systems is their coupling to blocks on the disk drive. And having a single index for locating files (the directory structure).

    The object store method allows a new person to join the distributed file system. The SHA for the files can be computed to locate duplicates. Then their metadata indexes can be merged into the global one. In the giant Internet file system model you need a way to collapse 300M copies of Windows into a single, highly redundant copy. In this world the concept of path names collapses too.

    Add in the git concepts of transactions (commits to make batches of metadata appear) and delta chaining (just an efficient form of compression).

    Then for compatibility you need local path names. To install something like gedit the install file would set /usr/bin/gedit = query (gnu gedit current x86 executable). Metadata would convert that to an SHA which would be cached in the local path metadata. When you open the SHA it would fault in the binary to your local disk. In this model your entire disk becomes a cache.

  • zbr says:

    No I did not talk about this particular problem, but the fact, that when something is being hidden into the drive itself, this becomes very error prone, since disk’s firmware essentially will not be updated except in some really bad things.

    All network file systems already operate not with blocks, but with inodes. There is no difference between dir and file at the level of the network filesystem, they work only with inodes, which differ only by some inner flags.

    What network object filesystems do (one was presented recently in linux-fsdevel@), is exporting of the internal layout of the object storage disk over the network to the clients. So effectively in usual network filesystem and local FS we have following layout: client works with files, network filesystem works with files, network server works with files, local filesystem works with files, but does IO in blocks. With object storage everything is the same, but local filesystem is hidden in the disk’s firmware.

    The difference is where object->block translation happens.
    Btw, when it lives in the disk’s firmware, this is another way to hardly tie customer to the vendor, if interface for the disk’s firmware is closed.

  • zbr says:

    Having sha1 on path as id is effectively the same key as its path, and there will be no way to easily find duplications, since on one server path is /path1/root/windows/iosys.sys and in another case it is /path2/iosys.sys, hashes on pathes thus will be different, hashes on data content will be the same of course, but we lose information about where given file has to be placed to create proper system.

    Since filesystem lookup is path based in existing systems, we have to have path (or its hash) as a main key. And your last example will only work if sha id is a path hash, or did I miss some trick there?

  • jonsmirl says:

    In a network model the SHA can not include the local path, but it could include a default standard path. The standard path is just another piece of metadata. But there is no universal, global path space.

    I would view local paths as part of backwards compatibility. On a local machine there would be an index for turning a local path into an attribute query. The attribute query then turns into a SHA. The standard path included in the file could be used to initially create the local path for the file, but it is not a requirement. Local paths are simply an alias for an attribute query.

    You can build arbitrary indexes on the file metadata to make the type of queries you are interested in as fast as possible. You can also do hierarchies of queries. Search locally for a metadata match first, then expand the search to the local area, then global.

    Now that I am thinking about this more, there doesn’t seem to be a hard requirement that metadata be stored in the file. Storing the metadata in the file just makes the fsck problem easier since it can be used to reconstruct the indexes.

    One of the root problems with building an Internet file system is eliminating the visibility of local paths. Long ago (1985) I was one of the designers of SMB (now CIFS). I can see now that we made a lot of mistakes.

  • zbr says:

    Let me shed some light on how I designed object export.

    When server starts, it has its root directory as a base, so all path hashes are calculated relative to that root, in this case two machines, which export the same path relative to the export root will have the same set of ids. Plus it has own ID, which will be used as a base to store writes from different clients into the objects (files and dirs) with the same (or close in flat address space) ids. Each exported object has several keys for subpathes of the original path, i.e. when server exports /path/test/object/file with some ID, it also exports test/object/file, object/file and file with appropriate ids.

    When client wants to access some file, it constructs its full path, gets its hash ID and requests some server about this data (it will be forwarded to needed server), if there is no object with precisely this ID, client may request ID for subpath like described above.

    There is a problem though when there are two different objects with the same subpath.

  • Anonymous says:

    I would like to know if POHMELFS has or is planned to have a feature for avoiding duplicate data.. I could see that as being quite useful. It could be done at insert, there is a file checksum calculated already isn’t there? -vrt

  • jonsmirl says:

    As fas as I have determined there is no solution that allows exposing local paths in an Internet file system. The only way to make it work is via tagging and attribute queries. As long as the local paths are exported it makes the problem of collapsing duplicates too hard. Path directories don’t exist in Internet file systems either.

    git mkdir is a NOP. git doesn’t have directories, they are virtual. What appears to be a directory is a partial match against the full path string. Paths are not special. You could get the same effect by querying on files created in December 2001.

    I wouldn’t export local paths. Instead I would export a query service on metadata tags. The query service then returns a list of file SHAs. The remote machine can access those SHA1s remotely or copy the file object over for local caching. The remote machine can also query the metadata from the file and integrate it into it’s own indexes.

    Think of local paths as an alias for an attribute query. The local aliases are never exported into the global file system.

    In this model I could ‘mkdir beatles song_artist=”The Beatles”‘ Now ls beatles and all of the beatles’ songs would magically appear in that directory. Make the query live and new songs would appear as they are written.

    BTW, byte range locking is only rarely used. When it is used it is almost always a database app. Databases are better implemented as a service on an Internet file system than trying to run them on top of the Internet file system.

    I’d also let simultaneous writes to a file create two versions of the file. It would be up to another program to achieve the merge. In a disconnected network there is no way to stop simultaneous writes.

    atime, etc are meaningless on an Internet file system. Buld them into the legacy local path system.

  • zbr says:

    POHMELFS client itself does not do anything with data, it only requests servers to provide needed object.

    Servers are being designed to work over a flat address space, where the same objectes are supposed to have the same ids, so there should be no data duplication during writing (modulo redundancy), but of course servers may export own copies of data by itself.

  • zbr says:

    Multiple pathes are actually nothing more than additional IDs. Effectively there is only single object, no matter where it lives (it can be placed even into database), but it is exported as /path/to/file and as a having beatles ID. And server knows how to find data based on ID information presented.

    So effectively there is a file on some server, and server at its startup enters the network and announces that it has object id0, id1 … idN, while actually this may be the same file, addressed by different ways. if one of the IDs happend to be an attribute, that it belongs to some special musical group, it can be accessed by that ID, if musical application knows how to create it.

    So, it is possible to attach an id=sha(group=’beatles’) to some auidio files, and they can be read by that id, but writing/modifications should be limited.

  • zbr says:

    Oject attributes are local to the client and are maintained by the low-level filesystem on the server itself, there is no need to attach this information to the exported metadata.

    According to locks: I recently read an article about CIFS data access distribution, and it showed that amount of files simultaneously accessed is extermely low compared to non-simultaneous access, so I do not really care much about the fast, that currently POHMELFS locks the whole object.

    Merging and offline work is another idea which lives for a while in my POHMELFS todo list, and while I think coherency should be maintained while node is online, when it goes offline, things should not break, and it just should work with local storage, and when node goes online again, it can sync with server data and allow external application to merge data if local and remote objects differ.

  • jonsmirl says:

    Encoding metadata into SHAs puts it into the same flat name space with the files. SHAs should return a single unique file. This also introduces a global naming authority problem,

    I would pass metadata through another mechanism. Metadata can be quite large, for example a list of unique words found in a text file. Metadata can also be computed in large server farms like Google like tagging all photos of Obama.

    Another example is the search built into Google’s desktop search engine for Windows. It has plugins for decoding hundreds of file formats. This decoded data could be stored in a large metadata server and retrieved using the file’s SHA when you wanted to build a local index on that metadata type. If you use Google to back up your photos it would identify all of the people in them and tags the photos. That generated metadata could be brought back into the local machine. Send a file SHA to Google and ask for all the metadata Google has on it.

    This is a good area for experimentation. As far as I know no one has built a system like this. Piece of it have been built but they aren’t integrated.

    Every time I try to talk about this with file system people they are too stuck on path names to abandon them.

  • zbr says:

    That’s why I split POHMELFS into to essentially unrelated projects: the first is network parallel filesystem: kernel client and simple enough server, which operates on local data using path names as a key. This part is essentially ready, it outperforms NFS, has really lots of features which are only planned in pNFS, and the whole idea looks close to be complete, but I want to move further: distributed network. This part is not a filesystem actually, but some amorph storage, which, among others, can be accessed from the common filesystem with its pathname interface.

    What about having metadata not as a separate entity, since it is effectively the same data as placed in file, but called differently, but placing it into the same cloud as data, but that (one of) its key can be derived from object’s key. So let’s say node exports sha(/tmp/some/file) and its metadata attributes as sha(/tmp/some/file) ^ 0xff or something like that (not particulary that way, just an example).

    As of Google, I’ve just thought about placing a huge data on its Picasa data servers as pictures, and having access protocol to those files, which will strip jpeg header and combine received chunks into single object, spread over the internet servers… Just an idea :)

  • jonsmirl says:

    Google will release Gdrive soon or later. They have been working on it forever. I was using Picasa as an example of something that computes metadata. Web Picasa learns who is in the pictures and then can automatically tag new photos when they are uploaded.

    There clearly needs to be a universal repository for the specification of public metadata formats. Each of these formats would have a GUID. You could then ask a server for all of the metadata on a file and get back GUID/metadata pairs.

    This lets me do the Beatles example. I could ask Google for the SHA of all files matching “music_group=Beatles format=mp3″. music_group and format would be the GUID for a standard metadata type. The results of that query would be cached in my local metadata server allowing me to mkdir beatles “music_group=Beatles format=mp3″. The files retrieved by opening the SHAs could come from anywhere. But the paths and file names are purely a local thing. Opening the file starts the faulting in process. If your net is fast enough you can probably use the file stream in real-time as it arrives.

    There are all kinds of neat things that fall out of this model when you start thing about the process of installing software on a local machine. You don’t really install it. Instead you make a node in the local path name store and it faults in based on SHA. Backups become automatic. You just tell the file system to ensure that there are always at least three copies of the file in the cloud.

  • Anonymous says:

    I have been watching the development of POHMELFS and DST for some time now. I just wanted to give you feedback as to what I would like to use POHMELFS for, and perhaps you can comment and let me know whether these usage scenarios are along the lines of what you have planned.

    1. web server cluster
    – as small as 2 servers, exporting their own storage, providing redundancy and load distribution (I would employ load
    balancers for web traffic – LVS-DR is great)
    – adding additional servers (adding storage capacity to pool) online, and this way an administrator could add more
    servers without storage to boost cpu capacity, or backend servers that only provide storage to boost storage capacity
    – storage utilization is space efficient, calculate parity chunks instead of 2 copies of each block of data to provide
    redundancy?
    – decentralized, not an exact requirement but why should a cluster be constrained by a single or few metadata servers
    or servers have to be relegated to metadata/storage server roles? Shouldn’t all servers in the cluster be aware of each
    other and be able to negotiate the quickest and most balanced communications between themselves. This would be
    most useful for two server usage scenarios where both are aware/sharing the metadata and storage.
    – lock support was a requirement for me, I am glad you implemented it. I guess it is not really a requirement for general
    web servers, but it is useful for scripts that want to coordinate without a database connection (and some
    scripts may expect to be able to protect file access ie were written for a single server in mind)
    – snapshot capability (not super important, but would be a nice feature)
    2. mail server cluster (a lot of the points above for web cluster, perhaps both are one in the same for small deployments)
    3. road warrior / backup scenario
    – use offline capability to access local copies of data in a storage server, and merge/update changes at a later date

    4. future use, I could see using features such as byte range locking and attributes (perhaps beyond what normal extended attributes provided by file system provide) in clustered scripts/programs via libraries (I like perl, perhaps a XS adapter to a C library?)

    DST
    1. mysql/postgres distributed data store (I have very little experience in this area)
    – to be used with web cluster above, provide redundancy, load balancing (as in adding more db servers to support
    more clients) and storage pooling

    These are my goals, I could be misinformed or mistaken about some of the points I made, and maybe sound a little naive, but that is what discussion is for then. Anyways thanks for your hard work.
    -vrt

  • zbr says:

    POHMELFS kernel client at its current state and by design is a parallel network client to the distributed filesystem (called elliptics network in the design notes) itself. So you can consider it as parallel NFS, where parallel means read balancing and redundancy writing to multiple nodes. Having local cache it is also very high-performance, but that’s it: it is a simple client, which is able (with server of course) to maintain coherent cache of data and metadata on multiple clients, which work with the same servers.

    So, right now it is effectively a way to create data servers, where client sees only set of the same nodes, and balance operations (when appropriate) between them. Server itself works with local filesystem, which can be built on top of whtever raid combinations you like of the disks or DST nodes. So effectively in this case POHMELFS mounted node can be extended by increasing local filesystem on the server.

    That’s what the current state of the POHMELFS is. Next step in this direction is to extend server only (modulo some new commands to the client if needed) and add distributed facilities. By design there will be a cloud of servers, where each of which has own ID and data is distributed over this cloud, which has elliptics network working title. This name somewhat mirrors the main algorithm of the ID distribution. POHMELFS client connects to some server and asks for needed data, this request is transformed into the elliptics network format and is forwarded to the server, which hosts it, data is returned to the client and it does not even know that it was stored elsewhere. In this case it is possible to infinitely extend the server space by adding new nodes, which will automatically join the network and will not require any work from client or administrator.

    Proof-of-concept implementation is scheduled for the next month or so, this should be working library which can be used by other applications. Then I will integrate it with existing POHMELFS server. This is optimistic timings though :)

    DST is a network block device. It has a dumb protocol, which allows to connect two machines and use read/write commands between them. So, there is always one machine, which exports some storage, and one which connects to it. There are no locks, no protection against parallel usage, nothing. Just plain read and write commands. System can start several connections to the remote nodes and thus will have multiple block devices, which appear like local disks. Administrator can combine those block devices into single one via device mapper/lvm or mount btrfs on top of multiple nodes. And then export it via POHMELFS to some clients or work with it locally.

    Hope this clarifies how things work. I’ve also pushed it to the main page.

  • Anonymous says:

    Evgeniy, you probably remember about my attempts to build a distributed storage with ability to define redundancy for data blocks (2 or 3, for example) using your DST project. For some reason I migrated to Lustre, but there is a problem with a OST (data) redundancy (i.e. to have several copies of data chunks on different servers) – lustre just doesn’t have that feature. It’ll be great to get a distributed storage with data redundancy and fail-over mechanisms ( like several servers for a client in pohmelfs).

  • zbr says:

    This is a task for higher layer, DST is a network block device now. You can create a raid on one of the machines on top of several DST nodes, and then export this node to the outer space. The same can be done on multiple machines, and one of them will be master, while others will wait and select another master when first one died.

    Thus writing data to the one machine will send data to multiple remote hosts, and if this server suddenly goes offline, another one can pick up network nodes into raid and reexport it.

    I though Lustre has OST redundancy quite for a while already? Even metadata server can be duplicated (upto 32 replicas and wit some patches upto 1024).