Completed static content elliptics network implementation

Tagged:  

I've completed installation of the small distributed hash table storage with static content delivered via direct URLs. This whole setup slightly differs from more common and expected one in one detail: how data is fetched and accessed by the client.

In the common case it is supposed that elliptics network powered applications will fetch data from the network according to transaction history of the object (optionally in parallel). This requires client code to be linked with elliptics network library and modified according to its API.

But there is another way, which is much simpler although a bit limited - split data lookup and reading itself, and implement the former in the special small application, while rely on other facilities (like HTTP servers) to get the data.

This is what was made. I wrote simple FastCGI application which starts data lookups and form URLs which are returned to the client application, which in turn fetches data from storage HTTP servers. There is one-to-one relation between any potenially failing object within storage cluster (one can install one elliptics node per disk or per server, or even per datacenter) and elliptics network nodes. FastCGI daemons (which can live on separate set of machines if needed) are persistent clients of that network, and the only task they do is elliptics network node's IP address lookup, which is then extended to static URL to actually get the data.
This URL is returned from the fastcgi daemon as redirect, but this is configurable.

I extended lookup message to optionally stat local storage on the node to actually test whether object is presented on the given node. Using multiple IDs for the same data object allows to redundantly store multiple copies, so that client could switch to another copy if object can not be found using previous ID. Elliptics network storage servers will take care about data relocation when servers go offline and online.

The only problem for this setup is how data is treated by the client and storage. Client expects dataflow from the single node starting from the beginning to the end, while elliptics storage uses transactions with its own protocol and on-disk storage format, which is processed by the library when appropriate IO API is called from the client code.

Problem can be solved if we will upload data not via elliptics network, but directly into the strorage, although using the same name conversion which could be done by elliptics internals. I.e. when we manually create directory structure and put there objects with names, which are equal to hash transformations of the real names, which then in turn will be made in fastcgi elliptics network daemon.
Let me show an example of how this is done. Let's consider an object called '/tmp/passwd.c' to be placed into the storage, which will use sha1 transformation function.

sha1('/tmp/passwd.c') = 8c23ac86ef943021cf6524f475c15f3d5d575deb
so we manually put this object into the storage network on the appropriate node (which handles covering ID range) with that 8c23... name.

FastCGI daemon configured to use sha1 transformation will receive URL like

GET some.host.net/blah?name=/tmp/passwd.c

take name part, hash it, lookup object with above 8c23... id and return following header:

Status: 301
Location: http://some_other_host.net/8c/8c23ac86ef943021cf6524f475c15f3d5d575deb

Simple. We only need to make an appropriate script for data upload. If we would use elliptics network for data upload, then above 8c23... transaction will contain history for data updates, and actual data transaction (or there could be multiple transactions if object was split into multiple parts to allow parallel reading) should be read from the history and then fetched from some other nodes using elliptics network API.

I will write such helper script and upload some content (currently I do this manually via ssh/scp :), so that I could stress-test setup before it goes up. So far things went pretty smooth.
Interested parties can check example directory in the git tree.

Stay tuned!

Can we add option to not only get data from the cloud via HTTP, but put objects as well ? I.e. to make it working completely via HTTP (it may be optional, of course).

I thought about this and this idea looks very interesting, but it requires additional application to run on server storage nodes to actually write data to the disk.

Support for the same URL transformation as being done for GET request will be rather trivial in FastCGI daemon, but I'm not sure how client will be able to process such redirect for the POST requst.

I'm not very familiar with HTTP, but eventually I will talk with the people who plan to deploy this solution about such extension, I'm pretty sure they will want something similar, so I could implement complete HTTP-based upload/download solution.