ASCII vs binary formats

Tagged:  

I used to use binary format for as much building blocks as possible, since they by definition have fixed size structure or at least size parameter somewhere in the fixed-size header.

Contrary ASCII strings can only be differentiated by the searching for some special symbol, most of the time it is a null-terminating byte (in a pair with some visible symbol like end-of-line). It is possible to put a length field at the beginning, but then it can not be used as a dereferenced pointer or it has to involve some a bit ugly arithmetic operations in the access macros.
But definitely it is much more convenient to look at the strings in the file instead of calling some special application (or hex editor) to understand the binary format.

And still knowing all this details, I managed to implement (quick and dirty) transaction update log as ASCII file, which contains (variable length) IO offset field, (variable length) size field and (fixed length) ID substring. Thus each entry in the log file (separated by the end-of-line and null bytes) has a variable length and it is not possible to easily traverse them from the end of the file or doing a binary search.
Each access thus has to involve a string parsing, and as we all know, any string parser always has a hidden buffer overflow.

This will be changed, which is rather simple task though. Also I decided to put a special header at the beginning of the transaction history file, which will contain a structure version field among others, so that if we will ever decide to update it to the new format, there would be no data corruption. I will use it when transferring history information and check if versions do not match. Or better implement this transaction log as a set of nested attributes, where the highest level has always the same format (length and type of the inner data is enough), the same way I did it for the network protocol.

While working on this I also found, that currently it is not that simple to use a different hash (from what is supported by OpenSSL) to create an object binding from the common path or data content. There is a good crypto engine abstraction, but its initialization is based on OpenSSL, and thus it is not that simple to implement own hash, which will take into account for example data center ID (or any other geographical data), so that it could be always possible to spread data among different physical locations for redundancy and better load balancing.

So I plan to extend server initialization to include special structure which will contain among other fields (like ID, listening address, root directory to store/load data and list of the initial routes to connect to) hashing function pointer, which by default will use OpenSSL.

This is not about ASCII vs Binary.
This is about fixed length and variable length records.

If you want variable length records, you only have to choose whether you want to keep all in one file or you can spread it between two or more.

The first case is more complicated but the advantage is that you always have everything in one file.

The second one allows you to take advantage of the underlying filesystem (in fact, not (re)inventing one of yours) and keep attributes and (multi-level) indexes in a convenient way.

The reasonable trade-off can be if you write your logs just straight in usual way and process it afterwards with an index-builder or maybe just put it in an indexed database where you can do what you want.

I may be completely off basis, but if I understand what you might be attempting to achieve, perhaps the following idea migh help:

Say you have files which you want to spread copies 1 & 2 across nodes A and B. Your hash function varies from 0 to 10. If you force your nodes to pick IDs 5 and 10, you could use the hash value, say 3, for the first copy and use the original hash value plus half of the hash space (5) modulus 2, i.e. 8 for the second copy. This would force copies to always be on separate nodes from each other.

Extend this idea for more copies/geo locations. If you have more nodes, ensure that the geographically separate ones are also in separate ID spaces. In other words, assume that you have 6 nodes in two locations, ensure that the nodes in location A have IDs below 5 and location B above 5. This way files will always get a copy on one of the three nodes in location A and one copy on of the three nodes in location B.

If you have multiple criteria that you need to split things up on, you could consider giving nodes a separate ID for each criteria. A node could therefore pretend to be several nodes and answer to multiple subdivisions of the hash space.

That's roughly how elliptics network works.

OpenSSL's interface is terrible I think. I wrote a small wrapper around common hashes and also mentioned alternative libraries here: <a href="http://www.pixelbeat.org/programming/lib_crypto.html" title="http://www.pixelbeat.org/programming/lib_crypto.html">http://www.pixelbeat.org/programming/lib_crypto.html</a>

It is not only because of the interface, but mainly because we will not be able to add something new when needed, and having flexible hasing is a key point to implement proper data distribution by providing own hashes or interaces to the appropriate libraries.