On development process.
Jonathan Corbet wrote me an open letter where he describes Linux kernel development process, a little bit biased to my particular blog cryyy though :)
In a nutshell Jonathan selected DST to show what was made wrong from the process point of view. And although making excuses is not interesting, I will show that (from my humble opinion) objections are not entirely correct.
For those who do not like such details, you may jump the the end of the post to read frustration points on the process I have. It is not related to the review, inclusion or Jupiter weather. These things are so small and really do not worth our time. What I do not like is just how communication between people happens. And only that really matters: no kilobytes of code can make a good process, but only people do.
But let’s first dive into small details of the Jonathan’s post, where he analyzes my easter-egg post. First, reviewers comments. Andrew Morton, who did it in October, and who does an exceptionally great work doing this all the time, noted besides other things that DST does not have enough documentation about the network protocol and generic details. Jonathan, sorry, but your conclusion that I did not provide documentation is wrong. In a few hours after that review I made a new release where about 10% of the patch was devoted to documentation.
Let’s dig into the patch, I will paste several short exmples of what blocks of documentation were added.
Generic crypto bits:
+ * DST uses generic iteration approach for data crypto processing. + * Single block IO request is switched into array of scatterlists, + * which are submitted to the crypto processing iterator. + * Input and output iterator initialization are different, since + * in output case we can not encrypt data in-place and need a + * temporary storage, which is then being sent to the remote peer.
there are also some details on how the most interesting parts work. In a few words.
Network export bits:
+ * Initialize listening state and schedule accepting thread. + * Each server's block request sometime finishes. + * Usually it happens in hard irq context of the appropriate controller, + * so to play good with all cases we just queue BIO into the queue + * and wake up processing thread, which gets completed request and + * send (encrypting if needed) it back to the client (if it was a read + * request), or sends back reply that writing succesfully completed. + * When client connects and autonegotiates with the server node, + * its permissions are checked in a security attributes and sent + * back. + * Groovy, we've gotten an IO request from the client. + * Allocate BIO from the bioset, private data from the mempool + * and lots of pages for IO. + * Free bio and related private data. + * Also drop a reference counter for appropriate state, + * which waits when there are no more block IOs in-flight.
there are other comments, all somewhat big function has at least small comment, which describes what it is.
Client bits:
+ * Block autoconfiguration: request size of the storage and permissions. + * State reset is used to reconnect to the remote peer. + * Ping command to early detect failed nodes. + * Send block autoconf reply. + * Our block IO has just completed and arrived: get it. + * Initialize client's connection to the remote peer: allocate state, + * connect and perform block IO autoconfiguration. + * Send transaction to the remote peer.
these were some random comment strings, which preceeded the appropriate code.
Thread pool, the first lines:
+ * Thread pool abstraction allows to schedule a work to be performed + * on behalf of kernel thread. One does not operate with threads itself, + * instead user provides setup and cleanup callbacks for thread pool itself, + * and action and cleanup callbacks for each submitted work. + * + * Each worker has private data initialized at creation time and data, + * provided by user at scheduling time. + * + * When action is being performed, thread can not be used by other users, + * instead they will sleep until there is free thread to pick their work.
There are others, but yes, these are technical details, which help understanding the code, and likely are not very useful for generic review. But I did run through the code and documented every peace of code which exceeds one page :). As requested.
In the header, which among other things contains every structure used by DST I documented every single line of data (extern function declarations are not included :), for example network command used by DST:
+struct dst_cmd
+{
+ /* Network command itself, see above */
+ __u32 cmd;
+ /*
+ * Size of the attached data
+ * (in most cases, for READ command it means how many bytes were requested)
+ */
+ __u32 size;
+ /* Crypto size: number of attached bytes with digest/hmac */
+ __u32 csize;
+ /* Here we can carry secret data */
+ __u32 reserved;
+ /* Read/write bits, see how they are encoded in bio structure */
+ __u64 rw;
+ /* BIO flags */
+ __u64 flags;
+ /* Unique command id (like transaction ID) */
+ __u64 id;
+ /* Sector to start IO from */
+ __u64 sector;
+ /* Hash data is placed after this header */
+ __u8 hash[0];
+};
Let’s get the first message I always send with DST, which should tell us, what DST is and why people may want to use it. The same is written on the DST homepage.
Distributed storage is a feature-rich network block device.
Supported features:
* Kernel-side client and server. No need for any special tools for data processing
(like special userspace applications) except for configuration.
* Bullet-proof memory allocations via memory pools for all temporary objects
(transaction and so on). All clients structures are allocated as single transaction
from the memory pool and except this there is no allocation overhead.
Network adds own though. Server uses memory pools too, but number of allocations is
higher (bio, transaction and pages for IO).
* Zero-copy sending (except header) if supported by device using sendpage().
* Failover recovery in case of broken link (reconnection if remote node is down).
* Full transaction support (resending of the failed transactions on timeout of after
reconnect to failed node).
* Dynamically resizeable pool of threads used for data receiving and crypto processing.
* Initial autoconfiguration. Ability to extend it with additional attributes if needed.
* Support for barriers and other block io request flags.
* Support for any kind of network media (not limited to tcp or inet protocols) higher
MAC layer (socket layer). Out of the box kernel-side IPv6 support (needs to extend
configuration utility, check how it was done in POHMELFS).
* Security attributes for local export nodes (list of allowed to connect addresses
with permissions). Read-only connections.
* Ability to use any supported cryptographically strong checksums.
Ability to encrypt data channel.
* Keepalive messages to early detect failed nodes.
Please, tell me what I should add here, so that it would start to tell a story? It shows what it is, what it has, how it works in user-visible cases. Apparently it is not enough?
What I agree with Jonathan, is that kernel kconfig is really pathetic. Yes, that sucks. And its paragraph’s red selection in Jonathan’s letter really screams that I am wrong. It does not show anything above, but that kernel config option is not docemented. I will fix that, since apparently questions like
Why might they want to use DST? Where can they get the associated tools? This, too, is a fatal error for any substantive kernel change.
were not answered above. Btw, link to archive and tools are always sent in the first mail.
It is up to you to decide if I made more or less documentaion, if it is good or not. But it is not the whole story, interesting things come next, when generic thread pool model is described, which was implemented in DST. The cite:
Andrew naturally called out the generic-looking thread pool implementation buried deep within DST; shouldn’t it pulled out and made more generic? Your response can be paraphrased as “I can’t be bothered to get the API past the review process, which, in any case, is biased toward those who are ‘closer to the high end’.”
What I really answered is a little bit different: I did not want to create another generic project, which will not be merged, so that users would need to fetch multiple independant patches.
Why it will not? A good question. I do not know :)
Really. At kernel summit Benjamin Herrenschmidt proposed some ideas on how to implement this (3 projects were analyzed). Somewhat in parallel, Arjan van de Ven proposed another kernel thread pool for asynchronous function call. Creating yet another pool of threads? Frankly, do you think it is a good idea to have 5 thread pool implementations in parallel? :)
And here come the last words. The words about process. People can do a good work, but if they can not make a progress on that, work dissapears. Everyone has different meaning on what is progress. And progress for someone may be a waste of time for another. I like to do things. I like that process, and I like to create (really :) good things. I do not ask for inclusion as the main result. I just want to get a feedback on what is wrong and what is good. So I send my work to the mail lists expecting people will discuss the project (if interested), review would be great, but I understand that it may or may not happen depending on the time and other constraints. When I get no responses this basically means that either work is perfect and there is nothing to discuss, or no one really cares, which I can accept of course.
When something is sent to people, sender expects people to somehow comment on that. If receivers do not have time for that, say it, do not care: say it. So that next time patch is about to be sent, sender could clearly determine how to make a progress by sending his work to those who is interested.
It is very simple: reply once per thread, once per the whole idea, once to show that you are (not) interested. And if there are some questions, which happend to left unanswered, or something unclear, something hidden, say it, so that it would be possible to make a progress, and not sending work into the blackhole.
So effectively there is no frustration about inclusion. Absolutely. I’m perfectly ok maintaining out-of-tree projects. What really dissapoints, is that discussion process does not happen. And since there is no feedback on work made there is no way to improve it. That’s the point.
Thanks a lot for all people who make reviews of all the projects (Andrew in particular :) and Jonathan for this fun mails :)
Take care!
Nested attributes in inotify. Memory pool and kmem cache constructor gotcha.
Comments are currently closed.

When I follow the link:
The article you have tried to view (An open letter to Evgeniy Polyakov) is currently available to LWN subscribers only.
kernel warez and pr0n :)
Would it help to get into drivers/staging first? That might encourage more people to review the code.
Probably, but since there is no feedback on patches, they can not be accepted into any tree, since no one reads them, liekly because no one understands it deep enough to try because I did not know what should I write so that people got involved, because… I can enter infinite recursion here :)
Email GregKH and ask him to put you into drivers/staging. You probably need a git tree he can pull from with everything in the right place. You will have to fix a few things up since you are in a different place in the tree. Staging is not supposed to require fully reviewed code.
I’ve been posting a new version of the IR subsystem for a couple of months and I’ve only received a couple of comments on it too. My new strategy is to get it into drivers/staging. Getting it into staging should get people concerned that it will be in the mainline kernel pretty soon so they better make some time for reviewing it before it lands.
Now that we have staging I think it is a lot easier to do reviews there than from patches. It makes it easy to build and test the code. Note that code in staging still needs to be broken into multiple patches.
As I see it, the main problem of DST is that it isn’t used by anyone.
To solve that, you should tell people what problem is solved by DST and why it’s better than other solutions. People who would want to use DST probably use something else now. Perhaps things like SAN or distributed filesystems, I don’t know. Why would anyone want to use DST? I think that’s part of the story you need to tell, at least it’s the prologue. Or just the back cover. But no one is interested in implementation details if they don’t know what the problems are which you try to solve.
So try turning this from a one man’s project to a community thing with users and other developers (those will come if you spark enough interest). You like solving interesting technical problems, now find the people having the problem you just solved. Saying you implemented a distributed network storage thingydoesn’t cut it. They might be only interested in getting their local network filesystem fast enough for their workload, not knowing that DST would be a much better solution. Or they buy SAN boxes for way too much money, while a couple cheap servers running DST would work great. I dunno.
So basically don’t try to expect developer interest if there’s no user interest.
Greetings,
Indan
As was alerady told, there are users.
Problem is not in inclusion, do not be biased by that part.
Main issue, in my opinion, is absence of discussion. Since there is no discussion, there is no way to determine what is needed for others to work with it and make a progress.
In terms of telling a story it’s as much about things like splitting up the patch series so that each individual patch does as little as possible. This makes review a lot easier since each individual patch accomplishes a single thing (with a clear description in the patch), making them much easier to understand and verify. The less code someone has to review at once the better. At the minute your patches tend to have very brief descriptions of the individual patches and don’t seem to do the layering on features thing.
See, for example, how the sfc drivers were merged: a (still quite large) bare bones version of the driver was merged then features have been added to it over time.
I split it into 5 pieces: inroduction, network bits (client and server are in different files), thread pool and crypto processing as its users (also in separate files), core functionality (userspace interface, block registration, sysfs, netlink, transactions) and make files.
Individual patches do not contain descriptions though, since subject and filenames tell the story:
Subject: [resend take 3 3/4] DST crypto thread pool.and two files: crypto.c and thread_pool.c for example.Yes, I saw. I was thinking that the core functionality can be split further. For example, the netlink and sysfs interfaces could probably be layered on as separate patches – possibly the userspace interface as well.
Yes and no. There are apparently some users, but you don’t have much contact with them. Try to involve them with the project, they are the ones that should be able to give you the most useful feedback. Then you’ll get the progress you want. Especially if some of those users are other developers. And having a couple users doesn’t mean you shouldn’t try to get more, so all I previously said still holds.
Simply posting your work and expect people to discuss the technical merrit of the implementation is a bit too optimistic. If you want such discussion then try asking questions. Things like “There’s problem X, and I think solution Y is probably best to solve it (here’s the code), anyone knows a better approach?”
That’s the problem of adding new features, instead of e.g. improving core functionality or adding drivers. The last one is really easy, the former is guaranteed to get the discussion you want. But for new functionality people need to understand why it’s needed and that it’s worth adding it. If you can’t convince people to use it then it’s probably not worth it. Getting it in mainline is about having enough demand for it, if that’s there then you’ll get the discussion you want. And then it’s a matter of getting it merged, but that should be easy at that point, just a polish till it shines.
The reason I’m talking about merging so much is because if it isn’t at merging point, you won’t get the discussion you like to have. Because without merging you will only get feedback from people who care about DST, which are your users. Once it’s at merging point people who don’t care about DST but do care about the kernel get interested as well.
Good luck,
Indan
I have seen many interesting concepts from you and would not like it if you would walk away from kernel development.
The problem *I* have lies in your use of English. Sometimes I find your text very tiresome to read. I can imagine that this is a reason the discussion stops early.
Not sure what to do about it.
Regards,
Mike
Note that I am not a native speaker either. So who am I to ‘judge’?!
Usually people say that they do not understand something and ask to clarify.
But in this case discussion just does not start, so apparently this is not an issue.
But yes, my english is far from being perfect :)
He won’t :D
Hi.
I just looked at the git repositories (from the web interface) and at the project page. From a user perspective, I’m lost:
I know NBD and DRBD, but how does DST differ from those? Given you called it distributed storage, I assume that it has some difference from the named net related block device layers, but which?
You feature list sounds impressive, but it doesn’t help me understand why (or for what) I should use DST.
And the documentation included with the setup tool doesn’t help much, either. For example: After the “simple connect”, how do I access the resulting block device (assuming there will be any)? Does DST only sound like that, or does it support some mirroring of the underlying storage? ….
regards,
Sven
DST is a block device with features I saw on the main page. In previous versions it was possible to implement linear and mirroring mapping for requests, but I dropped that functionality in favour of device mapper. So essentially DST is NBD/iSCSI/AoE with lots of additional features which allow to implement more robust and fail-tolerant storage.
After device was created via described command, you will find it in the /dev directory with
/dev/dst-storagenamename. It can be mounted, imported into RAID array or operated in any other way administrator may want to work with block device.So effectively there is nothing major to add to what was already written on the homepage: it is a network block device with lots of features created for failsafe network storages over the common networks.
> Btw, link to archive and tools are always sent in the first mail.
It’s ok as long as those interested are supposed to know and pull themselves anyway, but to consider for inclusion it’s really reasonable to add a link beforehand — last time I checked crisis still didn’t influence link prices in kconfig largely. :)
> Frankly, do you think it is a good idea to have 5 thread pool
> implementations in parallel? :)
Depends: remember virtualization folks who were doing quite a few similar things a bit differently but decided to join forces in containers@osdl, even if it meant feature regressions for some of the larger implementations on top of merged lower level features (like, CPULIMIT in OpenVZ so far went south _but_ the whole patch shrank IIRC).
You might ask those who use the code whether it will be frustrating to pull from one more repo — or consider maintaining two separate repos/branches _and_ merge them into “build this” one so this step isn’t neccessarily done by everyone locally.
The point is, if you and other people who need thread pools will eventually come up with generalized implementation, it’s more likely a winner even if N-1 more similar ones are left in dust. We try, we fail and succeed, and communicate along the way… такова жизнь.
> If receivers do not have time for that, say it, do not care: say it.
It’s many times more seconds than “next”, and quite a lot more spam I’m afraid.
—
Regarding users: I guess there might be interest from storage/hpc folks (think clusterfs or gpfs), did you announce it in places where those lurk? I don’t remember what exactly our CTO told regarding POHMELFS but we‘re heavily into HPC and currently using (while fixing up along the way) Lustre, if you’re interested I can try to hook up both of you.
Another way to increase user visibility is doing e.g. livecd, virtual appliance image or installer so that the technology can be evaluated more easily. If you’re interested in that, we can try and do it: ALT Linux (the team I participate in) has decent git-based kernel build infrastructure and distro generator; I don’t know the former (but the colleagues do) still have some release manager skills (e.g. how to bake ALT Linux 4.0 Terminal).
The first time I saw distro testing to help merge was 2002 when reiserfs patches for parted were merged upstream after testing in Sisyphus and being accepted for a major distro release. Users enjoyed it as well, of course.
You can jabber/mail me at mike altlinux org or shigorin gmail if any of these proposals seem reasonable. Jabber’s better since my local antispam filter is getting a bit insane being a bit dated…
–
WBR, Michael Shigorin
ALT Linux Team
Media Magic Ltd
Ah, and I’m lagging for a full week on package update notifications: http://sisyphus.ru/srpm/kernel-modules-dst-std-def ;-)
–
mike@
DST is rather simple project to have special livecd or something like that, it is just a driver which allows to build network storages. But still it is advanced a bit not to be considered as trivial and requires some feedback from the kernel people, which does not happen.
Thanks for the looking at POHMELFS, at its current stage it is not comparable with Lustre, only initial read balancing between nodes and parallel writing, nothing particulary iteresting from HPC point of view. It is more like pNFS now although there are some questionable moments. Server will get distributed extensions soon, design and progress reports will be posted here, then it will step into Lustre’s world.
As of details from the letter… They are valid of course, but these are just details which hide the real problem. I can split thread pool from DST and submit it alone, it is possible (and done) to update kconfig/documentaion/mail description/whatever else, but discussion does not happen. That’s the main problem I do not like, anything else is just a techical issues which may be resolved in a few moments.
That’s a bit old version I think, but yes, AltLinux added DST into its package list some time ago :)
Hello,
I read the LWN open letter and your response. Interesting stuff. Good luck with getting it merged.
One question:
Could your file system replace RAID for data protection?
I mean could I set up two or more disks with DST so that data is stored fault tolerant on each disk?
Keep up the good work & Thanks
- Udo -
It was implemeneted in the previous versions, but now I moved away from this decision in favour of using device mapper and its RAID capabilities on top of multiple DST devices.
hi
If you don’t care about inclusion then don’t waste kernel developer’s time. Just send an email to LKML announcing your out-of-tree project.
If you *do* care about inclusion there are some strategies you can follow:
It is a lot of work, but it’s the only way your projects will succeed. I’m sorry, but it doesn’t matter if some code has awesome technical details; ideas that don’t spread are failed ideas.
I image a 18th century genius that did great achievements, but he wasn’t interesting in pushing his ideas, he was happy with the limited visibility he had. Now he’s gone, and his ideas too. One day you’ll be gone too (and all of us) but the ideas might live longer, or at least be part of something bigger.
Those are well-know plain and boring copybook truths. It is a theory people who do nothing like to tell to the submitters. Unfortunately after 6+ years of kernel development I can tell you – this does not work :)
Practice is very different: only politics works.
‘Pushing ideas’ means playing those games. I do not like this, but kind of participate. So DST and POHMELFS are in mainline.
I saw patches went in without proper changelog, with non-addressed objections in the discussion (which were clearly highlighted in the practical runs), without documentation…
If you could do a homework, you could find that all projects I mentioned had all needed bits quite long ago.