POHMELFS server update

Tagged:  

Moved the whole building process to autoconf tools.

One can now build it easily like this:

$ ./autogen.sh
$ ./configure --with-kdir-path=/some/path/linux-2.6.pohmelfs --prefix=/usr/local/pohmelfs
$ make
$ make install

It will properly detect different pathes to the needed libraries, will not build configuration tool, if netlink connector is not accessible (for storage nodes) and so on.

Effectively one only needs to have fs/pohmelfs/netfs.h header file in the above kernel directory for the successful build, so one can create this dir somewhere, put there header and build the server, configuration tool and remote cache flushing utility.

I also spent a lot of time porting server to FreeBSD, but then found, that 7.0 does not have fstatat() and some other calls, so I dropped this task. Rediscovered actually, the same problem existed when I ported elliptics network, but it was not really needed there, so now elliptics nodes successfully run on FreeBSD.

Anyway, this will not be needed, when I port POHMELFS to elliptics.
Maybe I will even eliminate userspace POHMELFS server and move the whole client side into kernel, but so far I plan to change userspace server first to become a part of the elliptics cloud and drop directory export support.

This requires to implement a support for the directory content within a file (first, on the userspace server) - each directory entry in POHMELFS will become a record in the directory object, stored in the elliptics cloud. Likely I will implement some kind of a lazy tree of direntries, it will be indexed by the object name and will contain inode information needed for the filesystem.

That's the plan for the next days, stay tuned!

Does that mean exporting a directory (like nfs does) will not be supported in some future release?

I think that is quite a valuable feature for when a storage setup grows out of a single server or when one tries to switch from nfs to something "better". The problem with moving to a networking fs that has no directory export feature is that the move can only happen in a disruptive and "this and nothing else" way, i.e. there is no migration path possible.

Thank you.

Also, don't forget that NFS does not have a user space server implementation yet that supports locking. If I remember correctly, POHMELFS is a user space implementation, and I was hoping to use it inside of linux vservers to virtualize various different exports from inside separate vservers (once debian kernels are up-to-date enough to do so). This is a simple NFS like usage pattern that current NFS does not quite support yet since it is a kernel space process on linux.

So, I definitely would like to request to not loose a better NFS replacement. If the server works well inside of vservers you mind find a whole new niche for your filesystem.

what's the problem running NFS server in userspace?

It is a bit hard to work through firewalls though.

NFS supports locking (with some problems) as a separate daemon (lockd).

If you know of a user space NFS implementation that fully supports posix locking, please point me to it! I have yet to find one.

Or, if you know of an other solid networked linux FS that does this, and blocks on server failures the way that NFS does and continues to operate as soon as the server returns, point me there too! (please not SMB or variants)

This comes pretty close:
http://hail.wiki.kernel.org/index.php/Nfs4d

I would be curious to know what is missing, if anything, WRT POSIX locking.
-jgarzik

Oh, also, are there any debian packages for this anywhere by chance?

Jeff,

Thanks for the link. I would like to give it a try. Is this entirely in userspace? Have you run it in a vserver? I am having a hard time finding much info about it, do you have a pointer to the docs (I saw setup.txt in the git depot, but it seems limited), perhaps I am simply missing something? It looks like this does not export a normal filesystem like the normal nfs server does, it sores everything in the Berkley DB, is that correct?

As for what is missing, you tell me, it's your project right? I have never heard of it. :) It sounds like you believe it supports posix locking, great! Does it handle both server and client reboots gracefully? Would you trust it with a mail server and imap server reading/writing to it concurrently? Would you run nfsroot over it?

Thanks.

Answers...

1. Yes, the server is entirely in userspace.

2. No, I have not tested it with Linux's IP virtual server, but in theory it should work.

3. Correct, there is little documentation. Run "./nfs4d --help" and "./nfs4dba --help" to determine how to invoke the daemon, and manipulate the metadata database.

4. 50% correct: nfs4d stores filesystem metadata in db4. Filesystem data is stored in the local filesystem.

5. Yes, server and client reboots are handled. This is specified in the NFSv4 specification.

6a. It will work just fine for mail server or nfsroot.

6b. However, it is not yet production, so I would not trust it with production data, until it sees more testing from more users.

7. I am not aware of any Debian packages for nfs4d.

-jgarzik

Thanks so much for the detailed responses!

...4) For metadata, it sounds like that includes permissions and user/group info? Does the local Filesystem data resemble the filesystem content at all, or is it an opaque format, weird hashed named files...? Since it runs as non-root, I assume it cannot just export a local filesystem, so how do you get data into it? Do you have to start with an empty system and add files through nfs, or is there an import facility?

...3) I suspect that the help options you gave me here might answer more of my questions above, but since that would require me downloading and building everything, I can't do that yet. Might I suggest you leave the current output of those two helps on a web page somewhere? If these helps are one of the best places where one can get a glimpse into the design of your project, this might be very useful to others also. This way, people could learn more about your design before committing to building and installing something.

Does it require using an external portmap daemon? Are there other non included daemons that it requires?

"filesystem metadata" is a standard term, referring to all information sans the actual file data: last-modified time, size, uid/gid and ACLs, version, file mode, MIME type, symlink text, blkdev/chrdev major and minor, etc.

It was a key design decision that directly exporting the local filesystem is a mistake, because you are always competing ("racing") with the kernel and other userland processes when accessing each file. nfs4d uses an alternate model, what I call the "database model", where all data is stored in a format optimized specifically for NFS remote access. Since nfs4d "owns" the file data, it is guaranteed that it will never race with another local process. It is guaranteed that nfs4d has 100% control over the data it manages.

100% ownership also gives nfs4d the flexibility to implement features such as POSIX ACLs and other advanced filesystem features, without having to hack a solution because the underlying Linux filesystem does not support that feature. (yet another reason why directly exporting a local fs is a bad idea — you are dependent on the varying features of the underlying blkdev-based filesystem)

So, you are correct — you start with an empty filesystem and fill it with data, just like PostgreSQL and MySQL start with an empty database, that is eventually filled with useful data.

An external portmap daemon is not required, NFSv4 specifies a well known port (2049), and the server binds to that.

I agree that better documentation is needed... maybe I could recruit you as a volunteer, to update http://hail.wiki.kernel.org/index.php/Nfs4d ? :)

The one which goes in a debian installation supports that.

As of blocking on errors - this is a major tradeoff, it remote server broke you definitely do not want to wait until it goes online again, since all state information is lost, and you will need to remount.

I suppose that NFS TCP also returns error when TCP connection goes down, it is default in NFSv4 (and higher). So its behaviour will be the same as POHMELFS one (which also works with TCP).

UDP can block forever though.

The one which goes in a debian installation supports that.

There are several NFS servers in debian, could you be specific? The ones I have tried were either kernel space servers (even though you start them from userspace), or they did not support locking.

The debian NFS servers I am aware of:

* The nfs-kernel-server which is obviously a kernel server. :)

* The nfs-user-server which as you can see here, does NOT support locking:
http://packages.debian.org/unstable/net/nfs-user-server

* The unfs3 server which is a user space server, but again as you can see here, does NOT support locking:
http://packages.debian.org/sid/unfs3

Is there another user space server which does support locking?

Indeed, no userspace NFS server supports locking in Debian. I was under impression that nfs-user-server works with lockd, but apparently was wrong.

This will be solved in POHMELFS eventually as a separate project and will be advisory. Distributed locks will be implemented using PAXOS algorithm.

Do I take this to mean that POHMELFS (without elliptics) does not currently support locking?

For anyone interested, I have found a linux userspace NFS server that claims to support NFSv4 locking (not V3).
http://nfs-ganesha.sourceforge.net/
I am attempting to set it up and test it (setup is a bit of a beast, requires a DB etc.). They have debian packages on their site.

As for blocking, while it is true that if a server goes down and never comes up this can be a problem. The idea though is that if you run an active/passive DRBD setup, if your primary server goes down, the blocking allows the secondary server to take over gracefully for the primary without disrupting the clients, all they see is a delay.

It is only possible to avoid disrupting NFS clients, if the primary NFS server fails, if

  • NFS version <= 3, and using something like IP virtual server
  • NFS version 4, using the fs_info structures defined in spec for fail-over and migration.

drbd operates at a lower level, and is completely irrelevant to NFS service stability.

And for NFS version 4, you must migrate and/or recover client state — not transparent to client, and again, nothing to do with drbd.

NFS, in general, was not built for good fail-over. This is partially solved in NFS v4.1, a hideously complex protocol (which does fix several long-standing NFS problems).

-jgarzik

Well, with DRBD and virtual IPs, you really are simply simulating a server reboot. If the server can reboot gracefully without interrupting the clients, then you can failover with DRBD without any protocol support. Simply failover the IP and storage and your server should act as if it has just simply rebooted.

There are two key differences between NFSv4 and previous versions (that are relevant to this thread...):

  • TCP connections are used by default
  • The protocol is stateful, which implies that a client or server reboot requires recovery before work can proceed.

My point was that no NFSv4 server can reboot without interrupt the clients, making "failover with drbd without any protocol support" impossible.

NFSv4 protocol's statefulness permits greater client caching, but with the added penalty of requiring expensive, protocol-specific recovery when the server + client connection is severed for any reason.

-jgarzik