Tag Archives: Networking

Hacking jabber chats

Actually I wanted to allow gajim to work with Yandex Online. By default it does not connect as well as empathy from the latest Ubuntu. The latter shows network error windows and that’s all.

So, what is the geek-way to fix it? Of course not to bugger support or whatever else (although some internet trolling brings som lulz too). There is a special client (open source of course) which works with the service, so let’s compare network protocols used by the working and failing clients.

After some debug and tcpdumps I got this:

gajim->server: <?xml version=’1.0′?><stream:stream xmlns=”jabber:client” to=”xmpp.yandex.ru” version=”1.0″ xmlns:stream=”http://etherx.jabber.org/streams” >
server->gajim: <?xml version=’1.0′?><stream:stream xmlns=’jabber:client’ xmlns:stream=’http://etherx.jabber.org/streams’ id=’3512116648′ from=’ya.ru’ xml:lang=’en’><stream:error><host-unknown xmlns=’urn:ietf:params:xml:ns:xmpp-streams’/></stream:error></stream:stream>

And the working client:

yachat->server: <?xml version=”1.0″?><stream:stream xmlns:stream=”http://etherx.jabber.org/streams” version=”1.0″ xmlns=”jabber:client” to=”ya.ru” xml:lang=”ru” xmlns:xml=”http://www.w3.org/XML/1998/namespace” xmlns:yandex=”ns:yandex:let:me:in” >
server->yachat: <?xml version=’1.0′?><stream:stream xmlns=’jabber:client’ xmlns:stream=’http://etherx.jabber.org/streams’ id=’4193006145′ from=’ya.ru’ version=’1.0′ xml:lang=’en’> <stream:features><starttls xmlns=’urn:ietf:params:xml:ns:xmpp-tls’/><compression xmlns=’http://jabber.org/features/compress’><method>zlib</method></compression><mechanisms xmlns=’urn:ietf:params:xml:ns:xmpp-sasl’><mechanism>PLAIN</mechanism></mechanisms></stream:features>

As we can see, server replied and it wants to sex, drugs and rock-n-roll with cookies.
Initial string sent by the client differs by two additional tags in the working client, so let’s brutally hack gajim to have them also (my first python hack, I know it still has its buggy shiny 2d grammatics):

--- /usr/share/gajim/src/common/xmpp/dispatcher_nb.py	2009-10-17 19:32:20.000000000 +0400
+++ /usr/share/gajim/src/common/xmpp/dispatcher_nb.py	2009-10-17 19:46:27.000000000 +0400
@@ -110,6 +110,8 @@
 		self._metastream.setNamespace(self._owner.Namespace)
 		self._metastream.setAttr('version', '1.0')
 		self._metastream.setAttr('xmlns:stream', NS_STREAMS)
+		self._metastream.setAttr('xmlns:xml' 'http://www.w3.org/XML/1998/namespace ')
+		self._metastream.setAttr('xmlns:yandex', 'ns:yandex:let:me:in ')
 		self._metastream.setAttr('to', self._owner.Server)
 		self._owner.send("%s>" % str(self._metastream)[:-2])

Tcpdump, connect and:

gajim->server: <?xml version=’1.0′?><stream:stream xmlns=”jabber:client” xmlns:xml=”http://www.w3.org/XML/1998/namespace” xmlns:yandex=”ns:yandex:let:me:in” to=”xmpp.yandex.ru” version=”1.0″ xmlns:stream=”http://etherx.jabber.org/streams” >
server->gajim: <?xml version=’1.0′?><stream:stream xmlns=’jabber:client’ xmlns:stream=’http://etherx.jabber.org/streams’ id=’3512116648′ from=’ya.ru’ xml:lang=’en’><stream:error><host-unknown xmlns=’urn:ietf:params:xml:ns:xmpp-streams’/></stream:error></stream:stream>

Fuck my brain, but nothing changed. Ok, let’s compare tags sybmol-by-symbol. Actually it is enough just to compare tag “to“:

gajim: to="xmpp.yandex.ru"
yachat: to="ya.ru"

No need to look into the sources to determine that it is gotten not from the server string, but jabber ID, i.e. string after @ symbol. Setting there login@ya.ru instead of default yandex.ru or xmpp.yandex.ru, and things magically start to work.

A good shake after the music hours.

DNS presentation on HAR2009

Bert Hubert (PowerDNS author) made “DNS Security in the Broadest Sense” presentation at HAR2009.
Among other things about current state of the DNS protocol and infrastructure it mentiones my successful hack of the latest to date ISC BIND server with randomized ports by injecting a poisoned DNS record about a year ago.

It took a night and distributed to two attacking nodes which worked against 9.5.0-P2 BIND server and filled a gigabit link almost completely. In a parallel I started the same attack against production DNS server and was shut down by the admins late night. There were lots of fun (and maybe trolling) talks with NOCs and management about the results, not actually about broken DNS, but failed network :)

Code for the distributed attack is still there :)
Without usage example of course.

Who wanted 10 Gb/s bidirectional routing?

Jesper Dangaard Brouer wrote:

Formatted quick view using 'ifstat -b'

eth31-in   eth31-out   eth32-in eth32-out
9.57  +    9.52  +     9.51 +     9.60  = 38.20 Gbit/s
9.60  +    9.55  +     9.52 +     9.62  = 38.29 Gbit/s
9.61  +    9.53  +     9.52 +     9.62  = 38.28 Gbit/s
9.61  +    9.53  +     9.54 +     9.62  = 38.30 Gbit/s

And that’s with 1500 MTU. Notice bidirectional routing means that we actually has to move approx 40Gbit/s through memory and in-and-out of the interfaces.

Excellent results!

New passive OS fingerprint iptables module release

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

This release brings following changes, some of which are not backward compatible and depend on upcoming 2.6.30 changes:

  • switch from printk() to nf_log_packet() (2.6.30+ only, it depends on hook number provided through xt_match_param structure)
  • comment and documentation update
  • The latest MacOS X fingerprint (finally!)
  • Split add/remove fingerprint processing callbacks and do not abuse internal netlink header fields
  • WSS state machine cleanups (introduced enum instead of hardcoded char values, simplify checks)

Enjoy!

New passive OS fingerprint module release

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

This is a rather minor release, where I completed ipt->xt name switch in the libary code.
In some environment it could cause a compilation errors (like old libipt_osf.c library and new xt_osf.h header), now everything lives in ‘xt’ namespace.

As usual, code is available in archive.

OSF was pushed upstream multiple times and all requested features were implemented (like switch from kernel connector to nfnetlink, codying style cleanups and the like), but patch is stuck in mail list without moving neither into iptables nor into vanilla tree though. Ping does not return reply so far. Let’s see how this will end up.

EDITED TO ADD: Rebased OSF patch against current vanilla tree and resent to netdev@ and netfilter-devel@ lists.

Morning exercise: port multithreaded project to pool of libevent-backed threads

Everyone knows libevent, yeah?
If you do not, then likely you will not ever write client-server applications using event-driven model, since libevent is effectively the most known and feature-rich library for this purpose.

So, yesterday I decided to try to port elliptics network from thread-based model to event-driven one. Sounds simple…
But it took me two days to complete and still I’m not sure if transaction forwarding (when server receives data which belongs to some other node in the network it forwards it there) works without bugs and what is the performance.

The main problem with libevent, is that it was designed for single-threaded model. But reality says us that having single IO thread does not scale well for storage systems (and polling does not work well for usual files, not even talking about weird cases when we want to store data in the database where there is no file descriptor polling at all), so no matter what, we have to have number of IO threads.

Here comes the first problem. Although libevent claims to have experimental support for the thread-safe dispatching and event queueing, it does not work, when event base (provided by event_init()) is created in one thread and then another one starts adding/dispatching events to/from it. Version I tested in Ubuntu Hardy (1.3.something I think) just crashed. So I moved base allocation into own thread and started to use the latest 1.4.9 version. It works fine, except that dispatching code returns error when there are no events to wait for.
This is a generic library, which does not allow to sleep until some events are ready, if there are no events, so effectively one either has to add dummy descriptor (like pipe) or sleep when event_dispatch() and friends return 1 (no events).
Anoter problem is with event_loopexit() and other _loopexit() functions – they allocate event which I never managed to find freed, but I may be wrong. Initially I thought that this function should be used instead of event_dispatch() to wait for ready file descriptor, but with timeout. Looks like I was wrong, but I did not investigate it deep enough.

But all those libevent problems are not real problems – it is just a matter of usage experience. Effectively it took me two days to resolve all the issues I worked with.

The more serious one is event-driven design itself. TCP socket in Linux is more than 1500 bytes already, file descriptor is close to 200 bytes. Task structure is about 3800 bytes. Noticebly more, but the difference will dissapear compared to IO overhead. And thread-driven model is much simpler to programm.
For example in my network protocol there is a header, which contains information about attached data. So we have to read two blocks from the network, similar issues may also appear in the sending part, and to resolve them we have to have a rather complex state machine.

Very serious problem rises when we want to manage the same file descriptor from the multiple threads – for example two threads wait on socket, awake and each one reads a request to be processed in parallel. This does not really work with libevent, since it does not contain any lock inside, so we will have to introduce synchronization primitive on the higher layer. POSIX thread locks in contention case end up in kernel with sys_futex() syscall, which will happen in this case almost instantly.

Another solution is to have a single thread to accept new clients and dynamically spread new file descriptors among threads in the IO pool. I made a weird decision to push receiving work to one thread and sending work to another one. This simplifies data forwarding, but forces threads to allocate a special control block for each packet to be sent. They are placed into FIFO queue and thread operates on the first element without locks (only to dequeue when it has been sent). I selected this model, for data forwarding we have to put a given data block into different socket, which is handled by the different thread, so to be completely non-blocking we have to invent a queue for each thread, which will be drained when appropiate socket is ready to be written to.
Right now I do not remember why I did not put each thread to handle both receiving and sending work (including checking the forwarding queue), maybe will experiment with this model next time, since I really do not like the idea of allocation/freeing of the control block for each packet to be sent.

I will check how this work in the multiple-server and multiple-client workload, but so far I like the way it behaves in a single server – single client case over loopback on my laptop with 256 MB or RAM :)
Changes are available in the git tree.

Getting the local time, I already do not have time for the electronic experiments. I wanted to attach K8055 IO board to sample Dallas’ one-wire bus (w1 in my notation) between ibutton and ds2490 for example and check if it can sniff the data channel. I’m not sure it will be able though: first, I do not remember how fast its digital inputs can be sampled from the software; second, I’m not sure ds2490 chip will allow this passive load. It has a capacity, so w1 bus master may decide to start negotiation protocol with it, since it will think this is a proper client.

If this succeeds, I will move further and order some PIC-based board to work with and eventually make some kind of w1 sniffer myself.

Passive OS fingerprint module release

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

This release brings us following features:

  • move from kernel netlink connector to netfilter netlink (nfnetlink)
  • use helper functions from the 1.4.3 iptables release
  • code cleanups

So far those were the last changes requested by the netfilter team for inclusion. Let’s the results.

Passive OS fingerprint module update

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

It is possible to block/allow packets which were originated from the specified OS using this module, or watch the log of the OS network usage:

# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
# dmesg | tail -n2
Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139
Linux [2.5-2.6:] : 1.2.3.4:42624 -> 1.2.3.5:22 hops=4

This release incorporates comments I got during the code review and prepares module for the kernel inclusion (like header split, structure rearrangement, some type changes, new printk() format usage and so on).
It does not contain new features, but it restructures userspace/kernelspace headers so it is not binary compatible with the previous releases.

Enjoy!

Passive OS fingerprint module update

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

This update fixes an unload RCU completion problem noticed by Paul McKenney (paulmck_linux.vnet.ibm.com)

The latest version is always available from archive and was sent to the netdev@ mail list.

Passive OS fingerprint module update.

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

Fingerprint matching rules can be downloaded from OpenBSD source tree and loaded via netlink connector into the kernel via special util found in archive. It will also listen for events about matching packets.

This release has all rules initialization to be handled over the netlink and introduces lookup tables to speed-up RCU lookup a bit (added in the previous version though, which was not announced in the mail lists likely because of spam filters). It also adds back a max ip option length definition, which is needed for the successful library compilation. otherwise one can include /usr/include/netinet/ip.h

Example usage:

# modrpobe ipt_osf
# ./ucon_osf -f ./pf.os
^C Daemon will listen for incoming match events
-d switch removes fingerprints
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector

You will find something like this in the syslog:

ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139

I’ve asked for its inclusion into the vanilla tree. This is a bit late though, but if things are good, it will find its way in the next kernel.

Updated 64k-bind() patch.

I suppose I fixed the issue with the multiple sockets bound to the same local address and port. Likely problem is in the way my patch broke the old code, which assumed that table may contain only single reuse socket added via bind() with 0 port.

In the updated version bind() checks that selected bucket contains fast reuse sockets already, and in that case runs the whole bind conflict check, which involves address and port, otherwise socket is just added into the table.

Patch was not yet heavily tested though, but it passed several trivial lots-of-binds test application runs.
I will update patch If production testing reveals some problems.

Optimizing Linux bind() performance.

When I implemented a simple extension to the binding mechanism, which allows to bind more than 64k sockets (or smaller amount, depending on sysctl parameters), I stuck with the problem, when we have to traverse the whole bind hash table to find out empty bucket. And while it is not a problem for example for 32k connections, bind() completion time grows exponentially (since after each successful binding we have to traverse one bucket more to find empty one) even if we start each time from random offset inside the hash table.

So, when hash table is full, and we want to add another socket, we have to traverse the whole table no matter what, so effectivelly this will be the worst case performance and it will be constant.

Let’s see the results.


bind() time depending on number of already bound sockets

Green area corresponds to the usual binding to zero port process, which turns on kernel port selection as described above. Red area is the bind process, when number of bound sockets (let me remind that we only talk here about automatic kernel port selection for the sockets which have reuse option turned on, i.e. those which are bound to different addresses for example, but want to share the port) is not limited by 64k (or sysctl parameters). The same exponential growth (hidden by the green area) before number of ports reaches sysctl limit.

At this time bind hash table has exactly one reuse-enbaled socket in a bucket, but it is possible that they have different addresses. Actually kernel selects the first port to try randomly, so at the beginning bind will take roughly constant time, but with time number of port to check after random start will increase. And that will have exponential growth, but because of above random selection, not every next port selection will necessary take longer time than previous. So we have to consider the area below in the graph (if you could zoom it, you could find, that there are many different times placed there), so one can hide another.

Blue area corresponds to the title of the post: optimized bind() and kernel port selection algorithm.

This is rather simple design approach: hashtable now maintains (unprecise and racely updated) number of currently bound sockets, and when number of such sockets becomes greater than predefined value (I use maximum port range defined by sysctls), we stop traversing the whole bind hash table and just stop at first matching bucket after random start. Above limit roughly corresponds to the case, when bind hash table is full and we turned on mechanism of allowing to bind more reuse-enabled sockets, so it does not change behaviour of other sockets.

Patch was sent to netdev@, but I just have been told about following issue:

$ grep -n :60013 netstat.res
33101:tcp        0      0 local1:60013       remote:80       ESTABLISHED
33105:tcp        0      0 local1:60013       remote:80       ESTABLISHED
52084:tcp        0      0 local2:60013       remote:80       ESTABLISHED
52085:tcp        0      0 local2:60013       remote:80       ESTABLISHED
58249:tcp        0      0 local3:60013       remote:80       ESTABLISHED

it is yet to resolve what it is and how much harm it brings :)

Allowing more than 64k bound connections.

Linux sockets have nice reuse-addr option, which allows to bind multiple sockets to the same port if they use different local addresses and are not listeining sockets. This works only if selecting port by hands and if setting zero port in bind(), it will fail after local port range is exhausted.

There are crazy people who want to have many tens of thousands of bound connections, but having several interface aliases to be able to bing to the different addresses and being able to have many connections, and calling bind() with zero port ends up only with 32-64k connections (depending on the local port range syscall).

Attached patch allows to remove this limit. Currently inet port selection algorithm runs over the whole bind hash table and checks if appropriate hash bucket does not use randomly selected port. When it found given cell, system binds socket to the selected port. If sockets are not freed, this will be finished after local port range is exhausted, not even trying to check if bound sockets have reuse socket option and thus could share the bucket.

My patch implements just that: when there are no buckets, which do not have our random port, we will use that one, which contains sockets with reuse option and has the smallest number of sockets in it. Its hot path overhead (i.e. when there are empty buckets) corresponds to additional three additional condition checks for the buckets which are not empty, and in case of all positive, storing two values into the local variables. When local port range is empty, we will quickly select given port based on that stored values. It could be possible to add some heuerisstics into the bucket selection, i.e. when overall number of ports is more than 2/3 of the hash table, we could just randomly select the bucket and work with it.

This only affects port selection path invoked via bind() call with port fiels being equal to zero.

Passive OS fingerprint module update.

Passive OS fingerprinting netfilter module allows to passively detect remote OS and perform various netfilter actions based on that knowledge. This module compares some data (WS, MSS, options and it’s order, ttl, df and others) from packets with SYN bit set with dynamically loaded OS fingerprints.

Fingerprint matching rules can be downloaded from OpenBSD source tree and loaded via netlink connector into the kernel via special util found in archive. It will also listen for events about matching packets.

Archive also contains library file (also attached), which was shipped with iptables extensions some time ago (at least when ipt_osf existed in patch-o-matic).

Example usage:

$ wget http://www.openbsd.org/cgi-bin/cvsweb/~checkout~/src/etc/pf.os?rev=1.21;content-type=text%2Fplain
# modrpobe ipt_osf
# ./ucon_osf -f ./pf.os
^C Daemon will listen for incoming match events
-d switch removes fingerprints
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector

You will find something like this in the syslog:

ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139

Enjoy!

The first elliptics distributed network draft.

POHMELFS was created as a client (and besides that it is already a very high-performance network filesystem) for this kind of distributed network. It was called elliptics network.

Each node has ID in flat space with modulo operation in it.

Each object has ID from the same finite space. Object can only be identified by its id, there are no dirs, files or whatever else.

Server with ID N hosts objects from (N-1; N] where N-1 is ID of the neighbour node.

Any object can have multiple IDs, in this case it will be copied to different nodes.
Each additional ID can be obtained from the first one via known mathematical operations (like sum in finite field). IDs (including additional) should not cross for different objects.

Information about object and its clones will be stored in object itself and in objects, which correspond to the parent directory in classical path-based lookup, thus to get the object client will create ID from its path, fetch the data, and if it is not available, it can created ID from parent’s directory path and request information about clones from that object.

Each write operation is transaction based, thus each transaction is a separate object in the filesystem, which will be stored on different servers according to client’s settings (how many clones). Information about how to apply transactions on top of some object will be stored in the parent’s object the same way as described above.

Transaction is committed (and can be remoevd) after all clones have it applied, otherwise it should live in the network until explicitely removed.

Node join protocol.
Node looks up a pair of node, whose IDs surround ID of the joining node (called next and prev nodes here according to how IDs correlate). It sends a request to the next node to return list of objects (and its attributes) which correspond to the joining node ID. Let’s assume next node has ID N, prev node – P and joining node has id J.
1. Joining node J sends request to next node N to grab list of objects it hosts in (P; J] and to get routing table next node has.
1a. Next node forwards all write requests to the joining node J, which will block util step 3 is completed.
2. Joining node J runs through received list and selects objects which are newer than those which are presented on joining node itself.
3. Those objects are fetched from the next node.
3a. All write requests forwarded from the next node are applied.
4. Joining node connects to previous node P and announce that it is now in the network, so that previous node updated its routing table. This announce can be sent to all nodes in the routing table received from the next node in step 1.

Each node has a routing table, which corresponds to the tree indexed by node IDs. Each object in the tree hosts an address of the remote node. Even large enough routing table will not take lots of RAM, but this redundacy (and not only addresses of the immediate neighbours) allows to greatly reduce amount of lookup messages neeeded to find an appropriate node. When node receives some request which does not correspond to IDs it hosts, node should forward this request to another node according to its routing table. If routing table does not have a node, which corresponds to given ID, request should be forwarded to the nearest to that ID node.
Some heueristics should be introduced to determine when to stop a lookup and return no-entry error, for example when next node in the routing table has ID less than requested ID, then error should be returned. This relies on correct routing table, which should be updated when next node leaves, or nodes next to the next node join.

When node joins and fetches list of updated objects from the next node, it may notice that some objects were changed and request transaction list from another nodes to apply. Transaction list can be obtained from the object, which corresponds to parent directory in the path-based object name.

Comments?