I setup a 6-machine cluster where each node was placed in different datacenter (at least that's what I was told), connected over 10 Gbit network, but actual network speed was about 20-30 MB/s - it is production datacenters after all (I'm curious whether filling up the whole available bandwidth made people unhappy).
Each system runs on 8-way E5440 64-bit Xeons with 16 Gb of RAM and has two test SCSI disks attached, which I formatted into ext2 and ran the latest (1.4.39) Tokyo Cabinet database. System runs old as mamont's shit 5.4 RHEL (for example it has libevent 1.1, 9.8e openssl and 2.6.18 kernel).
I installed one elliptics network node on each of 6 servers and started a singlethreaded IO testing tool, which wrote 100 byte chunks into the storage. When test application was connected to every node it was able to get more than 300k requests per second IO rate, which was effectively limited by the network - it is those 20-30 MB/s of free bandwidth between datacenters. Sometimes it dropped to 70-100k rps, sometimes grew up to 360 thousands of rps.
I was asked to run multithreaded benchmark, like 50-100 threads from single node writing into the storage, but system was able to fill the whole pipe using just one client.
Anyway, everything looked quite good except one small detail - servers regulary crash on those machines. And by regulary I mean it. To date I do not know the reason, but I have to admit that it works without problems on the similar Ubuntu systems (with 4.2.4 gcc).
For example RHEL machines can throw out following dmesg message:
dnet_ioserv[2320] trap invalid opcode rip:2b649487455e rsp:43dcdfe8 error:0
which, I must say, I do not understand how to run into with whatever kind of software error in my userspace application.
Following quite usual gcc options were used during compilation:
gcc -DHAVE_CONFIG_H -I. -I../config -I../include -I../config -pthread -I/usr/include -I/tmp/rhel//include -I/tmp/rhel//include -g -O2 -W -Wall -Wextra -MT iotest.o -MD -MP -MF .deps/iotest.Tpo -c -o iotest.o iotest.c $ gcc --version gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46) Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I believe I will have to try different compiler, since I extensively use __sync instructions for atomic access, which had some bugs in 4.1 gcc versions. Or at least to try libatomic first.
There are some bugs to hunt on!
Do not use patched gcc, especially by Red Had guys :)
(Do you remember their 2.96 series)
BTW 4.1.xx have some speculative registers access problems when inline asm("") code mixed with C.
AlexR.
rip:2b649487455e is an IP of 47 terrabytes. It smells like stack corruption and your return address was data instead of a real IP. Google around for stack and heap validators.
Does -fstack-protector-all work in your GCC?
How about Valgrind?
Doublecheck the CPU capabilities of the machines you run this on, and figure out what instruction actually ran when that error occurred. You could always run the server under gdb until it crashes, if necessary.