Gigabit Network Performance ??

June 5th, 2008

Are you really seeing the full benefit of your gigabit network infrastructure? Today we’re trying to figure out why it seems the private network interconnect links used by our Oracle RAC are causing problems.

First thing is to verify that the NIC ports appear to be configured properly:

root@backup01
/backup/mysql_snapshots >ethtool eth0
Settings for eth0:
        Supported ports: [ MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Current message level: 0×000000ff (255)
        Link detected: yes

Then we’ll move on to some basic link testing. My two favorite utilities for testing network throughput at a relatively low level are:

Here’s a quick example of how both are used…. first we’ll use ttcp:

Set up the receiving server…

root@images01
~ >ttcp -r -s -f M

Then run the test from the transmitting server…

root@backup01
~ >ttcp -t -s -l 16384 -n 16384 -f M 10.0.0.7
ttcp-t: buflen=16384, nbuf=16384, align=16384/0, port=5001  tcp  -> 10.0.0.7
ttcp-t: socket
ttcp-t: connect
ttcp-t: 268435456 bytes in 2.26 real seconds = 113.51 MB/sec +++
ttcp-t: 16384 I/O calls, msec/call = 0.14, calls/sec = 7264.61
ttcp-t: 0.0user 0.3sys 0:02real 15% 0i+0d 0maxrss 0+5pf 199+2csw

Now we can use the handy bandwidth units conversion tool found here to see that 113.51 MB/sec equals 0.9522 Gbps (gigabits/sec). Nice - nearly the full 1.0 Gbps we wanted to see! I ran the above test with various values for -l and -n with similar results.

Let’s test again - this time using iperf:

Set up the receiving server…

root@images01
~ >iperf -s

Then run the test from the transmitting server…

root@backup01
~ >iperf -c 10.0.0.7
————————————————————
Client connecting to 10.0.0.7, TCP port 5001
TCP window size: 16.0 KByte (default)
————————————————————
[  3] local 10.0.0.6 port 52323 connected with 10.0.0.7 port 5001
[  3]  0.0-10.0 sec  1.09 GBytes    936 Mbits/sec

Again, nearly the full 1.0 Gbps is being achieved!

However, keep in mind that these tests do not take things into account which happen higher up in the protocol stack - such as when you’re using SCP or FTP to copy files across the network.

For example, I checked the speed of an SCP file transfer between the same 2 servers that I used above. You may be surprised to see what appears to be terrible performance like this:

root@backup01
/backup/mysql_snapshots >scp mysql_replicated.2008-06-05-040005.tgz images01:/home/
mysql_replicated.2008-06-05-040005.tgz                                    100%  334MB  33.4MB/s   00:10

Hmmm…. 33.4MB/s is only 0.2802 Gbps. What happened?! Actually, a lot of extra things happened all of which eat up time but the single biggest one is the added overhead of the SSH encryption used during the SCP copy job.

The point being - when you’re examining network performance, you must remember to take EVERYTHING into account. Just having a gigabit switch and some el-cheapo gigabit network adapters is no guarantee of anything. Here’s a VERY incomplete list of things to consider:

  • How are your adapters connected to the system? 64-bit PCI slot filled with a high-end NIC? A port built into the main board? A $12 “gigabit NIC” you found at the local wholesale supplier?
  • Hard drive speeds in the case of copying files or doing anything else involving read/write operations to disk storage
  • What else is happening concurrently?
  • Using the correct drivers?
  • Are you certain the patch cables are OK?
  • Are you certain the ports are in fact set at 1000Mbit/sec speed?
  • Is there anything strange happening at the switch level? VLAN misconfigured? Routing problem?
  • have you eliminated the possibility of DNS trouble? Tried the IP address(es) directly?

If gigabit still isn’t fast enough, step up to link aggregation per 802.3ad (along with the more costly class of high-end switches you’ll need to take full advantage of it). If your budget is flexible enough check out the newer 10 gigabit options.

There’s plenty of additional information online about these topics - google.com is your friend.


16 Work Weeks Wasted. Every Year. Why?

May 30th, 2008

And that’s only the beginning of the problem…

My daily commute round trip averages 2.5 hours of driving. Multiplying that by 5 days a week and 50 weeks per year gives 625 hours of time. 625 hours equals almost 16 work weeks (assuming a 40 hour work week) of time I spend doing absolutely nothing productive during the year.

Then there’s the fuel. My daily round trip is 100 miles, during which time I average 40 miles per hour. The combination of distance and poor fuel economy while parked on I-4 works out to $4500 per year in fuel cost in my case. Recently, with the car I no longer have, it was much higher at around $7000 per year.

How about the 25,000 miles per year I’m accumulating on the car (not counting my personal driving which is another 8,000 miles)?. That’s yet another cost. It’s also a great way to expire the factory warranty sooner - even more cost.

Multiply the above scenario by a few dozen or hundreds of employees!

Can this be stopped, or at least significantly reduced? YES, it can be!

Thankfully, it is possible to perform nearly all the requirements of my job by working from home and connecting to the office computer systems via VPN over internet. One of the few reasons to come into the office might be for a face-to-face meeting (possibly involving a white board) which couldn’t be handled via a conference voice-call.

Then there’s the question of worker productivity. Several people I’ve talked to about this have a well rehearsed speech when arguing for OR against telecommuting. My speech is really short and simple:

“Any reduction in productivity while working from home as compared to working in the office will not be tolerated. Period!. If you’re goofing off instead of working at home and your productivity levels are negatively affected, you will be fired”.

Stated another way, either the employee can be trusted or not - and if not - the employee gets replaced.

It is also true that nobody forced me to purchase my home where it is located or forced me to find a job where I did, so maybe I should just shut up and enjoy the drive.

Now I just need to repackage this idea and take it to management. Wish me luck!


edit, compile, debug - I wish….

March 17th, 2008

I wish it was that simple. I’m once again working in the Microsoft realm doing .NET development. I’ve seen things happen in VS2005 which I clearly do not have the capacity to understand.

I’m part of a team that’s combining parts of other products/projects into a new stand-alone baseline product for our company. All the other previously existing products will become enhancements to this new baseline product.

Needless to say, the VS Solution file has plenty of stuff in it - many projects, references, etc.

Here’s the sequence of events I just witnessed while attempting to get a clean build after my last half-hour of code exorcism (removing all traces of the original products from this new one so it can truly stand alone without dependency on the others):

  • clean
  • rebuild
  • clean
  • rebuild (don’t ask me why I’m starting to think this multi-rebuild process is “normal”)
  • lots of errors
  • clean
  • rebuild
  • fewer errors
  • clean
  • rebuild
  • fewer errors
  • clean
  • rebuild
  • more errors
  • wait and watch the “live” errors list (VS2005 is apparently doing something behind the scenes here)
  • wait some more (I was actually distracted doing something else when my peripheral vision picked up the changing errors listing)
  • errors start disappearing off the list
  • finally they’re gone
  • can it be? do I really have a clean build?
  • clean
  • rebuild
  • more errors
  • clean
  • rebuild
  • NO ERRORS (but I don’t really believe it)
  • clean
  • rebuild
  • no errors
  • again
  • no errors(now I believe it - sort of)
  • QUICK - check it all in to VSS
  • spend the next 5 minutes trying to make the voices stop

It should be noted that at no point during that sequence was any file edited or saved!

Looking forward to deploying this monster….


HOWTO: Flash in x86_64 Linux

January 15th, 2008

Adobe has, for whatever reason, never released a 64-bit version of Flash. Thankfully, the 32-bit Firefox works just fine in 64-bit Linux - which means we can use the Flash player in spite of their short-sighted decision.

I know there are dozens of ways to do this. Here’s what works for me quick-and-easy in Fedora 8 (although it should work with most other flavors of Linux too):

1. Download the latest i686 (32-bit) version from http://www.mozilla.com/en-US/

2. “Install” Firefox as follows:

# uncompress and untar it...
tar zxvf firefox-2.0.0.11.tar.gz

# move the resulting firefox directory to /usr/local
mv firefox /usr/local/

# make sure it works...
cd /usr/local/firefox
./firefox

# close the browser for now...

On one of my workstations I didn’t already have the i386 version of compat-libstdc++ installed (which caused Firefox to complain about a missing libstdc++.so.5). Easy fix for that:

yum install compat-libstdc++-33.i386

3. Download the latest Flash player in tar.gz format from: http://www.adobe.com

4. “Install” Flash as follows:

# uncompress and untar it...
tar zxvf install_flash_player_9_linux.tar.gz

# give the resulting directory a shorter, more proper name...
mv install_flash_player_9_linux flash

# move it to /usr/local
mv flash /usr/local

# set up the needed symbolic link for firefox to find and use flash...
cd /usr/local/firefox/plugins/

ln -s /usr/local/flash/libflashplayer.so libflashplayer.so

5. Use your menu editor (in my case the KDE Menu Editor) to make sure the correct firefox gets launched from the menu. If you already had a version of firefox installed, it’s probably already set to use:

/usr/bin/firefox

Just change that to be:

/usr/local/firefox/firefox

6. Done! Launch Firefox and visit a site that uses Flash, such as youtube.com and enjoy.

Note: It is likely that you will need elevated privileges for the directory moving and symbolic linking operations. I simply do the whole thing as root to make it easy.


NetBeans 6.0 IDE ?

January 12th, 2008

Did Jerry actually just mention an IDE? WTF? Surely the Four Horsemen of the Apocalypse have arrived. Anyone that knows me is probably already working to figure out the Baker Act procedure to have me committed before I hurt myself. I’ve been a died-in-the-wool-hard-core user of the vi editor for many years and the mere mention of me considering an IDE will have many people looking for safe cover. I must confess to having been a heavy IDE user many years ago when I was writing a lot of software for the Windows platform - but not many people know that about me. Since then I’ve worked on many non-Windows projects ranging in size from tiny to huge and written every single line of code for them using vi.

After being submerged down in the code soup, server administration, networking, and many other things for so long - I haven’t had time until now to check out the state of the various IDE tools to see if any of them sucked a lot less than the last time I looked at them. Even now I really don’t have the time for this, but the mental distraction is welcome plus the hope of finding something that truly works well and maybe improve productivity is worth it.

First stop - Eclipse. I’ve tried it several times in the distant past, didn’t like it then, and found I still don’t like it now. Several things that should have worked just didn’t work and I didn’t feel like doing the legwork to fix them. I know it’s a fine tool that is used and respected by many - but not for me today.

I suppose I should point out that I’m currently running Fedora 8 as my primary workstation OS.

Next I tried the new Netbeans 6.0. Not really all that new considering 6.0 has been out for a while, but still new to me. Perhaps it was luck, but the initial installation went very smoothly without any of the Java-related dependency problems I somewhat expected to pop up. In only a few minutes I installed it and had the code for my existing Ruby-on-Rails project loaded into it and was exploring the UI. My friend Adam had mentioned it supported vi key bindings so I searched for a plug-in and found jVi - a vi editor clone and installed it. WOW. There I was - using the most commonly needed vi commands while inside the very attractive NetBeans UI.

Ok - I needed to get back on track here and start figuring out why and how this thing sucks. I was looking forward to some disappointment with either the SVN integration or the debugging. Surely one of them would fail miserably!

Our SVN repository access is via svn+ssh with the magic of public key authentication - which makes getting to the secure repository just about as easy as it possibly could be. A half-hour later I had to conclude that NetBeans’ built-in SVN tool set was very nice indeed. I loved how the diff tool works in that I can actually edit and save the current file right there in the diff window (using standard vi commands too!). This would be great for adding those last few code comments just before committing the code back to the repository.

Looking forward to some disappointment with the debugger, I first needed to see what the work-flow would be like if I used the “Run Project” (F6) functionality. Since it wanted to run Webrick (which doesn’t support SSL) I had to make a few very minor tweaks to my code such that no redirects to an SSL page happen when RAILS_ENV == ‘development’. All was well. Very well actually.

Somewhere along the way I got distracted and watched the debugging example video. I clearly remember thinking - yeah right - it would take me all day to solve dozens of dependency problems before _I_ would be allowed to debug like they showed in that video. In reality, it took me maybe 5 minutes to deal with a few minor problems (totally unrelated to Netbeans) installing the ruby-debug-ide gem.

A few minutes later I was setting breakpoints in my code, stepping into the abyss that is the Rails stack, stepping back out into my own code, watching variables - and it wasn’t sucking. Matter of fact, it was a lot nicer than doing it the “old way” where I might be using PrettyPrint to view some big ugly hash variable and getting a headache trying to find the hash element I was looking for in the console output.

I later discovered many other useful things that just work as expected. Code completion works well - even to the point where it automatically constructs all the find_by_somefieldname method names for you. Some things will always be much faster for me at the command line - but for now I’ll be sticking with NetBeans 6.0.

Be sure to check out the tutorials and demos at: http://www.netbeans.org/kb/trails/ruby.html


Lighttpd, 500 errors, and FCGI load balancing… (part 2)

January 11th, 2008

So I very recently upgraded Lighttpd on 3 production application servers to see what, if any, effect it would have on users seeing error 500 and/or the FCGI processes dying. I would be looking for lines similar to this in the lighttpd error logs:

2008-01-10 22:26:56: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012

Time for some command line magic to figure out the death count on the FCGI processes in the recent past. I ran this command on each of the servers:

for d in 01 02 03 04 05 06 07 08 09 10; do echo -n "2008-01-$d   "; tail -20000 lighttpd.error.log | grep died | grep -c "2008-01-$d"; done

For the 3 servers, results were:

Server: app01
2008-01-01   39
2008-01-02   115
2008-01-03   571
2008-01-04   119
2008-01-05   98
2008-01-06   101
2008-01-07   882
2008-01-08   100
2008-01-09   52
2008-01-10   4
2008-01-11   7

Server: app02
2008-01-01   27
2008-01-02   21
2008-01-03   38
2008-01-04   136
2008-01-05   114
2008-01-06   48
2008-01-07   62
2008-01-08   138
2008-01-09   29
2008-01-10   55
2008-01-11   7

Server: app03
2008-01-01   19
2008-01-02   26
2008-01-03   236
2008-01-04   51
2008-01-05   28
2008-01-06   38
2008-01-07   49
2008-01-08   300
2008-01-09   59
2008-01-10   6
2008-01-11   3

When I saw the result from app01, I got excited until I looked at 55 failures for app02 on 2008-01-10. However, early indications are that a substantial reduction was made in the failures as a by-product of fully utilizing all the available FCGI processes (even though I haven’t done a thing _yet_ toward figuring out why they’re dying in the first place).

I will keep monitoring this as I continue the search for a permanent solution. I am considering an architecture change to eliminate FCGI and start using the new mod_proxy_core in Lighttpd 1.5 in front of multiple Mongrels or maybe even try out Thin.


Lighttpd, 500 errors, and FCGI load balancing… (part 1)

January 9th, 2008

Today’s challenge is to figure out why some of our site’s visitors are occasionally seeing status code 500 (Internal Server Error). First stop is the Lighttpd error logs from one of the application servers where I found entries like this:

2008-01-09 11:21:43: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:21:42: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:21:43: (mod_fastcgi.c.3181) response not received, request sent: 1386 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:21:42: (mod_fastcgi.c.3181) response not received, request sent: 1497 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1336 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1295 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000
2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000
2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000

Looks like Lighttpd is occasionally handing requests off to fcgi processes that have died. Simple enough…. but why are they dying?

Next stop is a look at some output from this command:

ps -eFH | grep fcgi | grep -v grep

as shown here:

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
lighttpd 30189     1  0 20492 23320   3 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5
lighttpd 30200     1  0 101967 219912 2 Jan08 ?        00:01:01   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30202     1  0 101364 217368 2 Jan08 ?        00:00:20   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30204     1  0 58803 57416   0 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30206     1  0 58805 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30208     1  0 58805 57420   0 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30210     1  0 58803 57416   2 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30212     1  0 58802 57412   2 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30214     1  0 58803 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30216     1  0 58805 57420   1 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30218     1  0 58803 57416   0 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30222     1  0 58805 57420   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30225     1  0 58805 57420   2 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30227     1  0 58804 57416   2 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30229     1  0 58805 57420   1 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30231     1  0 58803 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30233     1  0 58805 57416   1 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 32106     1  1 103389 225564 3 11:08 ?        00:01:15   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 32164     1  0 100314 213196 3 11:08 ?        00:00:16   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd  5059     1 44 106262 237152 0 11:47 ?        00:17:53   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd  8606     1 14 98805 207184  3 12:26 ?        00:00:08   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi

Our current production deployment stack is a simple FCGI configuration. Each application server has 20 instances of dispatch.fcgi waiting for Lighttpd to hand off requests to them. They service the request, return the result to Lighttpd, then Lighttpd returns the request to the user’s browser.

In our case, more than half of the dispatch.fcgi processes appear to be untouched according to the memory usage indicated in the RSS column (meaning they have not yet “bloated” their memory consumption beyond the initial amount). Since first deploying the application many months ago I’ve always known this was the case and I always assumed it was due to Lighttpd not getting enough concurrent requests to need to call upon those idle processes.

Timestamp values in the STIME column indicate we have 4 fcgi processes which were recently started by spawner. Spawner (the first process shown in the above list) simply checks to make sure the specified number of processes exist and spawns new processes if needed. In an ideal world, we’d start 20 processes and they would faithfully handle requests forever…. but such is not the case.

The astute observer will have noticed the processes which are being used are all consuming 200+ MB of RAM each! We can probably attribute much of the heavy RAM usage to the typically inefficient manner in which in-memory objects are created when ActiveRecord, for the sake of convenience, loads many columns of data into memory we aren’t going to need at that moment. This is another problem to be discussed and debated (at length) at some other time.

I was running an older version of Lighttpd (1.4.13) on these servers and hadn’t bothered to upgrade because 1) it has been working pretty well, and 2) I’ve had plenty of other things to do. On the off-chance that Lighttpd might be at fault here I decided to have a look at the change logs for the newer versions of Lighttpd. A few things in the change logs for the newer versions caught my eye:

from the 1.4.14 change log:
  * fix http 500 errors (colin.stephen/at/o2.com) #1041 [1663]
  * Added round-robin support to mod_fastcgi [1500]
  * fix cpu hog in certain requests [1473] CVE-2007-1869from the 1.4.16 change log:
  * fixed check on stale errno values, which broke handling of broken fastcgi applications. (#1245)

from the 1.4.18 change log:
  * fixed FastCGI header overrun in mod_fastcgi (reported by mattias@secweb.se)

The first two I listed from 1.4.14 especially interested me…

So I downloaded the latest Lighttpd source (currently version 1.4.18) and compiled/installed it on one of the application servers to see if anything changes for the better. I immediately noticed something very interesting from the ps output as shown here:

UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
lighttpd 11946 1 0 20444 23248 3 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5
lighttpd 11959 1 0 126376 317356 3 Jan08 ? 00:06:43 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11961 1 0 106143 236516 1 Jan08 ? 00:06:19 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11965 1 0 103744 227076 2 Jan08 ? 00:05:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11967 1 0 103759 227156 1 Jan08 ? 00:05:14 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11969 1 0 102879 223588 1 Jan08 ? 00:04:23 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11971 1 0 103456 225780 2 Jan08 ? 00:05:15 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11979 1 0 105246 233092 3 Jan08 ? 00:05:40 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11981 1 0 103516 226188 3 Jan08 ? 00:04:55 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11986 1 0 103541 226244 1 Jan08 ? 00:04:21 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11988 1 0 103077 224392 2 Jan08 ? 00:04:27 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11990 1 0 102421 221764 2 Jan08 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 18924 1 0 122312 301352 1 Jan08 ? 00:11:05 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 12774 1 9 104090 228480 3 10:03 ? 00:18:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 12776 1 4 103632 226648 3 10:03 ? 00:07:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11144 1 4 102462 221940 1 11:25 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 14011 1 4 102638 222600 1 11:44 ? 00:03:34 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 14515 1 4 104432 229508 3 11:49 ? 00:03:30 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 17360 1 4 102608 222544 3 12:16 ? 00:02:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 21006 1 7 103863 227544 1 12:46 ? 00:01:37 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 23795 1 8 101678 218672 1 12:59 ? 00:00:49 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi

Based on values in the RSS column it seems that Lighttpd is now getting all of the fcgi processes involved. This is what the change log meant by “Added round-robin support to mod_fastcgi”. Now I’ve got 20 bloated processes in memory eating up almost 5 GB of RAM (around 230MB each). It’s a good thing I have 8GB in these servers!

Let’s have another look at the Lighttpd error log:

2008-01-09 11:05:13: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:05:14: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:17: (log.c.75) server started
2008-01-09 11:33:15: (network_openssl.c.133) SSL: 5 -1 104 Connection reset by peer
2008-01-09 11:33:15: (connections.c.603) connection closed: write failed on fd 8
2008-01-09 11:44:38: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012
2008-01-09 11:44:38: (mod_fastcgi.c.3254) response not received, request sent: 855 on socket: tcp:127.0.0.1:7012 for /dispatch.fcgi , closing connection
2008-01-09 13:11:15: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 34776 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2008-01-09 13:11:27: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 36064 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2008-01-09 13:27:00: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7018
2008-01-09 13:27:00: (mod_fastcgi.c.3254) response not received, request sent: 1226 on socket: tcp:127.0.0.1:7018 for /dispatch.fcgi , closing connection
2008-01-09 13:28:56: (connections.c.279) SSL: 1 error:140940E5:SSL routines:SSL3_READ_BYTES:ssl handshake failure

We can see where Lighttpd was restarted at 11:20:17, and unfortunately with some continuing instances of “unexpected end-of-file (perhaps the fastcgi process died)”.

Early indications are that we’re probably better off than before due to the now functional round-robin distribution of requests. As a by-product of the improved round-robin request handling I suspect I may have reduced the frequency of 500 errors seen by the users because Lighttpd will now send the request to the next fcgi process in the round-robin pool whereas before it was not using all 20 of the available processes. I will go ahead and install this latest version of Lighttpd on all the application servers.

However, the fcgi processes are still dying….. which means I get to figure out why.

The fun never ends.


Index your data already!

November 29th, 2007

So I’m still toiling away building this Ruby-on-Rails based web application and decided to take a break from coding optimization and have a look at the MySQL slow query log files on the database servers.

Although I found several good examples of slowly executing queries, I’ll just talk about this one:

SELECT profiles.`id` AS t0_r0, profiles.`user_id` AS t0_r1, profiles.`type` AS t0_r2, profiles.`screen_name` AS t0_r3, profiles.`postal_code` AS t0_r4, profiles.`updated_at` AS t0_r5, profiles.`created_at` AS t0_r6, profiles.`real_name` AS t0_r7, profiles.`short_name` AS t0_r8, profiles.`city` AS t0_r9, profiles.`state` AS t0_r10, profiles.`hometown` AS t0_r11, profiles.`quotation` AS t0_r12, profiles.`bio` AS t0_r13, profiles.`statements` AS t0_r14, profiles.`achievements` AS t0_r15, profiles.`specialties` AS t0_r16, profiles.`inspirations` AS t0_r17, profiles.`country` AS t0_r18, profiles.`channel_id` AS t0_r19, profiles.`styles` AS t0_r20, profiles.`featured_at` AS t0_r21, profiles.`featured_by_id` AS t0_r22, profiles.`bookmarked_by_count` AS t0_r23, profiles.`group_members` AS t0_r24, profiles.`looking_for` AS t0_r25, profiles.`work_with` AS t0_r26, profiles.`favorite_events` AS t0_r27, profiles.`fan_of` AS t0_r28, profiles.`favorite_talent` AS t0_r29, profiles.`favorite_music` AS t0_r30, profiles.`favorite_films` AS t0_r31, profiles.`favorite_performances` AS t0_r32, profiles.`membership_criteria` AS t0_r33, profiles.`skills` AS t0_r34, profiles.`materials` AS t0_r35, profiles.`exhibitions` AS t0_r36, profiles.`views` AS t0_r37, profiles.`soapbox` AS t0_r38, profiles.`user_activated` AS t0_r39, profiles.`phone_number` AS t0_r40, profiles.`address` AS t0_r41, profiles.`role_cache` AS t0_r42, profiles.`genre_cache` AS t0_r43, profiles.`upload_total_bytes_cache` AS t0_r44, profiles.`upload_item_count_cache` AS t0_r45, profiles.`popularity_factor` AS t0_r46, profiles.`views_today` AS t0_r47, profiles.`portfolio_views_today` AS t0_r48, profiles.`bookmarkings_today` AS t0_r49, profiles.`acknowledgements_today` AS t0_r50, profiles.`popularity_data` AS t0_r51, profiles.`popularity_factor_raw` AS t0_r52, profiles.`display_admin_info` AS t0_r53, profiles.`places` AS t0_r54, profiles.`where` AS t0_r55, profiles.`who` AS t0_r56, profiles.`craziest` AS t0_r57, profiles.`associations` AS t0_r58, profiles.`primary_image_id` AS t0_r59, profiles.`custom_url_name` AS t0_r60, profiles.`custom_url_hits` AS t0_r61, profiles.`web_presence_data` AS t0_r62, profiles.`has_portfolio_image` AS t0_r63, profiles.`has_portfolio_audio` AS t0_r64, profiles.`has_portfolio_remote_video` AS t0_r65, profiles.`has_portfolio_writing` AS t0_r66, profiles.`last_portfolio_item_created_at` AS t0_r67, profiles.`has_portfolio_video` AS t0_r68, profiles.`spotlighted_at` AS t0_r69, profiles.`clan_name` AS t0_r70, genres.`id` AS t1_r0, genres.`channel_id` AS t1_r1, genres.`name` AS t1_r2, genres.`display` AS t1_r3, roles.`id` AS t2_r0, roles.`name` AS t2_r1, roles.`display` AS t2_r2, channels.`id` AS t3_r0, channels.`name` AS t3_r1, channels.`ad_click_id` AS t3_r2, channels.`ad_zone_id` AS t3_r3, channels.`indieclicks_zone_id` AS t3_r4, portfolios.`id` AS t4_r0, portfolios.`profile_id` AS t4_r1, portfolios.`type` AS t4_r2, portfolios.`created_at` AS t4_r3, portfolios.`updated_at` AS t4_r4
FROM profiles
LEFT OUTER JOIN profile_genres ON ( profiles.`id` = profile_genres.`profile_id` )
LEFT OUTER JOIN genres ON ( genres.`id` = profile_genres.`genre_id` )
LEFT OUTER JOIN profile_roles ON ( profiles.`id` = profile_roles.`profile_id` )
LEFT OUTER JOIN roles ON ( roles.`id` = profile_roles.`role_id` )
LEFT OUTER JOIN channels ON channels.id = profiles.channel_id
LEFT OUTER JOIN portfolios ON portfolios.profile_id = profiles.id
AND portfolios.`type` = 'FreePortfolio'
WHERE ((profiles.channel_id =9)
AND (profiles.user_activated =1))
AND (featured_at IS NOT NULL)
AND profiles.id IN ('1051301', '1047701', '805801', '1016601', '566501', '215501', '605501', '301901', '781401', '768001', '912401', '911301')
ORDER BY featured_at DESC

A Rails developer will immediately recognize that as a typical SQL statement created by ActiveRecord to support our insatiable desire for OOPH (Object Oriented Programming Happiness). There IS a potential dark side to the dot notation we’ve come to love.

Slow query log had this to say about it:

# Query_time: 5  Lock_time: 0  Rows_sent: 2847  Rows_examined: 5645986

Notice the Rows_examined value - 5,645,986 rows! It’s a good thing our database servers have a crap-ton of RAM in them along with super-fast RAID disk arrays :-)

I then manually re-executed that same query several times and it consistently needed 1 to 3 seconds to run. Time to EXPLAIN things a little more or rather have MySQL explain it for me. Running EXPLAIN on that same query revealed the need to index the profile_id field in two tables (profile_genres and profile_roles).

After taking a minute to create the new indexes (including making new migration files and whatnot), I ran EXPLAIN again and was pleased to see MySQL needed to look at only 18 rows of data now instead of 5,645,986. Furthermore, the SAME query that previously needed several seconds was now consistently finishing in 0.007 seconds.

How the hell did we let this happen?!?! Simple: when juggling too many balls some inevitably get dropped. Ok, that’s only part of it. Another part is resisting the temptation to do premature optimizations to your systems. Arguably, in this case, the indexes should have been created when those data associations were set up. This is why we use the slow query logging and must periodically check for surprises like this.

Keep in mind that data indexing comes at a price. Indexing can be a double-edged sword in the sense that those indexes must be maintained when writing data to indexed tables. It is quite possible, even easy, to kill database write performance if you go too far and index too many things trying to speed up read performance.

Checking your slow query logs frequently and making cautious indexing changes will help more than most people realize. Especially when your tables grow beyond a few thousand rows.


Sunday Ride Report

November 11th, 2007

I only have time for the highlights…

  • Perfect weather.
  • A dozen skilled riders all wearing the appropriate safety gear.
  • No crashes.
  • No financial penalties.
  • Countless excellent roads south of SR60. Thanks to Dave for the tour guide service!
  • Surprisingly good food at the “trough” (Golden Corral) in Lake Placid.
  • 400.28 miles for me today.
  • I need a new rear tire. I’m going to try the new dual compound Pilot Road 2.

Concurrent Image Processing - Part One

November 7th, 2007

I’m building a new web site using the Ruby on Rails framework. Our users will be uploading countless image files (eventually there will be many millions of image files!). After each image file is uploaded, and depending on where that image will later be displayed, we must crop and resize it to make various sized thumbnail versions and display versions of the originally uploaded image file. Sounds simple, right? If only… In a perfect world:

  • The image would be optimized with respect to pixel dimensions and quality/compression. It would be no larger than the largest display size we use in the site and would be suitably compressed for display on a web page. This would reduce upload time, conserve filesystem space on the server, ensure the fastest possible page loading time, and reduce our bandwidth costs.
  • GIF, JPG, and PNG formats would be used appropriately.
  • They would all be RGB instead of CMYK. The CMYK vs RGB issues are complicated, but generally speaking CMYK is for print and RGB is for computer screens.

However our world isn’t so perfect because we see a lot of:

  • 8+ megapixel images straight from the user’s camera, each consuming several megabytes of disk space and taking many minutes for the user to upload.
  • Absolutely no attempt at JPG compression.
  • GIF files that most certainly needed to be JPG and vice versa.
  • the occasional CMYK file which doesn’t survive the conversion to RGB.

In reality, it’s just a whole lot easier if we let the user upload whatever they want and we’ll “fix” it on the server side. Understandably, most users have no clue about resizing, cropping, JPG compression, and all the other things we must deal with.

The real fun begins AFTER the file is uploaded. There are two general approaches for handling the server-side processing of the user’s uploaded image.

  1. Serialized processing: In the same process thread that is serving up the web page, we could do all the cropping, resizing, and compressing that is needed in a step-by-step serialized fashion followed by placing the resulting set of images on the image servers. There are 2 main problems with this approach: 1) The user’s browser session is literally put on hold and the browser will appear to be frozen during this processing and 2) During the processing, that connection to the web server is tied up and cannot service other requests thus making the application less scalable. This also happens to be the easiest way to do it, and in fact is how many sites handle such things.
  2. Concurrent processing: With almost no interruption of the user’s browsing session (other than the time it takes for the initial upload), we can place unprocessed uploads into a queue and have a SEPARATE process do all the work. This allows the user to continue using the site while their upload is being processed in the background on a different server that is dedicated to image processing. This processing approach is going to require more time to architect and implement.

For obvious reasons, I wanted to take the concurrent processing approach. To ensure that our application will scale up to millions of users, we are using a cluster of servers. Each of the servers handles a specific set of tasks related to front-end, image serving, image processing, audio serving, audio processing, database cluster, etc. The initial upload goes to the particular application server a particular user is hitting.

So, those are the issues….. what do we do?

Conceptually, the steps are actually simple and I’ll break it down by server type:

  • Application Servers: User uploads a new image (we call them assets) to whichever application server is handling their session. Behind the scenes, this consists of an application server receiving the uploaded file and placing the uploaded file in a pre-determined location on the application server which makes it available for “pick up” by an asset_processor program. The uploaded item is then inserted into an “asset processing queue”.
  • Asset Processing Servers: We have several servers each running multiple instances of an asset_processor.rb script which frequently looks in the “asset processing queue” for work to do. When a new item is found in the queue that isn’t already being processed, the asset_processor timestamps the started_at field for the item, copies the item to a local working directory and begins the processing of the item. When the processing is finished, the final set of files is copied over to the appropriate Asset Servers such that the world can get to them. After processing, the originally uploaded files are removed from the application server they were uploaded to and the new asset is made “active”. During the time period between the initial upload and completion of processing, the asset is considered to be “inactive” which causes our application to serve up placeholder slug images. As soon as the asset is made active by the processing script, the application magically starts serving up the processed images.
  • Asset Servers: we have images, audio, and (eventually) video asset servers. Fully processed assets are stored on these web servers.

Of course, no article about web development using Ruby on Rails is worth reading unless I include at least a little bit of code…

In “Part Two” of this article I will discuss several sections of the code doing some of the above described work.