Friday, April 17, 2009

HTTP Server load balancing secrets

The HTTP Server plugin that ships with WAS is quite a nice little piece of router magic that ships for free, yet work load management through the HTTP Server seems like such black magic to most people. There are just a few things you should consider when tuning the plugin that might make your life a bit easier:

1) The ConnectTimeout parameter on each element only governs the HTTP Server's ability to open a socket to the WebSphere Portal HTTP endpoint. This is possible even if the entire Web Container thread pool is consumed. So, even if Portal is hung up and all threads are exhausted, the plugin can still connect to Portal's HTTP endpoint, and the plugin will continue to hammer Portal with more traffic until it is reduced to a bloody pulp. The ConnectTimeout parameter is only good for when the Portal's java process crashes and thus cannot respond to new socket requests.

The best way to defend against a hung Portal is to set the ServerIOTimeout attribute (missing by default, which is why most people miss it). This governs how long the plugin will wait after establishing a connection for data to be returned. Set this value to your pain threshold for waiting for a page to return. After this timeout, that server will be marked down.

2) Performance tests show that "Random" is a more efficient workload management policy than "Round Robin". There is evidence that shows when using Round Robin and a server gets marked down, a larger than normal allotment of that server's traffic is shifted to the very next server in line instead of being evenly redistributed across the cluster.

3) Consider using the and elements for active/passive configurations, where parts of a cluster can be targetted for normal traffic (PrimaryServer) and other servers in the cluster only get traffic if all of the primaries are marked down (BackupServer).