Reverse Proxies and the gremlins within

I stumbled into a quirk with apache and mod_proxy, and our sm_module (CA’s Siteminder agent), and thought it might be of use to others.

Basic Setup

CA Siteminder protected website, reverse proxied (mod_proxy) to a Java Servlet Web Application built on Tomcat.
Tomcat is front-ended by hardware load balancer.
Pretty standard, just handling regular J2EE web traffic. It looks something like this (horizontal lines represent firewalls, but that is mostly irrelevant to the discussion. We will assume they are functioning as expected).

For those not familiar with Siteminder, it’s a Web Access Management platform, that amongst other things, provides authorization and authentication. This design generally provides a layer of insulation from our internal applications. Think of it as a second or third layer of defense, beyond the Firewalls and IPS. Most folks in an enterprise would find this somewhat typical, though the tech stack may vary.

Our Problem

All our HTTP connections for users are humming along splendidly, but occasionally when a user POSTs some large data payloads (PDF’s in this case), the user gets the dreaded “HTTP 502 Bad Gateway” from Apache HTTPD.

Application team reports these requests are not even making it to the application server. That’s pretty unusual: typically HTTP 502 Bad Gateway stems from a timeout or bad response from the back-end. In those cases, the request is making its way to the application server, it just doesn’t respond appropriately. So immediately, I’m suspicious.

Initial Investigations

As I mentioned, the first assumption is the request takes too long to respond within the proxy timeout, or is just not returning a value at all.

The first thing we do is try the same request against the Tomcat instance directly†.

POST /abcdef/servlet
<40MB of post payload>

And 25-30 seconds later, I get a successful response:
HTTP 200 OK

So that is surprising: the application server happily responds with success, and no obvious timeout (25-30 sec is just time spent uploading).
I make the assumption the application itself is having no problems handling the request/payload, and no strange idle timeout is occurring.

Proceeding up the infrastructure, I try the same request against the hardware load balancer. Sure enough, success again.

Next, we try the request against the Apache Proxy. Ah ha! Problem reoccurs. However, we still don’t have answers. We just know it’s somewhere starting at Apache. Now, we must dig into the capture to see if anything uncommon is happening.

† This can be done by capturing the original attempt with CharlesProxy, Fiddler2, BurpSuite, or whatever proxy/logging tool floats your boat. If you debug web issues frequently and are not familiar with any of these tools, I highly recommend you try them out and find what is right for you.

Capturing the behavior

Fortunately, the behavior was easily reproducible by end users, and ourselves. So I fire up tcpdump and set to monitoring the backend connection between the Apache proxy and the application server (since this is the place where we suspect trouble).

Sure enough, we quickly reproduce the issue. I quickly jump into WireShark, and am following the TCP connection.

An interesting pattern appears:

User is establishing HTTP/1.1  Keep-Alive connection, no problem. I see the previous conversation, going back and forth between the Proxy and the Application Servers load balanced IP. After exactly 20 seconds, the Application server sends a FIN to hang up the connection (from the Tomcat side!?).

It turns out, Tomcat had it’s HTTP connector connectionTimeout set to 20000 milliseconds (default configuration). If the request doesn’t make it to Tomcat, it initiates a socket shutdown (FIN). Problem is, Apache still has some POST data buffering up, and is ready to dump it off to the connection pool. It seems as if Apache allocates the backend connection first, then once the payload is finished buffering from the end-user, it delivers it to the backend. Our issue is, if the 40MB file takes several minutes to upload, Tomcat would close down the proxied connection before completing!

The cause

So, this configuration is slightly unusual from your typical mod_proxy + Tomcat setup. We run the siteminder webagent as mentioned, and we use load balancers for distributing load, not mod_jk or mod_proxy_ajp. As far as we can tell, the siteminder webagent attempts to do some inspection of the POST payload, which prevents flushing from that bucket to the proxy bucket. Deeper analysis and debugging of the apache process and the modules would be required to confirm this. For us, the urgency was to resolve the issue for the application.

The solution

We increased the HTTP connector “connectionTimeout” value to match the Apache HTTPD ProxyTimeout value. Now, Tomcat will keep the connection open as long as HTTPD might have, which should allow Apache HTTPD to close down backend connections as it sees fit. Ideally, the backend HTTP connector timeout should be just a hair longer, to make sure the right order of closing, but so far we must have been lucky.

Future tips

Always pay attention to these timeouts, especially when they happen consistently: the 20 second disconnection that was observed was the key to finding a solution. Often times, when you are able to reproduce the problem reliably, and the duration of the timeout is some very round number like 15, 30, 60, 180 seconds, there is some hardcoded default value just waiting to bite you.

Posted on July 5, 2012 at 4:02 pm by Andy · Permalink
In: linux, troubleshooting

One Response

Subscribe to comments via RSS

  1. Written by Dan E
    on December 7, 2012 at 12:47 pm
    Reply · Permalink

    Excellent article. I have a very simlar setup (our app runs in jboss) and was looking at apache and siteminder settings for timeouts, but never thought to check the actual tomcat settings.

    Thank you,

Subscribe to comments via RSS

Leave a Reply