I stumbled into a quirk with apache and mod_proxy, and our sm_module (CA’s Siteminder agent), and thought it might be of use to others.
CA Siteminder protected website, reverse proxied (mod_proxy) to a Java Servlet Web Application built on Tomcat.
Tomcat is front-ended by hardware load balancer.
Pretty standard, just handling regular J2EE web traffic. It looks something like this (horizontal lines represent firewalls, but that is mostly irrelevant to the discussion. We will assume they are functioning as expected).
For those not familiar with Siteminder, it’s a Web Access Management platform, that amongst other things, provides authorization and authentication. This design generally provides a layer of insulation from our internal applications. Think of it as a second or third layer of defense, beyond the Firewalls and IPS. Most folks in an enterprise would find this somewhat typical, though the tech stack may vary.
All our HTTP connections for users are humming along splendidly, but occasionally when a user POSTs some large data payloads (PDF’s in this case), the user gets the dreaded “HTTP 502 Bad Gateway” from Apache HTTPD.
Application team reports these requests are not even making it to the application server. That’s pretty unusual: typically HTTP 502 Bad Gateway stems from a timeout or bad response from the back-end. In those cases, the request is making its way to the application server, it just doesn’t respond appropriately. So immediately, I’m suspicious.
As I mentioned, the first assumption is the request takes too long to respond within the proxy timeout, or is just not returning a value at all.
The first thing we do is try the same request against the Tomcat instance directly†.
<40MB of post payload>
And 25-30 seconds later, I get a successful response:
HTTP 200 OK
So that is surprising: the application server happily responds with success, and no obvious timeout (25-30 sec is just time spent uploading).
I make the assumption the application itself is having no problems handling the request/payload, and no strange idle timeout is occurring.
Proceeding up the infrastructure, I try the same request against the hardware load balancer. Sure enough, success again.
Next, we try the request against the Apache Proxy. Ah ha! Problem reoccurs. However, we still don’t have answers. We just know it’s somewhere starting at Apache. Now, we must dig into the capture to see if anything uncommon is happening.† This can be done by capturing the original attempt with CharlesProxy, Fiddler2, BurpSuite, or whatever proxy/logging tool floats your boat. If you debug web issues frequently and are not familiar with any of these tools, I highly recommend you try them out and find what is right for you.
Capturing the behavior
Fortunately, the behavior was easily reproducible by end users, and ourselves. So I fire up tcpdump and set to monitoring the backend connection between the Apache proxy and the application server (since this is the place where we suspect trouble).
Sure enough, we quickly reproduce the issue. I quickly jump into WireShark, and am following the TCP connection.
An interesting pattern appears:
User is establishing HTTP/1.1 Keep-Alive connection, no problem. I see the previous conversation, going back and forth between the Proxy and the Application Servers load balanced IP. After exactly 20 seconds, the Application server sends a FIN to hang up the connection (from the Tomcat side!?).
It turns out, Tomcat had it’s HTTP connector connectionTimeout set to 20000 milliseconds (default configuration). If the request doesn’t make it to Tomcat, it initiates a socket shutdown (FIN). Problem is, Apache still has some POST data buffering up, and is ready to dump it off to the connection pool. It seems as if Apache allocates the backend connection first, then once the payload is finished buffering from the end-user, it delivers it to the backend. Our issue is, if the 40MB file takes several minutes to upload, Tomcat would close down the proxied connection before completing!
So, this configuration is slightly unusual from your typical mod_proxy + Tomcat setup. We run the siteminder webagent as mentioned, and we use load balancers for distributing load, not mod_jk or mod_proxy_ajp. As far as we can tell, the siteminder webagent attempts to do some inspection of the POST payload, which prevents flushing from that bucket to the proxy bucket. Deeper analysis and debugging of the apache process and the modules would be required to confirm this. For us, the urgency was to resolve the issue for the application.
We increased the HTTP connector “connectionTimeout” value to match the Apache HTTPD ProxyTimeout value. Now, Tomcat will keep the connection open as long as HTTPD might have, which should allow Apache HTTPD to close down backend connections as it sees fit. Ideally, the backend HTTP connector timeout should be just a hair longer, to make sure the right order of closing, but so far we must have been lucky.
Always pay attention to these timeouts, especially when they happen consistently: the 20 second disconnection that was observed was the key to finding a solution. Often times, when you are able to reproduce the problem reliably, and the duration of the timeout is some very round number like 15, 30, 60, 180 seconds, there is some hardcoded default value just waiting to bite you.
Stumbled upon this great link on troubleshooting from NANOG mailing lists, and had to pass it on to anybody who hasn’t read it, especially the incident with Mary. Brings me back to the tech support days.
Excellent cheatsheet of netcat uses.
In: linux, tools · Tagged with: netcat
A couple tools any Linux user should know about, and their frequent uses:
The essentials which I won’t cover:
ls, grep, sed, awk, cat, less, head, tail, ..
I’m sure there are others. These should automatically just be extensions of your brain — you need to be intimately familiar with them to be productive on a command line. If you aren’t, I suggest you search around the net, there are thousands of tutorials to bring you up to speed on each one individually, and then you can progress to chaining them together.
On to the real meat of this post:
This really should be included above, but it does have some special uses for application debugging. Useful flags:
netstat -anlp | grep PID
Netcat has become the multi-tool of connection testing, for what we used to use “telnet” to establish simple tcp outbound connections, nc can now provide that, plus a listening mode to receive incoming connections. This is especially useful for validating firewall configurations before your applications ever get installed. Plus, combining nc with chained commands such as tar or gzip can make for some very quick file transfer mechanisms (bypassing ssh/scp’s performance limitations). Common uses:
nc host port — Connect outbound to a host:port
nc -l 8080 — Listen for a connection on 8080 and exit when closed.
Handy way to list the open files/handles/sockets from a process. Common flags:
lsof -nPp PID
Awesome utility to monitor the system calls an application makes. Having problems debugging an app that doesn’t seem to read your configs? Or hangs every 30 seconds? Fire up strace and attach to the pid, to find out that it’s reading the wrong path, or connecting to a downed service! Want to find the longest or most frequent running system calls? No problem! The volume of info and ease of use strace provides makes it an essential part of your toolkit. Common flags:
strace -cp PID
Will give you a nice table that counts the syscalls and sorts them, as well as the time spent executing.
strace -ttTp PID
Spits out the timing down to the microsecond of system calls.
Add -f to follow forked processes as well (handy for things like apache pre-fork, or any similar threaded/forking application). Make sure to use -o FILE to write out your output, it can move pretty quick!
The glorious debugger. If you are here, it’s probably because you have a poorly behaving app, that is core dumping. strace can only go so far.. you want to find the problem code, and kick it back to the developers. GDB will help here. We can attach to a process (which will pause it), then continue the process, and have the application perform whatever causes the core dump. At that point, GDB should spit back the problem line, and hopefully provide you a window into the problem. Common use:
gdb [binary name] [PID]
Issue “c” to continue the app. Generate your problem/seg fault, and observe the console output.
Beej has a pretty slick guide to gdb: http://beej.us/guide/bggdb/
In: linux, tools, troubleshooting · Tagged with: tools linux
Just getting started.. hold on to your hats.