Tuesday, April 20, 2010

Installing Sun Java 6 on Ubuntu 10.4 Lucid Lynx

It looked like Canonical was going to totally abandon Sun's JDK with the release of Lucid Lynx. After heated discussions, however, instead it was merely tucked away even deeper into the recesses of alternative repositories. Now it lives in a partner respository, so as root you'll need to run

add-apt-repository "deb http://archive.canonical.com/ lucid partner"
apt-get update

to add the appropriate repository. Now you can use apt-get install as before to install the sun-java6-jdk or sun-java6-jre packages. For those curious, this is the "official" way to do this according to the release notes.

Monday, April 19, 2010

Connecting to JMX on Tomcat 6 through a firewall

One of the flaws (in my opinion, and shared by others) of the design of JMX/RMI is that the server listens on a port for connections, and when one is established it negotiates a new secondary port to open on the server side and expects the client to connect to that. Well, OK, except that it will pick an available port at random, and if your target machine is behind a firewall, well, you're out of luck because you don't know which port to open up!

With the release of Tomcat 6.0.24, a new Listener (the JmxRemoteLifecycleListener) is available that lets you connect to JMX running on your Tomcat server using jconsole. Using this Listener you can specify the secondary port number instead of it being picked at random. This way, you can open two known ports on your firewall and jconsole will happily connect and read data from Tomcat's JVM over JMX.

Setting it up is pretty easy. First, copy catalina-jmx-remote.jar from the extras folder of the binary distribution into Tomcat's lib folder.

Update your server.xml to include the Listener:

<Listener className="org.apache.catalina.mbeans.JmxRemoteLifecycleListener" rmiRegistryPortPlatform="10001" rmiServerPortPlatform="10002"/>

Replace the ports with whichever ones you wish. Make sure to open up those ports on your firewall. Be sure to properly configure JMX using an authentication and SSL. Or if you're just setting this up for testing, you can go with the totally insecure and unsafe configuration and add the following JVM arguments to your Tomcat startup script (typically CATALINA_OPTS or JAVA_OPTS):

-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false

Now you can start Tomcat. On your client machine, start jconsole and drop in the following URL for your remote process:

service:jmx:rmi://your.public.dns:10002/jndi/rmi://your.public.dns:10001/jmxrmi

Obviously you need to replace your.public.dns with the DNS address of your Tomcat machine, and if you chose different ports, change those as well. With some luck, you'll connect and be getting data!

If you're on EC2 or a similar network where you have an internal DNS name that's different from your external/public DNS name, one more step is required. Additionally set the following property to the server's external/public DNS name

-Djava.rmi.server.hostname=your.public.dns

And with that bit of magic you should be off and collecting data!

Sunday, April 18, 2010

Optimizing PostgreSQL/Tomcat for Write-Heavy Workloads

Recently I've been working on tuning the performance of a Tomcat web front-end and PostgreSQL back-end. In particular I wanted to stress some write-heavy scenarios, so I designed a JMeter test plan and ran it using Maven Chronos (as described in this post). Also I have collectd running on the machines and reporting various system metrics to a central server for graphing. This is essential to help identify which system resources are contributing to a performance bottleneck.

In this post I don't want to get too hung up on the exact nature of the queries or the hardware configurations. Instead I'd like to focus on the investigative process itself. Let's start off by showing some graphs from a 20-minute stress run. The JMeter test plan is configured to ramp up linearly from 0 to 50 client threads over a 5-minute period, then continue for another 15 minutes:


PostgresTomcat

Right away we notice that the stress run isn't really stressing either the Tomcat or Postgres machines. (aside: in this post I'm only going to show CPU graphs. Obviously you need to look at other resources as well. However, for the purposes of this discussion, looking at CPU is enough to get the idea.) At first it might seem that we're not hitting the server hard enough. Maybe 50 client threads is too few? Yet as we can see from the throughput graph performance, the overall throughput rises until we get to about 15 threads, and after that it is fairly flat. So this suggests that the problem is not with the test setup, but something in the server configurations.

Also notice that performance is actually pretty bad from a query response time perspective. The response times are all over the map, with a median around 300ms, a 95th-percentile all the way at about 1.2 seconds, and some queries lasting as long as 3.5 seconds. Ouch!

The most suspicious thing to me is that throughput doesn't increase when more than about 15 threads are hitting the servers. Both Tomcat and PostgreSQL are designed to be extremely capable in high-volume environments. No way could 15 threads be causing us to max out. The huge variance in response times implies that requests are being queued rather than handled right away. After running the test again, I logged into Postgres and ran SELECT count(*) FROM pg_stat_activity a few times during the run. There were never more than 8 connections to the database.

As it turns out, 8 is the default value for the maximum number of connections allowed with the Apache Commons Database Connection Pool (DBCP). In our case this looks to be the first culprit, and explains why we never got any throughput increases after just a small number of client threads and why response time variance was so high. So let's bump the maximum DBCP connections up to 50 and see what it looks like:



PostgresTomcat

Nice! Not only did our throughput increase by about 50% but the response times are more consistent and have fewer extremes. The throughput graph shows that our throughput increases until about 40 client threads, which is below the maximum database connections. This suggests contention for the thread pool is no longer a big issue. Also the second core on the Postgres machine finally began to be utilized. Our system is spending less time queuing and more time working.

But check out the high IO wait times on the Postgres machine. 20-30% of CPU time is waiting on IO to complete rather than doing useful processing. This seems like a really high proportion of time. As I mentioned at the beginning, the test plan I'm running has a relatively high amount of writes. Other metrics on the Postgres machine related to disk IO (not reproduced here) also showed that this was a likely bottleneck. So I set about researching how to improve my Postgres configuration for write performance. The following were valuable resources:


I played around with a number of parameters. The most important for this workload turned out to be those related to writing WAL files. In particular, changing the following parameters to these values had the biggest impact:

synchronous_commit = off
full_page_writes = off

The results:



PostgresTomcat

Clearly a major improvement. Throughput increased by another 25%, and response times not only dropped by about 50% but are now very consistent. On the Postgres machine very little time is spent waiting on IO. In fact it looks like our throughput increased all the way up to 50 client threads, suggesting that if we increase the number of threads we'll see that the system is capable of even more.

One odd thing about the last test results is the periodic drops to zero throughput. That's a mystery I'll solve for you in a future post.

It's important to fully understand the impact of these settings before deciding to put them into production. This combination of these settings makes it possible to lose transactions and increases the chances of a corrupt WAL in the event of an OS crash or power failure (though not as much as turning off fsync). As such configuring Postgres in this way should only be done if you can tolerate or otherwise mitigate this possibility.