Alex Zhitnitsky : Java Performance Tuning: Getting the Most Out of Your Garbage Collector
The main question here is this: What do you see as an acceptable criteria for the GC pause frequency and duration in your application? For example, a daily pause of 15 seconds might be acceptable, while a frequency of once in 30min would be an absolute disaster for the product. The requirements come from the domain of each system, where real time and high frequency trading systems would have the most strict requirements.
Overall, seeing pauses of 15-17 seconds is not a rare thing. Some systems might even reach 40-50 seconds pauses, and Haim also had a chance to see 5 minute pauses in a system with a large heap that did batch processing jobs. So pause duration doesn’t play a big factor there.
Stop The World and gather data: The importance of GC logsThe richest source of data for the state of garbage collection in a system based on a HotSpot JVM are the GC logs. If your JVM is not generating GC logs with timestamps, you’re missing out on a critical source of data to analyze and solve pausing issues. This is true for development environments, staging, load testing and most importantly, in production. You can get data about all GC events in your system, whether they were completed concurrently or caused a stop-the-world pause: how long did they take, how much CPU they consumed, and how much memory was freed. From this data, you’re able to understand the frequency and duration of these pauses, their overhead, and move on to taking actions to reduce them.?
The minimal settings for GC log data collection
1-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:mygclogfilename.gc
Looking at metrics, 5% is usually the upper bound for acceptable GC overhead while acceptable pauses are very different from one application to another.
Two tools worth mentioning here for GC log analysis are the open source GC Viewer that’s available on Github, andjClarity’s Censum.
Java VM Options You Should Always Use in Production
It’s not unusual in financial service systems to have problems that requires significant vertical, as opposed to horizontal, scaling. During his talk at QCon London Peter Lawrey described the particular problems that occur when you scale a Java application beyond 32GB.
Starting from the observation that Java responds much faster if you can keep your data in memory rather than going to a database or some other external resource, Lawrey described the kind of problems you hit when you go above the 32GB range that Java is reasonably comfortable in. As you’d expect GC pause times become a major problem, but also memory efficiency drops significantly, and you have the problem of how to recover in the event of a failure.
Suppose your system dies and you want to pull in your data to rebuild that system. If you are pulling data in at around 100 MB/s, which is not an unreasonable rate (it’s about the saturation point of a gigabit line and if you have a faster network you may not want to be maxing out those connections because you still want to be able to handle user requests), then 10GB takes about 2 minutes to recover, but as your data sets get larger you are getting into hours or even days. A Petabyte is about 4 months which is obviously unrealistic, particularly if your data is also changing.
Generally this problem is solved by replicating what is going on in a database. Lawrey mentioned Speedment SQL Reflector as one example of a product that can be used to do this.
Associated Slideshare:
No comments:
Post a Comment