The YourKit Java Profiler has immediately delivered positive ROI by helping us eliminate a bottleneck in an enterprise email filtering and archiving system, with the problem component accelerated by a factor of 16x on the most stressful test scenario in the end. (If you explain why YourKit is not serving your needs, it will help the expert answer your question). Related Questions. How do Java profilers work? What free tools are there to profile Java applications? Indeed is migrating from Java 7 to Java 8. We’re migrating in phases, starting with running our Java 7 compiled binaries on JRE 8, the Java 8 runtime. Rather than switch everything at once, we started with a few apps to get a sense for the kinds of problems we might find. We selected our job search web app as our first canary. Last May, we switched a single job search server in production to JRE 8. We’d been running with JRE 8 in a QA environment for several weeks without any trouble. But when we switched to JRE 8 in production, the system load rose from 5 (a normal level) to 20 (an abnormal level): high enough to require a fix before proceeding. Fixing the problem required unusual measures. Production metrics in Datadog I started by looking at, a third-party, cloud-based metric and event correlation service that we use for real-time monitoring. The following Datadog graph shows the system load by host for every job search server in one of our data centers. The outlier (blue line) is the server running JRE 8. System load for job search servers in a data center Here is the median response time by host. Again, the server running JRE 8 is the outlier. Median response time by host The job search servers that were still on JRE 7 had normal system loads and normal median response times, so the problem was specific to the JRE 8 server. It’s not unusual for a Java process to slow down because of garbage collection (GC) activity. We publish our GC metrics to Datadog, and the GC graphs for the JRE 8 instances looked normal. Production log files Next, I checked the web application log files. I retrieved every log message from the four hours when we used JRE 8, plus the log messages for the previous 24 hours. There were too many messages to read by hand. Instead, I focused on changes in the distribution of errors. I extracted the name of the associated Java class for each error and then generated histograms for different time periods. The histograms showed a slight increase in memcached client timeouts during the JRE 8 period. They also showed we used an open-source memcached client library that is no longer maintained. A Google search didn’t turn up any JRE 8 specific problems with the library. I glanced through its source code but didn’t see any problems. I decided that the increase in timeouts was most likely just a side effect of the higher system load. Reproducing the problem in QA Since the evidence from our production environment didn’t identify the problem, the next step was to try to reproduce it in our QA environment. As I mentioned earlier, the web app had been running normally on JRE 8 in QA. That wasn’t surprising, because performance problems might not manifest themselves until the system is under enough load, and our QA servers don’t get as much traffic as our production servers. I hoped that if I increased traffic beyond normal QA levels, we would see the same symptoms. I queried our system to find the relative frequencies of our most common incoming requests. Then, using, I simulated that mixture of traffic and ran it twice while monitoring system load, once each with instances on JRE 7 and JRE 8. In both cases, the web app processed about 50 requests per second. The system load was about the same in both cases, too. Fifty requests per second is a higher traffic load than we normally simulate in QA, but it is still not at the level we receive in production. When we started this effort, our production job search instances each served approximately 350 requests per second. To reproduce the problem in QA, I would need to get closer to that traffic volume. Ruling out JMeter as a bottleneck Why was our application pegged at 50 requests/sec? Before blaming the QA environment, I needed to rule out JMeter as the bottleneck.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
March 2019
Categories |