Statistical Application Profiling in Multi-threaded Java

codersA not so uncommon problem is that either during load test or production deployment performance problems come up that do not replicate in the development environment.  The possible reasons for this are many, resource contention, lock contention, slow services, larger data sets among others.  The first problem is that you need to identify what the problem area is so you can go about trying to improve that area.

You could use a third party tool like Wiley, which can do a great job at this but it isn’t free and it isn’t light weight either, it could be a very hard sell if it is not already incumbent.

Unfortunately in these environments it is not going to be as simple as looking at the logs and piecing together what is slow.  It is not uncommon to have ten or more threads all running simultaneously.  Let’s take a simple example


Let’s say you log the time it took to execute a method at the end of that method execution, with 10 or more threads executing against a call stack like this you won’t get nice neat logs like

DataBaseCall1() = 10ms
DataBaseCall2() = 15ms
DataBaseCall3() = 40ms
DataAccessLayer() = 75ms
BusinessLayer() = 87ms
WebLayer() = 93ms

Instead all the threads will be interleaved, DataBase1() may fall in between a BusinessLayer and DataAccessLayer time.  So it may look more like:

DataBaseCall2() = 18ms
DataBaseCall1() = 9ms
DataAccessLayer() = 62ms
DataBaseCall2() = 12ms
DataBaseCall3() = 50ms
DataAccessLayer() = 80ms
DataBaseCall2() = 19ms
BusinessLayer() = 74ms
DataBaseCall2() = 11ms
WebLayer() = 102ms

Pretty difficult to read particularly in a prod or load test environment where there will be hundreds of thousands of transactions from 10 or more threads interleaved every which way.

IBM / Red Hat - Unlock Potential App Modernization
Unlock Your Potential with Application Modernization

Application modernization is a growing area of focus for enterprises. If you’re considering this path to cloud adoption, this guide explores considerations for the best approach – cloud native or legacy migration – and more.

Get the Guide

One possible solution is to create a “metrics” object you pass along from the top of the call stack to the bottom.  Allow it to aggregate the times and then print them all collated together so you can clearly see how much time was spent in each call at each level.  Problem with this approach is you now need to modify the call signatures up and down your application, more difficult still may be that you do not actually control the signatures of all the methods up and down the stack so you could not pass anything along even if you wanted to.

So we come to a solution which can solve both of these problems, we can use a special java facility called thread local storage that essentially provides static variables that are thread local.  This allows our thread to call our timer method where ever it might be and it automatically knows which thread of execution to which it belongs and can aggregate the activity for display at the end.

So it would be integrated something like this:


We of course use synchronized print methods to ensure after carefully collating the results we don’t let them interleave on output.  So we can get metrics that look like the following (only in a csv file for import into a spreadsheet/analysis tool)


Thread Activity Total Time Self Time Child Time
Thread 10 WebLayer 400 25 375
Thread 10 BusinessLayer 375 50 325
Thread 10 DataAccessLayer 325 40 290
Thread 10 DataBaseCall1 30 30 0
Thread 10 DataBaseCall2 200 200 0
Thread 10 DataBaseCall3 60 60 0


Now with a little magic in excel you can crunch the numbers (you will want to run the stats over many calls to diminish the effects of outliers) and say that say DataBaseCall2 represents 50% of total execution time and database calls represent about 75% of total time.  Very little that you can do in the Java layer in this case is going to make a particularly significant change in application performance (aside from perhaps clever query caching).


Note the numbers may not come out perfectly, timing things down to the millisecond with a small amount of vagueness in the millisecond clock and many possible edge cases the numbers may fudge a tiny bit this way or that.  Run many tests to minimize this factor.


What is presented is the simplest approach to integration, directly embedding the calls into your code.  I have included in the unit test a way how you can verify the overhead of embedding these calls.  I measure it to be about .002% or so or about 10ms for every 30 seconds of execution time.  With such low overhead it probably would not matter if it was left in for production deployment with a switch to turn the output on and off.  However it has been pointed out that fancier methods of integration would allow for complete extraction, for instance using AOP to intercept the before and after of various methods would leave your code free of timer code yet let you intercept at most critical points – but that is another show.


The source code is attached the unit test is simple it is main() on Timer, I have omitted JUnit for simplicity.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.