Part 1: Baseline and JIT compilers

9 mins read
December 19, 2025
java jvm performance jit benchmark

Hey!! You kept reading… welcome back!

For those who landed here directly, this is the part 1 of the series - Taking JVM on a performance ride (4D tensor). I suggest you to take a look at the intro page before reading this.

Code and results referred here are present in mmpataki/tensor.benchmark, you can get them and reproduce the results on your computer by -

# clone the repo
$ git clone https://github.com/mmpataki/tensor.benchmark

# if already cloned, pull
$ git pull

# checkout the branch used for this part
$ cd tensor.benchmark
$ git checkout p1-baseline-and-jit-compilers

# build the benchmark
$ mvn clean package

# run it
$ java -jar target/benchmarks.jar

To begin with, I decided to use a 4D array to represent this tensor and to measure the element access performance, I created a JMH benchmark as follows

@State(Scope.Benchmark)
public class Tensor4DBenchmark {

    final int B_DIM = 64;
    final int CH_DIM = 16;
    final int R_DIM = 28;
    final int C_DIM = 32;
    float[][][][] arr = new float[B_DIM][CH_DIM][R_DIM][C_DIM];

    public float realTest() {
        float sum = 0;
        for (int i = 0; i < 1_000_000; i++) {
            // access elements at pseudo random locations
            int b = i % B_DIM,
                ch = i % CH_DIM,
                r = i % R_DIM,
                c = i % C_DIM;
            sum += arr[b][ch][r][c];
        }
        return sum;
    }

    @Benchmark
    @Fork(value = 1, warmups = 1)
    public void accessTest(Blackhole bh) {
        float ret = realTest();
        bh.consume(ret);
    }

}

What this code does

Sets up a 4D array with shape 64 x 16 x 28 x 32.
Does pseudo random element access a million times.
Using JMH, measures the throughput

Results from my machine

$ java -jar target/benchmarks.jar
# JMH version: 1.37
# VM version: JDK 24.0.2-internal, OpenJDK 64-Bit Server VM, 24.0.2-internal-adhoc.mpataki.jdk24u
# VM invoker: /home/mpataki/projects/jdk24u/build/linux-x86_64-server-release/jdk/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.mpataki.Tensor4DBenchmark.accessTest

# Run progress: 0.00% complete, ETA 00:03:20
# Warmup Fork: 1 of 1
# Warmup Iteration   1: 140.104 ops/s
# Warmup Iteration   2: 140.242 ops/s
# Warmup Iteration   3: 140.198 ops/s
# Warmup Iteration   4: 140.141 ops/s
# Warmup Iteration   5: 139.711 ops/s
Iteration   1: 139.656 ops/s
Iteration   2: 139.656 ops/s
Iteration   3: 139.573 ops/s
Iteration   4: 139.637 ops/s
Iteration   5: 139.710 ops/s

# Run progress: 50.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 134.888 ops/s
# Warmup Iteration   2: 135.329 ops/s
# Warmup Iteration   3: 135.202 ops/s
# Warmup Iteration   4: 135.041 ops/s
# Warmup Iteration   5: 134.735 ops/s
Iteration   1: 134.285 ops/s
Iteration   2: 134.274 ops/s
Iteration   3: 134.330 ops/s
Iteration   4: 134.755 ops/s
Iteration   5: 134.599 ops/s


Result "com.mpataki.Tensor4DBenchmark.accessTest":
  134.449 ±(99.9%) 0.834 ops/s [Average]
  (min, avg, max) = (134.274, 134.449, 134.755), stdev = 0.217
  CI (99.9%): [133.614, 135.283] (assumes normal distribution)


# Run complete. Total time: 00:03:21
...

Benchmark                      Mode  Cnt    Score   Error  Units
Tensor4DBenchmark.accessTest  thrpt    5  134.449 ± 0.834  ops/s

So we were able to execute the accessTest method 134 times per second which is 134M element accesses per second. Can we push this further?

Execution of a Java code

We were taught in our schools that JVM interprets the bytecode which makes it platform independent. But this is not entirely true. By default JVM optimizes and compiles our bytecode to machine code and lets it run on CPU.

If you don’t believe me, there is a way we can make our code run completely in interpreted mode (using -Xint flag). Take a peek at the below numbers, Interpreted code is 8X slower compared to our initial run.

$ java -Xint -jar target/benchmarks.jar
...

Result "com.mpataki.Tensor4DBenchmark.accessTest":
  17.235 ±(99.9%) 0.473 ops/s [Average]
  (min, avg, max) = (17.100, 17.235, 17.372), stdev = 0.123
  CI (99.9%): [16.762, 17.708] (assumes normal distribution)

Benchmark                      Mode  Cnt   Score   Error  Units
Tensor4DBenchmark.accessTest  thrpt    5  17.235 ± 0.473  ops/s

So I went on reading about the compilation in JVM, and while reading [1], [2], [3] I found out that this compilation and optimization is tiered and there are 5 levels of it as shown below. I strongly suggest you spend some time reading [1] to understand these tiers better.

    Level 0      - Interpreted
    Level 1-3    - C1 (client compiler)
            1       - w/o profiling
            2       - w basic profiling
            3       - w full profiling
    Level 4      - C2 (server compiler)

TL;DR
JVM initially interprets the bytecode and keep tracks of the counts of method invocation and loop iterations (and many other stuff). This is called profiling. As these numbers increase, it decides to compile and optimize the methods and loops.

References

Did our run use Tiered compilation?

Let’s check that. -XX:+PrintFlagsFinal prints all the effective flags JVM is using.

$ java -XX:+PrintFlagsFinal -jar target/benchmarks.jar | grep TieredCompilation
     ...
     bool TieredCompilation                 = true            {pd product} {default}
     ...

Cool! tiered compilation is already in use.

Is C2 in use?

Let’s see whether C2 has compiled our code. In order to do this, we can enable the compilation related logging using the flag PrintCompilation.

$ java -XX:+PrintCompilation -jar target/benchmarks.jar | grep realTest
1053      1 468  945 %     3       com.mpataki.Tensor4DBenchmark::realTest @ 4 (59 bytes)
946       3                        com.mpataki.Tensor4DBenchmark::realTest (59 bytes)
947 %     4                        com.mpataki.Tensor4DBenchmark::realTest @ 4 (59 bytes)
945 %     3                        com.mpataki.Tensor4DBenchmark::realTest @ 4 (59 bytes)   made not entrant
947 %     4                        com.mpataki.Tensor4DBenchmark::realTest @ 4 (59 bytes)   made not entrant
948 %     4                        com.mpataki.Tensor4DBenchmark::realTest @ 4 (59 bytes)
949       4                        com.mpataki.Tensor4DBenchmark::realTest (59 bytes)
946       3                        com.mpataki.Tensor4DBenchmark::realTest (59 bytes)   made not entrant

I couldn’t find any doc explaining the columns, so took help from Chat and according to it

Col3 - %            indicates OSR optimization (replacing loops)
Col4 - [1-4]        represents the tier
`@ 4`               represents the byte code index at which the OSR is done
`made not entrant`  means the code block with attributes (represented by Col3, 4 in that line) is discarded

So looking at the last two lines, we can conclude our code (realTest method) is indeed compiled by C2.

I’ll stop here, but you can continue exploring other flags available for compiler configuration in above mentioned documents. Play with them and share your findings with me.

In the next episode of this series, we will dive deeper in to the code generated by the C2 compiler and try to optimize it.