Part 7: Java array layout and 60% perf. gain

11 mins read
February 28, 2026
linux performance benchmark tlb cache memory java

Hello! Welcome to part 7 of this series. In this episode we will continue to look at the memory latencies and improve the performance even further.

For those who landed here directly, this is the part 6 of the series - Taking JVM on a performance ride (4D tensor). I suggest you to take a look at all the previous posts before reading this.

tl;dr
Multi-dimensional Java arrays introduce pointer indirection and scattered memory access. Flattening the data into a contiguous 1D structure improved spatial locality, reduced TLB pressure, and yielded ~60% performance improvement in a memory-bound benchmark.

In the last episode we tried to reduce the TLB misses by increasing the page size and the misses came down to ~2M. Can we do better? But before we dive in, I wanted to call out something.

Error in our measurements

Until now we have been looking at perf counters for the whole Java appplication and not just the benchmark, so our readings are not accurate. To solve this JMH provides a way to get the counters only for our benchmark code. This can be done using -prof option in the JMH. There are a lot of type of profilers, but we will use the normalized perf (perfnorm) for our case as this gives per benchmark-call statitics.

Here is one run of the benchmark with this profiler (full output).

$ taskset -c 10-11 java -jar target/benchmarks.jar -prof perfnorm -wi 2 -i 2
...
Benchmark                                            Mode  Cnt         Score   Error      Units
Tensor4DBenchmark.accessTest                        thrpt    2       133.263              ops/s
Tensor4DBenchmark.accessTest:CPI                    thrpt              0.358          clks/insn
Tensor4DBenchmark.accessTest:IPC                    thrpt              2.794          insns/clk
Tensor4DBenchmark.accessTest:L1-dcache-load-misses  thrpt        2398891.680               #/op
Tensor4DBenchmark.accessTest:L1-dcache-loads        thrpt        7019438.903               #/op
Tensor4DBenchmark.accessTest:L1-dcache-stores       thrpt          11913.478               #/op
Tensor4DBenchmark.accessTest:L1-icache-load-misses  thrpt           4562.458               #/op
Tensor4DBenchmark.accessTest:LLC-load-misses        thrpt            151.708               #/op
Tensor4DBenchmark.accessTest:LLC-loads              thrpt          67594.351               #/op
Tensor4DBenchmark.accessTest:LLC-store-misses       thrpt              7.467               #/op
Tensor4DBenchmark.accessTest:LLC-stores             thrpt            320.958               #/op
Tensor4DBenchmark.accessTest:branch-misses          thrpt           1227.786               #/op
Tensor4DBenchmark.accessTest:branches               thrpt        5014496.884               #/op
Tensor4DBenchmark.accessTest:cycles                 thrpt       19358718.240               #/op
Tensor4DBenchmark.accessTest:dTLB-load-misses       thrpt             26.433               #/op
Tensor4DBenchmark.accessTest:dTLB-loads             thrpt        7019772.999               #/op
Tensor4DBenchmark.accessTest:dTLB-store-misses      thrpt              7.915               #/op
Tensor4DBenchmark.accessTest:dTLB-stores            thrpt          11076.570               #/op
Tensor4DBenchmark.accessTest:iTLB-load-misses       thrpt              8.799               #/op
Tensor4DBenchmark.accessTest:iTLB-loads             thrpt             25.510               #/op
Tensor4DBenchmark.accessTest:instructions           thrpt       54094393.720               #/op

Notice the counters for L1-dcache-load-misses, LLC-load-misses (last level cache), branches and misses, dTLB-load-misses in above output.

Predicting the ideal limit for above counters

Before we try to optimize any further, we need to evaluate what is the best we can do. For this we need to clearly understand the application to the lowest level.

Our application simply accesses elements from a 4D array at pseudo random locations (they are not actually pseudo random, check this simulation).
- This 4D array is present in heap
- Its size is 64 x 16 x 28 x 32 x 4 bytes (per float) = 3670016 bytes ~= 3MB (Please read ahead)
  - This can’t fit in to a 2MB page. So it will span atmost 3 pages in a 2MB huge-page system.
Adds these elements to a a variable
- Present in stack
Call a blackhole.consume method

Assuming our benchmark method & blackhole.consume’s code is at max a KB and in worst case the blackhole.consume’s method code lies in a different page, the code executed should fall in at most 4 pages.

So these should be our upper bounds

TLB misses

We should be seeing ~8 TLB misses. Why?

3 dTLB misses for array access
1 dTLB miss for stack access
4 iTLB misses for code

Cache misses

I don’t have a quick formulae for this because our access pattern is quite unique. So I tried to simulate and count for the worst case cache using a cache simulator I wrote. Here is what it predicts for a core (with zero other load) with below L1 cache

size 64KB
4 way (round robin way-selection for new loads) set associative cache
64byte wide

$ python cache_simulator.py --size 65536 --ways 4 --width 64
accesses = 1000000
misses   = 448 (0.0448%)
hits     = 999552 (99.9552%)

Note that our current dcache misses are off by orders of magnitude.

Visualizing the access pattern and cache behavior

I tried to simulate the access pattern and cache using a webpage generated by Antigravity (Opus 4.6). You should definitely check this - cache simulation

The problem: Java’s multidimensional array layout

Above section built the ideal upper bounds based on an assumption “arrays are contiguous”. But Java’s multidimensional arrays are not! They are laid out as below. In a \(n\) dimensional array all \(n-1\) dimensions store pointer to arrays in next level.

This layout going to be catastrophic for our application for three reasons

Accessing an element in such multidimensional array requires pointer chasing (remeber the code generated by JIT), if arrays at different levels are spread accross pages (which is more possible), there will be more TLB misses
A single element access requires dereferencing 4 pointers which boil down to 4 extra memory accesses each time.
Size of the array increases. The exact size of a \(N\)-dimensional array with shape \((a_1, a_2, \dots, a_N)\) will be (assuming compressed pointers)
\[\text{size} = \left( a_1 + a_1 a_2 + a_1 a_2 a_3 + \dots + a_1 a_2 \dots a_{N-1} \right) \times 4 \;+\; \left( a_1 a_2 \dots a_N \right) \times size\_of\_element\]
For our example it will be
\[\begin{aligned} \text{size} &= (64 + 64 \cdot 16 + 64 \cdot 16 \cdot 28)\cdot 4 \\ &\quad + (64 \cdot 16 \cdot 28 \cdot 32)\cdot 4 \\ &= 3{,}789{,}056 \end{aligned}\]

What next?

Languages like C, C++ store the multidimensional arrays in a flat structure. During indexed element access, compiler generates the code to convert the user’s multidimensional index to a flat index (Note that compiler is aware of the array size). We can borrow the same idea and implement for our tensor.

Introducing `Tensor4D` class

I enumerated the above idea in a class named Tensor4D. To test whether it works, you can run the Tensor4DTest test.

public class Tensor4D {

    float arr[];
    int bDim, chDim, rDim, cDim;

    public Tensor4D(int bDim, int chDim, int rDim, int cDim) {
        this.bDim = bDim;
        this.chDim = chDim;
        this.rDim = rDim;
        this.cDim = cDim;
        this.arr = new float[bDim * chDim * rDim * cDim];
    }

    public float get(int b, int ch, int r, int c) {
        return arr[
            0                               // here for aesthetics
            + b  * (chDim * rDim * cDim)    // skip b blocks
            + ch * (rDim * cDim)            // skip ch channels
            + r  * (cDim)                   // skip r rows
            + c                             // skip c columns
        ];
    }
}

To understand what calculations are done in get, look at this diagram where we are accessing the element at (2, 1, 0) in a 3d array of shape (3, 3, 3)

Get this code

Code and results referred here are present in p7-single-d-array branch of mmpataki/tensor.benchmark, you can get them and reproduce the results on your computer by -

# clone the repo, cd & pull
$ git clone https://github.com/mmpataki/tensor.benchmark; cd tensor.benchmark; git pull

# checkout the branch used for this part
$ git checkout p7-single-d-array

# build the benchmark
$ mvn clean package

The 60% performance gain

Here is the run of the benchmark using the Tensor4D class (full output). Notice we hit 287 ops/s which is 60% improvement over our earlier ~140 ops/s.

$ taskset -c 10-11 java -jar target/benchmarks.jar -prof perfnorm -wi 2 -i 2
...
Benchmark                                            Mode  Cnt         Score   Error      Units
Tensor4DBenchmark.accessTest                        thrpt    2       287.630              ops/s
Tensor4DBenchmark.accessTest:CPI                    thrpt              0.252          clks/insn
Tensor4DBenchmark.accessTest:IPC                    thrpt              3.962          insns/clk
Tensor4DBenchmark.accessTest:L1-dcache-load-misses  thrpt           7309.383               #/op
Tensor4DBenchmark.accessTest:L1-dcache-loads        thrpt        1008458.441               #/op
Tensor4DBenchmark.accessTest:L1-dcache-stores       thrpt           5928.029               #/op
Tensor4DBenchmark.accessTest:L1-icache-load-misses  thrpt           2100.949               #/op
Tensor4DBenchmark.accessTest:LLC-load-misses        thrpt             34.797               #/op
Tensor4DBenchmark.accessTest:LLC-loads              thrpt            319.140               #/op
Tensor4DBenchmark.accessTest:LLC-store-misses       thrpt              5.777               #/op
Tensor4DBenchmark.accessTest:LLC-stores             thrpt            119.419               #/op
Tensor4DBenchmark.accessTest:branch-misses          thrpt            667.397               #/op
Tensor4DBenchmark.accessTest:branches               thrpt        1506185.433               #/op
Tensor4DBenchmark.accessTest:cycles                 thrpt        8970497.486               #/op
Tensor4DBenchmark.accessTest:dTLB-load-misses       thrpt              6.784               #/op
Tensor4DBenchmark.accessTest:dTLB-loads             thrpt        1009581.692               #/op
Tensor4DBenchmark.accessTest:dTLB-store-misses      thrpt              6.364               #/op
Tensor4DBenchmark.accessTest:dTLB-stores            thrpt           4877.616               #/op
Tensor4DBenchmark.accessTest:iTLB-load-misses       thrpt              5.646               #/op
Tensor4DBenchmark.accessTest:iTLB-loads             thrpt              8.516               #/op
Tensor4DBenchmark.accessTest:instructions           thrpt       35544485.004               #/op

Concluding remarks

Java’s default multi-dimensional arrays can impose pointer chasing and scatter data across pages
Flat memory layouts drastically reduce the pointer overload in memory bound applications and unlock significant performance gains

Although I am happy with the TLB miss rate matching what we predicted and branch misprediction rate coming down as a side effect, I am not happy with the L1 dcache misses, there is something either wrong in measurement / interpreation / side effect by profiling. We will try to dig deeper in this in the next episode.

Until then, stare at this code generated by the C2 compiler and find whether there is something we can do here.

See you in the next one!