JVM Workhorses - Stubs and Intrinsics

A few weeks ago, I was analyzing a hot path in my current project. The code contained several calls to the following method:
System.arracopy(args...)
Building on this observation, I suspected this pattern might be suboptimal, assuming a heavy native library call was required to perform the copy. This assumption was based on the method signature:
public static native void arraycopy(Object src, int srcPos, Object dest, int destPos, int length);
However, after examining the C2-compiled machine code, I observed several calls to a Final Stub.
Given these findings, in this article, I will explain what stubs are and the vital role they play within the JVM.
Environment
I performed the analysis using the following JDK version: OpenJDK Runtime Environment Temurin-25.0.2+10 (build 25.0.2+10-LTS). The environment consisted of an Intel(R) Xeon(R) @ 2.20GHz - 4 vCPU (2 cores), 16GB RAM, running on Debian GNU/Linux 12 (bookworm).
The full source code is available on GitHub. For testing, I used the Adoptium mirror of the OpenJDK repository with the tag jdk-25.0.2+10.
Stubs
Stubs are pieces of assembly code generated and optimized to perform specific tasks, such as cryptographic functions, mathematical operations, and array copying. These are generated during JVM startup.
In the init.cpp file, we can find the following function calls:
// stub routines in initial blob are referenced by later generated code
initial_stubs_init();
...
continuation_stubs_init(); // depends on continuations_init
...
compiler_stubs_init(false /* in_compiler_thread */); // compiler's intrinsics stubs
final_stubs_init(); // final StubRoutines stubs
This demonstrates that stub routine generation is divided into several independent steps:
- Initial stubs
- Continuation stubs
- Compiler stubs
- Final stubs
Because certain stubs depend on others, this sequence is critical for correct generation.
We can inspect the time required to generate stubs using the following command:
scala-cli . --main-class playground.Main --java-opt -Xlog:startuptime --power
The output displays the following:
[0.005s][info][startuptime] StubRoutines generation initial stubs, 0.0005537 secs
...
[0.019s][info][startuptime] StubRoutines generation continuation stubs, 0.0000309 secs
...
[0.026s][info][startuptime] StubRoutines generation final stubs, 0.0003121 secs
...
[0.041s][info][startuptime] StubRoutines generation compiler stubs, 0.0033504 secs
Generation is remarkably efficient. The longest phase - compiler stub generation - took approximately 3 ms, while the other phases were completed in microseconds.
We can also examine the generated assembly code by enabling the -XX:+PrintStubCode option. The output for this command is extensive. I have included it in the GitHub repository for those who wish to perform a closer inspection.
According to the documentation in stubRoutines.hpp, although each step (initial, final, etc.) can produce multiple stubs, they are generated within a single BufferBlob. Each BufferBlob simply contains multiple entry points. This differs from the stubs (also called blobs) generated by SharedRuntime, which are bundled into separate BufferBlobs.
A BufferBlob is a container for machine code stored in the Code Cache.
Stubs possess another interesting characteristic that enhances their efficiency: some stubs may be compiled using intrinsics.
Intrinsics
Intrinsics can be thought of as specialized assembly code written directly for a specific CPU architecture. An intrinsic function utilizes specific Instruction Set Architecture (ISA) instructions, which vary across different processors.
While a comprehensive list of available intrinsics is defined in vmIntrinsics.hpp, not all of them are implemented for every CPU architecture. This is why the StubCodeGenerator must be implemented separately for each target architecture.
Let’s experiment with them using a simple Java example:
private static long power() {
var result = 0L;
for (int i = 0; i < 1_000_000; i++) {
result += (long) Math.pow(i, 2.0);
}
return result;
}
First, we can print the intrinsics used by the program by adding the PrintIntrinsics option to the command:
scala-cli . --main-class playground.Main --java-opt -XX:+UnlockDiagnosticVMOptions --java-opt -XX:+PrintIntrinsics --power
In this example, the output will include a log similar to the following:
@ 16 java.lang.Math::pow (6 bytes) (intrinsic)
This confirms that the pow function was intrinsified by the JIT compiler. However, there are also ways to disable intrinsics, which I will explore next.
Disable InlineNatives
The first available option is to disable the InlineNatives. After disabling this setting and running the program as follows:
scala-cli . --main-class playground.Main --java-opt -XX:+UnlockDiagnosticVMOptions --java-opt -XX:+PrintIntrinsics --java-opt -XX:-InlineNatives --power
The pow function is no longer intrinsified. Instead, the logs show that while some methods were inlined, none were intrinsified:
@ 2 java.lang.FdLibm$Pow::compute (1533 bytes) failed to inline: hot method too big
@ 9 java.lang.Double::isNaN (12 bytes) inline (hot)
@ 16 java.lang.Double::isNaN (12 bytes) inline (hot)
@ 27 java.lang.Math::abs (12 bytes) inline (hot)
@ 1 java.lang.Double::doubleToRawLongBits (0 bytes) failed to inline: native method
@ 8 java.lang.Double::longBitsToDouble (0 bytes) failed to inline: native method
@ 33 java.lang.Math::abs (12 bytes) inline (hot)
@ 1 java.lang.Double::doubleToRawLongBits (0 bytes) failed to inline: native method
@ 8 java.lang.Double::longBitsToDouble (0 bytes) failed to inline: native method
@ 16 java.lang.Math::pow (6 bytes) inline (hot)
@ 2 java.lang.StrictMath::pow (6 bytes) inline (hot)
@ 2 java.lang.FdLibm$Pow::compute (1533 bytes) failed to inline: hot method too big
We can inspect the generated stubs once again. By disabling InlineNatives, the JVM does not create stubs for a variety of functions, such as pow, sin, and doubleToRawLongBits. The complete list of affected functions can be found by inspecting the vmIntrinsics::disabled_by_jvm_flags method.
Crucially, when using this option, no stubs are produced. Consequently, they do not occupy any memory within the Code Cache.
ControlIntrinsic
While the previous flag has a "global" scope - disabling many stubs at once - we can exercise more fine-grained control using the ControlIntrinsic option (there is also a DisableIntrinsic option, but according to the documentation in vmIntrinsics.hpp, it is slated for deprecation).
To disable intrinsics specifically for the pow function, we must pass the intrinsic ID of that function. This ID can be found in the vmIntrinsics.hpp file, which contains code similar to this for each intrinsic:
do_intrinsic(_dpow, java_lang_Math, pow_name, double2_double_signature, F_S)
The first argument, _ dpow, is the ID we need. With this information, we can run the program as follows:
scala-cli . --main-class playground.Main --java-opt -XX:+UnlockDiagnosticVMOptions --java-opt -XX:+PrintIntrinsics --java-opt -XX:ControlIntrinsic=-_dpow --power
The output of this command differs from our previous attempts:
@ 2 java.lang.FdLibm$Pow::compute (1533 bytes) failed to inline: hot method too big
@ 9 java.lang.Double::isNaN (12 bytes) inline (hot)
@ 16 java.lang.Double::isNaN (12 bytes) inline (hot)
@ 27 java.lang.Math::abs (12 bytes) (intrinsic)
@ 33 java.lang.Math::abs (12 bytes) (intrinsic)
@ 16 java.lang.Math::pow (6 bytes) inline (hot)
@ 2 java.lang.StrictMath::pow (6 bytes) inline (hot)
@ 2 java.lang.FdLibm$Pow::compute (1533 bytes) failed to inline: hot method too big
This shows that while several methods were inlined, abs remained intrinsified. This approach allows us to disable a single intrinsic while leaving others active. This is further confirmed by the generated stubs, which show that only the pow stub is missing.
Use specific options
Another way to disable specific intrinsics is to use a flag dedicated to a given function or group of functions. To find all available options, we can run the following command:
java -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal -version | grep -Ei 'use.*intrinsic'
From the output, we find a particularly relevant option: UseLibmIntrinsic.
Let’s run the program once again with UseLibmIntrinsic disabled:
scala-cli . --main-class playground.Main --java-opt -XX:+UnlockDiagnosticVMOptions --java-opt -XX:+PrintIntrinsics --java-opt -XX:-UseLibmIntrinsic --power
The output was surprising:
@ 16 java.lang.Math::pow (6 bytes) (intrinsic)
This is the same output as when all intrinsics are enabled. However, inspecting the generated stubs reveals that stubs for the pow function - as well as other math functions like sin or log - were not generated.
This is not the first time the JVM has proven that just when I think I understand its behavior, I realize I still have much to learn. I will not attempt to definitively answer why the PrintIntrinsics option behaves this way. Instead, I will offer some educated guesses based on my findings.
To determine what is actually happening under the hood, I wrote several benchmarks.
Benchmarks
I wrote benchmarks to compare all mentioned approaches:
@Benchmark
public void power_Intrinsics_On(Blackhole bh) {
bh.consume(Math.pow(2.0, 10.0));
}
@Benchmark
@Fork(value = 3, jvmArgsAppend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-InlineNatives"})
public void power_Intrinsics_Off_Disabled_By_Inline_Natives(Blackhole bh) {
bh.consume(Math.pow(2.0, 10.0));
}
@Benchmark
@Fork(value = 3, jvmArgsAppend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:ControlIntrinsic=-_dpow"})
public void power_Intrinsics_Off_Disabled_By_Control_Intrinsic(Blackhole bh) {
bh.consume(Math.pow(2.0, 10.0));
}
@Benchmark
@Fork(value = 3, jvmArgsAppend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseLibmIntrinsic"})
public void power_Intrinsics_Off_Disabled_By_Libm_Intrinsics(Blackhole bh) {
bh.consume(Math.pow(2.0, 10.0));
}
The results are as follows:
Benchmark Mode Cnt Score Error Units
IntrinsicsPowerBenchmark.power_Intrinsics_Off_Disabled_By_Control_Intrinsic avgt 15 91.580 ? 0.251 ns/op
IntrinsicsPowerBenchmark.power_Intrinsics_Off_Disabled_By_Inline_Natives avgt 15 398.977 ? 1.606 ns/op
IntrinsicsPowerBenchmark.power_Intrinsics_Off_Disabled_By_Libm_Intrinsics avgt 15 92.155 ? 0.377 ns/op
IntrinsicsPowerBenchmark.power_Intrinsics_On avgt 15 22.240 ? 0.089 ns/op
These results clearly demonstrate that the intrinsified version of the pow function drastically outperforms all other solutions. Furthermore, the version with the InlineNatives option disabled is undeniably the worst performer. Interestingly, the other two solutions are nearly identical in terms of performance.
Performance Analysis
To gain deeper insight into exactly what the JVM executes for each approach, I ran the benchmarks with an enabled profiler using the following command:
scala-cli . --jmh --power -- -f 1 -prof "perfasm:events=cpu-clock;intel=true"
The full output of this run is available in the GitHub repository.
Intrinsics On
This approach is straightforward to analyze, as most of the execution time was spent calling the stub generated for the pow function:
88.14% runtime stub StubRoutines::dpow
The performance is high because there are no excessive function calls and code is fully optimized.
Intrinsics Disabled by InlineNatives
This scenario is also relatively simple to analyze. The primary bottlenecks are:
47.33% Unknown, level 0 java.lang.Double::doubleToRawLongBits, version 1, compile id 312
32.45% Unknown, level 0 java.lang.Double::longBitsToDouble, version 1, compile id 666
10.52% c2, level 4 java.lang.FdLibm$Pow::compute, version 2, compile id 683
By disabling a broad range of intrinsics with this option, the JVM falls back to the FdLibm implementation of pow (a Java implementation of C math libraries designed for portability rather than performance).
However, the majority of the time was spent on two methods from the Double class:
- doubleToRawLongBits
- longBitsToDouble
Since these are native methods that would normally be intrinsified, disabling that feature forces the JVM to call the native library repeatedly. This introduces significant overhead due to the "ceremony" of state transitions between the Java and native worlds.
You can read more about this in my previous article.
The actual mathematical operations within libjava.so were quite fast:
2.31% libjava.so Java_java_lang_Double_longBitsToDouble
Intrinsics Disabled by ControlIntrinsic
In this case, we disabled the intrinsic stub only for the pow function, leaving all other intrinsics intact. Similar to the previous scenario, the JVM fell back to the FdLibm implementation:
92.40% c2, level 4 java.lang.FdLibm$Pow::compute, version 2, compile id 670
However, this time the JVM had more optimization tools at its disposal. For instance, it could still intrinsify doubleToRawLongBits and longBitsToDouble, which were the primary bottlenecks in the previous example. Consequently, these results reflect the actual time required to execute the pure FdLibm pow implementation without the JNI overhead observed earlier.
Intrinsics Disabled by UseLibmIntrinsic
This is the most intriguing case. Although the logs indicated that the pow function was intrinsified, the profiler revealed a different reality:
92.01% libjvm.so SharedRuntime::dpow
The majority of the execution time was spent within the native library.
My research led me to the inline_math_pow function. Following the initial checks, the final return statement is implemented as follows:
return StubRoutines::dpow() != nullptr ?
runtime_math(OptoRuntime::Math_DD_D_Type(), StubRoutines::dpow(), "dpow") :
runtime_math(OptoRuntime::Math_DD_D_Type(), CAST_FROM_FN_PTR(address, SharedRuntime::dpow), "POW");
If a stub routine has been compiled, the JVM selects that stub. I assume this occurred in the fully optimized path. However, if no stub is available, the system falls back to SharedRuntime::dpow. The fallback uses a lightweight JVM internal call, which avoids the typical overhead of standard native JNI calls.
I also located the corresponding macro and method implementation:
JRT_LEAF(jdouble, SharedRuntime::dpow(jdouble x, jdouble y))
return __ieee754_pow(x, y);
JRT_END
This supports the first part of my hypothesis: the function was compiled within libjvm.so and invoked as a fallback when the stub was unavailable. It appears the log indicates the pow function was "intrinsified" before the JVM determined which specific implementation the C2 compiler would ultimately select.
While this remains a hypothesis requiring further research, I will postpone deeper investigation for now to avoid falling into another "rabbit hole".
Conclusion
In this article, I aimed to shed light on the mechanics of stubs and intrinsics. These are not new functionalities, but rather stable, mature features that have been integral to the JVM for a long time.
While they may not be widely known by many developers - largely because they operate so seamlessly behind the scenes - understanding them provides a clearer picture of how the JVM achieves such high performance on critical code paths. I hope you find this article useful.
Reviewed by Michał Matłoka
