java - Why is this inner loop 4X faster the first iteration through the outer loop? -


i trying reproduce of processor cache effects described in here. understand java managed environment, , these examples not translate exactly, came across strange case, have tried distill simple example illustrates effect:

public static void main(string[] args) {     final int runs = 10;     final int steps = 1024 * 1024 * 1024;      (int run = 0; run < runs; run++) {         final int[] = new int[1];         long start = system.nanotime();         (int = 0; < steps; i++) {             a[0]++;         }         long stop = system.nanotime();         long time = timeunit.milliseconds.convert(stop - start, timeunit.nanoseconds);         system.out.printf("time loop# %2d: %5d ms\n", run, time);     } } 

output:

 time loop#  0:    24 ms  time loop#  1:   106 ms  time loop#  2:   104 ms  time loop#  3:   103 ms  time loop#  4:   102 ms  time loop#  5:   103 ms  time loop#  6:   104 ms  time loop#  7:   102 ms  time loop#  8:   105 ms  time loop#  9:   102 ms 

the first iteration of inner loop 4 times fast subsequent iterations. opposite of expect, performace goes jit kicks in.

of course, 1 several warm-up loops in serious micro-benchmark, i'm curious causing behaviour, since if know loop can performed in 24ms, it's not satisfying steady-state time on 100ms.

for reference jdk using (on linux):

 openjdk version "1.8.0_40"  openjdk runtime environment (build 1.8.0_40-b20)  openjdk 64-bit server vm (build 25.40-b23, mixed mode) 

update:

here's update info, based on of comments, , experimenting:

1) moving system.out i/o out of loop (by storing timing in array of size 'runs') makes no significant difference on time.

2) output displayed above when run within eclipse. when compile , run command line (with same jdk/jvm) more modest, still significant results (2x instead of 4x faster). seems interesting, since usaully running in eclipse slow things down, if anything.

3) moving a up, out of loop, reused each iteration has no effect.

4) if int[] a changed long[] a, first iteration runs faster (about 20%), while other iterations still same (slower) speed.

update 2:

i think answer apangin explains it. tried sun's 1.9 jvm , it's going from:

openjdk version "1.8.0_40" openjdk runtime environment (build 1.8.0_40-b20) openjdk 64-bit server vm (build 25.40-b23, mixed mode)  time loop#  0:    48 ms time loop#  1:   116 ms time loop#  2:   112 ms time loop#  3:   113 ms time loop#  4:   112 ms time loop#  5:   112 ms time loop#  6:   111 ms time loop#  7:   111 ms time loop#  8:   113 ms time loop#  9:   113 ms 

to:

java version "1.9.0-ea" java(tm) se runtime environment (build 1.9.0-ea-b73) java hotspot(tm) 64-bit server vm (build 1.9.0-ea-b73, mixed mode)  time loop#  0:    48 ms time loop#  1:    26 ms time loop#  2:    22 ms time loop#  3:    22 ms time loop#  4:    22 ms time loop#  5:    22 ms time loop#  6:    22 ms time loop#  7:    22 ms time loop#  8:    22 ms time loop#  9:    23 ms 

that's quite improvement!

this suboptimal recompilation of method.

jit compiler relies on run-time statistics gathered during interpretation. when main method compiled first time, outer loop has not yet completed first iteration => run-time statistics tells code after inner loop never executed, jit not ever bother compiling it. rather generates uncommon trap.

when inner loop ends first time, uncommon trap hit causing method deoptimized.

on second iteration of outer loop main method recompiled new knowledge. jit has more statistics , more context compile. reason not cache value a[0] in register (probably because jit fooled wider context). generates addl instruction update array in memory, combination of memory load , store.

on contrary, during first compilation jit caches value of a[0] in register, there mov instruction store value in memory (without load).

fast loop (first iteration):

0x00000000029fc562: mov    %ecx,0x10(%r14)   <<< array store 0x00000000029fc566: mov    %r11d,%edi 0x00000000029fc569: mov    %r9d,%ecx 0x00000000029fc56c: add    %edi,%ecx 0x00000000029fc56e: mov    %ecx,%r11d 0x00000000029fc571: add    $0x10,%r11d       <<< increment in register 0x00000000029fc575: mov    %r11d,0x10(%r14)  <<< array store 0x00000000029fc579: add    $0x11,%ecx 0x00000000029fc57c: mov    %edi,%r11d 0x00000000029fc57f: add    $0x10,%r11d 0x00000000029fc583: cmp    $0x3ffffff2,%r11d 0x00000000029fc58a: jl     0x00000000029fc562 

slow loop (after recompilation):

0x00000000029fa1b0: addl   $0x10,0x10(%r14)  <<< increment in memory 0x00000000029fa1b5: add    $0x10,%r13d 0x00000000029fa1b9: cmp    $0x3ffffff1,%r13d 0x00000000029fa1c0: jl     0x00000000029fa1b0 

however problem seems fixed in jdk 9. i've checked test against recent jdk 9 access release , verified works expected:

time loop#  0:   104 ms time loop#  1:   101 ms time loop#  2:    91 ms time loop#  3:    63 ms time loop#  4:    60 ms time loop#  5:    60 ms time loop#  6:    59 ms time loop#  7:    55 ms time loop#  8:    57 ms time loop#  9:    59 ms 

Comments