Skip to content

Instantly share code, notes, and snippets.

@iwanowww
Last active June 5, 2020 17:44
Show Gist options
  • Save iwanowww/221df8893fbaa4b6b0904e3036221b1d to your computer and use it in GitHub Desktop.
Save iwanowww/221df8893fbaa4b6b0904e3036221b1d to your computer and use it in GitHub Desktop.
Original post:
https://stackoverflow.com/questions/61612313/openjdk-panama-vector-api-jdk-incubator-vector-not-giving-improved-results-for-v
=====================
(1) The regression in FloatVector256DotProduct.vector* with latest vectorIntrinsics branch is caused by a bug in vector operations intrinsification:
2675 92 b net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes)
...
@ 3 jdk.incubator.vector.FloatVector::zero (35 bytes) force inline by annotation
@ 6 jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes) accessor
@ 13 jdk.incubator.vector.AbstractSpecies::length (5 bytes) accessor
@ 19 jdk.incubator.vector.FloatVector::toBits (6 bytes) force inline by annotation
@ 1 java.lang.Float::floatToIntBits (15 bytes) (intrinsic)
@ 23 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes) force inline by annotation
@ 28 jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes) failed to inline (intrinsic)
The following patch fixes the bug:
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
@@ -476,7 +476,7 @@
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.
if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt,
- is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) {
+ (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) {
if (C->print_intrinsics()) {
tty->print_cr(" ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d",
num_elem, type2name(elem_bt),
BEFORE:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 679.280 ± 13.731 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2319.770 ± 123.943 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 803.740 ± 42.596 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 797.153 ± 49.129 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 828.172 ± 16.936 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 798.037 ± 85.566 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1888.662 ± 55.922 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1486.322 ± 93.864 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1525.046 ± 110.700 ops/s
AFTER:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vanilla 1048576 thrpt 5 666.581 ± 8.727 ops/s
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2416.695 ± 106.223 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 5 3776.422 ± 117.357 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 3734.246 ± 122.463 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 5 1914.794 ± 51.329 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1405.345 ± 52.025 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 5 1832.133 ± 56.256 ops/s
=====================
(2) The regression in vectorfmaUnrolled (compared to vectorfma) is caused by well-known inlining issues which break vector box elimination:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s
19727 95 b net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes)
...
@ 209 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 213 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
@ 218 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call
...
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3938.922 ± 97.041 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 1048576 thrpt 5 0.111 ± 0.003 B/op
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 2052.549 ± 68.859 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 1048576 thrpt 5 1573537.127 ± 22.886 B/op
Until the inlining is fixed, as a workaround, a warm-up phase with smaller data input can help:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.vectorfma 128 thrpt 5 54838734.769 ± 161477.746 ops/s
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
FloatVector256DotProduct.vectorfmaUnrolled 128 thrpt 5 68993637.658 ± 359974.720 ops/s
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op
Possible fix:
http://cr.openjdk.java.net/~vlivanov/panama/vector/inline_virtual/webrev.00/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment