Last active
June 5, 2020 17:44
-
-
Save iwanowww/221df8893fbaa4b6b0904e3036221b1d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original post: | |
https://stackoverflow.com/questions/61612313/openjdk-panama-vector-api-jdk-incubator-vector-not-giving-improved-results-for-v | |
===================== | |
(1) The regression in FloatVector256DotProduct.vector* with latest vectorIntrinsics branch is caused by a bug in vector operations intrinsification: | |
2675 92 b net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes) | |
... | |
@ 3 jdk.incubator.vector.FloatVector::zero (35 bytes) force inline by annotation | |
@ 6 jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes) accessor | |
@ 13 jdk.incubator.vector.AbstractSpecies::length (5 bytes) accessor | |
@ 19 jdk.incubator.vector.FloatVector::toBits (6 bytes) force inline by annotation | |
@ 1 java.lang.Float::floatToIntBits (15 bytes) (intrinsic) | |
@ 23 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force inline by annotation | |
@ 4 java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes) force inline by annotation | |
@ 28 jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes) failed to inline (intrinsic) | |
The following patch fixes the bug: | |
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp | |
--- a/src/hotspot/share/opto/vectorIntrinsics.cpp | |
+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp | |
@@ -476,7 +476,7 @@ | |
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad. | |
if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt, | |
- is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) { | |
+ (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) { | |
if (C->print_intrinsics()) { | |
tty->print_cr(" ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d", | |
num_elem, type2name(elem_bt), | |
BEFORE: | |
Benchmark (size) Mode Cnt Score Error Units | |
FloatVector256DotProduct.vanilla 1048576 thrpt 5 679.280 ± 13.731 ops/s | |
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2319.770 ± 123.943 ops/s | |
FloatVector256DotProduct.vector 1048576 thrpt 5 803.740 ± 42.596 ops/s | |
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 797.153 ± 49.129 ops/s | |
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 828.172 ± 16.936 ops/s | |
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 798.037 ± 85.566 ops/s | |
JavaDocExample.simpleMultiply 1048576 thrpt 5 1888.662 ± 55.922 ops/s | |
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1486.322 ± 93.864 ops/s | |
JavaDocExample.vectorMultiply 1048576 thrpt 5 1525.046 ± 110.700 ops/s | |
AFTER: | |
Benchmark (size) Mode Cnt Score Error Units | |
FloatVector256DotProduct.vanilla 1048576 thrpt 5 666.581 ± 8.727 ops/s | |
FloatVector256DotProduct.unrolled 1048576 thrpt 5 2416.695 ± 106.223 ops/s | |
FloatVector256DotProduct.vector 1048576 thrpt 5 3776.422 ± 117.357 ops/s | |
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 3734.246 ± 122.463 ops/s | |
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s | |
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s | |
JavaDocExample.simpleMultiply 1048576 thrpt 5 1914.794 ± 51.329 ops/s | |
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1405.345 ± 52.025 ops/s | |
JavaDocExample.vectorMultiply 1048576 thrpt 5 1832.133 ± 56.256 ops/s | |
===================== | |
(2) The regression in vectorfmaUnrolled (compared to vectorfma) is caused by well-known inlining issues which break vector box elimination: | |
Benchmark (size) Mode Cnt Score Error Units | |
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 ± 44.797 ops/s | |
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 ± 15.955 ops/s | |
19727 95 b net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes) | |
... | |
@ 209 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation | |
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call | |
@ 213 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation | |
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call | |
@ 218 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation | |
@ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call | |
... | |
Benchmark (size) Mode Cnt Score Error Units | |
FloatVector256DotProduct.vectorfma 1048576 thrpt 5 3938.922 ± 97.041 ops/s | |
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 1048576 thrpt 5 0.111 ± 0.003 B/op | |
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 2052.549 ± 68.859 ops/s | |
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 1048576 thrpt 5 1573537.127 ± 22.886 B/op | |
Until the inlining is fixed, as a workaround, a warm-up phase with smaller data input can help: | |
Benchmark (size) Mode Cnt Score Error Units | |
FloatVector256DotProduct.vectorfma 128 thrpt 5 54838734.769 ± 161477.746 ops/s | |
FloatVector256DotProduct.vectorfma:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op | |
FloatVector256DotProduct.vectorfmaUnrolled 128 thrpt 5 68993637.658 ± 359974.720 ops/s | |
FloatVector256DotProduct.vectorfmaUnrolled:·gc.alloc.rate.norm 128 thrpt 5 ≈ 10⁻⁵ B/op | |
Possible fix: | |
http://cr.openjdk.java.net/~vlivanov/panama/vector/inline_virtual/webrev.00/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment