Skip to content

Instantly share code, notes, and snippets.

@dmonakhov
Created November 6, 2021 17:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmonakhov/492e99c89ff43cc12a3b6b32172cb7e9 to your computer and use it in GitHub Desktop.
Save dmonakhov/492e99c89ff43cc12a3b6b32172cb7e9 to your computer and use it in GitHub Desktop.
nvcr.io/nvidia/hpc-benchmarks:21.4-hppl xhpl crash
.....
Prog= 85.55% N_left= 1188288 Time= 870.74 Time_left= 147.12 iGF= 6253382.31 GF= 7603226.15 iGF_per= 12407.50 GF_per= 15085.77
Prog= 85.94% N_left= 1177344 Time= 876.23 Time_left= 143.33 iGF= 5573961.76 GF= 7590502.86 iGF_per= 11059.45 GF_per= 15060.52
Prog= 86.35% N_left= 1165824 Time= 881.27 Time_left= 139.31 iGF= 6271848.56 GF= 7582957.89 iGF_per= 12444.14 GF_per= 15045.55
Prog= 86.75% N_left= 1154304 Time= 886.36 Time_left= 135.37 iGF= 6095663.79 GF= 7574422.65 iGF_per= 12094.57 GF_per= 15028.62
Prog= 87.12% N_left= 1143360 Time= 891.10 Time_left= 131.69 iGF= 6100617.39 GF= 7566590.79 iGF_per= 12104.40 GF_per= 15013.08
Prog= 88.60% N_left= 1097856 Time= 910.53 Time_left= 117.14 iGF= 5880744.58 GF= 7530604.49 iGF_per= 11668.14 GF_per= 14941.68
Prog= 89.96% N_left= 1052352 Time= 929.69 Time_left= 103.75 iGF= 5491677.90 GF= 7488589.51 iGF_per= 10896.19 GF_per= 14858.31
Prog= 91.19% N_left= 1007424 Time= 948.52 Time_left= 91.61 iGF= 5061996.93 GF= 7440414.22 iGF_per= 10043.64 GF_per= 14762.73
Prog= 92.33% N_left= 961920 Time= 966.67 Time_left= 80.27 iGF= 4862198.60 GF= 7392002.82 iGF_per= 9647.22 GF_per= 14666.67
Prog= 93.36% N_left= 916992 Time= 983.82 Time_left= 70.00 iGF= 4624393.64 GF= 7343750.82 iGF_per= 9175.38 GF_per= 14570.93
Prog= 94.30% N_left= 871488 Time= 1001.26 Time_left= 60.54 iGF= 4174093.56 GF= 7288545.19 iGF_per= 8281.93 GF_per= 14461.40
Prog= 95.15% N_left= 825984 Time= 1017.29 Time_left= 51.90 iGF= 4092061.82 GF= 7238193.34 iGF_per= 8119.17 GF_per= 14361.49
[vla3-7009-hpl-test:65524:0:65651] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x80)
==== backtrace (tid: 65651) ====
0 0x0000000000026975 ucs_debug_print_backtrace() /build-result/src/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.1-0.6.6.0-ubuntu16.04-x86_64/ucx-96422ce/src/ucs/debug/debug.c:656
1 0x0000000000011390 __funlockfile() ???:0
2 0x00000000004c4430 __kmpc_omp_task_with_deps() ???:0
3 0x00000000004cba51 __kmpc_omp_task_alloc() ???:0
4 0x00000000004d109d __kmpc_omp_task_alloc() ???:0
5 0x000000000051cab7 __kmp_external___intel_sse2_strspn() ???:0
6 0x00000000005212f4 __kmp_external___intel_sse2_strspn() ???:0
7 0x0000000000498f40 omp_in_parallel() ???:0
8 0x00000000004dba60 __kmp_external___intel_sse2_strspn() ???:0
9 0x00000000000076ba start_thread() ???:0
10 0x000000000010751d clone() ???:0
=================================
[vla3-7009-hpl-test:65524] *** Process received signal ***
[vla3-7009-hpl-test:65524] Signal: Segmentation fault (11)
[vla3-7009-hpl-test:65524] Signal code: (-6)
[vla3-7009-hpl-test:65524] Failing at address: 0xfff4
[vla3-7009-hpl-test:65524] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f6d03f8c390]
[vla3-7009-hpl-test:65524] [ 1] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x4c4430]
[vla3-7009-hpl-test:65524] [ 2] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x4cba51]
[vla3-7009-hpl-test:65524] [ 3] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x4d109d]
[vla3-7009-hpl-test:65524] [ 4] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x51cab7]
[vla3-7009-hpl-test:65524] [ 5] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x5212f4]
[vla3-7009-hpl-test:65524] [ 6] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x498f40]
[vla3-7009-hpl-test:65524] [ 7] /opt/hpc-benchmarks/workspace/hpl-linux-x86_64/xhpl[0x4dba60]
[vla3-7009-hpl-test:65524] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6d03f826ba]
[vla3-7009-hpl-test:65524] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6d03cb851d]
[vla3-7009-hpl-test:65524] *** End of error message ***
./hpl.sh: line 321: 65524 Segmentation fault (core dumped) numactl --physcpubind=${CPU} ${MEMBIND} ${XHPL} ${DAT}
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[8832,1],501]
Exit code: 139
Crash caused by opt/intel/compilers_and_libraries_2020.4.304, use nvcr.io/nvidia/hpc-benchmarks:20.10-hpl docker image instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment