Created
March 9, 2022 01:52
-
-
Save leiterenato/3efd6735b0ce792a3f127a0b14ec6d0d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
====================================================Model Init===================================================== | |
[HCTR][01:43:08][WARNING][RK0][main]: The model name is not specified when creating the solver. | |
[HCTR][01:43:08][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled. | |
[HCTR][01:43:08][INFO][RK0][main]: Global seed is 1388934725 | |
[HCTR][01:43:12][INFO][RK0][main]: Device to NUMA mapping: | |
GPU 0 -> node 0 | |
GPU 1 -> node 0 | |
GPU 2 -> node 0 | |
GPU 3 -> node 0 | |
[HCTR][01:43:18][INFO][RK0][main]: Start all2all warmup | |
[HCTR][01:43:19][INFO][RK0][main]: End all2all warmup | |
[HCTR][01:43:19][INFO][RK0][main]: Using All-reduce algorithm: NCCL | |
[HCTR][01:43:19][INFO][RK0][main]: Device 0: A100-SXM4-40GB | |
[HCTR][01:43:19][INFO][RK0][main]: Device 1: A100-SXM4-40GB | |
[HCTR][01:43:19][INFO][RK0][main]: Device 2: A100-SXM4-40GB | |
[HCTR][01:43:19][INFO][RK0][main]: Device 3: A100-SXM4-40GB | |
[HCTR][01:43:19][INFO][RK0][main]: num of DataReader workers: 4 | |
[HCTR][01:43:19][ERROR][RK0][main]: Check Failed! | |
File: /hugectr/HugeCTR/include/data_readers/file_source_parquet.hpp:99 | |
Function: ParquetFileSource | |
Expression: file_list_.get_num_of_files() >= stride_ | |
Hint: The number of data reader workers should be no greater than the number of files in the file list. There is one worker on each GPU for Parquet dataset, please re-configure vvgpu within CreateSolver or guarantee enough files in the file list. | |
[5e93a2668dff:41532] *** Process received signal *** | |
[5e93a2668dff:41532] Signal: Aborted (6) | |
[5e93a2668dff:41532] Signal code: (-6) | |
[5e93a2668dff:41532] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f6e72991210] | |
[5e93a2668dff:41532] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6e7299118b] | |
[5e93a2668dff:41532] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6e72970859] | |
[5e93a2668dff:41532] [ 3] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZNK7HugeCTR6Logger5checkEbRKNS_6SrcLocEPKcz+0x1a6)[0x7f6e6f0858c6] | |
[5e93a2668dff:41532] [ 4] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17ParquetFileSourceC1EjjRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb+0x2fd)[0x7f6e6f60514d] | |
[5e93a2668dff:41532] [ 5] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR23ParquetDataReaderWorkerIxEC2EjjRKSt10shared_ptrINS_11GPUResourceEEPiRKS2_INS_12ThreadBufferEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbRKSt6vectorINS_21DataReaderSparseParamESaISL_EERKSK_IxSaIxEEiRKS2_INS_15ResourceManagerEE+0x408)[0x7f6e6f6263c8] | |
[5e93a2668dff:41532] [ 6] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR28DataReaderWorkerGroupParquetIxEC2ERKSt6vectorISt10shared_ptrINS_12ThreadBufferEESaIS5_EENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbS2_INS_21DataReaderSparseParamESaISG_EES2_IxSaIxEERKS3_INS_15ResourceManagerEEb+0x33e)[0x7f6e6f626bbe] | |
[5e93a2668dff:41532] [ 7] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR10DataReaderIxE19create_drwg_parquetENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIxSaIxEEb+0xcc)[0x7f6e6f626fbc] | |
[5e93a2668dff:41532] [ 8] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR9add_inputIxEEvRNS_5InputERNS_16DataReaderParamsERSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_11SparseInputIT_EESt4lessISB_ESaISt4pairIKSB_SE_EEERSt6vectorISN_INS_11TensorEntryESaISO_EESaISQ_EEST_RSt10shared_ptrINS_11IDataReaderEESX_SX_mmbbbmSU_INS_15ResourceManagerEE+0x1fc8)[0x7f6e6f6a2418] | |
[5e93a2668dff:41532] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model3addERNS_5InputE+0x831)[0x7f6e6f6cb9c1] | |
[5e93a2668dff:41532] [10] /usr/local/hugectr/lib/hugectr.so(+0x9f1ca)[0x7f6e722231ca] | |
[5e93a2668dff:41532] [11] /usr/local/hugectr/lib/hugectr.so(+0xd9108)[0x7f6e7225d108] | |
[5e93a2668dff:41532] [12] python(PyCFunction_Call+0x59)[0x5f5e79] | |
[5e93a2668dff:41532] [13] python(_PyObject_MakeTpCall+0x296)[0x5f6a46] | |
[5e93a2668dff:41532] [14] python[0x50b4a7] | |
[5e93a2668dff:41532] [15] python(_PyEval_EvalFrameDefault+0x5706)[0x5703e6] | |
[5e93a2668dff:41532] [16] python(_PyEval_EvalCodeWithName+0x26a)[0x5696da] | |
[5e93a2668dff:41532] [17] python(_PyFunction_Vectorcall+0x393)[0x5f6403] | |
[5e93a2668dff:41532] [18] python(_PyEval_EvalFrameDefault+0x18f1)[0x56c5d1] | |
[5e93a2668dff:41532] [19] python(_PyFunction_Vectorcall+0x1b6)[0x5f6226] | |
[5e93a2668dff:41532] [20] python(_PyEval_EvalFrameDefault+0x71e)[0x56b3fe] | |
[5e93a2668dff:41532] [21] python(_PyEval_EvalCodeWithName+0x26a)[0x5696da] | |
[5e93a2668dff:41532] [22] python(PyEval_EvalCode+0x27)[0x68db17] | |
[5e93a2668dff:41532] [23] python[0x67eeb1] | |
[5e93a2668dff:41532] [24] python[0x67ef2f] | |
[5e93a2668dff:41532] [25] python[0x67efd1] | |
[5e93a2668dff:41532] [26] python(PyRun_SimpleFileExFlags+0x197)[0x67f377] | |
[5e93a2668dff:41532] [27] python(Py_RunMain+0x212)[0x6b7902] | |
[5e93a2668dff:41532] [28] python(Py_BytesMain+0x2d)[0x6b7c8d] | |
[5e93a2668dff:41532] [29] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6e729720b3] | |
[5e93a2668dff:41532] *** End of error message *** | |
Aborted (core dumped) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment