Skip to content

Instantly share code, notes, and snippets.

@leiterenato
Created March 9, 2022 01:52
Show Gist options
  • Save leiterenato/3efd6735b0ce792a3f127a0b14ec6d0d to your computer and use it in GitHub Desktop.
Save leiterenato/3efd6735b0ce792a3f127a0b14ec6d0d to your computer and use it in GitHub Desktop.
====================================================Model Init=====================================================
[HCTR][01:43:08][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][01:43:08][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][01:43:08][INFO][RK0][main]: Global seed is 1388934725
[HCTR][01:43:12][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
GPU 1 -> node 0
GPU 2 -> node 0
GPU 3 -> node 0
[HCTR][01:43:18][INFO][RK0][main]: Start all2all warmup
[HCTR][01:43:19][INFO][RK0][main]: End all2all warmup
[HCTR][01:43:19][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][01:43:19][INFO][RK0][main]: Device 0: A100-SXM4-40GB
[HCTR][01:43:19][INFO][RK0][main]: Device 1: A100-SXM4-40GB
[HCTR][01:43:19][INFO][RK0][main]: Device 2: A100-SXM4-40GB
[HCTR][01:43:19][INFO][RK0][main]: Device 3: A100-SXM4-40GB
[HCTR][01:43:19][INFO][RK0][main]: num of DataReader workers: 4
[HCTR][01:43:19][ERROR][RK0][main]: Check Failed!
File: /hugectr/HugeCTR/include/data_readers/file_source_parquet.hpp:99
Function: ParquetFileSource
Expression: file_list_.get_num_of_files() >= stride_
Hint: The number of data reader workers should be no greater than the number of files in the file list. There is one worker on each GPU for Parquet dataset, please re-configure vvgpu within CreateSolver or guarantee enough files in the file list.
[5e93a2668dff:41532] *** Process received signal ***
[5e93a2668dff:41532] Signal: Aborted (6)
[5e93a2668dff:41532] Signal code: (-6)
[5e93a2668dff:41532] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f6e72991210]
[5e93a2668dff:41532] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6e7299118b]
[5e93a2668dff:41532] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6e72970859]
[5e93a2668dff:41532] [ 3] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZNK7HugeCTR6Logger5checkEbRKNS_6SrcLocEPKcz+0x1a6)[0x7f6e6f0858c6]
[5e93a2668dff:41532] [ 4] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17ParquetFileSourceC1EjjRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb+0x2fd)[0x7f6e6f60514d]
[5e93a2668dff:41532] [ 5] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR23ParquetDataReaderWorkerIxEC2EjjRKSt10shared_ptrINS_11GPUResourceEEPiRKS2_INS_12ThreadBufferEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbRKSt6vectorINS_21DataReaderSparseParamESaISL_EERKSK_IxSaIxEEiRKS2_INS_15ResourceManagerEE+0x408)[0x7f6e6f6263c8]
[5e93a2668dff:41532] [ 6] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR28DataReaderWorkerGroupParquetIxEC2ERKSt6vectorISt10shared_ptrINS_12ThreadBufferEESaIS5_EENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbS2_INS_21DataReaderSparseParamESaISG_EES2_IxSaIxEERKS3_INS_15ResourceManagerEEb+0x33e)[0x7f6e6f626bbe]
[5e93a2668dff:41532] [ 7] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR10DataReaderIxE19create_drwg_parquetENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIxSaIxEEb+0xcc)[0x7f6e6f626fbc]
[5e93a2668dff:41532] [ 8] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR9add_inputIxEEvRNS_5InputERNS_16DataReaderParamsERSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_11SparseInputIT_EESt4lessISB_ESaISt4pairIKSB_SE_EEERSt6vectorISN_INS_11TensorEntryESaISO_EESaISQ_EEST_RSt10shared_ptrINS_11IDataReaderEESX_SX_mmbbbmSU_INS_15ResourceManagerEE+0x1fc8)[0x7f6e6f6a2418]
[5e93a2668dff:41532] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model3addERNS_5InputE+0x831)[0x7f6e6f6cb9c1]
[5e93a2668dff:41532] [10] /usr/local/hugectr/lib/hugectr.so(+0x9f1ca)[0x7f6e722231ca]
[5e93a2668dff:41532] [11] /usr/local/hugectr/lib/hugectr.so(+0xd9108)[0x7f6e7225d108]
[5e93a2668dff:41532] [12] python(PyCFunction_Call+0x59)[0x5f5e79]
[5e93a2668dff:41532] [13] python(_PyObject_MakeTpCall+0x296)[0x5f6a46]
[5e93a2668dff:41532] [14] python[0x50b4a7]
[5e93a2668dff:41532] [15] python(_PyEval_EvalFrameDefault+0x5706)[0x5703e6]
[5e93a2668dff:41532] [16] python(_PyEval_EvalCodeWithName+0x26a)[0x5696da]
[5e93a2668dff:41532] [17] python(_PyFunction_Vectorcall+0x393)[0x5f6403]
[5e93a2668dff:41532] [18] python(_PyEval_EvalFrameDefault+0x18f1)[0x56c5d1]
[5e93a2668dff:41532] [19] python(_PyFunction_Vectorcall+0x1b6)[0x5f6226]
[5e93a2668dff:41532] [20] python(_PyEval_EvalFrameDefault+0x71e)[0x56b3fe]
[5e93a2668dff:41532] [21] python(_PyEval_EvalCodeWithName+0x26a)[0x5696da]
[5e93a2668dff:41532] [22] python(PyEval_EvalCode+0x27)[0x68db17]
[5e93a2668dff:41532] [23] python[0x67eeb1]
[5e93a2668dff:41532] [24] python[0x67ef2f]
[5e93a2668dff:41532] [25] python[0x67efd1]
[5e93a2668dff:41532] [26] python(PyRun_SimpleFileExFlags+0x197)[0x67f377]
[5e93a2668dff:41532] [27] python(Py_RunMain+0x212)[0x6b7902]
[5e93a2668dff:41532] [28] python(Py_BytesMain+0x2d)[0x6b7c8d]
[5e93a2668dff:41532] [29] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6e729720b3]
[5e93a2668dff:41532] *** End of error message ***
Aborted (core dumped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment