Skip to content

Instantly share code, notes, and snippets.

@anj-s
Created April 30, 2021 04:21
Show Gist options
  • Save anj-s/6c808731287e9a504cb63c6f8013fad0 to your computer and use it in GitHub Desktop.
Save anj-s/6c808731287e9a504cb63c6f8013fad0 to your computer and use it in GitHub Desktop.
Stack trace: node 0: worker 0 , node 1: worker 1, server, scheduler
BytePS launching worker
BytePS launching worker
BytePS launching server
BytePS launching scheduler
[2021-04-29 20:03:00.669667: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-04-29 20:03:00.669697: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-04-29 20:03:00.669699: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-04-29 20:03:00.669754: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-04-29 20:03:00.670890: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-04-29 20:03:00.670910: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-04-29 20:03:00.670914: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-04-29 20:03:00.670938: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-04-29 20:03:00.670948: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2021-04-29 20:03:00.670948: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[[20:03:0020:03:00] ] byteps/server/server.ccbyteps/server/server.cc::430430: : BytePS server engine uses BytePS server engine uses 44 threads threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[[20:03:00] src/postoffice.cc20:03:00:] src/postoffice.cc25:: Creating Van: 251:
Creating Van: 1
[[20:03:0020:03:00] ] src/van.ccsrc/van.cc::8484: : DMLC_ENABLE_RDMA=1 will be deprecated. DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
Please use DMLC_ENABLE_RDMA=ibverbs instead.
[[20:03:0020:03:00] ] src/./rdma_van.hsrc/./rdma_van.h::4444: : Shared memory IPC has been disabledShared memory IPC has been disabled
[20:03:00] src/van.cc:441: Bind to [role=scheduler, id=1, ip=100.97.90.229, port=25000, is_recovery=0, aux_id=-1]
[20:03:00] src/./rdma_van.h:155: Connecting to Node 1, My_Node=1
[20:03:00] src/van.cc:441: Bind to [role=server, ip=100.97.90.43, port=41959, is_recovery=0, aux_id=-1]
[20:03:00] src/./rdma_van.h:155: Connecting to Node 1, My_Node=2147483647
[[20:03:0020:03:00] ] 3rdparty/ps-lite/include/dmlc/logging.h3rdparty/ps-lite/include/dmlc/logging.h::276276: : [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)
Stack trace returned 6 entries:
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f729a91699b]
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f729a916ca1]
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f729a96aad6]
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f729a7db19d]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f729b2f4609]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f729b21b293]
[20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)
Stack trace returned 6 entries:
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f52a7f5699b]
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f52a7f56ca1]
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f52a7faaad6]
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f52a7e1b19d]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f52a8934609]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f52a885b293]
terminate called after throwing an instance of 'dmlc::Error'
what(): [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)
Stack trace returned 6 entries:
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f729a91699b]
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f729a916ca1]
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f729a96aad6]
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f729a7db19d]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f729b2f4609]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f729b21b293]
terminate called after throwing an instance of 'dmlc::Error'
what(): [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)
Stack trace returned 6 entries:
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f52a7f5699b]
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f52a7f56ca1]
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f52a7faaad6]
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f52a7e1b19d]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f52a8934609]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f52a885b293]
Aborted (core dumped)
Aborted (core dumped)
Traceback (most recent call last):
File "/private/home/anj/.conda/envs/fairscale/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
launch_bps()
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 215, in launch_bps
subprocess.check_call(command, env=my_env,
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.
Traceback (most recent call last):
File "/private/home/anj/.conda/envs/fairscale/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
launch_bps()
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 215, in launch_bps
subprocess.check_call(command, env=my_env,
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.
[2021-04-29 20:03:01.214492: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-04-29 20:03:01.215658: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-04-29 20:03:01.215793: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-04-29 20:03:01.215857: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-04-29 20:03:01.215888: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2021-04-29 20:03:01.215903: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered
[2021-04-29 20:03:01.313811: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2021-04-29 20:03:01.313994: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_send_0
[2021-04-29 20:03:01.314050: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_recv_0
[2021-04-29 20:03:01.314108: D byteps/common/communicator.cc:123] This is ROOT device, rank=0, all sockets create successfully
[2021-04-29 20:03:01.314135: D byteps/common/global.cc:142] Partition size round up to 4096000 (bytes)
[2021-04-29 20:03:01.314141: D byteps/common/global.cc:166] Using key hash function type: djb2
[2021-04-29 20:03:01.314146: D byteps/common/global.cc:181] Number of worker=2, launching distributed job
[2021-04-29 20:03:01.314184: D byteps/common/communicator.cc:166] Listening on socket 0
[2021-04-29 20:03:01.314225: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2021-04-29 20:03:01.314239: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2021-04-29 20:03:01.314246: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2021-04-29 20:03:01.314298: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_send_nccl0
[2021-04-29 20:03:01.314334: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_recv_nccl0
[2021-04-29 20:03:01.314381: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2021-04-29 20:03:01.314407: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2021-04-29 20:03:01.314458: D byteps/common/communicator.cc:166] Listening on socket 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment