Created
April 30, 2021 04:21
-
-
Save anj-s/6c808731287e9a504cb63c6f8013fad0 to your computer and use it in GitHub Desktop.
Stack trace: node 0: worker 0 , node 1: worker 1, server, scheduler
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BytePS launching worker | |
BytePS launching worker | |
BytePS launching server | |
BytePS launching scheduler | |
[2021-04-29 20:03:00.669667: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered | |
[2021-04-29 20:03:00.669697: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered | |
[2021-04-29 20:03:00.669699: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered | |
[2021-04-29 20:03:00.669754: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered | |
[2021-04-29 20:03:00.670890: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered | |
[2021-04-29 20:03:00.670910: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered | |
[2021-04-29 20:03:00.670914: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered | |
[2021-04-29 20:03:00.670938: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered | |
[2021-04-29 20:03:00.670948: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered | |
[2021-04-29 20:03:00.670948: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered | |
[[20:03:0020:03:00] ] byteps/server/server.ccbyteps/server/server.cc::430430: : BytePS server engine uses BytePS server engine uses 44 threads threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance | |
[[20:03:00] src/postoffice.cc20:03:00:] src/postoffice.cc25:: Creating Van: 251: | |
Creating Van: 1 | |
[[20:03:0020:03:00] ] src/van.ccsrc/van.cc::8484: : DMLC_ENABLE_RDMA=1 will be deprecated. DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead. | |
Please use DMLC_ENABLE_RDMA=ibverbs instead. | |
[[20:03:0020:03:00] ] src/./rdma_van.hsrc/./rdma_van.h::4444: : Shared memory IPC has been disabledShared memory IPC has been disabled | |
[20:03:00] src/van.cc:441: Bind to [role=scheduler, id=1, ip=100.97.90.229, port=25000, is_recovery=0, aux_id=-1] | |
[20:03:00] src/./rdma_van.h:155: Connecting to Node 1, My_Node=1 | |
[20:03:00] src/van.cc:441: Bind to [role=server, ip=100.97.90.43, port=41959, is_recovery=0, aux_id=-1] | |
[20:03:00] src/./rdma_van.h:155: Connecting to Node 1, My_Node=2147483647 | |
[[20:03:0020:03:00] ] 3rdparty/ps-lite/include/dmlc/logging.h3rdparty/ps-lite/include/dmlc/logging.h::276276: : [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) | |
Stack trace returned 6 entries: | |
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f729a91699b] | |
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f729a916ca1] | |
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f729a96aad6] | |
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f729a7db19d] | |
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f729b2f4609] | |
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f729b21b293] | |
[20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) | |
Stack trace returned 6 entries: | |
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f52a7f5699b] | |
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f52a7f56ca1] | |
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f52a7faaad6] | |
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f52a7e1b19d] | |
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f52a8934609] | |
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f52a885b293] | |
terminate called after throwing an instance of 'dmlc::Error' | |
what(): [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) | |
Stack trace returned 6 entries: | |
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f729a91699b] | |
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f729a916ca1] | |
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f729a96aad6] | |
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f729a7db19d] | |
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f729b2f4609] | |
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f729b21b293] | |
terminate called after throwing an instance of 'dmlc::Error' | |
what(): [20:03:00] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) | |
Stack trace returned 6 entries: | |
[bt] (0) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x2999b) [0x7f52a7f5699b] | |
[bt] (1) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x29ca1) [0x7f52a7f56ca1] | |
[bt] (2) /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x7dad6) [0x7f52a7faaad6] | |
[bt] (3) /private/home/anj/.conda/envs/test_clone/lib/libstdc++.so.6(+0xc819d) [0x7f52a7e1b19d] | |
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f52a8934609] | |
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f52a885b293] | |
Aborted (core dumped) | |
Aborted (core dumped) | |
Traceback (most recent call last): | |
File "/private/home/anj/.conda/envs/fairscale/bin/bpslaunch", line 4, in <module> | |
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch') | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script | |
self.require(requires)[0].run_script(script_name, ns) | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1446, in run_script | |
exec(code, namespace, namespace) | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module> | |
launch_bps() | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 215, in launch_bps | |
subprocess.check_call(command, env=my_env, | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/subprocess.py", line 364, in check_call | |
raise CalledProcessError(retcode, cmd) | |
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134. | |
Traceback (most recent call last): | |
File "/private/home/anj/.conda/envs/fairscale/bin/bpslaunch", line 4, in <module> | |
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch') | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script | |
self.require(requires)[0].run_script(script_name, ns) | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1446, in run_script | |
exec(code, namespace, namespace) | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module> | |
launch_bps() | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 215, in launch_bps | |
subprocess.check_call(command, env=my_env, | |
File "/private/home/anj/.conda/envs/test_clone/lib/python3.8/subprocess.py", line 364, in check_call | |
raise CalledProcessError(retcode, cmd) | |
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134. | |
[2021-04-29 20:03:01.214492: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered | |
[2021-04-29 20:03:01.215658: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered | |
[2021-04-29 20:03:01.215793: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered | |
[2021-04-29 20:03:01.215857: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered | |
[2021-04-29 20:03:01.215888: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered | |
[2021-04-29 20:03:01.215903: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered | |
[2021-04-29 20:03:01.313811: D byteps/common/communicator.cc:63] Using Communicator=Socket | |
[2021-04-29 20:03:01.313994: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_send_0 | |
[2021-04-29 20:03:01.314050: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_recv_0 | |
[2021-04-29 20:03:01.314108: D byteps/common/communicator.cc:123] This is ROOT device, rank=0, all sockets create successfully | |
[2021-04-29 20:03:01.314135: D byteps/common/global.cc:142] Partition size round up to 4096000 (bytes) | |
[2021-04-29 20:03:01.314141: D byteps/common/global.cc:166] Using key hash function type: djb2 | |
[2021-04-29 20:03:01.314146: D byteps/common/global.cc:181] Number of worker=2, launching distributed job | |
[2021-04-29 20:03:01.314184: D byteps/common/communicator.cc:166] Listening on socket 0 | |
[2021-04-29 20:03:01.314225: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4 | |
[2021-04-29 20:03:01.314239: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1 | |
[2021-04-29 20:03:01.314246: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1 | |
[2021-04-29 20:03:01.314298: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_send_nccl0 | |
[2021-04-29 20:03:01.314334: D byteps/common/communicator.cc:159] Init socket at /tmp/socket_recv_nccl0 | |
[2021-04-29 20:03:01.314381: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully | |
[2021-04-29 20:03:01.314407: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 | |
[2021-04-29 20:03:01.314458: D byteps/common/communicator.cc:166] Listening on socket 0 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment