Skip to content

Instantly share code, notes, and snippets.

@sparticlesteve
Created December 18, 2019 06:12
Show Gist options
  • Save sparticlesteve/7307694f89329c277e16e452b524fefa to your computer and use it in GitHub Desktop.
Save sparticlesteve/7307694f89329c277e16e452b524fefa to your computer and use it in GitHub Desktop.
$ srun -n 8 -c 10 -u -l python test_ddp.py --backend mpi
3: Initialized rank 3 local-rank 3 size 8
1: Initialized rank 1 local-rank 1 size 8
5: Initialized rank 5 local-rank 5 size 8
7: Initialized rank 7 local-rank 7 size 8
2: Initialized rank 2 local-rank 2 size 8
4: Initialized rank 4 local-rank 4 size 8
6: Initialized rank 6 local-rank 6 size 8
0: Initialized rank 0 local-rank 0 size 8
3: Generating a batch of data
1: Generating a batch of data
5: Generating a batch of data
7: Generating a batch of data
4: Generating a batch of data
6: Generating a batch of data
2: Generating a batch of data
0: Generating a batch of data
7: Constructing model
3: Constructing model
5: Constructing model
1: Constructing model
0: Constructing model
2: Constructing model
6: Constructing model
4: Constructing model
0: [1576565271.426514] [cgpu06:45376:0] cuda_ipc_md.c:62 UCX ERROR cuCtxGetDevice(&cu_device) is failed. ret:invalid device context
0: [1576565271.426547] [cgpu06:45376:0] ucp_rkey.c:250 UCX ERROR Failed to unpack remote key from remote md[4]: Input/output error
0: [cgpu06:45376:0:45523] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x20)
0: ==== backtrace ====
0: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x2293c) [0x2aab1e94293c]
0: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x22ba4) [0x2aab1e942ba4]
0: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_rkey_release+0xe) [0x2aab1e70751e]
0: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_rkey_destroy+0x34) [0x2aab14e46804]
0: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_ep_rkey_unpack+0x341) [0x2aab14e465e1]
0: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_rndv_rtr_handler+0x1e8) [0x2aab14e63558]
0: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0xed99) [0x2aab1e70ad99]
0: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_iface_progress+0x6
0: e) [0x2aab1e70af7e]
0: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa]
0: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7]
0: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [0x2aaaff5f584c]
0: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255]
0: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait_all+0x231) [0x2aaabb353331]
0: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x46f) [0x2aaabb3ad97f]
0: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8]
0: 15
0: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab2796e7de]
0: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776]
0: 17 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8]
0: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864]
0: 19 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f]
0: 20 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569]
0: 21 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f]
0: ===================
2: [1576565271.431356] [cgpu06:45378:0] mm_posix.c:449 UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/45376/fd/54
2: [cgpu06:45378:0:45521] mm_ep.c:168 Fatal: Failed to attach to remote mmid:194888436023734. Shared memory error
2: ==== backtrace ====
2: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x99) [0x2aab1693fcc9]
2: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6) [0x2aab1693fda6]
2: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4) [0x2aab1670c1c4]
2: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c) [0x2aab1670c63c]
2: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d) [0x2aab14e76a9d]
2: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a) [0x2aab14e78d0a]
2: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125) [0x2aab14e79125]
2: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a
2: ) [0x2aab14e7a59a]
2: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a) [0x2aab1693939a]
2: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa]
2: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7]
2: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [0x2aaaff5f584c]
2: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255]
2: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2) [0x2aaabb352dc2]
2: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3) [0x2aaabb3ad8f3]
2: 15 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-
2: cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8]
2: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab1f96d7de]
2: 17 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776]
2: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8]
2: 19 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864]
2: 20 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f]
2: 21 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569]
2: 22 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f]
2: ===================
2: [cgpu06:45378] *** Process received signal ***
2: [cgpu06:45378] Signal: Aborted (6)
2: [cgpu06:45378] Signal code: (-6)
2: [cgpu06:45378] [ 0] /lib64/libpthread.so.0(+0x12360)[0x2aaaaace5360]
2: [cgpu06:45378] [ 1]
2: /lib64/libc.so.6(gsignal+0x110)[0x2aaaaaf27160]
2: [cgpu06:45378] [ 2]
2: /lib64/libc.so.6(abort+0x151)[0x2aaaaaf28741]
2: [cgpu06:45378] [ 3] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x20cce)[0x2aab1693fcce]
2: [cgpu06:45378] [ 4]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6)[0x2aab1693fda6]
2: [cgpu06:45378] [ 5] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4)[0x2aab1670c1c4]
2: [cgpu06:45378] [ 6] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c)[0x2aab1670c63c]
2: [cgpu06:45378] [ 7] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d)[0x2aab14e76a9d]
2: [cgpu06:45378] [ 8] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a)[0x2aab14e78d0a]
2: [cgpu06:45378] [ 9]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125)[0x2aab14e79125]
2: [cgpu06:45378] [10] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a)[0x2aab14e7a59a]
2: [cgpu06:45378] [11] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a)[0x2aab1693939a]
2: [cgpu06:45378] [12] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a)[0x2aab14e4b4fa]
2: [cgpu06:45378] [13] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aab14a206c7]
2: [cgpu06:45378] [14]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaaff5f584c]
2: [cgpu06:45378] [15]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x2aaaff5fc255]
2: [cgpu06:45378] [16] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2)[0x2aaabb352dc2]
2: [cgpu06:45378] [17]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3)[0x2aaabb3ad8f3]
2: [cgpu06:45378] [18] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8)[0x2aaabb3adce8]
2: [cgpu06:45378] [19]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e)[0x2aab1f96d7de]
2: [cgpu06:45378] [20]
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116)[0x2aaabb36e776]
2: [cgpu06:45378] [21]
2: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8)[0x2aaabae3bda8]
2: [cgpu06:45378] [22]
2: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134)[0x2aaabae39864]
2: [cgpu06:45378] [23]
2: /usr/lib64/libstdc++.so.6(+0xc338f)[0x2aaabb70538f]
2: [cgpu06:45378] [24] /lib64/libpthread.so.0(+0x7569)[0x2aaaaacda569]
2: [cgpu06:45378] [25]
2: /lib64/libc.so.6(clone+0x3f)[0x2aaaaafe9a2f]
2: [cgpu06:45378] *** End of error message ***
4: [1576565271.438377] [cgpu06:45380:0] mm_posix.c:449 UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/45376/fd/54
4: [cgpu06:45380:0:45520] mm_ep.c:168 Fatal: Failed to attach to remote mmid:194888436023734. Shared memory error
4: ==== backtrace ====
4: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x99) [0x2aab1693fcc9]
4: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6) [0x2aab1693fda6]
4: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4) [0x2aab1670c1c4]
4: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c) [0x2aab1670c63c]
4: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d) [0x2aab14e76a9d]
4: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a) [0x2aab14e78d0a]
4: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125) [0x2aab14e79125]
4: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a) [0x2aab14e7a59a]
4: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a) [0x2aab1693939a]
4: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa]
4: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7]
4: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [
4: 0x2aaaff5f584c]
4: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255]
4: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2) [0x2aaabb352dc2]
4: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3) [0x2aaabb3ad8f3]
4: 15 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8]
4: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab1f96d7de]
4: 17 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776]
4: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8]
4: 19 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864]
4: 20 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f]
4: 21 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569]
4: 22 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f]
4: ===================
4: [cgpu06:45380] *** Process received signal ***
4: [cgpu06:45380] Signal: Aborted (6)
4: [cgpu06:45380] Signal code: (-6)
4: [cgpu06:45380] [ 0]
4: /lib64/libpthread.so.0(+0x12360)[0x2aaaaace5360]
4: [cgpu06:45380] [ 1]
4: /lib64/libc.so.6(gsignal+0x110)[0x2aaaaaf27160]
4: [cgpu06:45380] [ 2] /lib64/libc.so.6(abort+0x151)[0x2aaaaaf28741]
4: [cgpu06:45380] [ 3]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x20cce)[0x2aab1693fcce]
4: [cgpu06:45380] [ 4]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6)[0x2aab1693fda6]
4: [cgpu06:45380] [ 5] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4)[0x2aab1670c1c4]
4: [cgpu06:45380] [ 6] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c)[0x2aab1670c63c]
4: [cgpu06:45380] [ 7] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d)[0x2aab14e76a9d]
4: [cgpu06:45380] [ 8] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a)[0x2aab14e78d0a]
4: [cgpu06:45380] [ 9] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125)[0x2aab14e79125]
4: [cgpu06:45380] [10] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a)[0x2aab14e7a59a]
4: [cgpu06:45380] [11]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a)[0x2aab1693939a]
4: [cgpu06:45380] [12] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a)[0x2aab14e4b4fa]
4: [cgpu06:45380] [13]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aab14a206c7]
4: [cgpu06:45380] [14]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaaff5f584c]
4: [cgpu06:45380] [15]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x2aaaff5fc255]
4: [cgpu06:45380] [16]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2)[0x2aaabb352dc2]
4: [cgpu06:45380] [17]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3)[0x2aaabb3ad8f3]
4: [cgpu06:45380] [18]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8)[0x2aaabb3adce8]
4: [cgpu06:45380] [19]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e)[0x2aab1f96d7de]
4: [cgpu06:45380] [20]
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116)[0x2aaabb36e776]
4: [cgpu06:45380] [21]
4: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8)[0x2aaabae3bda8]
4: [cgpu06:45380] [22]
4: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134)[0x2aaabae39864]
4: [cgpu06:45380] [23]
4: /usr/lib64/libstdc++.so.6(+0xc338f)[0x2aaabb70538f]
4: [cgpu06:45380] [24]
4: /lib64/libpthread.so.0(+0x7569)[0x2aaaaacda569]
4: [cgpu06:45380] [25]
4: /lib64/libc.so.6(clone+0x3f)[0x2aaaaafe9a2f]
4: [cgpu06:45380] *** End of error message ***
srun: error: cgpu06: task 0: Segmentation fault
srun: Terminating job step 358157.14
srun: error: cgpu06: tasks 2,4: Aborted
0: slurmstepd: error: *** STEP 358157.14 ON cgpu06 CANCELLED AT 2019-12-16T22:47:52 ***
srun: error: cgpu06: tasks 3,7: Terminated
srun: error: cgpu06: tasks 5-6: Terminated
srun: error: cgpu06: task 1: Terminated
srun: Force Terminated job step 358157.14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment