Skip to content

Instantly share code, notes, and snippets.

@thuningxu
Created March 18, 2019 17:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thuningxu/6311503ec04b3f38ee70d7a99b9bfa4c to your computer and use it in GitHub Desktop.
Save thuningxu/6311503ec04b3f38ee70d7a99b9bfa4c to your computer and use it in GitHub Desktop.
import horovod.tensorflow as hvd
import tensorflow as tf
import time
config = tf.ConfigProto()
tf.enable_eager_execution(config=config)
hvd.init()
with tf.device("/cpu:0"):
tf.set_random_seed(1234)
tensor = tf.random_uniform(
[1], -100, 100, dtype=tf.int32)
if hvd.rank() != 0:
time.sleep(80 * hvd.rank());
hvd.allreduce(tensor, average=False)
$ HOROVOD_STALL_CHECK_TIME_SECONDS=15 HOROVOD_STALL_SHUTDOWN_TIME_SECONDS=30 mpirun --tag-ot --oversubscribe -np 2 -mca btl ^tcp python test/test_stall.py
[1,0]<stderr>:2019-03-18 10:45:25.136739: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,1]<stderr>:2019-03-18 10:45:25.136736: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,0]<stderr>:[2019-03-18 10:45:55.141569: W horovod/common/operations.cc:[1,0]<stderr>:621] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 15 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:1: [HorovodAllreduce]
[1,0]<stderr>:[2019-03-18 10:46:10.143080: E horovod/common/operations.cc:619] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 15 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:1!: [HorovodAllreduce]
[1,0]<stderr>:One or more rank (marked by "!") is stalled for longer than 30 seconds. Will shutdown.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "test/test_stall.py", line 15, in <module>
[1,0]<stderr>: hvd.allreduce(tensor, average=False)
[1,0]<stderr>: File "/Users/nx/sd/horovod/env/lib/python2.7/site-packages/horovod-0.16.0-py2.7-macosx-10.14-x86_64.egg/horovod/tensorflow/__init__.py", line 88, in allreduce
[1,0]<stderr>: summed_tensor_compressed = _allreduce(tensor_compressed)
[1,0]<stderr>: File "/Users/nx/sd/horovod/env/lib/python2.7/site-packages/horovod-0.16.0-py2.7-macosx-10.14-x86_64.egg/horovod/tensorflow/mpi_ops.py", line 91, in _allreduce
[1,0]<stderr>: return MPI_LIB.horovod_allreduce(tensor, name=name)
[1,0]<stderr>: File "<string>", line 74, in horovod_allreduce
[1,0]<stderr>: File "/Users/nx/sd/horovod/env/lib/python2.7/site-packages/six.py", line 737, in raise_from
[1,0]<stderr>: raise value
[1,0]<stderr>:tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce]
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[5009,1],0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment