Skip to content

Instantly share code, notes, and snippets.

@darkjh
Created May 24, 2021 19:38
Show Gist options
  • Save darkjh/e7392a96c1f719a45067d3b031f6640b to your computer and use it in GitHub Desktop.
Save darkjh/e7392a96c1f719a45067d3b031f6640b to your computer and use it in GitHub Desktop.
horovod error
21/05/24 21:35:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/store.py:299: FutureWarning: pyarrow.LocalFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
self._fs = pa.LocalFileSystem()
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/store.py:299: FutureWarning: pyarrow.filesystem.LocalFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
self._fs = pa.LocalFileSystem()
--2021-05-24 21:35:59-- https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15179306 (14M) [application/x-bzip2]
Saving to: ‘/tmp/horovod_spark/mnist.bz2’
/tmp/horovod_spark/mnist.bz2 100%[==========================================================================================================================================>] 14.48M 479KB/s in 47s
2021-05-24 21:36:50 (317 KB/s) - ‘/tmp/horovod_spark/mnist.bz2’ saved [15179306/15179306]
2021-05-24 21:36:55.687446: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-24 21:36:55.687632: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-24 21:36:55.688308: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
num_partitions=40
writing dataframes
train_data_path=file:///tmp/horovod_spark/intermediate_train_data.0
val_data_path=file:///tmp/horovod_spark/intermediate_val_data.0
train_partitions=40
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/util.py:509: FutureWarning: The 'field_by_name' method is deprecated, use 'field' instead
metadata, avg_row_size = make_metadata_dictionary(train_data_schema)
train_rows=54006
---------------------------------------- (0 + 4) / 4]
Exception happened during processing of request from ('127.0.0.1', 35454)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35456)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35458)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,3]<stderr>: "__main__", mod_spec)
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 85, in _run_code
[1,3]<stderr>: exec(code, run_globals)
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,3]<stderr>: main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,3]<stderr>: task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')[1,3]<stderr>:
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[1,3]<stderr>: fn, args, kwargs = driver_client.code()
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/driver/driver_service.py", line 245, in code
[1,3]<stderr>: resp = self._send(CodeRequest())
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 303, in _send
[1,3]<stderr>: return self._send_one(addr, req, stream)
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[1,3]<stderr>: resp = self._wire.read(rfile)
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 95, in read
[1,3]<stderr>: message_len = struct.unpack('i', rfile.read(4))[0]
[1,3]<stderr>:struct.error: unpack requires a buffer of 4 bytes
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35468)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35472)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 35474)
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__
self.handle()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle
server._wire.write(resp, self.wfile)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
cp.dump(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce
save(args)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
----------------------------------------
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>: "__main__", mod_spec)
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 85, in _run_code
[1,0]<stderr>: exec(code, run_globals)
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,0]<stderr>: main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,0]<stderr>: task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[1,0]<stderr>: fn, args, kwargs = driver_client.code()
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/driver/driver_service.py", line 245, in code
[1,0]<stderr>: resp = self._send(CodeRequest())
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 303, in _send
[1,0]<stderr>: return self._send_one(addr, req, stream)
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[1,0]<stderr>: resp = self._wire.read(rfile)
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 95, in read
[1,0]<stderr>: message_len = struct.unpack('i', rfile.read(4))[0]
[1,0]<stderr>:struct.error: unpack requires a buffer of 4 bytes
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[58371,1],3]
Exit code: 1
--------------------------------------------------------------------------
Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 140, in run_spark
result = procs.mapPartitionsWithIndex(mapper).collect()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/rdd.py", line 949, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job 3 cancelled part of cancelled job group horovod.spark.run.0
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:2149)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleJobGroupCancelled$4(DAGScheduler.scala:1047)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.handleJobGroupCancelled(DAGScheduler.scala:1046)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2402)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "horovod_spark.py", line 95, in <module>
keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob'])
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 35, in fit
return super(HorovodEstimator, self).fit(df, params)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 81, in _fit
backend, train_rows, val_rows, metadata, avg_row_size, dataset_idx)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/keras/estimator.py", line 317, in _fit_on_prepared_data
env=env)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/backend.py", line 85, in run
**self._kwargs)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 284, in run
_launch_job(use_mpi, use_gloo, settings, driver, env, stdout, stderr)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 155, in _launch_job
settings.verbose)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/launch.py", line 704, in run_controller
mpi_run()
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 153, in <lambda>
use_mpi, lambda: mpi_run(settings, nics, driver, env, stdout, stderr),
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/mpi_run.py", line 55, in mpi_run
hr_mpi_run(settings, nics, env, command, stdout=stdout, stderr=stderr)
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/mpi_run.py", line 252, in mpi_run
raise RuntimeError("mpirun failed with exit code {exit_code}".format(exit_code=exit_code))
RuntimeError: mpirun failed with exit code 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment