Created
May 24, 2021 19:38
-
-
Save darkjh/e7392a96c1f719a45067d3b031f6640b to your computer and use it in GitHub Desktop.
horovod error
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21/05/24 21:35:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties | |
Setting default log level to "WARN". | |
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/store.py:299: FutureWarning: pyarrow.LocalFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead. | |
self._fs = pa.LocalFileSystem() | |
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/store.py:299: FutureWarning: pyarrow.filesystem.LocalFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead. | |
self._fs = pa.LocalFileSystem() | |
--2021-05-24 21:35:59-- https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2 | |
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt' | |
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26 | |
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected. | |
HTTP request sent, awaiting response... 200 OK | |
Length: 15179306 (14M) [application/x-bzip2] | |
Saving to: ‘/tmp/horovod_spark/mnist.bz2’ | |
/tmp/horovod_spark/mnist.bz2 100%[==========================================================================================================================================>] 14.48M 479KB/s in 47s | |
2021-05-24 21:36:50 (317 KB/s) - ‘/tmp/horovod_spark/mnist.bz2’ saved [15179306/15179306] | |
2021-05-24 21:36:55.687446: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set | |
2021-05-24 21:36:55.687632: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA | |
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
2021-05-24 21:36:55.688308: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. | |
num_partitions=40 | |
writing dataframes | |
train_data_path=file:///tmp/horovod_spark/intermediate_train_data.0 | |
val_data_path=file:///tmp/horovod_spark/intermediate_val_data.0 | |
train_partitions=40 | |
/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/util.py:509: FutureWarning: The 'field_by_name' method is deprecated, use 'field' instead | |
metadata, avg_row_size = make_metadata_dictionary(train_data_schema) | |
train_rows=54006 | |
---------------------------------------- (0 + 4) / 4] | |
Exception happened during processing of request from ('127.0.0.1', 35454) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
---------------------------------------- | |
Exception happened during processing of request from ('127.0.0.1', 35456) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
---------------------------------------- | |
Exception happened during processing of request from ('127.0.0.1', 35458) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
[1,3]<stderr>:Traceback (most recent call last): | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 193, in _run_module_as_main | |
[1,3]<stderr>: "__main__", mod_spec) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 85, in _run_code | |
[1,3]<stderr>: exec(code, run_globals) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module> | |
[1,3]<stderr>: main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2])) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main | |
[1,3]<stderr>: task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')[1,3]<stderr>: | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 60, in task_exec | |
[1,3]<stderr>: fn, args, kwargs = driver_client.code() | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/driver/driver_service.py", line 245, in code | |
[1,3]<stderr>: resp = self._send(CodeRequest()) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 303, in _send | |
[1,3]<stderr>: return self._send_one(addr, req, stream) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 279, in _send_one | |
[1,3]<stderr>: resp = self._wire.read(rfile) | |
[1,3]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 95, in read | |
[1,3]<stderr>: message_len = struct.unpack('i', rfile.read(4))[0] | |
[1,3]<stderr>:struct.error: unpack requires a buffer of 4 bytes | |
---------------------------------------- | |
Exception happened during processing of request from ('127.0.0.1', 35468) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
---------------------------------------- | |
Exception happened during processing of request from ('127.0.0.1', 35472) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
---------------------------------------- | |
Exception happened during processing of request from ('127.0.0.1', 35474) | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 650, in process_request_thread | |
self.finish_request(request, client_address) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 360, in finish_request | |
self.RequestHandlerClass(request, client_address, self) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/socketserver.py", line 720, in __init__ | |
self.handle() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 131, in handle | |
server._wire.write(resp, self.wfile) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 75, in write | |
message = cloudpickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps | |
cp.dump(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump | |
return Pickler.dump(self, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 437, in dump | |
self.save(obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 549, in save | |
self.save_reduce(obj=obj, *rv) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 662, in save_reduce | |
save(state) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict | |
StockPickler.save_dict(pickler, obj) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 859, in save_dict | |
self._batch_setitems(obj.items()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 885, in _batch_setitems | |
save(v) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function | |
*self._dynamic_function_reduce(obj), obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 682, in _save_reduce_pickle5 | |
dictitems=dictitems, obj=obj | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 638, in save_reduce | |
save(args) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 789, in save_tuple | |
save(element) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/pickle.py", line 504, in save | |
f(self, obj) # Call unbound method with explicit self | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/dill/_dill.py", line 1177, in save_cell | |
f = obj.cell_contents | |
ValueError: Cell is empty | |
---------------------------------------- | |
[1,0]<stderr>:Traceback (most recent call last): | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 193, in _run_module_as_main | |
[1,0]<stderr>: "__main__", mod_spec) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/runpy.py", line 85, in _run_code | |
[1,0]<stderr>: exec(code, run_globals) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module> | |
[1,0]<stderr>: main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2])) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main | |
[1,0]<stderr>: task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK') | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 60, in task_exec | |
[1,0]<stderr>: fn, args, kwargs = driver_client.code() | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/driver/driver_service.py", line 245, in code | |
[1,0]<stderr>: resp = self._send(CodeRequest()) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 303, in _send | |
[1,0]<stderr>: return self._send_one(addr, req, stream) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 279, in _send_one | |
[1,0]<stderr>: resp = self._wire.read(rfile) | |
[1,0]<stderr>: File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/common/util/network.py", line 95, in read | |
[1,0]<stderr>: message_len = struct.unpack('i', rfile.read(4))[0] | |
[1,0]<stderr>:struct.error: unpack requires a buffer of 4 bytes | |
-------------------------------------------------------------------------- | |
Primary job terminated normally, but 1 process returned | |
a non-zero exit code. Per user-direction, the job has been aborted. | |
-------------------------------------------------------------------------- | |
-------------------------------------------------------------------------- | |
mpirun detected that one or more processes exited with non-zero status, thus causing | |
the job to be terminated. The first process to do so was: | |
Process name: [[58371,1],3] | |
Exit code: 1 | |
-------------------------------------------------------------------------- | |
Exception in thread Thread-3: | |
Traceback (most recent call last): | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/threading.py", line 926, in _bootstrap_inner | |
self.run() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/threading.py", line 870, in run | |
self._target(*self._args, **self._kwargs) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 140, in run_spark | |
result = procs.mapPartitionsWithIndex(mapper).collect() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/rdd.py", line 949, in collect | |
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__ | |
answer, self.gateway_client, self.target_id, self.name) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco | |
return f(*a, **kw) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value | |
format(target_id, ".", name), value) | |
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. | |
: org.apache.spark.SparkException: Job 3 cancelled part of cancelled job group horovod.spark.run.0 | |
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253) | |
at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:2149) | |
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleJobGroupCancelled$4(DAGScheduler.scala:1047) | |
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) | |
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) | |
at org.apache.spark.scheduler.DAGScheduler.handleJobGroupCancelled(DAGScheduler.scala:1046) | |
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2402) | |
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382) | |
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371) | |
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) | |
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) | |
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202) | |
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) | |
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) | |
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267) | |
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) | |
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) | |
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) | |
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) | |
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) | |
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) | |
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) | |
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) | |
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) | |
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) | |
at java.lang.reflect.Method.invoke(Method.java:498) | |
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) | |
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) | |
at py4j.Gateway.invoke(Gateway.java:282) | |
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) | |
at py4j.commands.CallCommand.execute(CallCommand.java:79) | |
at py4j.GatewayConnection.run(GatewayConnection.java:238) | |
at java.lang.Thread.run(Thread.java:748) | |
Traceback (most recent call last): | |
File "horovod_spark.py", line 95, in <module> | |
keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob']) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 35, in fit | |
return super(HorovodEstimator, self).fit(df, params) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/pyspark/ml/base.py", line 161, in fit | |
return self._fit(dataset) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 81, in _fit | |
backend, train_rows, val_rows, metadata, avg_row_size, dataset_idx) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/keras/estimator.py", line 317, in _fit_on_prepared_data | |
env=env) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/common/backend.py", line 85, in run | |
**self._kwargs) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 284, in run | |
_launch_job(use_mpi, use_gloo, settings, driver, env, stdout, stderr) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 155, in _launch_job | |
settings.verbose) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/launch.py", line 704, in run_controller | |
mpi_run() | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/runner.py", line 153, in <lambda> | |
use_mpi, lambda: mpi_run(settings, nics, driver, env, stdout, stderr), | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/spark/mpi_run.py", line 55, in mpi_run | |
hr_mpi_run(settings, nics, env, command, stdout=stdout, stderr=stderr) | |
File "/home/darkjh/miniconda3/envs/tf-test/lib/python3.7/site-packages/horovod/runner/mpi_run.py", line 252, in mpi_run | |
raise RuntimeError("mpirun failed with exit code {exit_code}".format(exit_code=exit_code)) | |
RuntimeError: mpirun failed with exit code 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment