orwa-te/output logs

## output logs
20/06/19 13:41:30 WARN Utils: Your hostname, orwa-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.198.131 instead (on interface ens33)
20/06/19 13:41:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/19 13:41:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-19 13:41:33.862495: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:41:33.862706: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:41:33.862730: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/19 13:41:35 INFO SparkContext: Running Spark version 2.4.5
20/06/19 13:41:35 INFO SparkContext: Submitted application: keras_spark_mnist
20/06/19 13:41:36 INFO SecurityManager: Changing view acls to: orwa
20/06/19 13:41:36 INFO SecurityManager: Changing modify acls to: orwa
20/06/19 13:41:36 INFO SecurityManager: Changing view acls groups to:
20/06/19 13:41:36 INFO SecurityManager: Changing modify acls groups to:
20/06/19 13:41:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(orwa); groups with view permissions: Set(); users  with modify permissions: Set(orwa); groups with modify permissions: Set()
20/06/19 13:41:36 INFO Utils: Successfully started service 'sparkDriver' on port 37493.
20/06/19 13:41:36 INFO SparkEnv: Registering MapOutputTracker
20/06/19 13:41:36 INFO SparkEnv: Registering BlockManagerMaster
20/06/19 13:41:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/06/19 13:41:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/06/19 13:41:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0b5fcac7-0cf1-4623-bd6b-2c0d6d50b054
20/06/19 13:41:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/06/19 13:41:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/06/19 13:41:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/06/19 13:41:37 INFO Utils: Successfully started service 'SparkUI' on port 4041.
20/06/19 13:41:37 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.198.131:4041
20/06/19 13:41:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.198.131:7077...
20/06/19 13:41:37 INFO TransportClientFactory: Successfully created connection to /192.168.198.131:7077 after 98 ms (0 ms spent in bootstraps)
20/06/19 13:41:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200619134137-0000
20/06/19 13:41:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40673.
20/06/19 13:41:38 INFO NettyBlockTransferService: Server created on 192.168.198.131:40673
20/06/19 13:41:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/06/19 13:41:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619134137-0000/0 on worker-20200619134021-192.168.198.131-40907 (192.168.198.131:40907) with 4 core(s)
20/06/19 13:41:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619134137-0000/0 on hostPort 192.168.198.131:40907 with 4 core(s), 1024.0 MB RAM
20/06/19 13:41:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.198.131, 40673, None)
20/06/19 13:41:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619134137-0000/0 is now RUNNING
20/06/19 13:41:38 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.198.131:40673 with 366.3 MB RAM, BlockManagerId(driver, 192.168.198.131, 40673, None)
20/06/19 13:41:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.198.131, 40673, None)
20/06/19 13:41:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.198.131, 40673, None)
20/06/19 13:41:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/06/19 13:41:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/orwa/spark_files/spark-warehouse').
20/06/19 13:41:38 INFO SharedState: Warehouse path is 'file:/home/orwa/spark_files/spark-warehouse'.
20/06/19 13:41:39 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/06/19 13:41:40 INFO InMemoryFileIndex: It took 182 ms to list leaf files for 1 paths.
20/06/19 13:41:42 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.198.131:36304) with ID 0
20/06/19 13:41:42 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.198.131:38731 with 366.3 MB RAM, BlockManagerId(0, 192.168.198.131, 38731, None)
20/06/19 13:41:44 INFO FileSourceStrategy: Pruning directories with:
20/06/19 13:41:44 INFO FileSourceStrategy: Post-Scan Filters:
20/06/19 13:41:44 INFO FileSourceStrategy: Output Data Schema: struct<label: double>
20/06/19 13:41:44 INFO FileSourceScanExec: Pushed Filters:
20/06/19 13:41:45 INFO CodeGenerator: Code generated in 388.066819 ms
20/06/19 13:41:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 283.2 KB, free 366.0 MB)
20/06/19 13:41:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 366.0 MB)
20/06/19 13:41:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.198.131:40673 (size: 23.4 KB, free: 366.3 MB)
20/06/19 13:41:45 INFO SparkContext: Created broadcast 0 from broadcast at LibSVMRelation.scala:153
20/06/19 13:41:45 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4843402 bytes, open cost is considered as scanning 4194304 bytes.
20/06/19 13:41:45 INFO SparkContext: Starting job: treeAggregate at OneHotEncoderEstimator.scala:487
20/06/19 13:41:45 INFO DAGScheduler: Got job 0 (treeAggregate at OneHotEncoderEstimator.scala:487) with 4 output partitions
20/06/19 13:41:45 INFO DAGScheduler: Final stage: ResultStage 0 (treeAggregate at OneHotEncoderEstimator.scala:487)
20/06/19 13:41:46 INFO DAGScheduler: Parents of final stage: List()
20/06/19 13:41:46 INFO DAGScheduler: Missing parents: List()
20/06/19 13:41:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at treeAggregate at OneHotEncoderEstimator.scala:487), which has no missing parents
20/06/19 13:41:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 12.5 KB, free 366.0 MB)
20/06/19 13:41:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.5 KB, free 366.0 MB)
20/06/19 13:41:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.198.131:40673 (size: 6.5 KB, free: 366.3 MB)
20/06/19 13:41:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163
20/06/19 13:41:46 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at treeAggregate at OneHotEncoderEstimator.scala:487) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
20/06/19 13:41:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
20/06/19 13:41:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.198.131, executor 0, partition 0, PROCESS_LOCAL, 8251 bytes)
20/06/19 13:41:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.198.131, executor 0, partition 1, PROCESS_LOCAL, 8251 bytes)
20/06/19 13:41:46 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.198.131, executor 0, partition 2, PROCESS_LOCAL, 8251 bytes)
20/06/19 13:41:46 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.198.131, executor 0, partition 3, PROCESS_LOCAL, 8251 bytes)
20/06/19 13:41:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.198.131:38731 (size: 6.5 KB, free: 366.3 MB)
20/06/19 13:41:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.198.131:38731 (size: 23.4 KB, free: 366.3 MB)
20/06/19 13:41:52 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6384 ms on 192.168.198.131 (executor 0) (1/4)
20/06/19 13:41:56 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 10588 ms on 192.168.198.131 (executor 0) (2/4)
20/06/19 13:41:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10705 ms on 192.168.198.131 (executor 0) (3/4)
20/06/19 13:41:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10746 ms on 192.168.198.131 (executor 0) (4/4)
20/06/19 13:41:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/06/19 13:41:56 INFO DAGScheduler: ResultStage 0 (treeAggregate at OneHotEncoderEstimator.scala:487) finished in 10.915 s
20/06/19 13:41:56 INFO DAGScheduler: Job 0 finished: treeAggregate at OneHotEncoderEstimator.scala:487, took 11.037180 s
2020-06-19 13:41:57.784809: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-06-19 13:41:57.784958: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-19 13:41:57.785018: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orwa-virtual-machine): /proc/driver/nvidia/version does not exist
2020-06-19 13:41:57.820394: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-19 13:41:57.834282: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2020-06-19 13:41:57.835500: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f786ded4c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-19 13:41:57.835551: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
num_partitions=40
writing dataframes
train_data_path=file:///home/orwa/tmp/intermediate_train_data.0
val_data_path=file:///home/orwa/tmp/intermediate_val_data.0
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 21
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 7
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 13
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 25
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 11
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 20
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 10
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 8
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 6
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 16
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 12
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 14
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 22
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 28
20/06/19 13:41:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.198.131:40673 in memory (size: 6.5 KB, free: 366.3 MB)
20/06/19 13:41:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.198.131:38731 in memory (size: 6.5 KB, free: 366.3 MB)
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 23
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 19
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 24
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 15
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 27
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 30
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 9
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 17
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 18
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 29
20/06/19 13:41:58 INFO ContextCleaner: Cleaned accumulator 26
20/06/19 13:41:58 INFO FileSourceStrategy: Pruning directories with:
20/06/19 13:41:58 INFO FileSourceStrategy: Post-Scan Filters:
20/06/19 13:41:58 INFO FileSourceStrategy: Output Data Schema: struct<label: double, features: vector>
20/06/19 13:41:58 INFO FileSourceScanExec: Pushed Filters:
20/06/19 13:41:58 INFO CodeGenerator: Code generated in 190.262483 ms
20/06/19 13:41:58 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 283.2 KB, free 365.7 MB)
20/06/19 13:41:58 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.4 KB, free 365.7 MB)
20/06/19 13:41:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.198.131:40673 (size: 23.4 KB, free: 366.3 MB)
20/06/19 13:41:58 INFO SparkContext: Created broadcast 2 from broadcast at LibSVMRelation.scala:153
20/06/19 13:41:58 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4843402 bytes, open cost is considered as scanning 4194304 bytes.
20/06/19 13:41:59 INFO SparkContext: Starting job: runJob at PythonRDD.scala:153
20/06/19 13:41:59 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:153) with 1 output partitions
20/06/19 13:41:59 INFO DAGScheduler: Final stage: ResultStage 1 (runJob at PythonRDD.scala:153)
20/06/19 13:41:59 INFO DAGScheduler: Parents of final stage: List()
20/06/19 13:41:59 INFO DAGScheduler: Missing parents: List()
20/06/19 13:41:59 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[12] at RDD at PythonRDD.scala:53), which has no missing parents
20/06/19 13:41:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 42.5 KB, free 365.7 MB)
20/06/19 13:41:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 17.1 KB, free 365.6 MB)
20/06/19 13:41:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.198.131:40673 (size: 17.1 KB, free: 366.2 MB)
20/06/19 13:41:59 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1163
20/06/19 13:41:59 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[12] at RDD at PythonRDD.scala:53) (first 15 tasks are for partitions Vector(0))
20/06/19 13:41:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/06/19 13:41:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 4, 192.168.198.131, executor 0, partition 0, PROCESS_LOCAL, 8251 bytes)
20/06/19 13:41:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.198.131:38731 (size: 17.1 KB, free: 366.3 MB)
20/06/19 13:42:00 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.198.131:38731 (size: 23.4 KB, free: 366.2 MB)
20/06/19 13:42:06 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 4) in 7348 ms on 192.168.198.131 (executor 0) (1/1)
20/06/19 13:42:06 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/06/19 13:42:06 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 39071
20/06/19 13:42:06 INFO DAGScheduler: ResultStage 1 (runJob at PythonRDD.scala:153) finished in 7.411 s
20/06/19 13:42:06 INFO DAGScheduler: Job 1 finished: runJob at PythonRDD.scala:153, took 7.424411 s
train_partitions=40
20/06/19 13:42:07 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
20/06/19 13:42:07 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
20/06/19 13:42:07 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/06/19 13:42:07 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
20/06/19 13:42:07 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/06/19 13:42:07 INFO SparkContext: Starting job: parquet at NativeMethodAccessorImpl.java:0
20/06/19 13:42:07 INFO DAGScheduler: Got job 2 (parquet at NativeMethodAccessorImpl.java:0) with 4 output partitions
20/06/19 13:42:07 INFO DAGScheduler: Final stage: ResultStage 2 (parquet at NativeMethodAccessorImpl.java:0)
20/06/19 13:42:07 INFO DAGScheduler: Parents of final stage: List()
20/06/19 13:42:07 INFO DAGScheduler: Missing parents: List()
20/06/19 13:42:07 INFO DAGScheduler: Submitting ResultStage 2 (CoalescedRDD[18] at parquet at NativeMethodAccessorImpl.java:0), which has no missing parents
20/06/19 13:42:07 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 187.2 KB, free 365.5 MB)
20/06/19 13:42:07 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 68.6 KB, free 365.4 MB)
20/06/19 13:42:07 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.198.131:40673 (size: 68.6 KB, free: 366.2 MB)
20/06/19 13:42:07 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1163
20/06/19 13:42:07 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 2 (CoalescedRDD[18] at parquet at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
20/06/19 13:42:07 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks
20/06/19 13:42:07 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 5, 192.168.198.131, executor 0, partition 0, PROCESS_LOCAL, 8480 bytes)
20/06/19 13:42:07 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 6, 192.168.198.131, executor 0, partition 1, PROCESS_LOCAL, 8480 bytes)
20/06/19 13:42:07 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 7, 192.168.198.131, executor 0, partition 2, PROCESS_LOCAL, 8480 bytes)
20/06/19 13:42:07 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 8, 192.168.198.131, executor 0, partition 3, PROCESS_LOCAL, 8480 bytes)
20/06/19 13:42:07 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.198.131:38731 (size: 68.6 KB, free: 366.2 MB)
20/06/19 13:42:11 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 8) in 4097 ms on 192.168.198.131 (executor 0) (1/4)
20/06/19 13:42:27 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 7) in 19941 ms on 192.168.198.131 (executor 0) (2/4)
20/06/19 13:42:27 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 6) in 19971 ms on 192.168.198.131 (executor 0) (3/4)
20/06/19 13:42:29 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 5) in 21639 ms on 192.168.198.131 (executor 0) (4/4)
20/06/19 13:42:29 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
20/06/19 13:42:29 INFO DAGScheduler: ResultStage 2 (parquet at NativeMethodAccessorImpl.java:0) finished in 21.729 s
20/06/19 13:42:29 INFO DAGScheduler: Job 2 finished: parquet at NativeMethodAccessorImpl.java:0, took 21.740013 s
20/06/19 13:42:29 INFO FileFormatWriter: Write Job f33088e5-fdc9-4b82-a70d-6dfbf385cdf0 committed.
20/06/19 13:42:29 INFO FileFormatWriter: Finished processing stats for write job f33088e5-fdc9-4b82-a70d-6dfbf385cdf0.
/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/common/util.py:445: FutureWarning: The 'field_by_name' method is deprecated, use 'field' instead
  metadata, avg_row_size = make_metadata_dictionary(train_data_schema)
train_rows=53978
Running 4 processes...
20/06/19 13:42:30 INFO SparkContext: Starting job: collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106
20/06/19 13:42:30 INFO DAGScheduler: Got job 3 (collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106) with 4 output partitions
20/06/19 13:42:30 INFO DAGScheduler: Final stage: ResultStage 3 (collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106)
20/06/19 13:42:30 INFO DAGScheduler: Parents of final stage: List()
20/06/19 13:42:30 INFO DAGScheduler: Missing parents: List()
20/06/19 13:42:30 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[22] at collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106), which has no missing parents
20/06/19 13:42:30 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 5.0 KB, free 365.4 MB)
20/06/19 13:42:30 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 3.5 KB, free 365.4 MB)
20/06/19 13:42:30 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.198.131:40673 (size: 3.5 KB, free: 366.2 MB)
20/06/19 13:42:30 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1163
20/06/19 13:42:30 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 3 (PythonRDD[22] at collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
20/06/19 13:42:30 INFO TaskSchedulerImpl: Adding task set 3.0 with 4 tasks
20/06/19 13:42:30 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 9, 192.168.198.131, executor 0, partition 0, PROCESS_LOCAL, 7856 bytes)
20/06/19 13:42:30 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 10, 192.168.198.131, executor 0, partition 1, PROCESS_LOCAL, 7856 bytes)
20/06/19 13:42:30 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 11, 192.168.198.131, executor 0, partition 2, PROCESS_LOCAL, 7856 bytes)
20/06/19 13:42:30 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 12, 192.168.198.131, executor 0, partition 3, PROCESS_LOCAL, 7856 bytes)
20/06/19 13:42:30 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.198.131:38731 (size: 3.5 KB, free: 366.2 MB)
Initial Spark task registration is complete.
Spark task-to-task address registration is complete.
Checking whether extension tensorflow was built with MPI.
Extension tensorflow was built with MPI.
2020-06-19 13:42:33.133992: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:42:33.134212: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:42:33.134235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
mpirun --allow-run-as-root --tag-output -np 4 -H orwa-virtual-machine-4d40ae72ac08da8862db2678b7ed7471:4 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib  -mca btl_tcp_if_include ens33,lo -x NCCL_SOCKET_IFNAME=ens33,lo  -x ADDR2LINE -x AR -x AS -x CC -x CFLAGS -x CLUTTER_IM_MODULE -x CMAKE_PREFIX_PATH -x COLORTERM -x CONDA_BACKUP_HOST -x CONDA_BUILD_SYSROOT -x CONDA_DEFAULT_ENV -x CONDA_EXE -x CONDA_PREFIX -x CONDA_PROMPT_MODIFIER -x CONDA_PYTHON_EXE -x CONDA_SHLVL -x CPP -x CPPFLAGS -x CXX -x CXXFILT -x CXXFLAGS -x DBUS_SESSION_BUS_ADDRESS -x DEBUG_CFLAGS -x DEBUG_CPPFLAGS -x DEBUG_CXXFLAGS -x DESKTOP_SESSION -x DISPLAY -x ELFEDIT -x GCC -x GCC_AR -x GCC_NM -x GCC_RANLIB -x GDMSESSION -x GNOME_DESKTOP_SESSION_ID -x GNOME_SHELL_SESSION_MODE -x GNOME_TERMINAL_SCREEN -x GNOME_TERMINAL_SERVICE -x GPG_AGENT_INFO -x GPROF -x GTK_IM_MODULE -x GTK_MODULES -x GXX -x HOME -x HOST -x IM_CONFIG_PHASE -x INVOCATION_ID -x JAVA_HOME -x JOURNAL_STREAM -x LANG -x LC_ADDRESS -x LC_IDENTIFICATION -x LC_MEASUREMENT -x LC_MONETARY -x LC_NAME -x LC_NUMERIC -x LC_PAPER -x LC_TELEPHONE -x LC_TIME -x LD -x LDFLAGS -x LD_GOLD -x LESSCLOSE -x LESSOPEN -x LIBHDFS_OPTS -x LOGNAME -x LS_COLORS -x MANAGERPID -x NM -x OBJCOPY -x OBJDUMP -x PATH -x PWD -x PYSPARK_GATEWAY_PORT -x PYSPARK_GATEWAY_SECRET -x PYSPARK_PYTHON -x PYTHONHASHSEED -x PYTHONPATH -x PYTHONUNBUFFERED -x QT4_IM_MODULE -x QT_ACCESSIBILITY -x QT_IM_MODULE -x RANLIB -x READELF -x SESSION_MANAGER -x SHELL -x SHLVL -x SIZE -x SPARK_CONF_DIR -x SPARK_ENV_LOADED -x SPARK_HOME -x SPARK_MASTER_HOST -x SPARK_SCALA_VERSION -x SSH_AGENT_PID -x SSH_AUTH_SOCK -x STRINGS -x STRIP -x TERM -x TFoS_HOME -x USER -x USERNAME -x VTE_VERSION -x WINDOWPATH -x XAUTHORITY -x XDG_CONFIG_DIRS -x XDG_CURRENT_DESKTOP -x XDG_DATA_DIRS -x XDG_MENU_PREFIX -x XDG_RUNTIME_DIR -x XDG_SESSION_CLASS -x XDG_SESSION_DESKTOP -x XDG_SESSION_TYPE -x XMODIFIERS -x _CE_CONDA -x _CE_M -x _CONDA_PYTHON_SYSCONFIGDATA_NAME  -x NCCL_DEBUG=INFO -mca plm_rsh_agent "/home/orwa/anaconda3/envs/spark1/bin/python -m horovod.spark.driver.mpirun_rsh gASVQAAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTXDxhpRhjAVlbnMzM5RdlIwPMTkyLjE2OC4xOTguMTMxlE1w8YaUYXUu gASVhAIAAAAAAACMIGhvcm92b2QucnVuLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMB3ZlcmJvc2WUSwOMCHNzaF9wb3J0lE6MDmV4dHJhX21waV9hcmdzlE6MCHRjcF9mbGFnlE6MDGJpbmRpbmdfYXJnc5ROjANrZXmUTowHdGltZW91dJSMH2hvcm92b2QucnVuLmNvbW1vbi51dGlsLnRpbWVvdXSUjAdUaW1lb3V0lJOUKYGUfZQojAtfdGltZW91dF9hdJRHQde7JjuF4uWMCF9tZXNzYWdllFgOAQAAVGltZWQgb3V0IHdhaXRpbmcgZm9yIHthY3Rpdml0eX0uIFBsZWFzZSBjaGVjayB0aGF0IHlvdSBoYXZlIGVub3VnaCByZXNvdXJjZXMgdG8gcnVuIGFsbCBIb3Jvdm9kIHByb2Nlc3Nlcy4gRWFjaCBIb3Jvdm9kIHByb2Nlc3MgcnVucyBpbiBhIFNwYXJrIHRhc2suIFlvdSBtYXkgbmVlZCB0byBpbmNyZWFzZSB0aGUgc3RhcnRfdGltZW91dCBwYXJhbWV0ZXIgdG8gYSBsYXJnZXIgdmFsdWUgaWYgeW91ciBTcGFyayByZXNvdXJjZXMgYXJlIGFsbG9jYXRlZCBvbi1kZW1hbmQulHVijAludW1faG9zdHOUTowIbnVtX3Byb2OUSwSMBWhvc3RzlIw3b3J3YS12aXJ0dWFsLW1hY2hpbmUtNGQ0MGFlNzJhYzA4ZGE4ODYyZGIyNjc4YjdlZDc0NzE6NJSMD291dHB1dF9maWxlbmFtZZROjA1ydW5fZnVuY19tb2RllIiMBG5pY3OUTnViLg==" /home/orwa/anaconda3/envs/spark1/bin/python -m horovod.spark.task.mpirun_exec_fn gASVQAAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTXDxhpRhjAVlbnMzM5RdlIwPMTkyLjE2OC4xOTguMTMxlE1w8YaUYXUu gASV0gYAAAAAAACMIGhvcm92b2QucnVuLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMB3ZlcmJvc2WUSwOMCHNzaF9wb3J0lE6MDmV4dHJhX21waV9hcmdzlFhJBAAAIC14IE5DQ0xfREVCVUc9SU5GTyAtbWNhIHBsbV9yc2hfYWdlbnQgIi9ob21lL29yd2EvYW5hY29uZGEzL2VudnMvc3BhcmsxL2Jpbi9weXRob24gLW0gaG9yb3ZvZC5zcGFyay5kcml2ZXIubXBpcnVuX3JzaCBnQVNWUUFBQUFBQUFBQUI5bENpTUFteHZsRjJVakFreE1qY3VNQzR3TGpHVVRYRHhocFJoakFWbGJuTXpNNVJkbEl3UE1Ua3lMakUyT0M0eE9UZ3VNVE14bEUxdzhZYVVZWFV1IGdBU1ZoQUlBQUFBQUFBQ01JR2h2Y205MmIyUXVjblZ1TG1OdmJXMXZiaTUxZEdsc0xuTmxkSFJwYm1kemxJd0lVMlYwZEdsdVozT1VrNVFwZ1pSOWxDaU1CM1psY21KdmMyV1VTd09NQ0hOemFGOXdiM0owbEU2TURtVjRkSEpoWDIxd2FWOWhjbWR6bEU2TUNIUmpjRjltYkdGbmxFNk1ER0pwYm1ScGJtZGZZWEpuYzVST2pBTnJaWG1VVG93SGRHbHRaVzkxZEpTTUgyaHZjbTkyYjJRdWNuVnVMbU52YlcxdmJpNTFkR2xzTG5ScGJXVnZkWFNVakFkVWFXMWxiM1YwbEpPVUtZR1VmWlFvakF0ZmRHbHRaVzkxZEY5aGRKUkhRZGU3Smp1RjR1V01DRjl0WlhOellXZGxsRmdPQVFBQVZHbHRaV1FnYjNWMElIZGhhWFJwYm1jZ1ptOXlJSHRoWTNScGRtbDBlWDB1SUZCc1pXRnpaU0JqYUdWamF5QjBhR0YwSUhsdmRTQm9ZWFpsSUdWdWIzVm5hQ0J5WlhOdmRYSmpaWE1nZEc4Z2NuVnVJR0ZzYkNCSWIzSnZkbTlrSUhCeWIyTmxjM05sY3k0Z1JXRmphQ0JJYjNKdmRtOWtJSEJ5YjJObGMzTWdjblZ1Y3lCcGJpQmhJRk53WVhKcklIUmhjMnN1SUZsdmRTQnRZWGtnYm1WbFpDQjBieUJwYm1OeVpXRnpaU0IwYUdVZ2MzUmhjblJmZEdsdFpXOTFkQ0J3WVhKaGJXVjBaWElnZEc4Z1lTQnNZWEpuWlhJZ2RtRnNkV1VnYVdZZ2VXOTFjaUJUY0dGeWF5QnlaWE52ZFhKalpYTWdZWEpsSUdGc2JHOWpZWFJsWkNCdmJpMWtaVzFoYm1RdWxIVmlqQWx1ZFcxZmFHOXpkSE9VVG93SWJuVnRYM0J5YjJPVVN3U01CV2h2YzNSemxJdzNiM0ozWVMxMmFYSjBkV0ZzTFcxaFkyaHBibVV0TkdRME1HRmxOekpoWXpBNFpHRTRPRFl5WkdJeU5qYzRZamRsWkRjME56RTZOSlNNRDI5MWRIQjFkRjltYVd4bGJtRnRaWlJPakExeWRXNWZablZ1WTE5dGIyUmxsSWlNQkc1cFkzT1VUblZpTGc9PSKUjAh0Y3BfZmxhZ5ROjAxiaW5kaW5nX2FyZ3OUTowDa2V5lE6MB3RpbWVvdXSUjB9ob3Jvdm9kLnJ1bi5jb21tb24udXRpbC50aW1lb3V0lIwHVGltZW91dJSTlCmBlH2UKIwLX3RpbWVvdXRfYXSUR0HXuyY7heLljAhfbWVzc2FnZZRYDgEAAFRpbWVkIG91dCB3YWl0aW5nIGZvciB7YWN0aXZpdHl9LiBQbGVhc2UgY2hlY2sgdGhhdCB5b3UgaGF2ZSBlbm91Z2ggcmVzb3VyY2VzIHRvIHJ1biBhbGwgSG9yb3ZvZCBwcm9jZXNzZXMuIEVhY2ggSG9yb3ZvZCBwcm9jZXNzIHJ1bnMgaW4gYSBTcGFyayB0YXNrLiBZb3UgbWF5IG5lZWQgdG8gaW5jcmVhc2UgdGhlIHN0YXJ0X3RpbWVvdXQgcGFyYW1ldGVyIHRvIGEgbGFyZ2VyIHZhbHVlIGlmIHlvdXIgU3BhcmsgcmVzb3VyY2VzIGFyZSBhbGxvY2F0ZWQgb24tZGVtYW5kLpR1YowJbnVtX2hvc3RzlE6MCG51bV9wcm9jlEsEjAVob3N0c5SMN29yd2EtdmlydHVhbC1tYWNoaW5lLTRkNDBhZTcyYWMwOGRhODg2MmRiMjY3OGI3ZWQ3NDcxOjSUjA9vdXRwdXRfZmlsZW5hbWWUTowNcnVuX2Z1bmNfbW9kZZSIjARuaWNzlE51Yi4=
2020-06-19 13:42:38.570645: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:42:38.570855: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:42:38.570878: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,0]<stdout>:Changing cwd from /home/orwa/spark_files to /home/orwa/spark/work/app-20200619134137-0000/0
[1,2]<stdout>:Changing cwd from /home/orwa/spark_files to /home/orwa/spark/work/app-20200619134137-0000/0[1,2]<stdout>:
[1,3]<stdout>:Changing cwd from /home/orwa/spark_files to /home/orwa/spark/work/app-20200619134137-0000/0[1,3]<stdout>:
[1,1]<stdout>:Changing cwd from /home/orwa/spark_files to /home/orwa/spark/work/app-20200619134137-0000/0
[1,0]<stderr>:2020-06-19 13:44:11.280356: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
[1,0]<stderr>:2020-06-19 13:44:11.280581: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
[1,0]<stderr>:2020-06-19 13:44:11.280606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,2]<stderr>:2020-06-19 13:44:11.299530: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
[1,2]<stderr>:2020-06-19 13:44:11.299815: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
[1,2]<stderr>:2020-06-19 13:44:11.299844: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,3]<stderr>:2020-06-19 13:44:11.351313: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
[1,3]<stderr>:2020-06-19 13:44:11.351864: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
[1,3]<stderr>:2020-06-19 13:44:11.352047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,1]<stderr>:2020-06-19 13:44:11.353749: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
[1,1]<stderr>:2020-06-19 13:44:11.353941: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
[1,1]<stderr>:2020-06-19 13:44:11.353958: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,2]<stderr>:2020-06-19 13:44:14.246609: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
[1,2]<stderr>:2020-06-19 13:44:14.246673: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
[1,2]<stderr>:2020-06-19 13:44:14.246723: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orwa-virtual-machine): /proc/driver/nvidia/version does not exist
[1,2]<stderr>:2020-06-19 13:44:14.247580: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,3]<stderr>:2020-06-19 13:44:14.246430: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
[1,3]<stderr>:2020-06-19 13:44:14.246574: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
[1,3]<stderr>:2020-06-19 13:44:14.246642: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orwa-virtual-machine): /proc/driver/nvidia/version does not exist
[1,3]<stderr>:2020-06-19 13:44:14.247438: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,0]<stderr>:2020-06-19 13:44:14.246970: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
[1,0]<stderr>:2020-06-19 13:44:14.247024: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
[1,0]<stderr>:2020-06-19 13:44:14.247067: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orwa-virtual-machine): /proc/driver/nvidia/version does not exist
[1,0]<stderr>:2020-06-19 13:44:14.247887: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,1]<stderr>:2020-06-19 13:44:14.249068: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
[1,1]<stderr>:2020-06-19 13:44:14.249125: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
[1,1]<stderr>:2020-06-19 13:44:14.249172: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (orwa-virtual-machine): /proc/driver/nvidia/version does not exist
[1,1]<stderr>:2020-06-19 13:44:14.258062: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,3]<stderr>:2020-06-19 13:44:14.267170: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
[1,3]<stderr>:2020-06-19 13:44:14.278151: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558860aeec30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,3]<stderr>:2020-06-19 13:44:14.278272: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[1,2]<stderr>:2020-06-19 13:44:14.286139: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
[1,2]<stderr>:2020-06-19 13:44:14.287107: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f6a04f8d70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,2]<stderr>:2020-06-19 13:44:14.287480: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[1,0]<stderr>:2020-06-19 13:44:14.301974: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
[1,0]<stderr>:2020-06-19 13:44:14.303189: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5565571d2d00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,0]<stderr>:2020-06-19 13:44:14.303490: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[1,1]<stderr>:2020-06-19 13:44:14.332838: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
[1,1]<stderr>:2020-06-19 13:44:14.337740: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56389d577970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,1]<stderr>:2020-06-19 13:44:14.337845: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
[1,1]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,1]<stderr>:  warn(RuntimeWarning(msg))
[1,2]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,2]<stderr>:  warn(RuntimeWarning(msg))
[1,1]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,1]<stderr>:  warn(RuntimeWarning(msg))
[1,0]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,0]<stderr>:  warn(RuntimeWarning(msg))
[1,2]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,2]<stderr>:  warn(RuntimeWarning(msg))
[1,0]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,0]<stderr>:  warn(RuntimeWarning(msg))
[1,3]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,3]<stderr>:  warn(RuntimeWarning(msg))
[1,3]<stderr>:/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
[1,3]<stderr>:  warn(RuntimeWarning(msg))
[1,0]<stderr>:WARNING:tensorflow:From /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py:68: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,1]<stderr>:WARNING:tensorflow:From /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py:68: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,2]<stderr>:WARNING:tensorflow:From /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py:68: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,2]<stderr>:Instructions for updating:
[1,2]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,3]<stderr>:WARNING:tensorflow:From /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py:68: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,3]<stderr>:Instructions for updating:
[1,3]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,0]<stdout>:Train for 106 steps
[1,2]<stderr>:WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.084984). Check your callbacks.
[1,3]<stderr>:WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.048007). Check your callbacks.
[1,1]<stderr>:WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.048115). Check your callbacks.
[1,0]<stderr>:2020-06-19 13:44:34.022261: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
[1,0]<stderr>:2020-06-19 13:44:34.168972: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory
[1,0]<stderr>:2020-06-19 13:44:34.169496: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
[1,0]<stderr>:2020-06-19 13:44:34.169791: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI could not be loaded or symbol could not be found.
[1,0]<stderr>:WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.262600). Check your callbacks.[1,0]<stderr>:
[1,0]<stderr>:2020-06-19 13:44:34.380550: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI could not be loaded or symbol could not be found.
[1,0]<stderr>:2020-06-19 13:44:34.385397: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,3]<stderr>:    [1,3]<stderr>:"__main__", mod_spec)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 85, in _run_code
[1,3]<stderr>:    exec(code, run_globals)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 57, in <module>
[1,3]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 50, in main
[1,3]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK')
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 53, in task_exec
[1,3]<stderr>:    result = fn(*args, **kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/remote.py", line 214, in train
[1,3]<stderr>:    validation_steps, callbacks, verbose)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py", line 47, in fn
[1,3]<stderr>:    epochs=epochs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
[1,3]<stderr>:    use_multiprocessing=use_multiprocessing)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
[1,3]<stderr>:    [1,3]<stderr>:total_epochs=epochs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
[1,3]<stderr>:    batch_outs = execution_function(iterator)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
[1,3]<stderr>:    distributed_function(input_fn))
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
[1,3]<stderr>:    result = self._call(*args, **kwds)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
[1,3]<stderr>:    self._initialize(args, kwds, add_initializers_to=initializers)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
[1,3]<stderr>:    *args, **kwds))
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
[1,3]<stderr>:    graph_function, _, _ = self._maybe_define_function(args, kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
[1,3]<stderr>:    graph_function = self._create_graph_function(args, kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
[1,3]<stderr>:    capture_by_value=self._capture_by_value),
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,1]<stderr>:    "__main__", mod_spec)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 85, in _run_code
[1,1]<stderr>:    exec(code, run_globals)[1,1]<stderr>:
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 57, in <module>
[1,1]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 50, in main
[1,1]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK')
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 53, in task_exec
[1,1]<stderr>:    result = fn(*args, **kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/remote.py", line 214, in train
[1,1]<stderr>:    validation_steps, callbacks, verbose)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py", line 47, in fn
[1,1]<stderr>:    epochs=epochs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
[1,3]<stderr>:    func_outputs = python_func(*func_args, **func_kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
[1,1]<stderr>:    use_multiprocessing=use_multiprocessing)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
[1,3]<stderr>:    [1,3]<stderr>:return weak_wrapped_fn().__wrapped__(*args, **kwds)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 85, in distributed_function
[1,3]<stderr>:    per_replica_function, args=args)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
[1,1]<stderr>:    total_epochs=epochs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
[1,1]<stderr>:    batch_outs = execution_function(iterator)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
[1,1]<stderr>:    distributed_function(input_fn))
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
[1,1]<stderr>:    result = self._call(*args, **kwds)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
[1,3]<stderr>:    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
[1,1]<stderr>:    self._initialize(args, kwds, add_initializers_to=initializers)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
[1,3]<stderr>:    return self._call_for_each_replica(fn, args, kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
[1,3]<stderr>:    return fn(*args, **kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
[1,3]<stderr>:    return func(*args, **kwargs)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 433, in train_on_batch
[1,3]<stderr>:    output_loss_metrics=model._output_loss_metrics)
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
[1,1]<stderr>:    *args, **kwds))
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
[1,3]<stderr>:    output_loss_metrics=output_loss_metrics))
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 273, in _process_single_batch
[1,1]<stderr>:    graph_function, _, _ = self._maybe_define_function(args, kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
[1,3]<stderr>:    model.optimizer.apply_gradients(zip(grads, trainable_weights))
[1,3]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/_keras/__init__.py", line 73, in apply_gradients
[1,1]<stderr>:    graph_function = self._create_graph_function(args, kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
[1,1]<stderr>:    capture_by_value=self._capture_by_value),
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
[1,3]<stderr>:    raise Exception('`apply_gradients()` was called without a call to '
[1,3]<stderr>:Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.
[1,1]<stderr>:    func_outputs = python_func(*func_args, **func_kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
[1,1]<stderr>:    return weak_wrapped_fn().__wrapped__(*args, **kwds)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 85, in distributed_function
[1,1]<stderr>:    per_replica_function, args=args)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
[1,1]<stderr>:    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
[1,1]<stderr>:    return self._call_for_each_replica(fn, args, kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
[1,1]<stderr>:    return fn(*args, **kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 433, in train_on_batch
[1,1]<stderr>:    output_loss_metrics=model._output_loss_metrics)
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
[1,1]<stderr>:    output_loss_metrics=output_loss_metrics))
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 273, in _process_single_batch
[1,1]<stderr>:    model.optimizer.apply_gradients(zip(grads, trainable_weights))
[1,1]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/_keras/__init__.py", line 73, in apply_gradients
[1,1]<stderr>:    raise Exception('`apply_gradients()` was called without a call to '
[1,1]<stderr>:Exception[1,1]<stderr>:: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,2]<stderr>:    "__main__", mod_spec)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 85, in _run_code
[1,2]<stderr>:    exec(code, run_globals)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 57, in <module>
[1,2]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 50, in main
[1,2]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK')
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 53, in task_exec
[1,2]<stderr>:    result = fn(*args, **kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/remote.py", line 214, in train
[1,2]<stderr>:    validation_steps, callbacks, verbose)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py", line 47, in fn
[1,2]<stderr>:    epochs=epochs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
[1,2]<stderr>:    use_multiprocessing=use_multiprocessing)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
[1,2]<stderr>:    [1,2]<stderr>:total_epochs=epochs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
[1,2]<stderr>:    batch_outs = execution_function(iterator)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
[1,2]<stderr>:    distributed_function(input_fn))
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
[1,2]<stderr>:    result = self._call(*args, **kwds)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
[1,2]<stderr>:    self._initialize(args, kwds, add_initializers_to=initializers)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
[1,2]<stderr>:    *args, **kwds))
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
[1,2]<stderr>:    graph_function, _, _ = self._maybe_define_function(args, kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
[1,2]<stderr>:    graph_function = self._create_graph_function(args, kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
[1,2]<stderr>:    capture_by_value=self._capture_by_value),
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
[1,2]<stderr>:    func_outputs = python_func(*func_args, **func_kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
[1,2]<stderr>:    [1,2]<stderr>:return weak_wrapped_fn().__wrapped__(*args, **kwds)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 85, in distributed_function
[1,2]<stderr>:    per_replica_function, args=args)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
[1,2]<stderr>:    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
[1,2]<stderr>:    return self._call_for_each_replica(fn, args, kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
[1,2]<stderr>:    return fn(*args, **kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
[1,2]<stderr>:    return func(*args, **kwargs)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 433, in train_on_batch
[1,2]<stderr>:    output_loss_metrics=model._output_loss_metrics)
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
[1,2]<stderr>:    output_loss_metrics=output_loss_metrics))
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 273, in _process_single_batch
[1,2]<stderr>:    model.optimizer.apply_gradients(zip(grads, trainable_weights))
[1,2]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/_keras/__init__.py", line 73, in apply_gradients
[1,2]<stderr>:    raise Exception('`apply_gradients()` was called without a call to '
[1,2]<stderr>:Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 57, in <module>
[1,0]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 50, in main
[1,0]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK')
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/task/__init__.py", line 53, in task_exec
[1,0]<stderr>:    result = fn(*args, **kwargs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/remote.py", line 214, in train
[1,0]<stderr>:    validation_steps, callbacks, verbose)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/util.py", line 47, in fn
[1,0]<stderr>:    epochs=epochs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
[1,0]<stderr>:    use_multiprocessing=use_multiprocessing)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
[1,0]<stderr>:    total_epochs=epochs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
[1,0]<stderr>:    batch_outs = execution_function(iterator)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
[1,0]<stderr>:    distributed_function(input_fn))
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
[1,0]<stderr>:    result = self._call(*args, **kwds)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
[1,0]<stderr>:    self._initialize(args, kwds, add_initializers_to=initializers)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
[1,0]<stderr>:    *args, **kwds))
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
[1,0]<stderr>:    graph_function, _, _ = self._maybe_define_function(args, kwargs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
[1,0]<stderr>:    graph_function = self._create_graph_function(args, kwargs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
[1,0]<stderr>:    capture_by_value=self._capture_by_value),
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
[1,0]<stderr>:    func_outputs = python_func(*func_args, **func_kwargs)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
[1,0]<stderr>:    return weak_wrapped_fn().__wrapped__(*args, **kwds)
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 85, in distributed_function
[1,0]<stderr>:    [1,0]<stderr>:per_replica_function, args=args)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
[1,0]<stderr>:    [1,0]<stderr>:return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
[1,0]<stderr>:    [1,0]<stderr>:return self._call_for_each_replica(fn, args, kwargs)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
[1,0]<stderr>:    [1,0]<stderr>:return fn(*args, **kwargs)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
[1,0]<stderr>:    [1,0]<stderr>:return func(*args, **kwargs)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 433, in train_on_batch
[1,0]<stderr>:    [1,0]<stderr>:output_loss_metrics=model._output_loss_metrics)[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
[1,0]<stderr>:    [1,0]<stderr>:output_loss_metrics=output_loss_metrics))[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 273, in _process_single_batch
[1,0]<stderr>:    [1,0]<stderr>:model.optimizer.apply_gradients(zip(grads, trainable_weights))[1,0]<stderr>:
[1,0]<stderr>:  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/_keras/__init__.py", line 73, in apply_gradients
[1,0]<stderr>:    [1,0]<stderr>:raise Exception('`apply_gradients()` was called without a call to '[1,0]<stderr>:
[1,0]<stderr>:Exception[1,0]<stderr>:: [1,0]<stderr>:`apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.[1,0]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58063,1],2]
  Exit code:    1
--------------------------------------------------------------------------
20/06/19 13:44:35 INFO DAGScheduler: Asked to cancel job group horovod.spark.run.0
20/06/19 13:44:35 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 10) in 125641 ms on 192.168.198.131 (executor 0) (1/4)
20/06/19 13:44:35 INFO TaskSchedulerImpl: Cancelling stage 3
20/06/19 13:44:35 INFO TaskSchedulerImpl: Killing all running tasks in stage 3: Stage cancelled
20/06/19 13:44:35 INFO TaskSchedulerImpl: Stage 3 was cancelled
20/06/19 13:44:35 INFO DAGScheduler: ResultStage 3 (collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106) failed in 125.687 s due to Job 3 cancelled part of cancelled job group horovod.spark.run.0
20/06/19 13:44:35 INFO DAGScheduler: Job 3 failed: collect at /home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py:106, took 125.702862 s
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py", line 106, in run_spark
    result = procs.mapPartitionsWithIndex(_make_mapper(driver.addresses(), settings, use_gloo)).collect()
  File "/home/orwa/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/home/orwa/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/orwa/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/orwa/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job 3 cancelled part of cancelled job group horovod.spark.run.0
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1826)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleJobGroupCancelled$1.apply$mcVI$sp(DAGScheduler.scala:907)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleJobGroupCancelled$1.apply(DAGScheduler.scala:907)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleJobGroupCancelled$1.apply(DAGScheduler.scala:907)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.spark.scheduler.DAGScheduler.handleJobGroupCancelled(DAGScheduler.scala:907)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2081)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
  File "/home/orwa/spark_files/keras_spark_mnist.py", line 115, in <module>
    keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob'])
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 37, in fit
    return super(HorovodEstimator, self).fit(df, params)
  File "/home/orwa/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 132, in fit
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 82, in _fit
    backend, train_rows, val_rows, metadata, avg_row_size, dataset_idx)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/keras/estimator.py", line 287, in _fit_on_prepared_data
    env=env)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/common/backend.py", line 87, in run
    **self._kwargs)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py", line 221, in run
    _launch_job(use_mpi, use_gloo, settings, driver, env, stdout, stderr)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py", line 128, in _launch_job
    settings.verbose)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/run/runner.py", line 692, in run_controller
    mpi_run()
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/runner.py", line 126, in <lambda>
    use_mpi, lambda: mpi_run(settings, nics, driver, env, stdout, stderr),
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/spark/mpi_run.py", line 55, in mpi_run
    hr_mpi_run(settings, nics, env, command, stdout=stdout, stderr=stderr)
  File "/home/orwa/anaconda3/envs/spark1/lib/python3.7/site-packages/horovod/run/mpi_run.py", line 204, in mpi_run
    raise RuntimeError("mpirun failed with exit code {exit_code}".format(exit_code=exit_code))
RuntimeError: mpirun failed with exit code 1
20/06/19 13:44:36 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9, 192.168.198.131, executor 0): TaskKilled (Stage cancelled)
20/06/19 13:44:36 WARN TaskSetManager: Lost task 3.0 in stage 3.0 (TID 12, 192.168.198.131, executor 0): TaskKilled (Stage cancelled)
20/06/19 13:44:36 WARN TaskSetManager: Lost task 2.0 in stage 3.0 (TID 11, 192.168.198.131, executor 0): TaskKilled (Stage cancelled)
20/06/19 13:44:36 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
20/06/19 13:44:36 INFO SparkContext: Invoking stop() from shutdown hook
20/06/19 13:44:36 INFO SparkUI: Stopped Spark web UI at http://192.168.198.131:4041
20/06/19 13:44:37 INFO StandaloneSchedulerBackend: Shutting down all executors
20/06/19 13:44:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/06/19 13:44:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/19 13:44:37 INFO MemoryStore: MemoryStore cleared
20/06/19 13:44:37 INFO BlockManager: BlockManager stopped
20/06/19 13:44:37 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/19 13:44:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/19 13:44:37 INFO SparkContext: Successfully stopped SparkContext
20/06/19 13:44:37 INFO ShutdownHookManager: Shutdown hook called
20/06/19 13:44:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-c5d87ecf-1161-4769-9d25-d4edd0e3a926/pyspark-847a0aa3-58b2-4300-a2f4-8e305bc10b1e
20/06/19 13:44:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-c5d87ecf-1161-4769-9d25-d4edd0e3a926
20/06/19 13:44:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-fe0450af-7caa-480c-92e7-54aeb671088f