Skip to content

Instantly share code, notes, and snippets.

@orwa-te
orwa-te / train_fn() function
Last active November 15, 2020 05:17
Function submitted to Horovod Runner
def train_fn():
# Make sure pyarrow is referenced before anything else to avoid segfault due to conflict
# with TensorFlow libraries. Use `pa` package reference to ensure it's loaded before
# functions like `deserialize_model` which are implemented at the top level.
# See https://jira.apache.org/jira/browse/ARROW-3346
pa
# import atexit
import horovod.tensorflow.keras as hvd
import os
@orwa-te
orwa-te / u_net_train.py
Created August 20, 2020 18:50
TFOS code to train my model on Spark
# Adapted from: https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras
from __future__ import absolute_import, division, print_function, unicode_literals
def main_fun(args, ctx):
import tensorflow as tf
import numpy as np
import imagecodecs
@orwa-te
orwa-te / u_net_train.py
Created August 14, 2020 14:13
Code snippet of PySpark program in 2 worker nodes
................
................
def main_fun(args, ctx):
batch_size=32
print(len(trainx)) # -----> 672
# 672/32 = 21
#Create distribute strategy
@orwa-te
orwa-te / stderr
Created August 11, 2020 07:34
Worker node logs in Spark cluster of 2 nodes
Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java" "-cp" "/home/node/spark/conf/:/home/node/spark/jars/*:/home/node/hadoop3.1.1/etc/hadoop/" "-Xmx9216M" "-Dspark.driver.port=38237" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@node:38237" "--executor-id" "1" "--hostname" "172.16.44.121" "--cores" "4" "--app-id" "app-20200811034057-0007" "--worker-url" "spark://Worker@172.16.44.121:40515"
========================================
2020-08-11 03:40:58,546 INFO executor.CoarseGrainedExecutorBackend: Started daemon with process name: 48258@node
2020-08-11 03:40:58,550 INFO util.SignalUtils: Registered signal handler for TERM
2020-08-11 03:40:58,551 INFO util.SignalUtils: Registered signal handler for HUP
2020-08-11 03:40:58,551 INFO util.SignalUtils: Registered signal handler for INT
2020-08-11 03:40:59,438 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
@orwa-te
orwa-te / gist:4887067323fb32bc5c2e62e29d4afc81
Created July 10, 2020 17:28
Snippet code for building and training U-Net
def unet(shape = (128,128,4)):
# Left side of the U-Net
inputs = Input(shape)
conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same', kernel_initializer = 'random_normal')(inputs)
conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same', kernel_initializer = 'random_normal')(conv1)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = Conv2D(128, 3, activation = 'relu', padding = 'same', kernel_initializer = 'random_normal')(pool1)
conv2 = Conv2D(128, 3, activation = 'relu', padding = 'same', kernel_initializer = 'random_normal')(conv2)
pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
@orwa-te
orwa-te / stack trace
Created July 10, 2020 17:19
NotImplementedError stack trace for model.fit() with tf.distribute.MirroredStrategy()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-26-55bddb4c82a6> in <module>
1 # dist_model.summary()
----> 2 history = dist_model.fit(trainx, trainy_hot, epochs=1, validation_data = (testx, testy_hot),batch_size=64, verbose=1)
~\Anaconda3\envs\open_cv\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
1211 else:
1212 fit_inputs = x + y + sample_weights
-> 1213 self._make_train_function()
@orwa-te
orwa-te / gist:0de487552828f721ac1f45d63d27e75f
Created June 29, 2020 22:34
Stack trace of error when setMaster(<url>) is set in Jupyter Notebook code
Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java" "-cp" "/home/orwa/spark/conf/:/home/orwa/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=37501" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@master:37501" "--executor-id" "0" "--hostname" "192.168.198.131" "--cores" "2" "--app-id" "app-20200630012803-0001" "--worker-url" "spark://Worker@192.168.198.131:37685"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/30 01:28:39 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 8519@orwa-virtual-machine
20/06/30 01:28:39 INFO SignalUtils: Registered signal handler for TERM
20/06/30 01:28:39 INFO SignalUtils: Registered signal handler for HUP
20/06/30 01:28:39 INFO SignalUtils: Registered signal handler for INT
20/06/30 01:28:39 WARN Utils: Your hostname, orwa-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.198.131 inste
@orwa-te
orwa-te / stderr file
Created June 24, 2020 10:30
Error logs when executing mnist_tf.py on Jupyter notebook
Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java" "-cp" "/home/orwa/spark/conf/:/home/orwa/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=36393" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@192.168.198.131:36393" "--executor-id" "0" "--hostname" "192.168.198.131" "--cores" "2" "--app-id" "app-20200624132821-0010" "--worker-url" "spark://Worker@192.168.198.131:36489"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/24 13:28:22 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 6648@orwa-virtual-machine
20/06/24 13:28:22 INFO SignalUtils: Registered signal handler for TERM
20/06/24 13:28:22 INFO SignalUtils: Registered signal handler for HUP
20/06/24 13:28:22 INFO SignalUtils: Registered signal handler for INT
20/06/24 13:28:22 WARN Utils: Your hostname, orwa-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.198.
@orwa-te
orwa-te / output logs
Created June 19, 2020 10:46
Output logs where PYTHON_SPARK env variable is set inside 'spark-env.sh' file
20/06/19 13:41:30 WARN Utils: Your hostname, orwa-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.198.131 instead (on interface ens33)
20/06/19 13:41:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/19 13:41:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-19 13:41:33.862495: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:41:33.862706: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:41:33.862730: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with Ten
@orwa-te
orwa-te / output logs
Created June 19, 2020 10:33
Output logs where PYTHON_SPARK env variable is removed from 'spark-env.sh'
20/06/19 13:30:06 WARN Utils: Your hostname, orwa-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.198.131 instead (on interface ens33)
20/06/19 13:30:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/19 13:30:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-19 13:30:10.735978: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:30:10.736517: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-06-19 13:30:10.736634: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with Te