In this tutorial we'll bring the TensorFlow 2 quickstart to Valohai, taking advantage of Valohai versioned experiments, data inputs, outputs and exporting metadata to easily track & compare your models.
You can use Valohai through the UI, using our command-line tools or by calling the APIs from your pipelines. This tutorial will focus on using the Valohai commandline tools.
β‘οΈ Start by creating an account at http://app.valohai.com.
Once you've logged in, click the "Create new project" button.
- Project name: Choose your project name, for example valohai-tf2-quickstart
- Description: TensorFlow 2 Quickstart with Valohai
- Ownership: Select your username. We'll talk more about ownership later, but this could be for example your organization or team, with whom you'd like to share the project.
π‘ You can also create the project from the command-line or through the APIs.
β Before continuing, make sure you have Python3 (and pip3) installed. Run for python3 --version on the command-line/terminal to check your Python version and use the Windows Installer or Homebrew (
brew install python3
) to install Pythong 3 if needed.
π Create a new folder (valohai-tf2-quickstart), where we will store all the files related to the quickstart. This way, it'll be easier to keep track and later remove the files, when you no longed need them.
π Navigate inside the folder and create an isolated Python virtual environment, where we will install the valohai-cli tools and other libraries needed.
python3 -m virtualenv .venv
source .venv/bin/activate
- Install Valohai CLI in your virtual environment with
pip install valohai-cli
- Login to the Valohai command-line tools with
vh login
Hooray, you now have a virtual environment with valohai-cli tools installed! π
Next, lets run our first experiment.
We'll start by creating a simple Hello World, to ensure that everything works smoothly. Create a new file called train.py
with your favorite editor (VS Code, Emacs, Nano, Vim etc.) and inside it add print('hello world!')
. Make sure this file is saved in the folder you previously created.
Now lets move back to the command-line.
We'll initialize a Valohai project in your folder, select which command to run and what container to use.
Inside the valohai-tf2-quickstart folder, initialize a new Valohai project with vh init
- π First confirm that you're in the right folder. Type
y
press enter to continue. - π Valohai has looked through the folder and found a file called
train.py
which we just created. You can confirm that this is the command you want to run by writing1
and hitting enter.-
If you had multiple files in this folder, it would give you options to select from but because now our folder has only one file that Valohai is expecting to the be the command to run, it's suggesting just
python train.py
. You could also write a custom command here and hit enter, if none of the options are correct.
-
- Confirm your selection with
y
- π³ Now it's asking you to select a Docker image to use. As you can see, it's already suggesting some images but you can always use another image.
- You can also create own Docker images that contain all the libraries and tools you need for your experiments
- Valohai can access public Docker images but it can also host private images on Docker Hub or Azure Container Registry, if you provide the right credentials in your organisation settings.
- Go to Docker Hub and find the newest TensorFlow image...or if you're feeling lazy just select 5 (
tensorflow/tensorflow:1.13.1-gpu-py3
)
- You can also create own Docker images that contain all the libraries and tools you need for your experiments
- Confirm your selection with
y
Now you'll see a preview of the Valohai.yaml file. This file contains all the configuration needed to run your experiments on Valohai. You'll see the information we provided so far, and some commented lines...we'll come back to those later on.
β‘οΈ This file is saved as valohai.yaml
You can always go and edit the file, make changes and upload them to Valohai.
- Select
y
to confirm the generated file - Now Valohai asks you, which project should this be associated with. You can create a new project (C) or link to an existing one (L). As we created our project already in the web UI, we can select
L
- Next you'll see a list of your projects, select the one you created and tadaa, your project has been created! πβ¨
π₯ To run your project you can write vh exec run --adhoc execute
which translates to _Valohai, execute a new run called 'execute' and run it as an adhoc run.
--adhoc
means that you don't have a code repository (like GitHub) but rather want to upload the files from this folder and run them on Valohai.- Where does the name
execute
of the run come from? It comes from ourvalohai.yaml
from our step name. And instead of making us write the whole lineExecute python train.py
, it acceptsexecute
as there is only one step starting with execute. - You can also include
--watch
, to get the execution logs to your command-line.
π‘ A good idea, is to go open your
valohai.yaml
file and edit the step nameExecute python train.py
to be more descriptive to what we're doing. You might want to call itTrain MNIST model
for example.Then you would run your experiment with
vh execute run --adhoc train
You can now see your execution on app.valohai.com under your project.
Look at you, running experiments on Valohai. Congratulations! ππ Pat yourself on the back for a job well done ππ
βοΈβοΈ You'll notice that these executions ran on a Microsoft Azure NC6 machine. That's where we run executions on by default. Once you start using Valohai, we can setup the executions to run on your machines on AWS, GCP or some other datacenter (or a set of machines you have locally).
You can also have multiple environments in different clouds setup, so you can define per project (or execution) which environment you want to run.
The YAML-file that was generated for us contains the configuration of our Valohai project. You'll find inside it a single step, called Train MNIST model with it's Docker image and the command to run (python train.py
).
In Valohai, you create steps for different operations or workload types. You could have steps for data anonymization, feature extraction, training your model, batch inference, model evaluation etc. It's essentially, what you want to run on Valohai.
You can read more about the valohai.yaml config file on our docs.
π‘ Remember that YAML is a bit fussy about indentation, so when you edit your file make sure you have the right ammount of spaces to structure the contents.
One advantage of Valohai is reproducability. The fact that you can go back any execution you've ran, see the input data, parameters, metadata, code version and the ouputs.
In this tutorial we'll be primarily doing --adhoc
runs which skip the need for pushing a new version of your code to the repository and then fetching it to Valohai. However, this will not keep track of the code versions.
Make sure that you create a repository (GitHub, GitLab, BitBucket etc.) where you store your code. Then just like you normally would, do commits and push new changes there. Every time you then run a execution on Valohai, it will keep track of which commit version was used, so you can easily reference and access it (instead of playing around with offline file versions)
So, maybe create that repo now? Or feel free to proceed with the tutorial as is. By default all the steps will do a --adhoc
execution, so you're not required to have a repository to complete to the tutorial but it's definitely a good practice. And remember then instead of running --adhoc
run your git commands to push a new version and then fetch it in Valohai to run a execution with the new code.
Now that we've succesfully ran a Hello World on Valohai, we can proceed to the next stage and implementing the TensorFlow 2 Quickstart.
This quickstart is using the MNIST data set and training a neural network to classify images. Long story short, we're looking to input the MNIST dataset, train the model to recognize the hardwriten digits and evaluate the accuracy of the model.
β‘οΈ Head on over to the TensorFlow 2 Quickstart-tutorial and replace your train.py
with the code from the tutorial.
π₯ Once you're done, you can run your new training on Valohai with vh exec run --adhoc train
(or whatever the name of your step is).
π₯ Run vh exec watch 2
to see the logs of your second execution or go to app.valohai.com to see your execution and its logs.
You'll see in the logs that the sample runs through 5 epochs to train the model, and it's showing the loss and accuracy of each epoch in the logs. We'll talk about visualising these metrics in the next part of this tutorial.
So you just ran your model, trained a beautiful handwritten digit classifier but the model is nowhere to be found.
To output a model in Valohai, or actually, to output files in Valohai you'll need to write them to the output directory. Once the execution is done, Valohai will save the outputted files to a cloud storage.
π‘ Valohai will upload all the files from outputs at the end of an execution, even if your code crashes or the execution is stopped.
You'll need to get the location of the output directory and save your file there.
First import os
in your train.py, so we can access the OS functionality, like getting paths and environment variables.
π¦ Then create a new variable to store the Valohai output file
# Get the output path from the Valohai machines environment variables
output_path = os.getenv('VH_OUTPUTS_DIR')
At the bottom of your file call model.save to save the model's architecture, weights and training in a single file, as described in the TensorFlow documentation.
# Save our file to that directory as model.h5
model.save(os.path.join(output_path, 'model.h5'))
π₯ Save your file and run vh exec run --adhoc train
again to start another execution and output your model. You'll see that the model appears in the outputs tab at http://app.valohai.com
βοΈ By default the outputs are saved in a Valohai owned Amazon S3 Bucket - it's saved in a location that only you can access. Check out our guides on docs.valohai.com to see how to add your own storage accounts and save the outputs there (Azure, AWS, Google)
In this example, we've outputted the trained model, but you can output whatever files you want. It could be for example graphs, confusion matrixes, labeled data, csv files or images like in our Darknet sample.
π‘ In some cases you might want to save checkpoints or orther artifacts mid-execution, instead of waiting till the end of the execution to get the files to your cloud storage. To upload files mid-execution you just need to set them as read-only files, and signal to Valohai that you want the files uploaded to the cloud storage immediately. Read more about Live Outputs in our docs.
We now have a couple of executions ran inside our project. Valohai will keep track of the input data used, the commands, the code version, the environment you ran it on and other key general metadata. In addition, you might have other metrics you want to keep track of, and use to compare your models. That's where Valohai metadata comes in.
Valohai picks up metadata from your logs and allows you to use it filter executions, compare models and visualize said metadata.
Everything that you output as JSON is picked up by Valohai as metadata, and then you can choose what you do with it.
π‘ You might remember that the TensorFlow 2 Quickstart is already logging the accuracy and loss of each epoch, as it executes them. Valohai isn't picking these up as metadata because it's outputted in the logs just like any other information. For Valohai to understand that this is metadata you want to collect, you need to output JSON.
In our TensorFlow 2 quickstart, we want to log the accuracy and loss of each epoch as they complete. So what we'll need to do is create a function that outputs those values every time an epoch completes.
Lets start by editing our train.py
and at the top of the file import json
.
The TensorFlow documentation describes the LambdaCallback, which allows us to create simple, custom callbacks once each epoch ends (on_epoch_end).
Create a new LambdaCallback function to call a function called logMetadata
at the end of each epoch.
metadataCallback = tf.keras.callbacks.LambdaCallback(on_epoch_end=logMetadata)
π¦ Next we'll create said logMetadata
function in which we'll output the metadata values we want to track. β Make sure you place the function before your metadataCallback, so it's defined before you call it. Otherwise you're not gonna have a good time...you'll get an error, that's what I mean.
# A function to write JSON to our output logs with the epoch number with the loss and accuracy from each run.
def logMetadata(epoch, logs):
print()
print(json.dumps({
'epoch': epoch,
'loss': str(logs['loss']),
'acc': str(logs['acc']),
}))
π‘ Did you notice that we executed an empty print() before printing our JSON? We do this to ensure that the metadata JSON appears always on a newline, so Valohai can identify it. Otherwise your metadata output might appear on the same as the previous log and Valohai won't know that it's metadata you want to track.
The last thing to do is start using the metadataCallback
in our model.fit and according to the example on TensorFlow documentation.
Update your model.fit
to the following:
model.fit(x_train, y_train, epochs=5, callbacks=[metadataCallback])
π₯ You can now save your file and run a new execution with vh exec run --adhoc train
and visualize your data on the Metadata tab of the execution ππ
π‘ Remember that you can output whatever you want as metadata, as long as you can output it as JSON, we'll save it. You might for example write metadata to track different methods you've tried in your executions.
On the Metadata tab you'll be able to see your metadata as a Time Series or a Scatter Plot graph. As in this tutorial, we're using outputting the accuracy and loss of each epoch, you can select epoc
as the value to plot on the X-axis and select both acc
and loss
on the Y-axis, to see the values visualized on the Time Series graph.
β‘οΈ Are you looking to use TensorBoard with Valohai? Check out our tutorial on TensorBoard + Valohai.
You can also view this metadata in your Executions-view, so you can easily filter and compare your different executions. Go to your Projects Executions-tab and above open the "Show columns" selection, on the right side above the table. You can these select to show the acc
and loss
metadata, so easily compare models that export this metadata.
π‘ You might have noticed that the table on the Executions view is showing you the latest value from the metadata. If you'd like to have it show something else, like the best accuracy or the results of your
model.evaluate
you can just do ajson.dumps
at the end of the execution and Valohai will pick it up as the latest value.
π‘ In some cases, you might also want to tag your executions, to be able to easily find for example that one that is currently in production. You can do that by going to the execution's Details-view and add a tag at the bottom of the list. Now if you look the table with all your executions, you'll see a blue tag on one of them, so you can easily find it later.
As you start running your experiments and trying different combinations, you'll soon wish there is a way to pass values like the learning rate
to your code, without changing the code, allowing you quickly to experiment with different values. Have no fear, we can do that! π
In your valohai.yaml
you can define parameters that you want to pass to your code. You can then pass these for example on the command-line when you run your executions or in the web UI.
For this tutorial, we'll learn how to pass epoch_num
and learning_rate
as parameters to our code, so we can experiment with different values easily.
Start by opening your valohai.yaml
and uncomment the lines under parameters
.
π¦ Now edit your valohai.yaml
file to define two parameters: epoch as an integer and learning_rate as a float. You'll also need to update the command and let it know that you might be passing in parameters.
---
- step:
name: Train MNIST model
image: tensorflow/tensorflow:1.15.2-gpu-py3
command: python train.py {parameters}
#inputs:
# - name: example-input
# default: https://example.com/
parameters:
- name: epoch_num
type: integer
default: 5
- name: learning_rate
type: float
default: 0.001
That's it - now Valohai knows that it might have parameters coming its way, and if there are none it will use the default values provided above.
π¦ Next we'll need to go to our train.py
and use these parameters in our code. We'll need to first parse the arguments passed to the code and then use these two new parameters.
We'll use argparse from the Python Standard Library to parse the arguments.
Start by adding import argparse
to train.py
and then create a new function:
def getArgs():
# Initialize the ArgumentParser
parser = argparse.ArgumentParser()
# Define two arguments that it should parse
parser.add_argument('--epoch-num', type=int, default=5)
parser.add_argument('--learning-rate', type=float, default=0.001)
# Now run the parser that will return us the arguments and their values and store in our variable args
args = parser.parse_args()
# Return the parsed arguments
return args
Now call our new function in the beginning of our file, for example after defining the functions.
# Call our newly created getArgs() function and store the parsed arguments in a variable args. We can later access the values through it, for example args.learning_rate
args = getArgs()
Now that we've parsed our values, we can start using them. Lets first update the simpler one: epoch_num
by updating our model.fit
to use the parameter value of epoch, rather than the fixed number 5.
model.fit(x_train, y_train, epochs=args.epoch_num, callbacks=[metadataCallback])
Now we'll also need to use the learning_rate
parameter, which is passed to the Keras optimizer. In the TensorFlow 2 Quickstart we can see that it's using an optimizer that implements the Adam algorithm as stated in our code with model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
According to the Tensorflow 2 documentation for the Adam optimizer we can pass the learning rate in the initialization of the optimizer. This means that we'll need to update our model.compile
to be the following:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
β In older version of TensorFlow the learning rate parameter is called
lr
instead of learning_rate. As stated on the documentation "lr is included for backward compatibility, recommended to use learning_rate instead."
π₯ That's it! Now lets run our new execution and pass in some parameters. You can run for example vh exec run --adhoc train --learning_rate=0.1 --epoch_num=10
Now you'll notice that your execution with run with 10 epochs and a set learning_rate.
Tasks are a collection of related executions. For example when doing Hyperparameter optimization, you could create a Task to run several executions in parallel and find the most optimal hyperparameters.
In our sample, we'll do a simple version where we will run a Task with different values for epoch_num and learning_rate to find the most optimal values.
π‘ By default you're not allowed to create a Task unless your project is connected to a code repository. But now as we have ran some experiments we can use them and the --adhoc files to generate a Task for us.
In the web UI open your latest execution and click on the Task button on the right side. This will use your execution as a base for a new Task.
Now you'll see the configuration page, which is essentially generated from the .yaml file and our organisation settings.
Scroll down to the parameters section and select multiple values for epoch_num
and try it with for example 3 new values (for example 3, 5, 10). Remember to write one value per line here.
For learning_rate
select for example linear values with start 0.001, end 0.2 and step 0.05, which results in 4 values.
Now at the bottom of the page you'll see on the right side that this configuration will create a total of 12 executions (3 epoch_num configurations x 4 learning_rate configurations).
π₯ Press Create Task and admire the magic.
You'll see 12 new executions start, each starting as gray (queued), turning blue (executing) and green (completed) as they are executed. You can go inside any of these and see that they look just like normal Executions.
In your executions list you'll now see 12 new executions appear, and you'll noticed in their name that they're marked with !1
meaning they belong to Task number 1, which you can view in the Tasks tab. In there you can also view the Metadata of the Task to visualize the results from each execution.
Next we'll learn how to pass input data to our Valohai executions. These could be for example your training data set, labels etc. They can come either from a public address or from your private (cloud) storage.
The TensorFlow 2 Quickstart for beginners uses tf.keras.datasets.mnist to download the MNIST dataset. What we want to do is provide a custom input data source that contains the same MNIST dataset.
The MNIST dataset can be downloaded for example from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
, where actually the TensorFlow 2 Quickstart is also downloading it from.
π¦ We'll start from valohai.yaml
where you can uncomment the inputs
section and define our new input data:
---
- step:
name: Train MNIST model
image: tensorflow/tensorflow:1.15.2-gpu-py3
command: python train.py {parameters}
inputs:
- name: my-mnist-dataset
default: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
parameters:
- name: epoch-num
type: integer
default: 5
- name: learning-rate
type: float
default: 0.001
Now in our code to access that .npz package we can find it as my-mnist-dataset/mnist.npz
.
π¦ Lets go back to our train.py
and start by adding import numpy
and then defining our Valohai input path, that will store all the inputs Valohai has downloaded as per the configuration in valohai.yaml
Under your output_path
variable definition, add the input_path
in the same way by finding the value from the environment variables. Then define a variable that will contain the path to our input .npz file
# Get the path to the folder where Valohai inputs are
input_path = os.getenv('VH_INPUTS_DIR')
# Get the file path of our MNIST dataset that we defined in our YAML
mnist_file_path = os.path.join(input_path, 'my-mnist-dataset/mnist.npz')
You can now remove the two lines that load up the sample MNIST Data:
mnist = tf.keras.datasets.mnist
and
(x_train, y_train), (x_test, y_test) = mnist.load_data()
π¦ The TensorFlow 2 Quickstart parses the MNIST data with it's own function (x_train, y_train), (x_test, y_test) = mnist.load_data()
but as we've just downloaded the file, we'll use numpy to load the file and define the train and test datas.
with numpy.load(mnist_file_path, allow_pickle=True) as f:
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']
π₯ Now you can run your new execution with vh exec run --adhoc train --learning-rate=0.1 --epoch-num=10
and you'll see... the exactly same results. What gives?
We actually didn't change anything else except define the input path and load our data from there, so the results shouldn't even change. However, you'll see on the details page the input we defined. And if you look at the logs, you'll notice that it's downloading the dataset from the valohai/inputs.
π‘ In our sample we referenced to a public dataset through HTTPS but you can also reference your own files for example from Azure Storage, AWS S3 Buckets, Google Cloud Storage etc.
β Remember: Valohai, by design, doesn't take a copy of your data and store it. We keep track of the input data that you defined, so you can later on easily reproduce your steps, but it's up to you do proper data versioning and ensure that data source still exists.
β¨ Valohai will make sure you're aware of changes in your input data. β¨ Imagine you running your experiments and referencing to an input data source as you do your experiments - then one day someone changes the dataset without telling you π± Now you're getting suddenly different results for your experiments. Valohai will actually create an alert for that execution if it noticed changes in the dataset you're referencing to (by comparing the checksums and metadata of the file). This way you won't get those nasty surprises.
As you run your experiments, your Valohai executions are queued for different worker machines that run your experiments. Sometimes you get to run on the same machine as before, and the machine dataset already downloaded in it's cache and can use that dataset, skipping downloading and making it faster.
Except that sometimes, you don't want that π¬ You want to make sure that your execution downloads a fresh dataset and/or a fresh Docker image. No worries, we got you. π¦
You can define environment variables in the web UI or in your YAML file to instruct the worker machines to clear the cache. Just define a variable VH_NO_DATA_CACHE
and/or VH_NO_IMAGE_CACHE
to true
and Valohai will obey. Read more about environment variables in our docs.
Valohai with TensorFlow 2 Quickstart - Part 5: Using custom public and private Docker Container Images
In the previous steps we've used a standard TensorFlow Docker image to run our code π³ It worked great for our MNIST sample but as you build your experiments, you might start gathering requirements for additional libraries, downloads or other dependecies.
In your YAML you can run multiple commands and install libraries that you're missing like below:
- step:
name: Train MNIST model
image: tensorflow/tensorflow:1.15.2-gpu-py3
command:
- pip install mypackage
- python train.py {parameters}
However, often it makes more sense to include those dependencies already in your Docker container, so you don't have to download them and run the same commands on every single execution.
There is a ton of documentation online about Docker images but we'll be brief here.
- π³ In a Dockerfile you describe what you need in your application. Here you write all the commands (like installation and updates) that should be ran to define your image.
- π¨ You don't have to start from scratch - you can base your image on an existing Docker image and then just add on top of that the features you need. Keep in mind that generally speaking you'll want to keep only the required libraries in your image - a smaller image tends to mean that you get faster build and deploy times.
- π We'll use a couple of commands to define our image, download libraries and updates and then run
pip install -r requirements.txt
to get our required libraries in. - We'll then build the image, tag it and push it to a Docker Image repository. You can either store the Docker container as a public image, or store it as a private image that only you can access.
π‘π¦ Valohai currently supports private image repositories from Docker Hub and Azure Container Registry. You'll just need to define your repository and credentials in your organisation settings to allow Valohai to download private Docker Images.
You can find more information about building your own Docker Images on our docs or across the interwebs.
β Before we begin, make sure you install Docker on your machine.
Start by creating a requirements.txt
where we'll list all the python libraries required to run our code. Right now this will be very simple as in it you can just add tensorflow-gpu==1.15.2
. That's our only requirement right now. In the next part of this tutorial we'll be looking at more requirements, when publish an endpoint for our digit prediction.
After that highly complex requirements list we'll create the actual Dockerfile called...Dockerfile
. You can save this in the same folder as the rest of the files we've created in this tutorial.
# We'll use the nvidia/cuda image as our base
FROM nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
# Set some common environmenta variables that Python uses
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
# Install lower level dependencies
# Run newest updates, install python etc.
RUN apt-get update --fix-missing && \
apt-get install -y curl python3 python3-pip && \
update-alternatives --install /usr/bin/python python /usr/bin/python3 10 && \
update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 10 && \
apt-get clean && \
apt-get autoremove && \
rm -rf /var/lib/apt/lists/*
# Define our working directory
WORKDIR /usr/src/valohai-tf2-quickstart
# Installing python dependencies by copying the requirements.txt to our workdir, upgrading pip and then installing our requirements
COPY requirements.txt .
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
You can find the base image we're using on Docker Hub as nvidia/cuda.
βπ³ Make sure you have registered a Docker Hub account or have access to an Azure Container Registry.
Install the required tools
Now we'll build your Docker Image and tag it with appropriately. Go to the command line, navigate to the folder where you Dockerfile then run the docker build --tag myaccount/name:tag .
command like:
docker build --t drazend/valohai-tf2-quickstart:0.0.3 .
The last .
just describes it should build the image based on the definitions on in this directory.
Next we'll push your image to the registry, so you can use outside of your own machine. First you'll need to login:
docker login --username=yourhubusername
If you're logging in to an Azure Container Registry, you would use something like docker login myregistry.azurecr.io
π₯ After you've succesfully logged, you can push your image with a simple docker push myaccount/name:tag
so for example
docker push drazend/valohai-tf2-quickstart:0.0.3
π‘ You can create private repositories on Docker, if you don't want them to be publically accessible. Or you can use Azure Container Registry which stores and manages private Docker container images.
Once the Docker image has uploaded we can start using it by replacing the standard tensorflow-image with our new custom image in our valohai.yaml
.
Organisations on Valohai can easily use private Docker Repositories from Docker Hub or Azure Container Registry.
π First you'll need to create an access token that we'll use to permit Valohai to pull your private Docker Container Images. Follow the instructions for Docker Hub or Azure Container Registry to generate access credentials.
Once you have our credentials head on over to http://app.valohai.com and go manage your organisation settings (click on your name on the top right and select Manage organisation-name
). Go to the registries tab and Add new
entry. Here the name would be something like docker.io/valohai/*
or valohai.azurecr.io/*
and the username & password are the ones you've previously generated.
That's it! ππ
π₯ Now you can start using private Docker repositories in your Valohai executions. Just mark down in your valohai.yaml
the new image you'd like to use. Make sure you use the full name like docker.io/user/name:tag
.
Valohai makes it easy to publish your model for online inference through a Kubernetes cluster. By default the cluster is hosted by Valohai but it can also be defined to be installed on your own environment and cluster.
In this tutorial we'll deploy the model using a wsgi-specification that describes how a web server communicates with web applications. We'll define the endpoint in our valohai.yaml
and then write the code that will take an input (an image of a handwritten number) and use our MNIST predictor to predict what number is in the image.
Start by going to your valohai.yaml
and at the bottom define a new endpoint. Usually you'll have one endpoint per prediction that you want to make.
- name: Give it a name like digit-predict. This is just used for you to identify the endpoint later on in the web UI etc.
- description: Not surpringly this is the description of your endpoint, like "predict digits from image inputs"
- image: This is the Docker Image you want to use, it should contain the libraries and tool you need to run your prediction service.
- wsgi: Here you'll define what should the server execute. The format is filename:method
- files: When you go to deploy a model on Valohai you'll be presented with an option to provide files to it. In our case, we'll be passing it a model.h5 file that we've previously trained.
- name: name of the file (for example "prediction model")
- description: You guessed it, here you can describe what does the file you're looking for do.
- path: where this model will be stored. We'll use this path to load it in our application.
π‘ As of right now, the deployments don't support private Docker image repositories, so you'll have to use a public image for this.
π¦ Add an endpoint to your valohai.yaml
like below. Notice that we're using a custom built Docker image. Make sure this points to the image you published earlier. W
---
- step:
name: Train MNIST model
image: tensorflow/tensorflow:1.15.2-gpu-py3
command: python train.py {parameters}
inputs:
- name: my-mnist-dataset
default: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
parameters:
- name: epoch-num
type: integer
default: 5
- name: learning-rate
type: float
default: 0.001
- endpoint:
name: digit-predict
description: predict digits from image inputs
image: docker.io/myaccount/name:tag
wsgi: predict:mypredictor
files:
- name: model
description: Model output file from TensorFlow
path: model.h5
Next we'll start creating a Python script that will do the prediction. We'll use the werkzeug WSGI utility library to help us creating our web app.
π‘ What if I don't want to use WSGI? In valohai.yaml you can also define a
server-command
and aport
to run any Python script instead of definingwsgi
. You can then run what you need, for exampleserver-command: python runmyserver.py
. You can run multiple server-commands by chaining them together likeserver-command: dostuffandthings && python runmyserver.py
. See more details on our docs.
Create a new file called predict.py
and start by creating a simple Hello World following the example from the Werkzeug homepage
from werkzeug.wrappers import Request, Response
# Define the main function that Valohai will call to do the prediction
def mypredictor(environ, start_response):
# Create a new response object
response = Response("Hello world!")
# Send back our response
return response(environ, start_response)
# We run this piece of code, if we're directly executing this file. This way we can locally test the functionality
if __name__ == "__main__":
from werkzeug.serving import run_simple
# Run a local server on port 5000. Once we get a request there execute the mypredictor function declared above
run_simple("localhost", 8000, mypredictor)
Now you can test your app locally by running python predict.py
. An π₯BAMπ₯ it failed saying it can't find werkzeug. That's because it's not installed in our environment by default, sp we'll need to install it with pip install werkzeug
.
π₯ Now re-run python predict.py
and you'll see a webserver starting. Navigate to http://localhost:8000/ to get your response.
To do our predictions we'll need an image sent to our inference service. Our predict.py
should read that image from the request it has received, pass it to our prediction and then respond with the predicted digit value. So lets write some magic β¨ to read the image that's sent to us.
Define a new function called read_input
that takes in the request object, before your mypredictor
function.
def read_input(request):
# Ensure that we've received a file named 'image' through POST
# If we have a valid request proceed, otherwise reeturn None
if request.method == 'POST' and 'image' in request.files:
photo = request.files['image']
# Save file to memory
in_memory_file = io.BytesIO()
photo.save(in_memory_file)
# Read the file bytes
data = numpy.frombuffer(in_memory_file.getvalue(), dtype=numpy.uint8)
# Use OpenCV to read read the image as grayscale
img = cv2.imdecode(data, cv2.IMREAD_GRAYSCALE)
# Resize the image to 28x28 with OpenCV
img = cv2.resize(img, (28,28))
return img
return None
Now update your mypredictor
function to use the new read_input function. Based on the value we get from read_input (the image or None) we send a different response.
def mypredictor(environ, start_response):
# Get the request object from the environment
request = Request(environ)
# Get the image file from our request
inputfile = read_input(request)
# If read_input didn't find a valid file
if(inputfile is None) :
response = Response("\nNo image", content_type='text/html')
return response(environ, start_response)
response = Response("\nWe got an image!")
return response(environ, start_response)
Run python predict.py
to test your syntax locally before sending it to Valohai.
If you navigate to http://localhost:8000 you'll get a response that no image was sent to the service. Luckily there is a easy way to test if you're service is able to read the image file sent to it. You can do this either with curl or Postman.
Here's an example of using curl to send a POST request with a file (7.png) to the server. Open a new Terminal/Shell and write the following command (while keeping your python running in the other window)
curl -X POST -F "image=@7.png" localhost:8000/
π₯BAMπ₯ another error. This time it's saying that NameError: name 'io' is not defined'
. As were writing our read_input
function we used io
to read the bytes, numpy
and OpenCV to decode the image and resize it. We'll need to install the new packages with pip install numpy opencv-python
. And then add the right imports on the top of our file.
import io
import numpy
import cv2
The last part we'll need to edit in our code is the actual TensorFlow prediction.
Based on the TensorFlow 2 Documentation, we'll need to use models.load to load the model we created earlier and then pass our image to model.predict_classes which will return us the predicted class.
β¬οΈπTo test the model locally we'll need to download a model.h5
file from the outputs of one of your previous app.valohai.com executions. Go the web UI, navigate to a execution and from the outputs download a model.h5 and move it to the same folder as predict.py
.
Now in your predict.py
add import tensorflow as tf
at the top of your file. Then lets start editing mypredictor
to load the model and predict values. Add the following lines after checking for inputfile is None
and before creating the Response.
# Load our model
model_path = 'model.h5'
new_model = tf.keras.models.load_model(model_path)
# Use our model to predict the class of the file sent over a form.
# We're reshaping the model as our model is expecting 3 dimensions (with the first one describing the number of images)
prediction = new_model.predict_classes(inputfile.reshape(1,28,28))
Now that we have our prediction we can send a response with the predicted digit. We'll do this by sending a JSON response, so go add on top of your file import json
and then replace your current response with
# Generate a JSON output with the prediction
json_response = json.dumps("{Predicted_Digit: %s}" % prediction[0])
# Send a response back with the prediction
response = Response(json_response, content_type='application/json')
return response(environ, start_response)
Now you can run your curl command again and you'll get a JSON response with the predicted class! β¨β¨
To get your new prediction service is really easy - you'll just need to push the new changes to your code repository, or run a --adhoc
execution and upload the files from your working directory to Valohai.
βπ³ Make sure your Docker container is ready. In the previous section of this tutorial we created a custom Docker container and pushed it to the repository. Then in our YAML we told the deployment endpoint to use that Docker Image - but as it is right now our execution on Valohai will fail π₯
As you know we did a couple of
pip install
commands and got new dependencies in our predict.py code. These packages are not in our Docker image currently. How do we get them there? Simple. Update therequirements.txt
that your Dockerfile is using with the new libraries we need:tensorflow-gpu==1.15.2 numpy werkzeug==0.16.1 opencv-python==4.2.0.32 pillowWe'll also need to update the Dockerfile to install some additional libraries on the container image. In your
Dockerfile
update the apt install with new packages needed to run code# Install lower level dependencies # Run newest updates, install python etc. RUN apt-get update --fix-missing && \ apt-get install -y curl python3 python3-pip && \ update-alternatives --install /usr/bin/python python /usr/bin/python3 10 && \ update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 10 && \ apt install -y libsm6 libxext6 libxrender-dev && \ apt-get clean && \ apt-get autoremove && \ rm -rf /var/lib/apt/lists/*Then just run again
docker build --t myaccount/name:tag .
anddocker push myaccount/name:tag
to get a new version to your repository.π‘ You might want to change the tag for this new version, so put it as
myaccount/name:0.0.2
for exampleNow just update your
valohai.yaml
with the new name:tag of the image to use for the endpoint, so it knows to use a Docker image with all the required libraries.
If you project is connected to a repository you can go and click Fetch repository
in the web ui. Or if you haven't connected to a repo yet (or you just want to) use vh exec run --adhoc train
to upload your new files.
Go to http://app.valohai.com, navigate to your project and the Deployment-tab. There create a new deployment with the name of your choice (valohai-tf2-quickstart for example) and leave the cluster as Default.
π‘ Read more about Deployments in our docs.
Now create a new version for your deployment. Here you'll see that it pulled information from your YAML to show what kind of endpoints you have and what kind of files you need to upload for them.
π¦ Click the checkbox
to enable digit-predict
and from the dropdown select your latest model.h5
. Then click create version and let the magic happen β¨β¨
You'll see that Valohai will start by pulling your custom Docker Image, building the image and then pushing it to the Kubernetes cluster. After it's pushed you'll see it starting up (Pending) and once it's at 100% Available you can start using your online inference!
π’ I got a
Bad Gateway 500
error. That usually means there is some issue with your code. Make sure it's running correctly on your own machine, debug any issues and make sure you've define your custom Docker container properly in yourvalohai.yaml
π₯ Now you can click on the link and you'll see the familiar "No Image" response. To test a deployment with an image you can click the "Test Deployment" button and then select your endpoint, set a POST request and add a field image with the sample image file.
And tada, you got your response back! ππ Now pat yourself on the back. Job. Well. Done. β¨π¦β¨π¦β¨
π‘ What if I want to do batch inference and evaluate a set of files at the same time, and maybe we only do it once a week? I don't want to run a whole Kubernetes cluster for that. Have no fear - for that purpose the recomended approach is to create a new
step
in yourvalohai.yaml
and create a new python file that takes all the inputs (like the MNIST data), runs predictions and outputs them (like the model.h5).
- Batch inference
- GitHub Repo
- Pipelines
- Notebooks