Skip to content

Instantly share code, notes, and snippets.

@drazendee
Last active February 6, 2020 09:49
Show Gist options
  • Save drazendee/d253c34d9f92358d8da9c6c1e6df8f64 to your computer and use it in GitHub Desktop.
Save drazendee/d253c34d9f92358d8da9c6c1e6df8f64 to your computer and use it in GitHub Desktop.
Workshop: Keras with Valohai

In this tutorial we'll bring the TensorFlow 2 quickstart to Valohai, taking advantage of Valohai versioned experiments, data inputs, outputs and exporting metadata to easily track & compare your models.

You can use Valohai through the UI, using our command-line tools or by calling the APIs from your pipelines. This tutorial will focus on using the Valohai commandline tools.

Creating a project on app.valohai.com

➑️ Start by creating an account at http://app.valohai.com.

Create a new Project on Valohai

Once you've logged in, click the "Create new project" button.

  • Project name: Choose your project name, for example valohai-tf2-quickstart
  • Description: TensorFlow 2 Quickstart with Valohai
  • Ownership: Select your username. We'll talk more about ownership later, but this could be for example your organization or team, with whom you'd like to share the project.

πŸ’‘ You can also create the project from the command-line or through the APIs.

Using Valohai through the CLI

❗ Before continuing, make sure you have Python3 (and pip3) installed. Run for python3 --version on the command-line/terminal to check your Python version and use the Windows Installer or Homebrew (brew install python3) to install Pythong 3 if needed.

πŸ“‚ Create a new folder (valohai-tf2-quickstart), where we will store all the files related to the quickstart. This way, it'll be easier to keep track and later remove the files, when you no longed need them.

🐍 Navigate inside the folder and create an isolated Python virtual environment, where we will install the valohai-cli tools and other libraries needed.

python3 -m virtualenv .venv
source .venv/bin/activate
  • Install Valohai CLI in your virtual environment with pip install valohai-cli
  • Login to the Valohai command-line tools with vh login

Hooray, you now have a virtual environment with valohai-cli tools installed! πŸŽ‰

Next, lets run our first experiment.

Create your first Valohai project

We'll start by creating a simple Hello World, to ensure that everything works smoothly. Create a new file called train.py with your favorite editor (VS Code, Emacs, Nano, Vim etc.) and inside it add print('hello world!'). Make sure this file is saved in the folder you previously created.

Now lets move back to the command-line.

Initialize your Valohai project

We'll initialize a Valohai project in your folder, select which command to run and what container to use.

Inside the valohai-tf2-quickstart folder, initialize a new Valohai project with vh init

  • πŸ“‚ First confirm that you're in the right folder. Type y press enter to continue.
  • 🐍 Valohai has looked through the folder and found a file called train.py which we just created. You can confirm that this is the command you want to run by writing 1 and hitting enter.
    • If you had multiple files in this folder, it would give you options to select from but because now our folder has only one file that Valohai is expecting to the be the command to run, it's suggesting just python train.py. You could also write a custom command here and hit enter, if none of the options are correct.

  • Confirm your selection with y
  • 🐳 Now it's asking you to select a Docker image to use. As you can see, it's already suggesting some images but you can always use another image.
    • You can also create own Docker images that contain all the libraries and tools you need for your experiments
      • Valohai can access public Docker images but it can also host private images on Docker Hub or Azure Container Registry, if you provide the right credentials in your organisation settings.
    • Go to Docker Hub and find the newest TensorFlow image...or if you're feeling lazy just select 5 (tensorflow/tensorflow:1.13.1-gpu-py3)
  • Confirm your selection with y

Now you'll see a preview of the Valohai.yaml file. This file contains all the configuration needed to run your experiments on Valohai. You'll see the information we provided so far, and some commented lines...we'll come back to those later on.

➑️ This file is saved as valohai.yaml You can always go and edit the file, make changes and upload them to Valohai.

  • Select y to confirm the generated file
  • Now Valohai asks you, which project should this be associated with. You can create a new project (C) or link to an existing one (L). As we created our project already in the web UI, we can select L
  • Next you'll see a list of your projects, select the one you created and tadaa, your project has been created! πŸŽ‰βœ¨

πŸ”₯ To run your project you can write vh exec run --adhoc execute which translates to _Valohai, execute a new run called 'execute' and run it as an adhoc run.

  • --adhoc means that you don't have a code repository (like GitHub) but rather want to upload the files from this folder and run them on Valohai.
  • Where does the name execute of the run come from? It comes from our valohai.yaml from our step name. And instead of making us write the whole line Execute python train.py, it accepts execute as there is only one step starting with execute.
  • You can also include --watch, to get the execution logs to your command-line.

πŸ’‘ A good idea, is to go open your valohai.yaml file and edit the step name Execute python train.py to be more descriptive to what we're doing. You might want to call it Train MNIST model for example.

Then you would run your experiment with vh execute run --adhoc train

You can now see your execution on app.valohai.com under your project.

Look at you, running experiments on Valohai. Congratulations! πŸŽ‰πŸŽ‰ Pat yourself on the back for a job well done πŸ‘πŸ‘

☁️☁️ You'll notice that these executions ran on a Microsoft Azure NC6 machine. That's where we run executions on by default. Once you start using Valohai, we can setup the executions to run on your machines on AWS, GCP or some other datacenter (or a set of machines you have locally).

You can also have multiple environments in different clouds setup, so you can define per project (or execution) which environment you want to run.

What's this YAML-file you speak about:question::question:

The YAML-file that was generated for us contains the configuration of our Valohai project. You'll find inside it a single step, called Train MNIST model with it's Docker image and the command to run (python train.py).

In Valohai, you create steps for different operations or workload types. You could have steps for data anonymization, feature extraction, training your model, batch inference, model evaluation etc. It's essentially, what you want to run on Valohai.

You can read more about the valohai.yaml config file on our docs.

πŸ’‘ Remember that YAML is a bit fussy about indentation, so when you edit your file make sure you have the right ammount of spaces to structure the contents.

🚨A word on code repositories🚨

One advantage of Valohai is reproducability. The fact that you can go back any execution you've ran, see the input data, parameters, metadata, code version and the ouputs.

In this tutorial we'll be primarily doing --adhoc runs which skip the need for pushing a new version of your code to the repository and then fetching it to Valohai. However, this will not keep track of the code versions.

Make sure that you create a repository (GitHub, GitLab, BitBucket etc.) where you store your code. Then just like you normally would, do commits and push new changes there. Every time you then run a execution on Valohai, it will keep track of which commit version was used, so you can easily reference and access it (instead of playing around with offline file versions)

So, maybe create that repo now? Or feel free to proceed with the tutorial as is. By default all the steps will do a --adhoc execution, so you're not required to have a repository to complete to the tutorial but it's definitely a good practice. And remember then instead of running --adhoc run your git commands to push a new version and then fetch it in Valohai to run a execution with the new code.

Valohai with TensorFlow 2 Quickstart - Part 2: Outputs

Now that we've succesfully ran a Hello World on Valohai, we can proceed to the next stage and implementing the TensorFlow 2 Quickstart.

This quickstart is using the MNIST data set and training a neural network to classify images. Long story short, we're looking to input the MNIST dataset, train the model to recognize the hardwriten digits and evaluate the accuracy of the model.

➑️ Head on over to the TensorFlow 2 Quickstart-tutorial and replace your train.py with the code from the tutorial.

πŸ”₯ Once you're done, you can run your new training on Valohai with vh exec run --adhoc train (or whatever the name of your step is).

πŸ”₯ Run vh exec watch 2 to see the logs of your second execution or go to app.valohai.com to see your execution and its logs.

You'll see in the logs that the sample runs through 5 epochs to train the model, and it's showing the loss and accuracy of each epoch in the logs. We'll talk about visualising these metrics in the next part of this tutorial.

But wait.. where is the trained model? πŸ™€

So you just ran your model, trained a beautiful handwritten digit classifier but the model is nowhere to be found.

To output a model in Valohai, or actually, to output files in Valohai you'll need to write them to the output directory. Once the execution is done, Valohai will save the outputted files to a cloud storage.

πŸ’‘ Valohai will upload all the files from outputs at the end of an execution, even if your code crashes or the execution is stopped.

You'll need to get the location of the output directory and save your file there.

First import os in your train.py, so we can access the OS functionality, like getting paths and environment variables.

🦈 Then create a new variable to store the Valohai output file

# Get the output path from the Valohai machines environment variables
output_path = os.getenv('VH_OUTPUTS_DIR')

At the bottom of your file call model.save to save the model's architecture, weights and training in a single file, as described in the TensorFlow documentation.

# Save our file to that directory as model.h5
model.save(os.path.join(output_path, 'model.h5'))

πŸ”₯ Save your file and run vh exec run --adhoc train again to start another execution and output your model. You'll see that the model appears in the outputs tab at http://app.valohai.com

☁️ By default the outputs are saved in a Valohai owned Amazon S3 Bucket - it's saved in a location that only you can access. Check out our guides on docs.valohai.com to see how to add your own storage accounts and save the outputs there (Azure, AWS, Google)

In this example, we've outputted the trained model, but you can output whatever files you want. It could be for example graphs, confusion matrixes, labeled data, csv files or images like in our Darknet sample.

πŸ’‘ In some cases you might want to save checkpoints or orther artifacts mid-execution, instead of waiting till the end of the execution to get the files to your cloud storage. To upload files mid-execution you just need to set them as read-only files, and signal to Valohai that you want the files uploaded to the cloud storage immediately. Read more about Live Outputs in our docs.

Valohai with TensorFlow 2 Quickstart - Part 3: Metadata and visualisations

We now have a couple of executions ran inside our project. Valohai will keep track of the input data used, the commands, the code version, the environment you ran it on and other key general metadata. In addition, you might have other metrics you want to keep track of, and use to compare your models. That's where Valohai metadata comes in.

Valohai picks up metadata from your logs and allows you to use it filter executions, compare models and visualize said metadata.

Everything that you output as JSON is picked up by Valohai as metadata, and then you can choose what you do with it.

πŸ’‘ You might remember that the TensorFlow 2 Quickstart is already logging the accuracy and loss of each epoch, as it executes them. Valohai isn't picking these up as metadata because it's outputted in the logs just like any other information. For Valohai to understand that this is metadata you want to collect, you need to output JSON.

In our TensorFlow 2 quickstart, we want to log the accuracy and loss of each epoch as they complete. So what we'll need to do is create a function that outputs those values every time an epoch completes.

Lets start by editing our train.py and at the top of the file import json.

The TensorFlow documentation describes the LambdaCallback, which allows us to create simple, custom callbacks once each epoch ends (on_epoch_end).

Create a new LambdaCallback function to call a function called logMetadata at the end of each epoch.

metadataCallback = tf.keras.callbacks.LambdaCallback(on_epoch_end=logMetadata)

🦈 Next we'll create said logMetadata function in which we'll output the metadata values we want to track. ❗ Make sure you place the function before your metadataCallback, so it's defined before you call it. Otherwise you're not gonna have a good time...you'll get an error, that's what I mean.

# A function to write JSON to our output logs with the epoch number with the loss and accuracy from each run.
def logMetadata(epoch, logs):
    print()
    print(json.dumps({
        'epoch': epoch,
        'loss': str(logs['loss']),
        'acc': str(logs['acc']),
    }))

πŸ’‘ Did you notice that we executed an empty print() before printing our JSON? We do this to ensure that the metadata JSON appears always on a newline, so Valohai can identify it. Otherwise your metadata output might appear on the same as the previous log and Valohai won't know that it's metadata you want to track.

The last thing to do is start using the metadataCallback in our model.fit and according to the example on TensorFlow documentation.

Update your model.fitto the following:

model.fit(x_train, y_train, epochs=5, callbacks=[metadataCallback])

πŸ”₯ You can now save your file and run a new execution with vh exec run --adhoc train and visualize your data on the Metadata tab of the execution πŸŽ‰πŸŽ‰

πŸ’‘ Remember that you can output whatever you want as metadata, as long as you can output it as JSON, we'll save it. You might for example write metadata to track different methods you've tried in your executions.

On the Metadata tab you'll be able to see your metadata as a Time Series or a Scatter Plot graph. As in this tutorial, we're using outputting the accuracy and loss of each epoch, you can select epocas the value to plot on the X-axis and select both acc and loss on the Y-axis, to see the values visualized on the Time Series graph.

Metadata in Times Series graph

➑️ Are you looking to use TensorBoard with Valohai? Check out our tutorial on TensorBoard + Valohai.

You can also view this metadata in your Executions-view, so you can easily filter and compare your different executions. Go to your Projects Executions-tab and above open the "Show columns" selection, on the right side above the table. You can these select to show the acc and loss metadata, so easily compare models that export this metadata.

Execution table with metadata

πŸ’‘ You might have noticed that the table on the Executions view is showing you the latest value from the metadata. If you'd like to have it show something else, like the best accuracy or the results of your model.evaluate you can just do a json.dumps at the end of the execution and Valohai will pick it up as the latest value.

πŸ’‘ In some cases, you might also want to tag your executions, to be able to easily find for example that one that is currently in production. You can do that by going to the execution's Details-view and add a tag at the bottom of the list. Now if you look the table with all your executions, you'll see a blue tag on one of them, so you can easily find it later.

Valohai with TensorFlow 2 Quickstart - Part 4: Parameters

As you start running your experiments and trying different combinations, you'll soon wish there is a way to pass values like the learning rate to your code, without changing the code, allowing you quickly to experiment with different values. Have no fear, we can do that! πŸŽ‰

In your valohai.yamlyou can define parameters that you want to pass to your code. You can then pass these for example on the command-line when you run your executions or in the web UI.

For this tutorial, we'll learn how to pass epoch_num and learning_rate as parameters to our code, so we can experiment with different values easily.

Start by opening your valohai.yaml and uncomment the lines under parameters.

🦈 Now edit your valohai.yaml file to define two parameters: epoch as an integer and learning_rate as a float. You'll also need to update the command and let it know that you might be passing in parameters.

---

- step:
    name: Train MNIST model
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: python train.py {parameters}
    #inputs:
    #  - name: example-input
    #    default: https://example.com/
    parameters:
     - name: epoch_num
       type: integer
       default: 5
     - name: learning_rate
       type: float
       default: 0.001

That's it - now Valohai knows that it might have parameters coming its way, and if there are none it will use the default values provided above.

🦈 Next we'll need to go to our train.py and use these parameters in our code. We'll need to first parse the arguments passed to the code and then use these two new parameters.

We'll use argparse from the Python Standard Library to parse the arguments.

Start by adding import argparse to train.py and then create a new function:

def getArgs():
    # Initialize the ArgumentParser
    parser = argparse.ArgumentParser()
    # Define two arguments that it should parse
    parser.add_argument('--epoch-num', type=int, default=5)
    parser.add_argument('--learning-rate', type=float, default=0.001)

    # Now run the parser that will return us the arguments and their values and store in our variable args
    args = parser.parse_args()

    # Return the parsed arguments
    return args

Now call our new function in the beginning of our file, for example after defining the functions.

    # Call our newly created getArgs() function and store the parsed arguments in a variable args. We can later access the values through it, for example args.learning_rate
    args = getArgs()

Now that we've parsed our values, we can start using them. Lets first update the simpler one: epoch_num by updating our model.fit to use the parameter value of epoch, rather than the fixed number 5.

    model.fit(x_train, y_train, epochs=args.epoch_num, callbacks=[metadataCallback])

Now we'll also need to use the learning_rate parameter, which is passed to the Keras optimizer. In the TensorFlow 2 Quickstart we can see that it's using an optimizer that implements the Adam algorithm as stated in our code with model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

According to the Tensorflow 2 documentation for the Adam optimizer we can pass the learning rate in the initialization of the optimizer. This means that we'll need to update our model.compile to be the following:

   model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

❗ In older version of TensorFlow the learning rate parameter is called lr instead of learning_rate. As stated on the documentation "lr is included for backward compatibility, recommended to use learning_rate instead."

πŸ”₯ That's it! Now lets run our new execution and pass in some parameters. You can run for example vh exec run --adhoc train --learning_rate=0.1 --epoch_num=10 Now you'll notice that your execution with run with 10 epochs and a set learning_rate.

Using Tasks do to a parameter sweep

Tasks are a collection of related executions. For example when doing Hyperparameter optimization, you could create a Task to run several executions in parallel and find the most optimal hyperparameters.

In our sample, we'll do a simple version where we will run a Task with different values for epoch_num and learning_rate to find the most optimal values.

πŸ’‘ By default you're not allowed to create a Task unless your project is connected to a code repository. But now as we have ran some experiments we can use them and the --adhoc files to generate a Task for us.

In the web UI open your latest execution and click on the Task button on the right side. This will use your execution as a base for a new Task.

Now you'll see the configuration page, which is essentially generated from the .yaml file and our organisation settings.

Scroll down to the parameters section and select multiple values for epoch_numand try it with for example 3 new values (for example 3, 5, 10). Remember to write one value per line here.

For learning_rate select for example linear values with start 0.001, end 0.2 and step 0.05, which results in 4 values.

Now at the bottom of the page you'll see on the right side that this configuration will create a total of 12 executions (3 epoch_num configurations x 4 learning_rate configurations).

πŸ”₯ Press Create Task and admire the magic.

You'll see 12 new executions start, each starting as gray (queued), turning blue (executing) and green (completed) as they are executed. You can go inside any of these and see that they look just like normal Executions.

In your executions list you'll now see 12 new executions appear, and you'll noticed in their name that they're marked with !1 meaning they belong to Task number 1, which you can view in the Tasks tab. In there you can also view the Metadata of the Task to visualize the results from each execution.

Valohai with TensorFlow 2 Quickstart - Part 5: Custom Inputs

Next we'll learn how to pass input data to our Valohai executions. These could be for example your training data set, labels etc. They can come either from a public address or from your private (cloud) storage.

The TensorFlow 2 Quickstart for beginners uses tf.keras.datasets.mnist to download the MNIST dataset. What we want to do is provide a custom input data source that contains the same MNIST dataset.

The MNIST dataset can be downloaded for example from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz, where actually the TensorFlow 2 Quickstart is also downloading it from.

🦈 We'll start from valohai.yaml where you can uncomment the inputs section and define our new input data:

---

- step:
    name: Train MNIST model
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: python train.py {parameters}
    inputs:
      - name: my-mnist-dataset
        default: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
    parameters:
     - name: epoch-num
       type: integer
       default: 5
     - name: learning-rate
       type: float
       default: 0.001

Now in our code to access that .npz package we can find it as my-mnist-dataset/mnist.npz.

🦈 Lets go back to our train.py and start by adding import numpy and then defining our Valohai input path, that will store all the inputs Valohai has downloaded as per the configuration in valohai.yaml

Under your output_path variable definition, add the input_path in the same way by finding the value from the environment variables. Then define a variable that will contain the path to our input .npz file

# Get the path to the folder where Valohai inputs are
input_path = os.getenv('VH_INPUTS_DIR')
# Get the file path of our MNIST dataset that we defined in our YAML
mnist_file_path = os.path.join(input_path, 'my-mnist-dataset/mnist.npz')

You can now remove the two lines that load up the sample MNIST Data:

mnist = tf.keras.datasets.mnist

and

(x_train, y_train), (x_test, y_test) = mnist.load_data()

🦈 The TensorFlow 2 Quickstart parses the MNIST data with it's own function (x_train, y_train), (x_test, y_test) = mnist.load_data() but as we've just downloaded the file, we'll use numpy to load the file and define the train and test datas.

with numpy.load(mnist_file_path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

πŸ”₯ Now you can run your new execution with vh exec run --adhoc train --learning-rate=0.1 --epoch-num=10 and you'll see... the exactly same results. What gives?

We actually didn't change anything else except define the input path and load our data from there, so the results shouldn't even change. However, you'll see on the details page the input we defined. And if you look at the logs, you'll notice that it's downloading the dataset from the valohai/inputs.

πŸ’‘ In our sample we referenced to a public dataset through HTTPS but you can also reference your own files for example from Azure Storage, AWS S3 Buckets, Google Cloud Storage etc.

❗ Remember: Valohai, by design, doesn't take a copy of your data and store it. We keep track of the input data that you defined, so you can later on easily reproduce your steps, but it's up to you do proper data versioning and ensure that data source still exists.

✨ Valohai will make sure you're aware of changes in your input data. ✨ Imagine you running your experiments and referencing to an input data source as you do your experiments - then one day someone changes the dataset without telling you 😱 Now you're getting suddenly different results for your experiments. Valohai will actually create an alert for that execution if it noticed changes in the dataset you're referencing to (by comparing the checksums and metadata of the file). This way you won't get those nasty surprises.

Clean data and image cache from worker machines

As you run your experiments, your Valohai executions are queued for different worker machines that run your experiments. Sometimes you get to run on the same machine as before, and the machine dataset already downloaded in it's cache and can use that dataset, skipping downloading and making it faster.

Except that sometimes, you don't want that 😬 You want to make sure that your execution downloads a fresh dataset and/or a fresh Docker image. No worries, we got you. 🦈

You can define environment variables in the web UI or in your YAML file to instruct the worker machines to clear the cache. Just define a variable VH_NO_DATA_CACHE and/or VH_NO_IMAGE_CACHE to true and Valohai will obey. Read more about environment variables in our docs.

Valohai with TensorFlow 2 Quickstart - Part 5: Using custom public and private Docker Container Images

In the previous steps we've used a standard TensorFlow Docker image to run our code 🐳 It worked great for our MNIST sample but as you build your experiments, you might start gathering requirements for additional libraries, downloads or other dependecies.

In your YAML you can run multiple commands and install libraries that you're missing like below:

- step:
    name: Train MNIST model
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: 
      - pip install mypackage
      - python train.py {parameters}

However, often it makes more sense to include those dependencies already in your Docker container, so you don't have to download them and run the same commands on every single execution.

There is a ton of documentation online about Docker images but we'll be brief here.

  • 🐳 In a Dockerfile you describe what you need in your application. Here you write all the commands (like installation and updates) that should be ran to define your image.
  • πŸ”¨ You don't have to start from scratch - you can base your image on an existing Docker image and then just add on top of that the features you need. Keep in mind that generally speaking you'll want to keep only the required libraries in your image - a smaller image tends to mean that you get faster build and deploy times.
  • πŸ“ We'll use a couple of commands to define our image, download libraries and updates and then run pip install -r requirements.txt to get our required libraries in.
  • We'll then build the image, tag it and push it to a Docker Image repository. You can either store the Docker container as a public image, or store it as a private image that only you can access.

πŸ’‘πŸ¦ˆ Valohai currently supports private image repositories from Docker Hub and Azure Container Registry. You'll just need to define your repository and credentials in your organisation settings to allow Valohai to download private Docker Images.

You can find more information about building your own Docker Images on our docs or across the interwebs.

Build your Dockerfile

❗ Before we begin, make sure you install Docker on your machine.

Start by creating a requirements.txt where we'll list all the python libraries required to run our code. Right now this will be very simple as in it you can just add tensorflow-gpu==1.15.2. That's our only requirement right now. In the next part of this tutorial we'll be looking at more requirements, when publish an endpoint for our digit prediction.

After that highly complex requirements list we'll create the actual Dockerfile called...Dockerfile. You can save this in the same folder as the rest of the files we've created in this tutorial.

# We'll use the nvidia/cuda image as our base
FROM nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04

# Set some common environmenta variables that Python uses
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

# Install lower level dependencies
# Run newest updates, install python etc.
RUN apt-get update --fix-missing && \
    apt-get install -y curl python3 python3-pip && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3 10 && \
    update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 10 && \
    apt-get clean && \
    apt-get autoremove && \
    rm -rf /var/lib/apt/lists/*

# Define our working directory
WORKDIR  /usr/src/valohai-tf2-quickstart

# Installing python dependencies by copying the requirements.txt to our workdir, upgrading pip and then installing our requirements
COPY requirements.txt .
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt

You can find the base image we're using on Docker Hub as nvidia/cuda.

β—πŸ³ Make sure you have registered a Docker Hub account or have access to an Azure Container Registry.

Install the required tools

Now we'll build your Docker Image and tag it with appropriately. Go to the command line, navigate to the folder where you Dockerfile then run the docker build --tag myaccount/name:tag . command like:

docker build --t drazend/valohai-tf2-quickstart:0.0.3 .

The last . just describes it should build the image based on the definitions on in this directory.

Next we'll push your image to the registry, so you can use outside of your own machine. First you'll need to login:

docker login --username=yourhubusername

If you're logging in to an Azure Container Registry, you would use something like docker login myregistry.azurecr.io

πŸ”₯ After you've succesfully logged, you can push your image with a simple docker push myaccount/name:tag so for example

docker push drazend/valohai-tf2-quickstart:0.0.3

πŸ’‘ You can create private repositories on Docker, if you don't want them to be publically accessible. Or you can use Azure Container Registry which stores and manages private Docker container images.

Once the Docker image has uploaded we can start using it by replacing the standard tensorflow-image with our new custom image in our valohai.yaml.

Access Private Docker Repositories from Valohai

Organisations on Valohai can easily use private Docker Repositories from Docker Hub or Azure Container Registry.

πŸ”‘ First you'll need to create an access token that we'll use to permit Valohai to pull your private Docker Container Images. Follow the instructions for Docker Hub or Azure Container Registry to generate access credentials.

Once you have our credentials head on over to http://app.valohai.com and go manage your organisation settings (click on your name on the top right and select Manage organisation-name). Go to the registries tab and Add new entry. Here the name would be something like docker.io/valohai/* or valohai.azurecr.io/* and the username & password are the ones you've previously generated.

That's it! πŸŽ‰πŸŽ‰

πŸ”₯ Now you can start using private Docker repositories in your Valohai executions. Just mark down in your valohai.yaml the new image you'd like to use. Make sure you use the full name like docker.io/user/name:tag.

Valohai with TensorFlow 2 Quickstart - Part 5: Deploy a model for online inference

Valohai makes it easy to publish your model for online inference through a Kubernetes cluster. By default the cluster is hosted by Valohai but it can also be defined to be installed on your own environment and cluster.

In this tutorial we'll deploy the model using a wsgi-specification that describes how a web server communicates with web applications. We'll define the endpoint in our valohai.yaml and then write the code that will take an input (an image of a handwritten number) and use our MNIST predictor to predict what number is in the image.

Start by going to your valohai.yaml and at the bottom define a new endpoint. Usually you'll have one endpoint per prediction that you want to make.

  • name: Give it a name like digit-predict. This is just used for you to identify the endpoint later on in the web UI etc.
  • description: Not surpringly this is the description of your endpoint, like "predict digits from image inputs"
  • image: This is the Docker Image you want to use, it should contain the libraries and tool you need to run your prediction service.
  • wsgi: Here you'll define what should the server execute. The format is filename:method
  • files: When you go to deploy a model on Valohai you'll be presented with an option to provide files to it. In our case, we'll be passing it a model.h5 file that we've previously trained.
    • name: name of the file (for example "prediction model")
    • description: You guessed it, here you can describe what does the file you're looking for do.
    • path: where this model will be stored. We'll use this path to load it in our application.

πŸ’‘ As of right now, the deployments don't support private Docker image repositories, so you'll have to use a public image for this.

🦈 Add an endpoint to your valohai.yaml like below. Notice that we're using a custom built Docker image. Make sure this points to the image you published earlier. W

---

- step:
    name: Train MNIST model
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: python train.py {parameters}
    inputs:
      - name: my-mnist-dataset
        default: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
    parameters:
     - name: epoch-num
       type: integer
       default: 5
     - name: learning-rate
       type: float
       default: 0.001
- endpoint:
    name: digit-predict
    description: predict digits from image inputs
    image: docker.io/myaccount/name:tag
    wsgi: predict:mypredictor
    files:
      - name: model
        description: Model output file from TensorFlow
        path: model.h5

Next we'll start creating a Python script that will do the prediction. We'll use the werkzeug WSGI utility library to help us creating our web app.

πŸ’‘ What if I don't want to use WSGI? In valohai.yaml you can also define a server-command and a port to run any Python script instead of defining wsgi. You can then run what you need, for example server-command: python runmyserver.py. You can run multiple server-commands by chaining them together like server-command: dostuffandthings && python runmyserver.py. See more details on our docs.

Create a new file called predict.py and start by creating a simple Hello World following the example from the Werkzeug homepage

from werkzeug.wrappers import Request, Response

# Define the main function that Valohai will call to do the prediction
def mypredictor(environ, start_response):
    # Create a new response object
    response = Response("Hello world!") 
    # Send back our response
    return response(environ, start_response)

# We run this piece of code, if we're directly executing this file. This way we can locally test the functionality
if __name__ == "__main__":
    from werkzeug.serving import run_simple
    # Run a local server on port 5000. Once we get a request there execute the mypredictor function declared above
    run_simple("localhost", 8000, mypredictor)

Now you can test your app locally by running python predict.py. An πŸ’₯BAMπŸ’₯ it failed saying it can't find werkzeug. That's because it's not installed in our environment by default, sp we'll need to install it with pip install werkzeug.

πŸ”₯ Now re-run python predict.py and you'll see a webserver starting. Navigate to http://localhost:8000/ to get your response.

Read an image from the request

To do our predictions we'll need an image sent to our inference service. Our predict.py should read that image from the request it has received, pass it to our prediction and then respond with the predicted digit value. So lets write some magic ✨ to read the image that's sent to us.

Define a new function called read_input that takes in the request object, before your mypredictor function.

def read_input(request):
    # Ensure that we've received a file named 'image' through POST
    # If we have a valid request proceed, otherwise reeturn None
    if request.method == 'POST' and 'image' in request.files:
        photo = request.files['image']
        # Save file to memory
        in_memory_file = io.BytesIO()
        photo.save(in_memory_file)
        # Read the file bytes
        data = numpy.frombuffer(in_memory_file.getvalue(), dtype=numpy.uint8)
        # Use OpenCV to read read the image as grayscale
        img = cv2.imdecode(data, cv2.IMREAD_GRAYSCALE)
        # Resize the image to 28x28 with OpenCV
        img = cv2.resize(img, (28,28))
        return img

    return None

Now update your mypredictor function to use the new read_input function. Based on the value we get from read_input (the image or None) we send a different response.

def mypredictor(environ, start_response):
    # Get the request object from the environment
    request = Request(environ)

    # Get the image file from our request
    inputfile = read_input(request)

    # If read_input didn't find a valid file
    if(inputfile is None) :
        response = Response("\nNo image", content_type='text/html')
        return response(environ, start_response)

    response = Response("\nWe got an image!") 
    return response(environ, start_response)

Run python predict.py to test your syntax locally before sending it to Valohai.

If you navigate to http://localhost:8000 you'll get a response that no image was sent to the service. Luckily there is a easy way to test if you're service is able to read the image file sent to it. You can do this either with curl or Postman.

Here's an example of using curl to send a POST request with a file (7.png) to the server. Open a new Terminal/Shell and write the following command (while keeping your python running in the other window)

curl -X POST -F "image=@7.png" localhost:8000/

πŸ’₯BAMπŸ’₯ another error. This time it's saying that NameError: name 'io' is not defined'. As were writing our read_input function we used io to read the bytes, numpy and OpenCV to decode the image and resize it. We'll need to install the new packages with pip install numpy opencv-python. And then add the right imports on the top of our file.

import io
import numpy
import cv2

Use TensorFlow to predict the value

The last part we'll need to edit in our code is the actual TensorFlow prediction.

Based on the TensorFlow 2 Documentation, we'll need to use models.load to load the model we created earlier and then pass our image to model.predict_classes which will return us the predicted class.

β¬‡οΈπŸ“„To test the model locally we'll need to download a model.h5 file from the outputs of one of your previous app.valohai.com executions. Go the web UI, navigate to a execution and from the outputs download a model.h5 and move it to the same folder as predict.py.

Now in your predict.py add import tensorflow as tf at the top of your file. Then lets start editing mypredictor to load the model and predict values. Add the following lines after checking for inputfile is None and before creating the Response.

    # Load our model
    model_path = 'model.h5'
    new_model = tf.keras.models.load_model(model_path)

    # Use our model to predict the class of the file sent over a form.
    # We're reshaping the model as our model is expecting 3 dimensions (with the first one describing the number of images)
    prediction = new_model.predict_classes(inputfile.reshape(1,28,28))

Now that we have our prediction we can send a response with the predicted digit. We'll do this by sending a JSON response, so go add on top of your file import json and then replace your current response with

    # Generate a JSON output with the prediction
    json_response = json.dumps("{Predicted_Digit: %s}" % prediction[0])

    # Send a response back with the prediction
    response = Response(json_response, content_type='application/json') 
    return response(environ, start_response)

Now you can run your curl command again and you'll get a JSON response with the predicted class! ✨✨

Upload a deployment to Valohai

To get your new prediction service is really easy - you'll just need to push the new changes to your code repository, or run a --adhoc execution and upload the files from your working directory to Valohai.

β—πŸ³ Make sure your Docker container is ready. In the previous section of this tutorial we created a custom Docker container and pushed it to the repository. Then in our YAML we told the deployment endpoint to use that Docker Image - but as it is right now our execution on Valohai will fail πŸ’₯

As you know we did a couple of pip install commands and got new dependencies in our predict.py code. These packages are not in our Docker image currently. How do we get them there? Simple. Update the requirements.txt that your Dockerfile is using with the new libraries we need:

tensorflow-gpu==1.15.2
numpy
werkzeug==0.16.1
opencv-python==4.2.0.32
pillow

We'll also need to update the Dockerfile to install some additional libraries on the container image. In your Dockerfile update the apt install with new packages needed to run code

# Install lower level dependencies
# Run newest updates, install python etc.
RUN apt-get update --fix-missing && \
   apt-get install -y curl python3 python3-pip && \
   update-alternatives --install /usr/bin/python python /usr/bin/python3 10 && \
   update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 10 && \
   apt install -y libsm6 libxext6 libxrender-dev && \
   apt-get clean && \
   apt-get autoremove && \
   rm -rf /var/lib/apt/lists/*

Then just run again docker build --t myaccount/name:tag . and docker push myaccount/name:tag to get a new version to your repository.

πŸ’‘ You might want to change the tag for this new version, so put it as myaccount/name:0.0.2 for example

Now just update your valohai.yaml with the new name:tag of the image to use for the endpoint, so it knows to use a Docker image with all the required libraries.

If you project is connected to a repository you can go and click Fetch repository in the web ui. Or if you haven't connected to a repo yet (or you just want to) use vh exec run --adhoc train to upload your new files.

Go to http://app.valohai.com, navigate to your project and the Deployment-tab. There create a new deployment with the name of your choice (valohai-tf2-quickstart for example) and leave the cluster as Default.

πŸ’‘ Read more about Deployments in our docs.

Now create a new version for your deployment. Here you'll see that it pulled information from your YAML to show what kind of endpoints you have and what kind of files you need to upload for them.

🦈 Click the checkbox to enable digit-predict and from the dropdown select your latest model.h5. Then click create version and let the magic happen ✨✨

You'll see that Valohai will start by pulling your custom Docker Image, building the image and then pushing it to the Kubernetes cluster. After it's pushed you'll see it starting up (Pending) and once it's at 100% Available you can start using your online inference!

😒 I got a Bad Gateway 500 error. That usually means there is some issue with your code. Make sure it's running correctly on your own machine, debug any issues and make sure you've define your custom Docker container properly in your valohai.yaml

πŸ”₯ Now you can click on the link and you'll see the familiar "No Image" response. To test a deployment with an image you can click the "Test Deployment" button and then select your endpoint, set a POST request and add a field image with the sample image file.

And tada, you got your response back! πŸŽ‰πŸŽ‰ Now pat yourself on the back. Job. Well. Done. ✨🦈✨🦈✨

πŸ’‘ What if I want to do batch inference and evaluate a set of files at the same time, and maybe we only do it once a week? I don't want to run a whole Kubernetes cluster for that. Have no fear - for that purpose the recomended approach is to create a new step in your valohai.yaml and create a new python file that takes all the inputs (like the MNIST data), runs predictions and outputs them (like the model.h5).

Future

  • Batch inference
  • GitHub Repo
  • Pipelines
  • Notebooks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment