ningyuwhut/TensorFlow-Best-Practices-Q1-2018.md

## TensorFlow-Best-Practices-Q1-2018.md

      
    Raw
  

              TensorFlow-Best-Practices-Q1-2018.md
            
          
    TensorFlow Best Practices as of Q1 2018

By Adam Anderson
adam.b.anderson.96@gmail.com
Preface

This write-up assumes you have an general understanding of the TensorFlow programming model, but maybe you haven't kept up to date with the latest library features/standard practices.
The goal of this guide is to walk through what appears to be the current TensorFlow best practices using newer library features, most notably the Dataset API, Estimators, and tf.keras Models.
Documentation and tutorials written in the past few years differ on best practices because even 6 month old tutorials reference objects and functions that have been depricated. I'm hoping this is a sufficiently comprehensive summary of what's useful to know as of Q1 2018.
Overview

More recent versions of TensorFlow have added modules that make it easier to set up and train machine learning models. These include the Dataset, Layers, Estimator, and Metrics APIs.
The general workflow is the following:


Data Handling with Dataset API - tf.data allows you to create an input pipeline that aggregates data and preprocesses it before outputting data for input to the model. Data handling is achieved by creating a tf.data.Dataset for preprocessing and constructing batches, and then creating a  tf.data.Iterator for iterating through the batches.
This replaces previous input APIs using feed_dict or queue-based pipelines (Chengwei, Oct 2017).


Additional preprocessing with Feature Columns - In some cases, we may want to play around with how data in a Dataset is presented to the deep learning model - for example, converting categorical data into a one-hot encoding vector, or discretizing values into bins. tf.feature_column enables functions perform such actions.
Feature Columns are not particularly useful for handling image data, so I'm not going to talk about them in detail here. Instead, refer to Feature Columns Getting Started.


Model Specification with tf.layers -  The tf.layers module makes it easy to stack layers onto a neural network. Now, it's even possible to use the Keras API to construct a model using tf.keras.
Because tf.layers and tf.keras share core data structures, you can use the Keras API where it is convenient while also using core TensorFlow where necessary (Google Developers, Feb 2017).


Abstraction of training/evaluation/prediction using Estimators - The tf.estimator.Estimator object abstracts training, evaluation, inference, exporting, etc. It is defined using a model_fn to perform necessary computation. When the model is going to be trained/evaluated, an input_fn is provided to get the input to the model in a standard format.
Using the up-to-date pipeline, we can set up our tf.data.Dataset in the input_fn, then construct the model and specify training/evaluation behavior in the model_fn.  To evaluate the model, we can use tf.metrics, which provides ops to calculate common metrics.
As mentioned above, the Keras API has been integrated with TensorFlow, meaning you could use the abstractions offered by the Keras Model object when specifying the model architecture. Keras Models offer their own training/evaluation abstractions, but TensorFlow Estimators can be integrated into more robust and flexible deep learning workflows, so we're going to assume you want to use one of them.


Note that there was once an Experiment object that integrated training and evaluation functionality for an Estimator  in tf.contrib.learn.Experiment, but a Google Groups post by Martin Wicke from 9/25/2017 suggested that this would be deprecated in favor of tf.estimator.train_and_evaluate()
Specifics

tf.data

For an in depth overview, refer to the TensorFlow Importing Data Programmer's Guide. Here, we summarize relevant information relevant to using tf.data as part of a pipeline including a tf.estimator.Estimator.
Dataset Creation

The input to an Estimator must be in the form (feature_dict, label). When we want to train/evaluate/predict using an Estimator, we define an input_fn that returns data in this format. Specifically, the input_fn must return one of the following (as paraphrased from input_fn description for Estimator.evaluate():

A tf.data.Dataset object which outputs (features, labels) pairs
(features, labels) pairs

Thus, when using the Dataset API, we want to structure the Dataset so that we can easily extract data in this form.
In the Datasets Quick Start, we see the example input function:
def train_input_fn(features, labels, batch_size):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)

    # Build the Iterator, and return the read end of the pipeline.
    return dataset.make_one_shot_iterator().get_next()
tf.data.Dataset.from_tensor_slices(tensors)
tensors can be an Iterable of tensors, an Iterable of Datasets, or even a dict of the form {"tensor_name" : tensor}. The Dataset in the example input_fn creates a dataset containing (features, label) pairs, where features is a dict mapping the feature name to a tensor of values.
We use from_tensor_slices() because it assumes the first axis is the example number. In MNIST, this means the training dataset of shape (60000, 28, 28) is treated as 60,000 images, each 28x28.
We can manipulate the dataset using the preprocessing functions. For examples, see  Datasets Quick Start or TensorFlow Importing Data Programmer's Guide.

map(map_func, num_parallel_calls=None) - To apply a function that uses non-TensorFlow logic, wrap it in tf.py_func().
filter(predicate)
zip(datasets)  - Used to produce final dataset fed to the model with tf.data.Dataset.zip((features, labels)).
concatenate(dataset)
shuffle(buffer_size, seed=None, reshuffle_each_iteration=None). Larger buffer sizes result in better randomness, while smaller sizes use less memory.
repeat() - will repeat the dataset indefinitely, or a specified number of times.
batch(batch_size) provides one batch of inputs per iteration.

Iterators

Iterators provide access to elements from the dataset. Multiple types of iterators are allowed by the tf.data API, but only one-shot iterators are usable with tf.estimator.Estimator, so we'll only worry about those.
A one-shot iterator goes once through the Dataset, and is created using Dataset.make_one_shot_iterator(). To iterate through multiple epochs, just use Dataset.repeat(num_epochs) before creating the Iterator.
Example Usage:
dataset = dataset.shuffle(buffer_size)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()

features, labels = iterator.get_next()
A common pattern is to wrap the training loop with a try-except block to catch when the iterator finishes:
while True:
    try:
        sess.run(result)
    except tf.errors.OutOfRangeError:
        break
In the MNIST example below, we see how to use an instantiated Dataset object in an Estimator's input_fn.
tf.estimator.Estimator

In previous TensorFlow versions, to train the model, you would have to write a training loop where you got a TensorFlow session and called sess.run() on the optimization step. Evaluation and inference were achieved by calling sess.run() on the model's output op, which would then get used to calculate evaluation metrics.
Estimators abstract this process, and also make it easier to run models on different hardware, export models for sharing, save checkpoints/TensorBoard summaries, etc.
The tf.estimator.Estimator constructor accepts a model_fn which has the signature model_fn(features, labels, mode, config) (hence why the input_fn needs to output data in (features, label) pairs). Within the model_fn, the model is created, and we specify how to produce output given input in the form (features, labels).
I think the best way to see how an Estimator is specified is by looking at the MNIST example below. I summarize some of the important points here, but they might be hard to follow without specific example code.
The standard pattern is for the architecture specification to output logits (pre-softmax activations), and we may add additional ops depending on whether we are in train/evaluate/predict mode (see Implementing training, evaluation, and prediction).   tf.ModeKeys are used to check what the mode is. In all cases, we return a tf.estimator.EstimatorSpec object containing relevant information. For example, in predict mode, the EstimatorSpec wraps the predictions and class probabilities. In train mode the EstimatorSpec wraps the loss and optimization op. In eval mode the EstimatorSpec wraps tf.metrics ops.  As noted in the documentation for tf.estimator.EstimatorSpec, the train_op is ignored when not in train mode. Here is an example pattern from that documentation:
def my_model_fn(mode, features, labels):
  if (mode == tf.estimator.ModeKeys.TRAIN or
      mode == tf.estimator.ModeKeys.EVAL):
    loss = ...
  else:
    loss = None
  if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = ...
  else:
    train_op = None
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = ...
  else:
    predictions = None

  return tf.estimator.EstimatorSpec(
      mode=mode,
      predictions=predictions,
      loss=loss,
      train_op=train_op)

The goal with checking tf.ModeKeys is that ops for training/evaluation are added separately from the actual architecture specification.
For an introduction, see Estimators Programmer's Guide. For a more detailed guide, see Creating Custom Estimators
Note that both of these guides discuss feature columns, which are not useful for image data.
TensorFlow also offers the tf.estimator.train_and_evaluate() function, which encapsulates training and evaluation. The signature is
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

Where train_spec is a tf.estimator.TrainSpec object, which incapsulates the training input_fn and max_steps for training) and eval_spec is a tf.estimator.EvalSpec object, which encapsulates logging hooks and the eval input_fn.
tf.metrics

Module containing functions for evaluation related metrics, including accuracy, recall, precision, true/false positives/negatives, RMSE, etc.
Ex:  tf.metrics.accuracy(labels, predictions, weights=None) 
Walkthrough of Best Practices in TensorFlow's MNIST Example

To highlight what appears to be the current state of the art TensorFlow usage, we're going to examine some code from the TensorFlow MNIST example. All relevant code is going to be excerpted below, because the source code includes a lot more than we care about (argument parsing,  multi-GPU training etc.).
Code below is excerpted from mnist.py. Actual specification of the tf.data.Dataset object used is in dataset.py.
Architecture Specification

Quick Note

This part of this guide discussing the use of a tf.keras.Model subclass to encapsulate architecture specification will only work in TensorFlow r1.7. The Keras API does not currently allow this sort of subclassing, as I documented in this Github issue.
There is a workaround, which isn't complicated at all -- just add a tf.keras.layers.Input object to Model.__init__(), move the layer connection from Model.__call__() to Model.__init__(), and replace the call to super(Model, self).__init__() with
super(Model, self).__init__(inputs=inputs, outputs=outputs)
where inputs is the tf.keras.layers.Input, and outputs is the output of the layer representing the model output.
The architecture is specified using the Keras API.
Architecture Encapsulation using tf.keras.Model

Here, we use Keras-style syntax to pass inputs into the network and then pass the output of each layer into the next layer. The value returned by __call__() is the so called logits tensor of pre-softmax activations that can be argmax-ed to determine predictions or softmax-ed to get a probability distribution over the class labels.
It is important to note that a Keras Model object can only wrap models constructed using the layers API. An exception is thrown if you try to wrap a model that does a computation directly on a tensor. This is similar to the problem discussed in this GitHub issue. (for example, tf.concat cannot be wrapped by a tf.keras.Model -- you would need to use tf.keras.layers.Concatenate).
Below is the relevant code. We see that the __init__() method creates tf.layers objects specifying the layers that the network will have, but the layers are not chained together until the __call__() function is invoked.
class Model(tf.keras.Model):

  def __init__(self, data_format):
    super(Model, self).__init__()
    if data_format == 'channels_first':
      self._input_shape = [-1, 1, 28, 28]
    else:
      assert data_format == 'channels_last'
      self._input_shape = [-1, 28, 28, 1]

    self.conv1 = tf.layers.Conv2D(
        32, 5, padding='same', data_format=data_format, activation=tf.nn.relu)
    self.conv2 = tf.layers.Conv2D(
        64, 5, padding='same', data_format=data_format, activation=tf.nn.relu)
    self.fc1 = tf.layers.Dense(1024, activation=tf.nn.relu)
    self.fc2 = tf.layers.Dense(10)
    self.dropout = tf.layers.Dropout(0.4)
    self.max_pool2d = tf.layers.MaxPooling2D(
        (2, 2), (2, 2), padding='same', data_format=data_format)

  def __call__(self, inputs, training):
    y = tf.reshape(inputs, self._input_shape)
    y = self.conv1(y)
    y = self.max_pool2d(y)
    y = self.conv2(y)
    y = self.max_pool2d(y)
    y = tf.layers.flatten(y)
    y = self.fc1(y)
    y = self.dropout(y, training=training)
    return self.fc2(y)
Estimator model_fn

To create the tf.estimator.Estimator object, the model_fn is specified. First, the Model object specified above is instantiated. Then, we get the image to input to the model. After that, if statements check if the model is in training, prediction, or evaluation mode. For all modes, note that we get the logits by logits = model(image). We then add different TensorFlow ops depending on the mode.
In training mode, the loss and train_ops are created and included in the EstimatorSpec.
In prediction mode, the predicted class and probability ops are added after the logits layer, and these are returned in the EstimatorSpec.
In evaluation mode, a loss op is added along with tf.metrics.accuracy.
def model_fn(features, labels, mode, params):
  model = Model(params['data_format'])
  image = features
  if isinstance(image, dict):
    image = features['image']
    
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
    logits = model(image, training=True)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    return tf.estimator.EstimatorSpec(
        mode=tf.estimator.ModeKeys.TRAIN,
        loss=loss,
        train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step()))

  if mode == tf.estimator.ModeKeys.PREDICT:
    logits = model(image, training=False)
    predictions = {
        'classes': tf.argmax(logits, axis=1),
        'probabilities': tf.nn.softmax(logits),
    }
    return tf.estimator.EstimatorSpec(
        mode=tf.estimator.ModeKeys.PREDICT,
        predictions=predictions,
        export_outputs={
            'classify': tf.estimator.export.PredictOutput(predictions)
        })
                
  if mode == tf.estimator.ModeKeys.EVAL:
    logits = model(image, training=False)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    return tf.estimator.EstimatorSpec(
        mode=tf.estimator.ModeKeys.EVAL,
        loss=loss,
        eval_metric_ops={
            'accuracy':
                tf.metrics.accuracy(
                    labels=labels,
                    predictions=tf.argmax(logits, axis=1)),
        })
Estimator Instantiation and input_fn's

In the main() function, the model is instantiated. The mnist_classifier object is created as a tf.estimator.Estimator with the model_fn as specified above.
To train the model, use mnist_classifier.train(input_fn=train_input_fn)
To evaluate the model, use  eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
To export the model, use mnist_classifier.export_savedmodel(FLAGS.export_dir, input_fn), where the input_fn calls tf.estimator.export.build_raw_serving_input_receiver_fn().
def main(unused_argv):
  model_function = model_fn

  data_format = FLAGS.data_format

  mnist_classifier = tf.estimator.Estimator(
      model_fn=model_function,
      model_dir=FLAGS.model_dir,
      params={
          'data_format': data_format,
          'multi_gpu': FLAGS.multi_gpu
      })

  # Train the model
  def train_input_fn():
    ds = dataset.train(FLAGS.data_dir)
    ds = ds.cache().shuffle(buffer_size=50000).batch(FLAGS.batch_size).repeat(
        FLAGS.train_epochs)
    return ds

  # Set up training hook that logs the training accuracy every 100 steps.
  tensors_to_log = {'train_accuracy': 'train_accuracy'}
  logging_hook = tf.train.LoggingTensorHook(
      tensors=tensors_to_log, every_n_iter=100)
  mnist_classifier.train(input_fn=train_input_fn, hooks=[logging_hook])

  # Evaluate the model and print results
  def eval_input_fn():
    return dataset.test(FLAGS.data_dir).batch(
        FLAGS.batch_size).make_one_shot_iterator().get_next()

  eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
  print('Evaluation results:\n\t%s' % eval_results)

  # Export the model
  if FLAGS.export_dir is not None:
    image = tf.placeholder(tf.float32, [None, 28, 28])
    input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
        'image': image,
    })
    mnist_classifier.export_savedmodel(FLAGS.export_dir, input_fn)
Sources of Confusion

With the introduction of new high level APIs and the integration of Keras, there are a lot of degrees of freedom regarding how to specify a deep learning workflow.
Architecture Specification

tf.layers and tf.keras share data structures, and objects in tf.layers are identical to those in tf.keras.layers. The functional approach to network construction is to pass tensors into layer creating functions like tf.layers.conv2d(). The Keras approach to network construction is to create objects representing layers like tf.layers.Conv2D, and then passing tensors in by calling the objects on the tensors (like in the MNIST example above). You also have the option of using lower level TensorFlow functions from before tf.layers and tf.keras.layers. If you really wanted to, you could specify one layer using low level TensorFlow, then define the next using tf.layers functions, then define the next using tf.keras.layers objects.
Workflow Abstraction

Once the architecture is specified, you may want to wrap it in a tf.keras.models.Model, i.e. a Keras Model. This makes it easier to modularize architecture specification separate from the rest of the pipeline.
Keras Model objects have their own functions for training/evaluation/prediction. Alternatively, you can wrap the Model object instantiation inside a model_fn and use the Estimator training/evaluation/prediction workflow. The Keras Model object workflow is easier to set up, but it is less flexible than the TensorFlow Estimator setup. You can use an Estimator regardless of whether or not you encapsulate the architecture specification in a Model object.
Input Pipeline

For the input pipeline, you have the option of using traditional TensorFlow - an explicit for-loop that gets a TensorFlow session object and calls sess.run(), feeding inputs into placeholders via a feed_dict. TensorFlow also has its complicated, parallelized queue based input pipeline. Now, you can also use the Dataset API, which is intended to supplant both of these older options, with the caveat that tf.data offers many iterator types to iterate through a dataset, but only one-shot iterators are compatible with the Estimator workflow.
Conclusion

TensorFlow now has a lot of interoperable ways to do the same thing at different stages of the deep learning workflow. My impression is that the best practice at the moment is to use the Dataset API for input handling, either the functional tf.layers API or the object-oriented Keras API (in tf.layers or tf.keras.layers), and workflow abstraction with a tf.estimator.Estimator. The choice of using a tf.keras.models.Model object seems arbitrary but useful for encapsulation, with the caveat that it's only being used for encapsulation, and the Estimator handles the workflow abstractions.
Things I didn't Talk about Here that Might Be Important

tf.estimator.WarmStartSettings - Loading a model with all weights loaded from a checkpoint file.
tf.train.MonitoredTrainingSession - Training in a distributed setting.
Notes That I Will Flesh out Later

Go back and fix the MNIST discussion using the r1.6 example and discuss the future potential for using tf.keras.Model subclassing in future TF/Keras releases.
Debugging:


tf.check_numerics(tensor, message) - Reports InvalidArgument error if tensor has any NaN or Inf values, otherwise passes tensor through. Can be used anywhere in the graph as an error-checking identity function
tf.add_check_numerics_ops() - Adds a check_numerics to every floating point tensor
tf.verify_tensor_all_finite() which asserts that the tensor contains no NaN's or Infs

When using an Estimator, we don't have explicit access to a tf.Session object, making it non-obvious how we can run arbitrary ops. To debug, we may want to execute tf.add_check_numerics_ops() to check for NaN or Inf values in a tensor.  When the Estimator is created in train mode (i.e. we call estimator.train(), it executes the train_op specified in the EstimatorSpec created for mode == tf.train.MODE_KEYS.TRAIN in the model_fn. Thus, if we set train_op = tf.add_check_numerics_op() in the Estimator's model_fn, when we call estimator.train() we will see if there are any illegal values.
TensorBoard,  graph visualization with tf.layers, summary logging with Estimators