By Adam Anderson
This write-up assumes you have an general understanding of the TensorFlow programming model, but maybe you haven't kept up to date with the latest library features/standard practices.
The goal of this guide is to walk through what appears to be the current TensorFlow best practices using newer library features, most notably the Dataset API, Estimators, and tf.keras
Models.
Documentation and tutorials written in the past few years differ on best practices because even 6 month old tutorials reference objects and functions that have been depricated. I'm hoping this is a sufficiently comprehensive summary of what's useful to know as of Q1 2018.
More recent versions of TensorFlow have added modules that make it easier to set up and train machine learning models. These include the Dataset, Layers, Estimator, and Metrics APIs.
The general workflow is the following:
-
Data Handling with Dataset API -
tf.data
allows you to create an input pipeline that aggregates data and preprocesses it before outputting data for input to the model. Data handling is achieved by creating atf.data.Dataset
for preprocessing and constructing batches, and then creating atf.data.Iterator
for iterating through the batches. This replaces previous input APIs usingfeed_dict
or queue-based pipelines (Chengwei, Oct 2017). -
Additional preprocessing with Feature Columns - In some cases, we may want to play around with how data in a Dataset is presented to the deep learning model - for example, converting categorical data into a one-hot encoding vector, or discretizing values into bins.
tf.feature_column
enables functions perform such actions. Feature Columns are not particularly useful for handling image data, so I'm not going to talk about them in detail here. Instead, refer to Feature Columns Getting Started. -
Model Specification with
tf.layers
- Thetf.layers
module makes it easy to stack layers onto a neural network. Now, it's even possible to use the Keras API to construct a model usingtf.keras
. Becausetf.layers
andtf.keras
share core data structures, you can use the Keras API where it is convenient while also using core TensorFlow where necessary (Google Developers, Feb 2017). -
Abstraction of training/evaluation/prediction using Estimators - The
tf.estimator.Estimator
object abstracts training, evaluation, inference, exporting, etc. It is defined using amodel_fn
to perform necessary computation. When the model is going to be trained/evaluated, aninput_fn
is provided to get the input to the model in a standard format. Using the up-to-date pipeline, we can set up ourtf.data.Dataset
in theinput_fn
, then construct the model and specify training/evaluation behavior in themodel_fn
. To evaluate the model, we can usetf.metrics
, which provides ops to calculate common metrics. As mentioned above, the Keras API has been integrated with TensorFlow, meaning you could use the abstractions offered by the Keras Model object when specifying the model architecture. Keras Models offer their own training/evaluation abstractions, but TensorFlow Estimators can be integrated into more robust and flexible deep learning workflows, so we're going to assume you want to use one of them.
Note that there was once an Experiment object that integrated training and evaluation functionality for an Estimator in tf.contrib.learn.Experiment
, but a Google Groups post by Martin Wicke from 9/25/2017 suggested that this would be deprecated in favor of tf.estimator.train_and_evaluate()
For an in depth overview, refer to the TensorFlow Importing Data Programmer's Guide. Here, we summarize relevant information relevant to using tf.data
as part of a pipeline including a tf.estimator.Estimator
.
The input to an Estimator must be in the form (feature_dict, label)
. When we want to train/evaluate/predict using an Estimator, we define an input_fn
that returns data in this format. Specifically, the input_fn
must return one of the following (as paraphrased from input_fn
description for Estimator.evaluate()
:
- A
tf.data.Dataset
object which outputs(features, labels)
pairs (features, labels)
pairs
Thus, when using the Dataset API, we want to structure the Dataset so that we can easily extract data in this form.
In the Datasets Quick Start, we see the example input function:
def train_input_fn(features, labels, batch_size):
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
# Build the Iterator, and return the read end of the pipeline.
return dataset.make_one_shot_iterator().get_next()
tf.data.Dataset.from_tensor_slices(tensors)
tensors
can be an Iterable of tensors, an Iterable of Datasets, or even a dict of the form {"tensor_name" : tensor}
. The Dataset in the example input_fn
creates a dataset containing (features, label)
pairs, where features
is a dict mapping the feature name to a tensor of values.
We use from_tensor_slices()
because it assumes the first axis is the example number. In MNIST, this means the training dataset of shape (60000, 28, 28)
is treated as 60,000 images, each 28x28.
We can manipulate the dataset using the preprocessing functions. For examples, see Datasets Quick Start or TensorFlow Importing Data Programmer's Guide.
map(map_func, num_parallel_calls=None)
- To apply a function that uses non-TensorFlow logic, wrap it intf.py_func()
.filter(predicate)
zip(datasets)
- Used to produce final dataset fed to the model withtf.data.Dataset.zip((features, labels))
.concatenate(dataset)
shuffle(buffer_size, seed=None, reshuffle_each_iteration=None)
. Larger buffer sizes result in better randomness, while smaller sizes use less memory.repeat()
- will repeat the dataset indefinitely, or a specified number of times.batch(batch_size)
provides one batch of inputs per iteration.
Iterators provide access to elements from the dataset. Multiple types of iterators are allowed by the tf.data
API, but only one-shot iterators are usable with tf.estimator.Estimator
, so we'll only worry about those.
A one-shot iterator goes once through the Dataset, and is created using Dataset.make_one_shot_iterator()
. To iterate through multiple epochs, just use Dataset.repeat(num_epochs)
before creating the Iterator.
Example Usage:
dataset = dataset.shuffle(buffer_size)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
A common pattern is to wrap the training loop with a try-except block to catch when the iterator finishes:
while True:
try:
sess.run(result)
except tf.errors.OutOfRangeError:
break
In the MNIST example below, we see how to use an instantiated Dataset object in an Estimator's input_fn
.
In previous TensorFlow versions, to train the model, you would have to write a training loop where you got a TensorFlow session and called sess.run()
on the optimization step. Evaluation and inference were achieved by calling sess.run()
on the model's output op, which would then get used to calculate evaluation metrics.
Estimators abstract this process, and also make it easier to run models on different hardware, export models for sharing, save checkpoints/TensorBoard summaries, etc.
The tf.estimator.Estimator
constructor accepts a model_fn
which has the signature model_fn(features, labels, mode, config)
(hence why the input_fn
needs to output data in (features, label)
pairs). Within the model_fn
, the model is created, and we specify how to produce output given input in the form (features, labels)
.
I think the best way to see how an Estimator is specified is by looking at the MNIST example below. I summarize some of the important points here, but they might be hard to follow without specific example code.
The standard pattern is for the architecture specification to output logits (pre-softmax activations), and we may add additional ops depending on whether we are in train/evaluate/predict mode (see Implementing training, evaluation, and prediction). tf.ModeKeys
are used to check what the mode is. In all cases, we return a tf.estimator.EstimatorSpec
object containing relevant information. For example, in predict mode, the EstimatorSpec wraps the predictions and class probabilities. In train mode the EstimatorSpec wraps the loss and optimization op. In eval mode the EstimatorSpec wraps tf.metrics
ops. As noted in the documentation for tf.estimator.EstimatorSpec
, the train_op is ignored when not in train mode. Here is an example pattern from that documentation:
def my_model_fn(mode, features, labels):
if (mode == tf.estimator.ModeKeys.TRAIN or
mode == tf.estimator.ModeKeys.EVAL):
loss = ...
else:
loss = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = ...
else:
train_op = None
if mode == tf.estimator.ModeKeys.PREDICT:
predictions = ...
else:
predictions = None
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op)
The goal with checking tf.ModeKeys
is that ops for training/evaluation are added separately from the actual architecture specification.
For an introduction, see Estimators Programmer's Guide. For a more detailed guide, see Creating Custom Estimators Note that both of these guides discuss feature columns, which are not useful for image data.
TensorFlow also offers the tf.estimator.train_and_evaluate()
function, which encapsulates training and evaluation. The signature is
tf.estimator.train_and_evaluate(
estimator,
train_spec,
eval_spec
)
Where train_spec
is a tf.estimator.TrainSpec
object, which incapsulates the training input_fn
and max_steps
for training) and eval_spec
is a tf.estimator.EvalSpec
object, which encapsulates logging hooks and the eval input_fn
.
Module containing functions for evaluation related metrics, including accuracy, recall, precision, true/false positives/negatives, RMSE, etc.
Ex: tf.metrics.accuracy(labels, predictions, weights=None)
To highlight what appears to be the current state of the art TensorFlow usage, we're going to examine some code from the TensorFlow MNIST example. All relevant code is going to be excerpted below, because the source code includes a lot more than we care about (argument parsing, multi-GPU training etc.).
Code below is excerpted from mnist.py. Actual specification of the tf.data.Dataset
object used is in dataset.py.
This part of this guide discussing the use of a tf.keras.Model
subclass to encapsulate architecture specification will only work in TensorFlow r1.7. The Keras API does not currently allow this sort of subclassing, as I documented in this Github issue.
There is a workaround, which isn't complicated at all -- just add a tf.keras.layers.Input
object to Model.__init__()
, move the layer connection from Model.__call__()
to Model.__init__()
, and replace the call to super(Model, self).__init__()
with
super(Model, self).__init__(inputs=inputs, outputs=outputs)
where inputs is the tf.keras.layers.Input
, and outputs is the output of the layer representing the model output.
The architecture is specified using the Keras API.
Here, we use Keras-style syntax to pass inputs into the network and then pass the output of each layer into the next layer. The value returned by __call__()
is the so called logits tensor of pre-softmax activations that can be argmax-ed to determine predictions or softmax-ed to get a probability distribution over the class labels.
It is important to note that a Keras Model object can only wrap models constructed using the layers API. An exception is thrown if you try to wrap a model that does a computation directly on a tensor. This is similar to the problem discussed in this GitHub issue. (for example, tf.concat
cannot be wrapped by a tf.keras.Model
-- you would need to use tf.keras.layers.Concatenate
).
Below is the relevant code. We see that the __init__()
method creates tf.layers
objects specifying the layers that the network will have, but the layers are not chained together until the __call__()
function is invoked.
class Model(tf.keras.Model):
def __init__(self, data_format):
super(Model, self).__init__()
if data_format == 'channels_first':
self._input_shape = [-1, 1, 28, 28]
else:
assert data_format == 'channels_last'
self._input_shape = [-1, 28, 28, 1]
self.conv1 = tf.layers.Conv2D(
32, 5, padding='same', data_format=data_format, activation=tf.nn.relu)
self.conv2 = tf.layers.Conv2D(
64, 5, padding='same', data_format=data_format, activation=tf.nn.relu)
self.fc1 = tf.layers.Dense(1024, activation=tf.nn.relu)
self.fc2 = tf.layers.Dense(10)
self.dropout = tf.layers.Dropout(0.4)
self.max_pool2d = tf.layers.MaxPooling2D(
(2, 2), (2, 2), padding='same', data_format=data_format)
def __call__(self, inputs, training):
y = tf.reshape(inputs, self._input_shape)
y = self.conv1(y)
y = self.max_pool2d(y)
y = self.conv2(y)
y = self.max_pool2d(y)
y = tf.layers.flatten(y)
y = self.fc1(y)
y = self.dropout(y, training=training)
return self.fc2(y)
To create the tf.estimator.Estimator
object, the model_fn
is specified. First, the Model object specified above is instantiated. Then, we get the image to input to the model. After that, if
statements check if the model is in training, prediction, or evaluation mode. For all modes, note that we get the logits by logits = model(image)
. We then add different TensorFlow ops depending on the mode.
In training mode, the loss
and train_op
s are created and included in the EstimatorSpec
.
In prediction mode, the predicted class and probability ops are added after the logits layer, and these are returned in the EstimatorSpec
.
In evaluation mode, a loss
op is added along with tf.metrics.accuracy
.
def model_fn(features, labels, mode, params):
model = Model(params['data_format'])
image = features
if isinstance(image, dict):
image = features['image']
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
logits = model(image, training=True)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
return tf.estimator.EstimatorSpec(
mode=tf.estimator.ModeKeys.TRAIN,
loss=loss,
train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step()))
if mode == tf.estimator.ModeKeys.PREDICT:
logits = model(image, training=False)
predictions = {
'classes': tf.argmax(logits, axis=1),
'probabilities': tf.nn.softmax(logits),
}
return tf.estimator.EstimatorSpec(
mode=tf.estimator.ModeKeys.PREDICT,
predictions=predictions,
export_outputs={
'classify': tf.estimator.export.PredictOutput(predictions)
})
if mode == tf.estimator.ModeKeys.EVAL:
logits = model(image, training=False)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
return tf.estimator.EstimatorSpec(
mode=tf.estimator.ModeKeys.EVAL,
loss=loss,
eval_metric_ops={
'accuracy':
tf.metrics.accuracy(
labels=labels,
predictions=tf.argmax(logits, axis=1)),
})
In the main()
function, the model is instantiated. The mnist_classifier
object is created as a tf.estimator.Estimator
with the model_fn
as specified above.
To train the model, use mnist_classifier.train(input_fn=train_input_fn)
To evaluate the model, use eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
To export the model, use mnist_classifier.export_savedmodel(FLAGS.export_dir, input_fn)
, where the input_fn
calls tf.estimator.export.build_raw_serving_input_receiver_fn()
.
def main(unused_argv):
model_function = model_fn
data_format = FLAGS.data_format
mnist_classifier = tf.estimator.Estimator(
model_fn=model_function,
model_dir=FLAGS.model_dir,
params={
'data_format': data_format,
'multi_gpu': FLAGS.multi_gpu
})
# Train the model
def train_input_fn():
ds = dataset.train(FLAGS.data_dir)
ds = ds.cache().shuffle(buffer_size=50000).batch(FLAGS.batch_size).repeat(
FLAGS.train_epochs)
return ds
# Set up training hook that logs the training accuracy every 100 steps.
tensors_to_log = {'train_accuracy': 'train_accuracy'}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=100)
mnist_classifier.train(input_fn=train_input_fn, hooks=[logging_hook])
# Evaluate the model and print results
def eval_input_fn():
return dataset.test(FLAGS.data_dir).batch(
FLAGS.batch_size).make_one_shot_iterator().get_next()
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print('Evaluation results:\n\t%s' % eval_results)
# Export the model
if FLAGS.export_dir is not None:
image = tf.placeholder(tf.float32, [None, 28, 28])
input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
'image': image,
})
mnist_classifier.export_savedmodel(FLAGS.export_dir, input_fn)
With the introduction of new high level APIs and the integration of Keras, there are a lot of degrees of freedom regarding how to specify a deep learning workflow.
tf.layers
and tf.keras
share data structures, and objects in tf.layers
are identical to those in tf.keras.layers
. The functional approach to network construction is to pass tensors into layer creating functions like tf.layers.conv2d()
. The Keras approach to network construction is to create objects representing layers like tf.layers.Conv2D
, and then passing tensors in by calling the objects on the tensors (like in the MNIST example above). You also have the option of using lower level TensorFlow functions from before tf.layers
and tf.keras.layers
. If you really wanted to, you could specify one layer using low level TensorFlow, then define the next using tf.layers
functions, then define the next using tf.keras.layers
objects.
Once the architecture is specified, you may want to wrap it in a tf.keras.models.Model
, i.e. a Keras Model. This makes it easier to modularize architecture specification separate from the rest of the pipeline.
Keras Model objects have their own functions for training/evaluation/prediction. Alternatively, you can wrap the Model object instantiation inside a model_fn
and use the Estimator training/evaluation/prediction workflow. The Keras Model object workflow is easier to set up, but it is less flexible than the TensorFlow Estimator setup. You can use an Estimator regardless of whether or not you encapsulate the architecture specification in a Model object.
For the input pipeline, you have the option of using traditional TensorFlow - an explicit for
-loop that gets a TensorFlow session object and calls sess.run()
, feeding inputs into placeholders via a feed_dict
. TensorFlow also has its complicated, parallelized queue based input pipeline. Now, you can also use the Dataset API, which is intended to supplant both of these older options, with the caveat that tf.data
offers many iterator types to iterate through a dataset, but only one-shot iterators are compatible with the Estimator workflow.
TensorFlow now has a lot of interoperable ways to do the same thing at different stages of the deep learning workflow. My impression is that the best practice at the moment is to use the Dataset API for input handling, either the functional tf.layers
API or the object-oriented Keras API (in tf.layers
or tf.keras.layers
), and workflow abstraction with a tf.estimator.Estimator
. The choice of using a tf.keras.models.Model
object seems arbitrary but useful for encapsulation, with the caveat that it's only being used for encapsulation, and the Estimator handles the workflow abstractions.
tf.estimator.WarmStartSettings
- Loading a model with all weights loaded from a checkpoint file.
tf.train.MonitoredTrainingSession
- Training in a distributed setting.
Go back and fix the MNIST discussion using the r1.6 example and discuss the future potential for using tf.keras.Model
subclassing in future TF/Keras releases.
tf.check_numerics(tensor, message)
- ReportsInvalidArgument
error iftensor
has any NaN or Inf values, otherwise passes tensor through. Can be used anywhere in the graph as an error-checking identity functiontf.add_check_numerics_ops()
- Adds acheck_numerics
to every floating point tensortf.verify_tensor_all_finite()
whichassert
s that the tensor contains no NaN's or Infs
When using an Estimator, we don't have explicit access to a tf.Session
object, making it non-obvious how we can run arbitrary ops. To debug, we may want to execute tf.add_check_numerics_ops()
to check for NaN or Inf values in a tensor. When the Estimator is created in train mode (i.e. we call estimator.train()
, it executes the train_op
specified in the EstimatorSpec
created for mode == tf.train.MODE_KEYS.TRAIN
in the model_fn
. Thus, if we set train_op = tf.add_check_numerics_op()
in the Estimator's model_fn
, when we call estimator.train()
we will see if there are any illegal values.
TensorBoard, graph visualization with tf.layers
, summary logging with Estimators
Nice summary! Something I wanted to write down sometime this month too.
Could you maintain a repo for this? Would be useful to commit changes as newer practices (layers being dropped in 2.0) etc come by.