miwong/mnist_tutorial_jsma.md

## mnist_tutorial_jsma.md

      
    Raw
  

              mnist_tutorial_jsma.md
            
          
    MNIST tutorial: crafting adversarial examples with the Jacobian-based saliency map attack

This tutorial explains how to use CleverHans together
with a TensorFlow model to craft adversarial examples,
using the Jacobian-based saliency map approach. This attack
is described in details by the following paper.
We assume basic knowledge of TensorFlow. If you need help
getting CleverHans installed before getting started,
you may find our MNIST tutorial on the fast gradient sign method
to be useful.
The tutorial's complete script
is provided in the tutorial folder of the
CleverHans repository. Please be sure to
add CleverHans to your PYTHONPATH environment variable
before executing this tutorial.
Defining the model with TensorFlow and Keras

In this tutorial, we use Keras to define the model
and TensorFlow to train it. The model is a Keras
Sequential model:
it is made up of multiple convolutional and ReLU layers.
You can find the model definition in the
utils_mnist CleverHans module.
# Define input TF placeholder
x = tf.placeholder(tf.float32, shape=(None, 1, 28, 28))
y = tf.placeholder(tf.float32, shape=(None, 10))

# Define TF model graph
model = model_mnist()
predictions = model(x)
print "Defined TensorFlow model graph."
Training the model with TensorFlow

The library includes a helper function that runs a
TensorFlow optimizer to train models and another
helper function to load the MNIST dataset.
To train our MNIST model, we run the following:
# Get MNIST test data
X_train, Y_train, X_test, Y_test = data_mnist()

# Train an MNIST model
model_train(sess, x, y, predictions, X_train, Y_train)
We can then evaluate the performance of this model
using model_eval included in cleverhans.utils_tf:
# Evaluate the accuracy of the MNIST model on legitimate test examples
accuracy = model_eval(sess, x, y, predictions, X_test, Y_test)
assert X_test.shape[0] == 10000, X_test.shape
print 'Test accuracy on legitimate test examples: ' + str(accuracy)
The accuracy returned should be above 98%.
The accuracy can become much higher by training for more epochs.
Crafting adversarial examples

We first need to create the elements in the TensorFlow graph necessary
to compute Jacobian matrices (see below for more details), as well as
two numpy arrays to keep track of the results of adversarial example
crafting.
# This array indicates whether an adversarial example was found for each
# test set sample and target class
results = np.zeros((FLAGS.nb_classes, FLAGS.source_samples), dtype='i')

# This array contains the fraction of perturbed features for each test set
# sample and target class
perturbations = np.zeros((FLAGS.nb_classes, FLAGS.source_samples), dtype='f')

# Define the TF graph for the model's Jacobian
grads = jacobian_graph(predictions, x)
We then iterate over the samples that we want to perturb and all
possible target classes (i.e. all classes that are different from
the label assigned to the input in the dataset).
# Loop over the samples we want to perturb into adversarial examples
for sample_ind in xrange(FLAGS.source_samples):
    # We want to find an adversarial example for each possible target class
    # (i.e. all classes that differ from the label given in the dataset)
    target_classes = list(xrange(FLAGS.nb_classes))
    target_classes.remove(int(np.argmax(Y_test[sample_ind])))

    # Loop over all target classes
    for target in target_classes:
        print('--------------------------------------')
        print('Creating adversarial example for target class ' + str(target))

        # This call runs the Jacobian-based saliency map approach
        _, result, percentage_perturb = jsma(sess, x, predictions, grads,
                                             X_test[sample_ind:(sample_ind+1)],
                                             target, theta=1, gamma=0.1,
                                             increase=True, back='tf',
                                             clip_min=0, clip_max=1)

        # Update the arrays for later analysis
        results[target, sample_ind] = result
        perturbations[target, sample_ind] = percentage_perturb
The last few lines analyze the numpy arrays updated throughout crafting
in order to compute the success rate of the adversary: the number of
source-target misclassifications that were successful. Therefore, the
adversarial success rate is the opposite of the model's accuracy.
Given that the adversarial success rate should be larger than 90%,
the model's accuracy on these adversarial examples is lower than 10%.
This should be
significantly lower than the previous accuracy you obtained on
legitimate samples from the test set.
It also provides
the average fraction of input features perturbed to achieve this
misclassification.
Overview of the crafting process

Crafting adversarial examples is a 3-step process and is outlined in
the main loop of the attack, which you may find in the function
cleverhans.attacks.jsma_tf:
# Compute the Jacobian components
grads_target, grads_others = jacobian(sess, x, grads, target, adv_x)

# Compute the saliency map for each of our target classes
# and return the two best candidate features for perturbation
i, j, search_domain = saliency_map(grads_target, grads_others, search_domain, increase)

# Apply the perturbation to the two input features selected previously
adv_x = apply_perturbations(i, j, adv_x, increase, theta, clip_min, clip_max)
The Jacobian

In the first stage of the process, we compute the Jacobian component
corresponding to each pair of output class and input feature. This
helps us estimate how changes in the input features (here pixels
of the MNIST images) will affect each of the class probability
assigned by the model. The Jacobian is a 10 x 28 x 28 matrix of floating
point numbers where large positive values mean that increasing the
associated pixels will yield a large increase in the probabilities
output by the model. Conversely, components with large negative values
correspond to pixels that yield large decreases in the probabilities
output by the model when their value is increased.
Concisely, this step in the attack is key to identify the features
we should prioritize when crafting the perturbation that will result
in misclassification.
Before computing its actual values, the Jacobian is defined as a TF
graph by one call to attacks.jacobian_graph(). This graph is ran
by function attacks.jacobian(), which is fed with the current
values of input features to be fed as the input of the graph.
The saliency map

In the second stage of the process, we determine the best pixels for
our adversarial goal: to misclassify an input sample in a chosen target
class by applying the smallest number of perturbations possible to its
input features. To achieve this, we must take into account how much of
an impact our pixel will have on not only the target class, but all
other classes as well. Therefore the adversarial saliency score for a
pixel is defined as the product of the gradient of the target class and
the sum of the gradients of all other classes (multiplied by -1 so that
we do not consider pixels with a high impact on non-target classes).
However, this scoring methodology has implications. Ideally, the best
pixel is one that has a highly positive impact on the target class and a
highly negative impact on all other classes. Nevertheless, such pixels
are rare. In practice, the highest scoring pixels can be placed in one
of two categories; either the pixel has a high positive impact on
the target class and a moderate impact on other classes or the pixel has
little positive impact on our target class, but a highly negative impact
on all other classes.
This would imply then that such a pair of pixels belonging to the two
categories would be ideal in practice as their strengths would cancel
out their weaknesses, i.e. a pixel that has no impact on the target
class but highly negative impact on all other classes should be selected
with a pixel that has a highly positive impact on the target class and
a moderate to low positive impact on all other classes.
The end result is then a pair of pixels which, when both perturbed
simultaneously, push us towards our target class while simultaneously
pushing us away from all other classes. Concisely, it is this pair
of pixels we seek to identify in this step of the attack.
The computation of saliency map scores for pixel pairs is defined in
function attacks.saliency_score(). It is used to compute the entire
saliency map of an input with a pool
of threads by function attacks.saliency_map().
Applying the perturbations

In the third stage of the process, we simply maximize the value of the
pixel pair we identified in the prior stage. Depending upon the desired
adversarial example, this functionally means---for this MNIST tutorial
with black and white digits---setting the pixel to absolute black or
white (0 or 1). Once the  perturbation has been applied, the model is
queried again to check if we have achieved misclassification. If we have
not successfully perturbed our input sample to our desired target class,
then this process begins again at stage 1 and continues until either we
achieve misclassification, exceeded our maximum desired perturbation
percentage, or we have exhaustively perturbed all input features (which
should be rare unless the inputs have a small number of features).
This perturbation is applied by function attacks.apply_perturbations,
which makes sure that the resulting adversarial example remains in the
expected input domain (i.e., it constraints the perturbed input features
to remain between 0 and 1 in the case of MNIST).
Code

The complete code for this tutorial is available here.