Skip to content

Instantly share code, notes, and snippets.

@miwong
Created September 21, 2018 20:07
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save miwong/936d8b12d565802358a924e1073cf6da to your computer and use it in GitHub Desktop.
Save miwong/936d8b12d565802358a924e1073cf6da to your computer and use it in GitHub Desktop.
CleverHans Tutorial - MNIST with JSMA

MNIST tutorial: crafting adversarial examples with the Jacobian-based saliency map attack

This tutorial explains how to use CleverHans together with a TensorFlow model to craft adversarial examples, using the Jacobian-based saliency map approach. This attack is described in details by the following paper. We assume basic knowledge of TensorFlow. If you need help getting CleverHans installed before getting started, you may find our MNIST tutorial on the fast gradient sign method to be useful.

The tutorial's complete script is provided in the tutorial folder of the CleverHans repository. Please be sure to add CleverHans to your PYTHONPATH environment variable before executing this tutorial.

Defining the model with TensorFlow and Keras

In this tutorial, we use Keras to define the model and TensorFlow to train it. The model is a Keras Sequential model: it is made up of multiple convolutional and ReLU layers. You can find the model definition in the utils_mnist CleverHans module.

# Define input TF placeholder
x = tf.placeholder(tf.float32, shape=(None, 1, 28, 28))
y = tf.placeholder(tf.float32, shape=(None, 10))

# Define TF model graph
model = model_mnist()
predictions = model(x)
print "Defined TensorFlow model graph."

Training the model with TensorFlow

The library includes a helper function that runs a TensorFlow optimizer to train models and another helper function to load the MNIST dataset. To train our MNIST model, we run the following:

# Get MNIST test data
X_train, Y_train, X_test, Y_test = data_mnist()

# Train an MNIST model
model_train(sess, x, y, predictions, X_train, Y_train)

We can then evaluate the performance of this model using model_eval included in cleverhans.utils_tf:

# Evaluate the accuracy of the MNIST model on legitimate test examples
accuracy = model_eval(sess, x, y, predictions, X_test, Y_test)
assert X_test.shape[0] == 10000, X_test.shape
print 'Test accuracy on legitimate test examples: ' + str(accuracy)

The accuracy returned should be above 98%. The accuracy can become much higher by training for more epochs.

Crafting adversarial examples

We first need to create the elements in the TensorFlow graph necessary to compute Jacobian matrices (see below for more details), as well as two numpy arrays to keep track of the results of adversarial example crafting.

# This array indicates whether an adversarial example was found for each
# test set sample and target class
results = np.zeros((FLAGS.nb_classes, FLAGS.source_samples), dtype='i')

# This array contains the fraction of perturbed features for each test set
# sample and target class
perturbations = np.zeros((FLAGS.nb_classes, FLAGS.source_samples), dtype='f')

# Define the TF graph for the model's Jacobian
grads = jacobian_graph(predictions, x)

We then iterate over the samples that we want to perturb and all possible target classes (i.e. all classes that are different from the label assigned to the input in the dataset).

# Loop over the samples we want to perturb into adversarial examples
for sample_ind in xrange(FLAGS.source_samples):
    # We want to find an adversarial example for each possible target class
    # (i.e. all classes that differ from the label given in the dataset)
    target_classes = list(xrange(FLAGS.nb_classes))
    target_classes.remove(int(np.argmax(Y_test[sample_ind])))

    # Loop over all target classes
    for target in target_classes:
        print('--------------------------------------')
        print('Creating adversarial example for target class ' + str(target))

        # This call runs the Jacobian-based saliency map approach
        _, result, percentage_perturb = jsma(sess, x, predictions, grads,
                                             X_test[sample_ind:(sample_ind+1)],
                                             target, theta=1, gamma=0.1,
                                             increase=True, back='tf',
                                             clip_min=0, clip_max=1)

        # Update the arrays for later analysis
        results[target, sample_ind] = result
        perturbations[target, sample_ind] = percentage_perturb

The last few lines analyze the numpy arrays updated throughout crafting in order to compute the success rate of the adversary: the number of source-target misclassifications that were successful. Therefore, the adversarial success rate is the opposite of the model's accuracy. Given that the adversarial success rate should be larger than 90%, the model's accuracy on these adversarial examples is lower than 10%. This should be significantly lower than the previous accuracy you obtained on legitimate samples from the test set. It also provides the average fraction of input features perturbed to achieve this misclassification.

Overview of the crafting process

Crafting adversarial examples is a 3-step process and is outlined in the main loop of the attack, which you may find in the function cleverhans.attacks.jsma_tf:

# Compute the Jacobian components
grads_target, grads_others = jacobian(sess, x, grads, target, adv_x)

# Compute the saliency map for each of our target classes
# and return the two best candidate features for perturbation
i, j, search_domain = saliency_map(grads_target, grads_others, search_domain, increase)

# Apply the perturbation to the two input features selected previously
adv_x = apply_perturbations(i, j, adv_x, increase, theta, clip_min, clip_max)

The Jacobian

In the first stage of the process, we compute the Jacobian component corresponding to each pair of output class and input feature. This helps us estimate how changes in the input features (here pixels of the MNIST images) will affect each of the class probability assigned by the model. The Jacobian is a 10 x 28 x 28 matrix of floating point numbers where large positive values mean that increasing the associated pixels will yield a large increase in the probabilities output by the model. Conversely, components with large negative values correspond to pixels that yield large decreases in the probabilities output by the model when their value is increased. Concisely, this step in the attack is key to identify the features we should prioritize when crafting the perturbation that will result in misclassification.

Before computing its actual values, the Jacobian is defined as a TF graph by one call to attacks.jacobian_graph(). This graph is ran by function attacks.jacobian(), which is fed with the current values of input features to be fed as the input of the graph.

The saliency map

In the second stage of the process, we determine the best pixels for our adversarial goal: to misclassify an input sample in a chosen target class by applying the smallest number of perturbations possible to its input features. To achieve this, we must take into account how much of an impact our pixel will have on not only the target class, but all other classes as well. Therefore the adversarial saliency score for a pixel is defined as the product of the gradient of the target class and the sum of the gradients of all other classes (multiplied by -1 so that we do not consider pixels with a high impact on non-target classes).

However, this scoring methodology has implications. Ideally, the best pixel is one that has a highly positive impact on the target class and a highly negative impact on all other classes. Nevertheless, such pixels are rare. In practice, the highest scoring pixels can be placed in one of two categories; either the pixel has a high positive impact on the target class and a moderate impact on other classes or the pixel has little positive impact on our target class, but a highly negative impact on all other classes.

This would imply then that such a pair of pixels belonging to the two categories would be ideal in practice as their strengths would cancel out their weaknesses, i.e. a pixel that has no impact on the target class but highly negative impact on all other classes should be selected with a pixel that has a highly positive impact on the target class and a moderate to low positive impact on all other classes.

The end result is then a pair of pixels which, when both perturbed simultaneously, push us towards our target class while simultaneously pushing us away from all other classes. Concisely, it is this pair of pixels we seek to identify in this step of the attack.

The computation of saliency map scores for pixel pairs is defined in function attacks.saliency_score(). It is used to compute the entire saliency map of an input with a pool of threads by function attacks.saliency_map().

Applying the perturbations

In the third stage of the process, we simply maximize the value of the pixel pair we identified in the prior stage. Depending upon the desired adversarial example, this functionally means---for this MNIST tutorial with black and white digits---setting the pixel to absolute black or white (0 or 1). Once the perturbation has been applied, the model is queried again to check if we have achieved misclassification. If we have not successfully perturbed our input sample to our desired target class, then this process begins again at stage 1 and continues until either we achieve misclassification, exceeded our maximum desired perturbation percentage, or we have exhaustively perturbed all input features (which should be rare unless the inputs have a small number of features).

This perturbation is applied by function attacks.apply_perturbations, which makes sure that the resulting adversarial example remains in the expected input domain (i.e., it constraints the perturbed input features to remain between 0 and 1 in the case of MNIST).

Code

The complete code for this tutorial is available here.

@minhlab
Copy link

minhlab commented Oct 9, 2019

do you have any experience on making this work on a PyTorch model?

@Hxiang1124
Copy link

your tutorial code cannot open, 404 not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment