Training Robust Classifiers (Part 2)

In part 1 of this post, we discussed how training a robust classifier can be naturally formulated as minimizing the robust population loss:

This directly leads to solving the following min-max problem:

We then concluded that computing gradients for the outer minimization problem requires us to repeatedly solve the inner maximization problem. This latter problem corresponds to maximizing the loss of the classifier over a subset of the input space (e.g., small perturbations of image pixels). In this blog post, we will outline how to find good solutions for the min-max problem. For details, please refer to our paper.

Solving the Inner Maximization Problem

The key difficulty we immediately run into is that the inner maximization is provably hard (in the worst case), even for simple ReLU networks (link). So there is little hope for solving this problem efficiently and with good general guarantees on the quality of the computed solution. So instead, we resort to the leading methodology for optimizing non-convex objectives involving neural networks: applying a first-order method and hoping for the best!

Since our problem is constrained, we will use Projected Gradient Descent (PGD), a canonical method for solving constrained optimization problems. PGD proceeds by repeatedly taking a step in the direction of the gradient of the loss function – to greedily increase its value – and then projecting the resulting point back to the constraint set:

Here, refers to projecting a point onto the set . For a given point , computing corresponds to finding the point in that is closest to . As usual, denotes the step size. (Note that tuning the step size is crucial for PGD to work well.)

Since we are interested in maximizing over an -ball, we will use the -variant of PGD (see here or here for details):

(Following the signs of the gradient instead of the gradient itself lets us make more progress in the -geometry. It is worth keeping in mind though that following the signs is specific to the -setting and would not be a suitable choice, e.g., for -bounded perturbations.)

The resulting method turns out to be very successful in finding adversarial examples for a broad range of models. In the context of deep learning, it is sometimes referred to as Basic Iterative Method (link, link).

Reliable Inner Maximization is Critical

Before we examine the performance of robust training with PGD, we want to emphasize that it is crucial to solve the inner maximization problem sufficiently well. First, recall from last time that Danskin’s theorem is valid only if we are able to solve the inner maximization problem optimally. So, even though we generally can’t be guaranteed to find such optimal solutions efficiently, all current evidence shows that we need to find solutions that are sufficiently close to optimal. Such solutions are more likely to give us a good update to the outer minimization problem (cf. the algorithm for solving the min-max problem we discussed last time).

In the past, researchers used fast, yet fairly weak optimization procedures (attacks) for solving the inner problem. While these methods successfully produce adversarial examples for “standard” (non-robust) classifiers, they don’t necessarily give sufficiently good (i.e. near-optimal) solutions for the inner maximization problem. Models trained using them turned out to be resistant to these weak attacks, but still vulnerable to stronger attacks.

A prominent illustration of this phenomenon is adversarial training with FGSM (fast gradient sign method). FGSM is an attack method based on performing a single gradient step. It is very efficient and successful at creating adversarial examples for standard (non-robust) models. However, it is far less effective at finding worst-case adversarial perturbations, especially when compared to perturbations computed with a few steps of PGD. This can be clearly seen when we examine the loss function along the direction of FGSM and the direction found by a few steps of PGD, respectively. Here we plot the value of the loss on the line segment between the unmodified data point and the adversarially perturbed points found by FGSM and PGD:

While the FGSM direction leads to a high loss value close to the starting point for CIFAR10, the final iterate of PGD achieves a significantly higher loss. This is not surprising. After all, FGSM chooses the steepest direction locally (i.e., in a greedy way), and thus is not sensitive to the changes in loss value further along this direction.

As a result, models trained using FGSM perturbations are robust to FGSM attacks but are highly vulnerable to PGD attacks. (These models are also often memorizing the attacks they are exposed to, a phenomenon that was called label leaking.) Here are the same plots on models trained using FGSM:

As one can see, the points found by FGSM have very low loss, despite the fact that points with significantly higher loss exist (and can be found by PGD). The small bump at the start of the CIFAR plot is a sign of “gradient masking”, see here for further analysis.

Examining the Performance of PGD

In the previous section, we have seen that suboptimal solutions to the inner maximization problem can lead to vulnerable models (even if the attack method can easily fool a “vanilla” model). In particular, models trained using FGSM still turn out to be vulnerable to perturbations found via PGD.

Since PGD found solutions with higher loss than FGSM, a natural starting point for training more robust models is using PGD for the inner maximization. However, it is still unclear whether we can depend on PGD to always find optimal (or sufficiently close to optimal) solutions. After all, PGD is only guaranteed to find an actual maximum for concave functions over convex sets. But here, PGD is facing a non-concave function.

We conducted multiple experiments to understand the performance of PGD in our non-concave setting. In the first set of experiments, our goal was to examine whether the loss surface contains local maxima with very different values. To this end, we evaluated the robust loss of standard MNIST and CIFAR10 networks with multiple runs of PGD initialized at different random starting points. Below are the histograms of the final loss values found by PGD starting from different random points for two random examples from each test set:

Interestingly, the behavior of PGD is quite consistent. The values of the local maxima found by PGD are fairly concentrated and there are no major outliers (we did not crop the x axis). This is encouraging! The plots suggest that just a few steps of PGD already find a point with a reasonably good value. It is important to keep in mind though that, strictly speaking, we do not really know if the loss we obtain is close to the maximum possible value. But even if it was not the case, we at least get some degree of confidence that a first-order adversary – i.e., an adversary that only utilizes first-order information about the model – is unlikely to find examples with significantly higher loss.

In order to understand the loss landscape in more detail, we now take a look at two further visualizations for standard MNIST and CIFAR10 classifiers. In particular, we will plot the value of the loss (as a function of input image, not model parameters) in the two dimensional space spanned by an unperturbed image and the two local maxima found by PGD for different starting points. To be precise: if is the original point and , are the perturbed points computed by PGD, we plot as a function of and .