celoyd/ml.md

## ml.md

      
    Raw
  

              ml.md
            
          
    Work in progress – check back for a more final draft.
A written introduction to convolutional neural networks

This is not meant to be the only outline you read. It’s meant to fill in gaps that others left for me.
A sample problem

Let’s suppose we’re trying to tell sunflowers from tulips in pictures. Suppose every image is grayscale, with 10,000 pixels, with each pixel in the range 0..1.
To distinguish these flowers, we build a convnet that has the image pixels as its 10k inputs and a probability of sunflower-ness as its single output.
Alternate mental models

Here are four differently connoted interpretations of what the network is:


It is a feature recognizer. It picks up on the features (including higher-order features) that distinguish a sunflower from a tulip, weighs them, and makes a judgment.


It is an information discarder. It will take the mass of information in the image, pull it apart in a variety of ways, and incrementally (as the information moves though the net) discard the least useful parts of it, leaving only the few bits that we want.


It is a dimensionality reducer. It will project a point from a high-dimensional space – taking an image as a point in 10k dimensions, with every pixel as a dimension – down to a two-dimensional space: a point on the line from sunflower-ness to tulip-ness.


It is the implementation of a remark that Fermi attributed to von Neumann: “With four parameters I can fit an elephant, and with five I can make it wiggle its trunk.”


To make sense of the giant hairball of detail involved in implementing a network, it’s helpful to be able to switch freely among these high-level interpretations. (And no doubt others that I don’t get yet.)
Into the net

The first layer of the convnet is a set of kernels applied to the image pixels. A kernel is a set of weights like this (for a 3×3 kernel):
0.9 0.5 0.1

0.9 0.5 0.1

0.9 0.5 0.1
We convolve every kernel-sized neighborhood of the image with the kernel. Basically, this is pattern-matching: how closely does the piece of image around a given pixel match the kernel? This particular kernel is high on the left and low on the right, which means it will match most strongly anywhere something bright has something dark just to its right. Another kernel might look for the opposite, another one might look for a brighter-on-top region, another might look for saddle-shaped brightness surfaces, another might look for isolated points, and so on.
The first layer of the convnet will run perhaps 64 of these kernels over the whole image. We’d say that the first layer, then, has a depth of 64. To put it another way, the layer’s output is a stack of 64 images, each one a “highlight image” showing where a given kernel activated. Everything else the convnet does will be derived from this first, simplest analysis of the image.
A typical second layer is roughly like the first layer: a set of n (maybe also 64) kernels producing perhaps n “highlight images”. The big difference is that these kernels have depth as well as height and width. In other words, they aren’t just 5×5, they’re 3×3×64. So they’re matching on an area of the image, but within that area, what they’re seeing is combinations of the basic features – things like edges and dots – generated by the first layer.
This means the net can respond to more complex shapes than a first-layer kernel can match: not just a spot, but a spot that is just above a line; not just an area matching the horizontal edge feature, but one that also slightly matches a vertical edge feature, and thus might be an edge at a slight angle. We’re building complex features out of simple features.
Incidentally, I said the image was grayscale just so I wouldn’t have to start with an explanation of depth. More realistically, the image is RGB, which is represented as a depth of 3. So real-world first layers are typically more like the second layer we’re discussing here; a typical first-layer kernel is 3×3×3. As well as spatial patterns, they detect colors, and color/pattern combinations like “something gray next to something green”. But grayscale is seen in the wild too, and it makes a clearer example.
(At this point come nonlinear activation and some optional post-layer steps. We’ll return to them later, when the need for them is clearer. For the moment, we’ll skip them.)
Once we’ve run a couple layers, we have 64 “feature maps” – derived images of where the input shows 64 kinds of configurations of pixels (a.k.a. features, textures, details, or any other word that makes sense to you). Now we want to discard unimportant information, for example the exact, to-the-pixel locations of the features. Our output is “sunflower or tulip?”, so facts like “there is a sunflower seed pattern at coordinates 37, 42” can go, as long as useful information like “there is a cluster of areas with sunflower seed texture” gets passed on.
We do this by shrinking the feature maps. The most common method is 2×2 maxpooling, which means that groups of 4 pixels in feature maps are collected to one output with the value of the highest input, so
0.7 0.8
0.6 0.7
becomes
0.8
because, intuitively and empirically, the most important information to pass on seems to be the best matches. “This image chunk had some matches for sunflower seed texture” is useful, but “this image chunk had at least one excellent match for sunflower seed texture” is more useful.
Now we have more convolutional layers like the first two. Instead of starting with pixels, they start with features derived from pixels, but their basic operation is the same. And after them comes another pooling, and then more convolutional layers. Deeper in the network, and each layer sees fewer inputs, it’s common to design the layers with more depth (say, 512). In intuitive terms, these layers see less raw the horizontal structure of the image and more derived information. As they funnel information through the net, they’re digesting it from facts about pixel intensities to facts about patterns of patterns … of patterns of pixel intensities.
A given convolution in deeper levels might start to pick up on the idea of the edge where the sunflower seeds meet the petals, for example. That’s a structure that doesn’t have a single definitive shape, but which can be seen as a complicated set of if-thens in terms of simpler textures. (If any of a long list of seed textures adjoin a petal texture in any direction, but not if there’s also seed texture on the other side of the petal texture, because that’s probably a loose petal fragment on the flower face and not the overall edge … and so on.) More depth allows for more complex if-thens, because each convolutional layer sees more convolutions from the previous layer.
If you think of one or more convolutional layers and then a pooling (plus those mysterious post-layer steps that we’ll revisit) as a block, convnets are mainly built out of blocks, typically about 4–8.
Eventually, we’re left with a very narrow but very deep feature map. In the first couple blocks, it made sense to think of the feature maps as images showing the placement of various simple shapes, like edges, in the input image. Down here at the end of the blocks, the shapes are extremely complex, and their placement is irrelevant: we don’t care where there’s sunflower-ness or tulip-ness in the image; we care about which predominates. (Other architectures in the neural net family are capable of saying “there’s a tulip on the left and two sunflowers above it”, but they’re elaborations on the kind of convnet that we’re walking through here.)
Now we stop thinking about the feature map as having any horizontal, image-like structure at all, and we take it as an unordered list of high-level features. Maybe feature #1 is bright petals, feature #2 is smooth stems, feature #4 is wrinkly petals, feature #3 is sunflower seed/petal edges, feature #4 is tulip leaves, and so on. Keep in mind that these features aren’t, by themselves, pictures of things. They’re functions that respond enthusiastically to any of a range of pictures of those things. So if feature #1 has a high value here, that’s just a large number – a high probability that there are light petals in the image – without its own image of those petals.
Another way of saying this is that information about a feature doesn’t live at a particular node in the network. It lives in that node and all its attachments to previous nodes.
Now that we’ve broken the conceptual link to a literal image, the next layer is fully connected, a.k.a. dense. Instead of working with 3×3 or similar kernels, each node in it looks at all the inputs at once. To put it another way, it doesn’t slide a small kernel over a large range of inputs, to create a large number of outputs – its kernel is the whole size of the inputs, and it creates one output. Another identical dense layer, or a bit smaller, helps distill things down and sort through contingencies. Loosely speaking, this layer is weighing its inputs like: with an increase of 1% in feature #1, the image is 2% more likely to contain a sunflower; with an increase of 1% in feature #2, it’s 3% less likely, and so on.
It’s useful to see what this last dense layer is doing as a close relative of what the very first layer was doing. All that’s happening is it’s weighing inputs. The first layer had kernels that said things like “higher values in these specific inputs, and lower values in the others, raise the odds that we’re looking at a brighter-on-the-left edge”. Because the kernels were smaller than the inputs, they stepped across them. The last layer is doesn’t step, because it’s defined to see all the input at once, but it’s behaving in an analogous way: “higher values in these specific inputs raise the odds that we’re looking at a sunflower”. This progress – from small-scale pixel arrangements, through hierarchical abstractions, to high-level features – is the core of what convnets do. And that’s implmemented by every node doing some version of this same operation: weighting inputs to create outputs.
This is the first tour of a convnet. Now let’s dig into training, and into nonlinearity and the other non-convolutional operations.
To train the convnet, we think in terms of all the weights in the network. For the first layer, if a kernel is:
0.9 0.5 0.1
0.9 0.5 0.1
0.9 0.5 0.1
then each of those numbers is a weight. Likewise at every node in the network. Each node is basically seeing whether its inputs combine to form the feature that it measures, and weighting is how it does that. Maybe “stem” has heavy weight on vertical edges and no weight on horizontal edges, for example.
There are hundreds of thousands of weights in a reasonably complex convnet, and fancy ones can have billions. Manually setting weights would be laborious and inaccurate – both because you can’t enumerate every version of a sunflower, and because you can’t be certain what any given weight does in practice without tracing out all its connections and their connections and so on.
To start training, we set all the weights randomly and show the net an image. Say it’s a picture of a sunflower. We know the right output from this picture: 1.0. We define a loss function to measure how far away from the correct answer the convnet is at any training step: here, we can say it’s just |correct - observed|.
Now consider a straw-person proposal. After measuring the loss, randomly adjust all the weights and measure the loss again. If the new loss is higher (the output is less correct), go back to the original, but if the loss is lower (accuracy improved), continue from here. Randomly adjust all the weights again, and repeat many times, adjusting in gradually smaller increments to hone in on the optimal weighting. This works! But it’s extremely slow, because there are so many variables interacting, and any adjustment is as likely to make a given weight worse as better.
The actual strategy is only two or three notches more sophisticated than the straw-person approach. We use fairly conventional math, but we start by doing something that would have seemed like lunacy before computers arrived on the scene. Suppose there are a million weights in our sunflower/tulip convnet. We think of the whole net as a single function of a million variables. Then, for each weight, we find the gradient of the function applied to the test image. In other words, we ask: would adding an infinitesimal amount to this weight increase or decrease the loss value? And we adjust it in the direction that decreases the loss. (Remember that we can’t figure out what a given weight does in practice without looking at the net as a whole: well, a computer can do that.)
At first it might seem like this should give us optimal results very fast. If we can learn which way to move each weight, we just move them all where they should be and we’re golden. But there are two big speed bumps. First, it’s easy for a weight to have a very bumpy loss curve, where increasing it from its value might help at first, then hurt, and so on. Second, every time we update the net, every weight’s loss curve changes, because everything in its watershed has changed: it’s weighing different variables than it was last step.
The loss function defines a surface in million-dimensional space: the loss over the possible range of every weight in the net. We’re trying to find the lowest point of that surface – a familiar problem, usually in lower dimensions, for calculus students. In so many dimensions, actually finding the lowest point is unrealistic and
The converse way of looking at this is that the convnet is trying to approximate an is-a-sunflower function of 100k variables. Some sets of 100k values represent sunflowers (or, at least, are higher on a scale of sunflower representation) and others do not. Again, we’re swimming in horrifyingly high-dimensional space. On 100k axes, it takes a painfully large number of samples to create the kind of dense data that makes for easy statistical inference.
Why do we think this is tractable? Because a variety of experience in everyday life supports the idea that our own ability to recognize sunflowers in images is continuous and subjectively “clean” in terms of pixel values. In other words, we’re confident that most tiny changes to images lead to only tiny changes in their sunflower-ness. There are “pathological” cases like optical illusions, for example. And a very slight gloss texture or brushstroke effect could make a human read an image as a pictire of a picture of a sunflower. (Concretely: If your convnet is shown a Van Gogh painting, or a stock photo that’s surrounded by columns of magazine text, do you want to to say “sunflower”? Or “painting” and “magazine”? Or what? This is not really a question about convnets, though; it’s a more abstract question that they expose.) So at root, we think it’s possible for neural networks to recognize sunflowers at reasonable accuracy because the neural networks in our skulls can.
We also have a mathematical proof called the universal approximation theorem, or UAT, that says (in loose terms – if you want the math, look up the math): a network of finite size can approximate any continuous function arbitrarily well if its weights have nonlinear activations. Nonlinear activations means: every internal weight passes through a “filter” function that’s curved.
The network as described in our sketch so far doesn’t have nonlinear activations, so it can only create linear combinations of its inputs. In other words, it can define functions with flat graphs. They can be flat in any number of dimensions, and be very good approximations of flattish functions, but we believe – for a variety of good reasons – that most functions worth approximating with a convnet are very, very non-flat. We can’t construct the lumps we need by composing any number of functions that look like y = x.
We need curved functions: y = x^2, y = tan(x), or something. We can compose those into lumps. It matters surprisingly little what these core functions are: y = x^2 is neither qualitatively better nor qualitatively worse than y = tan(x), or y = cos(x), or y = 1/(1 + e^-x), and so on. For example, if we’re using a convex function to approximate something concave, we just muliply by -1. So instead of creating a library of useful building-block functions, because no reasonable choice is a categorical improvement on any other, we pick a single one. As long as it’s non-monotonic, UAT says a convnet will be able to approximate any other function when using it to scale its weights. (Fast-talking drug commerical side effect list voice: Some theoretical convnet, perhaps of impractical size if you need a good approximation, and as long as the function to be approximated is continuous and bounded; other restrictions apply.)
In practice, the most popular nonlinear activation function is max(0, x) over -1..1, called relu. It seems like a clumsy tool for carving intricate surfaces in high-dimensional space, but it’s easy to calculate and it’s often faster to train than more elegant-looking and intuitively promising activation curves. (For reasons that are interesting but a little outside our scope here.)
Back to sunflowers.
What we’re trying to find here is a surface that accurately defines a level of sunflowerilarity in 100k-dimensional space. It’s helpful sometimes to pretend that there’s a Platonic form of sunflower-appearance, a definitively correct surface, that we’re trying to approximate. In practice, to define our 100k-dimensional solarfloral surface (as a point in million-dimensional parameter space), we have a tiny number of sample inputs: maybe a few thousand. Everything we’re doing is interpolating among them.
To an untrained network, without context, the amount of useful information in any single picture of a sunflower is excruciatingly small. It’s a single point in brain-hurtingly vast emptiness. Of all the possible values for 100k pixels, this particular one is a sunflower. It’s a long way from that to a robust sunflower detector. Learning can only really begin with multiple samples, including negatives, so the network can play spot the difference. And in practice, it would be surprising to reach a level of accuracy that would make a really cool demo without perhaps ten thousand samples.
Why? With too few examples, the network will start overfitting, also called memorizing. The network is finding features that distinguish sunflowers from tulips in the training data (good!), but they’re features specific to those particular images, not generalizable to images of different individual flowers, or different framing (bad!). When overfitting, the convnet is basically making wrong guesses about the nature of sunflowerity. Showing it more diverse examples will show it counterexamples to these wrong theories.
Let’s say we have fewer sample images than we’d really like, which is virtually always the case in the real world. How do we stave off overfitting? One big group of techniques is augmentation, which means manipulating the existing images.
We’d train a convnet like this one in batches, meaning we’d show it perhaps 10 to 100 images of sunflowers, then compute the mean gradient for all of them. This is basically preventing overfitting in the small scale.
It’s worth thinking for a moment here about why we suspect that this is a tractable problem. We think that there is some global minimum that represents platonic sunflower-ness and will match every sunflower and no tulip. To represent this, in practice, we work in batches: instead of showing the convnet one sunflower, we show it maybe 10 or 100 at a time, and find the mean gradient. This helps smooth out the effects of images’ individual quirks.
What happens if you don’t smooth out images’ individual quirks?
Now we should look at how it begins to work. In doing so, we’ll get to double back to look at nonlinear activation.
And that’s read by a single node, a 1×1×1 layer, that
The model is organized in layers, from a layer that pulls in raw image data to a layer that makes an explicit prediction. Each layer is a
We assume that nearby pixels are related.