Spaces of programs
The practical development of deep learning and its associated infrastructure has initiated a broad re-examination of the practice of computer programming. In this document we briefly survey how this discussion has evolved over the past few years, and then describe our point of view on the underlying mathematics.
We begin with some appeals to authority, in the form of the following references:
- P. Norvig, "Deep Learning and Understandability versus Software Engineering and Verification", March 2016.
- P. Norvig, "As we may program", May 2017.
- F. Chollet, "The future of deep learning", July 2017.
- A. Karpathy, "Software 2.0", November 2017.
- Y. LeCun, "Deep Learning est mort. Vive Differentiable Programming!", January 2018.
- E. Meijer, "Alchemy For the Modern Computer Scientist", June 2018.
Briefly, Peter Norvig is a director of research at Google and coauthor with Stuart Russell of one of the most famous texts in artificial intelligence, Francois Chollet is a researcher and the author of the popular deep learning framework Keras, Andrej Karpathy is director of AI at Tesla, Yann Lecun is one of the pioneers of convolutional neural networks and Chief AI Scientist at Facebook, Erik Meijer is Director of Engineering at Facebook.
I highly recommend taking the time to look at these references. The rest of this section contains some highlights relevant to this document. Norvig's talk "As we may program" discusses various aspects of how machine learning will affect the nature of programming, with an emphasis on the role of data science:
(from 6:50) but now I think there is a third approach (to Computer Science) which is that it is an empirical science... so now instead of saying "this is true" you say "I ran an experiment and I got these results". So it's not a mathematical science, now it is a natural science... So recently, we've seen a lot of use of machine learning and this type of thinking and we are going to have to figure out how that fits in with the rest of programming. Because now it feels like there's a machine learning system and you call it and you get a result back... then you take that result over to your regular programming code and then you do stuff with it. And it feels like the two aren't really connected, and that's too bad because we could probably do a much better job if it was integrated all the way through. Somebody is going to have to think about how to do that...
(from 10:20) What I'm asking is, can we build a system that integrates regular programming with machine learning so that everybody can do machine learning with their code and it all works together right.
From Chollet's article (see also his talk on this material):
As we noted in our previous post, a necessary transformational development that we can expect in the field of machine learning is a move away from models that perform purely pattern recognition and can only achieve local generalization, towards models capable of abstraction and reasoning, that can achieve extreme generalization...
What could be the path to make this happen? ... imagine neural networks that would be "augmented" in a similar way with programming primitives such as for loops—but not just a single hard-coded for loop with a hard-coded geometric memory, rather, a large set of programming primitives that the model would be free to manipulate to expand its processing function, such as if branches, while statements, variable creation, disk storage for long-term memory, sorting operators, advanced datastructures like lists, graphs, and hashtables, and many more. The space of programs that such a network could represent would be far broader than what can be represented with current deep learning models, and some of these programs could achieve superior generalization power.
A related subfield of AI that I think may be about to take off in a big way is that of program synthesis, in particular neural program synthesis. Program synthesis consists in automatically generating simple programs, by using a search algorithm (possibly genetic search, as in genetic programming) to explore a large space of possible programs. The search stops when a program is found that matches the required specifications, often provided as a set of input-output pairs. As you can see, is it highly reminiscent of machine learning: given "training data" provided as input-output pairs, we find a "program" that matches inputs to outputs and can generalize to new inputs. The difference is that instead of learning parameter values in a hard-coded program (a neural network), we generate source code via a discrete search process.
From Karpathy's article:
The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer identifies a specific point in program space with some desirable behavior.
Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried). Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a program that satisfies the constraints. In the case of neural networks, we restrict the search to a continuous subset of the program space where the search process can be made (somewhat surprisingly) efficient with backpropagation and stochastic gradient descent
From LeCun's Facebook post:
But the important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization... It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable.
Finally, from Meijer's talk:
(from 27:00) the future of programming is differentiable programming.
At a high level some of these ideas are old ones: program induction and program synthesis are well-studied areas of computer science. The word synthesis is very interesting. My dictionary gives the following three definitions:
- The combination of components or elements to form a connected whole: the ideology represented a synthesis of certain ideas.
- The production of chemical compounds by reaction from simpler materials: the synthesis of methanol from carbon monoxide and hydrogen.
- (in Hegelian philosophy) the final stage in the process of dialectical reasoning, in which a new idea resolves the conflict between thesis and antithesis.
For an excellent introduction and many entry points into the literature see:
- W. Zaremba's thesis "Learning algorithms from data".
- A. L. Gaunt, M. Brockschmidt, R. Singh, N. Kushman, P. Kohli, J. Taylor, D. Tarlow "TerpreT: A Probabilistic Programming Language for Program Induction" (see also this summary and a more recent article on the "TerpreT problem").
As explained in Norvig's first talk, the recent excitement is the result of progress in the problem of program synthesis in a restricted domain, namely neural networks, in which the problem may be approached using the methods of automatic differentiation and gradient descent. These methods are sufficient for synthesising many useful programs (for example, programs that recognize objects in images). The quotes chosen above suggest a future in which programmers guide the assembly of programs that are partly traditional code, and partly code that is optimised by gradient descent (or another search method). We will use the umbrella term program synthesis to refer to the entire continuum of such guided assembly processes, with traditional programming at one extreme. If, as Norvig suggests, these trends lead to computer science becoming more of an empirical science, what kind of mathematics will that future computer science involve?
Question: what is the mathematics of program synthesis?
A mathematical theory of program synthesis
We have a good mathematical theory of programs (from Turing machines to Cartesian closed categories, etc.) but we do not yet have a mathematical theory of program synthesis. What we have is a developing body of real-world practice. The closest part of mathematics is the calculus of variations, which gives a powerful set of tools for finding functions which minimise a given functional. But algorithms and functions are not the same thing, and that is why program synthesis is an interesting opportunity to develop new mathematics.
The technique at the heart of the current body of practice is automatic differentiation, which is the foundation of languages like TensorFlow and PyTorch. However, there is more to program synthesis than automatic differentiation. Some of the foundations for a deeper mathematical theory are already clear in the algorithmic formalisation of inductive inference ala Solomonoff, and the existing literature on program synthesis. However there are clear gaps, such as the lack of a proper theory of "spaces" of programs.
I know of two situations in which the terminology "space of programs" has some genuine content (as opposed to just being an evocative synonym for "set of programs"). Both examples consist of bijective maps
P: M -> A where
A denotes a set of algorithms and
M is a smooth manifold (possibly with corners). Thus
P gives a parametrisation of some class of algorithms, and we may use
P to equip
A with the structure of a manifold. There are plenty of meaningless bijections between infinite sets, so the existence of a bijection
P is not a priori very meaningful. What makes the examples meaningful is that, in both cases, the bijection
P arises in the context of a learning problem whose loss function is defined as a real-valued function on
A by evaluating algorithms on training data, in such a way that the partial derivatives of the loss function at a point
x \in A (that is, an algorithm) are defined by evaluating the derivatives of
x on the training data (in the two cases cases the word derivative is interpreted differently, but in both cases the derivative is an algorithm in its own right). So there is a genuine sense in which some kind of infinitesimal smooth structure of the set
A is being used in the learning process.
The first example is the one which motivates the above discussion, namely, neural networks. Any specific architecture of neural network determines a function
M = R^N is the space of weights and
A is a set of numerical algorithms implementing the given type of neural network in a language like C, or the intermediate language for some processor. The loss function in a learning problem is actually defined as a function
A -> R by evaluating these algorithms on training data, and the derivatives of the loss function at a point
x \in A are computed by evaluating the derivatives of
x on the training data (where the "derivatives" are the algorithms produced by automatic differentiation, which compute the derivatives of the numerical function encoded by
The second example is the one studied in Section 7 of the paper
- J. Clift and D. Murfet "Derivatives of Turing machines in linear logic", link.
where there is an implicit bijection
P: [0,1]^a -> A where
A is a set of terms in a probabilistic extension of differential linear logic. An inductive inference problem determines a loss function on this manifold, and the derivatives of the loss at a point
x \in A are computed using the Ehrhard-Regnier derivatives of
These examples suggest there is some mathematical content to the idea of spaces of programs, and that there may be interesting fundamental mathematics lying behind the idea of synthesising programs by gradient descent. However, to echo a point from Norvig's first talk above, traditional program induction was successful at synthesising some short programs but failed at synthesising large programs, and so far the new approaches to program synthesis (e.g. neural program synthesis) have not demonstrated that they are able to go much beyond this. See TerpreT paper for a sober look at the success (or lack thereof) to date in this direction.
Just for fun: differentiating programs is mad
Here is Dijkstra in his essay "On the cruelty of really teaching computing science":
It is possible, and even tempting, to view a program as an abstract mechanism, as a device of some sort. To do so, however, is highly dangerous: the analogy is too shallow because a program is, as a mechanism, totally different from all the familiar analogue devices we grew up with. Like all digitally encoded information, it has unavoidably the uncomfortable property that the smallest possible perturbations —i.e. changes of a single bit— can have the most drastic consequences. [For the sake of completness I add that the picture is not essentially changed by the introduction of redundancy or error correction.] In the discrete world of computing, there is no meaningful metric in which "small" changes and "small" effects go hand in hand, and there never will be.
This is discussed beginning from 21:30 in Norvig's first talk in the above list of references. In a similar vein it is sometimes argued that extending neural networks with more programmatic elements necessarily means abandoning the idea of end-to-end differentiability. An example is Chollet's observation that
If machine learning models become more like programs, then they will mostly no longer be differentiable—certainly, these programs will still leverage continuous geometric layers as subroutines, which will be differentiable, but the model as a whole would not be. As a result, using backpropagation to adjust weight values in a fixed, hard-coded network, cannot be the method of choice for training models in the future—at least, it cannot be the whole story.
We do not agree!