This blog post is about the Linear Least Squares Problem. This method is credited back to Legendre and Gauss, some of my favorite mathematicians. Why are they such inspiring people? Here is a passage from a post that goes into more depth about Gauss's application of least squares:
The 24-year-old Gauss tackled the orbit problem, assuming only Kepler’s three laws of planetary motion, with his newly discovered error distributions and his method of least squares for three months. He spent over 100 hours performing intensive calculations by hand without any mistakes (and without the luxury of today’s computers!). He had to estimate the six parameters of the orbit (as shown in Figure 7) from only 19 data points, subject to random measurement errors. He even invented new techniques such as the Fast Fourier Transform for interpolating trigonometric series, which produced efficient numerical approximations of the elliptical orbit. This is much earlier than Joseph Fourier’s 1825 results and the modern FFT algorithm by James Cooley and John Tukey in 1965, which has been described as “one of the most important numerical algorithms of our lifetime” by Heideman et al (1984). Teets & Whitehead (1999) and Tennenbaum & Director (1998) document the full mathematical details of Gauss’ work.
To make this concrete the post is about solving the problem of going from to
The input is a random set of points and the output is two numbers that describe a line. A line is defined by the equation
I used jupyter to write python code to do this. You can download the notebook and use it alongside this document. Or you can set up jupyter and try it out yourself. It's very very easy to get it up and running. See here. Just docker run -p8888:8888 sagemath/sagemath:latest sage-jupyter
.
First I generated a random data set to work with:
import random
# Define the line equation
def line_equation(x):
return 73 * x + 22
# Generate a data set of 100 points with Gaussian noise
data_set = []
for i in range(74):
x = random.uniform(-10, 10) # generate a random value for x between -10 and 10
y = line_equation(x) + random.gauss(0, 90) # add Gaussian noise with mean 0 and standard deviation 10
data_set.append((x, y))
# Print the data set
print(data_set)
So which line is the best line? Suppose we had 3 points
For
-
$y_1$ to be close to$m x_1 + c$ -
$y_2$ to be close to$m x_2 + c$ -
$y_3$ to be close to$m x_3 + c$
A great way to do this is to take the sum of the squares of the error:
This formula will give a small value if the points are close to the line - zero iff all points lie on the line. The further away a point is from the line, the larger the number will get.
So the problem of finding a good line can be reduced to the problem of minimizing
Let's look at single variable functions for a moment. Here is a graph of a function in pink. And then a graph of its derivative in blue. The local extrema A, B and C are in exactly the same locations as where the derivative is zero. The derivative of a function is its slope at that point. When the slope is zero you're either at the top of a hill or at the bottom of a dip.
The general procedure that is used time and time again in calculus to find local minima and local maxima is to take the derivative of a function and then set that to zero and solve.
This was a complete surprise to me but just like you can do calculus and derivatives with single variable functions
So the plan for the rest of this post is as follows:
- Rewrite the formula
$s(m,c)$ in terms of matrices and vectors - Apply matrix differentiation to it.
- Set the result of that to zero, and solve to find the local minma!
So I am going to package up
and I will package up all the
but for the
The reason for doing this is that we can apply the line equation easily to each point in a single matrix operation:
and finally we can express the sum of squares error function
The dagger symbol denotes the matrix transpose operation. If you try out the python notebook sage math will expand out this in the 3x3 case and print back the original expression. Because of the way we've rewritten this using matrix operations it works for any number of data points. If you're not comfortable or familiar with linear algebra stuff like matrix multiplication then a good way to conceptualize this is it just packages up the data in a way where you can run loops and sums over it using these matrix functions instead of having to write out the loops manually.
It took me a couple days to make sense of this. I kept getting it wrong - Good thing I knew the exact answer I was supposed to get. The rules for differentiation of functions with matrices and vectors in them are similar to the normal differentation rules, but the transpose part has a strange "non-local" rule to it that really surprised me. And of course you have to be careful not to commute terms since for matrices
Let's use
$(M + N)^\dagger = M^\dagger + N^\dagger$ $(M N)^\dagger = N^\dagger M^\dagger$
and some derivative rules
- Chain rule:
$\mathrm{d}(M N) = \mathrm{d}(M) N + M \mathrm{d}(N)$ - Transpose rule:
$\mathrm{d}(M^\dagger C) = (\mathrm{d}M)^\dagger C^\dagger$ $\mathrm{d}(B^\dagger B) = 2 B$
Multiply out
and differentiate
So we can now set this to zero and solve for
Could this really work!?
Let's try it out:
def least_squares(data_set):
X = matrix([vector([point[0], 1]) for point in data_set])
Y = matrix([point[1] for point in data_set]).transpose()
return (X.transpose() * X).inverse() * X.transpose() * Y
least_squares([(1,1), (2, 2), (3, 3)])
gives (1, 0) which is right.
least_squares(data_set)
gives (73.87082950043062, 17.864774499430233)
very close to the numbers I used to generate the data set! It seems to work!
I find this formula incredible and I hope you got something good out of this post. It seems to do so much, taking into account an entire data set and working something very nontrivial out about it in a single formula.
This stuff came up because I'm reading The Little Learner and they introduce the problem of best fit, starting with a line. They go on to discuss solving it with gradient descent - a different approach than this which requires iteration, but generalizes in different dimensions than this does.
return (X.transpose() * X).inverse() * X.transpose() * Y
That appears to commute the
Y^T * X
term from your derivation above it:B=(X†X)^−1 Y†X
-- I think it might work out since you also swapped the transposes and they're vectors, but it reads a little funny.
Great article otherwise -- Least Squares and Gradient Descent are such useful tools!