Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@rain-1
Last active March 6, 2023 01:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rain-1/4af9cad7941e5999e44a22e8cd368494 to your computer and use it in GitHub Desktop.
Save rain-1/4af9cad7941e5999e44a22e8cd368494 to your computer and use it in GitHub Desktop.
least squares

Introduction

This blog post is about the Linear Least Squares Problem. This method is credited back to Legendre and Gauss, some of my favorite mathematicians. Why are they such inspiring people? Here is a passage from a post that goes into more depth about Gauss's application of least squares:

The 24-year-old Gauss tackled the orbit problem, assuming only Kepler’s three laws of planetary motion, with his newly discovered error distributions and his method of least squares for three months. He spent over 100 hours performing intensive calculations by hand without any mistakes (and without the luxury of today’s computers!). He had to estimate the six parameters of the orbit (as shown in Figure 7) from only 19 data points, subject to random measurement errors. He even invented new techniques such as the Fast Fourier Transform for interpolating trigonometric series, which produced efficient numerical approximations of the elliptical orbit. This is much earlier than Joseph Fourier’s 1825 results and the modern FFT algorithm by James Cooley and John Tukey in 1965, which has been described as “one of the most important numerical algorithms of our lifetime” by Heideman et al (1984). Teets & Whitehead (1999) and Tennenbaum & Director (1998) document the full mathematical details of Gauss’ work.

To make this concrete the post is about solving the problem of going from plot 1 to plot 2

The input is a random set of points and the output is two numbers that describe a line. A line is defined by the equation $$y = m x + c$$. $x$ and $y$ are variables and $m$ and $c$ are numbers, which we aim to find. We want to find the 'best' $m$ and $c$.

Jupyter

I used jupyter to write python code to do this. You can download the notebook and use it alongside this document. Or you can set up jupyter and try it out yourself. It's very very easy to get it up and running. See here. Just docker run -p8888:8888 sagemath/sagemath:latest sage-jupyter.

First I generated a random data set to work with:

import random

# Define the line equation
def line_equation(x):
    return 73 * x + 22

# Generate a data set of 100 points with Gaussian noise
data_set = []
for i in range(74):
    x = random.uniform(-10, 10)  # generate a random value for x between -10 and 10
    y = line_equation(x) + random.gauss(0, 90)  # add Gaussian noise with mean 0 and standard deviation 10
    data_set.append((x, y))

# Print the data set
print(data_set)

The best line

So which line is the best line? Suppose we had 3 points $(x_1, y_1)$, $(x_2, y_2)$, $(x_3, y_3)$.

For $y = m x + c$ to be a good line we want:

  • $y_1$ to be close to $m x_1 + c$
  • $y_2$ to be close to $m x_2 + c$
  • $y_3$ to be close to $m x_3 + c$

A great way to do this is to take the sum of the squares of the error:

$$s(m,c) = (y_1 - (m x_1 + c))^2 + (y_2 - (m x_2 + c))^2 + (y_3 - (m x_3 + c))^2$$

This formula will give a small value if the points are close to the line - zero iff all points lie on the line. The further away a point is from the line, the larger the number will get.

So the problem of finding a good line can be reduced to the problem of minimizing $s(m,c)$.

Local extrema

Let's look at single variable functions for a moment. Here is a graph of a function in pink. And then a graph of its derivative in blue. The local extrema A, B and C are in exactly the same locations as where the derivative is zero. The derivative of a function is its slope at that point. When the slope is zero you're either at the top of a hill or at the bottom of a dip.

2_1_function_to_derivative

The general procedure that is used time and time again in calculus to find local minima and local maxima is to take the derivative of a function and then set that to zero and solve.

The plan

This was a complete surprise to me but just like you can do calculus and derivatives with single variable functions $\mathbb R \to \mathbb R$ you can also do this with functions to and from matrices, vectors etc. There is an entire, terrifying, wikipedia page about this: https://en.wikipedia.org/wiki/Matrix_calculus

So the plan for the rest of this post is as follows:

  • Rewrite the formula $s(m,c)$ in terms of matrices and vectors
  • Apply matrix differentiation to it.
  • Set the result of that to zero, and solve to find the local minma!

So I am going to package up $m$ and $c$ into a vector, this is basically a pair:

$$B = \displaystyle \left(\begin{array}{r} m \\ c \end{array}\right)$$

and I will package up all the $y$ values of our data set into a vector too:

$$Y = \displaystyle \left(\begin{array}{r} y_{1} \\ y_{2} \\ y_{3} \end{array}\right)$$

but for the $x$ coordinates I will make them into a 3x2 matrix like this:

$$X = \displaystyle \left(\begin{array}{rr} x_{1} & 1 \\ x_{2} & 1 \\ x_{3} & 1 \end{array}\right)$$

The reason for doing this is that we can apply the line equation easily to each point in a single matrix operation:

$$X \cdot B = \displaystyle \left(\begin{array}{r} m x_{1} + c \\ m x_{2} + c \\ m x_{3} + c \end{array}\right)$$

$$Y - X \cdot B = \displaystyle \left(\begin{array}{r} -m x_{1} - c + y_{1} \\ -m x_{2} - c + y_{2} \\ -m x_{3} - c + y_{3} \end{array}\right)$$

and finally we can express the sum of squares error function $$s(B) = (Y - X \cdot B)^{\dagger} (Y - X \cdot B)$$

The dagger symbol denotes the matrix transpose operation. If you try out the python notebook sage math will expand out this in the 3x3 case and print back the original expression. Because of the way we've rewritten this using matrix operations it works for any number of data points. If you're not comfortable or familiar with linear algebra stuff like matrix multiplication then a good way to conceptualize this is it just packages up the data in a way where you can run loops and sums over it using these matrix functions instead of having to write out the loops manually.

Matrix Differentiation

It took me a couple days to make sense of this. I kept getting it wrong - Good thing I knew the exact answer I was supposed to get. The rules for differentiation of functions with matrices and vectors in them are similar to the normal differentation rules, but the transpose part has a strange "non-local" rule to it that really surprised me. And of course you have to be careful not to commute terms since for matrices $AB \not = BA$. This PDF helped me.

Let's use $\mathrm d$ to denote the differential with respect to the variable $B$, and let $M, N$ be some matrices that depend on $B$. Let $C$ be a matrix that does not depend on $B$.

  • $(M + N)^\dagger = M^\dagger + N^\dagger$
  • $(M N)^\dagger = N^\dagger M^\dagger$

and some derivative rules

  • Chain rule: $\mathrm{d}(M N) = \mathrm{d}(M) N + M \mathrm{d}(N)$
  • Transpose rule: $\mathrm{d}(M^\dagger C) = (\mathrm{d}M)^\dagger C^\dagger$
  • $\mathrm{d}(B^\dagger B) = 2 B$

Multiply out

$$ \begin{align} &(Y - XB)^\dagger (Y - XB) \\ =& (Y^\dagger - B^\dagger X^\dagger) (Y - XB) \\ =& Y^\dagger Y - B^\dagger X^\dagger Y - Y^\dagger XB + B^\dagger X^\dagger XB \\ \end{align} $$

and differentiate

$$ \begin{align} &\mathrm{d}[(Y - XB)^\dagger (Y - XB)] \\ =& \mathrm{d}[Y^\dagger Y - B^\dagger X^\dagger Y - Y^\dagger XB + B^\dagger X^\dagger XB] \\ =& 0 - \mathrm{d}[B^\dagger X^\dagger Y] - \mathrm{d}[Y^\dagger X B] + \mathrm{d}[B^\dagger X^\dagger XB] \\ =& - (X^\dagger Y)^\dagger - Y^\dagger X + 2 B X^\dagger X \\ =& - 2 Y^\dagger X + 2 X^\dagger X B \end{align} $$

So we can now set this to zero and solve for $B$:

$$ \begin{align} 0 =& - 2 Y^\dagger X + 2 X^\dagger X B \\ X^\dagger X B =& Y^\dagger X \\ B =& (X^\dagger X)^{-1} Y^\dagger X \\ \end{align} $$

Verifying this in code

Could this really work!?

Let's try it out:

def least_squares(data_set):
    X = matrix([vector([point[0], 1]) for point in data_set])
    Y = matrix([point[1] for point in data_set]).transpose()
    return (X.transpose() * X).inverse() * X.transpose() * Y
least_squares([(1,1), (2, 2), (3, 3)])

gives (1, 0) which is right.

least_squares(data_set)

gives (73.87082950043062, 17.864774499430233)

very close to the numbers I used to generate the data set! It seems to work!

Zooming out

I find this formula incredible and I hope you got something good out of this post. It seems to do so much, taking into account an entire data set and working something very nontrivial out about it in a single formula.

This stuff came up because I'm reading The Little Learner and they introduce the problem of best fit, starting with a line. They go on to discuss solving it with gradient descent - a different approach than this which requires iteration, but generalizes in different dimensions than this does.

#!/usr/bin/env python
# coding: utf-8
# In[1]:
import random
# Define the line equation
def line_equation(x):
return 73 * x + 22
# Generate a data set of 100 points with Gaussian noise
data_set = []
for i in range(74):
x = random.uniform(-10, 10) # generate a random value for x between -10 and 10
y = line_equation(x) + random.gauss(0, 90) # add Gaussian noise with mean 0 and standard deviation 10
data_set.append((x, y))
# Print the data set
print(data_set)
# In[2]:
import matplotlib.pyplot as plt
plt.clf()
def plot_points(data_set):
# Extract the x and y values from the data set
x_values = [point[0] for point in data_set]
y_values = [point[1] for point in data_set]
# Plot the points
plt.scatter(x_values, y_values)
# Add axis labels and a title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Plot of Points')
#size
#plt.figure(figsize=(20, 20))
#plt.figure(figsize=(12,8), dpi= 72, facecolor='w', edgecolor='k')
# Show the plot
plt.show()
return 0
# In[3]:
get_ipython().run_line_magic('matplotlib', 'notebook')
# In[4]:
plot_points(data_set)
# In[6]:
get_ipython().run_line_magic('display', 'latex')
# In[7]:
var('x_1 x_2 x_3 y_1 y_2 y_3')
# In[8]:
X_raw = [x_1, x_2, x_3]
# In[9]:
Y = matrix([y_1, y_2, y_3]).transpose()
# In[10]:
Ones = [1, 1, 1]
# In[36]:
X = matrix([X_raw, Ones]).transpose()
# In[37]:
var('m c')
# In[38]:
X
# In[39]:
Y
# In[40]:
B = matrix([m, c]).transpose();
B
# In[41]:
X * B
# In[42]:
Y - X * B
# In[43]:
(Y - X * B).transpose() * (Y - X * B)
# In[18]:
var('C')
del C
# In[44]:
def least_squares(data_set):
X = matrix([vector([point[0], 1]) for point in data_set])
Y = matrix([point[1] for point in data_set]).transpose()
#return X
#return Y
return (X.transpose() * X).inverse() * X.transpose() * Y
# In[45]:
least_squares([(1,1), (2, 2), (3, 3)])
# In[46]:
b = least_squares(data_set);
b
# In[47]:
b[0][0]
# In[48]:
import matplotlib.pyplot as plt
plt.clf()
plt.close()
def plot_points_and_line(data_set, b):
# Extract the x and y values from the data set
x_values = [point[0] for point in data_set]
y_values = [point[1] for point in data_set]
# Plot the points
plt.scatter(x_values, y_values)
# Add axis labels and a title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Plot of Points')
#size
#plt.figure(figsize=(20, 20))
#plt.figure(figsize=(12,8), dpi= 72, facecolor='w', edgecolor='k')
plt.axline((0, b[1][0]), slope=b[0][0], linewidth=4, color='r')
# Show the plot
plt.show()
return 0
# In[27]:
plot_points_and_line(data_set, b)
@jfredett
Copy link

jfredett commented Mar 6, 2023

return (X.transpose() * X).inverse() * X.transpose() * Y

That appears to commute the Y^T * X term from your derivation above it:

B=(X†X)^−1 Y†X

-- I think it might work out since you also swapped the transposes and they're vectors, but it reads a little funny.

Great article otherwise -- Least Squares and Gradient Descent are such useful tools!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment