jwilk/umap_primes.md Secret

## umap_primes.md

      
    Raw
  

              umap_primes.md
            
          
    Source: https://johnhw.github.io/umap_primes/index.md.html
What do numbers look like?


This is the first million integers, represented as binary vectors indicating their prime factors, and laid out using the UMAP dimensionality reduction algorithm by Leland Mcinnes.
Each integer is represented in a high-dimensional space, and gets squished down to 2D so that numbers with similar prime factorisations are closer
together than those with dissimilar factorisations.
A very pretty structure emerges; this might be spurious in that it captures
more about the layout algorithm than any "true" structure of numbers. However, the visual effect is very appealling and requires no tricky manipulation to create.
Large 4k version of the image suitable for backgrounds
Video

A short video, showing the numbers being added in sequence. Each frame adds 1000 integers,
for a total of 1000 frames.
https://www.youtube.com/embed/nCk8dyU7zUM
Representation

Each integer is represented
as a vector, one element for each possible prime factor, where 0=prime factor not present, 1=prime factor present. This provides a simple geometry for the
integers, where each number can be seen as being composed of prime basis vectors.
For example, 10 might be represented as [1 0 1 0 0 0 0 ...] as it has factors 2 and 5 (the first and third primes respectively).
These high-dimensional vectors are then squashed down
from thousands of dimensions for each number to just two dimensions, and a point plotted at that coordinate. The points are coloured according to
the original integer index; brighter colours correspond to larger numbers. The points are laid out so that numbers with similar "angles" to each other are close together. For example, all primes are at 90 degrees to each other, and all powers of a number are identical to each other (0 degree angle).
Algorithm


for each i up to n

find prime factors of i.
For each i, make binary vector with π(n) columns (π(n) = number of primes less than n), where each element has 1=prime factor present, 0=absent.


Stack all vectors into a large, sparse matrix


Lay out with UMAP, using the cosine distance function (dot product).


I used li(n) as a cheap upper bound on π(n) to estimate the number of columns needed.
For example, for n=11, the matrix looks like:

             2  3  5  7  11       
            [0. 0. 0. 0. 0.] 0
            [0. 0. 0. 0. 0.] 1
            [1. 0. 0. 0. 0.] 2
            [0. 1. 0. 0. 0.] 3
            [1. 0. 0. 0. 0.] 4
            [0. 0. 1. 0. 0.] 5
            [1. 1. 0. 0. 0.] 6
            [0. 0. 0. 1. 0.] 7
            [1. 0. 0. 0. 0.] 8
            [0. 1. 0. 0. 0.] 9
            [1. 0. 1. 0. 0.] 10
            [0. 0. 0. 0. 1.] 11


The image above is reduced from a 1000000x78628 (!) binary matrix to 1000000x2.  This only works because of the excellent sparse matrix support in UMAP.
What are we seeing?

The points can be filtered according to whatever criteria we might be interested in. For example, the parity of the numbers (even=green, odd=orange)

or alternatively by the whether the points are prime or not (white=prime, blue=composite).

As might be expected, the primes form the vague, diffuse cluster near the centre, as they are all orthogonal to each other.
Extending this idea, we can colour the points by number of unique factors:

We can colour the points according to the prime factors they contain, where the log of the factor is mapped to hue, with redder=smaller, bluer=larger
factor. This generalises the parity plot above. This plot has just a touch of spatial noise added to try and separate out the colours more clearly,
as each factor pass is rendered as a separate layer.

Sequences

We can look at where common sequences of numbers lie. This isn't particularly informative, but there is perhaps some pattern.
For example, the Fibonacci sequence:

or the squares:

The frontier

It's worth noting that any structure that is not an algorithmic artificat is really the frontier of numbers at some threshold (in this case, 1 million).
Every possible
finite binary vector corresponds to at least one integer in this representation (actually, infinitely many, as all repeated factors are
contracted onto a single binary indicator). Therefore, in the infinite set of integers, all points are present, at every possible angle or distance from
each other, and the space would be perfectly evenly filled. However, by cutting off at some specific n, we see the "surface" that we have reached by composing together primes such that
their product is less than n.
Code

The python code is available.
Massive ~10k resolution version 16Mb