This is an algorithm for calculating a euclidean distance matrix between each point to each other point in a data set, written as part of a agglomerative clustering exercise, using numpy's broadcasting, but could be applied to any other calculation done on multidimensional data sets.
The basic concept is that, when adding or multiplying two vectors of sizes (m,1) and (1,m), numpy will broadcast (duplicate the vector) so that it allows the calculation. For example multiplying a vector [1,2,3,4,...10] with a transposed version of itself, will yield the multiplication table. For Example:
The same technique can be used for matrices. In this case, I was looking to generate a Euclidean distance matrix for the iris data set. The formula is:
Obtaining the table could obviously be done using two nested for loops, but can also be performed using matrix operations.
We take the data set (represented as 'A' in the illustration),
create a 3D copy of it, and rotate it to the orientation shown as 'B' in the illustration.
We then create another copy and rotate it as represented by 'C'.
B-C will generate a 3D cube ('D'), sized (m,m,n) which represents the calculation
for each dimension of the original dataset vectors (n=4 in the iris data set).
Then, we simply raise the entire cube to the power of 2, sum the result in the axis shown in the illustration, and square.
The result is the multiplication matrix ('E').
Note- It's critical not to get the orientations confused when constructing A and B, obviously... :)
code:
import numpy as np
from sklearn.datasets import load_iris
Load data:
d, _ = load_iris(return_X_y=True)
def euc_matrix(A):
# generate distance matrix:
B = np.rot90(A[:,:,None],1,(1,2)) #[:,:,None] is needed to add a dimension
C = np.rot90(B,1,(0,1))
return np.sqrt(np.sum((B-C)**2, axis=2))
mat = euc_matrix(A)
Michael Chein 04/01/2019