Last active
June 15, 2023 09:41
-
-
Save ncalm/54b4c306468273679ae07d872fbdec4d to your computer and use it in GitHub Desktop.
This Excel lambda function can be used to classify data using the K-nearest neighbors (KNN) algorithm
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
KNN | |
Returns the classification of an unknown observation using known observations in a training set. | |
Inputs: | |
- x - an observation (row) in need of classification. This is an array of numerical measurements about an observation which you want to classify. This is one row and one or more columns. | |
- trn - an array of training data which is already classified. This array will have COLUMNS(x) + 1 columns. The additional column is because the training set includes a column on the right for the classification of each row. In the example below, the species of the flower. | |
- k - the number of observations in the training set to use to determine the class of the observation x. For example, if k=5, then the function will use the 5 observations (rows) in the training set which are closest to the observation x in order to determine to which class x should belong. | |
- [p] - the p-value passed to the MINKOWSKI function if a distance function is not provided to the distance_fn parameter. | |
- [distance_fn] - the distance function to use in the calculation. If omitted, then the provided p-value (or 2, if p is omitted) is passed to the MINKOWSKI function. | |
p=1 is equivalent to distance_fn=MANHATTAN | |
P=2 is equivalent to distance_fn=EUCLIDEAN | |
To classify vector x using training array Y and scalar k: | |
For Euclidean distance, either: | |
=KNN(x,trn,k,2) | |
or | |
=KNN(x,trn,k,,EUCLIDEAN) | |
For MANHATTAN distance, either: | |
=KNN(x,trn,k,1) | |
or | |
=KNN(x,trn,k,,MANHATTAN) | |
For CHEBYSHEV | |
=KNN(x,trn,k,,CHEBYSHEV) | |
For other values of p: | |
=KNN(x,trn,k,p) | |
*/ | |
KNN = LAMBDA(x, trn, k, [p], [distance_fn], | |
LET( | |
_p,IF(ISOMITTED(p),2,p), | |
_fn,IF(ISOMITTED(distance_fn),MINKOWSKI(_p),distance_fn), | |
_trnc, COLUMNS(trn), | |
_X, INDEX(trn, , 1) : INDEX(trn, , _trnc - 1), | |
_y, INDEX(trn, , _trnc), | |
_br, BYROW(_X, LAMBDA(r,_fn(r,x))), | |
_f, FILTER(_y, _br <= SMALL(_br, k)), | |
_fs, FREQ.SIMPLE(_f), | |
_output,INDEX(_fs, 1, 1), | |
_output | |
) | |
); | |
/* | |
MINKOWSKI | |
Returns a lambda function initialized with the given p-number. | |
1. p = the p-value to initialize the return function with. | |
Usage: | |
=MINKOWSKI(2)(X,Y) will calculate the EUCLIDEAN distance between X and Y. | |
=MINKOWSKI(1)(X,Y) will calculate the MANHATTAN distance between X and Y. | |
*/ | |
MINKOWSKI = LAMBDA(p,LAMBDA(x,y,POWER(SUM(ABS((x-y)^p)),1/p))); | |
/* | |
EUCLIDEAN | |
Calculates the Euclidean distance between 2 vectors x and y. | |
1. x - a one-dimensional vector of real numbers | |
2. y - a one-dimensional vector of real numbers with count(x)=count(y) | |
Since Euclidean distance is a special case of Minkowski distance with p=2, | |
we simply call MINKOWSKI to return a lambda function to which we pass the vectors | |
x and y. | |
*/ | |
EUCLIDEAN = LAMBDA(x,y,MINKOWSKI(2)(x,y)); | |
/* | |
MANHATTAN | |
Calculates the MANHATTAN distance between 2 vectors x and y. | |
1. x - a one-dimensional vector of real numbers | |
2. y - a one-dimensional vector of real numbers with count(x)=count(y) | |
Since MANHATTAN distance is a special case of Minkowski distance with p=1, | |
we simply call MINKOWSKI to return a lambda function to which we pass the vectors | |
x and y. | |
*/ | |
MANHATTAN = LAMBDA(x,y,MINKOWSKI(1)(x,y)); | |
/* | |
CHEBYSHEV | |
Calculates the CHEBYSHEV distance between 2 vectors x and y. | |
1. x - a one-dimensional vector of real numbers | |
2. y - a one-dimensional vector of real numbers with count(x)=count(y) | |
CHEBYSHEV distance is the limit case of MINKOWSKI distance with p approaching infinity. | |
The limit is as shown for positive infinity and we could replace MAX with MIN for p | |
approaching negative infinity. | |
*/ | |
CHEBYSHEV = LAMBDA(x,y,MAX(ABS(x-y))); | |
/* | |
FREQ.SIMPLE | |
Calculates a simple frequency table of the values in a column | |
Inputs: | |
1. data - a single column of data | |
*/ | |
FREQ.SIMPLE = LAMBDA(data, | |
LET( | |
d, INDEX(data,,1), | |
u, UNIQUE(d), | |
X, N(u = TRANSPOSE(d)), | |
Y, SEQUENCE(ROWS(d), 1, 1, 0), | |
mp, MMULT(X,Y), | |
c, CHOOSE({1,2}, u, mp), | |
SORT(c, 2, -1) | |
) | |
); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment