btbytes/kmeans.apl

## kmeans.apl
⍝ This is a comment in APL. Comments are denoted by the "⍝" symbol
⍝ which is called "lamp" or "thumb". A comment begins with the lamp
⍝ symbol and continues to the end of the line.
⍝
⍝ Here is a coordinate in 2-D space. It is a vector of two elements.
p←0.712 0.848
⍝ The symbol "←" denotes assignment and is called "copula". We have
⍝ given the name "p" to our first coordinate. Here is another 2-D
⍝ coordinate we will name "q".
q←0.962 0.362
⍝
⍝ Most APL functions are called "verbs". A verb that takes a single
⍝ argument is called a "monadic" verb, while a verb that takes two
⍝ arguments is called a "dyadic" verb. Many symbols have both a
⍝ monadic and dyadic meaning.
⍝
⍝ Monadic verbs appear to the left of their argument. The symbol "≢"
⍝ is called "tally" when used as a monadic verb; it returns the number
⍝ of elements in a vector.
≢p
⍝ The result of "≢p" is "2" since "p" is a vector with two elements.
⍝
⍝ Dyadic verbs appear between their two arguments. Subtraction is a
⍝ dyadic verb denoted by the symbol "-" called "minus".
p-q
⍝ The result of "p-q" is "¯0.25 0.486" which is the difference of each
⍝ element of "q" from the corresponding element of "p". Negative
⍝ numbers are indicated by a prefix symbol "¯" called "high minus".
⍝
⍝ Most APL operations work element-wise with vector arguments.
⍝
⍝ The dyadic verb called "exponential" or "power" is denoted by the
⍝ "⋆" symbol. It raises the argument on the left to the power of the
⍝ argument on the right. The expression "7⋆2" can be read as "seven
⍝ squared"; the expression "9⋆.5" can be read as "square root of
⍝ nine".
(p-q)⋆2
⍝ The result of "(p-q)⋆2" is "0.0625 0.236196" which is the square of
⍝ each element in "p-q". When one of the arguments of a dyadic verb is
⍝ a vector and the other is a scalar, the function will extend the
⍝ scalar argument by repeating it to match the length of the vector
⍝ argument.
⍝
⍝ Parentheses are used for grouping parts of an expression; the part
⍝ inside the parentheses is evaluated before the parts outside. APL
⍝ expressions without parentheses have no special operator precedence
⍝ rules; they always evaluate from right-to-left.
2⋆7-4
⍝ The result of "2⋆7-4" is "8" because "7-4" is evaluated first, the
⍝ result of which is "3"; the result of "2⋆3" is "8".
⍝
⍝ Addition is a dyadic verb denoted by the symbol "+" which is called
⍝ "plus".
1+1+2+3+5
⍝ The result of "1+1+2+3+5" is "12".
⍝
⍝ Some special APL functions are called "operators". An operator takes
⍝ an additional argument, often another function, in addition to its
⍝ normal arguments. The operator denoted by "/" is called "reduce".
+/1 1 2 3 5
⍝ The reduce operator takes a dyadic verb argument on the left, which
⍝ it evaluates between each element of the vector argument on the
⍝ right. The result of "+/1 1 2 3 5" is also "12". The symbols "+/"
⍝ together can be read as "sum".
⍝
⍝ The "euclidean distance" between two coordinates is the length of
⍝ the straight line that connects them. It is defined as the square
⍝ root of the sum of the squares of the element-wise differences of
⍝ the elements of the coordinates.
⍝ <https://en.wikipedia.org/wiki/Euclidean_distance>
(+/(p-q)⋆2)⋆.5
⍝ It is complicated to describe this formula in plain language but
⍝ simple in APL. The result of "(+/(p-q)⋆2)⋆.5" is "0.5465308774".
⍝
⍝ Braces are used for defining custom verbs called "direct functions".
p {(+/(⍺-⍵)⋆2)⋆.5} q
⍝ The symbols "⍺" and "⍵" are called "alpha" and "omega" which
⍝ respectively denote the left and right arguments of a dyadic direct
⍝ function. We can assign a name to a direct function using a copula.
Distance←{(+/(⍺-⍵)⋆2)⋆.5}
⍝ We can now use the name "Distance" to measure the euclidean distance
⍝ between any two coordinates.
p Distance q
⍝ The result of "p Distance q" is also "0.5465308774".
⍝
⍝ Here is a vector of coordinates in 2-D space, including "p" and "q"
⍝ and seven more points, which we will call "coords".
coords←p q (0.301 0.234) (0.347 0.168) (0.57 0.003) (0.16 0.009) (0.252 0.539) (0.595 0.487) (0.439 0.412)
⍝
⍝ The symbol "." denotes an operator called "dot". The symbol "∘"
⍝ denotes an operator called "jot" or "compose". When used together
⍝ the symbols "∘." are called the "outer product" operator. This
⍝ operator applies its function argument to each combination of the
⍝ elements of its left and right arguments.
2 3 ∘.+ 1 1 2 3 5
⍝ The result of "2 3 ∘.+ 1 1 2 3 5" is a matrix that looks like this:
⍝
⍝       3 3 4 5 7
⍝       4 4 5 6 8
⍝
⍝ The shape of the result depends on the number of elements in each
⍝ argument. In this case, the left argument has two elements and the
⍝ right argument has five elements. The result is a two-by-five
⍝ matrix.
⍝
⍝ The symbol "⍴" is called "rho". As a monadic verb it is called
⍝ "shape" because it returns the length of each dimension of its
⍝ argument.
⍴2 3 ∘.+ 1 1 2 3 5
⍝ The result of "⍴2 3 ∘.+ 1 1 2 3 5" is "2 5".
⍝
⍝ We can use the outer product operator with our "Distance" function
⍝ to compute the distance from each coordinate to each of "p" and "q".
⍝ We will call the result "deltas".
deltas←coords ∘.Distance p q
⍝ The result is a nine-by-two matrix that looks like this:
⍝
⍝       0            0.5465308774
⍝       0.5465308774 0
⍝       0.7388619628 0.6732792883
⍝       0.771767452  0.6448728557
⍝       0.8568482946 0.5315496214
⍝       1.004303241  0.8762493937
⍝       0.554148897  0.7317301415
⍝       0.3794864951 0.387703495
⍝       0.5144171459 0.525384621
⍝
⍝ The symbol "⌊" is called "downstile". As a dyadic verb it is also
⍝ called "minimum" because it returns the smaller of its two
⍝ arguments.
5⌊3
⍝ The result of "5⌊3" is "3".
⍝
⍝ When its argument is a matrix, the reduce operator works on each row
⍝ independently.
⌊/deltas
⍝ The result of "⌊/deltas" is a vector of nine elements containing the
⍝ smallest value from each row of "deltas". It looks like this:
⍝
⍝       0 0 0.6732792883 0.6448728557 0.5315496214 0.8762493937 0.554148897 0.3794864951 0.5144171459
⍝
⍝ The symbol "=" is called "equal to". As a dyadic verb it compares
⍝ its arguments and returns "1" when they are equal and "0" when they
⍝ are not.
3=1 1 2 3 5
⍝ The result of "3=1 1 2 3 5" is "0 0 0 1 0".
⍝
⍝ We can use "equal to" to find whether "p" or "q" is closer to each
⍝ coordinate by comparing the elements of "deltas" to the minima of
⍝ their respective rows. We call the result "clusters".
clusters←deltas=[1]⌊/deltas
⍝ When the arguments of a verb have different shapes, we can specify
⍝ the dimension along which the verb will be applied in brackets,
⍝ which are called the "axis" operator. In this case we want to
⍝ compare along the first dimension.
⍝
⍝ The result is a nine-by-two matrix that looks like this:
⍝
⍝       1 0
⍝       0 1
⍝       0 1
⍝       0 1
⍝       0 1
⍝       0 1
⍝       1 0
⍝       1 0
⍝       1 0
⍝
⍝ The number of coordinates in each cluster is the sum of each column
⍝ of the "clusters" matrix.
counts←+/[1]clusters
⍝ The result of "+/[1]clusters" is "4 5".
⍝
⍝ Multiplication is a dyadic verb denoted by the symbol "×" called
⍝ "times". Division is a dyadic verb denoted by the symbol "÷" called
⍝ "divide".
¯1 0 1×3
⍝ The result of "¯1 0 1×3" is "¯3 0 3".
1 1 2 3 5÷5
⍝ The result of "1 1 2 3 5÷5" is "0.2 0.2 0.4 0.6 1".
⍝
⍝ We can use multiplication to select the coordinates of each cluster.
⍝ To find the average (mean) coordinate in each cluster, we divide its
⍝ sum by its count.
means←(+/[1]clusters×[1]coords)÷counts
⍝ The result "means" is a vector of two coordinates "(0.4995 0.5715) (0.468 0.1552)".
⍝
⍝ This completes been a single iteration of the algorithm for "k-means
⍝ clustering" of our coordinates where k is 2.
⍝ <https://en.wikipedia.org/wiki/K-means_clustering>
⍝
⍝ We would now want to repeat this iteration with the new means in
⍝ place of "p" and "q", until the results converge. First, let us
⍝ define a direct function for this step.
Step←{(+/[1]clusters×[1]⍺)÷+/[1]clusters←deltas=[1]⌊/deltas←⍺ ∘.Distance ⍵}
⍝ We can now use our function called "Step" to compute the same answer
⍝ as before.
coords Step p q
⍝ The result of "coords Step p q" is also "(0.4995 0.5715) (0.468 0.1552)".
⍝
⍝ The symbol "≡" is called "match". As a dyadic verb it returns "1"
⍝ when its arguments are completely identical, and "0" otherwise. It
⍝ is different from "=" in that it compares the arguments entirely
⍝ rather than element-wise.
¯1 0 1≡1 0 1
⍝ The result of "¯1 0 1≡1 0 1" is "0".
1 2 3≡1 2 3
⍝ The result of "1 2 3≡1 2 3" is "1".
¯1 0 1=1 0 1
⍝ The result of "¯1 0 1=1 0 1" is "0 1 1".
1 2 3=1 2 3
⍝ The result of "1 2 3=1 2 3" is "1 1 1".
⍝
⍝ The symbol "⍣" denotes the "power operator". When used with two
⍝ functions, it recursively evaluates the left function, and uses the
⍝ right function to determine when to stop. When the right function is
⍝ "≡" it is called a "fixpoint operator".
means←coords Step⍣≡ p q
⍝ This repeats our "Step" function until the answer converges. The
⍝ result is a vector of the cluster means "(0.592 0.5296) (0.3445 0.1035)".
	⍝ This is a comment in APL. Comments are denoted by the "⍝" symbol
	⍝ which is called "lamp" or "thumb". A comment begins with the lamp
	⍝ symbol and continues to the end of the line.
	⍝
	⍝ Here is a coordinate in 2-D space. It is a vector of two elements.
	p←0.712 0.848
	⍝ The symbol "←" denotes assignment and is called "copula". We have
	⍝ given the name "p" to our first coordinate. Here is another 2-D
	⍝ coordinate we will name "q".
	q←0.962 0.362
	⍝
	⍝ Most APL functions are called "verbs". A verb that takes a single
	⍝ argument is called a "monadic" verb, while a verb that takes two
	⍝ arguments is called a "dyadic" verb. Many symbols have both a
	⍝ monadic and dyadic meaning.
	⍝
	⍝ Monadic verbs appear to the left of their argument. The symbol "≢"
	⍝ is called "tally" when used as a monadic verb; it returns the number
	⍝ of elements in a vector.
	≢p
	⍝ The result of "≢p" is "2" since "p" is a vector with two elements.
	⍝
	⍝ Dyadic verbs appear between their two arguments. Subtraction is a
	⍝ dyadic verb denoted by the symbol "-" called "minus".
	p-q
	⍝ The result of "p-q" is "¯0.25 0.486" which is the difference of each
	⍝ element of "q" from the corresponding element of "p". Negative
	⍝ numbers are indicated by a prefix symbol "¯" called "high minus".
	⍝
	⍝ Most APL operations work element-wise with vector arguments.
	⍝
	⍝ The dyadic verb called "exponential" or "power" is denoted by the
	⍝ "⋆" symbol. It raises the argument on the left to the power of the
	⍝ argument on the right. The expression "7⋆2" can be read as "seven
	⍝ squared"; the expression "9⋆.5" can be read as "square root of
	⍝ nine".
	(p-q)⋆2
	⍝ The result of "(p-q)⋆2" is "0.0625 0.236196" which is the square of
	⍝ each element in "p-q". When one of the arguments of a dyadic verb is
	⍝ a vector and the other is a scalar, the function will extend the
	⍝ scalar argument by repeating it to match the length of the vector
	⍝ argument.
	⍝
	⍝ Parentheses are used for grouping parts of an expression; the part
	⍝ inside the parentheses is evaluated before the parts outside. APL
	⍝ expressions without parentheses have no special operator precedence
	⍝ rules; they always evaluate from right-to-left.
	2⋆7-4
	⍝ The result of "2⋆7-4" is "8" because "7-4" is evaluated first, the
	⍝ result of which is "3"; the result of "2⋆3" is "8".
	⍝
	⍝ Addition is a dyadic verb denoted by the symbol "+" which is called
	⍝ "plus".
	1+1+2+3+5
	⍝ The result of "1+1+2+3+5" is "12".
	⍝
	⍝ Some special APL functions are called "operators". An operator takes
	⍝ an additional argument, often another function, in addition to its
	⍝ normal arguments. The operator denoted by "/" is called "reduce".
	+/1 1 2 3 5
	⍝ The reduce operator takes a dyadic verb argument on the left, which
	⍝ it evaluates between each element of the vector argument on the
	⍝ right. The result of "+/1 1 2 3 5" is also "12". The symbols "+/"
	⍝ together can be read as "sum".
	⍝
	⍝ The "euclidean distance" between two coordinates is the length of
	⍝ the straight line that connects them. It is defined as the square
	⍝ root of the sum of the squares of the element-wise differences of
	⍝ the elements of the coordinates.
	⍝ <https://en.wikipedia.org/wiki/Euclidean_distance>
	(+/(p-q)⋆2)⋆.5
	⍝ It is complicated to describe this formula in plain language but
	⍝ simple in APL. The result of "(+/(p-q)⋆2)⋆.5" is "0.5465308774".
	⍝
	⍝ Braces are used for defining custom verbs called "direct functions".
	p {(+/(⍺-⍵)⋆2)⋆.5} q
	⍝ The symbols "⍺" and "⍵" are called "alpha" and "omega" which
	⍝ respectively denote the left and right arguments of a dyadic direct
	⍝ function. We can assign a name to a direct function using a copula.
	Distance←{(+/(⍺-⍵)⋆2)⋆.5}
	⍝ We can now use the name "Distance" to measure the euclidean distance
	⍝ between any two coordinates.
	p Distance q
	⍝ The result of "p Distance q" is also "0.5465308774".
	⍝
	⍝ Here is a vector of coordinates in 2-D space, including "p" and "q"
	⍝ and seven more points, which we will call "coords".
	coords←p q (0.301 0.234) (0.347 0.168) (0.57 0.003) (0.16 0.009) (0.252 0.539) (0.595 0.487) (0.439 0.412)
	⍝
	⍝ The symbol "." denotes an operator called "dot". The symbol "∘"
	⍝ denotes an operator called "jot" or "compose". When used together
	⍝ the symbols "∘." are called the "outer product" operator. This
	⍝ operator applies its function argument to each combination of the
	⍝ elements of its left and right arguments.
	2 3 ∘.+ 1 1 2 3 5
	⍝ The result of "2 3 ∘.+ 1 1 2 3 5" is a matrix that looks like this:
	⍝
	⍝ 3 3 4 5 7
	⍝ 4 4 5 6 8
	⍝
	⍝ The shape of the result depends on the number of elements in each
	⍝ argument. In this case, the left argument has two elements and the
	⍝ right argument has five elements. The result is a two-by-five
	⍝ matrix.
	⍝
	⍝ The symbol "⍴" is called "rho". As a monadic verb it is called
	⍝ "shape" because it returns the length of each dimension of its
	⍝ argument.
	⍴2 3 ∘.+ 1 1 2 3 5
	⍝ The result of "⍴2 3 ∘.+ 1 1 2 3 5" is "2 5".
	⍝
	⍝ We can use the outer product operator with our "Distance" function
	⍝ to compute the distance from each coordinate to each of "p" and "q".
	⍝ We will call the result "deltas".
	deltas←coords ∘.Distance p q
	⍝ The result is a nine-by-two matrix that looks like this:
	⍝
	⍝ 0 0.5465308774
	⍝ 0.5465308774 0
	⍝ 0.7388619628 0.6732792883
	⍝ 0.771767452 0.6448728557
	⍝ 0.8568482946 0.5315496214
	⍝ 1.004303241 0.8762493937
	⍝ 0.554148897 0.7317301415
	⍝ 0.3794864951 0.387703495
	⍝ 0.5144171459 0.525384621
	⍝
	⍝ The symbol "⌊" is called "downstile". As a dyadic verb it is also
	⍝ called "minimum" because it returns the smaller of its two
	⍝ arguments.
	5⌊3
	⍝ The result of "5⌊3" is "3".
	⍝
	⍝ When its argument is a matrix, the reduce operator works on each row
	⍝ independently.
	⌊/deltas
	⍝ The result of "⌊/deltas" is a vector of nine elements containing the
	⍝ smallest value from each row of "deltas". It looks like this:
	⍝
	⍝ 0 0 0.6732792883 0.6448728557 0.5315496214 0.8762493937 0.554148897 0.3794864951 0.5144171459
	⍝
	⍝ The symbol "=" is called "equal to". As a dyadic verb it compares
	⍝ its arguments and returns "1" when they are equal and "0" when they
	⍝ are not.
	3=1 1 2 3 5
	⍝ The result of "3=1 1 2 3 5" is "0 0 0 1 0".
	⍝
	⍝ We can use "equal to" to find whether "p" or "q" is closer to each
	⍝ coordinate by comparing the elements of "deltas" to the minima of
	⍝ their respective rows. We call the result "clusters".
	clusters←deltas=[1]⌊/deltas
	⍝ When the arguments of a verb have different shapes, we can specify
	⍝ the dimension along which the verb will be applied in brackets,
	⍝ which are called the "axis" operator. In this case we want to
	⍝ compare along the first dimension.
	⍝
	⍝ The result is a nine-by-two matrix that looks like this:
	⍝
	⍝ 1 0
	⍝ 0 1
	⍝ 0 1
	⍝ 0 1
	⍝ 0 1
	⍝ 0 1
	⍝ 1 0
	⍝ 1 0
	⍝ 1 0
	⍝
	⍝ The number of coordinates in each cluster is the sum of each column
	⍝ of the "clusters" matrix.
	counts←+/[1]clusters
	⍝ The result of "+/[1]clusters" is "4 5".
	⍝
	⍝ Multiplication is a dyadic verb denoted by the symbol "×" called
	⍝ "times". Division is a dyadic verb denoted by the symbol "÷" called
	⍝ "divide".
	¯1 0 1×3
	⍝ The result of "¯1 0 1×3" is "¯3 0 3".
	1 1 2 3 5÷5
	⍝ The result of "1 1 2 3 5÷5" is "0.2 0.2 0.4 0.6 1".
	⍝
	⍝ We can use multiplication to select the coordinates of each cluster.
	⍝ To find the average (mean) coordinate in each cluster, we divide its
	⍝ sum by its count.
	means←(+/[1]clusters×[1]coords)÷counts
	⍝ The result "means" is a vector of two coordinates "(0.4995 0.5715) (0.468 0.1552)".
	⍝
	⍝ This completes been a single iteration of the algorithm for "k-means
	⍝ clustering" of our coordinates where k is 2.
	⍝ <https://en.wikipedia.org/wiki/K-means_clustering>
	⍝
	⍝ We would now want to repeat this iteration with the new means in
	⍝ place of "p" and "q", until the results converge. First, let us
	⍝ define a direct function for this step.
	Step←{(+/[1]clusters×[1]⍺)÷+/[1]clusters←deltas=[1]⌊/deltas←⍺ ∘.Distance ⍵}
	⍝ We can now use our function called "Step" to compute the same answer
	⍝ as before.
	coords Step p q
	⍝ The result of "coords Step p q" is also "(0.4995 0.5715) (0.468 0.1552)".
	⍝
	⍝ The symbol "≡" is called "match". As a dyadic verb it returns "1"
	⍝ when its arguments are completely identical, and "0" otherwise. It
	⍝ is different from "=" in that it compares the arguments entirely
	⍝ rather than element-wise.
	¯1 0 1≡1 0 1
	⍝ The result of "¯1 0 1≡1 0 1" is "0".
	1 2 3≡1 2 3
	⍝ The result of "1 2 3≡1 2 3" is "1".
	¯1 0 1=1 0 1
	⍝ The result of "¯1 0 1=1 0 1" is "0 1 1".
	1 2 3=1 2 3
	⍝ The result of "1 2 3=1 2 3" is "1 1 1".
	⍝
	⍝ The symbol "⍣" denotes the "power operator". When used with two
	⍝ functions, it recursively evaluates the left function, and uses the
	⍝ right function to determine when to stop. When the right function is
	⍝ "≡" it is called a "fixpoint operator".
	means←coords Step⍣≡ p q
	⍝ This repeats our "Step" function until the answer converges. The
	⍝ result is a vector of the cluster means "(0.592 0.5296) (0.3445 0.1035)".