danbst/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Matrix multiplication without multiplication

The basic idea is to transform dot product into log form:
log(a.b) = log(a1*b1 + a2*b2 + ... + an*bn) 
  = log(exp(log a1+log b1) + exp(log a2+log b2) + ... + exp(log an+log bn))

and then apply something like Horner's method, but for exp-s, so result looks like (I'll simplify log a1 to la1, log a2 to la2, ...):
  = log(exp(la1+lb1) * (1+ exp(la2+lb2-la1-lb1)*(1+ exp(la3+lb3-la2-lb2)*(1+ ...) ))

and then cancel exp/log:
  = la1+lb1 + log(1+ exp(la2+lb2-la1-lb1)*(1+ exp(la3+lb3-la2-lb2)*(1+ ...) )

and then expand log(1+x) as log(x) + addenLogPlus1(x) (which is a new function we'll talk about later):
   (D1 = exp(la2+lb2-la1-lb1)*(1+ exp(la3+lb3-la2-lb2)*(1+ ...))

  = la1+lb1 + log(1+ D1) = la1+lb1 + log D1 + addenLogPlus1(D1)

I've added D1 variable to capture the "rest". But as we see, it is not computable. Instead, we should go "backwards", from the inner most entry:
  D[n] = lan+lbn - la[n-1]-la[n-1]
  D[n-1] = la[n-1]+lb[n-1] - la[n-2]-la[n-2] + D[n] + addenLogPlus1(exp(D[n]))
  D[n-2] = la[n-2]+lb[n-2] - la[n-3]-la[n-3] + D[n-1] + addenLogPlus1(exp(D[n-1]))
  ...

Then D[0] will contain log value of our a.b
Now, back to addenLogPlus1. This is a simple function:
   addenLogPlus1(x) = log(1+x) - log(x)

But we can modify it and require input to be log:
   addenLogPlus1(x) = log(1+exp(x)) - x

This way we remove exp from D[..] computation.
  D[n] = lan+lbn - la[n-1]-lb[n-1]
  D[n-1] = la[n-1]+lb[n-1] - la[n-2]-lb[n-2] + D[n] + addenLogPlus1(D[n])
  D[n-2] = la[n-2]+lb[n-2] - la[n-3]-lb[n-3] + D[n-1] + addenLogPlus1(D[n-1])
  ...
  D[0] = la[0] + lb[0] + D[1] + addenLogPlus1(D[1])

To make it multiplication free we can get rid of addenLogPlus1 by... converting it to lookup table! For this we stick with float16 datatype and convert the function into lookup table. Without much thinking, I interpret float16 bitwise as uint16 and use as an index in lookup table. Table size is 64K, which doesn't create much overhead:
D = la[n]+lb[n] - la[n-1]-lb[n-1]
loop i from n-1 to 0:
  D = D + la[i]+lb[i] - la[i-1]-lb[i-1] + lookup_table[D]
return D

And dot product is done! In similar way, I can do matrix multiplication by utilizing numpy arrays. The code is available below.
Extra thoughts

Even though this is slower than built-in np.dot (or GPU-based multiplication), this poses a question:

Would this be faster than normal multiplication if this is implemented using vectorized intrinsics?
Can the weights of matrices for large neural model be stored in log space and processed with this algorithm?
Do we even need "multiplication" for large neural model inferences? Do we even need to leave log space?
Current TPUs dedicate large processor space to efficient multiplication schemes, maybe it can be traded to addition for log-space matrices?


## matrix_product_log_space.py
import numpy as np
from math import log, exp

def addenLogPlus1(x):
    x = min(500, max(-500, x)) # clamping too big and too small values
    return log(exp(x)+1) - x

# lookup table for every float16 x -> log(exp(x) + 1) - x
plus1_lookup = np.vectorize(addenLogPlus1)(np.arange(2**16, dtype='int16').view('<f2'))

def matrix_product_in_log_space(A, B):
    res = np.zeros(shape=(A.shape[0], B.shape[1]), dtype='float16')
    for i in range(0, A.shape[1]-1):
        res += B[i] - B[i+1] + (A[:,[i]] - A[:,[i+1]])
        res += plus1_lookup[res.view('<u2')]
    res += B[-1] + A[:,[-1]]
    return res

As = np.array([[1,2,3],[4,5,6]]) #
A = np.random.rand(30,400)
Bs = np.array([[0.1,0.2],[0.3,0.4],[0.5,0.6]])#
B = np.random.rand(400,50)

print('in log space:\n', np.exp(matrix_product_in_log_space(np.log(As), np.log(Bs))))
print('regular:\n', np.dot(As, Bs))

%timeit np.dot(A, B)
%timeit matrix_product_in_log_space(A, B)
	import numpy as np
	from math import log, exp

	def addenLogPlus1(x):
	x = min(500, max(-500, x)) # clamping too big and too small values
	return log(exp(x)+1) - x

	# lookup table for every float16 x -> log(exp(x) + 1) - x
	plus1_lookup = np.vectorize(addenLogPlus1)(np.arange(2**16, dtype='int16').view('<f2'))

	def matrix_product_in_log_space(A, B):
	res = np.zeros(shape=(A.shape[0], B.shape[1]), dtype='float16')
	for i in range(0, A.shape[1]-1):
	res += B[i] - B[i+1] + (A[:,[i]] - A[:,[i+1]])
	res += plus1_lookup[res.view('<u2')]
	res += B[-1] + A[:,[-1]]
	return res

	As = np.array([[1,2,3],[4,5,6]]) #
	A = np.random.rand(30,400)
	Bs = np.array([[0.1,0.2],[0.3,0.4],[0.5,0.6]])#
	B = np.random.rand(400,50)

	print('in log space:\n', np.exp(matrix_product_in_log_space(np.log(As), np.log(Bs))))
	print('regular:\n', np.dot(As, Bs))

	%timeit np.dot(A, B)
	%timeit matrix_product_in_log_space(A, B)