Hello!
For this article we take Tensor as a N-dimensional array whose last two axes might be interpreted as Matrix and/or the last axis might be interpreted as Vector. For TLDR the result is here.
- Tensors (N-dimensional storage)
- Implementation of methods and functions for working with tensors as with data storages
- Implementation of mathematical operations with matrices and vectors
- Arbitary type support (not only numeric/built-in ones)
Towel has generic matrices, but
- Does not support tensors
- Its transposition is an active operation that works for the number of elements
- No multithreading and not too performant in most operations thanks to delegates (a minor issue, but still)
There is also System.Numerics.Tensor
, but it supports neither arbitary type nor mathematical operations.
Also, there are some computational libraries like MathSharp, NumSharp, Torch.NET, and TensorFlow, but they all are about built-in numeric types and far from supporting custom.
Let's get to the implementation of our own mini-library.
We store elements in a linear array. To access an element means to multiply indices by some coefficients. Those coefficients are stored in array blocks
of the same length the tensor's shape is. For example, if we have [3 x 4 x 5]
-shaped tensor, our last elements in blocks
is 1
(always 1), then 1 * 5 = 5
, then 1 * 4 * 5 = 20
. blocks
is [20, 5, 1]
. If we access index [1, 0, 4]
then the linear index will be 20 * 1 + 5 * 0 + 1 * 4 = 24
.
For this article, we take Transposition as axes order change. The easy way to do so is to create a new tensor and move elements there in the new order. However, sometimes we do not need to copy a tensor, so our Transpose
should work for O(1)
without copying a tensor (which is quite expensive). Let's have a look at the function that computes the linear index for custom indices:
private int GetFlattenedIndexSilent(int[] indices)
{
var res = 0;
for (int i = 0; i < indices.Length; i++)
res += indices[i] * blocks[i];
return res + LinOffset;
}
As we could see, we can simulate transposing by swapping elements in blocks
. That is as easy as it sounds:
public void Transpose(int axis1, int axis2)
{
(blocks[axis1], blocks[axis2]) = (blocks[axis2], blocks[axis1]);
(Shape[axis1], Shape[axis2]) = (Shape[axis2], Shape[axis1]);
}
That is it, we just change the order of blocks
and Shape
(array of dimensions).
We need our Subtensor
function also work for O(1)
.
Subtensor of a N-dimensional tensor is its one element or its subtensor's one element. For example, if we take a tensor of the [2 x 3 x 4]
shape, its 0
and 1
elements are subtensors of shape [3 x 4]
. We actually only need the offset in our linear array, let's see how we can compute it.
Imagine that we are going to get the n-th subtensor. Then its first element is [n, 0, 0, ...]
in the intial array. Its linear index is n * blocks[0] + 0 * blocks[1] + 0 * blocks[2] + ...
which is n * blocks[0]
. That is the desired offset. So then we create a new tensor with the same linear array
(shared data) but with one number changed: offset, and also blocks[0]
and Shape[0]
removed.
All others' implementations come from those two:
- SetSubtensor gets a subtensor with shared data and sets all the necessary values to it
- Concat concatenates two tensors by their first axis (for example,
concat([3 x 4 x 5], [6 x 4 x 5])
->[9 x 4 x 5]
) - Stack unites a number of tensors into one with additional axis (for example,
stack([3 x 4], [3 x 4])
->[2 x 3 x 4]
) - Slice stacks a few subtensors
All composition operations are described here.
This must be more or less easy since the necessary composition functions are implemented.
- Piecewise operations (an operation is performed on two same-coordinates points from two tensors and its result goes to the third tensor)
- Matrix operations. Multiplication, inverse, they all are more or less clear, while I have to say something about computing a determinant.
What's wrong with determinant?
There are a few ways to compute determinant. The most obvious way is to use Laplace's method which works for O(1)
. We obviously need something more performant, so we take the method of Gaussian elimination (triangulation & product of all diagonal elements).
However, if you blindly take this method for, say, int
, you will get unexpectedly wrong answer, not even close to the real one. It is because most implementations of it perform the division operation somewhere during the computation.
What we should do is implement some kind of SafeFraction
-tensor instead of normal element-wise tensor, where these SafeFraction
's numerator and denonimator are normal elements. Its operators will be defined as normal operators for fractions: a / b + c / d = (a * d + b * c) / (b * d)
. Likewise, we difne the rest of the operators.
Unfortunately, there is a disadvantage of this method as well. It is overflow: we accumulate all numbers in both numerator and denominator, and, for example, instead of getting 6 at the end, we get 6'000'000/1'000'000. What we do is just keep both methods (one is with safe division, another one is simple).
- Vector operations (dot and cross product)
Full list of them is presented here.
It should work fast.
C# does not have templates. Some people solve this problem by dynamic compilation into a delegate, like this.
I wanted full castomization, so I declared an interface which should be inherited from a struct to be passed as a generic type. I use its methods as static ones, even though today's shapes do not support straight static
methods. This
var c = default(TWrapper).Addition(a, b);
Gets inlined into your struct's overriden method. Here is an example of such a struct.
It seems like we should use arbitary number of arguments in this
like that:
public T this[params int[] indices]
But when you access some number of arguments, instead of instancing it by number of arguments, it packs them into one array by allocating memory. It is a very expensive operation, so we have to implement many overloads to avoid it.
I wrapped all checks with #if ALLOW_EXCEPTIONS
construction to let the user choose whether they can spend time on checks or not. It is a little performance boost.
Actually it is not a nano-optimization, there are a lot of checks so avoiding them all will only leave the necessary functional code. Why one would need it? Because some have to perform those checks before passing the tensor to this library, so why do we have to duplicate code? Let's leave it up to the user.
Thanks to Microsoft, it appears to be extremely easy and can be implemented via Parallel.For
.
However, be careful with that, sometimes it is better to keep single threading:
This is time (in us
) taken to perform one piecewise addition operation on differently-shaped tensors on machine with i7-7700HQ. As you could see, there is a threshold since which it is better to use multithreading. Yet I added the Threading.Auto
mode so that the best mode can be determined automatically.
Yet it will never become as fast as those accelerated by CUDA and SIMD since we surely want it to support custom type.
That is about all, thank you for your attention. The library's main repository is located on github (under MIT).
Where is it used?
The main GenericTensor's user is symbolic algebra library AngouriMath. It uses Entity
class instead of numeric since Entity
is the main expression class of AM.
For example, that is how you compute determinant symbolically in AM:
var t = MathS.Matrices.Matrix(3, 3,
"A", "B", "C", // Those letters are symbolic variables, they also could be "x + 2" or "sin(y + a) / sqrt(3)"
"D", "E", "F",
"G", "H", "J");
Console.WriteLine(t.Determinant().Simplify());