zezba9000/CS_NoStackCpy.md

## CS_NoStackCpy.md

      
    Raw
  

              CS_NoStackCpy.md
            
          
    Removing C# stack copies syntactically

A proposal that eliminates stack copies at the heart of C type'ish languages like C# when using 'structs' (aka value types) heavily as is done in image manipulation, graphics in general, physics, games, Unity3D, etc (I'm sure there are many more fields as well).
Key arguments for this feature


40% .NET Core & Framework performance increase on i3-7100 using this benchmark with 'USE_OUT' enabled: Link

Performance increase could actually be higher if the code complexity increased.
Not yet tested but guessing even a bigger gain on ARM SoC.


Allows one to describe vector based algorithms in C# as you do in HLSL, GLSL, CG, etc without performance loss due to stack copies.
You can maintain operator 'precedence' in vector math as is done with primitive types in C#.
Doesn't break older C# or .NET runtime versions.
.NET Native / UWP / Tablets / Laptops save battery life.
CoreRT even more performance gains.
No static analysis needed.

The issue!


C# has no way syntactically to create an operator method that doesn't copy stack memory needlessly ('out' cannot be used).

This causes a huge performance loss that could otherwise be avoided.


C# has no way to elegantly create methods which are used in heavy vector math that return a resulting non-primitive struct.

A potential solution!


Use an existing/new keyword or attribute that tells the compile an 'out' parameter can be used like a return would be at the calling site. Don't let your hypothalamus get in your way, there is a reason!

Here is some initial pseudocode
struct Mat3
{
    public Vec3 x, y, z;
}

struct Vec3
{
    // existing reference operator
    /*public static Vec3 operator+(Vec3 p1, in Vec3 p2)// SLOW
    {
        p1.X += p2.X;
        p1.Y += p2.Y;
        p1.Z += p2.Z;
        return p1;
    }*/

    // 'result' functions as an 'out' parameter
    public static void operator+(in Vec3 p1, in Vec3 p2, return Vec3 result)// FAST
    {
        result.X = p1.X + p2.X;
        result.Y = p1.Y + p2.Y;
        result.Z = p1.Z + p2.Z;
    }

    // ALTERNATIVE: 'value' like used in property setter could be used here
    public static out Vec3 operator+(in Vec3 p1, in Vec3 p2)// FAST
    {
        value.X = p1.X + p2.X;
        value.Y = p1.Y + p2.Y;
        value.Z = p1.Z + p2.Z;
    }

    public static void operator*(in Vec3 p1, float p2, return Vec3 result) {...}
    public float Dot(in Vec3 vec) {...}// primitives have no performance gain using 'out'
    public void Transform(in Mat3 matrix, return Vec3 result) {...}
}

void Foo(in Vec3 a, in Vec3 b, in Vec3 c, in Mat3 m)
{
    var v = a.Transform(m) + b * c.Dot(c);// methods / operators can be invoked in a single easy to read line.
}
If the 'return' keyword isn't the best choice there are other options. However, this is more of a minor syntax issue.
// example version:
  public static void operator+(in Vec3 p1, in Vec3 p2, return Vec3 result)

// example version 2:
  public static out Vec3 operator+(in Vec3 p1, in Vec3 p2) {...}

// OR:
  public static void operator+(in Vec3 p1, in Vec3 p2, [return] out Vec3 result) {...}

// OR:
  public static void operator+(in Vec3 p1, in Vec3 p2, [OutReturn] out Vec3 result) {...}

// OR:
  [return:OutReturn(???)]
  public static void operator+(in Vec3 p1, in Vec3 p2, out Vec3 result) {...}

// OR: Others ???
HLSL Comparison

Say we wanted to run a color manipulation algorithm. Here is some standard code that might be used in HLSL. However imagine if you will a similar algorithm being used in C#. I'm simply showing this to give some frame of reference as to why this kind of syntax should be done this way while being able to achieve maximum performance.
float brightness;
float alpha;

void main(in v2p IN, out float4 OUT : COLOR)
{
    float3 color0 = tex2D(tex[0], IN.Texcoord0);
    float3 color1 = tex2D(tex[1], IN.Texcoord1);
    float3 mask   = tex2D(tex[2], IN.Texcoord2);

    OUT.rgb = brightness * (color0 * mask + color1 * (1.0 - mask));
    OUT.a = 1.0;
}

So lets take this line: "OUT.rgb = brightness * (color0 * mask + color1 * (1.0 - mask));"
If I wanted to get the best performance here in C# this becomes very verbose and hard to read.
As you can see below writing performant vector based code in C# isn't fun.
float brightness;
float alpha;

void GetColor(in Vec4 color0, in Vec4 color1, in Vec4 mask, out Vec4 result)
{
    color0.Mul(color0, mask, out var color0MulResult);
    mask.Sub(1.0f, mask, out var maskSubResult);
    color1.Mul(color1, maskSubResult, out var color1MulResult);
    color0MulResult.Add(color0MulResult, color1MulResult, out var someResult);
    brightness.Mul(brightness, someResult, out result);

    // Now compair this single line below with that confusion above (ugg).
    // NOTE: In C# this line runs MUCH SLOWER than the crazy lines above (currently).
    result = brightness * (color0 * mask + color1 * (1.0 - mask));
}
Why does this happen?

Every time anything is returned out in C# it must first have its value stored on the stack that in turn gets copied back to the stack before it unwinds. However if you use an 'out' parameter this extra copy is explicitly avoided. Taking a quick look at the IL difference is very telling.
public struct Vec3
{
    public float x, y, z;

    public void Foo(out Vec3 result)// FAST
    {
        result = new Vec3();
    }
    /*.method public hidebysig 
    instance void Foo (
    [out] valuetype Vec3& result
    ) cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 9 (0x9)
        .maxstack 8

        IL_0000: nop
        IL_0001: ldarg.1
        IL_0002: initobj Vec3
        IL_0008: ret
    } // end of method Vec3::Foo*/

    public Vec3 Foo2()// SLOW
    {
        return new Vec3();
    }
    /*.method public hidebysig 
    instance valuetype Vec3 Foo2 () cil managed 
    {
        // Method begins at RVA 0x205c
        // Code size 15 (0xf)
        .maxstack 1
        .locals init (
        [0] valuetype Vec3,
        [1] valuetype Vec3
        )

        IL_0000: nop
        IL_0001: ldloca.s 0
        IL_0003: initobj Vec3
        IL_0009: ldloc.0
        IL_000a: stloc.1
        IL_000b: br.s IL_000d

        IL_000d: ldloc.1
        IL_000e: ret
    } // end of method Vec3::Foo2*/
}
How to handle older C# or .NET versions

Take the example below
// C# pseudocode
void Foo(return Vec3 result)
{
    result = new Vec3();
}

// ALTERNATIVE: C# pseudocode
out Vec3 Foo()
{
    value = new Vec3();
}

// IL pseudocode
/*.method public hidebysig 
instance void Foo (
[out, return] valuetype Vec3& result// <<< <<< NOTE: 'return' attribute <<< <<<
) cil managed 
{
    // Method begins at RVA 0x2050
    // Code size 9 (0x9)
    .maxstack 8

    IL_0000: nop
    IL_0001: ldarg.1
    IL_0002: initobj Vec3
    IL_0008: ret
} // end of method Vec3::Foo*/
To call the example code above in older C# versions one must do:
void Main()
{
    Foo(out var result);
}
To call the example code in newer C# versions one can do:
void Main()
{
    var result = Foo();
    // OR: Foo(out var result);
}
And finally just as you would give a compiler error for methods that only differ in return type, so would you for 'out-return' types. All examples methods below conflict with one another if defined in the same type.
void Foo(return Mat3 result) {...}
void Foo(return Vec3 result) {...}
Vec3 Foo() {...}
Final thoughts

Given the major performance gains and relatively little changes needed I see this as a big win.