mrange/on-tail-recursion.md

## on-tail-recursion.md

      
    Raw
  

              on-tail-recursion.md
            
          
    On the topic of tail calls in .NET

Let's say you have implemented a small data pipeline library to replace LINQ.
module TrivialStream =
  type Receiver<'T> = 'T            -> unit
  type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    module Loop =
      let rec range s e r i = if i <= e then r i; range s e r (i + s)

  open Details

  let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b

  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then (r v))

  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> (r (m v)))

  let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss
Then you test compare the overhead of the pipeline against LINQ:
let trivialTest n =
  TrivialStream.range       0 1 n
  |> TrivialStream.map      (fun v -> int64 v)
  |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
  |> TrivialStream.map      (fun v -> v + 1L)
  |> TrivialStream.sum

let linqTest n =
  Enumerable
    .Range(0, n + 1)
    .Select(fun v -> int64 v)
    .Where(fun v -> v &&& 1L = 0L)
    .Select(fun v -> v + 1L)
    .Sum()
You are very satisfied to see that your trivial push data pipeline has significantly lower overhead than LINQ.
Running LINQ with total=100000000, outer=10, inner=10000000 ...
  ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
  ... 375 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L

Almost 6x lower overhead!
Paul Westcott ruins our day

Unfortunately you run into Paul Westcott who asks what happens if you run the benchhmark in x86 mode rather than x64 mode.
Running LINQ with total=100000000, outer=10, inner=10000000 ...
  ... 1773 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
  ... 3116 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L

Now the tables are turned, your beautiful little library is actually slower than LINQ! Paul then says: "Just apply some magic oil!" and suggests you change your code into:
module TrivialStream =
  type Receiver<'T> = 'T            -> unit
  type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    let inline paulWestcottsMagicOil u = match u with () -> ()
    module Loop =
      let rec range s e r i = if i <= e then r i; range s e r (i + s)

  open Details

  let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b

  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then paulWestcottsMagicOil (r v))

  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> paulWestcottsMagicOil (r (m v)))

  let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss
The function let inline paulWestcottsMagicOil u = match u with () -> () does nothing but to your disbelief it actually helps!
Running LINQ with total=100000000, outer=10, inner=10000000 ...
  ... 1770 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
  ... 561 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L

Performance in x64 mode isn't as good as before but respectable:
Running LINQ with total=100000000, outer=10, inner=10000000 ...
  ... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
  ... 483 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L

What's going on?

It's great that we have a potential solution but some question marks still remains:

What does this have to do with tail calls in .NET (as implied by the title of the blog)?
Why the significant drop in performance between x64 and x86 modes?
paulWestcottsMagicOil does nothing yet it helps, why?
What is tail call optimization (TCO) I sometimes hear about?

Let's start with tail calls.

A tail call is a function call that is the final operation in a function. When the F# compiler detects that a function is the final operation it decorates the call site with a .tail attribute.
In fact, in our data pipeline the calls to the next step in pipeline are tail calls.
let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
  fun r -> s (fun v -> if f v then (r v)) // <-- r v is a tail call

let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
  fun r -> s (fun v -> (r (m v))) // <-- r v is a tail call
If we would look at result of a function call that call isn't a tail call because it is no longer the final operation.
A silly code example

let rec x n = if n > 0 then  n + y (n - 1) else 0  // y (n - 1) is not a tail call
and     y n = if n > 0 then -n + x (n - 1) else 0  // x (n - 1) is not a tail call

let rec v  s n = if n > 0 then w (s + n) (n - 1) else s  // w (s + n) (n - 1) is a tail call
and     w  s n = if n > 0 then v (s - n) (n - 1) else s  // v (s - n) (n - 1) is a tail call
These functions don't do anything really useful except demonstrating two different ways to implement the same functionality. One using tail calls (v & w) and one without tail calls (x & y). The F# compiler then should inject a .tail attribute for v & w but not for x & y.
To see this we need to peek at the generated IL assembler.
Here is an excerpt of the  IL assembler for x:
; Calls y
IL_0008: call         int32 Program/Samples::y(int32)
; Increments the result
IL_000d: add
; Returns from the function
IL_000e: ret
x calls y. When y returns we accumulate the result and returns. Because the call isn't the final operation it is not a tail call as expected.
It looks different for v:
; The following call is a tail call
IL_000a: tail.
; Calls w
IL_000c: call         int32 Program/Samples::w(int32, int32)
; Returns from the function
IL_0011: ret
v calls w but as this is the final operation the call site is decorated with .tail.
Why are tail calls a good thing?

I am sure you have run into your fair share of StackOverflowException. For developers that aren't aware of the runtime ABI the name StackOverflowException must seem cryptic but we quickly learn it typically means we have a recursive function that keeps calling itself over and over until we overflow.
The F# and C# compiler translates our source into a stream of op codes that are executed by the .NET virtual machine.
When the virtual machine hits the op code ret it mean it should continue at the code site that called this function. The virtual machine needs information where to continue executing and that information is in .NET stored on the stack.
What happens is that when the virtual machine executes a call it stores the position of the op code following call on the stack. Later when ret is executed it reads that value from the stack and the virtual machine continue executing the instruction following call.
StackOverflowException happens when we run out of space to store the return positions.
In reality there is more information that is stored on the stack than just the return position but let's ignore that for now.
Tail calls allow us to not run out of stack space when implementing a recursive algorithm.
If we have a call that looks like this:
IL_000c: call         int32 Program/Samples::w(int32, int32)
IL_0011: ret
Since the call is the final operation it is a tail call. It seems unnecessary to return to IL_0011 to just return the the calling function. What if we did this?
IL_000c: jmp          int32 Program/Samples::w(int32, int32)
Then we jump to w but we don't store the return address to us. When w return it returns to the code that called us directly. We save space and we save time.
This is roughly what the jitter tries to do when it sees a tall call.
Let's test our function above with this theory:
// Remember x is not tail recursive
printfn "%A" <| x 10000  // Prints 5000
printfn "%A" <| x 100000 // Crashes on my machine with StackOverflowException

// Remember v is tail recursive
printfn "%A" <| v 0 10000  // Prints 5000
printfn "%A" <| v 0 100000 // Prints 50000
printfn "%A" <| v 0 1000000000 // Prints 500000000 (takes awhile)
Tail calls are brilliant!
Why did our code perform worse than LINQ in x86 mode?

That is because tail calls sucks!
If we profile the x64 code the top 3 functions are:
Program+PerformanceTests+trivialTest@71.Invoke    32 %
Program+PerformanceTests+trivialTest@72-1.Invoke  20 %
Program+PerformanceTests+trivialTest@73-4.Invoke  18 %

In x86 mode the top 3 functions are:
Program+PerformanceTests+trivialTest@72-2.Invoke  27 %
[clr.dll]	                                        24 %
Program+PerformanceTests+trivialTest@73-4.Invoke  17 %

What are the Invoke methods? What does the CLR do? Even if we eliminate the CLR part it does not explain the complete drop in performance.
Mapping Invoke methods to the code

One way to interpret the name Program+PerformanceTests+trivialTest@71.Invoke is that this is lambda function generated for line 71. In my code line 71, 72 and 73 looks like this:
|> TrivialStream.map      (fun v -> int64 v)
|> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
|> TrivialStream.map      (fun v -> v + 1L)
Because F# inlines small lambdas the lambda functions that show up in the performance run are the data pipeline continuation lambdas for map and filter
What is the CLR doing

You are going to love this! To answer this we have to dig into the jitted x64 and x86 code.
Let's start with x64 code:
; This is - TrivialStream.filter   (fun v -> v &&& 1L = 0L)
00007ff8`5af01160 488bc2          mov     rax,rdx
; Note that the test (fun v -> v &&& 1L = 0L) is inlined
; Test if the if the number is even
00007ff8`5af01163 a801            test    al,1
00007ff8`5af01165 7512            jne     00007ff8`5af01179
; It is even, then we should call the next step in the data pipeline.
; That is a virtual call.
; The jitter has devirtualization tricks but they have failed to apply here so
; we have to do the full virtual call...
; First load the pointer the classes that implements the next step in the data pipeline
00007ff8`5af01167 488b4908        mov     rcx,qword ptr [rcx+8]
; Load the object Method Table Pointer
00007ff8`5af0116b 488b01          mov     rax,qword ptr [rcx]
; Load the class Pointer to Methods
00007ff8`5af0116e 488b4040        mov     rax,qword ptr [rax+40h]
; Load the address to virtual method Invoke
00007ff8`5af01172 488b4020        mov     rax,qword ptr [rax+20h]
; Jump to next Invoke (note this is a tail call so need to return here)
00007ff8`5af01176 48ffe0          jmp     rax
; The number was odd meaning we return to the caller
00007ff8`5af01179 33c0            xor     eax,eax
00007ff8`5af0117b c3              ret
Apart from the unfortunate fact that the devirtualization has failed to apply which prevents further inlining of the next step the code above is not that bad.
Sure there is a lot of overhead compared to the simple test on odd/even but it
could be worse.
Speaking of much worse, let's look at the x86 code for the same function:
; This is - TrivialStream.filter   (fun v -> v &&& 1L = 0L)
; Method prelude storing various registers... they weren't here in x64 mode :(
014e1090 55              push    ebp
014e1091 8bec            mov     ebp,esp
014e1093 57              push    edi
014e1094 56              push    esi
014e1095 53              push    ebx
014e1096 8bf1            mov     esi,ecx
; Loading the value to test from the stack rather than picking it up from a register :(
014e1098 8b4508          mov     eax,dword ptr [ebp+8]
; Well at least the test is simple enough and inlined
014e109b a901000000      test    eax,1
014e10a0 7526            jne     014e10c8
; The number is even
; But why this test? It wasn't here in x64 mode
014e10a2 833d4010005d00  cmp     dword ptr [clr!g_TrapReturningThreads (5d001040)],0
014e10a9 7526            jne     014e10d1
; Ok, here we go loading the pointer to next step in the pipeline
014e10ab 8b4e04          mov     ecx,dword ptr [esi+4]
; Pushing input arguments on the stack as required by the x86 ABI
; x64 ABI supports passing values in registers
014e10ae ff750c          push    dword ptr [ebp+0Ch]
014e10b1 ff7508          push    dword ptr [ebp+8]
; Load the object Method Table Pointer
014e10b4 8b01            mov     eax,dword ptr [ecx]
; Load the class Pointer to Methods
014e10b6 8b4028          mov     eax,dword ptr [eax+28h]
; Load the address to virtual method Invoke
014e10b9 8b4010          mov     eax,dword ptr [eax+10h]
; More arguments pushed to stack?
014e10bc 6a02            push    2
014e10be 6a02            push    2
014e10c0 6a01            push    1
014e10c2 50              push    eax
; And now we call clr!JIT_TailCall? Not Invoke?
014e10c3 e877cf4a5b      call    clr!JIT_TailCall (5c98e03f)
; The number is odd so let's return
014e10c8 33c0            xor     eax,eax
; Method epilogue :(
014e10ca 5b              pop     ebx
014e10cb 5e              pop     esi
014e10cc 5f              pop     edi
014e10cd 5d              pop     ebp
014e10ce c20800          ret     8
Some differences can be attributed to that the ABI for x64 and x86 modes, but what is the clr!JIT_TailCall doing?
Let's look at it:
; Why are we looking the current thread? :(
5c98e03f e844060000      call    clr!GetThread (5c98e688)
5c98e044 50              push    eax
5c98e045 51              push    ecx
5c98e046 52              push    edx
; Something about the current thread is interesting?
5c98e047 f7400480000000  test    dword ptr [eax+4],80h
5c98e04e 7406            je      clr!JIT_TailCall+0x17 (5c98e056)
5c98e050 50              push    eax
5c98e051 e870213400      call    clr!JIT_TailCallHelper (5ccd01c6)
; Seems we usually end up here!
; Ok, let's load the arguments we pushed when calling clr!JIT_TailCall
5c98e056 8b542414        mov     edx,dword ptr [esp+14h]
5c98e05a 8b44241c        mov     eax,dword ptr [esp+1Ch]
5c98e05e 8b4c2418        mov     ecx,dword ptr [esp+18h]
; A lot of stuff happening, but it seems to be to eliminate the stackframe
; of the current function. Which makes sense because we don't want the stack to grow
; on a tail call
5c98e062 f7c201000000    test    edx,1
5c98e068 7409            je      clr!JIT_TailCall+0x34 (5c98e073)
5c98e06a 8b7dfc          mov     edi,dword ptr [ebp-4]
5c98e06d 8b75f8          mov     esi,dword ptr [ebp-8]
5c98e070 8b5df4          mov     ebx,dword ptr [ebp-0Ch]
5c98e073 ff7504          push    dword ptr [ebp+4]
5c98e076 57              push    edi
5c98e077 56              push    esi
5c98e078 8d7c8508        lea     edi,[ebp+eax*4+8]
5c98e07c 8d748c2c        lea     esi,[esp+ecx*4+2Ch]
5c98e080 8b6d00          mov     ebp,dword ptr [ebp]
5c98e083 f7c202000000    test    edx,2
5c98e089 752b            jne     clr!JIT_TailCallLeave+0x1 (5c98e0b6)
5c98e08b 85c9            test    ecx,ecx
5c98e08d 740e            je      clr!JIT_TailCall+0x5e (5c98e09d)
5c98e08f 8b46fc          mov     eax,dword ptr [esi-4]
5c98e092 83ef04          sub     edi,4
5c98e095 83ee04          sub     esi,4
5c98e098 8907            mov     dword ptr [edi],eax
; Hmm, a loop isn't what we want to see in a time-critical part
5c98e09a 49              dec     ecx
5c98e09b 75f2            jne     clr!JIT_TailCall+0x50 (5c98e08f)
5c98e09d 8b442408        mov     eax,dword ptr [esp+8]
5c98e0a1 8b4c241c        mov     ecx,dword ptr [esp+1Ch]
5c98e0a5 8947fc          mov     dword ptr [edi-4],eax
5c98e0a8 894ff8          mov     dword ptr [edi-8],ecx
5c98e0ab 8d47f8          lea     eax,[edi-8]
; Ok, getting near the end, restoring the registers
5c98e0ae 5e              pop     esi
5c98e0af 5f              pop     edi
5c98e0b0 59              pop     ecx
5c98e0b1 5a              pop     edx
5c98e0b2 59              pop     ecx
5c98e0b3 8be0            mov     esp,eax
clr!JIT_TailCallLeave:
; This actually "returns" to next step in the pipeline rather then the current step
5c98e0b5 c3              ret
If you want to look at the actual source code for clr!JIT_TailCall you can have a peek at dotnet.core: https://github.com/dotnet/coreclr/blob/master/src/vm/i386/jithelp.asm#L927
So it seems that clr!Jit_TailCall eliminates the current stackframe in order to make sure the stack don't grow. In x64 mode the jitter was able to just do a jmp but that doesn't work for some reason in x86 mode. Unfortunately it seems to be quite lot of work eliminating the stackframe.
clr!Jit_TailCall is most likely what's tagged as [clr.dll] when we measured
the most costly functions.
What does the function paulWestcottsMagicOil do again?

Let's see what the "magic oil does to functions x, y , v and w we looked at before:
let inline paulWestcottsMagicOil u = match u with i -> i

let rec x n = if n > 0 then  n + paulWestcottsMagicOil (y (n - 1)) else 0  // y (n - 1) is not a tail call
and     y n = if n > 0 then -n + paulWestcottsMagicOil (x (n - 1)) else 0  // x (n - 1) is not a tail call

let rec v  s n = if n > 0 then paulWestcottsMagicOil (w (s + n) (n - 1)) else s  // w (s + n) (n - 1) is a tail call
and     w  s n = if n > 0 then paulWestcottsMagicOil (v (s - n) (n - 1)) else s  // v (s - n) (n - 1) is a tail call

paulWestcottsMagicOil runs a match on input argument and whatever it is that is returned. It does nothing!
Let's see how the magic function changes the IL Assembler.
Let's start with x:
IL_0008: call         int32 Program/Sample2::y(int32)
IL_000d: add
IL_000e: ret
Same as before. The magic oil function does nothing!
Let's look at v:
IL_000a: call         int32 Program/Sample2::w(int32, int32)
IL_000f: ret
The .tail attribute is gone?! It seems that because match semantically looks at the value returned by w the F# compiler don't realize this is in essence a tail call and omits the .tail attribute.
This is the purpose of paulWestcottsMagicOil - to eliminate the .tail attribute. This means that we will get StackOverflowException when call v 0 20000.
However, it seems to help the performance of x86 code.
Let's see how this affects the x64 assembler code.
; Reserve some stack space
00007ff8`5aee1160 4883ec28        sub     rsp,28h
00007ff8`5aee1164 488bc2          mov     rax,rdx
; The test
00007ff8`5aee1167 a801            test    al,1
00007ff8`5aee1169 7515            jne     00007ff8`5aee1180
; Number is even, let's call the next step in the datapipeline
; Load the pointer to the next step in the pipeline
00007ff8`5aee116b 488b4908        mov     rcx,qword ptr [rcx+8]
; Load the object Method Table Pointer
00007ff8`5aee116f 488b01          mov     rax,qword ptr [rcx]
; Load the class Pointer to Methods
00007ff8`5aee1172 488b4040        mov     rax,qword ptr [rax+40h]
; Load and call the address to virtual method Invoke
00007ff8`5aee1176 ff5020          call    qword ptr [rax+20h]
00007ff8`5aee1179 33c0            xor     eax,eax
; Unreserve stack space
00007ff8`5aee117b 4883c428        add     rsp,28h
; We are done!
00007ff8`5aee117f c3              ret
; Number is odd, just return
00007ff8`5aee1180 33c0            xor     eax,eax
; Unreserve stack space
00007ff8`5aee1182 4883c428        add     rsp,28h
; We are done!
00007ff8`5aee1186 c3              ret
In x64 mode the code when .tail is present is better as we could jmp to the next step in the pipeline, now the jitter choses call instead. We lose some performance (25%) and we risk StackOverflowException.
Let's look at x86 code:
; Saving the framepointer. This is a good thing!
02b61079 8bec            mov     ebp,esp
02b61078 55              push    ebp
; Trying to walk a stack manually without framepointers are not fun.
;   However, ARM ABI on Linux seems to place the framepointer in a "random" position
;   so it doesn't help there but that is a story for another blog.
; Load number from stack
02b6107b 8b4508          mov     eax,dword ptr [ebp+8]
; Test number if even
02b6107e a901000000      test    eax,1
02b61083 7517            jne     02b6109c
; Ok the number is even. Let's call the next step in the pipeline
; Load the pointer to the next step in the pipeline
02b61085 8b4904          mov     ecx,dword ptr [ecx+4]
; Push the input arguments on the stack
02b61088 ff750c          push    dword ptr [ebp+0Ch]
02b6108b ff7508          push    dword ptr [ebp+8]
; Load the object Method Table Pointer
02b6108e 8b01            mov     eax,dword ptr [ecx]
; Load the class Pointer to Methods
02b61090 8b4028          mov     eax,dword ptr [eax+28h]
; Load and call the address to virtual method Invoke
02b61093 ff5010          call    dword ptr [eax+10h]
02b61096 33c0            xor     eax,eax
; Restore framepointer
02b61098 5d              pop     ebp
; We are done!
02b61099 c20800          ret     8
; Ok, the number was odd, just return
02b6109c 33c0            xor     eax,eax
; Restore framepointer
02b6109e 5d              pop     ebp
; We are done!
02b6109f c20800          ret     8
This looks much cleaner and it performs much better. The gain seems to primarily to come from calling the next step in the pipeline directly without having to go through clr!JIT_TailCall. We risk StackOverflowException but perhaps it's worth it.
What is Tail Call Optimization (TCO)?

F# doesn't have break and continue as common in other languages. In F# if you want to implement a loop that stops at a certain condition the idiom is to use tail recursion:
let rec find f = function
  | []    -> None
  | v::vs -> if f v then Some v else find vs
But what about efficiency? Doesn't the use of tail calls especially on x86 mode kill the performance.
Here is where F# compiler applies TCO and unrolls the code above to a loop. If we reverse engineer the IL code into C# it looks something like this
public static FSharpOption<a> find<a>(FSharpFunc<a, bool> f, FSharpList<a> _arg1)
{
  while (_arg1.get_TailOrNull() != null)
  {
    FSharpList<a> fsharpList = _arg1;
    FSharpList<a> tailOrNull = fsharpList.get_TailOrNull();
    a headOrDefault = fsharpList.get_HeadOrDefault();
    if (f.Invoke(headOrDefault))
      return FSharpOption<a>.Some(headOrDefault);
    FSharpFunc<a, bool> fsharpFunc = f;
    _arg1 = tailOrNull;
    f = fsharpFunc;
  }
  return (FSharpOption<a>) null;
}
So our tail recursive function becomes an efficient loop instead. TCO can not be applied in every situation, for example v and w above are mutually recursive. TCO doesn't apply there.
Tail recursion to implements loops are more useful than you might think because in F# identical loops are vary a lot in terms of performance.
// These loops are fast
for i in 0..10 do
  compute i

for i in 0..1..10 do
  compute i

for i in 0..-1..10 do
  compute i

// These loops are slow
for i in 0L..10L do
  compute i

for i in 0..2..10 do
  compute i

// This cause a significant GC overhead in addition to being slow
for i in 0..10000000 do
  for j in 0L..10L do
    compute i
If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.
Wrapping up

Tail recursion can be used to eliminate StackOverflowException

Because we seen that tail recursion cause the jitter to avoid generating a stackframe (x64 mode) or to eliminate the stackframe (x86 mode) tail recursion allows us to implement deep recursion without StackOverflowException.
Tail recursion can have a significant performance impact.

In x86 mode was saw that eliminating the stackframe using clr!JIT_TailCall caused a significant overhead. In x64 mode we instead observed a performance improvement as the jitter was able to avoid a stackframe altogether.
However, it is possible that for more complex x64 code the jitter end up needing to call clr!JIT_TailCall.
As far as I know, C# compiler don't generate the .tail attribute meaning that the JIT performance team has not optimized tail call as efficiently as possible as F# is a minor .NET language (I wish it was otherwise).
There are things the jitter could do to improve the stackframe elimination, like inlining clr!JIT_TailCall and unroll the loops.
Tail Call Optimization (TCO) is our friend!

Implementing efficient non-trivial loops in F# is best done with tail recursion. In addition; tail recursion also allows us to avoid mutable state. F# has damaged me and now I tend to write tail recursion even in language that has support for break, continue and TCO like kotlin.
C# doesn't currently support TCO so tail recursion is a bad idea in C#.
Paul Westcott is a cool guy

I hope Paul forgives me for naming a function after him. I first saw tail call suppression trick in an ambitious PR from Paul that aimed to improve the Seq module performance.
I see PRs from Paul that tries to improve the performance of various core library like Seq, generic compare/equality and Lazy. All of these libraries need to be improved but I realize the difficulty in getting the PRs approved as they typically involve massive and risky changes. I have a beer instead and tries to forget it.
Paul does try and that's why I think he should win the "Tenacious F# developer of the year" award.