EgorBo/TE_Inline_Cont_Sockets.md

## TE_Inline_Cont_Sockets.md

      
    Raw
  

              TE_Inline_Cont_Sockets.md
            
          
    DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 noticeably improves simple TE benchmarks such as the following ones on all UNIX archs. From my understanding, it avoids dispatching from the event-thread to threadpool and does the work in the same thread it got request from.


TE Benchmark
Baseline, RPS
MyTest, RPS
diff, %


ARM64 Platform-JSON PGO
661,663
778,925
+17.72%


ARM64 Platform-Caching PGO
186,188
218,004
+17.09%


ARM64 Platform-Plaintext PGO
6,933,964
7,563,428
+9.08%


x64 Platform-JSON PGO
1,299,388
1,432,200
+10.22%


x64 Platform-Caching PGO
413,123
445,144
+7.75%


x64 Platform-Plaintext PGO
12,529,587
13,137,836
+4.85%


(+17% on arm64 seems to be a sign that something can be improved on it, e.g. Threads-per-engine heuristic, or SpinWait params?)
However, it most likely regresses pretty much anything more complicated than "receive a tiny request and immediately send something back":


TE Benchmark
Baseline, RPS
MyTest, RPS
diff, %


ARM64 Platform-Fortunes PGO
88,765
51,648
-41.81%


x64 Platform-Fortunes PGO
494,777
410,766
-16.98%


Can we do a sort of PGO (static or dynamic) but on managed level to adapt to users' workloads dynamicly?
TE Benchmark	Baseline, RPS	MyTest, RPS	diff, %
ARM64 Platform-JSON PGO	661,663	778,925	+17.72%
ARM64 Platform-Caching PGO	186,188	218,004	+17.09%
ARM64 Platform-Plaintext PGO	6,933,964	7,563,428	+9.08%

x64 Platform-JSON PGO	1,299,388	1,432,200	+10.22%
x64 Platform-Caching PGO	413,123	445,144	+7.75%
x64 Platform-Plaintext PGO	12,529,587	13,137,836	+4.85%
TE Benchmark	Baseline, RPS	MyTest, RPS	diff, %
ARM64 Platform-Fortunes PGO	88,765	51,648	-41.81%

x64 Platform-Fortunes PGO	494,777	410,766	-16.98%