The provided benchmark script is using Python's timeit
module for benchmarking. I noticed that I was getting very different results between runs even with the same compiler, so I first switched to pyperf
to attempt to get more stable results.
I am not sure if there is a Python API for pyperf, so I started by writing a small bash wrapper for benchmark
:
#!/usr/bin/bash
benchmark="twitter"
do_setup=0
perf=
asan=
while getopts "spab:" o;do
case "$o" in
b)
benchmark="$OPTARG"
;;
s)
do_setup=1
;;
p)
perf="perf record --call-graph dwarf -e cycles,cycle_activity.stalls_total"
;;
a)
asan="env LD_PRELOAD=/usr/lib/libasan.so"
esac
done
path="$benchmark.json"
if [[ do_setup -ne 0 ]];then
echo "Downloading $path..."
curl -L "https://github.com/ijl/orjson/raw/master/data/$path.xz" | \
xz --decompress > $path
fi
echo "Benchmark: $path"
$perf pyperf timeit -s \
"import json; d = json.load(open('$path')); import msgspec; f = msgspec.JSONEncoder().encode" \
'f(d)'
(venv) [joe@jevnik msgspec]$ python --version
Python 3.9.6
(venv) [joe@jevnik msgspec]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
CPU family: 6
Model: 94
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 3
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 8003.30
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx
sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re
p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdr
and lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority e
pt vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_
act_window hwp_epp
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Meltdown: Mitigation; PTI
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Srbds: Vulnerable: No microcode
Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
(venv) [joe@jevnik msgspec]$ sudo cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
Even with a more reliable benchmark, I wanted to run each sequence 5 times to start to prove to myself that I had a stable result.
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fno-semantic-interposition -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/joe/projects/python/msgspec/venv/include -I/usr/include/python3.9 -c msgspec/core.c -o build/temp.linux-x86_64-3.9/msgspec/core.o
Note
My system gcc
is gcc-11.1.0
$ for n in `seq 5`;do ./benchmark -b twitter;done
Benchmark: twitter.json
.....................
Mean +- std dev: 489 us +- 6 us
Benchmark: twitter.json
.....................
Mean +- std dev: 489 us +- 5 us
Benchmark: twitter.json
.....................
Mean +- std dev: 487 us +- 3 us
Benchmark: twitter.json
.....................
Mean +- std dev: 490 us +- 5 us
Benchmark: twitter.json
.....................
Mean +- std dev: 488 us +- 5 us
$ for n in `seq 5`;do ./benchmark -b canada;done
Benchmark: canada.json
.....................
Mean +- std dev: 4.56 ms +- 0.03 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.57 ms +- 0.03 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.57 ms +- 0.05 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.57 ms +- 0.05 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.58 ms +- 0.04 ms
gcc-10 -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fno-semantic-interposition -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/joe/projects/python/msgspec/venv/include -I/usr/include/python3.9 -c msgspec/core.c -o build/temp.linux-x86_64-3.9/msgspec/core.o
$ for n in `seq 5`;do ./benchmark -b twitter;done
Benchmark: twitter.json
.....................
Mean +- std dev: 489 us +- 7 us
Benchmark: twitter.json
.....................
Mean +- std dev: 485 us +- 4 us
Benchmark: twitter.json
.....................
Mean +- std dev: 488 us +- 4 us
Benchmark: twitter.json
.....................
Mean +- std dev: 487 us +- 13 us
Benchmark: twitter.json
.....................
Mean +- std dev: 487 us +- 6 us
$ for n in `seq 5`;do ./benchmark -b canada;done
Benchmark: canada.json
.....................
Mean +- std dev: 4.55 ms +- 0.02 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.56 ms +- 0.04 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.56 ms +- 0.02 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.56 ms +- 0.03 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.57 ms +- 0.05 ms
clang -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fno-semantic-interposition -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/joe/projects/python/msgspec/venv/include -I/usr/include/python3.9 -c msgspec/core.c -o build/temp.linux-x86_64-3.9/msgspec/core.o
$ for n in `seq 5`;do ./benchmark -b twitter;done
Benchmark: twitter.json
.....................
Mean +- std dev: 476 us +- 4 us
Benchmark: twitter.json
.....................
Mean +- std dev: 474 us +- 3 us
Benchmark: twitter.json
.....................
Mean +- std dev: 475 us +- 5 us
Benchmark: twitter.json
.....................
Mean +- std dev: 476 us +- 12 us
Benchmark: twitter.json
.....................
Mean +- std dev: 475 us +- 4 us
$ for n in `seq 5`;do ./benchmark -b canada;done
Benchmark: canada.json
.....................
Mean +- std dev: 4.26 ms +- 0.07 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.24 ms +- 0.02 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.25 ms +- 0.06 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.27 ms +- 0.06 ms
Benchmark: canada.json
.....................
Mean +- std dev: 4.26 ms +- 0.05 ms
Before I investigated the code further, I wanted to make sure that the existing code was not using any undefined behavior (UB), so I did a gcc build with address sanitizer (asan) and undefined-behavior sanitizer (ubsan) to ensure that the code was not relying on any UB that the compilers were handling differently.
I changed the compiler flags to:
diff --git a/setup.py b/setup.py
index 844c53c..b0de70f 100644
--- a/setup.py
+++ b/setup.py
@@ -4,7 +4,14 @@ import versioneer
from setuptools import setup
from setuptools.extension import Extension
-ext_modules = [Extension("msgspec.core", [os.path.join("msgspec", "core.c")])]
+ext_modules = [
+ Extension(
+ "msgspec.core",
+ [os.path.join("msgspec", "core.c")],
+ extra_compile_args=['-fsanitize=address', '-fsanitize=undefined'],
+ extra_link_args=['-lasan', '-lubsan'],
+ )
+]
setup(
name="msgspec",
I did an instrumented run with:
$ PYTHONMALLOC=malloc LD_PRELOAD=/usr/lib/libasan.so python -c "import json; d = json.load(open('twitter.json')); import msgspec; f = msgspec.JSONEncoder().encode; f(d)"
There were a few leaks detected, but this is standard in Python and I don't believe it was anything more than odds and ends global structures that Python uses. I did not see any ubsan violations with gcc, so I tried clang and got the same results.
I noticed that there were a few warnings that gcc was providing, and also noticed that I was not compiling with -Wextra
, so I decided to turn on -Wextra
and then disable a few warnings that are annoying when writing CPython extensions:
extra_compile_args=[
'-Wextra',
'-Wno-unused-parameter',
'-Wno-missing-field-initializers',
]
Here is a sample of some warnings that seemed harmless but maybe indicate the code isn't doing what is expected:
msgspec/core.c: In function ‘mp_decode_any’:
msgspec/core.c:3525:25: warning: comparison is always true due to limited range of data type [-Wtype-limits]
3525 | if (-32 <= op && op <= 127) {
| ^~
msgspec/core.c:3534:21: warning: comparison is always true due to limited range of data type [-Wtype-limits]
3534 | else if ('\x80' <= op && op <= '\x8f') {
| ^~
msgspec/core.c: In function ‘mp_skip’:
msgspec/core.c:3647:25: warning: comparison is always true due to limited range of data type [-Wtype-limits]
3647 | if (-32 <= op && op <= 127) {
| ^~
msgspec/core.c:3656:21: warning: comparison is always true due to limited range of data type [-Wtype-limits]
3656 | else if ('\x80' <= op && op <= '\x8f') {
| ^~
msgspec/core.c: In function ‘mp_validation_error’:
msgspec/core.c:3761:25: warning: comparison is always true due to limited range of data type [-Wtype-limits]
3761 | if (-32 <= op && op <= 127) {
There are also some warnings about possible uninitialized values; however, these seems top be false positives:
msgspec/core.c: In function ‘mp_decode_any’:
msgspec/core.c:3292:12: warning: ‘s’ may be used uninitialized in this function [-Wmaybe-uninitialized]
3292 | return PyFloat_FromDouble(_PyFloat_Unpack8((unsigned char *)s, 0));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
msgspec/core.c:3290:11: note: ‘s’ was declared here
3290 | char *s;
| ^
With stable benchmarks, ubsan passing, and code checked, I decided to dig into the hotspots to see what the major differences were.
I started with the twitter
example because I thought I would have the most to offer in the more Python object heavy workload and because ryu already seems pretty well optimized.
I ran the benchmark under perf
and recorded the cycles (the default) but also the stalls to see places where we the CPU was not able to make progress because it was waiting for something. You can open the result of perf
with perf report
; however, I like an open source tool called hotspot to look through perf results. The biggest hotspot by far was json_encode_str
, which by itself consumed almost 50% of the total cycles in the run and almost 25% of the stalls.
Now we can focus our search to a single function and see where we might make improvements
I started by just reading the source for this function. My first impression was that this function has an incredibly "branchy" inner loop for what should be "mostly memcpy". Next I wanted to look through the generated code for this function, which I extracted with:
$ objdump --no-addresses --source --disassemble=json_encode_str msgspec/core.cpython-39-x86_64-linux-gnu.so > gcc11-json_encode_str
The generated code has lots of jumps inside this hot loop, which is not good. All these jumps will prevent the CPU from doing things as fast as can, because it will end up bottlenecked reading data from main memory, which is slow.
To get some more insight into how the compiler is understanding this function, I needed to dump some of the gcc internal state. gcc can emit information about the code after every pass that it does, and one of the output formats is a graphviz formatted .dot
file. The only pass I cared about was the final, fully optimized and inlined pass, so I selected that I wanted this pass with -fdump-rtl-final
, which means dump the rtl (register transfer language, the lowest level gcc intermediate representation) pass named "final", which is the last pass unless you have plugins running. I also wanted a graph, not just raw rtl, so I also passed -fdump-tree-optimized-graph=graph
which says to dump the optimized passes in a graph format and write the data to a file named graph.dot
.
The full command line used was:
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fno-semantic-interposition -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/joe/projects/python/msgspec/venv/include -I/usr/include/python3.9 -c msgspec/core.c -o build/temp.linux-x86_64-3.9/msgspec/core.o -fdump-rtl-final -fdump-tree-optimized-graph=graph
Unfortunately, this writes out all of the functions, so we need to find the cluster for the function we care about. I wrote a little Python to do this:
def isolate_function(in_, out, func):
start = 'subgraph "cluster_%s" {\n' % func
end = '}\n'
with open(in_) as f:
lines = [
'digraph "small.graph" {\n',
'overlap=false;\n',
]
for line in f:
if line == start:
lines.append(line)
break
for line in f:
lines.append(line)
if line == end:
break
lines.append(end)
content = ''.join(lines)
with open(out, 'w') as f: f.write(content)
The results are shown in original.svg
.
This representation breaks down the code into basic blocks and shows the control flow edges between them. Each edge is annotated with a predicted likelihood, where blue edges are unconditional. With a little squinting, we can match this form up with our original source code if we consider what the result of inlining will be.
The first thing to notice is just how many branches we have in this loop, the control flow is not simple! Having a small loop (in terms of generated code size) that fits in cache is very important for optimal performance
The first thing I wanted to do was help the compiler generate better code around mp_write
in our loop, so I told gcc to always inline the function to potentially unlock optimizations based on earlier passes before the decision to inline would be made.
To do this, we can annotate mp_write
with __attribute__((always_inline))
which is respected by both gcc and clang.
This alone gives us a pretty decent change with gcc11:
Benchmark: twitter.json
.....................
Mean +- std dev: 477 us +- 5 us
A 2.5% speedup from just a one line change!
Looking at the graph from the original code, another thing I noticed was that the compiler was assuming that the required > self->max_output_len
condition was being weighed equally, which we know is definitely not true.
To tell the compiler what we know, I used __builtin_expect
, a gcc compiler intrinsic. This intrinsic is usually wrapped with LIKELY
and UNLIKELY
macros, like:
#define MP_LIKELY(pred) __builtin_expect(!!(pred), 1)
#define MP_UNLIKELY(pred) __builtin_expect(!!(pred), 0)
This small change gives another small performance improvement:
Benchmark: twitter.json
.....................
Mean +- std dev: 473 us +- 5 us
The full diff at this point is:
diff --git a/msgspec/core.c b/msgspec/core.c
index 90017f0..f084f4f 100644
--- a/msgspec/core.c
+++ b/msgspec/core.c
@@ -8,6 +8,11 @@
#include "ryu.h"
+#define MP_LIKELY(pred) __builtin_expect(!!(pred), 1)
+#define MP_UNLIKELY(pred) __builtin_expect(!!(pred), 0)
+
+#define MP_INLINE __attribute__((always_inline))
+
#if PY_VERSION_HEX < 0x03090000
#define IS_TRACKED _PyObject_GC_IS_TRACKED
#define CALL_ONE_ARG(fn, arg) PyObject_CallFunctionObjArgs((fn), (arg), NULL)
@@ -2137,11 +2142,11 @@ mp_ensure_space(EncoderState *self, Py_ssize_t size) {
return 0;
}
-static inline int
+MP_INLINE static inline int
mp_write(EncoderState *self, const char *s, Py_ssize_t n)
{
Py_ssize_t required = self->output_len + n;
- if (required > self->max_output_len) {
+ if (MP_UNLIKELY(required > self->max_output_len)) {
if (mp_resize(self, required) < 0) return -1;
}
memcpy(self->output_buffer_raw + self->output_len, s, n);
These changes yield a noticeably different graph in the dot files, so we know that the compiler does see these as different functions. The graph at this stage is shown in opt-1.svg
This at least makes me believe that my understanding of what is going on is correct because we are making progress.
The next bit of branching that I see in our loop is the type check in the resize code. I am talking about the is_bytes
checks in mp_resize
. Resizing is unexpected, but it does happen in the loop with a sufficiently large file, so I decided to think about how to remove these conditions.
My thought was that we don't need to branch at all here, instead we just need to remember what kind of buffer we have when we set it. To do this, I added a function pointer in the EncoderState
struct which points to a function which either resizes bytes
objects or bytearray
objects. This function pointer can be used to avoid all the extra branching we have, and further get the resizing code out of the way to make the loop smaller.
The diff to do this was:
diff --git a/msgspec/core.c b/msgspec/core.c
index f084f4f..4ee348c 100644
--- a/msgspec/core.c
+++ b/msgspec/core.c
@@ -1990,6 +1990,8 @@ typedef struct EncoderState {
char *output_buffer_raw; /* raw pointer to output_buffer internal buffer */
Py_ssize_t output_len; /* Length of output_buffer */
Py_ssize_t max_output_len; /* Allocation size of output_buffer */
+
+ char* (*resize_output_buffer)(PyObject**, Py_ssize_t);
} EncoderState;
@@ -2113,24 +2115,32 @@ enum mp_code {
MP_EXT32 = '\xc9',
};
+static char*
+mp_resize_bytes(PyObject** output_buffer, Py_ssize_t size)
+{
+ int status = _PyBytes_Resize(output_buffer, size);
+ if (status < 0) return NULL;
+ return PyBytes_AS_STRING(*output_buffer);
+}
+
+static char*
+mp_resize_bytearray(PyObject** output_buffer, Py_ssize_t size)
+{
+ int status = PyByteArray_Resize(*output_buffer, size);
+ if (status < 0) return NULL;
+ return PyByteArray_AS_STRING(*output_buffer);
+}
+
+
static int
-mp_resize(EncoderState *self, Py_ssize_t size)
+mp_resize(EncoderState *self, Py_ssize_t size)
{
- int status;
- bool is_bytes = PyBytes_CheckExact(self->output_buffer);
self->max_output_len = Py_MAX(8, 2 * size);
- status = (
- is_bytes ? _PyBytes_Resize(&self->output_buffer, self->max_output_len)
- : PyByteArray_Resize(self->output_buffer, self->max_output_len)
- );
- if (status < 0) return -1;
- if (is_bytes) {
- self->output_buffer_raw = PyBytes_AS_STRING(self->output_buffer);
- }
- else {
- self->output_buffer_raw = PyByteArray_AS_STRING(self->output_buffer);
- }
- return status;
+ char* new_buf = self->resize_output_buffer(&self->output_buffer,
+ self->max_output_len);
+ if (new_buf == NULL) return -1;
+ self->output_buffer_raw = new_buf;
+ return 0;
}
static inline int
@@ -2834,6 +2844,7 @@ Encoder_encode_into(Encoder *self, PyObject *const *args, Py_ssize_t nargs)
self->state.output_buffer_raw = PyByteArray_AS_STRING(buf);
self->state.output_len = offset;
self->state.max_output_len = buf_size;
+ self->state.resize_output_buffer = mp_resize_bytearray;
status = mp_encode(&(self->state), obj);
@@ -2871,6 +2882,7 @@ encode_common(
state->output_buffer = PyBytes_FromStringAndSize(NULL, state->max_output_len);
if (state->output_buffer == NULL) return NULL;
state->output_buffer_raw = PyBytes_AS_STRING(state->output_buffer);
+ state->resize_output_buffer = mp_resize_bytes;
}
status = encode(state, args[0]);
@@ -3022,6 +3034,7 @@ msgspec_encode(PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject
state.output_buffer = PyBytes_FromStringAndSize(NULL, state.max_output_len);
if (state.output_buffer == NULL) return NULL;
state.output_buffer_raw = PyBytes_AS_STRING(state.output_buffer);
+ state.resize_output_buffer = mp_resize_bytes;
status = mp_encode(&state, args[0]);
This change gives a pretty good improvement over our last effort:
Benchmark: twitter.json
.....................
Mean +- std dev: 465 us +- 5 us
The thing that still bothers me about this is I feel like resizing is a rare enough event that we don't even need any of the mp_resize
code in our loop, we can just do a call
when this uncommon thing happens. We can get the compiler to never inline our function when this happens with a small edit:
diff --git a/msgspec/core.c b/msgspec/core.c
index 4ee348c..78b1917 100644
--- a/msgspec/core.c
+++ b/msgspec/core.c
@@ -12,6 +12,7 @@
#define MP_UNLIKELY(pred) __builtin_expect(!!(pred), 0)
#define MP_INLINE __attribute__((always_inline))
+#define MP_NOINLINE __attribute__((noinline))
#if PY_VERSION_HEX < 0x03090000
#define IS_TRACKED _PyObject_GC_IS_TRACKED
@@ -2143,6 +2144,12 @@ mp_resize(EncoderState *self, Py_ssize_t size)
return 0;
}
+MP_NOINLINE static int
+mp_resize_cold(EncoderState *self, Py_ssize_t size)
+{
+ return mp_resize(self, size);
+}
+
static inline int
mp_ensure_space(EncoderState *self, Py_ssize_t size) {
Py_ssize_t required = self->output_len + size;
@@ -2157,7 +2164,7 @@ mp_write(EncoderState *self, const char *s, Py_ssize_t n)
{
Py_ssize_t required = self->output_len + n;
if (MP_UNLIKELY(required > self->max_output_len)) {
- if (mp_resize(self, required) < 0) return -1;
+ if (mp_resize_cold(self, required) < 0) return -1;
}
memcpy(self->output_buffer_raw + self->output_len, s, n);
self->output_len += n;
Now, our generated code will push the resizing out of line entirely, which is great!
Benchmark: twitter.json
.....................
Mean +- std dev: 463 us +- 3 us
This isn't a great improvement; however, it is something and it adds up. I also thought it was funny to show a case where disabling inlining is actually a performance improvement.
I also confirmed that the effect we saw earlier was not just dominated by this noinline
change, and having both of these is actually still an improvement over just having one by itself.
At this point, we have gotten gcc generating 5.3% faster code, which is faster than our initial clang measurements, so let's make sure we didn't make clang slower:
Benchmark: twitter.json
.....................
Mean +- std dev: 429 us +- 4 us
Clang seems to have done even better with these optimizations! I don't know what the equivalent tree visualizations are for clang, so I am not sure how to look into the clang internal state here.
Looking at the graph on opt-2.svg, I still see a lot of duplication between the two branches of the two kinds of escaped characters. The two blocks leaving bb 17
are bb 18
, the hex-encoded version, and bb 22
, the \\
encoded version.
We can actually share a ton of code here by pre-initializing the escaped
array with a common prefix: {'\\', escape, '0', '0'}
, which also happens to be a single word which can be stored in an immediate (when escape == 'u'
) which appears right with the instructions.
I did a little shuffling here:
diff --git a/msgspec/core.c b/msgspec/core.c
index 78b1917..3b59be4 100644
--- a/msgspec/core.c
+++ b/msgspec/core.c
@@ -4984,15 +4984,14 @@ json_encode_str(EncoderState *self, PyObject *obj) {
}
/* Write the escaped character */
+ size_t size = escape == 'u' ? 6 : 2;
+ char escaped[6] = {'\\', escape, '0', '0'};
if (escape == 'u') {
- const char* hex = "0123456789abcdef";
- char escaped[6] = {'\\', 'u', '0', '0', hex[c >> 4], hex[c & 0xF]};
- if (mp_write(self, escaped, 6) < 0) return -1;
- }
- else {
- char escaped[2] = {'\\', escape};
- if (mp_write(self, escaped, 2) < 0) return -1;
+ static const char* const hex = "0123456789abcdef";
+ escaped[4] = hex[c >> 4];
+ escaped[5] = hex[c & 0xF];
}
+ if (mp_write(self, escaped, size) < 0) return -1;
start = i + 1;
}
/* Write the last unescaped fragment (if any) */
Which let's us share a single mp_write
call, and efficiently load the prefix on the escaped
array. There is now one bit of dead work, which is the that we will write {'0', '0'}
into the buffer for the non-u
case; however, this is implemented with a 2-byte store of 0x3030
, which will never be read, so it is quite fast.
This is actually a pretty big improvement for gcc, but a pessimization for clang (over the previous best), so I think clang is not doing the same sharing here.
# gcc 11.1.0
Benchmark: twitter.json
.....................
Mean +- std dev: 455 us +- 4 us
# clang-12.0.1
Benchmark: twitter.json
.....................
Mean +- std dev: 449 us +- 4 us
But, now the there isn't much of a difference between the two compilers!
I didn't exactly help you figure out what the difference between gcc and clang was, but hopefully this helps you think about optimizing C code. This can be continued to keep trying to squeeze performance out of the existing code. Also, the graph visualization then compare generated code can be used to compare what different versions of gcc do with the same code, which is a good way to get a real understanding of how the compilers are treating the code differently.
Thanks for this excellent writeup! I definitely learned a few things here.
A few questions:
-O3 -fno-plt -fno-semantic-interposition -march=x86-64 -mtune=generic
. I found that of these, the only one that made any difference (on my hardware) was-fno-plt
, which resulted in a measurable speedup on all compilers tested. After reading a bit about this flag, I'm not sure if I should enable it or not? My understanding is it has something to do with how functions in dynamic libraries are called by the code being compiled? (edit: I no longer can reproduce this flag having any affect on benchmarks. May have been measurement error before, or some other change has negated it 🤷)perf report
to annotate different versions of the same function to try and gain some insights, and found the line annotations less useful than I would have hoped (though this is likely due to my lack of experience). Do you ever use perf to look at costs within a function, or only to hone in on which functions are hot spots?Thanks again for the time you took to look into this and write this up!