Skip to content

Instantly share code, notes, and snippets.

@dspezia
Created October 8, 2011 12:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dspezia/1272254 to your computer and use it in GitHub Desktop.
Save dspezia/1272254 to your computer and use it in GitHub Desktop.
Benchmark huge pages
Host
====
Physical machine
2 * Intel X5670 @ 2.93 GHz, HT activated, 24 threads, cpufreq deactivated
48 GB
OS: SLES10 SP3
Linux ncegcolnx243 2.6.16.60-0.54.5-smp #1 SMP Fri Sep 4 01:28:03 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux
MemTotal: 49452408 kB
MemFree: 9745596 kB
Buffers: 721068 kB
Cached: 4744648 kB
SwapCached: 1188 kB
Active: 3621452 kB
Inactive: 1953992 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 49452408 kB
LowFree: 9745596 kB
SwapTotal: 8393920 kB
SwapFree: 8391420 kB
Dirty: 12 kB
Writeback: 0 kB
AnonPages: 106088 kB
Mapped: 30396 kB
Slab: 527300 kB
CommitLimit: 16342908 kB
Committed_AS: 580044 kB
PageTables: 3312 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 266784 kB
VmallocChunk: 34359471279 kB
HugePages_Total: 16384
HugePages_Free: 16384
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
stepping : 2
cpu MHz : 2926.082
cache size : 12288 KB
physical id : 0
siblings : 12
core id : 10
cpu cores : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx smx est tm2 cx16 xtpr dca popcnt lahf_lm ida arat
bogomips : 5852.17
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
available: 2 nodes (0-1)
node 0 size: 24240 MB
node 0 free: 2773 MB
node 1 size: 24215 MB
node 1 free: 6730 MB
node distances:
node 0 1
0: 10 20
1: 20 10
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1
nodebind: 0 1
membind: 0 1
/sys/devices/system/node/node0:
total 0
drwxr-xr-x 2 root root 0 Oct 8 13:31 .
drwxr-xr-x 4 root root 0 Aug 25 17:13 ..
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu0 -> ../../../../devices/system/cpu/cpu0
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu10 -> ../../../../devices/system/cpu/cpu10
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu12 -> ../../../../devices/system/cpu/cpu12
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu14 -> ../../../../devices/system/cpu/cpu14
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu16 -> ../../../../devices/system/cpu/cpu16
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu18 -> ../../../../devices/system/cpu/cpu18
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu2 -> ../../../../devices/system/cpu/cpu2
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu20 -> ../../../../devices/system/cpu/cpu20
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu22 -> ../../../../devices/system/cpu/cpu22
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu4 -> ../../../../devices/system/cpu/cpu4
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu6 -> ../../../../devices/system/cpu/cpu6
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu8 -> ../../../../devices/system/cpu/cpu8
-r--r--r-- 1 root root 4096 Oct 8 13:31 cpumap
-r--r--r-- 1 root root 4096 Oct 8 13:30 distance
-r--r--r-- 1 root root 4096 Oct 8 13:30 meminfo
-r--r--r-- 1 root root 4096 Oct 8 13:29 numastat
/sys/devices/system/node/node1:
total 0
drwxr-xr-x 2 root root 0 Oct 8 13:31 .
drwxr-xr-x 4 root root 0 Aug 25 17:13 ..
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu1 -> ../../../../devices/system/cpu/cpu1
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu11 -> ../../../../devices/system/cpu/cpu11
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu13 -> ../../../../devices/system/cpu/cpu13
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu15 -> ../../../../devices/system/cpu/cpu15
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu17 -> ../../../../devices/system/cpu/cpu17
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu19 -> ../../../../devices/system/cpu/cpu19
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu21 -> ../../../../devices/system/cpu/cpu21
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu23 -> ../../../../devices/system/cpu/cpu23
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu3 -> ../../../../devices/system/cpu/cpu3
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu5 -> ../../../../devices/system/cpu/cpu5
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu7 -> ../../../../devices/system/cpu/cpu7
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu9 -> ../../../../devices/system/cpu/cpu9
-r--r--r-- 1 root root 4096 Oct 8 13:31 cpumap
-r--r--r-- 1 root root 4096 Oct 8 13:30 distance
-r--r--r-- 1 root root 4096 Oct 8 13:30 meminfo
-r--r--r-- 1 root root 4096 Oct 8 13:29 numastat
Result of redis-benchmark of this machine:
------------------------------------------
Using loopback
PING (inline): 135501.36 requests per second
PING: 136798.91 requests per second
MSET (10 keys): 78864.35 requests per second
SET: 134770.89 requests per second
GET: 135685.22 requests per second
INCR: 133868.81 requests per second
LPUSH: 134952.77 requests per second
LPOP: 134952.77 requests per second
SADD: 134048.27 requests per second
SPOP: 134048.27 requests per second
LPUSH (again, in order to bench LRANGE): 134228.19 requests per second
LRANGE (first 100 elements): 78988.94 requests per second
LRANGE (first 300 elements): 41614.64 requests per second
LRANGE (first 450 elements): 29994.00 requests per second
LRANGE (first 600 elements): 24195.50 requests per second
Using Unix domain socket
PING (inline): 194552.53 requests per second
PING: 194931.77 requests per second
MSET (10 keys): 96805.42 requests per second
SET: 194931.77 requests per second
GET: 193423.59 requests per second
INCR: 194931.77 requests per second
LPUSH: 196463.66 requests per second
LPOP: 194174.77 requests per second
SADD: 194174.77 requests per second
SPOP: 192307.70 requests per second
LPUSH (again, in order to bench LRANGE): 196078.44 requests per second
LRANGE (first 100 elements): 94966.77 requests per second
LRANGE (first 300 elements): 46339.20 requests per second
LRANGE (first 450 elements): 33333.33 requests per second
LRANGE (first 600 elements): 25866.53 requests per second
Redis
=====
Version 2.2.12 + patch to fix LRANGE issue + CPU affinity patch + COW ratio patch
Size of the dump file: 3 GB
Peak memory consumption: 24 GB (working set after filling or loading)
Compiled with huge page support
https://gist.github.com/1240452
Benchmark program
https://gist.github.com/1272522
COW ratio patch
https://gist.github.com/1240427
Note: the COW ratio patch returns meaningless results when huge pages support is activated.
This is due to the fact huge pages are not tracked correctly in the page map of the process.
With huge pages, COW ratio has been evaluated manually with /proc/meminfo by comparing
HugePages_Total and HugePages_Free.
Huge pages area was hard limited to 32 GB (working set was about 24 GB)
So about one third of Redis memory was still available to support COW at bgsave time.
Benchmark
=========
Fill use case generates about 50M of write queries (multiplied by 3 clients)
It is used to build the working set.
Read use case generates about 50M of random read queries (multiplied by 3 clients)
It is used to evaluate the read throughput
Update use case generates random write queries, with a tunable throttle to limit the throughput.
It is used to test COW efficiency (by running it concurrencly to a bgsave)
All tests are done using Unix domain sockets.
Clients are bound on the same CPU than Redis server, but on different cores
(enforced using CPU affinity). This is the most efficient configuration.
Fill, 3 clients, with HP
------------------------
ncegcolnx243:genload> x 0 0 68.20s user 2.98s system 15% cpu 7:31.85 total
ncegcolnx243:genload> x 0 1 75.28s user 3.76s system 17% cpu 7:31.90 total
ncegcolnx243:genload> x 0 2 77.30s user 3.48s system 17% cpu 7:31.94 total
=> throughput = 331931 q/s
used_cpu_sys:376.68
used_cpu_user:18.35
used_memory:18810796424
used_memory_human:17.52G
used_memory_rss:25773010944
mem_fragmentation_ratio:1.37
Read queries, 3 clients, with HP
--------------------------------
ncegcolnx243:genload> x 1 0 85.68s user 3.04s system 23% cpu 6:09.86 total
ncegcolnx243:genload> x 1 1 83.82s user 2.98s system 23% cpu 6:09.86 total
ncegcolnx243:genload> x 1 2 90.71s user 1.62s system 24% cpu 6:09.86 total
=> throughput = 405558 q/s
used_cpu_sys:736.58
used_cpu_user:28.31
Bgsave with HP
--------------
[6150] 08 Oct 15:51:51 * Fork: 3893
[6150] 08 Oct 15:51:51 * Background saving started by pid 7487
[7487] 08 Oct 15:52:57 * DB saved on disk
=> fork latency = 4 ms
=> duration = 66 secs
Restart with HP
---------------
[7630] 08 Oct 15:57:22 * Server started, Redis version 2.2.12
[7630] 08 Oct 15:58:22 * DB loaded from disk: 60 seconds
Fill, 3 clients, without HP
---------------------------
ncegcolnx243:genload> x 0 0 68.89s user 2.04s system 15% cpu 7:38.15 total
ncegcolnx243:genload> x 0 2 67.00s user 3.56s system 15% cpu 7:38.22 total
ncegcolnx243:genload> x 0 1 74.39s user 2.95s system 16% cpu 7:38.24 total
=> throughput = 327367 q/s
used_cpu_sys:381.86
used_cpu_user:24.17
used_memory:18810795776
used_memory_human:17.52G
used_memory_rss:25715482624
mem_fragmentation_ratio:1.37
Read queries, 3 clients, without HP
-----------------------------------
ncegcolnx243:genload> x 1 0 81.52s user 2.60s system 22% cpu 6:21.30 total
ncegcolnx243:genload> x 1 1 84.88s user 1.30s system 22% cpu 6:21.30 total
ncegcolnx243:genload> x 1 2 87.23s user 2.46s system 23% cpu 6:21.30 total
=> throughput = 393391 q/s
used_cpu_sys:752.62
used_cpu_user:34.69
Bgsave without HP
-----------------
[17603] 08 Oct 18:34:06 * Fork: 286990
[17603] 08 Oct 18:34:06 * Background saving started by pid 18600
[18600] 08 Oct 18:35:13 * DB saved on disk
=> fork latency = 287 ms
=> duration = 67 secs
Restart without HP
------------------
[19396] 08 Oct 18:48:38 * Server started, Redis version 2.2.12
[19396] 08 Oct 18:49:41 * DB loaded from disk: 63 seconds
COW efficiency evaluation
-------------------------
With huge pages, at only 60 w/s, ratio is about 25%
With higher throughput (for instance 120 w/s), the limit of 32 GB is quickly reached, and Redis is killed,
which means too many pages (more than one third) are copied.
Without huge pages, at 60 w/s, ratio is negligible
Without huge pages, at 5000 w/s, ratio is about 5%
Final results
=============
Here are all the results in a single table:
With HP Without HP Ratio %
Throughput fill (q/s) 331931 327367 101.39
CPU user 376.68 381.86 98.64
CPU sys 18.35 24.17 75.92
CPU total 395.03 406.03 97.29
Throughput read queries (q/s) 405558 393391 103.09
CPU user 359.9 370.76 97.07
CPU sys 9.96 10.52 94.68
CPU total 369.86 381.28 97.00
Fork latency (ms) 4 287 1.39
BGSAVE duration (s) 66 67 98.51
Load duration (s) 60 63 95.24
We can see the gain in throughput due to huge pages is
between 1 and 4 % (i.e. a few percent only). The gain
in system CPU is about 25% at object creation time
(but system CPU accounts for a tiny fraction of the
total CPU consumption). There is also a small gain of
1.5% and 5% at save and load time.
It is clear that activating huge pages to boost general
performance of Redis is not really worth it.
The most noticeable benefit is of course the fork latency
which dramatically drops from 287 to only 4 ms (for a 24 GB
instance).
COW efficiency is abysmal with huge pages. Even with very low
update rate, most of the pages end up duplicated very quickly.
It is almost mandatory to provision twice the memory to support
background save, unless some strong locality of the traffic can
be exploited.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment