We have encountered what appears to be a regression in iptables performance when appending many rules sequentially. We believe we have narrowed the regression to this but do not understand the implications of this commit. Summary provided below but more info available on this tracker story
vagrant init ubuntu/xenial64 # or use ubuntu/trusty64 for a comparison point
vagrant up
vagrant ssh
Run these inside the vagrant box - the list-addrs
script can be found attached at the bottom
sudo su
iptables -S # see it is empty
time (./list-addrs <number_of_rules> | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
between tests, flush the table with
iptables -F FORWARD
Below are some sample numbers for Trusty and Xenial
root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m0.079s
user 0m0.005s
sys 0m0.072s
root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m0.815s
user 0m0.061s
sys 0m0.742s
root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m2.277s
user 0m0.287s
sys 0m1.956s
root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m3.975s
user 0m0.504s
sys 0m3.402s
root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m0.102s
user 0m0.000s
sys 0m0.012s
root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m2.269s
user 0m0.036s
sys 0m0.356s
root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m11.709s
user 0m0.572s
sys 0m7.252s
root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m33.965s
user 0m1.380s
sys 0m26.804s
At 3000 rules, the difference is pretty staggering. See as a graph
The numbers below demonstrate why we believe this commit introduced the regression.
We compiled the 4.1.0-rc7+ kernel at this commit, which gave us:
Linux concourse 4.1.0-rc7+ #2 SMP Thu Oct 27 10:52:21 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
iptables v1.4.21
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m0.086s
user 0m0.000s
sys 0m0.004s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m2.412s
user 0m0.048s
sys 0m0.528s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m11.127s
user 0m0.396s
sys 0m6.904s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m32.900s
user 0m1.192s
sys 0m25.888s
This is slow. We then reverted only that commit:
Linux concourse 4.1.0-rc7+ #2 SMP Thu Oct 27 11:52:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
iptables v1.4.21
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m0.090s
user 0m0.008s
sys 0m0.004s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m1.045s
user 0m0.036s
sys 0m0.080s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m2.710s
user 0m0.044s
sys 0m0.196s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real 0m5.038s
user 0m0.092s
sys 0m0.332s
Which demonstrates reasonable performance.
iptables-restore
performs better.
TL;DR - When loaded up with 150000 rules, the time it takes to perform a modifying action to the iptables
rule (e.g. create a chain) is ~10 times worse on the Trusty 4.4 kernel when compared with the Trusty 3.19 kernel.
./restore-rules 50000 | iptables-restore --noflush
This creates 50000 rules and 100000 chains, which should be relatively quick on either kernel. Run the following command to confirm:
iptables -w -S | wc -l
Between each of the following tests, reset with:
iptables -X test-table
# time (iptables -N test-table)
real 0m0.562s
user 0m0.056s
sys 0m0.504s
# time (echo -e "*filter\n-N test-table\nCOMMIT" | iptables-restore --noflush)
real 0m0.560s
user 0m0.060s
sys 0m0.496s
# time (strace -f iptables -N test-table)
...
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40403656) = 0
...
real 0m0.594s
user 0m0.048s
sys 0m0.472s
# time (iptables -N test-table)
real 0m10.832s
user 0m0.056s
sys 0m10.764s
# time (echo -e "*filter\n-N test-table\nCOMMIT" | iptables-restore --noflush)
real 0m10.848s
user 0m0.052s
sys 0m10.780s
# time (strace -f iptables -N test-table)
...
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40401056) = 0
...
real 0m10.879s
user 0m0.056s
sys 0m10.812s
There is a 10x performance degredation on Trusty 4.4 compared to Trusty 3.19. The interesting line from the strace
output is:
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40401056) = 0
On this system call, we hung for nearly the entirety of the iptables
command. Investigation shows that this is called in libibt.c
ret = setsockopt(handle->sockfd, TC_IPPROTO, SO_SET_REPLACE, repl, sizeof(*repl) + repl->size);
This call is writing the entirety of the iptables rules into a socket option. On each iptables
invocation, all of the rules are read off the socket, modified and rewritten. As the size of the rules grows, these operations naturally slow down. Therefore, in the original test case list-addr
we were repeatedly loading and writing the ruleset, degrading as we increased the number of rules added.
We believe iptables-restore
does not suffer from this same degradation when adding many rules because it loads all the rules, applies all changes in memory and commits them to the socket in one go. However, this is only a hack around the fundamental performance issue as when creating a chain with iptables-restore
(i.e. 1 rule change), the 3.19 -> 4.4 performance regression is still in play.
We have not attempted to identify the kernel commit that introduced this significant slowdown on setsocketopt
.