williammartin/iptables-regression.md

## iptables-regression.md

      
    Raw
  

              iptables-regression.md
            
          
    Iptables Regression Overview

We have encountered what appears to be a regression in iptables performance when appending many rules sequentially.
We believe we have narrowed the regression to this but do not understand the implications of this commit.
Summary provided below but more info available on this tracker story
Reproduction (credit @rosenhouse)

Set up

vagrant init ubuntu/xenial64  # or use ubuntu/trusty64 for a comparison point
vagrant up
vagrant ssh
Basics

Run these inside the vagrant box - the list-addrs script can be found attached at the bottom
sudo su
iptables -S        # see it is empty
time (./list-addrs <number_of_rules> | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
between tests, flush the table with
iptables -F FORWARD
Show me the numbers!

Below are some sample numbers for Trusty and Xenial
Kernel 3.13 (Ubuntu 14.04 "Trusty")

root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m0.079s
user  0m0.005s
sys   0m0.072s

root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m0.815s
user  0m0.061s
sys   0m0.742s

root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m2.277s
user  0m0.287s
sys   0m1.956s

root@vagrant-ubuntu-trusty-64:/vagrant# iptables -F FORWARD
root@vagrant-ubuntu-trusty-64:/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m3.975s
user  0m0.504s
sys   0m3.402s

Kernel 4.4 (Ubuntu 16.04 "Xenial")

root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m0.102s
user  0m0.000s
sys   0m0.012s

root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m2.269s
user  0m0.036s
sys   0m0.356s

root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m11.709s
user  0m0.572s
sys   0m7.252s

root@ubuntu-xenial:/home/ubuntu# iptables -F FORWARD
root@ubuntu-xenial:/home/ubuntu# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)
real  0m33.965s
user  0m1.380s
sys   0m26.804s

At 3000 rules, the difference is pretty staggering. See as a graph
Show me some more numbers!

The numbers below demonstrate why we believe this commit introduced the regression.
We compiled the 4.1.0-rc7+ kernel at this commit, which gave us:
Linux concourse 4.1.0-rc7+ #2 SMP Thu Oct 27 10:52:21 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
iptables v1.4.21

root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m0.086s
user    0m0.000s
sys     0m0.004s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m2.412s
user    0m0.048s
sys     0m0.528s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m11.127s
user    0m0.396s
sys     0m6.904s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m32.900s
user    0m1.192s
sys     0m25.888s

This is slow. We then reverted only that commit:
Linux concourse 4.1.0-rc7+ #2 SMP Thu Oct 27 11:52:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
iptables v1.4.21

root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 100 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m0.090s
user    0m0.008s
sys     0m0.004s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 1000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m1.045s
user    0m0.036s
sys     0m0.080s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 2000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m2.710s
user    0m0.044s
sys     0m0.196s
root@concourse:/home/vagrant# iptables -F FORWARD
root@concourse:/home/vagrant# time (./list-addrs 3000 | xargs -n1 iptables -A FORWARD -j ACCEPT -s)

real    0m5.038s
user    0m0.092s
sys     0m0.332s

Which demonstrates reasonable performance.
Addendum

iptables-restore performs better.
Trusty 3.19 vs Trusty 4.4 Performance Regression

TL;DR - When loaded up with 150000 rules, the time it takes to perform a modifying action to the iptables rule (e.g. create a chain) is ~10 times worse on the Trusty 4.4 kernel when compared with the Trusty 3.19 kernel.
Setup

./restore-rules 50000 | iptables-restore --noflush

This creates 50000 rules and 100000 chains, which should be relatively quick on either kernel. Run the following command to confirm:
iptables -w -S | wc -l

Between each of the following tests, reset with:
iptables -X test-table

3.19

Adding a chain via iptables

# time (iptables -N test-table)

real    0m0.562s
user    0m0.056s
sys     0m0.504s

Adding a chain via iptables-restore

# time (echo -e "*filter\n-N test-table\nCOMMIT" | iptables-restore --noflush)

real    0m0.560s
user    0m0.060s
sys     0m0.496s

Adding a chain via iptables (with strace)

# time (strace -f iptables -N test-table)

...
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40403656) = 0
...

real    0m0.594s
user    0m0.048s
sys     0m0.472s

4.4

Adding a chain via iptables

# time (iptables -N test-table)

real    0m10.832s
user    0m0.056s
sys     0m10.764s

Adding a chain via iptables-restore

# time (echo -e "*filter\n-N test-table\nCOMMIT" | iptables-restore --noflush)

real    0m10.848s
user    0m0.052s
sys     0m10.780s

Adding a chain via iptables (with strace)

# time (strace -f iptables -N test-table)

...
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40401056) = 0
...

real    0m10.879s
user    0m0.056s
sys     0m10.812s

Summary

There is a 10x performance degredation on Trusty 4.4 compared to Trusty 3.19. The interesting line from the strace output is:
setsockopt(4, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40401056) = 0

On this system call, we hung for nearly the entirety of the iptables command. Investigation shows that this is called in libibt.c
ret = setsockopt(handle->sockfd, TC_IPPROTO, SO_SET_REPLACE, repl, sizeof(*repl) + repl->size);

This call is writing the entirety of the iptables rules into a socket option. On each iptables invocation, all of the rules are read off the socket, modified and rewritten. As the size of the rules grows, these operations naturally slow down. Therefore, in the original test case list-addr we were repeatedly loading and writing the ruleset, degrading as we increased the number of rules added.
We believe iptables-restore does not suffer from this same degradation when adding many rules because it loads all the rules, applies all changes in memory and commits them to the socket in one go. However, this is only a hack around the fundamental performance issue as when creating a chain with iptables-restore (i.e. 1 rule change), the 3.19 -> 4.4 performance regression is still in play.
We have not attempted to identify the kernel commit that introduced this significant slowdown on setsocketopt.

  
## list-addrs
#!/bin/bash

nRules="$1"

set -euf -o pipefail

if [ -z "$nRules" ]; then
        echo "specify an integer number of addresses to generate"
        exit 1
fi

if [ "$nRules" -eq "$nRules" 2>/dev/null ]
then
        echo -n ""
else
        echo "specify an integer number of addresses to generate"
        exit 1
fi

for i in $(seq 0 $(( $nRules - 1)) ); do
        lowbyte="$(( $i % 250 ))"
        nextbyte="$(( $i / 250 ))"
        addr="10.10.${nextbyte}.${lowbyte}"
        echo "$addr"
done

## restore-rules
#!/bin/bash

nRules="$1"

set -euf -o pipefail

if [ -z "$nRules" ]; then
        echo "specify an integer" >&2
        exit 1
fi

if [ "$nRules" -eq "$nRules" 2>/dev/null ]
then
        echo -n ""
else
        echo "specify an integer" >&2
        exit 1
fi

echo "*filter"
for i in $(seq 0 $(( $nRules - 1)) ); do
        lowbyte="$(( $i % 250 ))"
        nextbyte="$(( $i / 250 ))"
        addr="10.10.${nextbyte}.${lowbyte}"
        echo "-A FORWARD -s ${addr}/32 -j ACCEPT"
        echo "-N test-chain-$i"
        echo "-N test-chain-2-$i"
done
echo "COMMIT"
	#!/bin/bash

	nRules="$1"

	set -euf -o pipefail

	if [ -z "$nRules" ]; then
	echo "specify an integer number of addresses to generate"
	exit 1
	fi

	if [ "$nRules" -eq "$nRules" 2>/dev/null ]
	then
	echo -n ""
	else
	echo "specify an integer number of addresses to generate"
	exit 1
	fi

	for i in $(seq 0 $(( $nRules - 1)) ); do
	lowbyte="$(( $i % 250 ))"
	nextbyte="$(( $i / 250 ))"
	addr="10.10.${nextbyte}.${lowbyte}"
	echo "$addr"
	done