danielschulz/linux_memory_control_to_avoid_swap_thrashing.md

## linux_memory_control_to_avoid_swap_thrashing.md

      
    Raw
  

              linux_memory_control_to_avoid_swap_thrashing.md
            
          
    Overview

Some notes about:

Explaining why current day Linux memory swap thrashing still happens (as of 2016).
Mitigating "stop the world" type thrashing issues on a Linux workstation when it's under high memory pressure and where responsiveness is more important than process completion.
Prioritizing and limiting memory use.
Older ulimit versus newer CGroup options.

These notes assume some basic background knowledge about memory management, ulimits and cgroups.
Documentation about Linux memory management and setting limits is all over the place. In general:

ulimits apply per process.

OS defaults for soft limits and add hard limits act as guard rails, but are typically left unlimited for memory use.


CGroups can be used to apply an overall limit for a group of processes, and in a hierarchical way.

But, as an example, Ubuntu 16.04 LTS doesn't enable the CGroup memory and swap accounting as default.


Linux allows for over-committed virtual memory allocation

vm.overcommit* and vm.swapiness can be used to change the overcommitment and swap features of Linux virtual memory.


OOM (out of memory manager). TODO

Linux memory management


Processes often request memory chunks in a greedy and speculative way, e.g.

Large arrays/buffers need to cover worst-case size scenarios for sets of objects.
To perform well and avoid overhead of micromanaging memory allocations byte by byte.
A process won't usually be accessing all of it's allocated memory frequently or at once


The kernel compensates for this and improves overall system performance by over committing memory:

Allowing the total virtual memory address space to exceed the actual memory available (possibly including even swap memory space).
The kernel will happily allocate lots of memory to a process, but only truly provides the physical memory when the program actually tries to access it.
Uses swap space on storage to back memory pages that appear inactive. Inactive memory gets paged out to swap and paged back in from swap if accessed later.
However, if a system has a lot of memory pressure, where the resident memory workload (or working set) exceeds the physical memory of the system, the kernel is pushed into a difficult to manage situation where it has to keep swapping around memory pages that are actually active.


Thrashing

"Thrashing" scenarios happen under high memory pressure and neatly describes the noise a magnetic hard disc would make as it ended up bogged down with lots of random IO requests while being treated as if it was RAM (which it ain't!). This typically lock's up storage I/O as Linux continuously accesses the swap partition (or file). No kernel wizardry can save a system from this mess without:

Killing greedy processes, or
Prioritizing which important processes should receive more CPU, memory and I/O access.

Linux is usually tuned for the server use case. On servers, it accepts the performance penalty of thrashing for a determined, slow, "avoid killing processes and don't lose data", type trade-off. On desktops, the trade off is different and users would prefer a more aggressive culling. A bit of data loss (process sacrifice) to keep things responsive is more acceptable in the end-user desktop use case.
Examples below are on a system with 16GB of RAM running Ubuntu 16.04, but should be applicable in general.
Related thrashing questions and answers

I attempted answering some related superuser.com questions:

prevent system freeze/unresponsiveness due to swapping run away memory usage
How do I quickly stop a process that is causing thrashing (due to excess memory allocation)?

ulimits

System-wide persistent hard limit

/etc/security/limits.conf (also limits.d/..) can be set such that no single process should:

Be allowed to lock half of the physical memory (seen with ulimit -H -l)
Be allowed to request more memory than is physically possible (seen with ulimit -H -v)

$ cat /etc/security/limits.d/mem.conf
*	hard	memlock	8198670
*	hard	as	16135196
To set the above via bash one liners for a system with an arbitrary size of memory (larger than say 1GB):
sudo bash -c "echo -e \"*\thard\tmemlock\t$(($(grep -E 'MemTotal' /proc/meminfo | grep -oP '(?<=\s)\d+(?=\skB$)') / 2))\" > /etc/security/limits.d/mem.conf"
sudo bash -c "echo -e \"*\thard\tas\t$(($(grep -E 'MemTotal' /proc/meminfo | grep -oP '(?<=\s)\d+(?=\skB$)') - 256*2**10))\" >> /etc/security/limits.d/mem.conf"
Note:

The resident memory setting, rss or ulimit -H -m, no longer has any effect since Linux kernel 2.4.30!, hence only work with overall virtual memory ulimit.
address space hard limit = <physical memory> - 256MB.

This is more conservative then Ubuntu 16.04's unlimited default.

In theory, a process can speculatively request lots of memory but only actively use a subset (smaller working set/resident memory use).
The above hard limit will cause such processes to abort (even if they might have otherwise run fine given Linux allows the virtual memory address space to be over-committed).


Only mitigates against a single process going overboard with memory use.
Won't prevent a multi-process workload with heavy memory pressure causing thrashing (cgroups is then more appropriate).


locked memory hard limit = <physical memory> / 2

This is more generous than Ubuntu's 16.04's 64 kB default.


Once off (custom soft limit)

A simple example:
$ bash
$ ulimit -S -v $((1*2**20))
$ r2(){r2 $@$@;};r2 r2
bash: xmalloc: .././subst.c:3550: cannot allocate 134217729 bytes (946343936 bytes allocated)
It:

Sets a soft limit of 1GB overall memory use (ulimit assumes limit in kB unit)
Runs a recursive bash function call r2(){ r2 $@$@;};r2 r2 that will exponentially chew up CPU and RAM by infinitely doubling itself while requesting stack memory.

CGroups

The major improvement from CGroups is (as per CGroup v2):

The combined memory+swap accounting and limiting is replaced by real control over swap space.

It offers more control, but is currently more complex to use:

Improves on ulimit offering. memory.max_usage_in_bytes can account and limit physical memory separately.
Need to enable some kernel cgroup flags in bootloader: cgroup_enable=memory swapaccount=1.

This didn't happen by default with Ubuntu 16.04.
Probably due to some performance implications of extra accounting overhead.


cgroup/systemd stuff is relatively new and changing a fair bit, so the flux upstream implies Linux distro vendors haven't yet made it easy to use. Between 14.04LTS and 16.04LTS, the user space tooling to use cgroups has changed.
cgm now seems to be the officially supported userspace tool.
systemd unit files don't yet seem to have any pre-defined"vendor/distro" defaults to prioritize important services like ssh.

In the future, let's hope to see "distro/vendors" pre-configure cgroup priorities and limits (via systemd units) for important things like SSH and the graphical stack, such that they never get starved of memory.
System-wide persistent cgroup memory controller setting

E.g. to check current settings via the cgroup filesystem:
$ echo $(($(cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes) / 2**20)) MB
11389 MB
$ cat /sys/fs/cgroup/memory/memory.stat
...
systemd slices and the unified cgroup hierarchy

Before trying apply a system-wide limit via cgroup, it's probably worthwhile reviewing the unified structure for the cgroup hierarchy. At a high level, an example of the tree with repetitive parts omitted:
$ systemd-cgls
Control group /:
-.slice
├─1212 /sbin/cgmanager -m name=systemd
├─user
│ └─root
├─docker
│ └─<container_id>
├─init.scope
│ └─1 /sbin/init splash
├─system.slice
│ ├─<name>.service
│ ...
└─user.slice
  ├─user-121.slice
  │ ├─user@121.service
  │ │ └─init.scope
  │ │   ├─3648 /lib/systemd/systemd --user
  │ │   └─3649 (sd-pam)
  │ └─session-c1.scope
  │   ├─3442 gdm-session-worker [pam/gdm-launch-environment]
  │   ...
  └─user-1000.slice
    ├─user@1000.service
    │ └─init.scope
    │   ├─5377 /lib/systemd/systemd --user
    │   └─5383 (sd-pam)         
    └─session-4.scope
      ├─ 5372 gdm-session-worker [pam/gdm-password]
      ...

As per RedHat documentation and man pages (systemd.slice, systemd.resource-control, systemd.special):

-.slice: the root slice (may be used to set defaults for the whole tree!);
system.slice: the default place for all system services (and scope units);
user.slice: the default place for all user sessions (sessions handled by systemd-logind);
machine.slice: for virtual machines and containers registered with systemd-machined (docker.io's v 1.12 .deb install and service doesn't use this);

System-wide global cgroup memory limits can probably be applied to the root -.slice
TODO: where to configure, on Ubuntu, a persistent memory limit for either the root slice? Might not be possible as RHEL 6 sec-memory doc.

You cannot use memory.limit_in_bytes to limit the root cgroup; you can only apply values to groups lower in the hierarchy.

Reference:

Default CGroup Hierarchies

Per systemd unit

CGroup options can be set via systemd resource control options. E.g.:

MemoryLow (memory.low)
MemoryHigh (memory.high)
MemoryMax (memory.max)
(memory.swap.max)

However, enabling memory and swap accounting has some drawbacks:

Overhead. Current docker documentation briefly mentions 1% extra memory use and 10% performance degradation (probably with regard to memory allocation operations - it doesn't really specify).
Cgroup/systemd stuff has been heavily re-worked recently, so the flux upstream implies Linux distro vendors might be waiting for it to settle first.

Other useful options for CPU and IO priority might be:

IOWeight (io.weight)
CPUShares (cpu.weight)

Even I/O memory caching can be prioritized with the following the write-back buffer:

vm.dirty_background_ratio
vm.dirty_ratio

In CGroup v2, they suggest that memory.high should be a good option to throttle and manage memory use by a process group.
Given systemd and cgroup user space tools are complex, I haven't found a simple way to set something appropriate and leverage this further.
TODO: example to prioritize the ssh service and reduce priority for a logstash process
Default systemd and cgroup settings (on Ubuntu)

In Ubuntu 16.04 LTS, not even CPU and IO priorities are appropriately set!? When I checked the systemd cgroup limits, cpu shares, etc applied, as far as I could tell, Ubuntu hadn't baked in any pre-defined privatizations. E.g.:
$ systemctl show dev-mapper-Ubuntu\x2dswap.swap
...
MemoryLimit=18446744073709551615
...
Many values were set to 18446744073709551615 which is one byte less than 2^64 and assumes a full 64-bit architecture able to address 16 exbibytes (whether or not the system has that available!)
I compared that to the same output for ssh, samba, gdm and nginx. Important things like the GUI and remote admin console have to fight equally with all other processes when thrashing happens.
In future, if a typical workload is known for a service or process, then via a systemd unit:

MemoryHigh=<byte value> can be set as a kind of upper threshold to avoid physical memory allocation beyond and force the process to then rely further on heavy swapping.
MemoryLow=<byte value> can be set as a kind of lower threshold to avoid starving or swapping away memory the process may need for viable operation (and is possibly an alternative to allowing a process to lock memory)

Useful systemd cgroup commands

To see the cgroup / systemd slices:
$ systemctl -t slice

To see the cgroup hierarchy / tree:
systemd-ls

To see current resource usage like top
$ systemd-cgtop

References:

How to manage processes with cgroup on Systemd

relies on older utils like cgcreate


Once off custom CGroup memory controller setting

E.g. To test limiting the memory of a single process:
$cgm create memory mem_1G
$ cgm setvalue memory mem_1G memory.limit_in_bytes $((1*2**30))
$ cgm setvalue memory mem_1G memory.memsw.limit_in_bytes $((1*2**30))
$ bash
$ cgm movepid memory mem_1G $$
$ r2(){ r2 $@$@;};r2 r2
Killed
To see it in action chewing up RAM as a background process and then getting killed:
$ bash -c 'cgm movepid memory mem_1G $$; r2(){ r2 $@$@;};r2 r2' & while [ -e /proc/$! ]; do ps -p $! -o pcpu,pmem,rss h; sleep 1; done
[1] 3201
 0.0  0.0  2876
 102  0.2 44056
 103  0.5 85024
 103  1.0 166944
 ...
98.9  5.6 920552
99.1  4.3 718196
[1]+  Killed                  bash -c 'cgm movepid memory mem_1G $$; r2(){ r2 $@$@;};r2 r2'
Note the exponential (power of 2) growth in memory requests.
Kernel over-commit policy

Global settings

If you prefer applications to be denied memory access and want to stop over-committing, use the commands below to test how your system behaves when under high memory pressure. E.g.:
$ sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50
But it only comes into full effect when changing the policy to disable overcommitment and apply the ratio (i.e. both settings are needed)
$ sudo sysctl -w vm.overcommit_memory=2
Note:

Like ulimit, this can cause many applications to crash, given it's common for developers to not gracefully handle the OS declining a memory allocation request.
Trades off

The occasional risk of a drawn out lockup due to thrashing (loose all processing after hard reset) to a more frequent risk of various processes crashing.
A limited over-commit policy is appropriate to run less more reliably (ensure stability), versus maximization of the available memory resources for bigger overall workloads.


To restore default behavior:
sudo sysctl -w vm.overcommit_memory=0
Out of Memory Manager (OOM)

This was a a nice comical analogy about OOM: oom_pardon, aka don't kill my xlock
One might hope the Linux OOM tweaks coming along in more recent kernels recognize when a working set exceeds physical memory situation and kills a greedy process.

When it doesn't, the thrashing problem happens.
The problem is, with a big swap partition, it can look as if the system still has headroom while the kernel merrily over commits and still serves up memory requests, but the working set could spill over into swap, effectively trying to treat storage as if it's RAM.

OOM and CGroups

However a CGroup v2 quote shows that monitoring memory pressure situations needs more work (as of 2015). This may imply that a "smarter" OOM still needs work.

A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet.

TODO: Further research into how OOM, and other cgroup options like memory.oom_control work.
Note:

OOM is not invoked when memory.high is reached.
However, if cgroup memory.max is reached then OOM is invoked.

systemd unit setting to ovoid OOM culling a service

TODO...
OOMScoreAdjust is another systemd option to help weight and avoid OOM killing processes considered more important.