svantelidman/java_scheduling_kubernetes.md

## java_scheduling_kubernetes.md

      
    Raw
  

              java_scheduling_kubernetes.md
            
          
    What do the resource requests and limits mean in OpenShift and how might they effect your Java application?

Java resource management

To the extent it is implementation specific the discussion below is based on the OpenJDK HotSpot runtime.
Threads

The JVM maps threads in Java to OS threads 1:1 , i.e., there are no "green threads" in Java like there are in some other virtual machines, for example BEAM, the Erlang virtual machine.
A thread in the JVM can be in one of the following states:

thread_new: a new thread in the process of being initialized
thread_in_Java: a thread that is executing Java code
thread_in_vm: a thread that is executing inside the VM
thread_blocked: the thread is blocked for some reason (acquiring a lock, waiting for a condition, sleeping, performing a blocking I/O operation and so forth)

In addition to the application threads the JVM spawns a number of internal threads where the most important ones are:

The VM thread
GC threads
JIT threads

On a high level you can say that the VM thread coordinates the other threads, for example that all application threads are blocked durring a stop-the-world GC which can happen with all the commonly used garabage collectors.
Garbage collection and threads

Both the CMS and the G1 garbage collectors have concurrent and non-concurrent phases. They are referred to as mostly
concurrent garbage collectors. Concurrent means concurrent to running application threads. A full GC with compaction
is non-concurrent for both CMS and G1.
Typically the JVM will assign the number of threads for the concurrent phases automatically. For systems with less than
8 physical core it will assign one thread per physical core by default. This can be overriden by passing command line options to the JVM.
On a CPU-constrained system there is a risk that the lower priority concurrent GC processing lags behind. Then it seems
that there will be an incresed risk and frequency of stop-the-world GC events where all application threads are paused.
Regardless there seems to be no situation in which lack of GC thread priority can hang processing inderminately. It will
just mean that you are likely to see more stop-the-world GC phases. We may want to tune how we allocate GC threads as
a consequence of this. Here is an excerpt from one of the references below.
We prevent our container from being throttled prematurely and permits more opportunities for our application threads to
execute by limiting the number of JVM threads to at most the number of cores available. Our base Docker image automatically
detects the resources available to the container and tunes the JVM accordingly at start time. Setting the
flags -XX:ParallelGCThreads, -XX:ConcGCThreads, and -Djava.util.concurrent.ForkJoinPool.common.parallelism prevents many
unnecessary pauses. However, many JVM components rely on Runtime.getRuntime.availableProcessors() which still returns the
number of physical cores available. To overcome this, we compile and load a C library that overrides the JVM native function
JVM_ActiveProcessorCount and returns our calculated value instead. This gives us complete control to limit all dynamically
scalable aspects of the JVM without performance penalties.
The Linux Scheduler - Completely Fair Scheduler (CFS)

CFS handles the scheduling of tasks and both processes and threads are tasks. CFS ensures that all tasks eventually
gets scheduled. All tasks gets equally sized time slices but lower priority tasks will get time slices less often
compared to higher priority tasks on a CPU constrained system. This is controlled by the relative weight of the task
which by default is 1024 (translates roughly to  1024 / (1.25)^ (nice)) and which determines how often the task will get scheduled in relation to other
tasks that share the same CPU (or CPU quota in the cgroup case, see below).
Cgroups, Docker, and OpenShift

You may know that Docker utilizes Linux cgroups and namespaces to do its magic and that OpenShift/Kubernetes orchestrates
the execution of Docker container across a cluster of machines called worker nodes. But how does it all fit together
and what do the different settings in Openshift/Kubernetes really mean?
Cgroups

Docker provides a way to run what is known as a Docker image in an isolated environment called a container. Typically
you will run a single process inside your container and the image specifies the full environment that the process will
run in including operation system, libraries and so on. The Linux kernel on the host system is however common to all
containers running on the system. On Linux Docker use the cgroups and namespaces to achieve this. Here we will focus
on the cgroup ascpect of things.
Cgroups, short for resource control groups (from the beginning call process containers) has been part of the Linux
kernel since 2008 and provides a mechanism to isolate, track and limit the resource usage (CPU, memory, I/O etc.) of
a group of processes.
Docker uses cgroups on Linux to make it possible to set resource constraints for a particular container. Cgroups
are arranged in a tree like structure and typically systemd is the root Cgroup where one of the children is the System
Cgroup where the Docker dameon, dockerd, will run and all containers will then be arranged as cgroups below the dockerd
cgroup. But what are the different constraints that can be set for a cgroup and what do they mean?

For memory you can set a soft and a hard limit but it is the hard limit that really matters. If your container exceeds
the hard limit it is subject to be killed without any compromise. The soft limit is basically a warning that does not
have any effect apart from raising the warning itself.
For CPU you can set a value called share per cgroup which is a weight in relation to the all other cgroups that have
the same parent cgroup. The default value for this weight is 1024.

Docker on Linux

Setting CPU Limits

Setting Docker CPU limits on Linux maps directly to Cgroups as described above. The Docker command lines options that we
need to consider are:


Option
Description


--cpus=<value>
Specify how much of the available CPU resources a container can use. For instance, if the host machine has two CPUs and you set --cpus="1.5", the container is guaranteed at most one and a half of the CPUs. This is the equivalent of setting --cpu-period="100000" and --cpu-quota="150000". Available in Docker 1.13 and higher.


--cpu-period=<value>
Specify the CPU CFS scheduler period, which is used alongside --cpu-quota. Defaults to 100 micro-seconds. Most users do not change this from the default. If you use Docker 1.13 or higher, use --cpus instead.


--cpu-quota=<value>
Impose a CPU CFS quota on the container. The number of microseconds per --cpu-period that the container is limited to before throttled. As such acting as the effective ceiling. If you use Docker 1.13 or higher, use --cpus instead.


--cpuset-cpus
Limit the specific CPUs or cores a container can use. A comma-separated list or hyphen-separated range of CPUs a container can use, if you have more than one CPU. The first CPU is numbered 0. A valid value might be 0-3 (to use the first, second, third, and fourth CPU) or 1,3 (to use the second and fourth CPU).


--cpu-shares
Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit. --cpu-shares does not prevent containers from being scheduled in swarm mode. It prioritizes container CPU resources for the available CPU cycles. It does not guarantee or reserve any specific CPU access.


Openshift

Limitation on containers

For each container in a pod you can set how much memory and cpu the container requests and also how much
memory and cpu the container is limited to.
CPU values are specified in millicores and memory is typically specified in megabytes or gigabytes.
CPU

Requested CPU is not a hard limit but is an indication to the Kubernetes scheduler as to on which working node in the cluster
it might be good to place the pod. If the node is not CPU-constrained it does however not place any upper bound on the CPU it
consumes. It does however provide a guruantuee that the node will get at least this amount of CPU. This setting maps to CPU-shares
of Cgroups which then are used by the CFS scheduler to determine how often the Cgroup will get a slice of the available CPU resources.
CPU Limit on the other hand is a limit. If a container attempts to exceed the limit it will get throttled even
if there are spare CPU slots available on the working node. Setting a limit will make the performance of the service/pod more predictable and
not dependent on what else is scheduled on the working node.
Memory

Similarly as for CPU, requested memory is a hint to the Kubernetes scheduler to decide on which working node the pod
should be placed. If the pod exeeds the memory limit (if set) then the pod will likely be killed
and restarted dependent on container restart policy.
Quotas for projects

You can set quotas per Openshift project for a number of different things like number of config maps, persistent volume claims, and so on. Here we
are however only discussing quota for memory and CPU.
Quotas are expressed as the limit for the sum of requested CPU and memory across all pods in the project and the sum of the limits for CPU and memory
across all pods in the project. Note that these limits are not applied in runtime but enforcement of the quota is done by stopping the creation of
new resources that would violate the quota constraint.
References

Threads in the OpenJDK HotSpot JVM
G1 Garbage Collector
CMS Garbage Collector
Completely Fair Scheduler - 15 min Video Overview
Cgroup basics
Here is a great article that brings Java and containers together. Beware that this may have been fixed in RHEL though as described in the next link.
Nobody puts Java in a container
This article contains a similar discussion focusing around tuning GC threads on startup when running Java 8 apps in Kubernetes.
Understanding Linux Container Scheduling
Openshift Container Compute Resources
Openshift Quotas and Limit Ranges
Linux nice command
Option	Description
`--cpus=<value>`	Specify how much of the available CPU resources a container can use. For instance, if the host machine has two CPUs and you set --cpus="1.5", the container is guaranteed at most one and a half of the CPUs. This is the equivalent of setting --cpu-period="100000" and --cpu-quota="150000". Available in Docker 1.13 and higher.
`--cpu-period=<value>`	Specify the CPU CFS scheduler period, which is used alongside --cpu-quota. Defaults to 100 micro-seconds. Most users do not change this from the default. If you use Docker 1.13 or higher, use --cpus instead.
`--cpu-quota=<value>`	Impose a CPU CFS quota on the container. The number of microseconds per --cpu-period that the container is limited to before throttled. As such acting as the effective ceiling. If you use Docker 1.13 or higher, use --cpus instead.
`--cpuset-cpus`	Limit the specific CPUs or cores a container can use. A comma-separated list or hyphen-separated range of CPUs a container can use, if you have more than one CPU. The first CPU is numbered 0. A valid value might be 0-3 (to use the first, second, third, and fourth CPU) or 1,3 (to use the second and fourth CPU).
`--cpu-shares`	Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit. --cpu-shares does not prevent containers from being scheduled in swarm mode. It prioritizes container CPU resources for the available CPU cycles. It does not guarantee or reserve any specific CPU access.