node_overcommit
Node overcommit
KubeVirt does not yet support classical Memory Overcommit Management or Memory Ballooning. In other words VirtualMachineInstances can't give back memory they have allocated. However, a few other things can be tweaked to reduce the memory footprint and overcommit the per-VMI memory overhead.
Remove the Graphical Devices
First the safest option to reduce the memory footprint, is removing the graphical device from the VMI by setting spec.domain.devices.autottachGraphicsDevice
to false
. See the video and graphics device documentation for further details and examples.
This will save a constant amount of 16MB
per VirtualMachineInstance but also disable VNC access.
Overcommit the Guest Overhead
Before you continue, make sure you make yourself comfortable with the Out of Resource Management of Kubernetes.
Every VirtualMachineInstance requests slightly more memory from Kubernetes than what was requested by the user for the Operating System. The additional memory is used for the per-VMI overhead consisting of our infrastructure which is wrapping the actual VirtualMachineInstance process.
In order to increase the VMI density on the node, it is possible to not request the additional overhead by setting spec.domain.resources.overcommitGuestOverhead
to true
:
This will work fine for as long as most of the VirtualMachineInstances will not request the whole memory. That is especially the case if you have short-lived VMIs. But if you have long-lived VirtualMachineInstances or do extremely memory intensive tasks inside the VirtualMachineInstance, your VMIs will use all memory they are granted sooner or later.
Overcommit Guest Memory
The third option is real memory overcommit on the VMI. In this scenario the VMI is explicitly told that it has more memory available than what is requested from the cluster by setting spec.domain.memory.guest
to a value higher than spec.domain.resources.requests.memory
.
The following definition requests 1024MB
from the cluster but tells the VMI that it has 2048MB
of memory available:
For as long as there is enough free memory available on the node, the VMI can happily consume up to 2048MB
. This VMI will get the Burstable
resource class assigned by Kubernetes (See QoS classes in Kubernetes for more details). The same eviction rules like for Pods apply to the VMI in case the node gets under memory pressure.
Implicit memory overcommit is disabled by default. This means that when memory request is not specified, it is set to match spec.domain.memory.guest
. However, it can be enabled using spec.configuration.developerConfiguration.memoryOvercommit
in the kubevirt
CR. For example, by setting memoryOvercommit: "150"
we define that when memory request is not explicitly set, it will be implicitly set to achieve memory overcommit of 150%. For instance, when spec.domain.memory.guest: 3072M
, memory request is set to 2048M, if omitted. Note that the actual memory request depends on additional configuration options like OvercommitGuestOverhead.
Configuring the memory pressure behavior of nodes
If the node gets under memory pressure, depending on the kubelet
configuration the virtual machines may get killed by the OOM handler or by the kubelet
itself. It is possible to tweak that behaviour based on the requirements of your VirtualMachineInstances by:
Configuring [Soft Eviction
Thresholds](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds)
Configuring [Hard Eviction
Thresholds](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#hard-eviction-thresholds)
Requesting the right QoS class for VirtualMachineInstances
Setting
--system-reserved
and--kubelet-reserved
Enabling KSM
Enabling swap
Configuring Soft Eviction Thresholds
Note: Soft Eviction will effectively shutdown VirtualMachineInstances. They are not paused, hibernated or migrated. Further, Soft Eviction is disabled by default.
If configured, VirtualMachineInstances get evicted once the available memory falls below the threshold specified via --eviction-soft
and the VirtualmachineInstance is given the chance to perform a shutdown of the VMI within a timespan specified via --eviction-max-pod-grace-period
. The flag --eviction-soft-grace-period
specifies for how long a soft eviction condition must be held before soft evictions are triggered.
If set properly according to the demands of the VMIs, overcommitting should only lead to soft evictions in rare cases for some VMIs. They may even get re-scheduled to the same node with less initial memory demand. For some workload types, this can be perfectly fine and lead to better overall memory-utilization.
Configuring Hard Eviction Thresholds
Note: If unspecified, the kubelet will do hard evictions for Pods once
memory.available
falls below100Mi
.
Limits set via --eviction-hard
will lead to immediate eviction of VirtualMachineInstances or Pods. This stops VMIs without a grace period and is comparable with power-loss on a real computer.
If the hard limit is hit, VMIs may from time to time simply be killed. They may be re-scheduled to the same node immediately again, since they start with less memory consumption again. This can be a simple option, if the memory threshold is only very seldom hit and the work performed by the VMIs is reproducible or it can be resumed from some checkpoints.
Requesting the right QoS Class for VirtualMachineInstances
Different QoS classes get assigned to Pods and VirtualMachineInstances based on the requests.memory
and limits.memory
. KubeVirt right now supports the QoS classes Burstable
and Guaranteed
. Burstable
VMIs are evicted before Guaranteed
VMIs.
This allows creating two classes of VMIs:
One type can have equal
requests.memory
andlimits.memory
setand therefore gets the
Guaranteed
class assigned. This one willnot get evicted and should never run into memory issues, but is more
demanding.
One type can have no
limits.memory
or alimits.memory
which isgreater than
requests.memory
and therefore gets theBurstable
class assigned. These VMIs will be evicted first.
Setting --system-reserved
and --kubelet-reserved
--system-reserved
and --kubelet-reserved
It may be important to reserve some memory for other daemons (not DaemonSets) which are running on the same node (ssh, dhcp servers, etc). The reservation can be done with the --system reserved
switch. Further for the Kubelet and Docker a special flag called --kubelet-reserved
exists.
Enabling KSM
The KSM (Kernel same-page merging) daemon can be started on the node. Depending on its tuning parameters it can more or less aggressively try to merge identical pages between applications and VirtualMachineInstances. The more aggressive it is configured the more CPU it will use itself, so the memory overcommit advantages comes with a slight CPU performance hit.
Config file tuning allows changes to scanning frequency (how often will KSM activate) and aggressiveness (how many pages per second will it scan).
Enabling Swap
Note: This will definitely make sure that your VirtualMachines can't crash or get evicted from the node but it comes with the cost of pretty unpredictable performance once the node runs out of memory and the kubelet may not detect that it should evict Pods to increase the performance again.
Enabling swap is in general not recommended on Kubernetes right now. However, it can be useful in combination with KSM, since KSM merges identical pages over time. Swap allows the VMIs to successfully allocate memory which will then effectively never be used because of the later de-duplication done by KSM.
Node CPU allocation ratio
KubeVirt runs Virtual Machines in a Kubernetes Pod. This pod requests a certain amount of CPU time from the host. On the other hand, the Virtual Machine is being created with a certain amount of vCPUs. The number of vCPUs may not necessarily correlate to the number of requested CPUs by the POD. Depending on the QOS of the POD, vCPUs can be scheduled on a variable amount of physical CPUs; this depends on the available CPU resources on a node. When there are fewer available CPUs on the node as the requested vCPU, vCPU will be over committed.
By default, each pod requests 100mil of CPU time. The CPU requested on the pod sets the cgroups cpu.shares which serves as a priority for the scheduler to provide CPU time for vCPUs in this POD. As the number of vCPUs increases, this will reduce the amount of CPU time each vCPU may get when competing with other processes on the node or other Virtual Machine Instances with a lower amount of vCPUs.
The cpuAllocationRatio
comes to normalize the amount of CPU time the POD will request based on the number of vCPUs. For example, POD CPU request = number of vCPUs * 1/cpuAllocationRatio When cpuAllocationRatio is set to 1, a full amount of vCPUs will be requested for the POD.
Note: In Kubernetes, one full core is 1000 of CPU time More Information
Administrators can change this ratio by updating the KubeVirt CR
Last updated