Current location: MCM Design Book.
(🚧 Please see Change Log for new additions/corrections.Please Check on 8th Oct for v3.1 release!🏗)
Introduction
A Kubernetes Controller is a program that watches for lifecycle events on specific resources and triggers one or more reconcile functions in response. A reconcile function is called with the Namespace and Name of an object corresponding to the resource and its job is to make the object Status match the declared state in the object Spec.
Machine Controller Manager aka MCM is a group of cooperative controllers that manage the lifecycle of the worker machines, machine-classes machine-sets and machine deployments. All these objects are custom resources.
- A worker Machine is a provider specific VM/instance that corresponds to a k8s Node. (k8s doesn't bring up nodes by its own, the MCM does so by using cloud provider API's abstracted by the Driver facade to bring up machines and map them to nodes)
- A MachineClass represents a template that contains cloud provider specific details used to create machines.
- A MachineSet ensures that the specified number of
Machine
replicas are running at a given point of time. Analogoues to k8s ReplicaSets. - A MachineDeployment provides a declarative update for
MachineSet
andMachines
. Analogous to k8s Deployments.
All the custom resources (Machine-*
objects) mentioned above are stored in the K8s control cluster. The nodes corresponding to the machines are created and registered in the target cluster.
For productive Gardener deployments, the control cluster is the control plane of the shoot cluster and since the MCM is running in the shoot's control plane, the kubeconfig for the control cluster is generally specified as the In-Cluster Config. The target cluster is the shoot cluster and hence the target cluster config is the shoot kube config.
Project Structure
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB subgraph MCM mcm["machine-controller-manager (Common MC Code, MachineSet, MachineDeploy controllers)"] mcmlo["machine-controller-manager-provider-local (Machine Controller Local atop K8s Kind)"] mcmaws["machine-controller-manager-provider-aws (Machine Controller for AWS)"] mcmazure["machine-controller-manager-provider-azure (Machine Controller for Azure)"] mcmgcp["machine-controller-manager-provider-gcp (Machine Controller for GCP)"] mcmx["machine-controller-manager-provider-X (Machine Controller for equinox/openstack/etc)"] end mcmlo--uses-->mcm mcmaws--uses-->mcm mcmazure--uses-->mcm mcmgcp--uses-->mcm mcmx-->mcm
The MCM project is divided into:
- The MCM Module. This contains
- The MCM Controller Type and MCM Controller Factory Method. The
MCM Controller
is responsible for reconciling theMachineDeployment
andMachineSet
custom resources. - MCM Main which creates and starts the MCM Controller.
- The MC Controller Type and MC Controller Factory Method.
- The
MC Controller
implements the reconciliation loop forMachineClass
andMachine
objects but delegates creation/updation/deletion/status-retrieval of Machines to theDriver
facade.
- The
- The Driver facade that abstracts away the lifecycle operations on Machines and obtaining Machine status.
- Utility Code leveraged by provider modules.
- The MCM Controller Type and MCM Controller Factory Method. The
- The provider specific modules named as
machine-controller-manager-provider-<providerName>
.- Contains a main file located at
cmd/machine-controller/main.go
that instantiate aDriver
implementation (Ex: AWSDriver) and then create and start aMC Controller
using the MC Controller Factory Method, passing theDriver
impl. In other worlds, each provider module starts its independent machine controller. - See MCM README for list of provider modules
- Contains a main file located at
The MCM leverages the old-school technique of writing controllers directly using client-go. Skeleton code for client types is generated using client-gen. A barebones example is illustrated in the sample controller.
The Modern Way of writing controllers is by leveraging the Controller Runtime and generating skeletal code fur custom controllers using the Kubebuilder Tool.
The MCM has a planned backlog to port the project to the controller runtime. The details of this will be documented in a separate proposal. (TODO: link me in future).
This book describes the current design of the MCM in order to aid code comprehension for development, enhancement and migratiion/port activities.
Deployment Structure
The MCM Pod's are part of the Deployment
named machine-controller-manager
that resides in the shoot control plane. After logging into the shoot control plane (use gardenctl
), you can the deployment details using k get deploy machine-controller-manager -o yaml
. The MCM deployment has two containers:
machine-controller-manager-provider-<provider>
. Ex:machine-controller-manager-provider-aws
. This container name is a bit misleading as it starts the provider specific machine controller main program responsible for reconciling machine-classes and machines. See Machine Controller. (Ideally the-manager
should have been removed)
Container command configured on AWS:
./machine-controller
--control-kubeconfig=inClusterConfig
--machine-creation-timeout=20m
--machine-drain-timeout=2h
--machine-health-timeout=10m
--namespace=shoot--i034796--tre
--port=10259
--target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig`
<provider>-machine-controller-manager
. Ex:aws-machine-controller-manager
. This container name is a bit misleading as it starts the machine deployment controller main program responsible for reconciling machine-deployments and machine-sets. (See: TODO: link me). Ideally it should have been called simplymachine-deployment-controller
as it is provider independent.
Container command configured on AWS
./machine-controller-manager
--control-kubeconfig=inClusterConfig
--delete-migrated-machine-class=true
--machine-safety-apiserver-statuscheck-timeout=30s
--machine-safety-apiserver-statuscheck-period=1m
--machine-safety-orphan-vms-period=30m
--machine-safety-overshooting-period=1m
--namespace=shoot--i034796--tre
--port=10258
--safety-up=2
--safety-down=1
--target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig
Local Development Tips
First read Local Dev MCM
Running MCM Locally
After setting up a shoot cluster in the dev landscape, you can run your local copy of MCM and MC to manage machines in the shoot cluster.
Example for AWS Shoot Cluster:
- Checkout
https://github.com/gardener/machine-controller-manager
andhttps://github.com/gardener/machine-controller-manager-provider-aws/
cd machine-controller-manager
and run./hack/gardener_local_setup.sh --seed <seedManagingShoot> --shoot <shootName> --project <userId> --provider aws
- Ex:
./hack/gardener_local_setup.sh --seed aws-ha --shoot aw2 --project i034796 --provider aws
- The above will set the replica count of the
machine-controller-manager
deployment in the shoot control plane to 0 and also set an annotationsdependency-watchdog.gardener.cloud/ignore-scaling
to prevent DWD from scalig it back up. Now, you can run your local dev copy.
- Ex:
- Inside the MCM directlry run
make start
- MCM controller should start without errors. Last line should look like:
I0920 14:11:28.615699 84778 deployment.go:433] Processing the machinedeployment "shoot--i034796--aw2-a-z1" (with replicas 1)
- MCM controller should start without errors. Last line should look like:
- Change to the provider directory. Ex
cd <checkoutPath>/machine-controllr-manager-provider-aws
and runmake start
- MC controller should start without errors. Last line should look like
I0920 14:14:37.793684 86169 core.go:482] List machines request has been processed successfully
I0920 14:14:37.896720 86169 machine_safety.go:59] reconcileClusterMachineSafetyOrphanVMs: End, reSync-Period: 5m0s
Change Log
- WIP Draft of Orphan/Safety
- WIP for machine set controller.
Kubernetes Client Facilities
This chapter describes the types and functions provided by k8s core and client modules that are leveraged by the MCM - it only covers what is required to understand MCM code and is simply meant to be a helpful review. References are provided for further reading.
K8s apimachinery
K8s objects
A K8s object represents a persistent entity. When using the K8s client-go framework to define such an object, one should follow the rules:
- A Go type representing a object must embed the k8s.io/apimachinery/pkg/apis/meta/v1.ObjectMeta struct.
ObjectMeta
is metadata that all persisted resources must have, which includes all objects users must create. - A Go type representing a singluar object must embed k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta which describes an individual object in an API response or request with strings representing the Kind of the object and its API schema version called APIVersion.
graph TB subgraph CustomType ObjectMeta TypeMeta end
- A Go type representing a list of a custom type must embed k8s.io/apimachinery/pkg/apis/meta/v1.ListMeta
graph TB subgraph CustomTypeList ObjectMeta ListMeta end
TypeMeta
type TypeMeta struct {
// Kind is a string value representing the REST resource this object represents.
Kind string
// APIVersion defines the versioned schema of this representation of an object.
APIVersion string
}
ObjectMeta
A snippet of ObjectMeta
struct fields shown below for convenience with the MCM relevant fields that are used by controller code.
type ObjectMeta struct { //snippet
// Name must be unique within a namespace. Is required when creating resources,
Name string
// Namespace defines the space within which each name must be unique. An empty namespace is equivalent to the "default" namespace,
Namespace string
// An opaque value that represents the internal version of this object that can be used by clients to determine when objects have changed.
ResourceVersion string
// A sequence number representing a specific generation of the desired state.
Generation int64
// UID is the unique in time and space value for this object. It is typically generated by the API server on successful creation of a resource and is not allowed to change on PUT operations.
UID types.UID
// CreationTimestamp is a timestamp representing the server time when this object was created.
CreationTimestamp Time
// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This field is set by the server when a graceful deletion is requested by the user. The resource is expected to be deleted (no longer reachable via APIs) after the time in this field, once the finalizers list is empty.
DeletionTimestamp *Time
// Must be empty before the object is deleted from the registry by the API server. Each entry is an identifier for the responsible controller that will remove the entry from the list.
Finalizers []string
// Map of string keys and values that can be used to organize and categorize (scope and select) objects. Valid label keys have two segments: an optional prefix and name, separated by a slash (/). Meant to be meaningful and relevant to USERS.
Labels map[string]string
// Annotations is an unstructured key value map stored with a resource that may be set by controllers/tools to store and retrieve arbitrary metadata. Meant for TOOLS.
Annotations map[string]string
// References to owner objects. Ex: Pod belongs to its owning ReplicaSet. A Machine belongs to its owning MachineSet.
OwnerReferences []OwnerReference
// The name of the cluster which the object belongs to. This is used to distinguish resources with same name and namespace in different clusters.
ClusterName string
//... other fields omitted.
}
OwnerReferences
k8s.io/apimachinery/pkg/apis/meta/v1.OwnerReference is a struct that contains TypeMeta
fields and a small sub-set of the ObjectMetadata
- enough to let you identify an owning object. An owning object must be in the same namespace as the dependent, or be cluster-scoped, so there is no namespace field.
type OwnerReference struct {
APIVersion string
Kind string
Name string
UID types.UID
//... other fields omitted. TODO: check for usages.
}
Finalizers and Deletion
Every k8s object has a Finalizers []string
field that can be explicitly assigned by a controller. Every k8s object has a DeletionTimestamp *Time
that is set by API Server when graceful deletion is requested.
These are part of the k8s.io./apimachinery/pkg/apis/meta/v1.ObjectMeta struct type which is embedded in all k8s objects.
When you tell Kubernetes to delete an object that has finalizers specified for it, the Kubernetes API marks the object for deletion by populating .metadata.deletionTimestamp
aka Deletiontimestamp
, and returns a 202
status code (HTTP Accepted
). The target object remains in a terminating state while the control plane takes the actions defined by the finalizers. After these actions are complete, the controller should removes the relevant finalizers from the target object. When the metadata.finalizers
field is empty, Kubernetes considers the deletion complete and deletes the object.
Diff between Labels and Annotations
Labels are used in conjunction with selectors to identify groups of related resources and meant to be meaningful to users. Because selectors are used to query labels, this operation needs to be efficient. To ensure efficient queries, labels are constrained by RFC 1123. RFC 1123, among other constraints, restricts labels to a maximum 63 character length. Thus, labels should be used when you want Kubernetes to group a set of related resources. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ on label key and value restrictions
Annotations are used for “non-identifying information” i.e., metadata that Kubernetes does not care about. As such, annotation keys and values have no constraints. Can include characters not
RawExtension
k8s.io/apimachinery/pkg/runtime.RawExtension is used to hold extension objects whose structures can be arbitrary. An example of MCM type that leverages this is the MachineClass type whose ProviderSpec
field is of type runtime.RawExtension
and whose structure can vary according to the provider.
One can use a different custom structure type for each extension variant and then decode the field into an instance of that extension structure type using the standard Go json decoder.
Example:
var providerSpec *api.AWSProviderSpec
json.Unmarshal(machineClass.ProviderSpec.Raw, &providerSpec)
One can
API Errors
k8s.io/apimachinery/pkg/api/errors provides detailed error types ans IsXXX
methods for k8s api errors.
errors.IsNotFound
k8s.io/apimachinery/pkg/api/errors.IsNotFound returns true if the specified error was due to a k8s object not found. (error or wrapped error created by errors.NewNotFound)
errors.IsNotFound
k8s.io/apimachinery/pkg/api/errors.IsTooManyRequests determines if err (or any wrapped error) is an error which indicates that there are too many requests that the server cannot handle.
API Machinery Utilities
wait.Until
k8s.io/apimachinery/pkg/wait.Until loops until stop
channel is closed, running f
every given period.
func Until(f func(), period time.Duration, stopCh <-chan struct{})
wait.PollImmediate
k8s.io/apimachinery/pkg/util/wait.PollImmediate tries a condition func until it returns true, an error, or the timeout is reached.
func PollImmediate(interval, timeout time.Duration, condition ConditionFunc) error
wait.Backoff
k8s.io/apimachinery/pkg/util/wait.Backoff holds parameters applied to a Backoff function. There are many retry
functions in client-go
and MCM that take an instance of this struct as parameter.
type Backoff struct {
Duration time.Duration
Factor float64
Jitter float64
Steps int
Cap time.Duration
}
Duration
is the initial backoff duration.Duration
is multiplied byFactor
for the next iteration.Jitter
is the random amount of each duration added (betweenDuration
andDuration*(1+jitter)
Steps
is the remaining number of iterations in which the duration may increase.Cap
is the cap on the duration and may not exceed this value.
Errors
k8s.io/apimachinery/pkg/util/errors.Aggregate represents an object that contains multiple errors
Use k8s.io/apimachinery/pkg/util/errors.NewAggregate to construct the aggregate error from a slice of errors.
type Aggregate interface {
error
Errors() []error
Is(error) bool
}
func NewAggregate(errlist []error) Aggregate {//...}
K8s API Core
The MCM leverages several types from https://pkg.go.dev/k8s.io/api/core/v1
Node
k8s.io/api/core/v1.Node represents a worker node in Kubernetes.
type Node struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec NodeSpec
// Most recently observed status of the node.
Status NodeStatus
}
NodeSpec
k8s.io/api/core/v1.NodeSpecdescribes the attributes that a node is created with. Both Node and MachineSpec use this. A snippet of MCM-relevant NodeSpec
struct fields shown below for convenience.
type NodeSpec struct {
// ID of the node assigned by the cloud provider in the format: <ProviderName>://<ProviderSpecificNodeID>
ProviderID string
// podCIDRs represents the IP ranges assigned to the node for usage by Pods on that node.
PodCIDRs []string
// Unschedulable controls node schedulability of new pods. By default, node is schedulable.
Unschedulable bool
// Taints represents the Node's Taints. (taint is opposite of affinity. allow a Node to repel pods as opposed to attracting them)
Taints []Taint
}
Node Taints
k8s.io/api/core/v1.Taint is a Kubernetes Node
property that enable specific nodes to repel pods. Tolerations are a Kubernetes Pod
property that overcome this and allow a pod to be scheduled on a node with a matching taint.
Instead of applying the label to a node, we apply a taint that tells a scheduler to repel Pods from this node if it does not match the taint. Only those Pods that have a toleration for the taint can be let into the node with that taint.
kubectl taint nodes <node name> <taint key>=<taint value>:<taint effect>
Example:
kubectl taint nodes node1 gpu=nvidia:NoSchedule
Users can specify any arbitrary string for the taint key and value. The taint effect defines how a tainted node reacts to a pod without appropriate toleration. It must be one of the following effects;
-
NoSchedule
: The pod will not get scheduled to the node without a matching toleration. -
NoExecute
:This will immediately evict all the pods without the matching toleration from the node. -
PerferNoSchedule
:This is a softer version of NoSchedule where the controller will not try to schedule a pod with the tainted node. However, it is not a strict requirement.
type Taint struct {
// Key of taint to be applied to a node.
Key string
// Value of taint corresponding to the taint key.
Value string
// Effect represents the effect of the taint on pods
// that do not tolerate the taint.
// Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
Effect TaintEffect
// TimeAdded represents the time at which the taint was added.
// It is only written for NoExecute taints.
// +optional
TimeAdded *metav1.Time
}
Example of a PodSpec with toleration below:
apiVersion: v1
kind: Pod
metadata:
name: pod-1
labels:
security: s1
spec:
containers:
- name: bear
image: supergiantkir/animals:bear
tolerations:
- key: "gpu"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
Example use case for a taint/tolerance: If you have nodes with special hardware (e.g GPUs) you want to repel Pods that do not need this hardware and attract Pods that do need it. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename gpu=nvidia:NoSchedule ) and adding corresponding toleration to Pods that must use this special hardware.
NodeStatus
See Node status
k8s.io/api/core/v1.NodeStatus represents the current status of a node and is an encapsulation illustrated below:
type NodeStatus struct {
Capacity ResourceList
// Allocatable represents the resources of a node that are available for scheduling. Defaults to Capacity.
Allocatable ResourceList
// Conditions is an array of current observed node conditions.
Conditions []NodeCondition
// List of addresses reachable to the node.Queried from cloud provider, if available.
Addresses []NodeAddress
// Set of ids/uuids to uniquely identify the node.
NodeInfo NodeSystemInfo
// List of container images on this node
Images []ContainerImage
// List of attachable volumes that are in use (mounted) by the node.
//UniqueVolumeName is just typedef for string
VolumesInUse []UniqueVolumeName
// List of volumes that are attached to the node.
VolumesAttached []AttachedVolume
}
Capacity
Capacity. The fields in the capacity block indicate the total amount of resources that a Node has.
- Allocatable indicates the amount of resources on a Node that is available to be consumed by normal Pods. Defaults to Capacity.
A Node Capacity
is of type k8s.io/api/core/v1.ResourceList which is effectively a set of set of (resource name, quantity) pairs.
type ResourceList map[ResourceName]resource.Quantity
ResourceName
s can be cpu/memory/storage
const (
// CPU, in cores. (500m = .5 cores)
ResourceCPU ResourceName = "cpu"
// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
ResourceMemory ResourceName = "memory"
// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
ResourceStorage ResourceName = "storage"
// Local ephemeral storage, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
// The resource name for ResourceEphemeralStorage is alpha and it can change across releases.
ResourceEphemeralStorage ResourceName = "ephemeral-storage"
)
A k8s.io/apimachinery/pkg/api/resource.Quantity is a serializable/de-serializable number with a SI unit
Conditions
Conditions are valid conditions of nodes.
https://pkg.go.dev/k8s.io/api/core/v1.NodeCondition contains condition information for a node.
type NodeCondition struct {
// Type of node condition.
Type NodeConditionType
// Status of the condition, one of True, False, Unknown.
Status ConditionStatus
// Last time we got an update on a given condition.
LastHeartbeatTime metav1.Time
// Last time the condition transitioned from one status to another.
LastTransitionTime metav1.Time
// (brief) reason for the condition's last transition.
Reason string
// Human readable message indicating details about last transition.
Message string
}
NodeConditionType
NodeConditionType
is one of the following:
const (
// NodeReady means kubelet is healthy and ready to accept pods.
NodeReady NodeConditionType = "Ready"
// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory.
NodeMemoryPressure NodeConditionType = "MemoryPressure"
// NodeDiskPressure means the kubelet is under pressure due to insufficient available disk.
NodeDiskPressure NodeConditionType = "DiskPressure"
// NodePIDPressure means the kubelet is under pressure due to insufficient available PID.
NodePIDPressure NodeConditionType = "PIDPressure"
// NodeNetworkUnavailable means that network for the node is not correctly configured.
NodeNetworkUnavailable NodeConditionType = "NetworkUnavailable"
)
Note: The MCM extends the above with further custom node condition types of its own. Not sure if this is correct - could break later if k8s enforces some validation ?
ConditionStatus
These are valid condition statuses. ConditionTrue
means a resource is in the condition. ConditionFalse
means a resource is not in the condition. ConditionUnknown
means kubernetes can't decide if a resource is in the condition or not.
type ConditionStatus string
const (
ConditionTrue ConditionStatus = "True"
ConditionFalse ConditionStatus = "False"
ConditionUnknown ConditionStatus = "Unknown"
)
Addresses
See Node Addresses
k8s.io/api/core/v1.NodeAddress contains information for the node's address.
type NodeAddress struct {
// Node address type, one of Hostname, ExternalIP or InternalIP.
Type NodeAddressType
// The node address string.
Address string
}
NodeSystemInfo
Describes general information about the node, such as machine id, kernel version, Kubernetes version (kubelet and kube-proxy version), container runtime details, and which operating system the node uses. The kubelet gathers this information from the node and publishes it into the Kubernetes API.
type NodeSystemInfo struct {
// MachineID reported by the node. For unique machine identification
// in the cluster this field is preferred.
MachineID string
// Kernel Version reported by the node from 'uname -r' (e.g. 3.16.0-0.bpo.4-amd64).
KernelVersion string
// OS Image reported by the node from /etc/os-release (e.g. Debian GNU/Linux 7 (wheezy)).
OSImage string
// ContainerRuntime Version reported by the node through runtime remote API (e.g. docker://1.5.0).
ContainerRuntimeVersion string
// Kubelet Version reported by the node.
KubeletVersion string
// KubeProxy Version reported by the node.
KubeProxyVersion string
// The Operating System reported by the node
OperatingSystem string
// The Architecture reported by the node
Architecture string
}
- The MachineID is a single newline-terminated, hexadecimal, 32-character, lowercase ID. from
/etc/machine-id
Images
A slice of k8s.io/api/core/v1.ContainerImage which describes a contianer image.
type ContainerImage struct {
Names []string
SizeBytes int64
}
- Names is the names by which this image is known.
e.g.
["kubernetes.example/hyperkube:v1.0.7", "cloud-vendor.registry.example/cloud-vendor/hyperkube:v1.0.7"]
SizeBytes
is the size of the image in bytes.
Attached Volumes
k8s.io/api/core/v1.AttachedVolume describes a volume attached to a node.
type UniqueVolumeName string
type AttachedVolume struct {
Name UniqueVolumeName
DevicePath string
}
Name
is the Name of the attached volume.UniqueVolumeName
is just a typedef for a Go string.DevicePath
represents the device path where the volume should be available
Pod
k8s.io/api/core/v1.Pod struct represents a k8s Pod which a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts.
type Pod struct {
metav1.TypeMeta
metav1.ObjectMeta
// Specification of the desired behavior of the pod.
Spec PodSpec
// Most recently observed status of the pod.
Status PodStatus
}
See k8s.io/api/core/v1.PodSpec. Each PodSpec
has a priority value where the higher the value, the higher the priority of the Pod.
PodSpec.Volumes
slice is List of volumes that can be mounted by containers belonging to the pod and is relevant to MC code that attaches/detaches volumes .
type PodSpec struct {
Volumes []Volume
//TODO: describe other PodSpec fields used by MC
}
Pod Eviction
A k8s.io/api/policy/v1.Eviction can be used to evict a Pod from its Node - eviction is the graceful terimation of Pods on nodes.See API Eviction
type Eviction struct {
metav1.TypeMeta
// ObjectMeta describes the pod that is being evicted.
metav1.ObjectMeta
// DeleteOptions may be provided
DeleteOptions *metav1.DeleteOptions
}
Construct the ObjectMeta
using the Pod and namespace and then use instance of typed Kubernetes Client Interface, and get the PolicyV1Interface, get the EvictionInterface and invoke the invoke the Evict method
Example
client.PolicyV1().Evictions(eviction.Namespace).Evict(ctx, eviction)
Pod Disruption Budget
A k8s.io/api/policy/v1.PodDisruptionBudgetis a struct type that represents a Pod Disruption Budget which is the max disruption that can be caused to a collection of pods.
type PodDisruptionBudget struct {
metav1.TypeMeta
metav1.ObjectMeta
// Specification of the desired behavior of the PodDisruptionBudget.
Spec PodDisruptionBudgetSpec
// Most recently observed status of the PodDisruptionBudget.
Status PodDisruptionBudgetStatus
}
PodDisruptionBudgetSpec
type PodDisruptionBudgetSpec struct {
MinAvailable *intstr.IntOrString
Selector *metav1.LabelSelector
MaxUnavailable *intstr.IntOrString
}
Selector
specifies Label query over pods whose evictions are managed by the disruption budget. A null selector will match no pods, while an empty ({}) selector will select all pods within the namespace.- An eviction is allowed if at least
MinAvailable
pods selected bySelector
will still be available after the eviction, i.e. even in the absence of the evicted pod. So for example you can prevent all voluntary evictions by specifying "100%". - An eviction is allowed if at most
MaxUnavailable
pods selected bySelector
are unavailable after the eviction, i.e. even in absence of the evicted pod. For example, one can prevent all voluntary evictions by specifying 0. MinAvailable
is a mutually exclusive setting withMaxUnavailable
PodDisruptionBudgetStatus
See godoc for k8s.io/api/policy/v1.PodDisruptionBudgetStatus
Pod Volumes
A k8s.io/api/core/v1.Volume represents a named volume in a pod - which is a directory that may be accessed by any container in the pod. See Pod Volumes
A Volume
has a Name
and embeds a VolumeSource
as shown below. A VolumeSource
represents the location and type of the mounted volume.
type Volume struct {
Name string
VolumeSource
}
VolumeSource
which represents the source of a volume to mount should only have ONE of its fields popularted. The MC uses PersistentVolumeClaim
field which is pointer to a PersistentVolumeClaimVolumeSource
which represents a reference to a PersistentVolumeClaim in the same namespace.
type VolumeSource struct {
PersistentVolumeClaim *PersistentVolumeClaimVolumeSource
}
type PersistentVolumeClaimVolumeSource struct {
ClaimName string
ReadOnly bool
}
PersistentVolume
A k8s.io/api/core/v1.PersistentVolume (PV) represents a piece of storage in the cluster. See K8s Persistent Volumes
PersistentVolumeClaim
A k8s.io/api/core/v1.PersistentVolumeClaim represents a user's request for and claim to a persistent volume.
type PersistentVolumeClaim struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec PersistentVolumeClaimSpec
Status PersistentVolumeClaimStatus
}
type PersistentVolumeClaimSpec struct {
StorageClassName *string
//...
VolumeName string
//...
}
Note that PersistentVolumeClaimSpec.VolumeName
is of interest to the MC which represents the binding reference to the PersistentVolume
backing this claim. Please note that this is different from Pod.Spec.Volumes[*].Name
which is more like a label for the volume directory.
Secret
A k8s.io/api/core/v1.Secret holds secret data of a secret type whose size < 1MB. See K8s Secrets
- Secret Data is in
Secret.Data
which is amap[string][]byte
where the bytes is the secret value and key is simple ASCII alphanumeric.
type Secret struct {
metav1.TypeMeta
metav1.ObjectMeta
Data map[string][]byte
Type SecretType
//... omitted for brevity
}
SecretType
can be of many types: SecretTypeOpaque
which represents user-defined secreets, SecretTypeServiceAccountToken
whichcontains a token that identifies a service account to the API, etc.
client-go
k8s clients have the type k8s.io/client-go/kubernetes.ClientSet which is actually a high-level client set facade encapsulating clients for the core
, appsv1
, discoveryv1
, eventsv1
, networkingv1
, nodev1
, policyv1
, storagev1
api groups. These individual clients are available via accessor methods. ie use clientset.AppsV1() to get the the AppsV1 client.
// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
appsV1 *appsv1.AppsV1Client
coreV1 *corev1.CoreV1Client
discoveryV1 *discoveryv1.DiscoveryV1Client
eventsV1 *eventsv1.EventsV1Client
// ...
// AppsV1 retrieves the AppsV1Client
func (c *Clientset) AppsV1() appsv1.AppsV1Interface {
return c.appsV1
}
// ...
}
As can be noticed from the above snippet, each of these clients associated with api groups expose an interface named GroupVersionInterface that in-turn provides further access to a generic REST Interface as well as access to a typed interface containing getter/setter methods for objects within that API group.
For example EventsV1Client which is used to interact with features provided by the events.k8s.io
group implements EventsV1Interface
type EventsV1Interface interface {
RESTClient() rest.Interface // generic REST API access
EventsGetter // typed interface access
}
// EventsGetter has a method to return a EventInterface.
type EventsGetter interface {
Events(namespace string) EventInterface
}
The ClientSet
struct implements the kubernetes.Interface facade which is a high-level facade containing all the methods to access the individual GroupVersionInterface facade clients.
type Interface interface {
Discovery() discovery.DiscoveryInterface
AppsV1() appsv1.AppsV1Interface
CoreV1() corev1.CoreV1Interface
EventsV1() eventsv1.EventsV1Interface
NodeV1() nodev1.NodeV1Interface
// ... other facade accessor
}
One can generate k8s clients for custom k8s objects that follow the same pattern for core k8s objects. The details are not covered here. Please refer to How to generate client codes for Kubernetes Custom Resource Definitions , gengo, k8s.io codegenrator and the slightly-old article: Kubernetes Deep Dive: Code Generation for CustomResources
client-go Shared Informers.
The vital role of a Kubernetes controller is to watch objects for the desired state and the actual state, then send instructions to make the actual state be more like the desired state. The controller thus first needs to retrieve the object's information. Instead of making direct API calls using k8s listers/watchers, client-go controllers should use SharedInformer
s.
cache.SharedInformer is a primitive exposed by client-go
lib that maintains a local cache of k8s objects of a particular API group and kind/resource. (restricable by namespace/label/field selectors) which is linked to the authoritative state of the corresponding objects in the API server.
Informers are used to reduced the load-pressure on the API Server and etcd.
All that is needed to be known at this point is that Informers internally watch for k8s object changes, update an internal indexed store and invoke registered event handlers. Client code must construct event handlers to inject the logic that one would like to execute when an object is Added/Updated/Deleted.
type SharedInformer interface {
// AddEventHandler adds an event handler to the shared informer using the shared informer's resync period. Events to a single handler are delivered sequentially, but there is no coordination between different handlers.
AddEventHandler(handler cache.ResourceEventHandler)
// HasSynced returns true if the shared informer's store has been
// informed by at least one full LIST of the authoritative state
// of the informer's object collection. This is unrelated to "resync".
HasSynced() bool
// Run starts and runs the shared informer, returning after it stops.
// The informer will be stopped when stopCh is closed.
Run(stopCh <-chan struct{})
//..
}
Note: resync period tells the informer to rebuild its cache every every time the period expires.
cache.ResourceEventHandler handle notifications for events that happen to a resource.
type ResourceEventHandler interface {
OnAdd(obj interface{})
OnUpdate(oldObj, newObj interface{})
OnDelete(obj interface{})
}
cache.ResourceEventHandlerFuncs is an adapter to let you easily specify as many or as few of the notification functions as you want while still implementing ResourceEventHandler
. Nearly all controllers code use instance of this adapter struct to create event handlers to register on shared informers.
type ResourceEventHandlerFuncs struct {
AddFunc func(obj interface{})
UpdateFunc func(oldObj, newObj interface{})
DeleteFunc func(obj interface{})
}
Shared informers for standard k8s objects can be obtained using the k8s.io/client-go/informers.NewSharedInformerFactory or one of the variant factory methods in the same package.
Informers and their factory functions for custom k8s objects are usually found in the generated factory code usually in a factory.go
file.
client-go workqueues
The basic workqueue.Interface has the following methods:
type Interface interface {
Add(item interface{})
Len() int
Get() (item interface{}, shutdown bool)
Done(item interface{})
ShutDown()
ShutDownWithDrain()
ShuttingDown() bool
}
This is extended with ability to Add Item as a later time using the workqueue.DelayingInterface.
type DelayingInterface interface {
Interface
// AddAfter adds an item to the workqueue after the indicated duration has passed
// Used to requeue items after failueres to avoid ending in hot-loop
AddAfter(item interface{}, duration time.Duration)
}
This is further extended with rate limiting using workqueue.RateLimiter
type RateLimiter interface {
// When gets an item and gets to decide how long that item should wait
When(item interface{}) time.Duration
// Forget indicates that an item is finished being retried. Doesn't matter whether its for perm failing
// or for success, we'll stop tracking it
Forget(item interface{})
// NumRequeues returns back how many failures the item has had
NumRequeues(item interface{}) int
}
client-go controller steps
The basic high-level contract for a k8s-client controller leveraging work-queues goes like the below:
- Create rate-limited work queue(s) created using workqueue.NewNamedRateLimitingQueue
- Define lifecycle callback functions (Add/Update/Delete) which accept k8s objects and enqueue k8s object keys (namespace/name) on these rate-limited work queue(s).
- Create informers using the shared informer factory functions.
- Add event handlers to the informers specifying these callback functions.
- When informers are started, they will invoke the appropriate registered callbacks when k8s objects are added/updated/deleted.
- The controller
Run
loop then picks up objects from the work queue usingGet
and reconciles them by invoking the appropriate reconcile function, ensuring thatDone
is called after reconcile to mark it as done processing.
Example:
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD CreateWorkQueue["machineQueue:=workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), 'machine')"] --> CreateInformer["machineInformerFactory := externalversions.NewSharedInformerFactory(...) machineInformer := machineInformerFactory.Machine().V1alpha1().Machines()"] --> AddEventHandler["machineInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: controller.machineToMachineClassAdd, UpdateFunc: controller.machineToMachineClassUpdate, DeleteFunc: controller.machineToMachineClassDelete, })"] -->Run["while (!shutdown) { key, shutdown := queue.Get() defer queue.Done(key) reconcile(key.(string)) } "] -->Z(("End"))
A more elaborate example of basic client-go controller flow is demonstrated in the clien-go workqueue example
client-go utilities
k8s.io/client-go/util/retry.RetryOnConflict
func RetryOnConflict(backoff wait.Backoff, fn func() error) error
- retry.RetryOnConflict is used to make an update to a resource when you have to worry about conflicts caused by other code making unrelated updates to the resource at the same time.
fn
should fetch the resource to be modified, make appropriate changes to it, try to update it, and return (unmodified) the error from the update function.- On a successful update,
RetryOnConflict
will returnnil
. If the updatefn
returns a Conflict error,RetryOnConflict
will wait some amount of time as described bybackoff
, and then try again. - On a non-Conflict error, or if it retries too many times (
backoff.Steps
has reached zero) and gives up,RetryOnConflict
will return an error to the caller.
References
- K8s API Conventions
- How To Call Kubernetes API using Go - Types and Common Machinery
- Node Taints and Tolerances
- Node status
- Making Sense of Taints and Tolerations
- CIDR
- CIDR Calculator
- k8s.io codegenrator
- How to generate client codes for Kubernetes Custom Resource Definitions
- Kubernetes Deep Dive: Code Generation for CustomResources
- MCM Facilities
MCM Facilities
This chapter describes the core types and utilities present in the MCM module and used by the controllers and drivers.
Machine Controller Core Types
The relevant machine types managed by the MCM controlled reside in github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1. This follows the standard location for client gen types <module>/pkg/apis/<group>/<version>
.
Example: Machine
type is pkg/apis/machine/v1alpha1.Machine
Machine
Machine is the representation of a physical or virtual machine that corresponds to a front-end k8s node object. An example YAML looks like the below
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
name: test-machine
namespace: default
spec:
class:
kind: MachineClass
name: test-class
A Machine
has a Spec
field represented by MachineSpec
type Machine struct {
// ObjectMeta for machine object
metav1.ObjectMeta
// TypeMeta for machine object
metav1.TypeMeta
// Spec contains the specification of the machine
Spec MachineSpec
// Status contains fields depicting the status
Status MachineStatus
}
graph TB subgraph Machine ObjectMeta TypeMeta MachineSpec MachineStatus end
MachineSpec
MachineSpec represents the specification of a Machine.
type MachineSpec struct {
// Class is the referrent to the MachineClass.
Class ClassSpec
// Unique identification of the VM at the cloud provider
ProviderID string
// NodeTemplateSpec describes the data a node should have when created from a template
NodeTemplateSpec NodeTemplateSpec
// Configuration for the machine-controller.
*MachineConfiguration
}
type NodeTemplateSpec struct { // wraps a NodeSpec with ObjectMeta.
metav1.ObjectMeta
// NodeSpec describes the attributes that a node is created with.
Spec corev1.NodeSpec
}
ProviderID
is the unique identification of the VM at the cloud provider.ProviderID
typically matches with thenode.Spec.ProviderID
on the node object.Class
field is of typeClassSpec
which is just the (Kind
and theName
) referring to theMachineClass
. (Ideally the field, type should have been calledClassReference
) likeOwnerReference
NodeTemplateSpec
describes the data a node should have when created from a template, embedsObjectMeta
and holds a corev1.NodeSpec in itsSpec
field.- The
Machine.Spec.NodeTemplateSpec.Spec
mirrors k8sNode.Spec
- The
MachineSpec
embeds aMachineConfiguration
which is just a configuration object that is a connection of timeouts, maxEvictRetries and NodeConditions
graph TB subgraph MachineSpec Class:ClassSpec ProviderID:string NodeTemplateSpec:NodeTemplateSpec MachineConfiguration end
type MachineConfiguration struct {
// MachineDraintimeout is the timeout after which machine is forcefully deleted.
MachineDrainTimeout *Duration
// MachineHealthTimeout is the timeout after which machine is declared unhealhty/failed.
MachineHealthTimeout *Duration
// MachineCreationTimeout is the timeout after which machinie creation is declared failed.
MachineCreationTimeout *Duration
// MaxEvictRetries is the number of retries that will be attempted while draining the node.
MaxEvictRetries *int32
// NodeConditions are the set of conditions if set to true for MachineHealthTimeOut, machine will be declared failed.
NodeConditions *string
}
MachineStatus
pkg/apis/machine/v1alpha1.MachineStatus represents the most recently observed status of Machine.
type MachineStatus struct {
// Conditions of this machine, same as NodeStatus.Conditions
Conditions []NodeCondition
// Last operation refers to the status of the last operation performed. NOTE: this is usually the NextOperation for reconcile!! Discuss!
LastOperation LastOperation
// Current status of the machine object
CurrentStatus CurrentStatus
// LastKnownState can store details of the last known state of the VM by the plugins.
// It can be used by future operation calls to determine current infrastucture state
LastKnownState string
}
Extended NodeConditionTypes
The MCM extends standard k8s NodeConditionType with several custom conditions. (TODO: isn't this hacky/error prone?)
- NodeTerminationCondition defined as
NodeTerminationCondition v1.NodeConditionType = "Terminating"
NodeTerminationCondition
LastOperation
github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.LastOperation represents the last operation performed on the object. Can sometimes mean the next operation to be performed on the machine if the machine operation state is Processing
!
type LastOperation struct {
// Description of the operation
Description string
// Last update time of operation
LastUpdateTime Time
// State of operation (bad naming)
State MachineState
// Type of operation
Type MachineOperationType
}
MachineState (should be called MachineOperationState)
NOTE: BADLY NAMED: Should be called MachineOperationState
machine-controller-manager/pkg/apis/machine/v1alpha1.MachineState represents the current state of a machine operation and is one of Processing
, Failed
or Successful
.
// MachineState is current state of the machine.
// BAD Name: Should be MachineOperationState
type MachineState string
// These are the valid (operation) states of machines.
const (
// MachineStatePending means there are operations pending on this machine state
MachineStateProcessing MachineState = "Processing"
// MachineStateFailed means operation failed leading to machine status failure
MachineStateFailed MachineState = "Failed"
// MachineStateSuccessful indicates that the node is not ready at the moment
MachineStateSuccessful MachineState = "Successful"
)
MachineOperationType
github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.MachineOperationType is a label for the operation performed on a machine object: Create
/Update
/HealthCheck
/Delete
.
type MachineOperationType string
const (
// MachineOperationCreate indicates that the operation is a create
MachineOperationCreate MachineOperationType = "Create"
// MachineOperationUpdate indicates that the operation is an update
MachineOperationUpdate MachineOperationType = "Update"
// MachineOperationHealthCheck indicates that the operation is a create
MachineOperationHealthCheck MachineOperationType = "HealthCheck"
// MachineOperationDelete indicates that the operation is a delete
MachineOperationDelete MachineOperationType = "Delete"
)
CurrentStatus
github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.CurrentStatus encapsulates information about the current status of Machine.
type CurrentStatus struct {
// Phase refers to the Machien (Lifecycle) Phase
Phase MachinePhase
// TimeoutActive when set to true drives the machine controller
// to check whether machine failed the configured creation timeout or health check timeout and change machine phase to Failed.
TimeoutActive bool
// Last update time of current status
LastUpdateTime Time
}
MachinePhase
MachinePhase
is a label for the life-cycle phase of a machine at a given time: Unknown
, Pending
, Available
, Running
, Terminating
, Failed
, CrashLoopBackOff
,.
type MachinePhase string
const (
// MachinePending means that the machine is being created
MachinePending MachinePhase = "Pending"
// MachineAvailable means that machine is present on provider but hasn't joined cluster yet
MachineAvailable MachinePhase = "Available"
// MachineRunning means node is ready and running successfully
MachineRunning MachinePhase = "Running"
// MachineRunning means node is terminating
MachineTerminating MachinePhase = "Terminating"
// MachineUnknown indicates that the node is not ready at the movement
MachineUnknown MachinePhase = "Unknown"
// MachineFailed means operation failed leading to machine status failure
MachineFailed MachinePhase = "Failed"
// MachineCrashLoopBackOff means creation or deletion of the machine is failing.
MachineCrashLoopBackOff MachinePhase = "CrashLoopBackOff"
)
Finalizers
See K8s Finalizers. The MC defines the following finalizer keys
const (
MCMFinalizerName = "machine.sapcloud.io/machine-controller-manager"
MCFinalizerName = "machine.sapcloud.io/machine-controller"
)
MCMFinalizerName
is the finalizer used to tag dependecies before deletionMCFinalizerName
is the finalizer added onSecret
objects.
MachineClass
A MachineClass is used to used to templatize and re-use provider configuration across multiple Machines or MachineSets or MachineDeployments
Example MachineClass YAML for AWS provider
apiVersion: machine.sapcloud.io/v1alpha1
credentialsSecretRef:
name: cloudprovider
namespace: shoot--i544024--hana-test
kind: MachineClass
metadata:
creationTimestamp: "2022-10-20T13:08:07Z"
finalizers:
- machine.sapcloud.io/machine-controller-manager
generation: 1
labels:
failure-domain.beta.kubernetes.io/zone: eu-west-1c
name: shoot--i544024--hana-test-whana-z1-b0f23
namespace: shoot--i544024--hana-test
resourceVersion: "38424578"
uid: 656f863e-5061-420f-8710-96dcc9777be4
nodeTemplate:
capacity:
cpu: "4"
gpu: "1"
memory: 61Gi
instanceType: p2.xlarge
region: eu-west-1
zone: eu-west-1c
provider: AWS
providerSpec:
ami: ami-0c3484dbcde4c4d0c
blockDevices:
- ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 50
volumeType: gp3
iam:
name: shoot--i544024--hana-test-nodes
keyName: shoot--i544024--hana-test-ssh-publickey
machineType: p2.xlarge
networkInterfaces:
- securityGroupIDs:
- sg-0445497aa49ddecb2
subnetID: subnet-04a338730d20ea601
region: eu-west-1
srcAndDstChecksEnabled: false
tags:
kubernetes.io/arch: amd64
kubernetes.io/cluster/shoot--i544024--hana-test: "1"
kubernetes.io/role/node: "1"
networking.gardener.cloud/node-local-dns-enabled: "true"
node.kubernetes.io/role: node
worker.garden.sapcloud.io/group: whana
worker.gardener.cloud/cri-name: containerd
worker.gardener.cloud/pool: whana
worker.gardener.cloud/system-components: "true"
secretRef:
name: shoot--i544024--hana-test-whana-z1-b0f23
namespace: shoot--i544024--hana-test
Type definition for MachineClass
shown below
type MachineClass struct {
metav1.TypeMeta
metav1.ObjectMeta
ProviderSpec runtime.RawExtension `json:"providerSpec"`
SecretRef *corev1.SecretReference `json:"secretRef,omitempty"`
CredentialsSecretRef *corev1.SecretReference
Provider string
NodeTemplate *NodeTemplate
}
Notes
- As can be seen above , the provider specific configuration to create a node is specified in
MachineClass.ProviderSpec
and this is of extensible custom type runtime.RawExtension. This permits instances of different structure types like AWS aws/api/AWSProviderSpec or azure/apis.AzureProviderSpec to be held within a single type.
SecretRef and CredentialsSecretRef
Both these fields in the MachineClass are of type k8s.io/api/core/v1.SecretReference.
Generally, When there is some operation to be performed on the machine, the K8s secret objects are obtained using the k8s.io/client-go/listers/core/v1.SecretLister. The corresponding Secret.Data map fields are retrieved and then merged into a single map. (Design TODO: Why do we do this and why can't use just one secret ?)
Snippet of AWS MachineClass containing CredentialsSecretRef
and SecretRef
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineClass
credentialsSecretRef:
name: cloudprovider
namespace: shoot--i034796--tre
secretRef:
name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
namespace: shoot--i034796--tre
//...
MachineClass.CredentialsSecretRef
is a reference to a secret that generally holds the cloud provider credentials. For the example above the correspondingSecret
holds theaccessKeyID
andsecretAccessKey
.k get secret cloudprovider -n shoot--i034796--tre -oyaml
apiVersion: v1
kind: Secret
data:
accessKeyID: QUtJQTZ...
secretAccessKey: dm52MX...
MachineClass.SecretRef
points to aSecret
whoseSecret.Data
contains auserData
entry that has a lot of info.- The Bootstrap Token which is generated by the MCM. See Bootstrap Tokens Per Shoot Worker Machine.
- shoot api server address.
- kubeconfig to contact shoot api server.
- TODO: Discuss this. Why so much info in one secret field?
MachineClass.Provider
is specified as the combination of name and location of cloud-specific drivers. (Unsure if this is being actually being followed as for AWS&Azure this is set to simplyAWS
andAzure
respectively)
NodeTemplate
A NodeTemplate as shown below
Capacity
is of type k8s.io/api/core/v1.ResourceList which is effectively a map of (resource name, quantity) pairs. Similar to Node Capacity, it has keys likecpu
,gpu
,memory
,storage
, etc and string values that are represented by resource.QuantityInstanceType
provider specific Instance type of the node belonging to nodeGroup. For AWS this would be the EC2 instance type likep2.xlarge
for AWS.Region
: provider specific region name likeeu-west-1
for AWS.Zone
: provider specified Availability Zone likeeu-west-1c
for AWS.
type NodeTemplate struct {
Capacity corev1.ResourceList
InstanceType string
Region string
Zone string
}
Example
nodeTemplate:
capacity:
cpu: "4"
gpu: "1"
memory: 61Gi
instanceType: p2.xlarge
region: eu-west-1
zone: eu-west-1c
MachineSet
A MachineSet is to a Machine in an analogue of what a ReplicaSet is to a Pod. A MachineSet
ensures that the specified number of Machines
are running at any given time. A MachineSet
is rarely rarely created directly. It is generally owned by its parent MachineDeployment and its ObjectMetadata.OwnerReferenes
slice has a reference to the parent deployment.
MachineSet
struct is defined as follows:
type MachineSet struct {
metav1.ObjectMeta
metav1.TypeMeta
Spec MachineSetSpec
Status MachineSetStatus
}
MachineSetSpec
MachineSetSpec is the specification of a MachineSet.
type MachineSetSpec struct {
Replicas int32
Selector *metav1.LabelSelector
MinReadySeconds int32
MachineClass ClassSpec
Template MachineTemplateSpec // Am I used ?
}
type ClassSpec struct {
APIGroup string
Kind string
Name string
}
type MachineTemplateSpec struct {
metav1.ObjectMeta
Spec MachineSpec
}
MachineSetSpec.Replicas
is the number of desired replicas.MachineSetSpec.Selector
is a label query over machines that should match the replica count. See Label SelectorsMachineSetSpec.MinReadySeconds
- TODO: unsure ? Mininum number of seconds for which a newly createdMachine
should be ready for it to be considered as available ? (guessing here - can't find the code using this). Usually specified as500
(ie in the YAML)MachineSetSpec.MachineClass
is an instance of typeClassSpec
which is a reference type to the MachineClass.- TODO: Discuss whether needed.
MachineSetSpec.Template
is an instance ofMachineTemplateSpec
which is an encapsulation over MachineSpec. I don't see this used sinceMachineSetSpec.MachineClass
is already a reference to aMachineClass
which by definition is a template.
Example MachineSet YAML
AWs MachineSet YAML
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineSet
metadata:
annotations:
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2022-10-20T13:53:01Z"
finalizers:
- machine.sapcloud.io/machine-controller-manager
generation: 1
labels:
machine-template-hash: "2415498538"
name: shoot--i034796--tre-worker-q3rb4-z1
name: shoot--i034796--tre-worker-q3rb4-z1-68598
namespace: shoot--i034796--tre
ownerReferences:
- apiVersion: machine.sapcloud.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: MachineDeployment
name: shoot--i034796--tre-worker-q3rb4-z1
uid: f20cba25-2fdb-4315-9e38-d301f5f08459
resourceVersion: "38510891"
uid: b0f12abc-2d19-4a6e-a69d-4bf4aa12d02b
spec:
minReadySeconds: 500
replicas: 1
selector:
matchLabels:
machine-template-hash: "2415498538"
name: shoot--i034796--tre-worker-q3rb4-z1
template:
metadata:
creationTimestamp: null
labels:
machine-template-hash: "2415498538"
name: shoot--i034796--tre-worker-q3rb4-z1
spec:
class:
kind: MachineClass
name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
nodeTemplate:
metadata:
creationTimestamp: null
labels:
kubernetes.io/arch: amd64
networking.gardener.cloud/node-local-dns-enabled: "true"
node.kubernetes.io/role: node
topology.ebs.csi.aws.com/zone: eu-west-1b
worker.garden.sapcloud.io/group: worker-q3rb4
worker.gardener.cloud/cri-name: containerd
worker.gardener.cloud/pool: worker-q3rb4
worker.gardener.cloud/system-components: "true"
spec: {}
status:
availableReplicas: 1
fullyLabeledReplicas: 1
lastOperation:
lastUpdateTime: "2022-10-20T13:55:23Z"
observedGeneration: 1
readyReplicas: 1
replicas: 1
MachineDeployment
A MachineDeployment is to a MachineSet in an analogue of what a Deployment is to a ReplicaSet.
A MachineDeployment
manages MachineSets and enables declarative updates for the machines in MachineSets.
MachineDeployment
struct is defined as follows
type MachineDeployment struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec MachineDeploymentSpec
// Most recently observed status of the MachineDeployment.
// +optional
Status MachineDeploymentStatus
MachineDeploymentSpec
MachineDeploymentSpec is the is the specification of the desired behavior of the MachineDeployment.
type MachineDeploymentSpec struct {
Replicas int32
Selector *metav1.LabelSelector
Template MachineTemplateSpec
MinReadySeconds int32
Paused bool
ProgressDeadlineSeconds *int32
Strategy MachineDeploymentStrategy
}
Replicas
is number of desired machines.Selector
is label selector for machines. Usually aname
label to select machines with a given name matching the deployment name.Template
of type MachineTemplateSpec which is an encapsulation over MachineSpec.MinReadySeconds
is the Minimum number of seconds for which a newly created machine should be ready without any of its container crashing, for it to be considered available.Paused
indicates that theMachineDeployment
is paused and will not be processed by the MCM controller.Strategy
is the MachineDeploymentStrategy strategy to use to replace existing machines with new ones.ProgressDeadlineSeconds
maximum time in seconds for a MachineDeployment to make progress before it is considered to be failed. The MachineDeployment controller will continue to process failed MachineDeployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the MachineDeployment status.
MachineDeploymentStrategy
MachineDeploymentStrategy describes how to replace existing machines with new ones.
type MachineDeploymentStrategy struct {
Type MachineDeploymentStrategyType
RollingUpdate *RollingUpdateMachineDeployment
}
type MachineDeploymentStrategyType string
const (
RecreateMachineDeploymentStrategyType MachineDeploymentStrategyType = "Recreate"
RollingUpdateMachineDeploymentStrategyType MachineDeploymentStrategyType = "RollingUpdate"
)
type RollingUpdateMachineDeployment struct {
MaxUnavailable *intstr.IntOrString
MaxSurge *intstr.IntOrString
}
Type
is one of theMachineDeploymentStrategyType
constantsRecreateMachineDeploymentStrategyType
is strategy to Kill all existing machines before creating new ones.RollingUpdateMachineDeploymentStrategyType
is strategy to replace the old machines by new one using rolling update i.e gradually scale down the old machines and scale up the new one.
RollingUpdate
is the rolling update config params represented byRollingUpdateMachineDeployment
. This is analogous to k8s.io/api/apps/v1.RollingUpdateDeployment which also hasMaxUnavailable
andMaxSurge
.- RollingUpdateMachineDeployment.MaxUnavailable is the maximum number of machines that can be unavailable during the update.
- RollingUpdateMachineDeployment.MaxSurge is the maximum number of machines that can be scheduled above the desired number of machines.
Example MachineDeployment YAML
AWS MchineDeployment YAML
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineDeployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2022-10-20T13:53:01Z"
finalizers:
- machine.sapcloud.io/machine-controller-manager
generation: 1
name: shoot--i034796--tre-worker-q3rb4-z1
namespace: shoot--i034796--tre
resourceVersion: "38510892"
uid: f20cba25-2fdb-4315-9e38-d301f5f08459
spec:
minReadySeconds: 500
replicas: 1
selector:
matchLabels:
name: shoot--i034796--tre-worker-q3rb4-z1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
name: shoot--i034796--tre-worker-q3rb4-z1
spec:
class:
kind: MachineClass
name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
nodeTemplate:
metadata:
creationTimestamp: null
labels:
kubernetes.io/arch: amd64
networking.gardener.cloud/node-local-dns-enabled: "true"
node.kubernetes.io/role: node
topology.ebs.csi.aws.com/zone: eu-west-1b
worker.garden.sapcloud.io/group: worker-q3rb4
worker.gardener.cloud/cri-name: containerd
worker.gardener.cloud/pool: worker-q3rb4
worker.gardener.cloud/system-components: "true"
spec: {}
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-10-20T13:55:23Z"
lastUpdateTime: "2022-10-20T13:55:23Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
VolumeAttachment
A k8s.io/api/storage/v1.VolumeAttachment is a non-namespaced object that captures the intent to attach or detach the specified volume to/from the specified node (specified in VolumeAttachmentSpec.NodeName
).
type VolumeAttachment struct {
metav1.TypeMeta
metav1.ObjectMeta
// Specification of the desired attach/detach volume behavior.
// Populated by the Kubernetes system.
Spec VolumeAttachmentSpec
// Status of the VolumeAttachment request.
// Populated by the entity completing the attach or detach
// operation, i.e. the external-attacher.
Status VolumeAttachmentStatus
}
VolumeAttachmentSpec
A k8s.ip/api/storage/v1.VolumeAttachmentSpecis the specification of a VolumeAttachment request.
type VolumeAttachmentSpec struct {
// Attacher indicates the name of the volume driver that MUST handle this
// request. Same as CSI Plugin name
Attacher string
// Source represents the volume that should be attached.
Source VolumeAttachmentSource
// The node that the volume should be attached to.
NodeName string
}
See Storage V1 Docs for further elaboration.
Utilities
NodeOps
nodeops.GetNodeCondition
machine-controller-manager/pkg/util/nodeops.GetNodeCondition get the nodes condition matching the specified type
func GetNodeCondition(ctx context.Context, c clientset.Interface, nodeName string, conditionType v1.NodeConditionType) (*v1.NodeCondition, error) {
node, err := c.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return getNodeCondition(node, conditionType), nil
}
func getNodeCondition(node *v1.Node, conditionType v1.NodeConditionType) *v1.NodeCondition {
for _, cond := range node.Status.Conditions {
if cond.Type == conditionType {
return &cond
}
}
return nil
}
nodeops.CloneAndAddCondition
machine-controller-manager/pkg/util/nodeops.CloneAndAddCondition adds a condition to the node condition slice. If condition with this type already exists, it updates the LastTransitionTime
func CloneAndAddCondition(conditions []v1.NodeCondition, condition v1.NodeCondition) []v1.NodeCondition {
if condition.Type == "" || condition.Status == "" {
return conditions
}
var newConditions []v1.NodeCondition
for _, existingCondition := range conditions {
if existingCondition.Type != condition.Type {
newConditions = append(newConditions, existingCondition)
} else {
// condition with this type already exists
if existingCondition.Status == condition.Status
&& existingCondition.Reason == condition.Reason {
// condition status and reason are the same, keep existing transition time
condition.LastTransitionTime = existingCondition.LastTransitionTime
}
}
}
newConditions = append(newConditions, condition)
return newConditions
}
TODO: Bug ? Logic above will end up adding duplicate node condition if status and reason phrase are different
nodeops.AddOrUpdateConditionsOnNode
machine-controller-manager/pkg/util/nodeops.AddOrUpdateConditionsOnNode adds a condition to the node's status, retrying on conflict with backoff.
func AddOrUpdateConditionsOnNode(ctx context.Context, c clientset.Interface, nodeName string, condition v1.NodeCondition) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Init["backoff = wait.Backoff{ Steps: 5, Duration: 100 * time.Millisecond, Jitter: 1.0, }"] -->retry.RetryOnConflict["retry.RetryOnConflict(backoff, fn)"] -->GetNode["oldNode, err = c.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})"] subgraph "fn" GetNode-->ChkIfErr{"err != nil"} ChkIfErr--Yes-->ReturnErr(("return err")) ChkIfErr--No-->InitNewNode["newNode := oldNode.DeepCopy()"] InitNewNode-->InitConditions["newNode.Status.Conditions:= CloneAndAddCondition(newNode.Status.Conditions, condition)"] -->UpdateNewNode["_, err := c.CoreV1().Nodes().UpdateStatus(ctx, newNodeClone, metav1.UpdateOptions{}"] -->ReturnErr end
MachineUtils
Operation Descriptions
machineutils
has a bunch of constants that are descriptions of machine operations that are set into machine.Status.LastOperation.Description
by the machine controller while performing reconciliation. It also has some reason phrase
const (
// GetVMStatus sets machine status to terminating and specifies next step as getting VMs
GetVMStatus = "Set machine status to termination. Now, getting VM Status"
// InitiateDrain specifies next step as initiate node drain
InitiateDrain = "Initiate node drain"
// InitiateVMDeletion specifies next step as initiate VM deletion
InitiateVMDeletion = "Initiate VM deletion"
// InitiateNodeDeletion specifies next step as node object deletion
InitiateNodeDeletion = "Initiate node object deletion"
// InitiateFinalizerRemoval specifies next step as machine finalizer removal
InitiateFinalizerRemoval = "Initiate machine object finalizer removal"
// LastAppliedALTAnnotation contains the last configuration of annotations,
// labels & taints applied on the node object
LastAppliedALTAnnotation = "node.machine.sapcloud.io/last-applied-anno-labels-taints"
// MachinePriority is the annotation used to specify priority
// associated with a machine while deleting it. The less its
// priority the more likely it is to be deleted first
// Default priority for a machine is set to 3
MachinePriority = "machinepriority.machine.sapcloud.io"
// MachineClassKind is used to identify the machineClassKind for generic machineClasses
MachineClassKind = "MachineClass"
// MigratedMachineClass annotation helps in identifying machineClasses who have been migrated by migration controller
MigratedMachineClass = "machine.sapcloud.io/migrated"
// NotManagedByMCM annotation helps in identifying the nodes which are not handled by MCM
NotManagedByMCM = "node.machine.sapcloud.io/not-managed-by-mcm"
// TriggerDeletionByMCM annotation on the node would trigger the deletion of the corresponding machine object in the control cluster
TriggerDeletionByMCM = "node.machine.sapcloud.io/trigger-deletion-by-mcm"
// NodeUnhealthy is a node termination reason for failed machines
NodeUnhealthy = "Unhealthy"
// NodeScaledDown is a node termination reason for healthy deleted machines
NodeScaledDown = "ScaleDown"
// NodeTerminationCondition describes nodes that are terminating
NodeTerminationCondition v1.NodeConditionType = "Terminating"
)
Retry Periods
These are standard retry periods that are internally used by the machine controllers to enqueue keys into the work queue after the specified duration so that reconciliation can be retried afer elapsed duration.
// RetryPeriod is an alias for specifying the retry period
type RetryPeriod time.Duration
// These are the valid values for RetryPeriod
const (
// ShortRetry tells the controller to retry after a short duration - 15 seconds
ShortRetry RetryPeriod = RetryPeriod(15 * time.Second)
// MediumRetry tells the controller to retry after a medium duration - 2 minutes
MediumRetry RetryPeriod = RetryPeriod(3 * time.Minute)
// LongRetry tells the controller to retry after a long duration - 10 minutes
LongRetry RetryPeriod = RetryPeriod(10 * time.Minute)
)
Misc
permits.PermitGiver
permits.PermitGiver
provides the ability to register, obtain, release and delete a number of permits for a given key. All operations are concurrent safe.
type PermitGiver interface {
RegisterPermits(key string, numPermits int) //numPermits should be maxNumPermits
TryPermit(key string, timeout time.Duration) bool
ReleasePermit(key string)
DeletePermits(key string)
Close()
}
The implementation of github.com/gardener/machine-controller-manager/pkg/util/permits.PermitGiver maintains:
- a sync map of permit keys mapped to permits, where each permit is a structure comprising a buffered empty struct
struct{}
channel with buffer size equalling the (max) number of permits.RegisterPermits(key, numPermits)
registersnumPermits
for the givenkey
. Apermit
struct is initialized withpermit.c
buffer size asnumPermits
.
- entries are deleted if not accessed for a configured time represented by
stalePermitKeyTimeout
. This is done by 'janitor' go-routine associated with the permit giver instance. - When attempting to get a permit using
TryPermit
, one writes an empty struct to the permit channel within a given timeout. If one can do so within the timeout one has acquired the permit, else not.TryPermit(key, timeout)
attempts to get a permit for the given key by sending astruct{}{}
instance to the bufferedpermit.c
channel.
type permit struct {
// lastAcquiredPermitTime is time since last successful TryPermit or new RegisterPermits
lastAcquiredPermitTime time.Time
// c is initialized with buffer size N representing num permits
c chan struct{}
}
type permitGiver struct {
keyPermitsMap sync.Map // map of string keys to permit struct values
stopC chan struct{}
}
permits.NewPermitGiver
permits.NewPermitGiver
returns a new PermitGiver
func NewPermitGiver(stalePermitKeyTimeout time.Duration, janitorFrequency time.Duration) PermitGiver
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB Begin((" ")) -->InitStopCh["stopC := make(chan struct{})"] -->InitPG[" pg := permitGiver{ keyPermitsMap: sync.Map{}, stopC: stopC, }"] -->LaunchCleanUpGoroutine["go cleanup()"] LaunchCleanUpGoroutine-->Return(("return &pg")) LaunchCleanUpGoroutine--->InitTicker subgraph cleanup InitTicker["ticker := time.NewTicker(janitorFrequency)"] -->caseReadStopCh{"<-stopC ?"} caseReadStopCh--Yes-->End(("End")) caseReadStopCh--No -->ReadTickerCh{"<-ticker.C ?"} ReadTickerCh--Yes-->CleanupStale[" pg.cleanupStalePermitEntries(stalePermitKeyTimeout) (Iterates keyPermitsMap, remove entries whose lastAcquiredPermitTime exceeds stalePermitKeyTimeout) "] ReadTickerCh--No-->caseReadStopCh end
Main Server Structs
MCServer
machine-controller-manager/pkg/util/provider/app/options.MCServer
is the main server context object for the machine controller. It embeds options.MachineControllerConfiguration and has a ControlKubeConfig
and TargetKubeConfig
string fields.
type MCServer struct {
options.MachineControllerConfiguration
ControlKubeconfig string
TargetKubeconfig string
The MCServer
is constructed and initialized using the pkg/util/provider/app/options.NewMCServer function which sets most of the default values for fields for the embedded struct.
MCServer Usage
Individual providers leverage the MCServer as follows:
s := options.NewMCServer()
driver := <providerSpecificDriverInitialization>
if err := app.Run(s, driver); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
The MCServer is thus re-used across different providers.
MachineControllerConfiguration struct
An imnportant struct that represents machine configuration that supports deep-coopying and is embedded within the MCServer
machine-controller-manager/pkg/util/provider/options.MachineControllerConfiguration
type MachineControllerConfiguration struct {
metav1.TypeMeta
// namespace in seed cluster in which controller would look for the resources.
Namespace string
// port is the port that the controller-manager's http service runs on.
Port int32
// address is the IP address to serve on (set to 0.0.0.0 for all interfaces).
Address string
// CloudProvider is the provider for cloud services.
CloudProvider string
// ConcurrentNodeSyncs is the number of node objects that are
// allowed to sync concurrently. Larger number = more responsive nodes,
// but more CPU (and network) load.
ConcurrentNodeSyncs int32
// enableProfiling enables profiling via web interface host:port/debug/pprof/
EnableProfiling bool
// enableContentionProfiling enables lock contention profiling, if enableProfiling is true.
EnableContentionProfiling bool
// contentType is contentType of requests sent to apiserver.
ContentType string
// kubeAPIQPS is the QPS to use while talking with kubernetes apiserver.
KubeAPIQPS float32
// kubeAPIBurst is the burst to use while talking with kubernetes apiserver.
KubeAPIBurst int32
// leaderElection defines the configuration of leader election client.
LeaderElection mcmoptions.LeaderElectionConfiguration
// How long to wait between starting controller managers
ControllerStartInterval metav1.Duration
// minResyncPeriod is the resync period in reflectors; will be random between
// minResyncPeriod and 2*minResyncPeriod.
MinResyncPeriod metav1.Duration
// SafetyOptions is the set of options to set to ensure safety of controller
SafetyOptions SafetyOptions
//NodeCondition is the string of known NodeConditions. If any of these NodeCondition is set for a timeout period, the machine will be declared failed and will replaced. Default is "KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable"
NodeConditions string
//BootstrapTokenAuthExtraGroups is a comma-separated string of groups to set bootstrap token's "auth-extra-groups" field to.
BootstrapTokenAuthExtraGroups string
}
SafetyOptions
An important struct availablea as the SafetyOptions
field in MachineControllerConfiguration
containing several timeouts, retry-limits, etc. Most of these fields are set via pkg/util/provider/app/options.NewMCServer function
pkg/util/provider/option.SafetyOptions
// SafetyOptions are used to configure the upper-limit and lower-limit
// while configuring freezing of machineSet objects
type SafetyOptions struct {
// Timeout (in durartion) used while creation of
// a machine before it is declared as failed
MachineCreationTimeout metav1.Duration
// Timeout (in durartion) used while health-check of
// a machine before it is declared as failed
MachineHealthTimeout metav1.Duration
// Maximum number of times evicts would be attempted on a pod for it is forcibly deleted
// during draining of a machine.
MaxEvictRetries int32
// Timeout (in duration) used while waiting for PV to detach
PvDetachTimeout metav1.Duration
// Timeout (in duration) used while waiting for PV to reattach on new node
PvReattachTimeout metav1.Duration
// Timeout (in duration) for which the APIServer can be down before
// declare the machine controller frozen by safety controller
MachineSafetyAPIServerStatusCheckTimeout metav1.Duration
// Period (in durartion) used to poll for orphan VMs
// by safety controller
MachineSafetyOrphanVMsPeriod metav1.Duration
// Period (in duration) used to poll for APIServer's health
// by safety controller
MachineSafetyAPIServerStatusCheckPeriod metav1.Duration
// APIserverInactiveStartTime to keep track of the
// start time of when the APIServers were not reachable
APIserverInactiveStartTime time.Time
// MachineControllerFrozen indicates if the machine controller
// is frozen due to Unreachable APIServers
MachineControllerFrozen bool
}
Controller Structs
Machine Controller Core Struct
controller
struct in package controller
inside go file: machine-controller-manager/pkg/util/provider/machinecontroller.go
(Bad convention) is the concrete Machine Controller struct that holds state data for the MC and implements the classifical controller Run(workers int, stopCh <-chan struct{})
method.
The top level MCServer.Run
method initializes this controller struct and calls its Run
method
package controller
type controller struct {
namespace string // control clustern namespace
nodeConditions string // Default: "KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable"
controlMachineClient machineapi.MachineV1alpha1Interface
controlCoreClient kubernetes.Interface
targetCoreClient kubernetes.Interface
targetKubernetesVersion *semver.Version
recorder record.EventRecorder
safetyOptions options.SafetyOptions
internalExternalScheme *runtime.Scheme
driver driver.Driver
volumeAttachmentHandler *drain.VolumeAttachmentHandler
// permitGiver store two things:
// - mutex per machinedeployment
// - lastAcquire time
// it is used to limit removal of `health timed out` machines
permitGiver permits.PermitGiver
// listers
pvcLister corelisters.PersistentVolumeClaimLister
pvLister corelisters.PersistentVolumeLister
secretLister corelisters.SecretLister
nodeLister corelisters.NodeLister
pdbV1beta1Lister policyv1beta1listers.PodDisruptionBudgetLister
pdbV1Lister policyv1listers.PodDisruptionBudgetLister
volumeAttachementLister storagelisters.VolumeAttachmentLister
machineClassLister machinelisters.MachineClassLister
machineLister machinelisters.MachineLister
// queues
// secretQueue = workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "secret"),
secretQueue workqueue.RateLimitingInterface
nodeQueue workqueue.RateLimitingInterface
machineClassQueue workqueue.RateLimitingInterface
machineQueue workqueue.RateLimitingInterface
machineSafetyOrphanVMsQueue workqueue.RateLimitingInterface
machineSafetyAPIServerQueue workqueue.RateLimitingInterface
// syncs
pvcSynced cache.InformerSynced
pvSynced cache.InformerSynced
secretSynced cache.InformerSynced
pdbV1Synced cache.InformerSynced
volumeAttachementSynced cache.InformerSynced
nodeSynced cache.InformerSynced
machineClassSynced cache.InformerSynced
machineSynced cache.InformerSynced
}
Driver
github.com/gardener/machine-controller-manager/pkg/util/provider/driver.Driver is the abstraction facade that decouplesthe machine controller from the cloud-provider specific machine lifecycle details. The MC invokes driver methods while performing reconciliation.
type Driver interface {
CreateMachine(context.Context, *CreateMachineRequest) (*CreateMachineResponse, error)
DeleteMachine(context.Context, *DeleteMachineRequest) (*DeleteMachineResponse, error)
GetMachineStatus(context.Context, *GetMachineStatusRequest) (*GetMachineStatusResponse, error)
ListMachines(context.Context, *ListMachinesRequest) (*ListMachinesResponse, error)
GetVolumeIDs(context.Context, *GetVolumeIDsRequest) (*GetVolumeIDsResponse, error)
GenerateMachineClassForMigration(context.Context, *GenerateMachineClassForMigrationRequest) (*GenerateMachineClassForMigrationResponse, error)
}
GetVolumeIDs
returns a list volumeIDs for the given list of PVSpecs- Example: the AWS driver checks if
spec.AWSElasticBlockStore.VolumeID
is not nil and coverts the k8sspec.AWSElasticBlockStore.VolumeID
to the EBS volume ID. Or if storage is provided by CSIspec.CSI.Driver="ebs.csi.aws.com"
just getsspec.CSI.VolumeHandle
- Example: the AWS driver checks if
GetMachineStatus
gets the status of the VM backing the machine object on the provider
Codes and (error) Status
Code
github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes.Code is a uint32
with following error codes. These error codes are contained in errors returned from Driver methods.
Note: Un-happy with current design. It is clear that some error codes overlap each other in the sense that they are supersets of other codes. The right thing to do would have been to make an ErrorCategory.
Value | Code | Description |
---|---|---|
0 | Ok | Success |
1 | Canceled | the operation was canceled (by caller) |
2 | Unknown | Unknown error (unrecognized code) |
3 | InvalidArgument | InvalidArgument indicates client specified an invalid argument. |
4 | DeadlineExceeded | DeadlineExceeded means operation expired before completion. For operations that change the state of the system, this error may be returned even if the operation has completed successfully. For example, a successful response from a server could have been delayed long enough for the deadline to expire. |
5 | NotFound | requested entity not found. |
6 | AlreadyExists | an attempt to create an entity failed because one already exists. |
7 | PermissionDenied | caller does not have permission to execute the specified operation. |
8 | ResourceExhausted | indicates some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space. |
9 | FailedPrecondition | operation was rejected because the system is not in a state required for the operation's execution. |
10 | Aborted | operation was aborted and client should retry the full process |
11 | OutOfRange | operation was attempted past the valid range. Unlike InvalidArgument, this error indicates a problem that may be fixed if the system state changes. |
12 | UnImplemented | operation is not implemented or not supported |
13 | Internal | BAD. Some internal invariant broken. |
14 | Unavailable | Service is currently unavailable (transient and op may be tried with backoff) |
15 | DataLoss | unrecoverable data loss or corruption. |
16 | Unauthenticated | request does not have valid creds for operation. Note: It would have been nice if 7 was called Unauthorized. |
Status
(OPINION: I beleive this is a minor NIH design defect. Ideally one should have re-levaraged the k8s API machinery Status k8s.io/apimachinery/pkg/apis/meta/v1.Status instead of making custom status object.)
status implements errors returned by MachineAPIs. MachineAPIs service handlers should return an error created by this package, and machineAPIs clients should expect a corresponding error to be returned from the RPC call.
status.Status implements error
and encapsulates a code
which should be onf the codes in codes.Code and a develoer-facing error message in English
type Status struct {
code int32
message string
}
// New returns a Status encapsulating code and msg.
func New(code codes.Code, msg string) *Status {
return &Status{code: int32(code), message: msg}
}
NOTE: No ideally why we are doing such hard work as shown in status.FromError which involves time-consuming regex parsing of an error string into a status. This is actually being used to parse error strings of errors returned by Driver methods. Not good - should be better designed.
- Machine Controller
- MC Launch
- Machine Controller Loop
- app.Run
- app.StartControllers
- Machine Controller Initialization
- 1. NewController factory func
- 1.1 Init Controller Struct
- 1.2 Assign Listers and HasSynced funcs to controller struct
- 1.3 Register Controller Event Handlers on Informers.
- Machine Controller Run
Machine Controller
The Machine Controller handles reconciliation of Machine and MachineClass objects.
The Machine Controller Entry Point for any provider is at
machine-controller-manager-provider-<name>/cmd/machine-controller/main.go
MC Launch
Dev
Build
A Makefile
in the root of machine-controller-manager-provider-<name>
builds the provider specific machine controller for linux with CGO enabled. The make build
target invokes the shell script .ci/build to do this.
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-a \
-v \
-o ${BINARY_PATH}/rel/machine-controller \
cmd/machine-controller/main.goo
Launch
Assuming one has initialized the variables using make download-kubeconfigs
, one can then use make start
target which launches the MC with flags as shown below.
Most of these timeout flags are redundant since exact same values are
given in machine-controller-manager/pkg/util/provider/app/options.NewMCServer
go run -mod=vendor
cmd/machine-controller/main.go
--control-kubeconfig=$(CONTROL_KUBECONFIG)
--target-kubeconfig=$(TARGET_KUBECONFIG)
--namespace=$(CONTROL_NAMESPACE)
--machine-creation-timeout=20m
--machine-drain-timeout=5m
--machine-health-timeout=10m
--machine-pv-detach-timeout=2m
--machine-safety-apiserver-statuscheck-timeout=30s
--machine-safety-apiserver-statuscheck-period=1m
--machine-safety-orphan-vms-period=30m
--leader-elect=$(LEADER_ELECT)
--v=3
Prod
Build
A Dockerfile
builds the provider specific machine controller and launches it directly with no CLI arguments. Hence uses coded defaults
RUN CGO_ENABLED=0 GOOS=$TARGETOS GOARCH=$TARGETARCH \
go build \
-mod=vendor \
-o /usr/local/bin/machine-controller \
cmd/machine-controller/main.go
COPY --from=builder /usr/local/bin/machine-controller /machine-controller
ENTRYPOINT ["/machine-controller"]
The machine-controller-manager
deployment usually launches both the MC in a Pod with following arguments
./machine-controller
--control-kubeconfig=inClusterConfig
--machine-creation-timeout=20m
--machine-drain-timeout=2h
--machine-health-timeout=10m
--namespace=shoot--i034796--tre
--port=10259
--target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig
--v=3
Launch Flow
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB Begin(("cmd/ machine-controller/ main.go")) -->NewMCServer["mc=options.NewMCServer"] -->AddFlaogs["mc.AddFlags(pflag.CommandLine)"] -->LogOptions["options := k8s.io/component/base/logs.NewOptions() options.AddFlags(pflag.CommandLine)"] -->InitFlags["flag.InitFlags"] InitFlags--local-->NewLocalDriver[" driver, err := local.NewDriver(s.ControlKubeconfig) if err exit "] InitFlags--aws-->NewPlatformDriver[" driver := aws.NewAWSDriver(&spi.PluginSPIImpl{})) OR driver := cp.NewAzureDriver(&spi.PluginSPIImpl{}) //etc "] NewLocalDriver-->AppRun[" err := app.Run(mc, driver) "] NewPlatformDriver-->AppRun AppRun-->End(("if err != nil os.Exit(1)"))
Summary
-
Creates machine-controller-manager/pkg/util/provider/app/options.MCServer using
options.NewMCServer
which is the main context object for the machinecontroller that embeds a options.MachineControllerConfiguration.options.NewMCServer
initializesoptions.MCServer
struct with default values forPort: 10258
,Namespace: default
,ConcurrentNodeSyncs: 50
: number of worker go-routines that are used to process items from a work queue. See Worker belowNodeConditions: "KernelDeadLock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable"
(failure node conditions that indicate that a machine is un-healthy)MinResyncPeriod: 12 hours
,KubeAPIQPS: 20
,KubeAPIBurst:30
: config params for k8s clients. See rest.Config
-
calls
MCServer.AddFlags
which defines all parsing flags for the machine controller into fields ofMCServer
instance created in the last step. -
calls
k8s.io/component-base/logs.NewOptions
and thenoptions.AddFlags
for logging options. TODO: Should get rid of this when moving tologr
.)- See Logging In Gardener Components.
- Then use the logchecktool.
-
Driver initialization code varies according to the provider type.
- Local Driver
- calls
NewDriver
with control kube config that creates a controller runtime client (sigs.k8s.io/controller-runtime/pkg/client
) which then callspkg/local/driver.NewDriver
passing the controlloer-runtime client which constructs alocaldriver
encapsulating the passed in client. driver := local.NewDriver(c)
- the
localdriver
implements Driver is the facade for creation/deletion of vm's
- calls
- Provider Specific Driver Example
driver := aws.NewAWSDriver(&spi.PluginSPIImpl{})
driver := cp.NewAzureDriver(&spi.PluginSPIImpl{})
spi.PluginSPIImpl
is a struct that implements a provider specific interface that initializes a provider session.
- Local Driver
-
calls app.Run passing in the previously created
MCServer
andDriver
instances.
Machine Controller Loop
app.Run
app.Run
is the function that setups the main control loop of the machine controller server.
Summary
- app.Run(options *options.MCServer, driver driver.Driver) is the common run loop for all provider Machine Controllers.
- Creates
targetkubeconfig
andcontrolkubeconfig
of typek8s.io/client-go/rest.Config
from the target kube config path usingclientcmd.BuildConfigFromFlags
- Set fields such as
config.QPS
andconfig.Burst
in bothtargetkubeconfig
andcontrolkubeconfig
from the passed inoptions.MCServer
- Create
kubeClientControl
from thecontrolkubeconfig
using the standard client-go client factory metohd:kubernetes.NewForConfig
that returns aclient-go/kubernetes.Clientset
- Similarly create another
Clientset
namedleaderElectionClient
usingcontrolkubeconfig
- Start a go routine using the function
startHTTP
that registers a bunch of http handlers for the go profiler, prometheus metrics and the health check. - Call
createRecorder
passing thekubeClientControl
client set instance that returns a client-go/tools/record.EventRecorder- Creates a new
eventBroadcaster
of type event.EventBroadcaster - Set the logging function of the broadcaster to
klog.Infof
. - Sets the event sink using
eventBroadcaster.StartRecordingToSink
passing the event interface askubeClient.CoreV1().RESTClient()).Events("")
. Effectively events will be published remotely. - Returns the
record.EventRecorder
associated with theeventBroadcaster
usingeventBroadcaster.NewRecorder
- Creates a new
- Constructs an anonymous function assigned to
run
variable which does the following:- Initializes a
stop
receive channel. - Creates a
controlMachineClientBuilder
usingmachineclientbuilder.SimpleClientBuilder
using thecontrolkubeconfig
. - Creates a
controlCoreClientBuidler
usingcoreclientbuilder.SimpleControllerClientBuilder
wrappingcontrolkubeconfig
. - Creates
targetCoreClientBuilder
usingcoreclientbuilder.SimpleControllerClientBuilder
wrappingcontrolkubeconfig
. - Call the
app.StartControllers
function passing theoptions
,driver
,controlkubeconfig
,targetkubeconfig
,controlMachineClientBuilder
,controlCoreClientBuilder
,targetCoreClientBuilder
,recorder
andstop
channel.- // Q: if you are going to pass the controlkubeconfig and targetkubeconfig - why not create the client builders inside the startcontrollers ?
- if
app.StartcOntrollers
return an error panic and exitrun
.
- Initializes a
- use leaderelection.RunOrDie to start a leader election and pass the previously created
run
function to as the callback forOnStartedLeading
.OnStartedLeading
callback is invoked when a leaderelector client starts leading.
app.StartControllers
app.StartControllers starts all controller loops which are part of the machine controller.
func StartControllers(options *options.MCServer,
controlCoreKubeconfig *rest.Config,
targetCoreKubeconfig *rest.Config,
controlMachineClientBuilder machineclientbuilder.ClientBuilder,
controlCoreClientBuilder coreclientbuilder.ClientBuilder,
targetCoreClientBuilder coreclientbuilder.ClientBuilder,
driver driver.Driver,
recorder record.EventRecorder,
stop <-chan struct{}) error
- Calls
getAvailableResources
using thecontrolCoreClientBuilder
that returns amap[schema.GroupVersionResource]bool
assigned toavailableresources
getAvailableResources
waits till the api server is running by checking its/healthz
usingwait.PollImmediate
. keeps re-creating the client usingclientbuilder.Client
method.- then uses
client.Discovery().ServerResources
which returns returns the supported resources for all groups and versions as a slice of *metav1.APIResourceList (which encapsulates a []APIResource) and then converts that to amap[schema.GroupVersionResource]bool
`
- Creates a
controlMachineClient
usingcontrolMachineClientBuilder.ClientOrDie("machine-controller").MachineV1alpha1()
which is a client of type MachineV1alpha1Interface. This interface is a composition of MachineGetter,MachineClassesGetter, MachineDeploymentsGetter and MachineSetsGetter allowing access to CRUD interface for machines, machine classes, machine deployments and machine sets. This client targets the control cluster - ie the cluster holding the machine crd's. - creates a
controlCoreClient
(of type: kubernetes.Clientset which is the standard k8s client-go client for accessing the k8s control cluster. - creates a
targetCoreClient
(of type: kubernetes.Clientset) which is the standard k8s client-go client for accessing the target cluster - in which machines will be spawned. - obtain the target cluster k8s version using the discovery interface and preserve it in
targetKubernetesVersion
- if the
availableResources
does not contain the machine GVR, exitapp.StartControllers
with error. - creates the following informer factories:
controlMachineInformerfactory
using the generated pkg/client/informers/externalversions#NewFilteredSharedInformerFactory passing the conrol machine client, the configured min resync period and control namespace.- Create
controlCoreInformerfactory
using the client-go core informers#NewFilteredSharedInformerFactory passing in the control core client, min resync period and control namespace. - Similarly create
targetCoreInformerFactory
- Get the controller's Machine Informers Facade using
controlMachineInformerfactory.Machine().V1alpha1()
and assign tomachinesharedinformers
- Now create the
machinecontroller
using machinecontroller.NewController factory function, passing the below:- control namespace from
options.MCServer.Namespace
SafetyOptions
fromoptions.MCServer.SafetyOptions
NodeConditions
fromoptions.MCserver.NodeConditions
. (by default these would be : "KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable")- clients:
controlMachineClient
,controlCoreClient
,targetCoreClient
- the
driver
- Target Cluster Informers obtained from
targetCoreInformerfactory
: - Control Cluster Informers obtained from
controlCoreInformerFactory
- MachineClassInformer, MachineInformer using
machinesharedinformers.MachineClasses()
andmachinesharedinformers.Machines()
- The event recorder created earlier
targetKubernetesVersion
- control namespace from
- Start
controlMachineInformerFactory
,controlCoreInformerFactory
andtargetCoreInformerFactory
by calling SharedInformerfactory.Start passing thestop
channel. - Launches the machinecontroller.Run in new go-routine passing the stop channel.
- Block forever using a
select{}
Machine Controller Initialization
the machine controller is constructed using controller.NewController
factory function which initializes the controller
struct.
1. NewController factory func
mc is constructed using the factory function below:
func NewController(
namespace string,
controlMachineClient machineapi.MachineV1alpha1Interface,
controlCoreClient kubernetes.Interface,
targetCoreClient kubernetes.Interface,
driver driver.Driver,
pvcInformer coreinformers.PersistentVolumeClaimInformer,
pvInformer coreinformers.PersistentVolumeInformer,
secretInformer coreinformers.SecretInformer,
nodeInformer coreinformers.NodeInformer,
pdbV1beta1Informer policyv1beta1informers.PodDisruptionBudgetInformer,
pdbV1Informer policyv1informers.PodDisruptionBudgetInformer,
volumeAttachmentInformer storageinformers.VolumeAttachmentInformer,
machineClassInformer machineinformers.MachineClassInformer,
machineInformer machineinformers.MachineInformer,
recorder record.EventRecorder,
safetyOptions options.SafetyOptions,
nodeConditions string,
bootstrapTokenAuthExtraGroups string,
targetKubernetesVersion *semver.Version,
) (Controller, error)
1.1 Init Controller Struct
Create and Initialize the Controller struct initializing rate-limiting work queues for secrets: controller.secretQueue
, nodes: controller.nodeQueue
, machines: controller.machineQueue
, machineclass: controller.machineClassQueue
. Along with 2 work queues used by safety controllers: controller.machineSafetyOrphanVMsQueue
and controller.machineSafetyAPIServerQueue
Example:
controller := &controller {
//...
secretQueue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "secret"),
machineQueue=workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "machine"),
//...
}
1.2 Assign Listers and HasSynced funcs to controller struct
// initialize controller listers from the passed-in shared informers (8 listers)
controller.pvcLister = pvcInformer
controller.pvLister = pvinformer.Lister()
controller.machineLister = machineinformer.lister()
controller.pdbV1Lister = pdbV1Informer.Lister()
controller.pdbV1Synced = pdbV1Informer.Informer().HasSynced
// ...
// assign the HasSynced function from the passed-in shared informers
controller.pvcSynced = pvcInformer.Informer().HasSynced
controller.pvSynced = pvInformer.Informer().HasSynced
controller.machineSynced = machineInformer.Informer().HasSynced
1.3 Register Controller Event Handlers on Informers.
An informer invokes registered event handler when a k8s object changes.
Event handlers are registered using <ResourceType>Informer().AddEventhandler
function.
The controller initialization registers add//delete event handlers for secrets. add/update/delete event handlers for MachineClass, Machine and Node informers.
The event handlers generally add the object keys to the appropriate work queues which are later picked up and reconciled in processing in controller.Run
.
The work queue is used to separate the delivery of the object from its processing. resource event handler functions extract the key of the delivered object and add it to the relevant work queue for future processing. (in controller.Run
)
Example
secretInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: controller.secretAdd,
DeleteFunc: controller.secretDelete,
})
1.3.1 Secret Informer Callback
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB SecretInformerAddCallback -->SecretAdd["controller.secretAdd(obj)"] -->GetSecretKey["key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)"] -->AddSecretQ["if err != nil c.secretQueue.Add(key)"] SecretInformeDeleteCallback -->SecretAdd
We must check for the DeletedFinalStateUnknown state of that secret in the cache before enqueuing its key. The DeletedFinalStateUnknown
state means that the object has been deleted but that the watch deletion event was missed while disconnected from apiserver and the controller didn't react accordingly. Hence if there is no error, we can add the key to the queue.
1.3.2 Machine Class Informer Callbacks
MachineClass Add/Delete Callback 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB MachineClassInformerAddCallback1 --> MachineAdd["controller.machineClassToSecretAdd(obj)"] -->CastMC[" mc, ok := obj.(*v1alpha1.MachineClass) "] -->EnqueueSecret[" c.secretQueue.Add(mc.SecretRef.Namespace + '/' + mc.SecretRef.Name) c.secretQueue.Add(mc.CredentialSecretRef.Namespace + '/' + mc.CredentialSecretRef.Namespace.Name) "] MachineClassToSecretDeleteCallback1 -->MachineAdd
MachineClass Update Callback 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB MachineClassInformerUpdateCallback1 --> MachineAdd["controller.machineClassToSecretUpdate(oldObj, newObj)"] -->CastMC[" old, ok := oldObj.(*v1alpha1.MachineClass) new, ok := newObj.(*v1alpha1.MachineClass) "] -->RefNotEqual{"old.SecretRef != new.SecretRef?"} --Yes-->EnqueueSecret[" c.secretQueue.Add(old.SecretRef.Namespace + '/' + old.SecretRef.Name) c.secretQueue.Add(new.SecretRef.Namespace + '/' + new.SecretRef.Name) "] -->CredRefNotEqual{"old.CredentialsSecretRef!= new.CredentialsSecretRef?"} --Yes-->EnqueueCredSecretRef[" c.secretQueue.Add(old.CredentialsSecretRef.Namespace + '/' + old.CredentialsSecretRef.Name) c.secretQueue.Add(new.CredentialsSecretRef.Namespace + '/' + new.CredentialsSecretRef.Name) "]
MachineClass Add/Delete/Update Callback 2
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB MachineClassInformerAddCallback2 --> MachineAdd["controller.machineClassAdd(obj)"] -->CastMC[" key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) if err != nil c.machineClassQueue.Add(key) "] MachineClassInformerDeleteCallback2 -->MachineAdd MachineClassInformeUpdateCallback2 -->MachineUpdate["controller.machineClassUpdate(oldObj,obj)"] -->MachineAdd
1.3.2 Machine Informer Callbacks
Machine Add/Update/Delete Callbacks 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB MachineAddCallback1 -->AddMachine["controller.addMachine(obj)"] -->EnqueueMachine[" key, err := cache.MetaNamespaceKeyFunc(obj) //Q: why don't we use DeletionHandlingMetaNamespaceKeyFunc here ? if err!=nil c.machineQueue.Add(key) "] MachineUpdateCallback1-->AddMachine MachineDeleteCallback1-->AddMachine
Machine Update/Delete Callbacks 2
DISCUSS THIS.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB MachineUpdateCallback2 -->UpdateMachineToSafety["controller.updateMachineToSafety(oldObj, newObj)"] -->EnqueueSafetyQ[" newM := newObj.(*v1alpha1.Machine) if multipleVMsBackingMachineFound(newM) { c.machineSafetyOrphanVMsQueue.Add('') } "] MachineDeleteCallback2 -->DeleteMachineToSafety["deleteMachineToSafety(obj)"] -->EnqueueSafetyQ1[" c.machineSafetyOrphanVMsQueue.Add('') "]
1.3.3 Node Informer Callbacks
Node Add Callback
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB NodeaAddCallback -->InvokeAddNodeToMachine["controller.addNodeToMachine(obj)"] -->AddNodeToMachine[" node := obj.(*corev1.Node) if node.ObjectMeta.Annotations has NotManagedByMCM return; key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) if err != nil return "] -->GetMachineFromNode[" machine := (use machineLister to get first machine whose 'node' label equals key) "] -->ChkMachine{" machine.Status.CurrentStatus.Phase != 'CrashLoopBackOff' && nodeConditionsHaveChanged( machine.Status.Conditions, node.Status.Conditions) ? "} --Yes-->EnqueueMachine[" mKey, err := cache.MetaNamespaceKeyFunc(obj) if err != nil return controller.machineQueue.Add(mKey) "]
Node Delete Callback
This is straightforward - it checks that the node has an associated machine and if so, enqueues the machine on the machineQueue
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TB NodeDeleteCallback -->InvokeDeleteNodeToMachine["controller.deleteNodeToMachine(obj)"] -->DeleteNodeToMachine[" key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) if err != nil return "] -->GetMachineFromNode[" machine := (use machineLister to get first machine whose 'node' label equals key) if err != nil return "] -->EnqueueMachine[" mKey, err := cache.MetaNamespaceKeyFunc(obj) if err != nil return controller.machineQueue.Add(mKey) "]
Node Update Callback
controller.updateNodeTomachine
is specified as UpdateFunc
registered for the nodeInformer
.
In a nutshell, it simply delegates to AddNodeTomachine(newobj)
described earlier, except if the node has the annotation machineutils.TriggerDeletionByMCM
(value: node.machine.sapcloud.io/trigger-deletion-by-mcm
). In this case it gets the machine
obj corresponding to the node and then leverages controller.controlMachineClient
to delete the machine object.
NOTE: This annotation was introduced for the user to add on the node. This gives them an indirect way to delete the machine object because they don’t have access to control plane.
Snippet shown below with error handling+logging omitted.
func (c *controller) updateNodeToMachine(oldobj, newobj interface{}) {
node := newobj.(*corev1.node)
// check for the triggerdeletionbymcm annotation on the node object
// if it is present then mark the machine object for deletion
if value, ok := node.annotations[machineutils.TriggerDeletionByMCM]; ok && value == "true" {
machine, err := c.getMachineFromnOde(node.name)
if machine.deletiontimestamp == nil {
c.controlmachineclient
.Machines(c.namespace)
.Delete(context.Background(), machine.Name, metav1.Deleteoptions{});
}
} else {
c.addnodeToMachine(newobj)
}
}
Machine Controller Run
func (c *controller) Run(workers int, stopch <-chan struct{}) {
// ...
}
1. Wait for Informer Caches to Sync
When an informer starts, it will build a cache of all resources it currently watches which is lost when the application restarts. This means that on startup, each of your handler functions will be invoked as the initial state is built. If this is not desirable, one should wait until the caches are synced before performing any updates. This can be done using the cache.WaitForCacheSync function.
if !cache.WaitForCacheSync(stopCh, c.secretSynced, c.pvcSynced, c.pvSynced, c.volumeAttachementSynced, c.nodeSynced, c.machineClassSynced, c.machineSynced) {
runtimeutil.HandleError(fmt.Errorf("Timed out waiting for caches to sync"))
return
}
2. Register On Prometheus
The Machine controller struct implements the prometheus.Collector interface and can therefore be then be registered on prometheus metrics registry.
prometheus.MustRegister(controller)
Collectors which are added to the registry will collect metrics to expose them via the metrics endpoint of the MCM every time when the endpoint is called.
2.1 Describe Metrics (controller.Describe)
All promethueus.Metric that are collected must first be described using a prometheus.Desc which is the immutable meta-data about a metric.
As can be seen below the machine controller sends a description of metrics.MachineCountDesc to prometheus. this is mcm_machine_items_total
which is the count of machines managed by controller.
This Describe
callback is called by prometheus.MustRegister
Doubt: we currently appear to have only have one metric for the mc ?
var MachineCountDesc = prometheus.NewDesc("mcm_machine_items_total", "Count of machines currently managed by the mcm.", nil, nil)
func (c *controller) Describe(ch chan<- *prometheus.desc) {
ch <- metrics.MachineCountDesc
}
2.1 Collect Metrics (controller.Collect)
Collect
is called by the prometheus registry when collecting
metrics. The implementation sends each collected metric via the
provided channel and returns once the last metric has been sent. the
descriptor of each sent metric is one of those returned by Describe
// Collect is method required to implement the prometheus.Collect interface.
func (c *controller) Collect(ch chan<- prometheus.Metric) {
c.CollectMachineMetrics(ch)
c.CollectMachineControllerFrozenStatus(ch)
}
2.1.1 Collect Machine Metrics
func (c *controller) CollectMachineMetrics(ch chan<- prometheus.Metric)
A [prometheus.Metric])(https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Metric) models a sample linking data points together over time. Custom labels (with their own values) can be added to each data point
A prometheus.Gauge is a Metric that represents a single numerical value that can arbitrarily go up and down. We use a Gauge for the machine count.
A prometheus.GaugeVec is a factory for creating a set of gauges all with the same description but which have different data values for the metric labels.
Machine information about a machine managed by MCM is described on Prometheus using a prometheus.GaugeVec constructed using the factory function prometheus.NewGaugeVec.
A prometheus.Desc is the descriptor used by every Prometheus Metric. It is essentially the immutable meta-data of a Metric that includes fully qualified name of the metric, the help string and the metric label names.
We have 3 such gauge vecs for machine metrics and 1 gauge metric for the machine count as seen below.
Q: Discuss Why do we need the 3 gauge vecs ?
Example:
var MachineCountDesc = prometheus.NewDesc("mcm_machine_items_total", "Count of machines currently managed by the mcm.", nil, nil)
//MachineInfo Information of the Machines currently managed by the mcm.
var MachineInfo prometheus.GaugeVec = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: "mcm",
Subsystem: "machine",
Name: "info",
Help: "Information of the Machines currently managed by the mcm.",
}, []string{"name", "namespace", "createdAt",
"spec_provider_id", "spec_class_api_group", "spec_class_kind", "spec_class_name"})
// MachineStatusCondition Information of the mcm managed Machines' status conditions
var MachineStatusCondition = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: namespace,
Subsystem: machineSubsystem,
Name: "status_condition",
Help: "Information of the mcm managed Machines' status conditions.",
}, []string{"name", "namespace", "condition"})
//MachineCSPhase Current status phase of the Machines currently managed by the mcm.
MachineCSPhase = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: namespace,
Subsystem: machineSubsystem,
Name: "current_status_phase",
Help: "Current status phase of the Machines currently managed by the mcm.",
}, []string{"name", "namespace"})
One invokes the GaugeVec.With method passing a prometheus.Labels which is a map[string]string
to obtain a prometheus.Gauge.
- Gets the list of machines using the
machineLister
- Iterate through list of machines. Use the
MachineInfo.With
method to initialize the labels for each metric and obtain theGauge
. UseGauge.Set
to set value for metric. - Create the machine count metric
metric, err := prometheus.NewConstMetric(metrics.MachineCountDesc, prometheus.GaugeValue, float64(len(machineList)))
- Set the metric value and send the metric to the prometheus metric channel:
metric, err := prometheus.NewConstMetric(metrics.MachineCountDesc, prometheus.GaugeValue, float64(len(machineList)))
3. Create controller worker go-routines specifying reconcile functions
Finally use worker.Run
to create and runs a worker routine that just processes items in the specified queue. The worker will run until stopCh
is closed. The worker go-routine will be added to the wait group when started and marked done when finished.
Q: reconcileClusterNodeKey
seems useless ?
func (c *controller) Run(workers int, stopch <-chan struct{}) {
//...
waitGroup sync.WaitGroup
for i := 0; i < workers; i++ {
worker.Run(c.secretQueue, "ClusterSecret", worker.DefaultMaxRetries, true, c.reconcileClusterSecretKey, stopCh, &waitGroup)
worker.Run(c.machineClassQueue, "ClusterMachineClass", worker.DefaultMaxRetries, true, c.reconcileClusterMachineClassKey, stopCh, &waitGroup)
worker.Run(c.machineQueue, "ClusterMachine", worker.DefaultMaxRetries, true, c.reconcileClusterMachineKey, stopCh, &waitGroup)
worker.Run(c.machineSafetyOrphanVMsQueue, "ClusterMachineSafetyOrphanVMs", worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyOrphanVMs, stopCh, &waitGroup)
worker.Run(c.machineSafetyAPIServerQueue, "ClusterMachineAPIServer", worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyAPIServer, stopCh, &waitGroup)
}
<-stopch
waitGroup.wait()
}
func Run(queue workqueue.ratelimitinginterface, resourcetype string, maxretries int, forgetaftersuccess bool, reconciler func(key string) error, stopch <-chan struct{}, waitgroup *sync.waitgroup) {
waitgroup.add(1)
go func() {
wait.until(worker(queue, resourcetype, maxretries, forgetaftersuccess, reconciler), time.second, stopch)
waitgroup.done()
}()
}
3.1 worker.worker
NOTE: Puzzled that basic routine like this is NOT part of client-go lib. Its likely repeated across thosands of controllers (prob with bugs). Thankfully controller-runtime obviates the need for soemthing like this.
worker returns a function that
- de-queues items (keys) from the work
queue
. thekey
s that are obtained using workqueue.get
to be strings of the formnamespace/name
of the resource. - processes them by invoking the
reconciler(key)
function- the purpose of the
reconciler
is to compares the actual state with the desired state, and attempts to converge the two. it should then update thestatus
block of the resource. - if
reconciler
returns an error, requeue the item up tomaxretries
before giving up.
- the purpose of the
- marks items as done.
func worker(queue workqueue.RateLimitingInterface, resourceType string, maxRetries int, forgetAfterSuccess bool, reconciler func(key string) error) func() {
return func() {
exit := false
for !exit {
exit = func() bool {
key, quit := queue.Get()
if quit {
return true
}
defer queue.Done(key)
err := reconciler(key.(string))
if err == nil {
if forgetAfterSuccess {
queue.Forget(key)
}
return false
}
if queue.NumRequeues(key) < maxRetries {
queue.AddRateLimited(key)
return false
}
queue.Forget(key)
return false
}()
}
}
}
4. Reconciliation functions executed by worker
The controller starts worker go-routines that pop out keys from the relevant workqueue and execute the reconcile function.
See reconcile chapters
Reconcile Cluster Machine Class
Reconcile Cluster Machine Class Key
reconcileClusterMachineClassKey
just picks up the machine class key from the machine class queue and then delegates further.
func (c *controller) reconcileClusterMachineClassKey(key string) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD GetMCName["ns,name=cache.SplitMetanamespacekey(mkey)"] -->GetMC[" class, err := c.machineClassLister.MachineClasses(c.namespace).Get(name) if err != nil return err // basically adds back to the queue after rate limiting "] -->RMC[" ctx := context.Background() reconcileClusterMachineClass(ctx, class)"] -->CheckErr{"err !=nil"} --Yes-->ShortR["machineClassQueue.AddAfter(key, machineutils.ShortRetry)"] CheckErr--No-->LongR["machineClassQueue.AddAfter(key, machineutils.LongRetry)"]
Reconcile Cluster Machine Class
func (c *controller) reconcileClusterMachineClass(ctx context.Context,
class *v1alpha1.MachineClass) error
Bad design: should ideally return the retry period like other reconcile functions.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD FindMachineForClass[" machines := Use machineLister and match on Machine.Spec.Class.Name == class to find machines with matching class"] -->CheckDelTimeStamp{" // machines are ref class.DeletionTimestamp == nil && len(machines) > 0 "} CheckDelTimeStamp--Yes-->AddMCFinalizers[" Add/Update MCM Finalizers to MC and use controlMachineClient to update (why mcm finalizer not mc finalizer?) 'machine.sapcloud.io/machine-controller-manager' retryPeriod=LongRetry "] -->ChkMachineCount{{"len(machines)>0?"}} --Yes-->EnQMachines[" iterate machines and invoke: c.machineQueue.Add(machine) "] -->End(("End")) CheckDelTimeStamp--No-->Shortr[" // Seems like over-work here. retryPeriod=ShortRetry "]-->ChkMachineCount ChkMachineCount--No-->DelMCFinalizers[" controller.deleteMachineClassFinalizers(ctx, class) "]-->End
NOTE: Scratch work below. IGNORE.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD a["ns,name=cache.SplitMetanamespacekey(mkey)"] getm["machine=machinelister.machines(ns).get(name)"] valm["validation.validatemachine(machine)"] valmc["machineclz,secretdata,err=validation.validatemachineclass(machine)"] longr["retryperiod=machineutils.longretry"] shortr["retryperiod=machineutils.shortretry"] enqm["machinequeue.addafter(mkey, retryperiod)"] checkmdel{"is\nmachine.deletiontimestamp\nset?"} newdelreq["req=&driver.deletemachinerequest{machine,machineclz,secretdata}"] delflow["retryperiod=controller.triggerdeletionflow(req)"] createflow["retryperiod=controller.triggercreationflow(req)"] hasfin{"hasfinalizer(machine)"} addfin["addmachinefinalizers(machine)"] checkmachinenodeexists{"machine.status.node\nexists?"} reconcilemachinehealth["controller.reconcilemachinehealth(machine)"] syncnodetemplates["controller.syncnodetemplates(machine)"] newcreatereq["req=&driver.createmachinerequest{machine,machineclz,secretdata}"] z(("end")) a-->getm enqm-->z longr-->enqm shortr-->enqm getm-->valm valm-->ok-->valmc valm--err-->longr valmc--err-->longr valmc--ok-->checkmdel checkmdel--yes-->newdelreq checkmdel--no-->hasfin newdelreq-->delflow hasfin--no-->addfin hasfin--yes-->shortr addfin-->checkmachinenodeexists checkmachinenodeexists--yes-->reconcilemachinehealth checkmachinenodeexists--no-->newcreatereq reconcilemachinehealth--ok-->syncnodetemplates syncnodetemplates--ok-->longr syncnodetemplates--err-->shortr delflow-->enqm newcreatereq-->createflow createflow-->enqm
Reconcile Cluster Secret
reconcileClusterSecretKey
reconciles an secret due to controller resync
or an event on the secret
func (c *controller) reconcileClusterSecretKey(key string) error
// which looks up secret and delegates to
func (c *controller) reconcileClusterSecret(ctx context.Context, secret *corev1.Secret) error
Usage
Worker go-routines are created for this as below
worker.Run(c.secretQueue,
"ClusterSecret",
worker.DefaultMaxRetries,
true,
c.reconcileClusterSecretKey,
stopCh,
&waitGroup)
Flow
controller.reconcileClusterSecretkey
basically adds the MCFinalizerName (value: machine.sapcloud.io/machine-controller
) to the list of finalizers for all secrets that are referenced by machine classes within the same namespace.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD a[ns, name = cache.SplitMetaNamespaceKey] b["sec=secretLister.Secrets(ns).Get(name)"] c["machineClasses=findMachineClassForSecret(name) // Gets the slice of MachineClasses referring to the passed secret //iterates through machine classes and // checks whether mc.SecretRef.Name or mcCredentialSecretRef.Name // matches secret name "] d{machineClasses empty?} e["controller.addSecretFinalizers(sec)"] z(("return err")) a-->b b-->c c-->d d--Yes-->DeleteFinalizers["controller.deleteSecretFinalizers"]-->z e--success-->z d--No-->e
controller.addSecretFinalizers
func (c *controller) addSecretFinalizers(ctx context.Context, secret *corev1.Secret) error {
Basicaly adds machine.sapcloud.io/machine-controller
to the secret and uses controlCoreClient
to update the secret.
While perusing the below, you might need to reference Machine Controller Helper Functions as several reconcile functions delegate to helper methods defined on the machine controller struct.
Cluster Machine Reconciliation
func (c *controller) reconcileClusterMachineKey(key string) error
The top-level reconcile function for the machine that analyzes machine status and delegates to the individual reconcile functions for machine-creation, machine-deletion and machine-health-check flows.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD A["ns,name=cache.SplitMetaNamespaceKey(mKey)"] GetM["machine=machineLister.Machines(ns).Get(name)"] ValM["validation.ValidateMachine(machine)"] ValMC["machineClz,secretData,err=validation.ValidateMachineClass(machine)"] LongR["retryPeriod=machineutils.LongRetry"] ShortR["retryPeriod=machineutils.ShortRetry"] EnqM["machineQueue.AddAfter(mKey, retryPeriod)"] CheckMDel{"Is\nmachine.DeletionTimestamp\nSet?"} NewDelReq["req=&driver.DeleteMachineRequest{machine,machineClz,secretData}"] DelFlow["retryPeriod=controller.triggerDeletionFlow(req)"] CreateFlow["retryPeriod=controller.triggerCreationFlow(req)"] HasFin{"HasFinalizer(machine)"} AddFin["addMachineFinalizers(machine)"] CheckMachineNodeExists{"machine.Status.Node\nExists?"} ReconcileMachineHealth["controller.reconcileMachineHealth(machine)"] SyncNodeTemplates["controller.syncNodeTemplates(machine)"] NewCreateReq["req=&driver.CreateMachineRequest{machine,machineClz,secretData}"] Z(("End")) Begin((" "))-->A A-->GetM EnqM-->Z LongR-->EnqM ShortR-->EnqM GetM-->ValM ValM--Ok-->ValMC ValM--Err-->LongR ValMC--Err-->LongR ValMC--Ok-->CheckMDel CheckMDel--Yes-->NewDelReq CheckMDel--No-->HasFin NewDelReq-->DelFlow HasFin--No-->AddFin HasFin--Yes-->ShortR AddFin-->CheckMachineNodeExists CheckMachineNodeExists--Yes-->ReconcileMachineHealth CheckMachineNodeExists--No-->NewCreateReq ReconcileMachineHealth--Ok-->SyncNodeTemplates SyncNodeTemplates--Ok-->LongR SyncNodeTemplates--Err-->ShortR DelFlow-->EnqM NewCreateReq-->CreateFlow CreateFlow-->EnqM
controller.triggerCreationFlow
Controller Method that orchestraes the call to the Driver.CreateMachine
This method badly requires to be split into several functions. It is too long.
func (c *controller) triggerCreationFlow(ctx context.Context,
cmr *driver.CreateMachineRequest)
(machineutils.RetryPeriod, error)
Apologies for HUMONGOUS flow diagram - ideally code should have been split here into small functions.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD ShortP["retryPeriod=machineutils.ShortRetry"] MediumP["retryPeriod=machineutils.MediumRetry"] Return(("return retryPeriod, err")) Begin((" "))-->Init[" machine = cmr.Machine machineName = cmr.Machine.Name secretCopy := cmr.Secret.DeepCopy() //NOTE: seems Un-necessary? "] -->AddBootStrapToken[" err = c.addBootstrapTokenToUserData(ctx, machine.Name, secretCopy) // get/create bootstrap token and populate inside secretCopy['userData'] "] -->ChkErr{err != nil?} ChkErr--Yes-->ShortP-->Return ChkErr--No-->CreateMachineStatusReq[" statusReq = & driver.GetMachineStatusRequest{ Machine: machine, MachineClass: cmr.MachineClass, Secret: cmr.Secret, }, "]-->GetMachineStatus[" statusResp, err := c.driver.GetMachineStatus(ctx, statusReq) //check if VM already exists "]-->ChkStatusErr{err!=nil} ChkStatusErr--No-->InitNodeNameFromStatusResp[" nodeName = statusResp.NodeName providerID = statusResp.ProviderID "] ChkStatusErr--Yes-->DecodeErrorStatus[" errStatus,decodeOk= status.FromError(err) "] DecodeErrorStatus-->CheckDecodeOk{"decodeOk ?"} CheckDecodeOk--No-->MediumP-->Return CheckDecodeOk--Yes-->AnalyzeCode{status.Code?} AnalyzeCode--NotFound,Unimplemented-->ChkNodeLabel{"machine.Labels['node']?"} ChkNodeLabel--No-->CreateMachine[" // node label is not present -> no machine resp, err := c.driver.CreateMachine(ctx, cmr) "]-->ChkCreateError{err!=nil?} ChkNodeLabel--Yes-->InitNodeNameFromMachine[" nodeName = machine.Labels['node'] "] AnalyzeCode--Unknown,DeadlineExceeded,Aborted,Unavailable-->ShortRetry[" retryPeriod=machineutils.ShortRetry "]-->GetLastKnownState[" lastKnownState := machine.Status.LastKnownState "]-->InitFailedOp[" lastOp := LastOperation{ Description: err.Error(), State: MachineStateFailed, Type: MachineOperationCreatea, LastUpdateTime: Now(), }; currStatus := CurrentStatus { Phase: MachineCrashLoopBackOff || MachineFailed (on create timeout) LastUpdateTime: Now() } "]-->UpdateMachineStatus[" c.machineStatusUpdate(ctx,machine,lastOp,currStatus,lastKnownState) "]-->Return ChkCreateError--Yes-->SetLastKnownState[" lastKnownState = resp.LastKnownState "]-->InitFailedOp ChkCreateError--No-->InitNodeNameFromCreateResponse[" nodeName = resp.NodeName providerID = resp.ProviderID "]-->ChkStaleNode{" // check stale node nodeName != machineName && nodeLister.Get(nodeName) exists"} InitNodeNameFromStatusResp-->ChkNodeLabelAnnotPresent{" cmr.Machine.Labels['node'] && cmr.Machine.Annotations[MachinePriority] ? "} InitNodeNameFromMachine-->ChkNodeLabelAnnotPresent ChkNodeLabelAnnotPresent--No-->CloneMachine[" clone := machine.DeepCopy; clone.Labels['node'] = nodeName clone.Annotations[machineutils.MachinePriority] = '3' clone.Spec.ProviderID = providerID "]-->UpdateMachine[" _, err := c.controlMachineClient.Machines(clone.Namespace).Update(ctx, clone, UpdateOptions{}) "]-->ShortP ChkStaleNode--No-->CloneMachine ChkStaleNode--Yes-->CreateDMR[" dmr := &driver.DeleteMachineRequest{ Machine: &Machine{ ObjectMeta: machine.ObjectMeta, Spec: MachineSpec{ ProviderID: providerID, }, }, MachineClass: createMachineRequest.MachineClass, Secret: secretCopy, } "]-->DeleteMachine[" _, err := c.driver.DeleteMachine(ctx, deleteMachineRequest) // discuss stale node case retryPeriod=machineutils.ShortRetry "]-->InitFailedOp1[" lastOp := LastOperation{ Description: 'VM using old node obj', State: MachineStateFailed, Type: MachineOperationCreate, //seems wrong LastUpdateTime: Now(), }; currStatus := CurrentStatus { Phase: MachineFailed (on create timeout) LastUpdateTime: Now() } "]-->UpdateMachineStatus ChkNodeLabelAnnotPresent--Yes-->ChkMachineStatus{"machine.Status.Node != nodeName || machine.Status.CurrentStatus.Phase == ''"} ChkMachineStatus--No-->LongP["retryPeriod = machineutils.LongRetry"]-->Return ChkMachineStatus--Yes-->CloneMachine1[" clone := machine.DeepCopy() clone.Status.Node = nodeName "]-->SetLastOp[" lastOp := LastOperation{ Description: 'Creating Machine on Provider', State: MachineStateProcessing, Type: MachineOperationCreate, LastUpdateTime: Now(), }; currStatus := CurrentStatus { Phase: MachinePending, TimeoutActive: true, LastUpdateTime: Now() } lastKnownState = clone.Status.LastKnownState "]-->UpdateMachineStatus style InitFailedOp text-align:left
controller.triggerDeletionFlow
func (c *controller) triggerDeletionFlow(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error)
Please note that there is sad use of machine.Status.LastOperation
as semantically the next requested operation. This is confusing. TODO: DIscuss This.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD GM["machine=dmr.Machine\n machineClass=dmr.MachineClass\n secret=dmr.Secret"] HasFin{"HasFinalizer(machine)"} LongR["retryPeriod=machineUtils.LongRetry"] ShortR["retryPeriod=machineUtils.ShortRetry"] ChkMachineTerm{"machine.Status.CurrentStatus.Phase\n==MachineTerminating ?"} CheckMachineOperation{"Check\nmachine.Status.LastOperation.Description"} DrainNode["retryPeriod=c.drainNode(dmr)"] DeleteVM["retryPeriod=c.deleteVM(dmr)"] DeleteNode["retryPeriod=c.deleteNodeObject(dmr)"] DeleteMachineFin["retryPeriod=c.deleteMachineFinalizers(machine)"] SetMachineTermStatus["c.setMachineTerminationStatus(dmr)"] CreateMachineStatusRequest["statusReq=&driver.GetMachineStatusRequest{machine, machineClass,secret}"] GetVMStatus["retryPeriod=c.getVMStatus(statusReq)"] Z(("End")) Begin((" "))-->HasFin HasFin--Yes-->GM HasFin--No-->LongR LongR-->Z GM-->ChkMachineTerm ChkMachineTerm--No-->SetMachineTermStatus ChkMachineTerm--Yes-->CheckMachineOperation SetMachineTermStatus-->ShortR CheckMachineOperation--GetVMStatus-->CreateMachineStatusRequest CheckMachineOperation--InitiateDrain-->DrainNode CheckMachineOperation--InitiateVMDeletion-->DeleteVM CheckMachineOperation--InitiateNodeDeletion-->DeleteNode CheckMachineOperation--InitiateFinalizerRemoval-->DeleteMachineFin CreateMachineStatusRequest-->GetVMStatus GetVMStatus-->Z DrainNode-->Z DeleteVM-->Z DeleteNode-->Z DeleteMachineFin-->Z ShortR-->Z
controller.reconcileMachineHealth
controller.reconcileMachineHealth reconciles the machine object with any change in node conditions or VM health.
func (c *controller) reconcileMachineHealth(ctx context.Context, machine *Machine)
(machineutils.RetryPeriod, error)
NOTES:
- Reference controller.isHealth which checks the machine status conditions.
Health Check Flow Diagram
See What are the different phases of a Machine
Health Check Summary
- Gets the
Node
obj associated with the machine. If it IS NOT found, yet the current machine phase isRunning
, change the machine phase toUnknown
, the last operation state toProcessing
, the last operation type toHealthCheck
, update the machine status and return with a short retry. - If the
Node
object IS found, then it checks whether theMachine.Status.Conditions
are different fromNode.Status.Conditions
. If so it sets the machine conditions to the node conditions. - If the machine IS NOT healthy (See isHealthy) but the current machine phase is
Running
, change the machine phase toUnknown
, the last operation state toProcessing
, the last operation type toHealthCheck
, update the machine status and return with a short retry. - If the machine IS healthy but the current machine phase is NOT
Running
and the machine's node does not have thenode.gardener.cloud/critical-components-not-ready
taint, check whether the last operation type was aCreate
.- If the last operation type was a
Create
and last operation state is NOT marked asSuccessful
, then delete the bootstrap token associated with the machine. Change the last operation state toSuccessful
. Let the last operation type continue to remain asCreate
. - If the last operation type was NOT a
Create
, change the last operation type toHealthCheck
- Change the machine phase to
Running
and update the machine status and return with a short retry. - (The above 2 cases take care of a newly created machine and a machine that became OK after ome temporary issue)
- If the last operation type was a
- If the current machine phase is
Pending
(ie machine being created: seetriggerCreationFlow
) get the configured machine creation timeout and check.- If the timoeut HAS NOT expired, enqueue the machine key on the machine work queue after 1m.
- If the timeout HAS expired, then change the last operation state to
Failed
and the machine phase toFailed
. Update the machine status and return with a short retry.
- If the current machine phase is
Unknown
, get the effective machine health timeout and check.- If the timeout HAS NOT expired, enqueue the machine key on the machine work queue after 1m.
- If the timeout HAS expired
- Get the machine deployment name
machineDeployName := machine.Labels['name']
corresponding to this machine - Register ONE permit with this with
machineDeployName
. See Permit Giver. Q: Background of this change ? Couldn't we find a better way to throttle via work-queues instead of complicatedPermitGiver
and go-routines? Even simple lock would be OK here right ? - Attempt to get ONE permit for
machineDeployName
using alockAcquireTimeout
of 1s- Throttle to check whether machine CAN be marked as
Failed
usingmarkable, err := controller.canMarkMachineFailed
. - If machine can be marked, change the last operation state (ie the health check) to
Failed
, preserve the last operation type, change machine phase toFailed
. Update the machine status. Seec.updateMachineToFailedState
- Then use
wait.Poll
using 100ms aspollInterval
and 1s ascacheUpdateTimeout
using the following poll condition function:- Get the
machine
from themachineLister
(which uses the cache of the shared informer) - Return true if
machine.Status.CurrentStatus.Phase
isFailed
orTerminating
or themachine
is not found - Return false otherwise.
- Get the
- Throttle to check whether machine CAN be marked as
- Get the machine deployment name
Health Check Doubts
- TODO: Why don't we check the machine health using the
Driver.GetMachineStatus
in the reconcile Machine health ? (seems like something obvious to do and would have helped in those meltdown issues where machine was incorrectly marked as failed) - TODO: why doesn't this code make use of the helper method:
c.machineStatusUpdate
? - TODO: Unclear why
LastOperation.Description
does not use/concatenate one of the predefined constants inmachineutils
- TODO: code makes too much use of
cloneDirty
to check whether machine clone obj has changed, when it could easily return early in several branches. - TODO: Code directly makes calls to enqueue machine keys on the machine queue and still returns retry periods to caller leanding to un-necessary enqueue of machine keys. (spurious design)
controller.triggerUpdationFlow
Doesn't seem to be used ? Possibly dead code ?
- Machine Controller Helper Methods
- controller.addBootstrapTokenToUserData
- controller.addMachineFinalizers
- controller.setMachineTerminationStatus
- controller.machineStatusUpdate
- controller.UpdateNodeTerminationCondition
- controller.isHealthy
- controller.getVMStatus
- controller.drainNode
- controller.deleteVM
- controller.deleteNodeObject
- controller.syncMachineNodeTemplates
Machine Controller Helper Methods
controller.addBootstrapTokenToUserData
This method is responsible for adding the bootstrap token for the machine.
Bootstrap tokens are used when joining new nodes to a cluster. Bootstrap Tokens are defined with a specific SecretType
: bootstrap.kubernetes.io/token
and live in the kube-system
namespace. These Secrets are then read by the Bootstrap Authenticator in the API Server
Reference
func (c *controller) addBootstrapTokenToUserData(ctx context.Context, machineName string, secret *corev1.Secret) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->InitTokenSecret[" tokenID := hex.EncodeToString([]byte(machineName)[len(machineName)-5:])[:6] // 6 chars length secretName :='bootstrap-token-' + tokenID "] -->GetSecret[" secret, err = c.targetCoreClient.CoreV1().Secrets('kube-system').Get(ctx, secretName, GetOptions{}) "] -->ChkErr{err!=nil?} ChkErr--Yes-->ChkNotFound{"IsNotFound(err)"} ChkNotFound--Yes-->GenToken[" tokenSecretKey = generateRandomStrOf16Chars "]-->InitSecretData[" data := map[string][]byte{ 'token-id': []byte(tokenID), 'token-secret': []byte(tokenSecretKey), 'expiration': []byte(c.safetyOptions.MachineCreationTimeout.Duration) //..others } "] -->InitSecret[" secret = &corev1.Secret{ ObjectMeta: metav1.ObjectMeta{ Name: secretName, Namespace: metav1.NamespaceSystem, }, Type: 'bootstrap.kubernetes.io/token' Data: data, } "] -->CreateSecret[" secret, err =c.targetCoreClient.CoreV1().Secrets('kube-system').Create(ctx, secret, CreateOptions{}) "] -->ChkErr1{err!=nil?} ChkErr1--Yes-->ReturnErr(("return err")) ChkNotFound--No-->ReturnErr ChkErr1--No-->CreateToken[" token = tokenID + '.' + tokenSecretKey "]-->InitUserData[" userDataByes = secret.Data['userData'] userDataStr = string(userDataBytes) "]-->ReplaceUserData[" userDataS = strings.ReplaceAll(userDataS, 'BOOTSTRAP_TOKEN',placeholder, token) secret.Data['userData'] = []byte(userDataS) //discuss this. "]-->ReturnNil(("return nil")) style InitSecretData text-align:left style InitSecret text-align:left
controller.addMachineFinalizers
This method checks for the MCMFinalizer
Value: machine.sapcloud.io/machine-controller-manager
and adds it if it is not present. It leverages k8s.io/apimachinery/pkg/util/sets
package for its work.
This method is regularly called during machine reconciliation, if a machine does not have a deletion timestamp so that all non-deleted machines possess this finalizer.
func (c *controller) addMachineFinalizers(ctx context.Context, machine *v1alpha1.Machine) (machineutils.RetryPeriod, error)
if finalizers := sets.NewString(machine.Finalizers...); !finalizers.Has(MCMFinalizerName) {
finalizers.Insert(MCMFinalizerName)
clone := machine.DeepCopy()
clone.Finalizers = finalizers.List()
_, err := c.controlMachineClient.Machines(clone.Namespace).Update(ctx, clone, metav1.UpdateOptions{})
if err != nil {
// Keep retrying until update goes through
klog.Errorf("Failed to add finalizers for machine %q: %s", machine.Name, err)
} else {
// Return error even when machine object is updated
klog.V(2).Infof("Added finalizer to machine %q with providerID %q and backing node %q", machine.Name, getProviderID(machine), getNodeName(machine))
err = fmt.Errorf("Machine creation in process. Machine finalizers are UPDATED")
}
}
return machineutils.ShortRetry, err
controller.setMachineTerminationStatus
setMachineTerminationStatus
set's the machine status to terminating. This is illustrated below. Please note that Machine.Status.LastOperation
is set an instance of the LastOperation
struct. (which at times appears to be a command for the next action? Discuss this.)
func (c *controller) setMachineTerminationStatus(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD CreateClone["clone := dmr.Machine.DeepCopy()"] NewCurrStatus["currStatus := &v1alpha1.CurrentStatus{Phase:\n MachineTerminating, LastUpdateTime: time.Now()}"] SetCurrentStatus["clone.Status.CurrentStatus = currStatus"] UpdateStatus["c.controlMachineClient.Machines(ns).UpdateStatus(clone)"] ShortR["retryPeriod=machineUtils.ShortRetry"] Z(("Return")) CreateClone-->NewCurrStatus NewCurrStatus-->SetCurrentStatus SetCurrentStatus-->UpdateStatus UpdateStatus-->ShortR ShortR-->Z
controller.machineStatusUpdate
Updates machine.Status.LastOperation
, machine.Status.CurrentStatus
and machine.Status.LastKnownState
func (c *controller) machineStatusUpdate(
ctx context.Context,
machine *v1alpha1.Machine,
lastOperation v1alpha1.LastOperation,
currentStatus v1alpha1.CurrentStatus,
lastKnownState string) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD CreateClone["clone := machine.DeepCopy()"] -->InitClone[" clone.Status.LastOperation = lastOperation clone.Status.CurrentStatus = currentStatus clone.Status.LastKnownState = lastKnownState "] -->ChkSimilarStatus{"isMachineStatusSimilar( clone.Status, machine.Status)"} ChkSimilarStatus--No-->UpdateStatus[" err:=c.controlMachineClient .Machines(clone.Namespace) .UpdateStatus(ctx, clone, metav1.UpdateOptions{}) "] -->Z1(("return err")) ChkSimilarStatus--Yes-->Z2(("return nil"))
NOTE: isMachineStatusSimilar implementation is quite sad. TODO: we should improve stuff like this when we move to controller-runtime.
controller.UpdateNodeTerminationCondition
controller.UpdateNodeTerminationCondition adds or updates the termination condition to the Node.Status.Conditions
of the node object corresponding to the machine.
func (c *controller) UpdateNodeTerminationCondition(ctx context.Context, machine *v1alpha1.Machine) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Init[" nodeName := machine.Labels['node'] newTermCond := v1.NodeCondition{ Type: machineutils.NodeTerminationCondition, Status: v1.ConditionTrue, LastHeartbeatTime: Now(), LastTransitionTime: Now()}"] -->GetCond["oldTermCond, err := nodeops.GetNodeCondition(ctx, c.targetCoreClient, nodeName, machineutils.NodeTerminationCondition)"] -->ChkIfErr{"err != nil ?"} ChkIfErr--Yes-->ChkNotFound{"apierrors.IsNotFound(err)"} ChkNotFound--Yes-->ReturnNil(("return nil")) ChkNotFound--No-->ReturnErr(("return err")) ChkIfErr--No-->ChkOldTermCondNotNil{"oldTermCond != nil && machine.Status.CurrentStatus.Phase == MachineTerminating ?"} ChkOldTermCondNotNil--No-->ChkMachinePhase{"Check\nmachine\n.Status.CurrentStatus\n.Phase?"} ChkMachinePhase--MachineFailed-->NodeUnhealthy["newTermCond.Reason = machineutils.NodeUnhealthy"] ChkMachinePhase--"else"-->NodeScaleDown["newTermCond.Reason=machineutils.NodeScaledDown //assumes scaledown..why?"] NodeUnhealthy-->UpdateCondOnNode["err=nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, newTermCond)"] NodeScaleDown-->UpdateCondOnNode ChkOldTermCondNotNil--Yes-->CopyTermReasonAndMessage[" newTermCond.Reason=oldTermCond.Reason newTermCond.Message=oldTermCond.Message "] CopyTermReasonAndMessage-->UpdateCondOnNode UpdateCondOnNode-->ChkNotFound
controller.isHealthy
Checks if machine is healty by checking its conditions.
func (c *controller) isHealthy(machine *.Machine) bool
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->Init[" conditions = machine.Status.Conditions badTypes = strings.Split( 'KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable', ',') "]-->ChkCondLen{"len(conditions)==0?"} ChkCondLen--Yes-->ReturnF(("return false")) ChkCondLen--No-->IterCond["c:= range conditions"] IterCond-->ChkNodeReady{"c.Type=='Ready' && c.Status != 'True' ?"}--Yes-->ReturnF ChkNodeReady --Yes-->IterBadConditions["badType := range badTypes"] -->ChkType{"badType == c.Type && c.Status != 'False' ?"} --Yes-->ReturnF IterBadConditions--loop-->IterCond ChkType--loop-->IterBadConditions style Init text-align:left
NOTE
- controller.NodeConditions should be called controller.BadConditionTypes
- Iterate over
machine.Status.Conditions
- If
Ready
condition inis notTrue
, node is determined as un-healty. - If any of the bad condition types are detected, then node is determine as un-healthy
- If
controller.getVMStatus
(BAD NAME FOR METHOD: should be called checkMachineExistenceAndEnqueNextOperation
)
func (c *controller) getVMStatus(ctx context.Context,
statusReq *driver.GetMachineStatusRequest) (machineutils.RetryPeriod, error)
This method is only called for the delete flow.
- It attempts to get the machine status
- If the machine exists, it updates the machine status operation to
InitiateDrain
and returns aShortRetry
for the machine work queue. - If attempt to get machine status failed, it will obtain the error code from the error.
- For
Unimplemented
(ieGetMachineStatus
op was is not implemented), it does the same as2
. ie: it updates the machine status operation toInitiateDrain
and returns aShortRetry
for the machine work queue. - If decoding the error code failed, it will update the machine status operation to
machineutils.GetVMStatus
and returns aLongRetry
for the machine key into the machine work queue.- Unsure how we get out of this Loop. TODO: Discuss this.
- For
Unknown|DeadlineExceeded|Aborted|Unavailable
it updates the machine status operation tomachineutils.GetVMStatus
status and returns aShortRetry
for the machine work queue. (So that reconcile will run this method again in future) - For
NotFound
code (ie machine is not found), it will enqueue node deletion by updating the machine stauts operation tomachineutils.InitiateNodeDeletion
and returning aShortRetry
for the machine work queue.
- For
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD GetMachineStatus["_,err=driver.GetMachineStatus(statusReq)"] ChkMachineExists{"err==nil ?\n (ie machine exists)"} DecodeErrorStatus["errStatus,decodeOk= status.FromError(err)"] CheckDecodeOk{"decodeOk ?"} CheckErrStatusCode{"Check errStatus.Code"} CreateDrainOp["op:=LastOperation{Description: machineutils.InitiateDrain State: v1alpha1.MachineStateProcessing, Type: v1alpha1.MachineOperationDelete, Time: time.Now()}"] CreateNodeDelOp["op:=LastOperation{Description: machineutils.InitiateNodeDeletion State: v1alpha1.MachineStateProcessing, Type: v1alpha1.MachineOperationDelete, Time: time.Now()}"] CreateDecodeFailedOp["op:=LastOperation{Description: machineutils.GetVMStatus, State: v1alpha1.MachineStateFailed, Type: v1alpha1.MachineOperationDelete, Time: time.Now()}"] CreaterRetryVMStatusOp["op:=LastOperation{Description: ma1chineutils.GetVMStatus, State: v1alpha1.MachineStateFailed, Type: v1alpha1.MachineOperationDelete, Time: time.Now()}"] ShortR["retryPeriod=machineUtils.ShortRetry"] LongR["retryPeriod=machineUtils.LongRetry"] UpdateMachineStatus["c.machineStatusUpdate(machine,op,machine.Status.CurrentStatus, machine.Status.LastKnownState)"] Z(("End")) GetMachineStatus-->ChkMachineExists ChkMachineExists--Yes-->CreateDrainOp ChkMachineExists--No-->DecodeErrorStatus DecodeErrorStatus-->CheckDecodeOk CheckDecodeOk--Yes-->CheckErrStatusCode CheckDecodeOk--No-->CreateDecodeFailedOp CreateDecodeFailedOp-->LongR CheckErrStatusCode--"Unimplemented"-->CreateDrainOp CheckErrStatusCode--"Unknown|DeadlineExceeded|Aborted|Unavailable"-->CreaterRetryVMStatusOp CheckErrStatusCode--"NotFound"-->CreateNodeDelOp CreaterRetryVMStatusOp-->ShortR CreateDrainOp-->ShortR CreateNodeDelOp-->ShortR ShortR-->UpdateMachineStatus LongR-->UpdateMachineStatus UpdateMachineStatus-->Z
controller.drainNode
Inside pkg/util/provider/machinecontroller/machine_util.go
func (c *controller) drainNode(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Initialize["err = nil machine = dmr.Machine nodeName= machine.Labels['node'] drainTimeout=machine.Spec.MachineConfiguration.MachineDrainTimeout || c.safetyOptions.MachineDrainTimeout maxEvictRetries=machine.Spec.MachineConfiguration.MaxEvictRetries || c.safetyOptions.MaxEvictRetries skipDrain = false"] -->GetNodeReadyCond["nodeReadyCond = machine.Status.Conditions contains k8s.io/api/core/v1/NodeReady readOnlyFSCond=machine.Status.Conditions contains 'ReadonlyFilesystem' "] -->ChkNodeNotReady["skipDrain = (nodeReadyCond.Status == ConditionFalse) && nodeReadyCondition.LastTransitionTime.Time > 5m or (readOnlyFSCond.Status == ConditionTrue) && readOnlyFSCond.LastTransitionTime.Time > 5m // discuss this "] -->ChkSkipDrain{"skipDrain true?"} ChkSkipDrain--Yes-->SetOpStateProcessing ChkSkipDrain--No-->SetHasDrainTimedOut["hasDrainTimedOut = time.Now() > machine.DeletionTimestamp + drainTimeout"] SetHasDrainTimedOut-->ChkForceDelOrTimedOut{"machine.Labels['force-deletion'] || hasDrainTimedOut"} ChkForceDelOrTimedOut--Yes-->SetForceDelParams[" forceDeletePods=true drainTimeout=1m maxEvictRetries=1 "] SetForceDelParams-->UpdateNodeTermCond["err=c.UpdateNodeTerminationCondition(ctx, machine)"] ChkForceDelOrTimedOut--No-->UpdateNodeTermCond UpdateNodeTermCond-->ChkUpdateErr{"err != nil ?"} ChkUpdateErr--No-->InitDrainOpts[" // params reduced for brevity drainOptions := drain.NewDrainOptions( c.targetCoreClient, drainTimeout, maxEvictRetries, c.safetyOptions.PvDetachTimeout.Duration, c.safetyOptions.PvReattachTimeout.Duration, nodeName, forceDeletePods, c.driver, c.pvcLister, c.pvLister, c.pdbV1Lister, c.nodeLister, c.volumeAttachmentHandler) "] ChkUpdateErr--"Yes&&forceDelPods"-->InitDrainOpts ChkUpdateErr--Yes-->SetOpStateFailed["opstate = v1alpha1.MachineStateFailed description=machineutils.InitiateDrain //drain failed. retry next sync "] InitDrainOpts-->RunDrain["err = drainOptions.RunDrain(ctx)"] RunDrain-->ChkDrainErr{"err!=nil?"} ChkDrainErr--No-->SetOpStateProcessing[" opstate= v1alpha1.MachineStateProcessing description=machineutils.InitiateVMDeletion // proceed with vm deletion"] ChkDrainErr--"Yes && forceDeletePods"-->SetOpStateProcessing ChkDrainErr--Yes-->SetOpStateFailed SetOpStateProcessing--> InitLastOp["lastOp:=v1alpha1.LastOperation{ Description: description, State: state, Type: v1alpha1.MachineOperationDelete, LastUpdateTime: metav1.Now(), } //lastOp is actually the *next* op semantically"] SetOpStateFailed-->InitLastOp InitLastOp-->UpdateMachineStatus["c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,machine.Status.LastKnownState)"] -->Return(("machineutils.ShortRetry, err"))
Note on above
- We skip the drain if node is set to ReadonlyFilesystem for over 5 minutes
- Check TODO:
ReadonlyFilesystem
is a MCM condition and not a k8s core node condition. Not sure if we are mis-using this field. TODO: Check this.
- Check TODO:
- Check TODO: Why do we check that node is not ready for 5m in order to skip the drain ? Shouldn't we skip the drain if node is simply not ready ? Why wait for 5m here ?/
- See Run Drain
controller.deleteVM
Called by controller.triggerDeletionFlow
func (c *controller) deleteVM(ctx context.Context, dmReq *driver.DeleteMachineRequest)
(machineutils.RetryPeriod, error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->Init[" machine = dmr.Machine "] -->CallDeleteMachine[" dmResp, err := c.driver.DeleteMachine(ctx, dmReq) "] -->ChkDelErr{"err!=nil?"} ChkDelErr--No-->SetSuccShortRetry[" retryPeriod = machineutils.ShortRetry description = 'VM deletion was successful.'+ machineutils.InitiateNodeDeletion) state = MachineStateProcessing "] -->InitLastOp[" lastOp := LastOperation{ Description: description, State: state, Type: MachineOperationDelete, LastUpdateTime: Now(), }, "] -->SetLastKnownState[" // useless since drivers impls dont set this?. // Use machine.Status.LastKnownState instead ? lastKnownState = dmResp.LastKnownState "] -->UpdateMachineStatus[" //Discuss: Introduce finer grained phase for status ? c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,lastKnownState)"]-->Return(("retryPeriod, err")) ChkDelErr--Yes-->DecodeErrorStatus[" errStatus,decodeOk= status.FromError(err) "] DecodeErrorStatus-->CheckDecodeOk{"decodeOk ?"} CheckDecodeOk--No-->SetFailed[" state = MachineStateFailed description = 'machine decode error' + machineutils.InitiateVMDeletion "]-->SetLongRetry[" retryPeriod= machineutils.LongRetry "]-->InitLastOp CheckDecodeOk--Yes-->AnalyzeCode{status.Code?} AnalyzeCode--Unknown, DeadlineExceeded,Aborted,Unavailable-->SetDelFailed[" state = MachineStateFailed description = VM deletion failed due to + err + machineutils.InitiateVMDeletion"] -->SetShortRetry[" retryPeriod= machineutils.ShortRetry "]-->InitLastOp AnalyzeCode--NotFound-->DelSuccess[" // we can proceed with deleting node. description = 'VM not found. Continuing deletion flow'+ machineutils.InitiateNodeDeletion state = MachineStateProcessing "]-->SetShortRetry AnalyzeCode--default-->DelUnknown[" state = MachineStateFailed description='VM deletion failed due to' + err + 'Aborting..' + machineutils.InitiateVMDeletion "]-->SetLongRetry
controller.deleteNodeObject
NOTE: Should have just called this controller.deleteNode
for naming consistency with other methods.
Called by triggerDeletionFlow
after successfully deleting the VM.
func (c *controller) deleteNodeObject(ctx context.Context,
machine *v1alpha1.Machine)
(machineutils.RetryPeriod, error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->Init[" nodeName := machine.Labels['node'] "] -->ChkNodeName{"nodeName != ''" ?} ChkNodeName--Yes-->DelNode[" err = c.targetCoreClient.CoreV1().Nodes().Delete(ctx, nodeName, metav1.DeleteOptions{}) "]-->ChkDelErr{"err!=nil?"} ChkNodeName--No-->NodeObjNotFound[" state = MachineStateProcessing description = 'No node object found for' + nodeName + 'Continue' + machineutils.InitiateFinalizerRemoval "]-->InitLastOp ChkDelErr--No-->DelSuccess[" state = MachineStateProcessing description = 'Deletion of Node' + nodeName + 'successful' + machineutils.InitiateFinalizerRemoval "]-->InitLastOp ChkDelErr--Yes&&!apierrorsIsNotFound-->FailedNodeDel[" state = MachineStateFailed description = 'Deletion of Node' + nodeName + 'failed due to' + err + machineutils.InitiateNodeDeletion "] -->InitLastOp[" lastOp := LastOperation{ Description: description, State: state, Type: MachineOperationDelete, LastUpdateTime: Now(), }, "] -->UpdateMachineStatus[" c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,machine.Status.LastKnownState)"] -->Return(("machineutils.ShortRetry, err"))
controller.syncMachineNodeTemplates
func (c *controller) syncMachineNodeTemplates(ctx context.Context,
machine *v1alpha1.Machine)
(machineutils.RetryPeriod, error)
See MachineSpec
syncMachineNodeTemplates
syncs machine.Spec.NodeTemplateSpec between machine and corresponding node-object. A NodeTemplateSpec
just wraps a core NodeSpec and ObjectMetadata
Get the node
for the given machine
and then:
- Synchronize
node.Annotations
tomachine.Spec.NodeTemplateSpec.Annotations
. - Synchronize
node.Labels
tomachine.Spec.NodeTemplateSpec.Labels
- Synchronize
node.Spec.Taints
tomachine.Spec.NodeTemplateSpec.Spec.Taints
- Update the
node
object if there were changes.
Since, we should not delete third-party annotations on the node object, synchronizing deleted ALT from the machine object is a bit tricky and so we maintain yet another custom annotation node.machine.sapcloud.io/last-applied-anno-labels-taints
(code constant LastAppliedALTAnnotation
) on the Node
object which is a JSON string of the NodeTemplateSpec.
- Before synchronizing, we un-marshall this
LastAppliedALTAnnotation
value into alastAppliedALT
of typeNodeTemplateSpec
. - While synchronizing annotations, labels and taints, we check respectively whether
lastAppliedALT.Annotations
,lastAppliedALT.Labels
andlastAppliedALT.Spec.Taints
hold keys that are NOT in the correspondingmachine.Spec.NodeTemplateSpec.Annotations
,machine.Spec.NodeTemplateSpec.Labels
andmachine.Spec.NodeTemplateSpec.Spec.Taints
.- If so, we delete keys from the corresponding
node.Annotations
,node.Labels
andnode.Spec.Taints
respectively. - We maintain a boolean saying the
LastAppliedALTAnnotation
needs to be updated
- If so, we delete keys from the corresponding
- Just before updating the Node object, we check if
LastAppliedALTAnnotation
needs updation and if so we Jsonifymachine.Spec.NodeTemplateSpec
and override the newLastAppliedALTAnnotation
to this value
- Node Drain
- Drain Utilities
- Drain
- Drain Types
- drain.PodVolumeInfo
- drain.Options.evictPod
- drain.Options.deletePod
- drain.Options.getPodsForDeletion
- drain.Options.getPVList
- drain.Options.getVolIDsFromDriver
- drain.Options.doAccountingOfPvs
- drain.filterPodsWithPv
- drain.Options.waitForDetach
- drain.Options.waitForReattach
- drain.Options.waitForDelete
- drain.Options.RunDrain
- drain.Options.evictPodsWithoutPv
- drain.Options.evictPodsWithPv
- drain.Options.evictPodsWithPVInternal
Node Drain
Node Drain code is in github.com/gardener/machine-controller-manager/pkg/util/provider/drain/drain.go
Drain Utilities
VolumeAttachmentHandler
pkg/util/provider/drain.VolumeAttachmentHandler is an handler used to distribute
incoming k8s.io/api/storage/v1.VolumeAttachment requests to a number of workers where each worker is a channel of type *VolumeAttachment
.
A k8s.io/api/storage/v1.VolumeAttachment is a non-namespaced k8s object that captures the intent to attach or detach the specified volume to/from the specified node. See VolumeAttachment
type VolumeAttachmentHandler struct {
sync.Mutex
workers []chan *storagev1.VolumeAttachment
}
// NewVolumeAttachmentHandler returns a new VolumeAttachmentHandler
func NewVolumeAttachmentHandler() *VolumeAttachmentHandler {
return &VolumeAttachmentHandler{
Mutex: sync.Mutex{},
workers: []chan *storagev1.VolumeAttachment{},
}
}
VolumeAttachmentHandler.AddWorker
AddWorker
appends a buffered channel of size 20
of type VolumeAttachment
to the workers
slice in VolumeAttachmentHandler
. There is an assumption that not more than 20 unprocessed objects would exist at a given time. On bufferring requests beyond this the channel will start dropping writes. See dispatch
method.
func (v *VolumeAttachmentHandler) AddWorker() chan *storagev1.VolumeAttachment {
// chanSize is the channel buffer size to hold requests.
// This assumes
// On bufferring requests beyond this the channel will start dropping writes
const chanSize = 20
klog.V(4).Infof("Adding new worker. Current active workers %d - %v", len(v.workers), v.workers)
v.Lock()
defer v.Unlock()
newWorker := make(chan *storagev1.VolumeAttachment, chanSize)
v.workers = append(v.workers, newWorker)
klog.V(4).Infof("Successfully added new worker %v. Current active workers %d - %v", newWorker, len(v.workers), v.workers)
return newWorker
}
VolumeAttachmentHandler.dispatch
The dispatch
method is responsible for distributing incomding VolumeAttachent
s to available channels.
func (v *VolumeAttachmentHandler) dispatch(obj interface{}) {
if len(v.workers) == 0 {
// As no workers are registered, nothing to do here.
return
}
volumeAttachment := obj.(*storagev1.VolumeAttachment)
v.Lock()
defer v.Unlock()
for i, worker := range v.workers {
select {
// submit volume attachment to the worker channel if channel is not full
case worker <- volumeAttachment:
default:
klog.Warningf("Worker %d/%v is full. Discarding value.", i, worker)
// TODO: Umm..isn't this problematic if we miss this ?
}
}
}
The Add|Update
methods below delegate to dispatch. The usage of this utility involves specifying the add/update methods below as the event handler callbacks on an instance of k8s.io/client-go/informers/storage/v1.VolumeAttachmentInformer. This way incoming volume attachments are distributed to several worker channels.
func (v *VolumeAttachmentHandler) AddVolumeAttachment(obj interface{}) {
v.dispatch(obj)
}
func (v *VolumeAttachmentHandler) UpdateVolumeAttachment(oldObj, newObj interface{}) {
v.dispatch(newObj)
}
VolumeAttachmentHandler Initialization in MC
During construction of the MC controller struct, we initialize the callback methods on volume attachment handler using the volume attachment informer
func NewController(...) {
//...
controller.volumeAttachmentHandler = drain.NewVolumeAttachmentHandler()
volumeAttachmentInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: controller.volumeAttachmentHandler.AddVolumeAttachment,
UpdateFunc: controller.volumeAttachmentHandler.UpdateVolumeAttachment,
});
Drain
Drain Types
Drain Constants
PodEvictionRetryInterval
is the interval in which to retry eviction for podsGetPvDetailsMaxRetries
is the number of max retries to get PV details using the PersistentVolumeLister or PersistentVolumeClaimListerGetPvDetailsRetryInterval
is the interval in which to retry getting PV details
const (
PodEvictionRetryInterval = time.Second * 20
GetPvDetailsMaxRetries = 3
GetPvDetailsRetryInterval = time.Second * 5
)
drain.Options
drain.Options
are configurable options while draining a node before deletion
NOTE: Unused fields/Fields with constant defaults omitted for brevity
type Options struct {
client kubernetes.Interface
kubernetesVersion *semver.Version
Driver driver.Driver
drainStartedOn time.Time
drainEndedOn time.Time
ErrOut io.Writer
ForceDeletePods bool
MaxEvictRetries int32
PvDetachTimeout time.Duration
PvReattachTimeout time.Duration
nodeName string
Out io.Writer
pvcLister corelisters.PersistentVolumeClaimLister
pvLister corelisters.PersistentVolumeLister
pdbV1Lister policyv1listers.PodDisruptionBudgetLister
nodeLister corelisters.NodeLister
volumeAttachmentHandler *VolumeAttachmentHandler
Timeout time.Duration
}
drain.PodVolumeInfo
drain.PodVolumeInfo
is the struct used to encapsulate the PV names and PV ID's for all the PVs attached to the pod
PodVolumeInfo struct {
persistentVolumeList []string
volumeList []string
}
NOTE: The struct fields are badly named.
PodVolumeInfo.persistentVolumeList
is a slice of persistent volume names. This is from PersistentVolumeSpec.VolumeNamePodVolumeInfo.volumeList
is a slice of persistent volume IDs. This is obtained using driver.GetVolumeIDs given the PV Spec. This is generally the CSI volume id.
drain.Options.evictPod
func (o *Options) evictPod(ctx context.Context, pod *corev1.Pod, policyGroupVersion string) error
drain.Options.evictPod is a simple helper method to evict a Pod using Eviction API
- TODO:
GracePeriodSeconds
in the code is useless here and should be removed as it is always -1. - TODO: Currently this method uses old
k8s.io/api/policy/v1beta1
. It must be changed tok8s.io/api/policy/v1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin(("" )) -->InitTypeMeta[" typeMeta:= metav1.TypeMeta{ APIVersion: policyGroupVersion, Kind: 'Eviction', }, "] -->InitObjectMeta[" objectMeta := ObjectMeta: metav1.ObjectMeta{ Name: pod.Name, Namespace: pod.Namespace, }, "] -->InitEviction[" eviction := &.Eviction{TypeMeta: typeMeta, ObjectMeta: objectMeta } "] -->EvictPod[" err := o.client.PolicyV1beta1().Evictions(eviction.Namespace).Evict(ctx, eviction) "] -->ReturnErr(("return err")) style InitEviction text-align:left
drain.Options.deletePod
Simple helper method to delete a Pod
func (o *Options) deletePod(ctx context.Context, pod *corev1.Pod) error {
Just delegates to PodInterface.Delete
o.client.CoreV1().Pods(pod.Namespace).Delete(ctx, pod.Name, metav1.DeleteOptions{} )
drain.Options.getPodsForDeletion
drain.getPodsForDeletion
returns all the pods we're going to delete. If there are any pods preventing us from deleting, we return that list in an error.
func (o *Options) getPodsForDeletion(ctx context.Context)
(pods []corev1.Pod, err error)
- Get all pods associated with the node.
podList, err := o.client.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{ FieldSelector: fields.SelectorFromSet(fields.Set{"spec.nodeName": o.nodeName}).String()})
- Iterate through
podList
. - Apply a bunch of pod filters.
- Remove mirror pods from consideration for deletion. See Static Pods
- Local Storage Filter. Discuss: seems useless. If Pod has local storge, remove it from consideration. This filter iterates through
Pod.Spec.Volumes
slice and checks whetherVolume.EmptyDir
is non nil in order to determine - A Pod whose
Pod.Status.Phase
is PodSucceeded or PodFailed is eligible for deletion- If a Pod has a controller owner reference, it is eligible for deletion. (TODO: Unsure why this makes a difference anyways)
- The final pod filter
daemonsetFilter
seems useless. Discuss.
drain.Options.getPVList
NOTE: Should be called getPVNames
. Gets a slice of the persistent volume names bound to the Pod through its claims. Contains time.sleep and retry handling to a limit. Unsure if this is the best way. Discuss.
func (o *Options) getPVList(pod *corev1.Pod) (pvNames []string, err error)
- Iterate over
pod.Spec.Volumes
. - If
volume.PersistentVolumeClaim
reference is not nil, gets thePersistentVolumeClaim
usingo.pvcLister
usingvol.PersistentVolumeClaim.ClaimName
.- Implements error handling and retry till
GetPvDetailsMaxRetries
is reached with intervalGetPvDetailsRetryInterval
for the above.
- Implements error handling and retry till
- Adds
pvc.Spec.VolumeName
topvNames
- Return
pvNames
drain.Options.getVolIDsFromDriver
Given a slice of PV Names, this method gets the corresponding volume ids from the driver.
- It does this by first getting the PersistentVolumeSpec using
o.pvLister.Get(pvName)
for each PV name and adding to thepvSpecs
slice of typePersistentVolumeSpec
. See k8s.io/client-go/listers/core/v1.PersistentVolumeLister - Retry handling is implemented here while looking up pvName till
GetPvDetailsMaxRetries
is reached with sleep interval ofGetPvDetailsRetryInterval
between each retry attempt. - Once
pvSpecs
slice is populated it constructs a driver.GetVolumeIDsRequest from the same and then invokesdriver.GetVolumeIDs(driver.GetVolumeIDsRequest))
to obtain the driver.GetVolumeIDsResponse and retrunsdriver.GetVolumeIDsResponse.VolumeIDs
TODO: BUG ? In case the PV is not found or retry limit is reached the slice of volume ids will not have a 1:1 correspondence with slice of PV names passed in.
func (o *Options) getVolIDsFromDriver(ctx context.Context, pvNames []string) ([]string, error)
drain.Options.doAccountingOfPvs
drain.Options.doAccountingOfPvs returns a map of the pod key pod.Namespace + '/' + pod.Name
to a PodVolumeInfo struct which holds a slice of PV names and PV IDs.
NOTES:
- See filterSharedPVs
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->Init[" podKey2VolNamesMap = make(map[string][]string) podKey2VolInfoMap = make(map[string]PodVolumeInfo) "] -->RangePods[" for pod := range pods "] -->PopPod2VolNames[" podKey2VolNamesMap[pod.Namespace + '/' pod.Name] = o.getPVList(pod) "] --loop-->RangePods PopPod2VolNames--done-->FilterSharedPVs[" filterSharedPVs(podKey2VolNamesMap) // filters out the PVs that are shared among pods. "] -->RangePodKey2VolNamesMap[" for podKey, volNames := range podKey2VolNamesMap "] -->GetVolumeIds[" volumeIds, err := o.getVolIDsFromDriver(ctx, volNames) if err != nil continue; //skip set of volumes "] -->InitPodVolInfo[" podVolumeInfo := PodVolumeInfo{ persistentVolumeList: volNames, volumeList: volumeIds } //struct field names are bad. "] -->PopPodVolInfoMap[" podVolumeInfoMap[podKey] = podVolumeInfo "] --loop-->RangePodKey2VolNamesMap PopPodVolInfoMap--done-->Return(("return podVolumeInfoMap"))
drain.filterPodsWithPv
NOTE: should have been named partitionPodsWithPVC
Utility function that iterates through given pods
and for each pod
, iterates through its pod.Spec.Volumes
. For each such pod volume
checks volume.PersistentVolumeClaim
. If not nil, adds pod
to slice podsWithPV
else adds pod
to slice podsWithoutPV
func filterPodsWithPv(pods []corev1.Pod)
(podsWithPV []*corev1.Pod, podsWithoutPV []*corev1.Pod)
drain.Options.waitForDetach
func (o *Options) waitForDetach(ctx context.Context,
podVolumeInfo PodVolumeInfo,
nodeName string) error
Summary:
- Initiaze boolean
found
to true. (Representing that a volume is still attached to a node). - Begins a loop while
found
is true - Uses a
select
and checks to see if a signal is received fromcontext.Done()
(ie context cancelled). If so, return an error with the message that a timeout occurred while waiting for PV's to be detached. - Sets
found
to false. - Gets the
Node
associated withnodeName
using thenodeLister
and assign tonode
. If there is an error return from the function. - Gets the
node.Status.VolumesAttached
which returns a []AttachedVolume and assign toattachedVols
.- Return
nil
if this slice is empty.
- Return
- Begin iteration
range podVolumeInfo.volumeList
assigningvolumeID
in parent iteration. Label this iteration asLookUpVolume
.- Begin inner iteration over
attachedVols
assigningattachedVol
in nested iteration - If
attachedVol.Name
is contained involumeID
then this volume is still attached to the node.- Set
found
to true - Sleep for
VolumeDetachPollInterval
seconds (5 seconds) - Break out of
LookUpVolume
- Set
- Begin inner iteration over
drain.Options.waitForReattach
Purpose: Waits for persistent volume names in podVolumeInfo.persistentVolumeList
to be re-attached (to another node).
But I am still confused on why we need to call this during the drain flow. Why should we wait for PVs to be attached to another node. After all, there is no guarantee they will be attached, right ?
func (o *Options) waitForReattach(ctx context.Context,
podVolumeInfo PodVolumeInfo,
previousNodeName string,
volumeAttachmentEventCh chan *VolumeAttachment) error
- Construct a map:
var pvsWaitingForReattachments map[string]bool
- Initiamize the above map by ranging through
podVolumeInfo.persistentVolumeList
taking thepersistentVolumeName
and setpvsWaitingForReattachments[persistentVolumeName] = true
- Start a
for
loop.- Commence a
select
with following cases:- Case: Check to see if context is closed/cancelled by reading:
<-ctx.Done()
. If so, return an error with the message that timeout occurred while waiting for PV's to reattach to another node. - Case: Obtain a *VolumeAttachment by reading from channel:
incomingEvent := <-volumeAttachmentEventCh
- Get the
persistentVolumeName
associated with this attachment event. persistentVolumeName := *incomingEvent.Spec.Source.PersistentVolumeName
- Check if this persistent volume was being tracked:
pvsWaitingForReattachments[persistentVolumeName]
is present - Check if the volume was attached to another node
incomingEvent.Status.Attached && incomingEvent.Spec.NodeName != previousNodeName
- If above is true, then delete entry corresponding to
persistentVolumeName
from thepvsWaitingForReattachments
map.
- Get the
- Case: Check to see if context is closed/cancelled by reading:
- if
pvsWaitingForReattachments
is empty break from the loop.
- Commence a
- Log that the volumes in
podVolumeInfo.persistentVolumeList
have been successfully re-attached and return nil.
drain.Options.waitForDelete
NOTE: Ideally should have been named waitForPodDisappearance
pkg/util/provider/drain.Options.waitForDelete is a helper method defined on drain.Options
that leverages wait.PollImmediate and the getPodFn
(get pod by name and namespace) and checks that all pods have disappeared within timeout
. The set of pods that did not disappear within timeout is returned as pendingPods
func (o *Options) waitForDelete(
pods []*corev1.Pod, interval,
timeout time.Duration,
getPodFn func(string, string) (*corev1.Pod, error)
) (pendingPods []*corev1.Pod, err error)
drain.Options.RunDrain
Context: drain.Options.RunDrain
is called from the MC helper method controller.drainNode
which in turn is called from controller.triggerDeletionFlow
when the machine.Status.LastOperation.Description
contains operation machineutils.InitiateDrain
.
If RunDrain
returns an error, then the drain is retried at a later time by putting back the machine key into the queue. Unless the force-deletion
label on the machine object is true - in which case we proceed to VM deletion.
func (o *Options) RunDrain(ctx context.Context) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD GetNode["node, err = .client.CoreV1().Nodes().Get(ctx, o.nodeName, metav1.GetOptions{})"] -->ChkGetNodeErr{err != nil} ChkGetNodeErr--Yes-->ReturnNilEarly(("return nil (case where deletion triggered during machine creation, so node is nil. TODO: should use apierrors.NotFound)")) ChkGetNodeErr--No-->ChkNodeUnschedulable{node.Spec.Unschedulable?} ChkNodeUnschedulable--Yes-->GetPodsForDeletion[" pods, err := o.getPodsForDeletion(ctx) if err!=nil return err"] ChkNodeUnschedulable--No-->CloneNode["clone := node.DeepCopy() clone.Spec.Unschedulable = true"] -->UpdateNode["_, err = o.client.CoreV1().Nodes().Update(ctx, clone, metav1.UpdateOptions{}) if err != nil return err "] UpdateNode-->GetPodsForDeletion GetPodsForDeletion-->GetEvictionPGV[" policyGroupVersion, err := SupportEviction(o.client) if err != nil return err "] --> DefineAttemptEvict[" attemptEvict := !o.ForceDeletePods && len(policyGroupVersion) > 0 // useless boolean which confuses matters considerably. "] --> DefineGetPodFn[" getPodFn := func(namespace, name string) (*corev1.Pod, error) { return o.client.CoreV1().Pods(namespace).Get(ctx, name, metav1.GetOptions{}) }"] --> CreateReturnChannel[" returnCh := make(chan error, len(pods)) defer close(returnCh) "] -->ChkForceDelPods{"o.ForceDeletePods?"} ChkForceDelPods--Yes-->EvictPodsWithoutPV["go o.evictPodsWithoutPv(ctx, attemptEvict, pods, policyGroupVersion, getPodFn, returnCh) // go-routine feels un-necessary here."] ChkForceDelPods--No-->FilterPodsWithPv[" podsWithPv, podsWithoutPv := filterPodsWithPv(pods) "] FilterPodsWithPv-->EvictPodsWithPv[" go o.evictPodsWithPv(ctx, attemptEvict, podsWithPv, policyGroupVersion, getPodFn, returnCh) "] -->EvictPodsWithoutPV1[" go o.evictPodsWithoutPv(ctx, attemptEvict, podsWithoutPv, policyGroupVersion, getPodFn, returnCh) "]-->CreateAggregateError EvictPodsWithoutPV-->CreateAggregateError[" var errors []error errors = for i = 0; i < len(pods); ++i { err := <-returnCh if err != nil { errors = append(errors, err) } } "] -->ReturnAggError(("return\nerrors.NewAggregate(errors)"))
Notes:
- machine-controller-manager/pkg/util/provider/drain.SupportEviction uses Discovery API to find out if the server support eviction subresource and if so return its groupVersion or "" if it doesn't.
- k8s.io/kubectl/pkg/drain.CheckEvictionSupport already does this.
- attemptEvict boolean usage is confusing. Stick to
drain.Options.ForceDeletePods
- TODO: GAP? For cordoning a Node we currently just set
Node.Spec.Unschedulable
. But we are also supposed to set the taint.node.kubernetes.io/unschedulable
. The spec way is supposed to be deprecated.
drain.Options.evictPodsWithoutPv
drain method that iterates through each given pod and for each pod launches a go-routine that simply delegates to Options.evictPodsWithoutPv
.
func (o *Options) evictPodsWithoutPv(ctx context.Context,
pods []*corev1.Pod,
policyGroupVersion string, //eviction API's GV
getPodFn func(namespace, name string) (*corev1.Pod, error),
returnCh chan error) {
for _, pod := range pods {
go o.evictPodWithoutPVInternal(ctx, attemptEvict, pod, policyGroupVersion, getPodFn, returnCh)
}
return
}
drain.Options.evictPodWithoutPVInternal
drian method that that either evicts or deletes a Pod with retry handling until Options.MaxEvictRetries
is reached.
func (o *Options) evictPodWithoutPVInternal(
ctx context.Context,
attemptEvict bool,
pod *corev1.Pod,
policyGroupVersion string,
getPodFn func(namespace, name string) (*corev1.Pod, error),
returnCh chan error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD RangePod["pod := range pods"] RangePod-->EvictOrDelPod["go evictPodWithoutPVInternal(attemptEvict bool, pod, policyGroupVersion,pod,getPodFn,returnCh)"] EvictOrDelPod-->Begin subgraph "evictPodWithoutPVInternal (evicts or deletes Pod) " Begin(("Begin"))-->SetRetry["retry := 0"] SetRetry -->SetAttemptEvict["if retry >= o.MaxEvictRetries {attemptEvict=false}"] -->ChkAttemptEvict{"attemptEvict ?"} ChkAttemptEvict--Yes-->EvictPod["err=o.evictPod(ctx, pod, policyGroupVersion)"] ChkAttemptEvict--No-->DelPod["err=o.deletePod(ctx, pod)"] EvictPod-->ChkErr DelPod-->ChkErr ChkErr{"Check err"} ChkErr--"Nil"-->ChkForceDelPods ChkErr--"IsTooManyRequests(err)"-->GetPdb[" // Possible case where Pod couldn't be evicted because of PDB violation pdbs, err = pdbLister.GetPodPodDisruptionBudgets(pod) pdb=pdbs[0] if err !=nil && len(pdbs) > 0 "] ChkErr--"IsNotFound(err)\n(pod evicted)"-->SendNilChannel-->NilReturn ChkErr--"OtherErr"-->SendErrChannel GetPdb-->ChkMisConfiguredPdb{"isMisconfiguredPdb(pdb)?"} ChkMisConfiguredPdb--Yes-->SetPdbError["err=fmt.Errorf('pdb misconfigured')"] SetPdbError-->SendErrChannel ChkMisConfiguredPdb--No-->SleepEvictRetryInterval["time.Sleep(PodEvictionRetryInterval)"] SleepEvictRetryInterval-->IncRetry["retry+=1"]-->SetAttemptEvict SendErrChannel-->NilReturn ChkForceDelPods{"o.ForceDeletePods"} ChkForceDelPods--"Yes\n(dont wait for\npod disappear)"-->SendNilChannel ChkForceDelPods--No-->GetPodTermGracePeriod[" // TODO: discuss this, shouldn't pod grace period override drain ? timeout=Min(pod.Spec.TerminationGracePeriodSeconds,o.Timeout) "] -->SetBufferPeriod["bufferPeriod := 30 * time.Second"] -->WaitForDelete["pendingPods=o.waitForDelete(pods, timeout,getPodFn)"] -->ChkWaitForDelError{err != nil ?} ChkWaitForDelError--Yes-->SendErrChannel ChkWaitForDelError--No-->ChkPendingPodsLength{"len(pendingPods) > 0?"} ChkPendingPodsLength--Yes-->SetTimeoutError["err = fmt.Errorf('pod term timeout')"] SetTimeoutError-->SendErrChannel ChkPendingPodsLength--No-->SendNilChannel end SendNilChannel["returnCh <- nil"] SendErrChannel["returnCh <- err"] NilReturn(("return"))
isMisconfiguredPdb
TODO: Discuss/Elaborate on why this is considered misconfigured.
func isMisconfiguredPdbV1(pdb *policyv1.PodDisruptionBudget) bool {
if pdb.ObjectMeta.Generation != pdb.Status.ObservedGeneration {
return false
}
return pdb.Status.ExpectedPods > 0 &&
pdb.Status.CurrentHealthy >= pdb.Status.ExpectedPods
&& pdb.Status.DisruptionsAllowed == 0
}
drain.Options.evictPodsWithPv
func (o *Options) evictPodsWithPv(ctx context.Context,
attemptEvict bool,
pods []*corev1.Pod,
policyGroupVersion string,
getPodFn func(namespace, name string) (*corev1.Pod, error),
returnCh chan error)
NOTE
- See drain.Options.evictPodsWithPv
- This method basically delegates to
o.evictPodsWithPVInternal
with retry handling - TODO: Logic of this method can do with some refactoring!
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->SortPods[" sortPodsByPriority(pods) //Desc priority: pods[i].Spec.Priority > *pods[j].Spec.Priority "] -->DoVolumeAccounting[" podVolumeInfoMap := o.doAccountingOfPvs(ctx, pods) "] -->ChkAttemptEvict{attemptEvict ?} ChkAttemptEvict--Yes-->RetryTillLimit[" until MaxEvictRetries "] --> InvokeHelper[" remainingPods, aborted = o.evictPodsWithPVInternal(ctx, attemptEvict, pods, podVolumeInfoMap, policyGroupVersion, returnCh) "] InvokeHelper-->ChkAbort{" aborted || len(remainingPods) == 0 "} ChkAbort--Yes-->RangeRemainingPods ChkAbort--No-->Sleep[" pods = remainingPods time.Sleep(PodEvictionRetryInterval) "] Sleep--loop-->RetryTillLimit RetryTillLimit--loopend-->ChkRemaining{"len(remainingPods) > 0 && !aborted ?"} ChkRemaining--Yes-->InvokeHelper1[" // force delete pods remainingPods, _ = o.evictPodsWithPVInternal(ctx, false, pods, podVolumeInfoMap, policyGroupVersion, getPodFn, returnCh) "] ChkAttemptEvict--No-->InvokeHelper1 InvokeHelper1-->RangeRemainingPods ChkRemaining--No-->RangeRemainingPods RangeRemainingPods["pod := range remainingPods"] RangeRemainingPods--aborted?-->SendNil["returnCh <- nil"] RangeRemainingPods--attemptEvict?-->SendEvictErr["returnCh <- fmt.Errorf('pod evict error')"] RangeRemainingPods--else-->SendDelErr["returnCh <- fmt.Errorf('pod delete error')"] SendNil-->NilReturn SendEvictErr-->NilReturn SendDelErr-->NilReturn NilReturn(("return"))
drain.Options.evictPodsWithPVInternal
FIXME: name case inconsistency with evictPodsWithPv
drain.Options.evictPodsWithPVInternal is a drain helper method that actually evicts/deletes pods and waits for volume detachment. It returns a remainingPods
slice and a fastTrack
boolean is meant to abort the pod eviction and exit the calling go-routine. (TODO: should be called abort
or even better should use custom error here)
func (o *DrainOptions) evictPodsWithPVInternal(ctx context.Context,
attemptEvict bool,
pods []*corev1.Pod,
volMap map[string][]string,
policyGroupVersion string,
returnCh chan error
) (remainingPods []*api.Pod, fastTrack bool)
- Uses
context.Deadline
passing inctx
and a deadline time after the drain timeout to get a sub-context assigned tomainContext
and aCancelFunc
. Defer the obtainedCancelFunc
. (So it always invoked when the method terminates) - Maintain a pod slice
retryPods
which is initially empty. - Iterate through
pods
slice withi
as index variable- Apply a select with the one check case:
- Check to see if
mainContext
is closed/cancelled. Attempt to read from the Done channel:<-mainContext.Done()
. If this case matches:- Send
nil
on the return error channel:returnCh <- nil
- Compute
remainingPods
asretryPods
slice appended with pods yet to be iterated:pods[i+1:]...
- Return
remainingPods, true
. (aborted is true)
- Send
- Check to see if
- Initiate the pod eviction start time:
podEvictionStartTime=time.Now()
- Call
volumeAttachmentHandler.AddWorker()
to start trackingVolumeAttachments
and obtain avolumeAttachmentEventCh
receive channel that one can use to receive the attached or detached*.VolumeAttachment
. - If
attemptEvict
is true, then call evictPod else call deletePod helper method. Grab theerr
for eviction/deletion. - eviction/deletion had an error: Analyze the
err
:- If both
attemptEvict
is true andapierrors.IsTooManyRequests(err)
is true, then this case is interpreted as an eviction failure due to PDB violation.- We get the PodDisruptionBudget for the pod being iterated.
- We check whether it is misconfigured. IF So we send an error on
returnCh
and close thevolumeAttachmentEventCh
usingvolumeAttachmentHandler.DeleteWorker
and continue with next loop iteration. ie go to next pod.
- If just
apierrors.IsNotFound(err)
is true, this means that Pod is already gone from the node. We sendnil
onreturnCh
and callvolumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
and continue with next pod in iteration. - Otherwise we add the pod to the
retryPod
slice:retryPods = append(retryPods, pod)
, callvolumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
and continue with next pod in iteration. - (NOTE: Error handling can be optimized. too much repetition)
- If both
- Log that the evict/delete was successful.
- Get the PodVolumeInfo from
volMap
using the pod key. - Obtain a context and cancellation function for volume detachment using context.Timeout passing in the
mainContext
and detach timeout computed as the sum of the termination grace period of the pod (pod.Spec.TerminationGracePeriodSeconds
if not nil) added to thePvDetachTimeout
(from the drain options) - Invoke waitForDetach(ctx, podVolumeInfo, o.nodeName) and grab the
err
. - Invoke the cancel function for detach. NOTE: THIS IS NICHT GUT. The sub context should be created INSIDE waitForDetach with a defer for the cancelllation.
- Analyze the detachment error.
- If
apierrors.IsNotFound(err)
is true this indicates that the node is not found.- Send
nil
onreturnCh
- Call
volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
- Compute
remainingPods
asretryPods
slice appended with pods yet to be iterated:pods[i+1:]...
- Return
remainingPods, true
. (aborted is true)
- Send
- For other errors:
- Send the
err
on thereturnCh
. - Call
volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
- Continue with next pod in iteration.
- Send the
- If
- Obtain a context and cancellation function for volume re-attachment using context.Timeout passing in the
mainContext
anddrain.Options.PvReattachTimeout
. - Invoke waitForReattach(ctx, podVolumeInfo, o.nodeName, volumeAttachmentEventCh) and grab the returned
err
. - Invoke the cancel function for reattach. NOTE: THIS IS NICHT GUT. The sub context should be created INSIDE
waitForReattach
. - Analyze the re-attachment error.
- If err is a reattachment timeout error just log a warning. TODO: Confused on why we don't return an error on the return channel here.
- Otherwise we Send the
err
on thereturnCh
. - Call
volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
and continue with next pod in iteration
- YAWN. Someone is very fond of calling this again and again. Call
volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
- Log the time taken for pod eviction+vol detachment+vol attachment to another node using
time.Since(podEvictionStartTime)
. - Send
nil
onreturnCh
- Apply a select with the one check case:
- pod iteration loop is done:
return retryPods, false
Orphan / Safety Jobs
Read MCM FAQ: What is Safety Controller in MCM
These are jobs that periodically run by pushing dummy keys onto their respective work-queues. The worker then picks up and dispatches to the reconcile functions.
worker.Run(c.machineSafetyOrphanVMsQueue, "ClusterMachineSafetyOrphanVMs",
worker.DefaultMaxRetries,
true, c.reconcileClusterMachineSafetyOrphanVMs, stopCh, &waitGroup)
worker.Run(c.machineSafetyAPIServerQueue, "ClusterMachineAPIServer",
worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyAPIServer,
stopCh, &waitGroup)
reconcileClusterMachineSafetyOrphanVMs
This techinically isn't a reconcilation loop. It is effectively just a job.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%% flowchart TD Begin((" ")) -->RescheduleJob[" // Schedule Rerun defer c.machineSafetyOrphanVMsQueue.AddAfter('', c.safetyOptions.MachineSafetyOrphanVMsPeriod.Duration) "]--> GetMC["Get Machine Classes and iterate"] -->forEach[" asdf "] -->ListDriverMachines[" listMachineResp := driver.ListMachines(...) "] ListDriverMachines-->MachinesExist{ #listMachineResp.MachineList > 0 ? } MachinesExist-->|yes| SyncCache["cache.WaitForCacheSync(stopCh, machineInformer.Informer().HasSynced)"] SyncCache-->IterMachine["Iterate machineID,machineName in listMachineResp.MachineList"] -->GetMachine["machine, err := Get Machine from Lister"] -->ChkErr{Check err} ChkErr-->|err NotFound|DelMachine ChkErr-->|err nil|ChkMachineId ChkMachineId{ machine.Spec.ProviderID == machineID OR machinePhase is empty or CrashLoopBackoff ?} ChkMachineId-->|yes|IterMachine ChkMachineId-->|no|DelMachine[" driver.DeleteMachine(ctx, &driver.DeleteMachineRequest{ Machine: machine, secretData...}) "] DelMachine--iterDone-->ScheduleRerun MachinesExist-->|no| ReturnLongRetry["retryPeriod := machineutils.LongRetry"] -->ScheduleRerun[" c.machineSafetyOrphanVMsQueue.AddAfter('', time.Duration(retryPeriod)) "]
Machine Controller Manager
The Machine Controller Manager handles reconciliation of MachineDeployment and MachineSet objects.
Ideally this should be the called machine-deployment-controller
but the current name is a legacy holdover when all controllers were in one project module.
The Machine Controller Manager Entry Point is at github.com/gardener/machine-controller-manager/cmd/machine-controller-manager/controller_manager.go
MCM Launch
Reconcile Cluster Machine Set
A MachineSet
is to a Machine
in an analogue of what a ReplicaSet
is to a Pod
. A MachineSet
ensures that the specified number of Machines are running at any given time.
A MachineSet
is rarely rarely created directly. It is generally owned by its parent MachineDeployment and its ObjectMetadata.OwnerReferenes slice has a reference to the parent deployment.
The MCM controller reconcileClusterMachineSet
is called from objects retrieved from the machineSetQueue
as shown below.
worker.Run(c.machineSetQueue,
"ClusterMachineSet",
worker.DefaultMaxRetries, true, c.reconcileClusterMachineSet, stopCh, &waitGroup)
The following is the flow diagram for func (c *controller) reconcileClusterMachineSet(key string) error
. As can be observed, it could be optimized better. For any error in the below, the ms key is added back to the machineSetQueue
according to the default rate limiting.
%%{init: { 'themeVariables': { 'fontSize': '11px'},"flowchart": {"defaultRenderer": "elk"}} }%% flowchart TD Begin((" ")) -->GetMachineSet["machineSet=Get MS From Lister"] -->ValidateMS["validation.ValidateMachineSet(machineSet)"] -->ChkDeltimestamp1{"machineSet.DeletionTimestamp?"} ChkDeltimestamp1-->|no| AddFinalizersIMissing["addFinalizersIfMissing(machineSet)"] ChkDeltimestamp1-->|yes| GetAllMS["allMachineSets = list all machine sets"]--> GetMSSelector["selector = LabelSelectorAsSelector(machineSet.Spec.Selector)"] -->ClaimMachines["claimedMachines=claimMachines(machineSet, selector, allMachines)"] -->SyncNT["synchronizeMachineNodeTemplates(claimedMachines, machineSet)"] -->SyncMC["syncMachinesConfig(claimedMachines, machineSet)"] -->SyncMCK["syncMachinesClassKind(claimedMachines, machineSet)"] -->ChkDeltimestamp2{"machineSet.DeletionTimestamp?"}-->|no| ScaleUpDown ChkDeltimestamp2-->|yes| ChkClaimedMachinesLen{"len(claimedMachines) == 0?"} ChkClaimedMachinesLen-->|yes| DelMSFinalizers["delFinalizers(machineSet)"] ChkClaimedMachinesLen-->|no| TermMachines["terminateMachines(claimedMachines,machineSet)"]-->CalcMSStatus DelMSFinalizers-->CalcMSStatus ScaleUpDown["manageReplicas(claimedMachines) // scale out/in machines"] -->CalcMSStatus["calculateMachineSetStatus(claimedMachines, machineSet, errors)"] -->UpdateMSStatus["updateMachineSetStatus(...)"] -->enqueueMachineSetAfter["machineSetQueue.AddAfter(msKey, 10m)"] AddFinalizersIMissing-->GetAllMS
claimMachines
claimMachines
tries to take ownership of a machine - it associates a Machine
with a MachineSet
by setting machine.metadata.OwnerReferences
and releasets the Machine
if the MS's deletion timestamp has been set.
- Initialize an empty
claimedMachines []Machine
slice - Initialize an empty
errlist []erro
- Iterate through
allMachines
and Get theownerRef
(the first element inOwnerReferences
slice) - If the
ownerRef
is notnil
- if the
ownerRef.UID
is diff from themachineSet
sUUID
skip the claim and continue. (Since the machine belongs to another machine set) - If the machine
selector
matches the labels of themachineSet
, add toclaimedMachines
and continue - If the
machineSet.DeletionTimestamp
is set, skip and continue - Release the
Machine
by removing itsownerReference
- if the
- If the
ownerRef
isnil
- If the
machineSet.DeletionTimestamp
is set or if the machineselector
does not mach themachineSet
, skip and continue. - If the
machine.DeletionTimestamp
is set, skip and continue. - Adopt the machine, ie. set the
ownerReference
to themachineSet
and add toclaimedMachines
ownerReferences: - apiVersion: machine.sapcloud.io/v1alpha1 blockOwnerDeletion: true controller: true kind: MachineSet name: shoot--i034796--aw2-a-z1-8c99f uid: 20bc03c5-e95b-4df5-9faf-68be38cb8e1b
- If the
- Returned
claimedMachines
.
synchronizeMachineNodeTemplates
func (c *controller) syncMachinesNodeTemplates(ctx context.Context,
claimedMachines []*Machine, machineSet *MachineSet) error
- This iterates through the
claimeMachines
and copies themachineset.Spec.Template.Spec.NodeTemplateSpec
to themachine.Spec.NodeTemplateSpec
- NOTE: Seems useless IO busy-work to me. When MC launches the
Machine
, it might as well access the owningMachineSet
and get theNodeSpec
. - The only reason to do this is to support independent
Machines
without owningMachineSets
. We will need to see whether such a use-case is truly needed.
NOTE: NodeTemplate
describes common resource capabilities like cpu
, gpu
, memory
, etc in terms of k8s.io/api/core/v1.ResourceList. This is used by the cluster-autoscaler
for scaling decisions.
syncMachinesConfig
Copies machineset.Spec.Template.Spec.MachineConfiguration
to machine.Spec.MachineConfiguration
for all claimedMachines
.
See MachineConfiguration
inside MachineSpec
syncMachinesClassKind
NOTE: This is useless and should be removed since we only have ONE kind of MachineClass
. TODO: Discuss with Himanshu/Rishabh.
func (c *controller) syncMachinesClassKind(ctx context.Context,
claimedMachines []*Machine, machineSet *MachineSet) error
Iterates through claimedMachines
and sets machine.Spec.Class.Kind = machineset.Spec.Template.Spec.Class.Kind
if not already set.
manageReplicas (scale-out / scale-in)
func (c *controller) manageReplicas(ctx context.Context,
claimedMachines []Machine, machineSet *MachineSet) error
%%{init: { 'themeVariables': { 'fontSize': '11px'},"flowchart": {"defaultRenderer": "elk"}} }%% flowchart TD Begin((" ")) -->Init["activeMachines :=[], staleMachines:=[]"] -->IterCLaimed["machine := range claimedMachines"] --loop-->IsActiveOrFailed{"IsMachineActiveOrFailed(machine)"} IsActiveOrFailed-->|active| AppendActive["append(activeMachines,machine)"] IsActiveOrFailed-->|failed| AppendFailed["append(staleMachines,machine)"] IterCLaimed--done-->TermStaleMachines["terminateMachines(staleMachines,machineSet)"] TermStaleMachines-->Delta["diff := len(activeMachines) - machineSet.Spec.Replicas"] Delta-->ChkDelta{"diff < 0?"} ChkDelta-->|yes| ScaleOut["numCreated:=slowStartBatch(-diff,..) // scale out"] ScaleOut-->Log["Log numCreated/skipped/deleted"] ChkDelta-->|no| GetMachinesToDelete["machinesToDel := getMachinesToDelete(activeMachines, diff)"] GetMachinesToDelete-->TermMachines["terminateMachines(machinesToDel, machineSet)"] -->Log-->ReturnErr["return err"]
terminateMachines
func (c *controller) terminateMachines(ctx context.Context,
inactiveMachines []*Machine, machineSet *MachineSet) error {
- Invokes
controlMachineClient.Machines(namespace).Delete(ctx, machineID,..)
for eachMachine
ininactiveMachines
and records an event. - The
machine.Status.Phase
is also set toTerminating
. - This is done in parallel using
go-routines
aWaitGroup
on length ofinactiveMachines
slowStartBatch
func slowStartBatch(count int, initialBatchSize int, createFn func() error) (int, error)
- Initializes
remaining
tocount
andsuccesses
as0
. - Method executes
fn
(which creates aMachine
object) in parallel with number of go-routines starting withbatchSize := initialBatchSize
and then doublingbatchSize
size after the call tofn
.- For each batch iteration, a
wg sync.WaitGroup
is constructed withbatchSize
. Each batch execution waits for batch to be complete usingwg.Wait()
- For each batch iteration, an
errCh
is constructed with size asbatchSize
batchSize
go-routines executefn
concurrently, sending errors onerrCh
and invokingwg.Done()
when complete.numErrorsInBatch = len(errCh)
successes
isbatchSize
minusnumErrorsInBatch
- if
numErrorsInBatch > 0
, abort, returningsuccesses
and first error fromerrCh
remaining
is decremented by thebatchSize
- Compute
batchSize
asMin(remaining, 2*batchSize)
- Continue iteration while
batchSize
is greater than0
. - Return
successes, nil
when done.
- For each batch iteration, a
fn
is a lambda that creates a newMachine
in which we do the below:- Create an
ownerRef
with themachineSet.Name
andmachineSet.UID
- Get the machine spec template using machineSet.Spec.Template
- Then create a
Machine
obj setting the machine spec andownerRef
. Use themachineSet
name as the prefix forGenerateName
in theObjectMeta
. - If any
err
return the same ornil
if no error. - New
Machine
objects are persisted usingcontrolMachineClient.Machines(namespace).Create(ctx, machine, createOpts)
- Create an
Reconcile Cluster Machine Deployment
func (dc *controller) reconcileClusterMachineDeployment(key string) error
- Gets the deployment name.
- Gets the
MachineDeployment
- TODO: WEIRD: freeze labels and deletion timestamp
- TODO: unclear why we do this
// Resync the MachineDeployment after 10 minutes to avoid missing out on missed out events
defer dc.enqueueMachineDeploymentAfter(deployment, 10*time.Minute)
- Add finalizers if deletion time stamp is nil
- TODO: Why is observed generation only updated conditionally in the below ? Shouldn't it be done always
everything := metav1.LabelSelector{}
if reflect.DeepEqual(d.Spec.Selector, &everything) {
dc.recorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all machines. A non-empty selector is required.")
if d.Status.ObservedGeneration < d.Generation {
d.Status.ObservedGeneration = d.Generation
dc.controlMachineClient.MachineDeployments(d.Namespace).UpdateStatus(ctx, d, metav1.UpdateOptions{})
}
return nil
}
- Get
[]*v1alpha1.MachineSet
for this deployment usinggetMachineSetsForMachineDeployment
and assign tomachineSets
- if
deployment.DeletionTimestamp != nil
- if there are no finalizers on deployment return nil
- if
len(machineSets) == 0
delete the machine deployment finalizers and return nil - Call
dc.terminateMachineSets(ctx, machineSets)
Rollout Rolling
func (dc *controller) rolloutRolling(ctx context.Context,
d *v1alpha1.MachineDeployment,
msList []*v1alpha1.MachineSet,
machineMap map[types.UID]*v1alpha1.MachineList) error
1. Get new machine set corresponding to machine deployment and old machine sets
newMS, oldMSs, err := dc.getAllMachineSetsAndSyncRevision(ctx, d, msList, machineMap, true)
allMSs := append(oldMSs, newMS)
2. Taint the nodes backing the old machine sets.
This is a preference - the k8s scheduler will try to avoid placing a pod that does not tolerate thee taint on the node. Q: Why don't we use NoSchedule
instead ? Any pods scheduled on this node will need to be drained - more work to be done.
dc.taintNodesBackingMachineSets(
ctx,
oldISs, &v1.Taint{
Key: PreferNoScheduleKey,
Value: "True",
Effect: "PreferNoSchedule",
},
)
3. Add AutoScaler Scale-Down annotations to Nodes of Old Machine Sets
- Create the map. (TODO: Q: Why do we add 2 annotations ?)
clusterAutoscalerScaleDownAnnotations := make(map[string]string) clusterAutoscalerScaleDownAnnotations["cluster-autoscaler.kubernetes.io/scale-down-disabled"]="true" clusterAutoscalerScaleDownAnnotations["cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm"]="true"
- Call
annotateNodesBackingMachineSets(ctx, allMSs, clusterAutoscalerScaleDownAnnotations)
4. Reconcile New Machine Set by calling reconcileNewMachineSet
scaledUp, err := dc.reconcileNewMachineSet(ctx, allISs, newIS, d)
func (dc *controller) reconcileNewMachineSet(ctx context.Context,
allMSs[]*v1alpha1.MachineSet,
newMS *v1alpha1.MachineSet,
deployment *v1alpha1.MachineDeployment)
(bool, error)
- if
newMS.Spec.Replicates == deployment.spec.Replicates
return - if
newMS.Spec.Replicas > deployment.Spec.Replicas
, we need to scale down. calldc.scaleMachineSet(ctx, newMS, deployment.Spec.Replicas, "down")
- Compute
newReplicasCount
usingNewMSNewReplicas(deployment, allMSs, newMS)
. - Call
dc.scaleMachineSet(ctx, newMS, newReplicasCount, "up")
Helper Methods
scaleMachineSet
func (dc *controller) scaleMachineSet(ctx context.Context,
ms *v1alpha1.MachineSet,
newScale int32,
deployment *v1alpha1.MachineDeployment,
scalingOperation string)
(bool, *v1alpha1.MachineSet, error) {
sizeNeedsUpdate := (ms.Spec.Replicas) != newScale
}
TODO: fill me in.
Get Machine Sets for Machine Deployment
func (dc *controller) getMachineSetsForMachineDeployment(ctx context.Context,
d *v1alpha1.MachineDeployment)
([]*v1alpha1.MachineSet, error)
- Get all machine sets using machine set lister.
NewMachineSetControllerRefManager
unclear
Get Machine Map for Machine Deployment
Returns a map from MachineSet UID to a list of Machines controlled by that MS, according to the Machine's ControllerRef.
func (dc *controller)
getMachineMapForMachineDeployment(d *v1alpha1.MachineDeployment,
machineSets []*v1alpha1.MachineSet)
(map[types.UID]*v1alpha1.MachineList, error) {
Terminate Machine Sets of Machine eDeployment
func (dc *controller) terminateMachineSets(ctx context.Context, machineSets []*v1alpha1.MachineSet)
Sync Deployment Status
func (dc *controller) syncStatusOnly(ctx context.Context,
d *v1alpha1.MachineDeployment,
msList []*v1alpha1.MachineSet,
machineMap map[types.UID]*v1alpha1.MachineList) error
Gets New and Old MachineSets and Sync Revision
func (dc *controller) getAllMachineSetsAndSyncRevision(ctx context.Context,
d *v1alpha1.MachineDeployment,
msList []*v1alpha1.MachineSet,
machineMap map[types.UID]*v1alpha1.MachineList,
createIfNotExisted bool)
(*v1alpha1.MachineSet, []*v1alpha1.MachineSet, error)
Overview
getAllMachineSetsAndSyncRevision
does the following:
- Get all old
MachineSets
theMachineDeployment:
d
targets, and calculate the max revision number among them (maxOldV
). - Get new
MachineSet
this deployment targets ie whose machine template matches the deployment's and updates new machine set's revision number to (maxOldV + 1
), This is done only if its revision number is smaller than(maxOldV + 1)
. If this step failed, we'll update it in the next deployment sync loop. - Copy new
MachineSet
's revision number to theMachineDeployment
(update deployment's revision). If this step failed, we'll update it in the next deployment sync loop.
Detail
- TODO: describe me
Annotate Nodes Backing Machine Sets
func (dc *controller) annotateNodesBackingMachineSets(
ctx context.Context,
machineSets []*v1alpha1.MachineSet,
annotations map[string]string) error
- Iterate through the
machineSets
. Loop variable:machineSet
- List all the machines. TODO: EXPENSIVE ??
allMachines, err := dc.machineLister.List(labels.Everything())
- Get the Selector for the Machine Set
selector, err := metav1.LabelSelectorAsSelector(machineSet.Spec.Selector)
- Claim the Machines for the given
machineSet
using the selector
filteredMachines, err = dc.claimMachines(ctx, machineSet, selector, allMachines)
- Iterate through
filteredMachines
, loop variable:machine
and ifNode
is not empty, add or update annotations on node.
if machine.Status.Node != "" {
err = AddOrUpdateAnnotationOnNode(
ctx,
dc.targetCoreClient,
machine.Status.Node,
annotations,
)
}
Claim Machines
Basically sets or unsets the owner reference of the machines matching selector to the deployment controller.
func (c *controller) claimMachines(ctx context.Context,
machineSet *v1alpha1.MachineSet,
selector labels.Selector,
allMachines []*v1alpha1.Machine)
([]*v1alpha1.Machine, error) {
TODO: delegates to MachineControllerRefManager.claimMachines
Sets or un-sets the owner reference of the machine object to the deployment controller.
Summary
- iterates through
allMachines
. Checks ifselector
matches the machine labels:m.Selector.Matches(labels.Set(machine.Labels)
- Gets the
controllerRef
of the machine usingmetav1.GetControllerOf(machine)
- If
controllerRef
is not nil and thecontrollerRef.UID
matches the - If so, then this is an adoption and calls
AdoptMachine
which patches the machines owner reference using the below:
addControllerPatch := fmt.Sprintf(
`{"metadata":{"ownerReferences":[{"apiVersion":"machine.sapcloud.io/v1alpha1","kind":"%s","name":"%s","uid":"%s","controller":true,"blockOwnerDeletion":true}],"uid":"%s"}}`,
m.controllerKind.Kind,
m.Controller.GetName(), m.Controller.GetUID(), machine.UID)
err := m.machineControl.PatchMachine(ctx, machine.Namespace, machine.Name, []byte(addControllerPatch))
err := m.machineControl.PatchMachine(ctx, machine.Namespace, machine.Name, []byte(addControllerPatch))
Helper Functions
Compute New Machine Set New Replicas
NewMSNewReplicas
calculates the number of replicas a deployment's new machine set should have.
- The new MS is saturated: newMS's replicas == deployment's replicas
- Max number of machines allowed is reached: deployment's replicas + maxSurge == allMS's replicas
func NewMSNewReplicas(deployment *v1alpha1.MachineDeployment,
allMSs []*v1alpha1.MachineSet,
newMS *v1alpha1.MachineSet) (int32, error)
// MS was called IS earlier (instance set)
- Get the
maxSurge
maxSurge, err = intstr.GetValueFromIntOrPercent(
deployment.Spec.Strategy.RollingUpdate.MaxSurge,
int(deployment.Spec.Replicas),
true
)
- Compute the
currentMachineCount
: iterate through all machine sets and sum upmachineset.Status.Replicas
maxTotalMachines = deployment.Spec.Replicas + maxSurge
if currentMachineCount >= maxTotalMachines return newMS.Spec.Replicas
// cannot scale up.- Compute
scaleUpCount := maxTotalMachines - currentMachineCount
- Make sure
scaleUpCount
does not exceed desired deployment replicasscaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(deployment.Spec.Replicas -newMS.Spec.Replicas))
🚧 WIP at the moment. Lot more material to be added here from notes. Please do not read presently.
- Issues
Issues
This section is very basic WIP atm. Please check after this warning has been removed. Lots more to be added here from notes and appropriately structured.
Design Issues
Bad Packaging
package controller
is inside import pathgithub.com/gardener/machine-controller-manager/pkg/util/provider/machinecontroller
LastOperation is actually Next Operation
Badly named. TODO: Describe more.
Description misused
Error Prone stuff like below due to misuse of description.
// isMachineStatusSimilar checks if the status of 2 machines is similar or not.
func isMachineStatusSimilar(s1, s2 v1alpha1.MachineStatus) bool {
s1Copy, s2Copy := s1.DeepCopy(), s2.DeepCopy()
tolerateTimeDiff := 30 * time.Minute
// Since lastOperation hasn't been updated in the last 30minutes, force update this.
if (s1.LastOperation.LastUpdateTime.Time.Before(time.Now().Add(tolerateTimeDiff * -1))) || (s2.LastOperation.LastUpdateTime.Time.Before(time.Now().Add(tolerateTimeDiff * -1))) {
return false
}
if utilstrings.StringSimilarityRatio(s1Copy.LastOperation.Description, s2Copy.LastOperation.Description) > 0.75 {
// If strings are similar, ignore comparison
// This occurs when cloud provider errors repeats with different request IDs
s1Copy.LastOperation.Description, s2Copy.LastOperation.Description = "", ""
}
// Avoiding timestamp comparison
s1Copy.LastOperation.LastUpdateTime, s2Copy.LastOperation.LastUpdateTime = metav1.Time{}, metav1.Time{}
s1Copy.CurrentStatus.LastUpdateTime, s2Copy.CurrentStatus.LastUpdateTime = metav1.Time{}, metav1.Time{}
return apiequality.Semantic.DeepEqual(s1Copy.LastOperation, s2Copy.LastOperation) && apiequality.Semantic.DeepEqual(s1Copy.CurrentStatus, s2Copy.CurrentStatus)
}
Gaps
TODO: Not comprehensive. Lots more to be added here
Dead/Deprecated Code
controller.triggerUpdationFlow
This is unused and appears to be dead code.
SafetyOptions.MachineDrainTimeout
This field is commented as deprecated but is still in MCServer.AddFlags
and in the launch script of individual providers.
Ex
go run
cmd/machine-controller/main.go
...
machine-drain-timeout=5m
Dup Code
- Nearly all files in
pkg/controller/*.go
- Ex: Types/func/smethods in
pkg/controller/machine_util.go
- Ex: Dup NodeTerminationCondition in
pkg/controller/machine_util.go
. The one that is being actively used is machineutils.NodeTerminationCondition
- Ex: Dup NodeTerminationCondition in
- Types/funcs/methods in
pkg/controller/drain.go
drainNode Handling
- Does not set err when
c.machineStatusUpdate
is called o.RunCordonOrUncordon
should useapierrors.NotFound
while checking error returned by a get node op- attemptEvict bool usage is confusing. Better design needed.
attemptEvict
is overridden inevictPodsWithoutPv
. - Misleading deep copy in
drain.Options.doAccountingOfPvs
for podKey, persistentVolumeList := range pvMap { persistentVolumeListDeepCopy := persistentVolumeList //...
Node Conditions
CloneAndAddCondition
logic seems erroneous ?
VolumeAttachment
func (v *VolumeAttachmentHandler) dispatch(obj interface{}) {
//...
volumeAttachment := obj.(*storagev1.VolumeAttachment)
if volumeAttachment == nil {
klog.Errorf("Couldn't convert to volumeAttachment from object %v", obj)
// Should return here.
}
//...
Dead? reconcileClusterNodeKey
This just delegates to reconcileClusterNode
which does nothing..
func (c *controller) reconcileClusterNode(node *v1.Node) error {
return nil
}
Dead? machine.go | triggerUpdationFlow
Can't find usages
Duplicate Initialization of EventRecorder in MC
pkg/util/provider/app.createRecorder
already dones this below.
func createRecorder(kubeClient *kubernetes.Clientset) record.EventRecorder {
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(klog.Infof)
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(kubeClient.CoreV1().RESTClient()).Events("")})
return eventBroadcaster.NewRecorder(kubescheme.Scheme, v1.EventSource{Component: controllerManagerAgentName})
}
We get the recorder from this eventBroadcaster and then pass it to the pkg/util/provider/machinecontroller/controller.NewController
method which again does:
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(klog.Infof)
eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: typedcorev1.New(controlCoreClient.CoreV1().RESTClient()).Events(namespace)})
The above is useless.
Q? Internal to External Scheme Conversion
Why do we do this ?
internalClass := &machine.MachineClass{}
err := c.internalExternalScheme.Convert(class, internalClass, nil)
if err != nil {
return err
}