Current location: MCM Design Book.

(🚧 Please see Change Log for new additions/corrections.Please Check on 8th Oct for v3.1 release!🏗)

Introduction

A Kubernetes Controller is a program that watches for lifecycle events on specific resources and triggers one or more reconcile functions in response. A reconcile function is called with the Namespace and Name of an object corresponding to the resource and its job is to make the object Status match the declared state in the object Spec.

Machine Controller Manager aka MCM is a group of cooperative controllers that manage the lifecycle of the worker machines, machine-classes machine-sets and machine deployments. All these objects are custom resources.

  • A worker Machine is a provider specific VM/instance that corresponds to a k8s Node. (k8s doesn't bring up nodes by its own, the MCM does so by using cloud provider API's abstracted by the Driver facade to bring up machines and map them to nodes)
  • A MachineClass represents a template that contains cloud provider specific details used to create machines.
  • A MachineSet ensures that the specified number of Machine replicas are running at a given point of time. Analogoues to k8s ReplicaSets.
  • A MachineDeployment provides a declarative update for MachineSet and Machines. Analogous to k8s Deployments.

All the custom resources (Machine-* objects) mentioned above are stored in the K8s control cluster. The nodes corresponding to the machines are created and registered in the target cluster.

For productive Gardener deployments, the control cluster is the control plane of the shoot cluster and since the MCM is running in the shoot's control plane, the kubeconfig for the control cluster is generally specified as the In-Cluster Config. The target cluster is the shoot cluster and hence the target cluster config is the shoot kube config.

Project Structure

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TB
    subgraph MCM

    mcm["machine-controller-manager
    (Common MC Code, MachineSet, MachineDeploy controllers)"]
    mcmlo["machine-controller-manager-provider-local
    (Machine Controller Local atop K8s Kind)"]
    mcmaws["machine-controller-manager-provider-aws
    (Machine Controller for AWS)"]
    mcmazure["machine-controller-manager-provider-azure
    (Machine Controller for Azure)"]
    mcmgcp["machine-controller-manager-provider-gcp
    (Machine Controller for GCP)"]
    mcmx["machine-controller-manager-provider-X
    (Machine Controller for equinox/openstack/etc)"]
    end

    mcmlo--uses-->mcm
    mcmaws--uses-->mcm
    mcmazure--uses-->mcm
    mcmgcp--uses-->mcm
    mcmx-->mcm

The MCM project is divided into:

  1. The MCM Module. This contains
    1. The MCM Controller Type and MCM Controller Factory Method. The MCM Controller is responsible for reconciling the MachineDeployment and MachineSet custom resources.
    2. MCM Main which creates and starts the MCM Controller.
    3. The MC Controller Type and MC Controller Factory Method.
      1. The MC Controller implements the reconciliation loop for MachineClass and Machine objects but delegates creation/updation/deletion/status-retrieval of Machines to the Driver facade.
    4. The Driver facade that abstracts away the lifecycle operations on Machines and obtaining Machine status.
    5. Utility Code leveraged by provider modules.
  2. The provider specific modules named as machine-controller-manager-provider-<providerName>.
    1. Contains a main file located at cmd/machine-controller/main.go that instantiate a Driver implementation (Ex: AWSDriver) and then create and start a MC Controller using the MC Controller Factory Method, passing the Driver impl. In other worlds, each provider module starts its independent machine controller.
    2. See MCM README for list of provider modules

The MCM leverages the old-school technique of writing controllers directly using client-go. Skeleton code for client types is generated using client-gen. A barebones example is illustrated in the sample controller.

The Modern Way of writing controllers is by leveraging the Controller Runtime and generating skeletal code fur custom controllers using the Kubebuilder Tool.

The MCM has a planned backlog to port the project to the controller runtime. The details of this will be documented in a separate proposal. (TODO: link me in future).

This book describes the current design of the MCM in order to aid code comprehension for development, enhancement and migratiion/port activities.

Deployment Structure

The MCM Pod's are part of the Deployment named machine-controller-manager that resides in the shoot control plane. After logging into the shoot control plane (use gardenctl), you can the deployment details using k get deploy machine-controller-manager -o yaml . The MCM deployment has two containers:

  1. machine-controller-manager-provider-<provider>. Ex: machine-controller-manager-provider-aws. This container name is a bit misleading as it starts the provider specific machine controller main program responsible for reconciling machine-classes and machines. See Machine Controller. (Ideally the -manager should have been removed)

Container command configured on AWS:

./machine-controller
         --control-kubeconfig=inClusterConfig
         --machine-creation-timeout=20m
         --machine-drain-timeout=2h
         --machine-health-timeout=10m
         --namespace=shoot--i034796--tre
         --port=10259
         --target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig`
  1. <provider>-machine-controller-manager. Ex: aws-machine-controller-manager. This container name is a bit misleading as it starts the machine deployment controller main program responsible for reconciling machine-deployments and machine-sets. (See: TODO: link me). Ideally it should have been called simply machine-deployment-controller as it is provider independent.

Container command configured on AWS

./machine-controller-manager
         --control-kubeconfig=inClusterConfig
         --delete-migrated-machine-class=true
         --machine-safety-apiserver-statuscheck-timeout=30s
         --machine-safety-apiserver-statuscheck-period=1m
         --machine-safety-orphan-vms-period=30m
         --machine-safety-overshooting-period=1m
         --namespace=shoot--i034796--tre
         --port=10258
         --safety-up=2
         --safety-down=1
         --target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig

Local Development Tips

First read Local Dev MCM

Running MCM Locally

After setting up a shoot cluster in the dev landscape, you can run your local copy of MCM and MC to manage machines in the shoot cluster.

Example for AWS Shoot Cluster:

  1. Checkout https://github.com/gardener/machine-controller-manager and https://github.com/gardener/machine-controller-manager-provider-aws/
  2. cd machine-controller-manager and run ./hack/gardener_local_setup.sh --seed <seedManagingShoot> --shoot <shootName> --project <userId> --provider aws
    • Ex: ./hack/gardener_local_setup.sh --seed aws-ha --shoot aw2 --project i034796 --provider aws
    • The above will set the replica count of the machine-controller-manager deployment in the shoot control plane to 0 and also set an annotations dependency-watchdog.gardener.cloud/ignore-scaling to prevent DWD from scalig it back up. Now, you can run your local dev copy.
  3. Inside the MCM directlry run make start
    1. MCM controller should start without errors. Last line should look like: I0920 14:11:28.615699 84778 deployment.go:433] Processing the machinedeployment "shoot--i034796--aw2-a-z1" (with replicas 1)
  4. Change to the provider directory. Ex cd <checkoutPath>/machine-controllr-manager-provider-aws and run make start
    1. MC controller should start without errors. Last line should look like
I0920 14:14:37.793684   86169 core.go:482] List machines request has been processed successfully
I0920 14:14:37.896720   86169 machine_safety.go:59] reconcileClusterMachineSafetyOrphanVMs: End, reSync-Period: 5m0s

Change Log

  • WIP Draft of Orphan/Safety
  • WIP for machine set controller.

Kubernetes Client Facilities

This chapter describes the types and functions provided by k8s core and client modules that are leveraged by the MCM - it only covers what is required to understand MCM code and is simply meant to be a helpful review. References are provided for further reading.

K8s apimachinery

K8s objects

A K8s object represents a persistent entity. When using the K8s client-go framework to define such an object, one should follow the rules:

  1. A Go type representing a object must embed the k8s.io/apimachinery/pkg/apis/meta/v1.ObjectMeta struct. ObjectMeta is metadata that all persisted resources must have, which includes all objects users must create.
  2. A Go type representing a singluar object must embed k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta which describes an individual object in an API response or request with strings representing the Kind of the object and its API schema version called APIVersion.
graph TB
    subgraph CustomType

    ObjectMeta
    TypeMeta
    end
  1. A Go type representing a list of a custom type must embed k8s.io/apimachinery/pkg/apis/meta/v1.ListMeta
graph TB
    subgraph CustomTypeList

    ObjectMeta
    ListMeta
    end

TypeMeta

type TypeMeta struct {
	// Kind is a string value representing the REST resource this object represents.
	Kind string 

	// APIVersion defines the versioned schema of this representation of an object.
	APIVersion string
}

ObjectMeta

A snippet of ObjectMeta struct fields shown below for convenience with the MCM relevant fields that are used by controller code.

type ObjectMeta struct { //snippet 
    // Name must be unique within a namespace. Is required when creating resources,
    Name string 

    // Namespace defines the space within which each name must be unique. An empty namespace is  equivalent to the "default" namespace,
    Namespace string

    // An opaque value that represents the internal version of this object that can be used by clients to determine when objects have changed.
    ResourceVersion string

	// A sequence number representing a specific generation of the desired state.
	Generation int64 

    // UID is the unique in time and space value for this object. It is typically generated by the API server on successful creation of a resource and is not allowed to change on PUT operations.
    UID types.UID 

    // CreationTimestamp is a timestamp representing the server time when this object was  created.
    CreationTimestamp Time 

    // DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This field is set by the server when a graceful deletion is requested by the user.  The resource is expected to be deleted (no longer reachable via APIs) after the time in this field, once the finalizers list is empty.
    DeletionTimestamp *Time


    // Must be empty before the object is deleted from the registry by the API server. Each entry is an identifier for the responsible controller that will remove the entry from the list.
    Finalizers []string 

    // Map of string keys and values that can be used to organize and categorize (scope and select) objects. Valid label keys have two segments: an optional prefix and name, separated by a slash (/).  Meant to be meaningful and relevant to USERS.
    Labels map[string]string

    // Annotations is an unstructured key value map stored with a resource that may be  set by controllers/tools to store and retrieve arbitrary metadata. Meant for TOOLS.
    Annotations map[string]string 

    // References to owner objects. Ex: Pod belongs to its owning ReplicaSet. A Machine belongs to its owning MachineSet.
    OwnerReferences []OwnerReference

    // The name of the cluster which the object belongs to. This is used to distinguish resources with same name and namespace in different clusters.
    ClusterName string

    //... other fields omitted.
}
OwnerReferences

k8s.io/apimachinery/pkg/apis/meta/v1.OwnerReference is a struct that contains TypeMeta fields and a small sub-set of the ObjectMetadata - enough to let you identify an owning object. An owning object must be in the same namespace as the dependent, or be cluster-scoped, so there is no namespace field.

type OwnerReference struct {
   APIVersion string
   Kind string 
   Name string 
   UID types.UID 
   //... other fields omitted. TODO: check for usages.
}
Finalizers and Deletion

Every k8s object has a Finalizers []string field that can be explicitly assigned by a controller. Every k8s object has a DeletionTimestamp *Time that is set by API Server when graceful deletion is requested.

These are part of the k8s.io./apimachinery/pkg/apis/meta/v1.ObjectMeta struct type which is embedded in all k8s objects.

When you tell Kubernetes to delete an object that has finalizers specified for it, the Kubernetes API marks the object for deletion by populating .metadata.deletionTimestamp aka Deletiontimestamp, and returns a 202 status code (HTTP Accepted). The target object remains in a terminating state while the control plane takes the actions defined by the finalizers. After these actions are complete, the controller should removes the relevant finalizers from the target object. When the metadata.finalizers field is empty, Kubernetes considers the deletion complete and deletes the object.

Diff between Labels and Annotations

Labels are used in conjunction with selectors to identify groups of related resources and meant to be meaningful to users. Because selectors are used to query labels, this operation needs to be efficient. To ensure efficient queries, labels are constrained by RFC 1123. RFC 1123, among other constraints, restricts labels to a maximum 63 character length. Thus, labels should be used when you want Kubernetes to group a set of related resources. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ on label key and value restrictions

Annotations are used for “non-identifying information” i.e., metadata that Kubernetes does not care about. As such, annotation keys and values have no constraints. Can include characters not

RawExtension

k8s.io/apimachinery/pkg/runtime.RawExtension is used to hold extension objects whose structures can be arbitrary. An example of MCM type that leverages this is the MachineClass type whose ProviderSpec field is of type runtime.RawExtension and whose structure can vary according to the provider.

One can use a different custom structure type for each extension variant and then decode the field into an instance of that extension structure type using the standard Go json decoder.

Example:

var providerSpec *api.AWSProviderSpec
json.Unmarshal(machineClass.ProviderSpec.Raw, &providerSpec)

One can

API Errors

k8s.io/apimachinery/pkg/api/errors provides detailed error types ans IsXXX methods for k8s api errors.

errors.IsNotFound

k8s.io/apimachinery/pkg/api/errors.IsNotFound returns true if the specified error was due to a k8s object not found. (error or wrapped error created by errors.NewNotFound)

errors.IsNotFound

k8s.io/apimachinery/pkg/api/errors.IsTooManyRequests determines if err (or any wrapped error) is an error which indicates that there are too many requests that the server cannot handle.

API Machinery Utilities

wait.Until

k8s.io/apimachinery/pkg/wait.Until loops until stop channel is closed, running f every given period.

func Until(f func(), period time.Duration, stopCh <-chan struct{})

wait.PollImmediate

k8s.io/apimachinery/pkg/util/wait.PollImmediate tries a condition func until it returns true, an error, or the timeout is reached.

func PollImmediate(interval, timeout time.Duration, condition ConditionFunc) error

wait.Backoff

k8s.io/apimachinery/pkg/util/wait.Backoff holds parameters applied to a Backoff function. There are many retry functions in client-go and MCM that take an instance of this struct as parameter.

type Backoff struct {
	Duration time.Duration
	Factor float64
	Jitter float64
	Steps int
	Cap time.Duration
}
  • Duration is the initial backoff duration.
  • Duration is multiplied by Factor for the next iteration.
  • Jitter is the random amount of each duration added (between Duration and Duration*(1+jitter)
  • Steps is the remaining number of iterations in which the duration may increase.
  • Cap is the cap on the duration and may not exceed this value.

Errors

k8s.io/apimachinery/pkg/util/errors.Aggregate represents an object that contains multiple errors

Use k8s.io/apimachinery/pkg/util/errors.NewAggregate to construct the aggregate error from a slice of errors.

type Aggregate interface {
	error
	Errors() []error
	Is(error) bool
}
func NewAggregate(errlist []error) Aggregate {//...}

K8s API Core

The MCM leverages several types from https://pkg.go.dev/k8s.io/api/core/v1

Node

k8s.io/api/core/v1.Node represents a worker node in Kubernetes.

type Node struct {
    metav1.TypeMeta
    metav1.ObjectMeta 
    Spec NodeSpec
    // Most recently observed status of the node.
    Status NodeStatus
}

NodeSpec

k8s.io/api/core/v1.NodeSpecdescribes the attributes that a node is created with. Both Node and MachineSpec use this. A snippet of MCM-relevant NodeSpec struct fields shown below for convenience.

type NodeSpec struct {
    // ID of the node assigned by the cloud provider in the format: <ProviderName>://<ProviderSpecificNodeID>
    ProviderID string 
    // podCIDRs represents the IP ranges assigned to the node for usage by Pods on that node.
    PodCIDRs []string 

    // Unschedulable controls node schedulability of new pods. By default, node is schedulable.
    Unschedulable bool

    // Taints represents the Node's Taints. (taint is opposite of affinity. allow a Node to repel pods as opposed to attracting them)
    Taints []Taint
}
Node Taints

See Taints and Tolerations

k8s.io/api/core/v1.Taint is a Kubernetes Node property that enable specific nodes to repel pods. Tolerations are a Kubernetes Pod property that overcome this and allow a pod to be scheduled on a node with a matching taint.

Instead of applying the label to a node, we apply a taint that tells a scheduler to repel Pods from this node if it does not match the taint. Only those Pods that have a toleration for the taint can be let into the node with that taint.

kubectl taint nodes <node name> <taint key>=<taint value>:<taint effect>

Example:

kubectl taint nodes node1 gpu=nvidia:NoSchedule

Users can specify any arbitrary string for the taint key and value. The taint effect defines how a tainted node reacts to a pod without appropriate toleration. It must be one of the following effects;

  • NoSchedule: The pod will not get scheduled to the node without a matching toleration.

  • NoExecute:This will immediately evict all the pods without the matching toleration from the node.

  • PerferNoSchedule:This is a softer version of NoSchedule where the controller will not try to schedule a pod with the tainted node. However, it is not a strict requirement.

type Taint struct {
	// Key of taint to be applied to a node.
	Key string
	// Value of taint corresponding to the taint key.
	Value string 
	// Effect represents the effect of the taint on pods
	// that do not tolerate the taint.
	// Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
	Effect TaintEffect 
	// TimeAdded represents the time at which the taint was added.
	// It is only written for NoExecute taints.
	// +optional
	TimeAdded *metav1.Time 
}

Example of a PodSpec with toleration below:

apiVersion: v1
kind: Pod
metadata:
  name: pod-1
  labels:
    security: s1
spec:
  containers:
  - name: bear
    image: supergiantkir/animals:bear
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "nvidia"
    effect: "NoSchedule"

Example use case for a taint/tolerance: If you have nodes with special hardware (e.g GPUs) you want to repel Pods that do not need this hardware and attract Pods that do need it. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename gpu=nvidia:NoSchedule ) and adding corresponding toleration to Pods that must use this special hardware.

NodeStatus

See Node status

k8s.io/api/core/v1.NodeStatus represents the current status of a node and is an encapsulation illustrated below:

type NodeStatus struct {
	Capacity ResourceList 
	// Allocatable represents the resources of a node that are available for scheduling. Defaults to Capacity.
	Allocatable ResourceList
	// Conditions is an array of current observed node conditions.
	Conditions []NodeCondition 
	// List of addresses reachable to the node.Queried from cloud provider, if available.
	Addresses []NodeAddress 
	// Set of ids/uuids to uniquely identify the node.
	NodeInfo NodeSystemInfo 
	// List of container images on this node
	Images []ContainerImage 
	// List of attachable volumes that are in use (mounted) by the node.
  //UniqueVolumeName is just typedef for string
	VolumesInUse []UniqueVolumeName 
	// List of volumes that are attached to the node.
	VolumesAttached []AttachedVolume 
}
Capacity

Capacity. The fields in the capacity block indicate the total amount of resources that a Node has.

  • Allocatable indicates the amount of resources on a Node that is available to be consumed by normal Pods. Defaults to Capacity.

A Node Capacity is of type k8s.io/api/core/v1.ResourceList which is effectively a set of set of (resource name, quantity) pairs.

type ResourceList map[ResourceName]resource.Quantity

ResourceNames can be cpu/memory/storage

const (
	// CPU, in cores. (500m = .5 cores)
	ResourceCPU ResourceName = "cpu"
	// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
	ResourceMemory ResourceName = "memory"
	// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
	ResourceStorage ResourceName = "storage"
	// Local ephemeral storage, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
	// The resource name for ResourceEphemeralStorage is alpha and it can change across releases.
	ResourceEphemeralStorage ResourceName = "ephemeral-storage"
)

A k8s.io/apimachinery/pkg/api/resource.Quantity is a serializable/de-serializable number with a SI unit

Conditions

Conditions are valid conditions of nodes.

https://pkg.go.dev/k8s.io/api/core/v1.NodeCondition contains condition information for a node.

type NodeCondition struct {
	// Type of node condition.
	Type NodeConditionType 
	// Status of the condition, one of True, False, Unknown.
	Status ConditionStatus 
	// Last time we got an update on a given condition.
	LastHeartbeatTime metav1.Time 
	// Last time the condition transitioned from one status to another.
	LastTransitionTime metav1.Time 
	// (brief) reason for the condition's last transition.
	Reason string 
	// Human readable message indicating details about last transition.
	Message string 
}
NodeConditionType

NodeConditionType is one of the following:

const (
	// NodeReady means kubelet is healthy and ready to accept pods.
	NodeReady NodeConditionType = "Ready"
	// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory.
	NodeMemoryPressure NodeConditionType = "MemoryPressure"
	// NodeDiskPressure means the kubelet is under pressure due to insufficient available disk.
	NodeDiskPressure NodeConditionType = "DiskPressure"
	// NodePIDPressure means the kubelet is under pressure due to insufficient available PID.
	NodePIDPressure NodeConditionType = "PIDPressure"
	// NodeNetworkUnavailable means that network for the node is not correctly configured.
	NodeNetworkUnavailable NodeConditionType = "NetworkUnavailable"
)

Note: The MCM extends the above with further custom node condition types of its own. Not sure if this is correct - could break later if k8s enforces some validation ?

ConditionStatus

These are valid condition statuses. ConditionTrue means a resource is in the condition. ConditionFalse means a resource is not in the condition. ConditionUnknown means kubernetes can't decide if a resource is in the condition or not.

type ConditionStatus string

const (
	ConditionTrue    ConditionStatus = "True"
	ConditionFalse   ConditionStatus = "False"
	ConditionUnknown ConditionStatus = "Unknown"
)
Addresses

See Node Addresses

k8s.io/api/core/v1.NodeAddress contains information for the node's address.

type NodeAddress struct {
	// Node address type, one of Hostname, ExternalIP or InternalIP.
	Type NodeAddressType 
	// The node address string.
	Address string 
}
NodeSystemInfo

Describes general information about the node, such as machine id, kernel version, Kubernetes version (kubelet and kube-proxy version), container runtime details, and which operating system the node uses. The kubelet gathers this information from the node and publishes it into the Kubernetes API.

type NodeSystemInfo struct {
	// MachineID reported by the node. For unique machine identification
	// in the cluster this field is preferred. 
	MachineID string 
	// Kernel Version reported by the node from 'uname -r' (e.g. 3.16.0-0.bpo.4-amd64).
	KernelVersion string 
	// OS Image reported by the node from /etc/os-release (e.g. Debian GNU/Linux 7 (wheezy)).
	OSImage string 
	// ContainerRuntime Version reported by the node through runtime remote API (e.g. docker://1.5.0).
	ContainerRuntimeVersion string 
	// Kubelet Version reported by the node.
	KubeletVersion string 
	// KubeProxy Version reported by the node.
	KubeProxyVersion string 
	// The Operating System reported by the node
	OperatingSystem string 
	// The Architecture reported by the node
	Architecture string 
}
  • The MachineID is a single newline-terminated, hexadecimal, 32-character, lowercase ID. from /etc/machine-id
Images

A slice of k8s.io/api/core/v1.ContainerImage which describes a contianer image.

type ContainerImage struct {
	Names []string 
	SizeBytes int64 
}
  • Names is the names by which this image is known. e.g. ["kubernetes.example/hyperkube:v1.0.7", "cloud-vendor.registry.example/cloud-vendor/hyperkube:v1.0.7"]
  • SizeBytes is the size of the image in bytes.
Attached Volumes

k8s.io/api/core/v1.AttachedVolume describes a volume attached to a node.

type UniqueVolumeName string
type AttachedVolume struct {
	Name UniqueVolumeName 
	DevicePath string 
}
  • Name is the Name of the attached volume. UniqueVolumeName is just a typedef for a Go string.
  • DevicePath represents the device path where the volume should be available

Pod

k8s.io/api/core/v1.Pod struct represents a k8s Pod which a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts.

type Pod struct {
	metav1.TypeMeta
	metav1.ObjectMeta

	// Specification of the desired behavior of the pod.
	Spec PodSpec 

	// Most recently observed status of the pod.
	Status PodStatus
}

See k8s.io/api/core/v1.PodSpec. Each PodSpec has a priority value where the higher the value, the higher the priority of the Pod.

PodSpec.Volumes slice is List of volumes that can be mounted by containers belonging to the pod and is relevant to MC code that attaches/detaches volumes .

type PodSpec struct {
	 Volumes []Volume 
	 //TODO: describe other PodSpec fields used by MC
}

Pod Eviction

A k8s.io/api/policy/v1.Eviction can be used to evict a Pod from its Node - eviction is the graceful terimation of Pods on nodes.See API Eviction

type Eviction struct {
	metav1.TypeMeta 

	// ObjectMeta describes the pod that is being evicted.
	metav1.ObjectMeta 

	// DeleteOptions may be provided
	DeleteOptions *metav1.DeleteOptions 
}

Construct the ObjectMeta using the Pod and namespace and then use instance of typed Kubernetes Client Interface, and get the PolicyV1Interface, get the EvictionInterface and invoke the invoke the Evict method

Example

client.PolicyV1().Evictions(eviction.Namespace).Evict(ctx, eviction)

Pod Disruption Budget

A k8s.io/api/policy/v1.PodDisruptionBudgetis a struct type that represents a Pod Disruption Budget which is the max disruption that can be caused to a collection of pods.

type PodDisruptionBudget struct {
	metav1.TypeMeta 
	metav1.ObjectMeta

	// Specification of the desired behavior of the PodDisruptionBudget.
	Spec PodDisruptionBudgetSpec 
	// Most recently observed status of the PodDisruptionBudget.
	Status PodDisruptionBudgetStatus 
}
PodDisruptionBudgetSpec
type PodDisruptionBudgetSpec struct {
	MinAvailable *intstr.IntOrString 
	Selector *metav1.LabelSelector 
	MaxUnavailable *intstr.IntOrString 
}
  • Selector specifies Label query over pods whose evictions are managed by the disruption budget. A null selector will match no pods, while an empty ({}) selector will select all pods within the namespace.
  • An eviction is allowed if at least MinAvailable pods selected by Selector will still be available after the eviction, i.e. even in the absence of the evicted pod. So for example you can prevent all voluntary evictions by specifying "100%".
  • An eviction is allowed if at most MaxUnavailable pods selected by Selector are unavailable after the eviction, i.e. even in absence of the evicted pod. For example, one can prevent all voluntary evictions by specifying 0.
  • MinAvailable is a mutually exclusive setting with MaxUnavailable
PodDisruptionBudgetStatus

See godoc for k8s.io/api/policy/v1.PodDisruptionBudgetStatus

Pod Volumes

A k8s.io/api/core/v1.Volume represents a named volume in a pod - which is a directory that may be accessed by any container in the pod. See Pod Volumes

A Volume has a Name and embeds a VolumeSource as shown below. A VolumeSource represents the location and type of the mounted volume.

type Volume struct {
	Name string 
	VolumeSource
}

VolumeSource which represents the source of a volume to mount should only have ONE of its fields popularted. The MC uses PersistentVolumeClaim field which is pointer to a PersistentVolumeClaimVolumeSource which represents a reference to a PersistentVolumeClaim in the same namespace.

type VolumeSource struct {
	PersistentVolumeClaim *PersistentVolumeClaimVolumeSource
}
type PersistentVolumeClaimVolumeSource struct  {
	ClaimName string
	ReadOnly bool
}

PersistentVolume

A k8s.io/api/core/v1.PersistentVolume (PV) represents a piece of storage in the cluster. See K8s Persistent Volumes

PersistentVolumeClaim

A k8s.io/api/core/v1.PersistentVolumeClaim represents a user's request for and claim to a persistent volume.

type PersistentVolumeClaim struct {
	metav1.TypeMeta 
	metav1.ObjectMeta
	Spec PersistentVolumeClaimSpec
	Status PersistentVolumeClaimStatus
}
type PersistentVolumeClaimSpec struct {
	StorageClassName *string 
	//...
	VolumeName string
	//...
}

Note that PersistentVolumeClaimSpec.VolumeName is of interest to the MC which represents the binding reference to the PersistentVolume backing this claim. Please note that this is different from Pod.Spec.Volumes[*].Name which is more like a label for the volume directory.

Secret

A k8s.io/api/core/v1.Secret holds secret data of a secret type whose size < 1MB. See K8s Secrets

  • Secret Data is in Secret.Data which is a map[string][]byte where the bytes is the secret value and key is simple ASCII alphanumeric.
type Secret struct {
	metav1.TypeMeta 
	metav1.ObjectMeta 
	Data map[string][]byte 
	Type SecretType 
	//... omitted for brevity
}

SecretType can be of many types: SecretTypeOpaque which represents user-defined secreets, SecretTypeServiceAccountToken whichcontains a token that identifies a service account to the API, etc.

client-go

k8s clients have the type k8s.io/client-go/kubernetes.ClientSet which is actually a high-level client set facade encapsulating clients for the core, appsv1, discoveryv1, eventsv1, networkingv1, nodev1, policyv1, storagev1 api groups. These individual clients are available via accessor methods. ie use clientset.AppsV1() to get the the AppsV1 client.

// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
    appsV1                       *appsv1.AppsV1Client
    coreV1                       *corev1.CoreV1Client
    discoveryV1                  *discoveryv1.DiscoveryV1Client
    eventsV1                     *eventsv1.EventsV1Client
    // ...

  // AppsV1 retrieves the AppsV1Client
  func (c *Clientset) AppsV1() appsv1.AppsV1Interface {
    return c.appsV1
  }
    // ...
}

As can be noticed from the above snippet, each of these clients associated with api groups expose an interface named GroupVersionInterface that in-turn provides further access to a generic REST Interface as well as access to a typed interface containing getter/setter methods for objects within that API group.

For example EventsV1Client which is used to interact with features provided by the events.k8s.io group implements EventsV1Interface

type EventsV1Interface interface {
	RESTClient() rest.Interface // generic REST API access
	EventsGetter // typed interface access
}
// EventsGetter has a method to return a EventInterface.
type EventsGetter interface {
	Events(namespace string) EventInterface
}

The ClientSet struct implements the kubernetes.Interface facade which is a high-level facade containing all the methods to access the individual GroupVersionInterface facade clients.

type Interface interface {
	Discovery() discovery.DiscoveryInterface
	AppsV1() appsv1.AppsV1Interface
	CoreV1() corev1.CoreV1Interface
	EventsV1() eventsv1.EventsV1Interface
	NodeV1() nodev1.NodeV1Interface
 // ... other facade accessor
}

One can generate k8s clients for custom k8s objects that follow the same pattern for core k8s objects. The details are not covered here. Please refer to How to generate client codes for Kubernetes Custom Resource Definitions , gengo, k8s.io codegenrator and the slightly-old article: Kubernetes Deep Dive: Code Generation for CustomResources

client-go Shared Informers.

The vital role of a Kubernetes controller is to watch objects for the desired state and the actual state, then send instructions to make the actual state be more like the desired state. The controller thus first needs to retrieve the object's information. Instead of making direct API calls using k8s listers/watchers, client-go controllers should use SharedInformers.

cache.SharedInformer is a primitive exposed by client-go lib that maintains a local cache of k8s objects of a particular API group and kind/resource. (restricable by namespace/label/field selectors) which is linked to the authoritative state of the corresponding objects in the API server.

Informers are used to reduced the load-pressure on the API Server and etcd.

All that is needed to be known at this point is that Informers internally watch for k8s object changes, update an internal indexed store and invoke registered event handlers. Client code must construct event handlers to inject the logic that one would like to execute when an object is Added/Updated/Deleted.

type SharedInformer interface {
	// AddEventHandler adds an event handler to the shared informer using the shared informer's resync period.  Events to a single handler are delivered sequentially, but there is no coordination between different handlers.
	AddEventHandler(handler cache.ResourceEventHandler)

	// HasSynced returns true if the shared informer's store has been
	// informed by at least one full LIST of the authoritative state
	// of the informer's object collection.  This is unrelated to "resync".
	HasSynced() bool

	// Run starts and runs the shared informer, returning after it stops.
	// The informer will be stopped when stopCh is closed.
	Run(stopCh <-chan struct{})
	//..
}

Note: resync period tells the informer to rebuild its cache every every time the period expires.

cache.ResourceEventHandler handle notifications for events that happen to a resource.

type ResourceEventHandler interface {
	OnAdd(obj interface{})
	OnUpdate(oldObj, newObj interface{})
	OnDelete(obj interface{})
}

cache.ResourceEventHandlerFuncs is an adapter to let you easily specify as many or as few of the notification functions as you want while still implementing ResourceEventHandler. Nearly all controllers code use instance of this adapter struct to create event handlers to register on shared informers.

type ResourceEventHandlerFuncs struct {
	AddFunc    func(obj interface{})
	UpdateFunc func(oldObj, newObj interface{})
	DeleteFunc func(obj interface{})
}

Shared informers for standard k8s objects can be obtained using the k8s.io/client-go/informers.NewSharedInformerFactory or one of the variant factory methods in the same package.

Informers and their factory functions for custom k8s objects are usually found in the generated factory code usually in a factory.go file.

See github.com/machine-controller-manager/pkg/client/informers/externalversions/externalversions.NewSharedInformerFactory

client-go workqueues

The basic workqueue.Interface has the following methods:

type Interface interface {
	Add(item interface{})
	Len() int
	Get() (item interface{}, shutdown bool)
	Done(item interface{})
	ShutDown()
	ShutDownWithDrain()
	ShuttingDown() bool
}

This is extended with ability to Add Item as a later time using the workqueue.DelayingInterface.

type DelayingInterface interface {
	Interface
	// AddAfter adds an item to the workqueue after the indicated duration has passed
	// Used to requeue items after failueres to avoid ending in hot-loop
	AddAfter(item interface{}, duration time.Duration)
}

This is further extended with rate limiting using workqueue.RateLimiter

type RateLimiter interface {
 	// When gets an item and gets to decide how long that item should wait
	When(item interface{}) time.Duration
	// Forget indicates that an item is finished being retried.  Doesn't matter whether its for perm failing
	// or for success, we'll stop tracking it
	Forget(item interface{})
	// NumRequeues returns back how many failures the item has had
	NumRequeues(item interface{}) int
}

client-go controller steps

The basic high-level contract for a k8s-client controller leveraging work-queues goes like the below:

  1. Create rate-limited work queue(s) created using workqueue.NewNamedRateLimitingQueue
  2. Define lifecycle callback functions (Add/Update/Delete) which accept k8s objects and enqueue k8s object keys (namespace/name) on these rate-limited work queue(s).
  3. Create informers using the shared informer factory functions.
  4. Add event handlers to the informers specifying these callback functions.
    1. When informers are started, they will invoke the appropriate registered callbacks when k8s objects are added/updated/deleted.
  5. The controller Run loop then picks up objects from the work queue using Get and reconciles them by invoking the appropriate reconcile function, ensuring that Done is called after reconcile to mark it as done processing.

Example:

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

CreateWorkQueue["machineQueue:=workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), 'machine')"]
-->
CreateInformer["machineInformerFactory := externalversions.NewSharedInformerFactory(...)
machineInformer := machineInformerFactory.Machine().V1alpha1().Machines()"]
-->
AddEventHandler["machineInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    controller.machineToMachineClassAdd,
		UpdateFunc: controller.machineToMachineClassUpdate,
		DeleteFunc: controller.machineToMachineClassDelete,
	})"]
-->Run["while (!shutdown) {
  key, shutdown := queue.Get()
  defer queue.Done(key)
  reconcile(key.(string))
}
"]
-->Z(("End"))

A more elaborate example of basic client-go controller flow is demonstrated in the clien-go workqueue example

client-go utilities

k8s.io/client-go/util/retry.RetryOnConflict

func RetryOnConflict(backoff wait.Backoff, fn func() error) error
  • retry.RetryOnConflict is used to make an update to a resource when you have to worry about conflicts caused by other code making unrelated updates to the resource at the same time.
  • fn should fetch the resource to be modified, make appropriate changes to it, try to update it, and return (unmodified) the error from the update function.
  • On a successful update, RetryOnConflict will return nil. If the update fn returns a Conflict error, RetryOnConflict will wait some amount of time as described by backoff, and then try again.
  • On a non-Conflict error, or if it retries too many times (backoff.Steps has reached zero) and gives up, RetryOnConflict will return an error to the caller.

References

MCM Facilities

This chapter describes the core types and utilities present in the MCM module and used by the controllers and drivers.

Machine Controller Core Types

The relevant machine types managed by the MCM controlled reside in github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1. This follows the standard location for client gen types <module>/pkg/apis/<group>/<version>.

Example: Machine type is pkg/apis/machine/v1alpha1.Machine

Machine

Machine is the representation of a physical or virtual machine that corresponds to a front-end k8s node object. An example YAML looks like the below

apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  name: test-machine
  namespace: default
spec:
  class:
    kind: MachineClass
    name: test-class

A Machine has a Spec field represented by MachineSpec

type Machine struct {
	// ObjectMeta for machine object
	metav1.ObjectMeta 

	// TypeMeta for machine object
	metav1.TypeMeta 

	// Spec contains the specification of the machine
	Spec MachineSpec 

	// Status contains fields depicting the status
	Status MachineStatus 
}
graph TB
    subgraph Machine
    ObjectMeta
    TypeMeta
    MachineSpec
    MachineStatus
    end

MachineSpec

MachineSpec represents the specification of a Machine.

type MachineSpec struct {

	// Class is the referrent to the MachineClass. 
	Class ClassSpec 

    // Unique identification of the VM at the cloud provider
	ProviderID string 

	// NodeTemplateSpec describes the data a node should have when created from a template
	NodeTemplateSpec NodeTemplateSpec 

	// Configuration for the machine-controller.  
	*MachineConfiguration 
}
type NodeTemplateSpec struct {  // wraps a NodeSpec with ObjectMeta.
	metav1.ObjectMeta

	// NodeSpec describes the attributes that a node is created with.
	Spec corev1.NodeSpec
}
  • ProviderID is the unique identification of the VM at the cloud provider. ProviderID typically matches with the node.Spec.ProviderID on the node object.
  • Class field is of type ClassSpec which is just the (Kind and the Name) referring to the MachineClass. (Ideally the field, type should have been called ClassReference) like OwnerReference
  • NodeTemplateSpec describes the data a node should have when created from a template, embeds ObjectMeta and holds a corev1.NodeSpec in its Spec field.
    • The Machine.Spec.NodeTemplateSpec.Spec mirrors k8s Node.Spec
  • MachineSpec embeds a MachineConfiguration which is just a configuration object that is a connection of timeouts, maxEvictRetries and NodeConditions
graph TB
    subgraph MachineSpec
	Class:ClassSpec
	ProviderID:string
	NodeTemplateSpec:NodeTemplateSpec
	MachineConfiguration
    end
type MachineConfiguration struct {
	// MachineDraintimeout is the timeout after which machine is forcefully deleted.
	MachineDrainTimeout *Duration

	// MachineHealthTimeout is the timeout after which machine is declared unhealhty/failed.
	MachineHealthTimeout *Duration 

	// MachineCreationTimeout is the timeout after which machinie creation is declared failed.
	MachineCreationTimeout *Duration 

	// MaxEvictRetries is the number of retries that will be attempted while draining the node.
	MaxEvictRetries *int32 

	// NodeConditions are the set of conditions if set to true for MachineHealthTimeOut, machine will be declared failed.
	NodeConditions *string 
}

MachineStatus

pkg/apis/machine/v1alpha1.MachineStatus represents the most recently observed status of Machine.

type MachineStatus struct {

	// Conditions of this machine, same as NodeStatus.Conditions
	Conditions []NodeCondition 

	// Last operation refers to the status of the last operation performed. NOTE: this is usually the NextOperation for reconcile!! Discuss!
	LastOperation LastOperation 

	// Current status of the machine object
	CurrentStatus CurrentStatus

	// LastKnownState can store details of the last known state of the VM by the plugins.
	// It can be used by future operation calls to determine current infrastucture state
	LastKnownState string 
}

Extended NodeConditionTypes

The MCM extends standard k8s NodeConditionType with several custom conditions. (TODO: isn't this hacky/error prone?)

NodeTerminationCondition

LastOperation

github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.LastOperation represents the last operation performed on the object. Can sometimes mean the next operation to be performed on the machine if the machine operation state is Processing!

type LastOperation struct {
	// Description of the operation
	Description string 

	// Last update time of operation
	LastUpdateTime Time 

	// State of operation (bad naming)
	State MachineState 

	// Type of operation
	Type MachineOperationType 
}

MachineState (should be called MachineOperationState)

NOTE: BADLY NAMED: Should be called MachineOperationState

machine-controller-manager/pkg/apis/machine/v1alpha1.MachineState represents the current state of a machine operation and is one of Processing, Failed or Successful.

// MachineState is  current state of the machine.
// BAD Name: Should be MachineOperationState
type MachineState string

// These are the valid (operation) states of machines.
const (
	// MachineStatePending means there are operations pending on this machine state
	MachineStateProcessing MachineState = "Processing"

	// MachineStateFailed means operation failed leading to machine status failure
	MachineStateFailed MachineState = "Failed"

	// MachineStateSuccessful indicates that the node is not ready at the moment
	MachineStateSuccessful MachineState = "Successful"
)
MachineOperationType

github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.MachineOperationType is a label for the operation performed on a machine object: Create/Update/HealthCheck/Delete.

type MachineOperationType string
const (
	// MachineOperationCreate indicates that the operation is a create
	MachineOperationCreate MachineOperationType = "Create"

	// MachineOperationUpdate indicates that the operation is an update
	MachineOperationUpdate MachineOperationType = "Update"

	// MachineOperationHealthCheck indicates that the operation is a create
	MachineOperationHealthCheck MachineOperationType = "HealthCheck"

	// MachineOperationDelete indicates that the operation is a delete
	MachineOperationDelete MachineOperationType = "Delete"
)

CurrentStatus

github.com/machine-controller-manager/pkg/apis/machine/v1alpha1.CurrentStatus encapsulates information about the current status of Machine.

type CurrentStatus struct {
	// Phase refers to the Machien (Lifecycle) Phase
	Phase MachinePhase 
	// TimeoutActive when set to true drives the machine controller
	// to check whether	machine failed the configured creation timeout or health check timeout and change machine phase to Failed.
	TimeoutActive bool 
	// Last update time of current status
	LastUpdateTime Time
}
MachinePhase

MachinePhase is a label for the life-cycle phase of a machine at a given time: Unknown, Pending, Available, Running, Terminating, Failed, CrashLoopBackOff,.

type MachinePhase string
const (
	// MachinePending means that the machine is being created
	MachinePending MachinePhase = "Pending"

	// MachineAvailable means that machine is present on provider but hasn't joined cluster yet
	MachineAvailable MachinePhase = "Available"

	// MachineRunning means node is ready and running successfully
	MachineRunning MachinePhase = "Running"

	// MachineRunning means node is terminating
	MachineTerminating MachinePhase = "Terminating"

	// MachineUnknown indicates that the node is not ready at the movement
	MachineUnknown MachinePhase = "Unknown"

	// MachineFailed means operation failed leading to machine status failure
	MachineFailed MachinePhase = "Failed"

	// MachineCrashLoopBackOff means creation or deletion of the machine is failing.
	MachineCrashLoopBackOff MachinePhase = "CrashLoopBackOff"
)

Finalizers

See K8s Finalizers. The MC defines the following finalizer keys

const (
	MCMFinalizerName = "machine.sapcloud.io/machine-controller-manager"
	MCFinalizerName = "machine.sapcloud.io/machine-controller"
)
  1. MCMFinalizerName is the finalizer used to tag dependecies before deletion
  2. MCFinalizerName is the finalizer added on Secret objects.

MachineClass

A MachineClass is used to used to templatize and re-use provider configuration across multiple Machines or MachineSets or MachineDeployments

Example MachineClass YAML for AWS provider
apiVersion: machine.sapcloud.io/v1alpha1
credentialsSecretRef:
  name: cloudprovider
  namespace: shoot--i544024--hana-test
kind: MachineClass
metadata:
  creationTimestamp: "2022-10-20T13:08:07Z"
  finalizers:
  - machine.sapcloud.io/machine-controller-manager
  generation: 1
  labels:
    failure-domain.beta.kubernetes.io/zone: eu-west-1c
  name: shoot--i544024--hana-test-whana-z1-b0f23
  namespace: shoot--i544024--hana-test
  resourceVersion: "38424578"
  uid: 656f863e-5061-420f-8710-96dcc9777be4
nodeTemplate:
  capacity:
    cpu: "4"
    gpu: "1"
    memory: 61Gi
  instanceType: p2.xlarge
  region: eu-west-1
  zone: eu-west-1c
provider: AWS
providerSpec:
  ami: ami-0c3484dbcde4c4d0c
  blockDevices:
  - ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 50
      volumeType: gp3
  iam:
    name: shoot--i544024--hana-test-nodes
  keyName: shoot--i544024--hana-test-ssh-publickey
  machineType: p2.xlarge
  networkInterfaces:
  - securityGroupIDs:
    - sg-0445497aa49ddecb2
    subnetID: subnet-04a338730d20ea601
  region: eu-west-1
  srcAndDstChecksEnabled: false
  tags:
    kubernetes.io/arch: amd64
    kubernetes.io/cluster/shoot--i544024--hana-test: "1"
    kubernetes.io/role/node: "1"
    networking.gardener.cloud/node-local-dns-enabled: "true"
    node.kubernetes.io/role: node
    worker.garden.sapcloud.io/group: whana
    worker.gardener.cloud/cri-name: containerd
    worker.gardener.cloud/pool: whana
    worker.gardener.cloud/system-components: "true"
secretRef:
  name: shoot--i544024--hana-test-whana-z1-b0f23
  namespace: shoot--i544024--hana-test

Type definition for MachineClass shown below

type MachineClass struct {
	metav1.TypeMeta 
	metav1.ObjectMeta 

	ProviderSpec runtime.RawExtension `json:"providerSpec"`
	SecretRef *corev1.SecretReference `json:"secretRef,omitempty"`
	CredentialsSecretRef *corev1.SecretReference
	Provider string 


	NodeTemplate *NodeTemplate 
}

Notes

SecretRef and CredentialsSecretRef

Both these fields in the MachineClass are of type k8s.io/api/core/v1.SecretReference.

Generally, When there is some operation to be performed on the machine, the K8s secret objects are obtained using the k8s.io/client-go/listers/core/v1.SecretLister. The corresponding Secret.Data map fields are retrieved and then merged into a single map. (Design TODO: Why do we do this and why can't use just one secret ?)

Snippet of AWS MachineClass containing CredentialsSecretRef and SecretRef

apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineClass
credentialsSecretRef:
  name: cloudprovider
  namespace: shoot--i034796--tre
secretRef:
  name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
  namespace: shoot--i034796--tre
//...
  • MachineClass.CredentialsSecretRef is a reference to a secret that generally holds the cloud provider credentials. For the example above the corresponding Secret holds the accessKeyID and secretAccessKey. k get secret cloudprovider -n shoot--i034796--tre -oyaml
apiVersion: v1
kind: Secret
data:
accessKeyID: QUtJQTZ...
secretAccessKey: dm52MX...
  • MachineClass.SecretRef points to a Secret whose Secret.Data contains a userData entry that has a lot of info.
  • MachineClass.Provider is specified as the combination of name and location of cloud-specific drivers. (Unsure if this is being actually being followed as for AWS&Azure this is set to simply AWS and Azure respectively)

NodeTemplate

A NodeTemplate as shown below

  1. Capacity is of type k8s.io/api/core/v1.ResourceList which is effectively a map of (resource name, quantity) pairs. Similar to Node Capacity, it has keys like cpu, gpu, memory, storage, etc and string values that are represented by resource.Quantity
  2. InstanceType provider specific Instance type of the node belonging to nodeGroup. For AWS this would be the EC2 instance type like p2.xlarge for AWS.
  3. Region: provider specific region name like eu-west-1 for AWS.
  4. Zone: provider specified Availability Zone like eu-west-1c for AWS.
type NodeTemplate struct {
	Capacity corev1.ResourceList 
	InstanceType string 
	Region string 
	Zone string 
}

Example

nodeTemplate:
  capacity:
    cpu: "4"
    gpu: "1"
    memory: 61Gi
  instanceType: p2.xlarge
  region: eu-west-1
  zone: eu-west-1c

MachineSet

A MachineSet is to a Machine in an analogue of what a ReplicaSet is to a Pod. A MachineSet ensures that the specified number of Machines are running at any given time. A MachineSet is rarely rarely created directly. It is generally owned by its parent MachineDeployment and its ObjectMetadata.OwnerReferenes slice has a reference to the parent deployment.

MachineSet struct is defined as follows:

type MachineSet struct {
	metav1.ObjectMeta 
	metav1.TypeMeta

	Spec MachineSetSpec 
	Status MachineSetStatus 
}

MachineSetSpec

MachineSetSpec is the specification of a MachineSet.

type MachineSetSpec struct {
	Replicas int32 
	Selector *metav1.LabelSelector 
	MinReadySeconds int32 
	MachineClass ClassSpec 
	Template MachineTemplateSpec // Am I used ?
}
type ClassSpec struct {
	APIGroup string 
	Kind string 
	Name string 
}
type MachineTemplateSpec struct { 
	metav1.ObjectMeta 
	Spec MachineSpec
}
  • MachineSetSpec.Replicas is the number of desired replicas.
  • MachineSetSpec.Selector is a label query over machines that should match the replica count. See Label Selectors
  • MachineSetSpec.MinReadySeconds - TODO: unsure ? Mininum number of seconds for which a newly created Machine should be ready for it to be considered as available ? (guessing here - can't find the code using this). Usually specified as 500 (ie in the YAML)
  • MachineSetSpec.MachineClass is an instance of type ClassSpec which is a reference type to the MachineClass.
  • TODO: Discuss whether needed. MachineSetSpec.Template is an instance of MachineTemplateSpec which is an encapsulation over MachineSpec. I don't see this used since MachineSetSpec.MachineClass is already a reference to a MachineClass which by definition is a template.

Example MachineSet YAML

AWs MachineSet YAML
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineSet
metadata:
  annotations:
    deployment.kubernetes.io/desired-replicas: "1"
    deployment.kubernetes.io/max-replicas: "2"
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-10-20T13:53:01Z"
  finalizers:
  - machine.sapcloud.io/machine-controller-manager
  generation: 1
  labels:
    machine-template-hash: "2415498538"
    name: shoot--i034796--tre-worker-q3rb4-z1
  name: shoot--i034796--tre-worker-q3rb4-z1-68598
  namespace: shoot--i034796--tre
  ownerReferences:
  - apiVersion: machine.sapcloud.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: MachineDeployment
    name: shoot--i034796--tre-worker-q3rb4-z1
    uid: f20cba25-2fdb-4315-9e38-d301f5f08459
  resourceVersion: "38510891"
  uid: b0f12abc-2d19-4a6e-a69d-4bf4aa12d02b
spec:
  minReadySeconds: 500
  replicas: 1
  selector:
    matchLabels:
      machine-template-hash: "2415498538"
      name: shoot--i034796--tre-worker-q3rb4-z1
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine-template-hash: "2415498538"
        name: shoot--i034796--tre-worker-q3rb4-z1
    spec:
      class:
        kind: MachineClass
        name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
      nodeTemplate:
        metadata:
          creationTimestamp: null
          labels:
            kubernetes.io/arch: amd64
            networking.gardener.cloud/node-local-dns-enabled: "true"
            node.kubernetes.io/role: node
            topology.ebs.csi.aws.com/zone: eu-west-1b
            worker.garden.sapcloud.io/group: worker-q3rb4
            worker.gardener.cloud/cri-name: containerd
            worker.gardener.cloud/pool: worker-q3rb4
            worker.gardener.cloud/system-components: "true"
        spec: {}
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  lastOperation:
    lastUpdateTime: "2022-10-20T13:55:23Z"
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

MachineDeployment

A MachineDeployment is to a MachineSet in an analogue of what a Deployment is to a ReplicaSet.

A MachineDeployment manages MachineSets and enables declarative updates for the machines in MachineSets.

MachineDeployment struct is defined as follows

type MachineDeployment struct {
	metav1.TypeMeta 
	metav1.ObjectMeta 
	Spec MachineDeploymentSpec 
	// Most recently observed status of the MachineDeployment.
	// +optional
	Status MachineDeploymentStatus 

MachineDeploymentSpec

MachineDeploymentSpec is the is the specification of the desired behavior of the MachineDeployment.

type MachineDeploymentSpec struct {
	Replicas int32 
	Selector *metav1.LabelSelector
	Template MachineTemplateSpec 
	MinReadySeconds int32 
	Paused bool 
	ProgressDeadlineSeconds *int32 
	Strategy MachineDeploymentStrategy 
}
  • Replicas is number of desired machines.
  • Selectoris label selector for machines. Usually a name label to select machines with a given name matching the deployment name.
  • Template of type MachineTemplateSpec which is an encapsulation over MachineSpec.
  • MinReadySeconds is the Minimum number of seconds for which a newly created machine should be ready without any of its container crashing, for it to be considered available.
  • Paused indicates that the MachineDeployment is paused and will not be processed by the MCM controller.
  • Strategy is the MachineDeploymentStrategy strategy to use to replace existing machines with new ones.
  • ProgressDeadlineSeconds maximum time in seconds for a MachineDeployment to make progress before it is considered to be failed. The MachineDeployment controller will continue to process failed MachineDeployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the MachineDeployment status.

MachineDeploymentStrategy

MachineDeploymentStrategy describes how to replace existing machines with new ones.

type MachineDeploymentStrategy struct {
	Type MachineDeploymentStrategyType
	RollingUpdate *RollingUpdateMachineDeployment
}

type MachineDeploymentStrategyType string
const (
	RecreateMachineDeploymentStrategyType MachineDeploymentStrategyType = "Recreate"
	RollingUpdateMachineDeploymentStrategyType MachineDeploymentStrategyType = "RollingUpdate"
)

type RollingUpdateMachineDeployment struct {
	MaxUnavailable *intstr.IntOrString
	MaxSurge *intstr.IntOrString
}

  1. Typeis one of the MachineDeploymentStrategyType constants
    1. RecreateMachineDeploymentStrategyType is strategy to Kill all existing machines before creating new ones.
    2. RollingUpdateMachineDeploymentStrategyType is strategy to replace the old machines by new one using rolling update i.e gradually scale down the old machines and scale up the new one.
  2. RollingUpdate is the rolling update config params represented by RollingUpdateMachineDeployment. This is analogous to k8s.io/api/apps/v1.RollingUpdateDeployment which also has MaxUnavailable and MaxSurge.
    1. RollingUpdateMachineDeployment.MaxUnavailable is the maximum number of machines that can be unavailable during the update.
    2. RollingUpdateMachineDeployment.MaxSurge is the maximum number of machines that can be scheduled above the desired number of machines.

Example MachineDeployment YAML

AWS MchineDeployment YAML
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineDeployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-10-20T13:53:01Z"
  finalizers:
  - machine.sapcloud.io/machine-controller-manager
  generation: 1
  name: shoot--i034796--tre-worker-q3rb4-z1
  namespace: shoot--i034796--tre
  resourceVersion: "38510892"
  uid: f20cba25-2fdb-4315-9e38-d301f5f08459
spec:
  minReadySeconds: 500
  replicas: 1
  selector:
    matchLabels:
      name: shoot--i034796--tre-worker-q3rb4-z1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: shoot--i034796--tre-worker-q3rb4-z1
    spec:
      class:
        kind: MachineClass
        name: shoot--i034796--tre-worker-q3rb4-z1-30c4a
      nodeTemplate:
        metadata:
          creationTimestamp: null
          labels:
            kubernetes.io/arch: amd64
            networking.gardener.cloud/node-local-dns-enabled: "true"
            node.kubernetes.io/role: node
            topology.ebs.csi.aws.com/zone: eu-west-1b
            worker.garden.sapcloud.io/group: worker-q3rb4
            worker.gardener.cloud/cri-name: containerd
            worker.gardener.cloud/pool: worker-q3rb4
            worker.gardener.cloud/system-components: "true"
        spec: {}
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-10-20T13:55:23Z"
    lastUpdateTime: "2022-10-20T13:55:23Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

VolumeAttachment

A k8s.io/api/storage/v1.VolumeAttachment is a non-namespaced object that captures the intent to attach or detach the specified volume to/from the specified node (specified in VolumeAttachmentSpec.NodeName).

type VolumeAttachment struct {
	metav1.TypeMeta 
	metav1.ObjectMeta 

	// Specification of the desired attach/detach volume behavior.
	// Populated by the Kubernetes system.
	Spec VolumeAttachmentSpec 

	// Status of the VolumeAttachment request.
	// Populated by the entity completing the attach or detach
	// operation, i.e. the external-attacher.
	Status VolumeAttachmentStatus 
}

VolumeAttachmentSpec

A k8s.ip/api/storage/v1.VolumeAttachmentSpecis the specification of a VolumeAttachment request.

type VolumeAttachmentSpec struct {
	// Attacher indicates the name of the volume driver that MUST handle this
	// request. Same as CSI Plugin name
	Attacher string 

	// Source represents the volume that should be attached.
	Source VolumeAttachmentSource 

	// The node that the volume should be attached to.
	NodeName string
}

See Storage V1 Docs for further elaboration.

Utilities

NodeOps

nodeops.GetNodeCondition

machine-controller-manager/pkg/util/nodeops.GetNodeCondition get the nodes condition matching the specified type

func GetNodeCondition(ctx context.Context, c clientset.Interface, nodeName string, conditionType v1.NodeConditionType) (*v1.NodeCondition, error) {
	node, err := c.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
	if err != nil {
		return nil, err
	}
	return getNodeCondition(node, conditionType), nil
}
func getNodeCondition(node *v1.Node, conditionType v1.NodeConditionType) *v1.NodeCondition {
	for _, cond := range node.Status.Conditions {
		if cond.Type == conditionType {
			return &cond
		}
	}
	return nil
}

nodeops.CloneAndAddCondition

machine-controller-manager/pkg/util/nodeops.CloneAndAddCondition adds a condition to the node condition slice. If condition with this type already exists, it updates the LastTransitionTime

func CloneAndAddCondition(conditions []v1.NodeCondition, condition v1.NodeCondition) []v1.NodeCondition {
	if condition.Type == "" || condition.Status == "" {
		return conditions
	}
	var newConditions []v1.NodeCondition

	for _, existingCondition := range conditions {
		if existingCondition.Type != condition.Type { 
			newConditions = append(newConditions, existingCondition)
		} else { 
			// condition with this type already exists
			if existingCondition.Status == condition.Status 
			&& existingCondition.Reason == condition.Reason {
				// condition status and reason are  the same, keep existing transition time
				condition.LastTransitionTime = existingCondition.LastTransitionTime
			}
		}
	}
	newConditions = append(newConditions, condition)
	return newConditions
}

TODO: Bug ? Logic above will end up adding duplicate node condition if status and reason phrase are different

nodeops.AddOrUpdateConditionsOnNode

machine-controller-manager/pkg/util/nodeops.AddOrUpdateConditionsOnNode adds a condition to the node's status, retrying on conflict with backoff.

func AddOrUpdateConditionsOnNode(ctx context.Context, c clientset.Interface, nodeName string, condition v1.NodeCondition) error 
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Init["backoff = wait.Backoff{
	Steps:    5,
	Duration: 100 * time.Millisecond,
	Jitter:   1.0,
}"]
-->retry.RetryOnConflict["retry.RetryOnConflict(backoff, fn)"]
-->GetNode["oldNode, err = c.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})"]
subgraph "fn"
    GetNode-->ChkIfErr{"err != nil"}
	ChkIfErr--Yes-->ReturnErr(("return err"))
	ChkIfErr--No-->InitNewNode["newNode := oldNode.DeepCopy()"]
	InitNewNode-->InitConditions["newNode.Status.Conditions:= CloneAndAddCondition(newNode.Status.Conditions, condition)"]
	-->UpdateNewNode["_, err := c.CoreV1().Nodes().UpdateStatus(ctx, newNodeClone, metav1.UpdateOptions{}"]
	-->ReturnErr
end

MachineUtils

Operation Descriptions

machineutils has a bunch of constants that are descriptions of machine operations that are set into machine.Status.LastOperation.Description by the machine controller while performing reconciliation. It also has some reason phrase

const (
	// GetVMStatus sets machine status to terminating and specifies next step as getting VMs
	GetVMStatus = "Set machine status to termination. Now, getting VM Status"

	// InitiateDrain specifies next step as initiate node drain
	InitiateDrain = "Initiate node drain"

	// InitiateVMDeletion specifies next step as initiate VM deletion
	InitiateVMDeletion = "Initiate VM deletion"

	// InitiateNodeDeletion specifies next step as node object deletion
	InitiateNodeDeletion = "Initiate node object deletion"

	// InitiateFinalizerRemoval specifies next step as machine finalizer removal
	InitiateFinalizerRemoval = "Initiate machine object finalizer removal"

	// LastAppliedALTAnnotation contains the last configuration of annotations, 
	// labels & taints applied on the node object
	LastAppliedALTAnnotation = "node.machine.sapcloud.io/last-applied-anno-labels-taints"

	// MachinePriority is the annotation used to specify priority
	// associated with a machine while deleting it. The less its
	// priority the more likely it is to be deleted first
	// Default priority for a machine is set to 3
	MachinePriority = "machinepriority.machine.sapcloud.io"

	// MachineClassKind is used to identify the machineClassKind for generic machineClasses
	MachineClassKind = "MachineClass"

	// MigratedMachineClass annotation helps in identifying machineClasses who have been migrated by migration controller
	MigratedMachineClass = "machine.sapcloud.io/migrated"

	// NotManagedByMCM annotation helps in identifying the nodes which are not handled by MCM
	NotManagedByMCM = "node.machine.sapcloud.io/not-managed-by-mcm"

	// TriggerDeletionByMCM annotation on the node would trigger the deletion of the corresponding machine object in the control cluster
	TriggerDeletionByMCM = "node.machine.sapcloud.io/trigger-deletion-by-mcm"

	// NodeUnhealthy is a node termination reason for failed machines
	NodeUnhealthy = "Unhealthy"

	// NodeScaledDown is a node termination reason for healthy deleted machines
	NodeScaledDown = "ScaleDown"

	// NodeTerminationCondition describes nodes that are terminating
	NodeTerminationCondition v1.NodeConditionType = "Terminating"
)

Retry Periods

These are standard retry periods that are internally used by the machine controllers to enqueue keys into the work queue after the specified duration so that reconciliation can be retried afer elapsed duration.

// RetryPeriod is an alias for specifying the retry period
type RetryPeriod time.Duration

// These are the valid values for RetryPeriod
const (
	// ShortRetry tells the controller to retry after a short duration - 15 seconds
	ShortRetry RetryPeriod = RetryPeriod(15 * time.Second)
	// MediumRetry tells the controller to retry after a medium duration - 2 minutes
	MediumRetry RetryPeriod = RetryPeriod(3 * time.Minute)
	// LongRetry tells the controller to retry after a long duration - 10 minutes
	LongRetry RetryPeriod = RetryPeriod(10 * time.Minute)
)

Misc

permits.PermitGiver

permits.PermitGiver provides the ability to register, obtain, release and delete a number of permits for a given key. All operations are concurrent safe.

type PermitGiver interface {
	RegisterPermits(key string, numPermits int) //numPermits should be  maxNumPermits
	TryPermit(key string, timeout time.Duration) bool
	ReleasePermit(key string)
	DeletePermits(key string)
	Close()
}

The implementation of github.com/gardener/machine-controller-manager/pkg/util/permits.PermitGiver maintains:

  • a sync map of permit keys mapped to permits, where each permit is a structure comprising a buffered empty struct struct{} channel with buffer size equalling the (max) number of permits.
    • RegisterPermits(key, numPermits) registers numPermits for the given key. A permit struct is initialized with permit.c buffer size as numPermits.
  • entries are deleted if not accessed for a configured time represented by stalePermitKeyTimeout. This is done by 'janitor' go-routine associated with the permit giver instance.
  • When attempting to get a permit using TryPermit, one writes an empty struct to the permit channel within a given timeout. If one can do so within the timeout one has acquired the permit, else not.
    • TryPermit(key, timeout) attempts to get a permit for the given key by sending a struct{}{} instance to the buffered permit.c channel.
type permit struct {
	// lastAcquiredPermitTime is time since last successful TryPermit or new RegisterPermits
	lastAcquiredPermitTime time.Time
	// c is initialized with buffer size N representing num permits
	c                      chan struct{} 
}
type permitGiver struct {
	keyPermitsMap sync.Map // map of string keys to permit struct values
	stopC         chan struct{}
}

permits.NewPermitGiver

permits.NewPermitGiver returns a new PermitGiver

func NewPermitGiver(stalePermitKeyTimeout time.Duration, janitorFrequency time.Duration) PermitGiver
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
	Begin((" "))
	-->InitStopCh["stopC := make(chan struct{})"]
	-->InitPG["
		pg := permitGiver{
		keyPermitsMap: sync.Map{},
		stopC:         stopC,
	}"]
	-->LaunchCleanUpGoroutine["go cleanup()"]
	LaunchCleanUpGoroutine-->Return(("return &pg"))
	LaunchCleanUpGoroutine--->InitTicker
	subgraph cleanup
	InitTicker["ticker := time.NewTicker(janitorFrequency)"]
	-->caseReadStopCh{"<-stopC ?"}
	caseReadStopCh--Yes-->End(("End"))
	caseReadStopCh--No
		-->ReadTickerCh{"<-ticker.C ?"}
	ReadTickerCh--Yes-->CleanupStale["
	pg.cleanupStalePermitEntries(stalePermitKeyTimeout)
	(Iterates keyPermitsMap, remove entries whose 
	lastAcquiredPermitTime exceeds stalePermitKeyTimeout)
	"]
	ReadTickerCh--No-->caseReadStopCh
	end

Main Server Structs

MCServer

machine-controller-manager/pkg/util/provider/app/options.MCServer is the main server context object for the machine controller. It embeds options.MachineControllerConfiguration and has a ControlKubeConfig and TargetKubeConfig string fields.

type MCServer struct {
	options.MachineControllerConfiguration

	ControlKubeconfig string
	TargetKubeconfig  string

The MCServer is constructed and initialized using the pkg/util/provider/app/options.NewMCServer function which sets most of the default values for fields for the embedded struct.

MCServer Usage

Individual providers leverage the MCServer as follows:

  s := options.NewMCServer()
  driver := <providerSpecificDriverInitialization>
  if err := app.Run(s, driver); err != nil {
      fmt.Fprintf(os.Stderr, "%v\n", err)
      os.Exit(1)
  }

The MCServer is thus re-used across different providers.

MachineControllerConfiguration struct

An imnportant struct that represents machine configuration that supports deep-coopying and is embedded within the MCServer

machine-controller-manager/pkg/util/provider/options.MachineControllerConfiguration

type MachineControllerConfiguration struct {
	metav1.TypeMeta

	// namespace in seed cluster in which controller would look for the resources.
	Namespace string

	// port is the port that the controller-manager's http service runs on.
	Port int32
	// address is the IP address to serve on (set to 0.0.0.0 for all interfaces).
	Address string
	// CloudProvider is the provider for cloud services.
	CloudProvider string
	// ConcurrentNodeSyncs is the number of node objects that are
	// allowed to sync concurrently. Larger number = more responsive nodes,
	// but more CPU (and network) load.
	ConcurrentNodeSyncs int32

	// enableProfiling enables profiling via web interface host:port/debug/pprof/
	EnableProfiling bool
	// enableContentionProfiling enables lock contention profiling, if enableProfiling is true.
	EnableContentionProfiling bool
	// contentType is contentType of requests sent to apiserver.
	ContentType string
	// kubeAPIQPS is the QPS to use while talking with kubernetes apiserver.
	KubeAPIQPS float32
	// kubeAPIBurst is the burst to use while talking with kubernetes apiserver.
	KubeAPIBurst int32
	// leaderElection defines the configuration of leader election client.
	LeaderElection mcmoptions.LeaderElectionConfiguration
	// How long to wait between starting controller managers
	ControllerStartInterval metav1.Duration
	// minResyncPeriod is the resync period in reflectors; will be random between
	// minResyncPeriod and 2*minResyncPeriod.
	MinResyncPeriod metav1.Duration

	// SafetyOptions is the set of options to set to ensure safety of controller
	SafetyOptions SafetyOptions

	//NodeCondition is the string of known NodeConditions. If any of these NodeCondition is set for a timeout period, the machine  will be declared failed and will replaced. Default is "KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable"
	NodeConditions string

	//BootstrapTokenAuthExtraGroups is a comma-separated string of groups to set bootstrap token's "auth-extra-groups" field to.
	BootstrapTokenAuthExtraGroups string
}

SafetyOptions

An important struct availablea as the SafetyOptions field in MachineControllerConfiguration containing several timeouts, retry-limits, etc. Most of these fields are set via pkg/util/provider/app/options.NewMCServer function

pkg/util/provider/option.SafetyOptions

// SafetyOptions are used to configure the upper-limit and lower-limit
// while configuring freezing of machineSet objects
type SafetyOptions struct {
	// Timeout (in durartion) used while creation of
	// a machine before it is declared as failed
	MachineCreationTimeout metav1.Duration
	// Timeout (in durartion) used while health-check of
	// a machine before it is declared as failed
	MachineHealthTimeout metav1.Duration
	// Maximum number of times evicts would be attempted on a pod for it is forcibly deleted
	// during draining of a machine.
	MaxEvictRetries int32
	// Timeout (in duration) used while waiting for PV to detach
	PvDetachTimeout metav1.Duration
	// Timeout (in duration) used while waiting for PV to reattach on new node
	PvReattachTimeout metav1.Duration

	// Timeout (in duration) for which the APIServer can be down before
	// declare the machine controller frozen by safety controller
	MachineSafetyAPIServerStatusCheckTimeout metav1.Duration
	// Period (in durartion) used to poll for orphan VMs
	// by safety controller
	MachineSafetyOrphanVMsPeriod metav1.Duration
	// Period (in duration) used to poll for APIServer's health
	// by safety controller
	MachineSafetyAPIServerStatusCheckPeriod metav1.Duration

	// APIserverInactiveStartTime to keep track of the
	// start time of when the APIServers were not reachable
	APIserverInactiveStartTime time.Time
	// MachineControllerFrozen indicates if the machine controller
	// is frozen due to Unreachable APIServers
	MachineControllerFrozen bool
}

Controller Structs

Machine Controller Core Struct

controller struct in package controller inside go file: machine-controller-manager/pkg/util/provider/machinecontroller.go (Bad convention) is the concrete Machine Controller struct that holds state data for the MC and implements the classifical controller Run(workers int, stopCh <-chan struct{}) method.

The top level MCServer.Run method initializes this controller struct and calls its Run method

package controller
type controller struct {
	namespace                     string // control clustern namespace
	nodeConditions                string // Default: "KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable"

	controlMachineClient    machineapi.MachineV1alpha1Interface
	controlCoreClient       kubernetes.Interface
	targetCoreClient        kubernetes.Interface
	targetKubernetesVersion *semver.Version

	recorder                record.EventRecorder
	safetyOptions           options.SafetyOptions
	internalExternalScheme  *runtime.Scheme
	driver                  driver.Driver
	volumeAttachmentHandler *drain.VolumeAttachmentHandler
	// permitGiver store two things:
	// - mutex per machinedeployment
	// - lastAcquire time
	// it is used to limit removal of `health timed out` machines
	permitGiver permits.PermitGiver

	// listers
	pvcLister               corelisters.PersistentVolumeClaimLister
	pvLister                corelisters.PersistentVolumeLister
	secretLister            corelisters.SecretLister
	nodeLister              corelisters.NodeLister
	pdbV1beta1Lister        policyv1beta1listers.PodDisruptionBudgetLister
	pdbV1Lister             policyv1listers.PodDisruptionBudgetLister
	volumeAttachementLister storagelisters.VolumeAttachmentLister
	machineClassLister      machinelisters.MachineClassLister
	machineLister           machinelisters.MachineLister
	// queues
    // secretQueue = workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "secret"),
	secretQueue                 workqueue.RateLimitingInterface
	nodeQueue                   workqueue.RateLimitingInterface
	machineClassQueue           workqueue.RateLimitingInterface
	machineQueue                workqueue.RateLimitingInterface
	machineSafetyOrphanVMsQueue workqueue.RateLimitingInterface
	machineSafetyAPIServerQueue workqueue.RateLimitingInterface
	// syncs
	pvcSynced               cache.InformerSynced
	pvSynced                cache.InformerSynced
	secretSynced            cache.InformerSynced
	pdbV1Synced             cache.InformerSynced
	volumeAttachementSynced cache.InformerSynced
	nodeSynced              cache.InformerSynced
	machineClassSynced      cache.InformerSynced
	machineSynced           cache.InformerSynced
}

Driver

github.com/gardener/machine-controller-manager/pkg/util/provider/driver.Driver is the abstraction facade that decouplesthe machine controller from the cloud-provider specific machine lifecycle details. The MC invokes driver methods while performing reconciliation.

type Driver interface {
	CreateMachine(context.Context, *CreateMachineRequest) (*CreateMachineResponse, error)
	DeleteMachine(context.Context, *DeleteMachineRequest) (*DeleteMachineResponse, error)
	GetMachineStatus(context.Context, *GetMachineStatusRequest) (*GetMachineStatusResponse, error)
	ListMachines(context.Context, *ListMachinesRequest) (*ListMachinesResponse, error)
	GetVolumeIDs(context.Context, *GetVolumeIDsRequest) (*GetVolumeIDsResponse, error)
	GenerateMachineClassForMigration(context.Context, *GenerateMachineClassForMigrationRequest) (*GenerateMachineClassForMigrationResponse, error)
}
  • GetVolumeIDs returns a list volumeIDs for the given list of PVSpecs
    • Example: the AWS driver checks if spec.AWSElasticBlockStore.VolumeID is not nil and coverts the k8s spec.AWSElasticBlockStore.VolumeID to the EBS volume ID. Or if storage is provided by CSI spec.CSI.Driver="ebs.csi.aws.com" just gets spec.CSI.VolumeHandle
  • GetMachineStatus gets the status of the VM backing the machine object on the provider

Codes and (error) Status

Code

github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes.Code is a uint32 with following error codes. These error codes are contained in errors returned from Driver methods.

Note: Un-happy with current design. It is clear that some error codes overlap each other in the sense that they are supersets of other codes. The right thing to do would have been to make an ErrorCategory.

ValueCodeDescription
0OkSuccess
1Canceledthe operation was canceled (by caller)
2UnknownUnknown error (unrecognized code)
3InvalidArgumentInvalidArgument indicates client specified an invalid argument.
4DeadlineExceededDeadlineExceeded means operation expired before completion. For operations that change the state of the system, this error may be returned even if the operation has completed successfully. For example, a successful response from a server could have been delayed long enough for the deadline to expire.
5NotFoundrequested entity not found.
6AlreadyExistsan attempt to create an entity failed because one already exists.
7PermissionDeniedcaller does not have permission to execute the specified operation.
8ResourceExhaustedindicates some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.
9FailedPreconditionoperation was rejected because the system is not in a state required for the operation's execution.
10Abortedoperation was aborted and client should retry the full process
11OutOfRangeoperation was attempted past the valid range. Unlike InvalidArgument, this error indicates a problem that may be fixed if the system state changes.
12UnImplementedoperation is not implemented or not supported
13InternalBAD. Some internal invariant broken.
14UnavailableService is currently unavailable (transient and op may be tried with backoff)
15DataLossunrecoverable data loss or corruption.
16Unauthenticatedrequest does not have valid creds for operation. Note: It would have been nice if 7 was called Unauthorized.

Status

(OPINION: I beleive this is a minor NIH design defect. Ideally one should have re-levaraged the k8s API machinery Status k8s.io/apimachinery/pkg/apis/meta/v1.Status instead of making custom status object.)

status implements errors returned by MachineAPIs. MachineAPIs service handlers should return an error created by this package, and machineAPIs clients should expect a corresponding error to be returned from the RPC call.

status.Status implements error and encapsulates a code which should be onf the codes in codes.Code and a develoer-facing error message in English

type Status struct {
	code int32
	message string
}
// New returns a Status encapsulating code and msg.
func New(code codes.Code, msg string) *Status {
	return &Status{code: int32(code), message: msg}
}

NOTE: No ideally why we are doing such hard work as shown in status.FromError which involves time-consuming regex parsing of an error string into a status. This is actually being used to parse error strings of errors returned by Driver methods. Not good - should be better designed.

Machine Controller

The Machine Controller handles reconciliation of Machine and MachineClass objects.

The Machine Controller Entry Point for any provider is at machine-controller-manager-provider-<name>/cmd/machine-controller/main.go

MC Launch

Dev

Build

A Makefile in the root of machine-controller-manager-provider-<name> builds the provider specific machine controller for linux with CGO enabled. The make build target invokes the shell script .ci/build to do this.

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
  -a \
  -v \
  -o ${BINARY_PATH}/rel/machine-controller \
  cmd/machine-controller/main.goo

Launch

Assuming one has initialized the variables using make download-kubeconfigs, one can then use make start target which launches the MC with flags as shown below. Most of these timeout flags are redundant since exact same values are given in machine-controller-manager/pkg/util/provider/app/options.NewMCServer

go run -mod=vendor 
    cmd/machine-controller/main.go 
    --control-kubeconfig=$(CONTROL_KUBECONFIG) 
    --target-kubeconfig=$(TARGET_KUBECONFIG) 
    --namespace=$(CONTROL_NAMESPACE) 
    --machine-creation-timeout=20m 
    --machine-drain-timeout=5m 
    --machine-health-timeout=10m 
    --machine-pv-detach-timeout=2m 
    --machine-safety-apiserver-statuscheck-timeout=30s 
    --machine-safety-apiserver-statuscheck-period=1m 
    --machine-safety-orphan-vms-period=30m 
    --leader-elect=$(LEADER_ELECT) 
    --v=3

Prod

Build

A Dockerfile builds the provider specific machine controller and launches it directly with no CLI arguments. Hence uses coded defaults

RUN CGO_ENABLED=0 GOOS=$TARGETOS GOARCH=$TARGETARCH \
      go build \
      -mod=vendor \
      -o /usr/local/bin/machine-controller \
      cmd/machine-controller/main.go
COPY --from=builder /usr/local/bin/machine-controller /machine-controller
ENTRYPOINT ["/machine-controller"]

The machine-controller-manager deployment usually launches both the MC in a Pod with following arguments

./machine-controller
         --control-kubeconfig=inClusterConfig
         --machine-creation-timeout=20m
         --machine-drain-timeout=2h
         --machine-health-timeout=10m
         --namespace=shoot--i034796--tre
         --port=10259
         --target-kubeconfig=/var/run/secrets/gardener.cloud/shoot/generic-kubeconfig/kubeconfig
         --v=3

Launch Flow

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TB

Begin(("cmd/
machine-controller/
main.go"))
-->NewMCServer["mc=options.NewMCServer"]
-->AddFlaogs["mc.AddFlags(pflag.CommandLine)"]
-->LogOptions["options := k8s.io/component/base/logs.NewOptions()
	options.AddFlags(pflag.CommandLine)"]
-->InitFlags["flag.InitFlags"]
InitFlags--local-->NewLocalDriver["
	driver, err := local.NewDriver(s.ControlKubeconfig)
	if err exit
"]
InitFlags--aws-->NewPlatformDriver["
	driver := aws.NewAWSDriver(&spi.PluginSPIImpl{}))
	OR
	driver := cp.NewAzureDriver(&spi.PluginSPIImpl{})
	//etc
"]

NewLocalDriver-->AppRun["
	err := app.Run(mc, driver)
"]
NewPlatformDriver-->AppRun
AppRun-->End(("if err != nil 
os.Exit(1)"))

Summary

  1. Creates machine-controller-manager/pkg/util/provider/app/options.MCServer using options.NewMCServer which is the main context object for the machinecontroller that embeds a options.MachineControllerConfiguration.

    options.NewMCServer initializes options.MCServer struct with default values for

    • Port: 10258,
    • Namespace: default,
    • ConcurrentNodeSyncs: 50: number of worker go-routines that are used to process items from a work queue. See Worker below
    • NodeConditions: "KernelDeadLock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable" (failure node conditions that indicate that a machine is un-healthy)
    • MinResyncPeriod: 12 hours, KubeAPIQPS: 20, KubeAPIBurst:30: config params for k8s clients. See rest.Config
  2. calls MCServer.AddFlags which defines all parsing flags for the machine controller into fields of MCServer instance created in the last step.

  3. calls k8s.io/component-base/logs.NewOptions and then options.AddFlags for logging options. TODO: Should get rid of this when moving to logr.)

  4. Driver initialization code varies according to the provider type.

    • Local Driver
      • calls NewDriver with control kube config that creates a controller runtime client (sigs.k8s.io/controller-runtime/pkg/client) which then calls pkg/local/driver.NewDriver passing the controlloer-runtime client which constructs a localdriver encapsulating the passed in client.
      • driver := local.NewDriver(c)
      • the localdriver implements Driver is the facade for creation/deletion of vm's
    • Provider Specific Driver Example
      • driver := aws.NewAWSDriver(&spi.PluginSPIImpl{})
      • driver := cp.NewAzureDriver(&spi.PluginSPIImpl{})
      • spi.PluginSPIImpl is a struct that implements a provider specific interface that initializes a provider session.
  5. calls app.Run passing in the previously created MCServer and Driver instances.

Machine Controller Loop

app.Run

app.Run is the function that setups the main control loop of the machine controller server.

Summary

  1. app.Run(options *options.MCServer, driver driver.Driver) is the common run loop for all provider Machine Controllers.
  2. Creates targetkubeconfig and controlkubeconfig of type k8s.io/client-go/rest.Config from the target kube config path using clientcmd.BuildConfigFromFlags
  3. Set fields such as config.QPS and config.Burst in both targetkubeconfig and controlkubeconfig from the passed in options.MCServer
  4. Create kubeClientControl from the controlkubeconfig using the standard client-go client factory metohd: kubernetes.NewForConfig that returns a client-go/kubernetes.Clientset
  5. Similarly create another Clientset named leaderElectionClient using controlkubeconfig
  6. Start a go routine using the function startHTTP that registers a bunch of http handlers for the go profiler, prometheus metrics and the health check.
  7. Call createRecorder passing the kubeClientControl client set instance that returns a client-go/tools/record.EventRecorder
    1. Creates a new eventBroadcaster of type event.EventBroadcaster
    2. Set the logging function of the broadcaster to klog.Infof.
    3. Sets the event sink using eventBroadcaster.StartRecordingToSink passing the event interface as kubeClient.CoreV1().RESTClient()).Events(""). Effectively events will be published remotely.
    4. Returns the record.EventRecorder associated with the eventBroadcaster using eventBroadcaster.NewRecorder
  8. Constructs an anonymous function assigned to run variable which does the following:
    1. Initializes a stop receive channel.
    2. Creates a controlMachineClientBuilder using machineclientbuilder.SimpleClientBuilder using the controlkubeconfig.
    3. Creates a controlCoreClientBuidler using coreclientbuilder.SimpleControllerClientBuilder wrapping controlkubeconfig.
    4. Creates targetCoreClientBuilder using coreclientbuilder.SimpleControllerClientBuilder wrapping controlkubeconfig.
    5. Call the app.StartControllers function passing the options, driver, controlkubeconfig, targetkubeconfig, controlMachineClientBuilder, controlCoreClientBuilder, targetCoreClientBuilder, recorder and stop channel.
      • // Q: if you are going to pass the controlkubeconfig and targetkubeconfig - why not create the client builders inside the startcontrollers ?
    6. if app.StartcOntrollers return an error panic and exit run.
  9. use leaderelection.RunOrDie to start a leader election and pass the previously created run function to as the callback for OnStartedLeading. OnStartedLeading callback is invoked when a leaderelector client starts leading.

app.StartControllers

app.StartControllers starts all controller loops which are part of the machine controller.

func StartControllers(options *options.MCServer,
	controlCoreKubeconfig *rest.Config,
	targetCoreKubeconfig *rest.Config,
	controlMachineClientBuilder machineclientbuilder.ClientBuilder,
	controlCoreClientBuilder coreclientbuilder.ClientBuilder,
	targetCoreClientBuilder coreclientbuilder.ClientBuilder,
	driver driver.Driver,
	recorder record.EventRecorder,
	stop <-chan struct{}) error
  1. Calls getAvailableResources using the controlCoreClientBuilder that returns a map[schema.GroupVersionResource]bool assigned to availableresources
    • getAvailableResources waits till the api server is running by checking its /healthz using wait.PollImmediate. keeps re-creating the client using clientbuilder.Client method.
    • then uses client.Discovery().ServerResources which returns returns the supported resources for all groups and versions as a slice of *metav1.APIResourceList (which encapsulates a []APIResource) and then converts that to a map[schema.GroupVersionResource]bool `
  2. Creates a controlMachineClient using controlMachineClientBuilder.ClientOrDie("machine-controller").MachineV1alpha1() which is a client of type MachineV1alpha1Interface. This interface is a composition of MachineGetter,MachineClassesGetter, MachineDeploymentsGetter and MachineSetsGetter allowing access to CRUD interface for machines, machine classes, machine deployments and machine sets. This client targets the control cluster - ie the cluster holding the machine crd's.
  3. creates a controlCoreClient (of type: kubernetes.Clientset which is the standard k8s client-go client for accessing the k8s control cluster.
  4. creates a targetCoreClient (of type: kubernetes.Clientset) which is the standard k8s client-go client for accessing the target cluster - in which machines will be spawned.
  5. obtain the target cluster k8s version using the discovery interface and preserve it in targetKubernetesVersion
  6. if the availableResources does not contain the machine GVR, exit app.StartControllers with error.
  7. creates the following informer factories:
  1. Now create the machinecontroller using machinecontroller.NewController factory function, passing the below:
  2. Start controlMachineInformerFactory, controlCoreInformerFactory and targetCoreInformerFactory by calling SharedInformerfactory.Start passing the stop channel.
  3. Launches the machinecontroller.Run in new go-routine passing the stop channel.
  4. Block forever using a select{}

Machine Controller Initialization

the machine controller is constructed using controller.NewController factory function which initializes the controller struct.

1. NewController factory func

mc is constructed using the factory function below:

func NewController(
	namespace string,
	controlMachineClient machineapi.MachineV1alpha1Interface,
	controlCoreClient kubernetes.Interface,
	targetCoreClient kubernetes.Interface,
	driver driver.Driver,
	pvcInformer coreinformers.PersistentVolumeClaimInformer,
	pvInformer coreinformers.PersistentVolumeInformer,
	secretInformer coreinformers.SecretInformer,
	nodeInformer coreinformers.NodeInformer,
	pdbV1beta1Informer policyv1beta1informers.PodDisruptionBudgetInformer,
	pdbV1Informer policyv1informers.PodDisruptionBudgetInformer,
	volumeAttachmentInformer storageinformers.VolumeAttachmentInformer,
	machineClassInformer machineinformers.MachineClassInformer,
	machineInformer machineinformers.MachineInformer,
	recorder record.EventRecorder,
	safetyOptions options.SafetyOptions,
	nodeConditions string,
	bootstrapTokenAuthExtraGroups string,
	targetKubernetesVersion *semver.Version,
) (Controller, error) 

1.1 Init Controller Struct

Create and Initialize the Controller struct initializing rate-limiting work queues for secrets: controller.secretQueue, nodes: controller.nodeQueue, machines: controller.machineQueue, machineclass: controller.machineClassQueue. Along with 2 work queues used by safety controllers: controller.machineSafetyOrphanVMsQueue and controller.machineSafetyAPIServerQueue

Example:

controller := &controller {
	//...
 secretQueue:                   workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "secret"),
 machineQueue=workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "machine"),
	//...
}

1.2 Assign Listers and HasSynced funcs to controller struct

	// initialize controller listers from the passed-in shared informers (8 listers)
	controller.pvcLister = pvcInformer
	controller.pvLister = pvinformer.Lister()
    controller.machineLister = machineinformer.lister()

	controller.pdbV1Lister = pdbV1Informer.Lister()
	controller.pdbV1Synced = pdbV1Informer.Informer().HasSynced

	// ...

	// assign the HasSynced function from the passed-in shared informers
	controller.pvcSynced = pvcInformer.Informer().HasSynced
	controller.pvSynced = pvInformer.Informer().HasSynced
    controller.machineSynced = machineInformer.Informer().HasSynced

1.3 Register Controller Event Handlers on Informers.

An informer invokes registered event handler when a k8s object changes.

Event handlers are registered using <ResourceType>Informer().AddEventhandler function.

The controller initialization registers add//delete event handlers for secrets. add/update/delete event handlers for MachineClass, Machine and Node informers.

The event handlers generally add the object keys to the appropriate work queues which are later picked up and reconciled in processing in controller.Run.

The work queue is used to separate the delivery of the object from its processing. resource event handler functions extract the key of the delivered object and add it to the relevant work queue for future processing. (in controller.Run)

Example

secretInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    controller.secretAdd,
	DeleteFunc: controller.secretDelete,
})
1.3.1 Secret Informer Callback
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB

SecretInformerAddCallback
-->SecretAdd["controller.secretAdd(obj)"]
-->GetSecretKey["key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)"]
-->AddSecretQ["if err != nil c.secretQueue.Add(key)"]

SecretInformeDeleteCallback
-->SecretAdd

We must check for the DeletedFinalStateUnknown state of that secret in the cache before enqueuing its key. The DeletedFinalStateUnknown state means that the object has been deleted but that the watch deletion event was missed while disconnected from apiserver and the controller didn't react accordingly. Hence if there is no error, we can add the key to the queue.

1.3.2 Machine Class Informer Callbacks
MachineClass Add/Delete Callback 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
MachineClassInformerAddCallback1
-->
MachineAdd["controller.machineClassToSecretAdd(obj)"]
-->CastMC["
	mc, ok := obj.(*v1alpha1.MachineClass)
"]
-->EnqueueSecret["
	c.secretQueue.Add(mc.SecretRef.Namespace + '/' + 
	mc.SecretRef.Name)
	c.secretQueue.Add(mc.CredentialSecretRef.Namespace + '/' + mc.CredentialSecretRef.Namespace.Name)
"]

MachineClassToSecretDeleteCallback1
-->MachineAdd
MachineClass Update Callback 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
MachineClassInformerUpdateCallback1
-->
MachineAdd["controller.machineClassToSecretUpdate(oldObj, newObj)"]
-->CastMC["
	old, ok := oldObj.(*v1alpha1.MachineClass)
	new, ok := newObj.(*v1alpha1.MachineClass)
"]
-->RefNotEqual{"old.SecretRef != 
	new.SecretRef?"}
--Yes-->EnqueueSecret["
	c.secretQueue.Add(old.SecretRef.Namespace + '/' + old.SecretRef.Name)
	c.secretQueue.Add(new.SecretRef.Namespace + '/' + new.SecretRef.Name)
"]
-->CredRefNotEqual{"old.CredentialsSecretRef!=
new.CredentialsSecretRef?"}
--Yes-->EnqueueCredSecretRef["
c.secretQueue.Add(old.CredentialsSecretRef.Namespace + '/' + old.CredentialsSecretRef.Name)
c.secretQueue.Add(new.CredentialsSecretRef.Namespace + '/' + new.CredentialsSecretRef.Name)
"]
MachineClass Add/Delete/Update Callback 2
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
MachineClassInformerAddCallback2
-->
MachineAdd["controller.machineClassAdd(obj)"]
-->CastMC["
	key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
	if err != nil c.machineClassQueue.Add(key)
"]
MachineClassInformerDeleteCallback2
-->MachineAdd

MachineClassInformeUpdateCallback2
-->MachineUpdate["controller.machineClassUpdate(oldObj,obj)"]
-->MachineAdd
1.3.2 Machine Informer Callbacks
Machine Add/Update/Delete Callbacks 1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
MachineAddCallback1
-->AddMachine["controller.addMachine(obj)"]
-->EnqueueMachine["
	key, err := cache.MetaNamespaceKeyFunc(obj)
	//Q: why don't we use DeletionHandlingMetaNamespaceKeyFunc here ?
	if err!=nil c.machineQueue.Add(key)
"]
MachineUpdateCallback1-->AddMachine
MachineDeleteCallback1-->AddMachine
Machine Update/Delete Callbacks 2

DISCUSS THIS.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
MachineUpdateCallback2
-->UpdateMachineToSafety["controller.updateMachineToSafety(oldObj, newObj)"]
-->EnqueueSafetyQ["
	newM := newObj.(*v1alpha1.Machine)
	if multipleVMsBackingMachineFound(newM) {
		c.machineSafetyOrphanVMsQueue.Add('')
	}
"]
MachineDeleteCallback2
-->DeleteMachineToSafety["deleteMachineToSafety(obj)"]
-->EnqueueSafetyQ1["
	c.machineSafetyOrphanVMsQueue.Add('')
"]
1.3.3 Node Informer Callbacks
Node Add Callback
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
NodeaAddCallback
-->InvokeAddNodeToMachine["controller.addNodeToMachine(obj)"]
-->AddNodeToMachine["
	node := obj.(*corev1.Node)
	if node.ObjectMeta.Annotations has NotManagedByMCM return;
	key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
	if err != nil return
"]
-->GetMachineFromNode["
	machine := (use machineLister to get first machine whose 'node' label equals key)
"]
-->ChkMachine{"
machine.Status.CurrentStatus.Phase != 'CrashLoopBackOff'
&&
nodeConditionsHaveChanged(
  machine.Status.Conditions, 
  node.Status.Conditions) ?
"}
--Yes-->EnqueueMachine["
	mKey, err := cache.MetaNamespaceKeyFunc(obj)
	if err != nil return
	controller.machineQueue.Add(mKey)
"]

Node Delete Callback

This is straightforward - it checks that the node has an associated machine and if so, enqueues the machine on the machineQueue

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TB
NodeDeleteCallback
-->InvokeDeleteNodeToMachine["controller.deleteNodeToMachine(obj)"]
-->DeleteNodeToMachine["
	key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
	if err != nil return
"]
-->GetMachineFromNode["
	machine := (use machineLister to get first machine whose 'node' label equals key)
	if err != nil return
"]
-->EnqueueMachine["
	mKey, err := cache.MetaNamespaceKeyFunc(obj)
	if err != nil return
	controller.machineQueue.Add(mKey)
"]
Node Update Callback

controller.updateNodeTomachine is specified as UpdateFunc registered for the nodeInformer.

In a nutshell, it simply delegates to AddNodeTomachine(newobj) described earlier, except if the node has the annotation machineutils.TriggerDeletionByMCM (value: node.machine.sapcloud.io/trigger-deletion-by-mcm). In this case it gets the machine obj corresponding to the node and then leverages controller.controlMachineClient to delete the machine object.

NOTE: This annotation was introduced for the user to add on the node. This gives them an indirect way to delete the machine object because they don’t have access to control plane.

Snippet shown below with error handling+logging omitted.

func (c *controller) updateNodeToMachine(oldobj, newobj interface{}) {
	node := newobj.(*corev1.node)
	// check for the triggerdeletionbymcm annotation on the node object
	// if it is present then mark the machine object for deletion
	if value, ok := node.annotations[machineutils.TriggerDeletionByMCM]; ok && value == "true" {
		machine, err := c.getMachineFromnOde(node.name)
		if machine.deletiontimestamp == nil {
			c.controlmachineclient
			.Machines(c.namespace)
			.Delete(context.Background(), machine.Name, metav1.Deleteoptions{});		
		} 
	}  else {
		c.addnodeToMachine(newobj)
	}
}

Machine Controller Run

func (c *controller) Run(workers int, stopch <-chan struct{}) {
	// ...
}

1. Wait for Informer Caches to Sync

When an informer starts, it will build a cache of all resources it currently watches which is lost when the application restarts. This means that on startup, each of your handler functions will be invoked as the initial state is built. If this is not desirable, one should wait until the caches are synced before performing any updates. This can be done using the cache.WaitForCacheSync function.

if !cache.WaitForCacheSync(stopCh, c.secretSynced, c.pvcSynced, c.pvSynced, c.volumeAttachementSynced, c.nodeSynced, c.machineClassSynced, c.machineSynced) {
	runtimeutil.HandleError(fmt.Errorf("Timed out waiting for caches to sync"))
	return
}

2. Register On Prometheus

The Machine controller struct implements the prometheus.Collector interface and can therefore be then be registered on prometheus metrics registry.

prometheus.MustRegister(controller)

Collectors which are added to the registry will collect metrics to expose them via the metrics endpoint of the MCM every time when the endpoint is called.

2.1 Describe Metrics (controller.Describe)

All promethueus.Metric that are collected must first be described using a prometheus.Desc which is the immutable meta-data about a metric.

As can be seen below the machine controller sends a description of metrics.MachineCountDesc to prometheus. this is mcm_machine_items_total which is the count of machines managed by controller.

This Describe callback is called by prometheus.MustRegister

Doubt: we currently appear to have only have one metric for the mc ?

var	MachineCountDesc = prometheus.NewDesc("mcm_machine_items_total", "Count of machines currently managed by the mcm.", nil, nil)

func (c *controller) Describe(ch chan<- *prometheus.desc) {
	ch <- metrics.MachineCountDesc
}
2.1 Collect Metrics (controller.Collect)

Collect is called by the prometheus registry when collecting metrics. The implementation sends each collected metric via the provided channel and returns once the last metric has been sent. the descriptor of each sent metric is one of those returned by Describe

// Collect is method required to implement the prometheus.Collect interface.
func (c *controller) Collect(ch chan<- prometheus.Metric) {
	c.CollectMachineMetrics(ch)
	c.CollectMachineControllerFrozenStatus(ch)
}
2.1.1 Collect Machine Metrics
func (c *controller) CollectMachineMetrics(ch chan<- prometheus.Metric) 

A [prometheus.Metric])(https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Metric) models a sample linking data points together over time. Custom labels (with their own values) can be added to each data point

A prometheus.Gauge is a Metric that represents a single numerical value that can arbitrarily go up and down. We use a Gauge for the machine count.

A prometheus.GaugeVec is a factory for creating a set of gauges all with the same description but which have different data values for the metric labels.

Machine information about a machine managed by MCM is described on Prometheus using a prometheus.GaugeVec constructed using the factory function prometheus.NewGaugeVec.

A prometheus.Desc is the descriptor used by every Prometheus Metric. It is essentially the immutable meta-data of a Metric that includes fully qualified name of the metric, the help string and the metric label names.

We have 3 such gauge vecs for machine metrics and 1 gauge metric for the machine count as seen below.

Q: Discuss Why do we need the 3 gauge vecs ?

Example:


var	MachineCountDesc = prometheus.NewDesc("mcm_machine_items_total", "Count of machines currently managed by the mcm.", nil, nil)

//MachineInfo Information of the Machines currently managed by the mcm.
var MachineInfo prometheus.GaugeVec = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Namespace: "mcm",
		Subsystem: "machine",
		Name:      "info",
		Help:      "Information of the Machines currently managed by the mcm.",
	}, []string{"name", "namespace", "createdAt",
		"spec_provider_id", "spec_class_api_group", "spec_class_kind", "spec_class_name"})

// MachineStatusCondition Information of the mcm managed Machines' status conditions
var	MachineStatusCondition = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Namespace: namespace,
		Subsystem: machineSubsystem,
		Name:      "status_condition",
		Help:      "Information of the mcm managed Machines' status conditions.",
	}, []string{"name", "namespace", "condition"})

//MachineCSPhase Current status phase of the Machines currently managed by the mcm.
	MachineCSPhase = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Namespace: namespace,
		Subsystem: machineSubsystem,
		Name:      "current_status_phase",
		Help:      "Current status phase of the Machines currently managed by the mcm.",
	}, []string{"name", "namespace"})

One invokes the GaugeVec.With method passing a prometheus.Labels which is a map[string]string to obtain a prometheus.Gauge.

  1. Gets the list of machines using the machineLister
  2. Iterate through list of machines. Use the MachineInfo.With method to initialize the labels for each metric and obtain the Gauge. Use Gauge.Set to set value for metric.
  3. Create the machine count metric
	metric, err := prometheus.NewConstMetric(metrics.MachineCountDesc, prometheus.GaugeValue, float64(len(machineList)))
  1. Set the metric value and send the metric to the prometheus metric channel:
metric, err := prometheus.NewConstMetric(metrics.MachineCountDesc, prometheus.GaugeValue, float64(len(machineList)))

3. Create controller worker go-routines specifying reconcile functions

Finally use worker.Run to create and runs a worker routine that just processes items in the specified queue. The worker will run until stopCh is closed. The worker go-routine will be added to the wait group when started and marked done when finished.

Q: reconcileClusterNodeKey seems useless ?

func (c *controller) Run(workers int, stopch <-chan struct{}) {
	//...
waitGroup sync.WaitGroup
for i := 0; i < workers; i++ {
	worker.Run(c.secretQueue, "ClusterSecret", worker.DefaultMaxRetries, true, c.reconcileClusterSecretKey, stopCh, &waitGroup)
	worker.Run(c.machineClassQueue, "ClusterMachineClass", worker.DefaultMaxRetries, true, c.reconcileClusterMachineClassKey, stopCh, &waitGroup)
	worker.Run(c.machineQueue, "ClusterMachine", worker.DefaultMaxRetries, true, c.reconcileClusterMachineKey, stopCh, &waitGroup)
	worker.Run(c.machineSafetyOrphanVMsQueue, "ClusterMachineSafetyOrphanVMs", worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyOrphanVMs, stopCh, &waitGroup)
	worker.Run(c.machineSafetyAPIServerQueue, "ClusterMachineAPIServer", worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyAPIServer, stopCh, &waitGroup)
}
<-stopch
waitGroup.wait()
}

func Run(queue workqueue.ratelimitinginterface, resourcetype string, maxretries int, forgetaftersuccess bool, reconciler func(key string) error, stopch <-chan struct{}, waitgroup *sync.waitgroup) {
	waitgroup.add(1)
	go func() {
		wait.until(worker(queue, resourcetype, maxretries, forgetaftersuccess, reconciler), time.second, stopch)
		waitgroup.done()
	}()
}
3.1 worker.worker

NOTE: Puzzled that basic routine like this is NOT part of client-go lib. Its likely repeated across thosands of controllers (prob with bugs). Thankfully controller-runtime obviates the need for soemthing like this.

worker returns a function that

  1. de-queues items (keys) from the work queue. the keys that are obtained using work queue.get to be strings of the form namespace/name of the resource.
  2. processes them by invoking the reconciler(key) function
    1. the purpose of the reconciler is to compares the actual state with the desired state, and attempts to converge the two. it should then update the status block of the resource.
    2. if reconciler returns an error, requeue the item up to maxretries before giving up.
  3. marks items as done.
func worker(queue workqueue.RateLimitingInterface, resourceType string, maxRetries int, forgetAfterSuccess bool, reconciler func(key string) error) func() {
	return func() {
		exit := false
		for !exit {
			exit = func() bool {
				key, quit := queue.Get()
				if quit {
					return true
				}
				defer queue.Done(key)

				err := reconciler(key.(string))
				if err == nil {
					if forgetAfterSuccess {
						queue.Forget(key)
					}
					return false
				}

				if queue.NumRequeues(key) < maxRetries {
					queue.AddRateLimited(key)
					return false
				}

				queue.Forget(key)
				return false
			}()
		}
	}
}

4. Reconciliation functions executed by worker

The controller starts worker go-routines that pop out keys from the relevant workqueue and execute the reconcile function.

See reconcile chapters

Reconcile Cluster Machine Class

Reconcile Cluster Machine Class Key

reconcileClusterMachineClassKey just picks up the machine class key from the machine class queue and then delegates further.

func (c *controller) reconcileClusterMachineClassKey(key string) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

GetMCName["ns,name=cache.SplitMetanamespacekey(mkey)"]
-->GetMC["
class, err := c.machineClassLister.MachineClasses(c.namespace).Get(name)
if err != nil return err  // basically adds back to the queue after rate limiting
"]
-->RMC["
ctx := context.Background()
reconcileClusterMachineClass(ctx, class)"]
-->CheckErr{"err !=nil"}
--Yes-->ShortR["machineClassQueue.AddAfter(key, machineutils.ShortRetry)"]
CheckErr--No-->LongR["machineClassQueue.AddAfter(key, machineutils.LongRetry)"]

Reconcile Cluster Machine Class

func (c *controller) reconcileClusterMachineClass(ctx context.Context,
 class *v1alpha1.MachineClass) error 

Bad design: should ideally return the retry period like other reconcile functions.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

FindMachineForClass["
machines := Use machineLister and 
match on Machine.Spec.Class.Name == class to 
find machines with matching class"]
-->CheckDelTimeStamp{"
// machines are ref
class.DeletionTimestamp == nil
&& len(machines) > 0
"}

CheckDelTimeStamp--Yes-->AddMCFinalizers["
Add/Update MCM Finalizers to MC 
and use controlMachineClient to update
(why mcm finalizer not mc finalizer?)
'machine.sapcloud.io/machine-controller-manager'
retryPeriod=LongRetry
"]
-->ChkMachineCount{{"len(machines)>0?"}}
--Yes-->EnQMachines["
iterate machines and invoke:
c.machineQueue.Add(machine)
"]
-->End(("End"))

CheckDelTimeStamp--No-->Shortr["
// Seems like over-work here.
retryPeriod=ShortRetry
"]-->ChkMachineCount

ChkMachineCount--No-->DelMCFinalizers["
controller.deleteMachineClassFinalizers(ctx, class)
"]-->End

NOTE: Scratch work below. IGNORE.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

a["ns,name=cache.SplitMetanamespacekey(mkey)"]
getm["machine=machinelister.machines(ns).get(name)"]
valm["validation.validatemachine(machine)"]
valmc["machineclz,secretdata,err=validation.validatemachineclass(machine)"]
longr["retryperiod=machineutils.longretry"]
shortr["retryperiod=machineutils.shortretry"]
enqm["machinequeue.addafter(mkey, retryperiod)"]
checkmdel{"is\nmachine.deletiontimestamp\nset?"}
newdelreq["req=&driver.deletemachinerequest{machine,machineclz,secretdata}"]
delflow["retryperiod=controller.triggerdeletionflow(req)"]
createflow["retryperiod=controller.triggercreationflow(req)"]
hasfin{"hasfinalizer(machine)"}
addfin["addmachinefinalizers(machine)"]
checkmachinenodeexists{"machine.status.node\nexists?"}
reconcilemachinehealth["controller.reconcilemachinehealth(machine)"]
syncnodetemplates["controller.syncnodetemplates(machine)"]
newcreatereq["req=&driver.createmachinerequest{machine,machineclz,secretdata}"]
z(("end"))

a-->getm
enqm-->z
longr-->enqm
shortr-->enqm
getm-->valm
valm-->ok-->valmc
valm--err-->longr
valmc--err-->longr
valmc--ok-->checkmdel
checkmdel--yes-->newdelreq
checkmdel--no-->hasfin
newdelreq-->delflow
hasfin--no-->addfin
hasfin--yes-->shortr
addfin-->checkmachinenodeexists
checkmachinenodeexists--yes-->reconcilemachinehealth
checkmachinenodeexists--no-->newcreatereq
reconcilemachinehealth--ok-->syncnodetemplates
syncnodetemplates--ok-->longr
syncnodetemplates--err-->shortr
delflow-->enqm
newcreatereq-->createflow
createflow-->enqm

Reconcile Cluster Secret

reconcileClusterSecretKey reconciles an secret due to controller resync or an event on the secret

func (c *controller) reconcileClusterSecretKey(key string) error 
// which looks up secret and delegates to
func (c *controller) reconcileClusterSecret(ctx context.Context, secret *corev1.Secret) error 

Usage

Worker go-routines are created for this as below

worker.Run(c.secretQueue, 
    "ClusterSecret", 
    worker.DefaultMaxRetries, 
    true, 
    c.reconcileClusterSecretKey,
    stopCh,
    &waitGroup)

Flow

controller.reconcileClusterSecretkey basically adds the MCFinalizerName (value: machine.sapcloud.io/machine-controller) to the list of finalizers for all secrets that are referenced by machine classes within the same namespace.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%

flowchart TD

a[ns, name = cache.SplitMetaNamespaceKey]
b["sec=secretLister.Secrets(ns).Get(name)"]
c["machineClasses=findMachineClassForSecret(name)
// Gets the slice of MachineClasses referring to the passed secret
//iterates through machine classes and 
// checks whether mc.SecretRef.Name or mcCredentialSecretRef.Name 
// matches secret name
"]
d{machineClasses empty?}
e["controller.addSecretFinalizers(sec)"] 
z(("return err"))
a-->b
b-->c
c-->d
d--Yes-->DeleteFinalizers["controller.deleteSecretFinalizers"]-->z
e--success-->z
d--No-->e

controller.addSecretFinalizers

func (c *controller) addSecretFinalizers(ctx context.Context, secret *corev1.Secret) error {

Basicaly adds machine.sapcloud.io/machine-controller to the secret and uses controlCoreClient to update the secret.

While perusing the below, you might need to reference Machine Controller Helper Functions as several reconcile functions delegate to helper methods defined on the machine controller struct.

Cluster Machine Reconciliation

func (c *controller) reconcileClusterMachineKey(key string) error

The top-level reconcile function for the machine that analyzes machine status and delegates to the individual reconcile functions for machine-creation, machine-deletion and machine-health-check flows.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

A["ns,name=cache.SplitMetaNamespaceKey(mKey)"]
GetM["machine=machineLister.Machines(ns).Get(name)"]
ValM["validation.ValidateMachine(machine)"]
ValMC["machineClz,secretData,err=validation.ValidateMachineClass(machine)"]
LongR["retryPeriod=machineutils.LongRetry"]
ShortR["retryPeriod=machineutils.ShortRetry"]
EnqM["machineQueue.AddAfter(mKey, retryPeriod)"]
CheckMDel{"Is\nmachine.DeletionTimestamp\nSet?"}
NewDelReq["req=&driver.DeleteMachineRequest{machine,machineClz,secretData}"]
DelFlow["retryPeriod=controller.triggerDeletionFlow(req)"]
CreateFlow["retryPeriod=controller.triggerCreationFlow(req)"]
HasFin{"HasFinalizer(machine)"}
AddFin["addMachineFinalizers(machine)"]
CheckMachineNodeExists{"machine.Status.Node\nExists?"}
ReconcileMachineHealth["controller.reconcileMachineHealth(machine)"]
SyncNodeTemplates["controller.syncNodeTemplates(machine)"]
NewCreateReq["req=&driver.CreateMachineRequest{machine,machineClz,secretData}"]
Z(("End"))

Begin((" "))-->A
A-->GetM
EnqM-->Z
LongR-->EnqM
ShortR-->EnqM
GetM-->ValM
ValM--Ok-->ValMC
ValM--Err-->LongR
ValMC--Err-->LongR
ValMC--Ok-->CheckMDel
CheckMDel--Yes-->NewDelReq
CheckMDel--No-->HasFin
NewDelReq-->DelFlow
HasFin--No-->AddFin
HasFin--Yes-->ShortR
AddFin-->CheckMachineNodeExists
CheckMachineNodeExists--Yes-->ReconcileMachineHealth
CheckMachineNodeExists--No-->NewCreateReq
ReconcileMachineHealth--Ok-->SyncNodeTemplates
SyncNodeTemplates--Ok-->LongR
SyncNodeTemplates--Err-->ShortR
DelFlow-->EnqM
NewCreateReq-->CreateFlow
CreateFlow-->EnqM

controller.triggerCreationFlow

Controller Method that orchestraes the call to the Driver.CreateMachine

This method badly requires to be split into several functions. It is too long.

func (c *controller) triggerCreationFlow(ctx context.Context, 
cmr *driver.CreateMachineRequest) 
  (machineutils.RetryPeriod, error) 

Apologies for HUMONGOUS flow diagram - ideally code should have been split here into small functions.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

ShortP["retryPeriod=machineutils.ShortRetry"]
MediumP["retryPeriod=machineutils.MediumRetry"]
Return(("return retryPeriod, err"))


Begin((" "))-->Init["
  machine     = cmr.Machine
	machineName = cmr.Machine.Name
  secretCopy := cmr.Secret.DeepCopy() //NOTE: seems Un-necessary?
"]
-->AddBootStrapToken["
 err = c.addBootstrapTokenToUserData(ctx, machine.Name, secretCopy)
//  get/create bootstrap token and populate inside secretCopy['userData']
"]
-->ChkErr{err != nil?}

ChkErr--Yes-->ShortP-->Return

ChkErr--No-->CreateMachineStatusReq["
  statusReq = & driver.GetMachineStatusRequest{
			Machine:      machine,
			MachineClass: cmr.MachineClass,
			Secret:       cmr.Secret,
		},
"]-->GetMachineStatus["
  statusResp, err := c.driver.GetMachineStatus(ctx, statusReq)
  //check if VM already exists
"]-->ChkStatusErr{err!=nil}

ChkStatusErr--No-->InitNodeNameFromStatusResp["
   nodeName = statusResp.NodeName
  providerID = statusResp.ProviderID
"]

ChkStatusErr--Yes-->DecodeErrorStatus["
  errStatus,decodeOk= status.FromError(err)
"]
DecodeErrorStatus-->CheckDecodeOk{"decodeOk ?"}

CheckDecodeOk--No-->MediumP-->Return
CheckDecodeOk--Yes-->AnalyzeCode{status.Code?}


AnalyzeCode--NotFound,Unimplemented-->ChkNodeLabel{"machine.Labels['node']?"}

ChkNodeLabel--No-->CreateMachine["
// node label is not present -> no machine
 resp, err := c.driver.CreateMachine(ctx, cmr)
"]-->ChkCreateError{err!=nil?}

ChkNodeLabel--Yes-->InitNodeNameFromMachine["
  nodeName = machine.Labels['node']
"]


AnalyzeCode--Unknown,DeadlineExceeded,Aborted,Unavailable-->ShortRetry["
retryPeriod=machineutils.ShortRetry
"]-->GetLastKnownState["
  lastKnownState := machine.Status.LastKnownState
"]-->InitFailedOp["
 lastOp := LastOperation{
    Description: err.Error(),
    State: MachineStateFailed,
    Type: MachineOperationCreatea,
    LastUpdateTime: Now(),
 };
 currStatus := CurrentStatus {
    Phase: MachineCrashLoopBackOff || MachineFailed (on create timeout)
    LastUpdateTime: Now()
 }
"]-->UpdateMachineStatus["
c.machineStatusUpdate(ctx,machine,lastOp,currStatus,lastKnownState)
"]-->Return


ChkCreateError--Yes-->SetLastKnownState["
  	lastKnownState = resp.LastKnownState
"]-->InitFailedOp

ChkCreateError--No-->InitNodeNameFromCreateResponse["
  nodeName = resp.NodeName
  providerID = resp.ProviderID
"]-->ChkStaleNode{"
// check stale node
nodeName != machineName 
&& nodeLister.Get(nodeName) exists"}


InitNodeNameFromStatusResp-->ChkNodeLabelAnnotPresent{"
cmr.Machine.Labels['node']
&& cmr.Machine.Annotations[MachinePriority] ?
"}
InitNodeNameFromMachine-->ChkNodeLabelAnnotPresent

ChkNodeLabelAnnotPresent--No-->CloneMachine["
  clone := machine.DeepCopy;
  clone.Labels['node'] = nodeName
  clone.Annotations[machineutils.MachinePriority] = '3'
  clone.Spec.ProviderID = providerID
"]-->UpdateMachine["
  _, err := c.controlMachineClient.Machines(clone.Namespace).Update(ctx, clone, UpdateOptions{})
"]-->ShortP




ChkStaleNode--No-->CloneMachine
ChkStaleNode--Yes-->CreateDMR["
  dmr := &driver.DeleteMachineRequest{
						Machine: &Machine{
							ObjectMeta: machine.ObjectMeta,
							Spec: MachineSpec{
								ProviderID: providerID,
							},
						},
						MachineClass: createMachineRequest.MachineClass,
						Secret:       secretCopy,
					}
"]-->DeleteMachine["
  _, err := c.driver.DeleteMachine(ctx, deleteMachineRequest)
  // discuss stale node case
  retryPeriod=machineutils.ShortRetry
"]-->InitFailedOp1["
 lastOp := LastOperation{
    Description: 'VM using old node obj',
    State: MachineStateFailed,
    Type: MachineOperationCreate, //seems wrong
    LastUpdateTime: Now(),
 };
 currStatus := CurrentStatus {
    Phase: MachineFailed (on create timeout)
    LastUpdateTime: Now()
 }
"]-->UpdateMachineStatus

ChkNodeLabelAnnotPresent--Yes-->ChkMachineStatus{"machine.Status.Node != nodeName
  || machine.Status.CurrentStatus.Phase == ''"}

ChkMachineStatus--No-->LongP["retryPeriod = machineutils.LongRetry"]-->Return

ChkMachineStatus--Yes-->CloneMachine1["
  clone := machine.DeepCopy()
  clone.Status.Node = nodeName
"]-->SetLastOp["
 lastOp := LastOperation{
    Description: 'Creating Machine on Provider',
    State: MachineStateProcessing,
    Type: MachineOperationCreate,
    LastUpdateTime: Now(),
 };
 currStatus := CurrentStatus {
    Phase: MachinePending,
    TimeoutActive:  true,
    LastUpdateTime: Now()
 }
 lastKnownState = clone.Status.LastKnownState
"]-->UpdateMachineStatus

style InitFailedOp text-align:left

controller.triggerDeletionFlow

func (c *controller) triggerDeletionFlow(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error) 

Please note that there is sad use of machine.Status.LastOperation as semantically the next requested operation. This is confusing. TODO: DIscuss This.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

GM["machine=dmr.Machine\n
machineClass=dmr.MachineClass\n
secret=dmr.Secret"]
HasFin{"HasFinalizer(machine)"}
LongR["retryPeriod=machineUtils.LongRetry"]
ShortR["retryPeriod=machineUtils.ShortRetry"]
ChkMachineTerm{"machine.Status.CurrentStatus.Phase\n==MachineTerminating ?"}
CheckMachineOperation{"Check\nmachine.Status.LastOperation.Description"}
DrainNode["retryPeriod=c.drainNode(dmr)"]
DeleteVM["retryPeriod=c.deleteVM(dmr)"]
DeleteNode["retryPeriod=c.deleteNodeObject(dmr)"]
DeleteMachineFin["retryPeriod=c.deleteMachineFinalizers(machine)"]
SetMachineTermStatus["c.setMachineTerminationStatus(dmr)"]

CreateMachineStatusRequest["statusReq=&driver.GetMachineStatusRequest{machine, machineClass,secret}"]
GetVMStatus["retryPeriod=c.getVMStatus(statusReq)"]



Z(("End"))

Begin((" "))-->HasFin
HasFin--Yes-->GM
HasFin--No-->LongR
LongR-->Z
GM-->ChkMachineTerm
ChkMachineTerm--No-->SetMachineTermStatus
ChkMachineTerm--Yes-->CheckMachineOperation
SetMachineTermStatus-->ShortR
CheckMachineOperation--GetVMStatus-->CreateMachineStatusRequest
CheckMachineOperation--InitiateDrain-->DrainNode
CheckMachineOperation--InitiateVMDeletion-->DeleteVM
CheckMachineOperation--InitiateNodeDeletion-->DeleteNode
CheckMachineOperation--InitiateFinalizerRemoval-->DeleteMachineFin
CreateMachineStatusRequest-->GetVMStatus
GetVMStatus-->Z


DrainNode-->Z
DeleteVM-->Z
DeleteNode-->Z
DeleteMachineFin-->Z
ShortR-->Z

controller.reconcileMachineHealth

controller.reconcileMachineHealth reconciles the machine object with any change in node conditions or VM health.

func (c *controller) reconcileMachineHealth(ctx context.Context, machine *Machine) 
  (machineutils.RetryPeriod, error)

NOTES:

  1. Reference controller.isHealth which checks the machine status conditions.

Health Check Flow Diagram

See What are the different phases of a Machine

Health Check Summary

  1. Gets the Node obj associated with the machine. If it IS NOT found, yet the current machine phase is Running, change the machine phase to Unknown, the last operation state to Processing, the last operation type to HealthCheck, update the machine status and return with a short retry.
  2. If the Node object IS found, then it checks whether the Machine.Status.Conditions are different from Node.Status.Conditions. If so it sets the machine conditions to the node conditions.
  3. If the machine IS NOT healthy (See isHealthy) but the current machine phase is Running, change the machine phase to Unknown, the last operation state to Processing, the last operation type to HealthCheck, update the machine status and return with a short retry.
  4. If the machine IS healthy but the current machine phase is NOT Running and the machine's node does not have the node.gardener.cloud/critical-components-not-ready taint, check whether the last operation type was a Create.
    1. If the last operation type was a Create and last operation state is NOT marked as Successful, then delete the bootstrap token associated with the machine. Change the last operation state to Successful. Let the last operation type continue to remain as Create.
    2. If the last operation type was NOT a Create, change the last operation type to HealthCheck
    3. Change the machine phase to Running and update the machine status and return with a short retry.
    4. (The above 2 cases take care of a newly created machine and a machine that became OK after ome temporary issue)
  5. If the current machine phase is Pending (ie machine being created: see triggerCreationFlow) get the configured machine creation timeout and check.
    1. If the timoeut HAS NOT expired, enqueue the machine key on the machine work queue after 1m.
    2. If the timeout HAS expired, then change the last operation state to Failed and the machine phase to Failed. Update the machine status and return with a short retry.
  6. If the current machine phase is Unknown, get the effective machine health timeout and check.
    1. If the timeout HAS NOT expired, enqueue the machine key on the machine work queue after 1m.
    2. If the timeout HAS expired
      1. Get the machine deployment name machineDeployName := machine.Labels['name'] corresponding to this machine
      2. Register ONE permit with this with machineDeployName. See Permit Giver. Q: Background of this change ? Couldn't we find a better way to throttle via work-queues instead of complicated PermitGiver and go-routines? Even simple lock would be OK here right ?
      3. Attempt to get ONE permit for machineDeployName using a lockAcquireTimeout of 1s
        1. Throttle to check whether machine CAN be marked as Failed using markable, err := controller.canMarkMachineFailed.
        2. If machine can be marked, change the last operation state (ie the health check) to Failed, preserve the last operation type, change machine phase to Failed. Update the machine status. See c.updateMachineToFailedState
        3. Then use wait.Poll using 100ms as pollInterval and 1s as cacheUpdateTimeout using the following poll condition function:
          1. Get the machine from the machineLister (which uses the cache of the shared informer)
          2. Return true if machine.Status.CurrentStatus.Phase is Failed or Terminating or the machine is not found
          3. Return false otherwise.

Health Check Doubts

  1. TODO: Why don't we check the machine health using the Driver.GetMachineStatus in the reconcile Machine health ? (seems like something obvious to do and would have helped in those meltdown issues where machine was incorrectly marked as failed)
  2. TODO: why doesn't this code make use of the helper method: c.machineStatusUpdate ?
  3. TODO: Unclear why LastOperation.Description does not use/concatenate one of the predefined constants in machineutils
  4. TODO: code makes too much use of cloneDirty to check whether machine clone obj has changed, when it could easily return early in several branches.
  5. TODO: Code directly makes calls to enqueue machine keys on the machine queue and still returns retry periods to caller leanding to un-necessary enqueue of machine keys. (spurious design)

controller.triggerUpdationFlow

Doesn't seem to be used ? Possibly dead code ?

Machine Controller Helper Methods

controller.addBootstrapTokenToUserData

This method is responsible for adding the bootstrap token for the machine.

Bootstrap tokens are used when joining new nodes to a cluster. Bootstrap Tokens are defined with a specific SecretType: bootstrap.kubernetes.io/token and live in the kube-system namespace. These Secrets are then read by the Bootstrap Authenticator in the API Server

Reference

func (c *controller) addBootstrapTokenToUserData(ctx context.Context, machineName string, secret *corev1.Secret) error 

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->InitTokenSecret["
tokenID := hex.EncodeToString([]byte(machineName)[len(machineName)-5:])[:6]
// 6 chars length
secretName :='bootstrap-token-' + tokenID
"]
-->GetSecret["
	secret, err = c.targetCoreClient.CoreV1().Secrets('kube-system').Get(ctx, secretName, GetOptions{}) 
"]
-->ChkErr{err!=nil?}

ChkErr--Yes-->ChkNotFound{"IsNotFound(err)"}

ChkNotFound--Yes-->GenToken["
  tokenSecretKey = generateRandomStrOf16Chars
"]-->InitSecretData["
  data := map[string][]byte{
    'token-id': []byte(tokenID),
    'token-secret': []byte(tokenSecretKey),
    'expiration': []byte(c.safetyOptions.MachineCreationTimeout.Duration)
    //..others
 }
"]
-->InitSecret["
  	secret = &corev1.Secret{
				ObjectMeta: metav1.ObjectMeta{
					Name:      secretName,
					Namespace: metav1.NamespaceSystem,
				},
				Type: 'bootstrap.kubernetes.io/token'
				Data: data,
			}
"]
-->CreateSecret["
secret, err =c.targetCoreClient.CoreV1().Secrets('kube-system').Create(ctx, secret, CreateOptions{})
"]
-->ChkErr1{err!=nil?}

ChkErr1--Yes-->ReturnErr(("return err"))
ChkNotFound--No-->ReturnErr

ChkErr1--No-->CreateToken["
token = tokenID + '.' + tokenSecretKey
"]-->InitUserData["
  userDataByes = secret.Data['userData']
  userDataStr = string(userDataBytes)
"]-->ReplaceUserData["
  	userDataS = strings.ReplaceAll(userDataS, 'BOOTSTRAP_TOKEN',placeholder, token)
   	secret.Data['userData'] = []byte(userDataS)
    //discuss this.
"]-->ReturnNil(("return nil"))

style InitSecretData text-align:left
style InitSecret text-align:left

controller.addMachineFinalizers

This method checks for the MCMFinalizer Value: machine.sapcloud.io/machine-controller-manager and adds it if it is not present. It leverages k8s.io/apimachinery/pkg/util/sets package for its work.

This method is regularly called during machine reconciliation, if a machine does not have a deletion timestamp so that all non-deleted machines possess this finalizer.

func (c *controller) addMachineFinalizers(ctx context.Context, machine *v1alpha1.Machine) (machineutils.RetryPeriod, error)
	if finalizers := sets.NewString(machine.Finalizers...); !finalizers.Has(MCMFinalizerName) {
		finalizers.Insert(MCMFinalizerName)
		clone := machine.DeepCopy()
		clone.Finalizers = finalizers.List()
		_, err := c.controlMachineClient.Machines(clone.Namespace).Update(ctx, clone, metav1.UpdateOptions{})
		if err != nil {
			// Keep retrying until update goes through
			klog.Errorf("Failed to add finalizers for machine %q: %s", machine.Name, err)
		} else {
			// Return error even when machine object is updated
			klog.V(2).Infof("Added finalizer to machine %q with providerID %q and backing node %q", machine.Name, getProviderID(machine), getNodeName(machine))
			err = fmt.Errorf("Machine creation in process. Machine finalizers are UPDATED")
		}
	}
	return machineutils.ShortRetry, err

controller.setMachineTerminationStatus

setMachineTerminationStatus set's the machine status to terminating. This is illustrated below. Please note that Machine.Status.LastOperation is set an instance of the LastOperation struct. (which at times appears to be a command for the next action? Discuss this.)

func (c *controller) setMachineTerminationStatus(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error)  
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

CreateClone["clone := dmr.Machine.DeepCopy()"]
NewCurrStatus["currStatus := &v1alpha1.CurrentStatus{Phase:\n MachineTerminating, LastUpdateTime: time.Now()}"]
SetCurrentStatus["clone.Status.CurrentStatus = currStatus"]
UpdateStatus["c.controlMachineClient.Machines(ns).UpdateStatus(clone)"]
ShortR["retryPeriod=machineUtils.ShortRetry"]
Z(("Return"))

CreateClone-->NewCurrStatus
NewCurrStatus-->SetCurrentStatus
SetCurrentStatus-->UpdateStatus
UpdateStatus-->ShortR
ShortR-->Z

controller.machineStatusUpdate

Updates machine.Status.LastOperation, machine.Status.CurrentStatus and machine.Status.LastKnownState

func (c *controller) machineStatusUpdate(
	ctx context.Context,
	machine *v1alpha1.Machine,
	lastOperation v1alpha1.LastOperation,
	currentStatus v1alpha1.CurrentStatus,
	lastKnownState string) error 
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

CreateClone["clone := machine.DeepCopy()"]
-->InitClone["
	clone.Status.LastOperation = lastOperation
	clone.Status.CurrentStatus = currentStatus
	clone.Status.LastKnownState = lastKnownState
"]
-->ChkSimilarStatus{"isMachineStatusSimilar(
	clone.Status,
	machine.Status)"}

ChkSimilarStatus--No-->UpdateStatus["
	err:=c.controlMachineClient
	.Machines(clone.Namespace)
	.UpdateStatus(ctx, clone, metav1.UpdateOptions{})
"]
-->Z1(("return err"))
ChkSimilarStatus--Yes-->Z2(("return nil"))

NOTE: isMachineStatusSimilar implementation is quite sad. TODO: we should improve stuff like this when we move to controller-runtime.

controller.UpdateNodeTerminationCondition

controller.UpdateNodeTerminationCondition adds or updates the termination condition to the Node.Status.Conditions of the node object corresponding to the machine.

func (c *controller) UpdateNodeTerminationCondition(ctx context.Context, machine *v1alpha1.Machine) error 
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Init["
	nodeName := machine.Labels['node']
	newTermCond := v1.NodeCondition{
		Type:               machineutils.NodeTerminationCondition,
		Status:             v1.ConditionTrue,
		LastHeartbeatTime:  Now(),
		LastTransitionTime: Now()}"]
-->GetCond["oldTermCond, err := nodeops.GetNodeCondition(ctx, c.targetCoreClient, nodeName, machineutils.NodeTerminationCondition)"]
-->ChkIfErr{"err != nil ?"}
ChkIfErr--Yes-->ChkNotFound{"apierrors.IsNotFound(err)"}
ChkNotFound--Yes-->ReturnNil(("return nil"))
ChkNotFound--No-->ReturnErr(("return err"))
ChkIfErr--No-->ChkOldTermCondNotNil{"oldTermCond != nil
&& machine.Status.CurrentStatus.Phase 
== MachineTerminating ?"}

ChkOldTermCondNotNil--No-->ChkMachinePhase{"Check\nmachine\n.Status.CurrentStatus\n.Phase?"}
ChkMachinePhase--MachineFailed-->NodeUnhealthy["newTermCond.Reason = machineutils.NodeUnhealthy"]
ChkMachinePhase--"else"-->NodeScaleDown["newTermCond.Reason=machineutils.NodeScaledDown
//assumes scaledown..why?"]
NodeUnhealthy-->UpdateCondOnNode["err=nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, newTermCond)"]
NodeScaleDown-->UpdateCondOnNode


ChkOldTermCondNotNil--Yes-->CopyTermReasonAndMessage["
newTermCond.Reason=oldTermCond.Reason
newTermCond.Message=oldTermCond.Message
"]
CopyTermReasonAndMessage-->UpdateCondOnNode


UpdateCondOnNode-->ChkNotFound

controller.isHealthy

Checks if machine is healty by checking its conditions.

func (c *controller) isHealthy(machine *.Machine) bool 
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->Init["
	conditions = machine.Status.Conditions
	badTypes = strings.Split(
	 'KernelDeadlock,ReadonlyFilesystem,DiskPressure,NetworkUnavailable', 
		',')
"]-->ChkCondLen{"len(conditions)==0?"}

ChkCondLen--Yes-->ReturnF(("return false"))
ChkCondLen--No-->IterCond["c:= range conditions"]
IterCond-->ChkNodeReady{"c.Type=='Ready'
&& c.Status != 'True' ?"}--Yes-->ReturnF
ChkNodeReady
--Yes-->IterBadConditions["badType := range badTypes"]
-->ChkType{"badType == c.Type
&&
c.Status != 'False' ?"}
--Yes-->ReturnF

IterBadConditions--loop-->IterCond
ChkType--loop-->IterBadConditions




style Init text-align:left

NOTE

  1. controller.NodeConditions should be called controller.BadConditionTypes
  2. Iterate over machine.Status.Conditions
    1. If Ready condition inis not True, node is determined as un-healty.
    2. If any of the bad condition types are detected, then node is determine as un-healthy

controller.getVMStatus

(BAD NAME FOR METHOD: should be called checkMachineExistenceAndEnqueNextOperation)

func (c *controller) getVMStatus(ctx context.Context, 
    statusReq *driver.GetMachineStatusRequest) (machineutils.RetryPeriod, error)

This method is only called for the delete flow.

  1. It attempts to get the machine status
  2. If the machine exists, it updates the machine status operation to InitiateDrain and returns a ShortRetry for the machine work queue.
  3. If attempt to get machine status failed, it will obtain the error code from the error.
    1. For Unimplemented(ie GetMachineStatus op was is not implemented), it does the same as 2. ie: it updates the machine status operation to InitiateDrain and returns a ShortRetry for the machine work queue.
    2. If decoding the error code failed, it will update the machine status operation to machineutils.GetVMStatus and returns a LongRetry for the machine key into the machine work queue.
      1. Unsure how we get out of this Loop. TODO: Discuss this.
    3. For Unknown|DeadlineExceeded|Aborted|Unavailable it updates the machine status operation to machineutils.GetVMStatus status and returns a ShortRetry for the machine work queue. (So that reconcile will run this method again in future)
    4. For NotFound code (ie machine is not found), it will enqueue node deletion by updating the machine stauts operation to machineutils.InitiateNodeDeletion and returning a ShortRetry for the machine work queue.
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

GetMachineStatus["_,err=driver.GetMachineStatus(statusReq)"]
ChkMachineExists{"err==nil ?\n (ie machine exists)"}
DecodeErrorStatus["errStatus,decodeOk= status.FromError(err)"]
CheckDecodeOk{"decodeOk ?"}

CheckErrStatusCode{"Check errStatus.Code"}

CreateDrainOp["op:=LastOperation{Description: machineutils.InitiateDrain
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
Time: time.Now()}"]

CreateNodeDelOp["op:=LastOperation{Description: machineutils.InitiateNodeDeletion
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
Time: time.Now()}"]

CreateDecodeFailedOp["op:=LastOperation{Description: machineutils.GetVMStatus,
State: v1alpha1.MachineStateFailed,
Type: v1alpha1.MachineOperationDelete,
Time: time.Now()}"]

CreaterRetryVMStatusOp["op:=LastOperation{Description: ma1chineutils.GetVMStatus,
State: v1alpha1.MachineStateFailed,
Type:  v1alpha1.MachineOperationDelete,
Time: time.Now()}"]

ShortR["retryPeriod=machineUtils.ShortRetry"]
LongR["retryPeriod=machineUtils.LongRetry"]
UpdateMachineStatus["c.machineStatusUpdate(machine,op,machine.Status.CurrentStatus, machine.Status.LastKnownState)"]

Z(("End"))

GetMachineStatus-->ChkMachineExists

ChkMachineExists--Yes-->CreateDrainOp
ChkMachineExists--No-->DecodeErrorStatus
DecodeErrorStatus-->CheckDecodeOk
CheckDecodeOk--Yes-->CheckErrStatusCode
CheckDecodeOk--No-->CreateDecodeFailedOp
CreateDecodeFailedOp-->LongR
CheckErrStatusCode--"Unimplemented"-->CreateDrainOp
CheckErrStatusCode--"Unknown|DeadlineExceeded|Aborted|Unavailable"-->CreaterRetryVMStatusOp
CheckErrStatusCode--"NotFound"-->CreateNodeDelOp
CreaterRetryVMStatusOp-->ShortR

CreateDrainOp-->ShortR
CreateNodeDelOp-->ShortR
ShortR-->UpdateMachineStatus
LongR-->UpdateMachineStatus
UpdateMachineStatus-->Z

controller.drainNode

Inside pkg/util/provider/machinecontroller/machine_util.go

func (c *controller) drainNode(ctx context.Context, dmr *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error)
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD


Initialize["err = nil
machine = dmr.Machine
nodeName= machine.Labels['node']
drainTimeout=machine.Spec.MachineConfiguration.MachineDrainTimeout || c.safetyOptions.MachineDrainTimeout
maxEvictRetries=machine.Spec.MachineConfiguration.MaxEvictRetries || c.safetyOptions.MaxEvictRetries
skipDrain = false"]
-->GetNodeReadyCond["nodeReadyCond = machine.Status.Conditions contains k8s.io/api/core/v1/NodeReady
readOnlyFSCond=machine.Status.Conditions contains 'ReadonlyFilesystem' 
"]
-->ChkNodeNotReady["skipDrain = (nodeReadyCond.Status == ConditionFalse) && nodeReadyCondition.LastTransitionTime.Time > 5m
or (readOnlyFSCond.Status == ConditionTrue) && readOnlyFSCond.LastTransitionTime.Time > 5m
// discuss this
"]
-->ChkSkipDrain{"skipDrain true?"}
ChkSkipDrain--Yes-->SetOpStateProcessing
ChkSkipDrain--No-->SetHasDrainTimedOut["hasDrainTimedOut = time.Now() > machine.DeletionTimestamp + drainTimeout"]
SetHasDrainTimedOut-->ChkForceDelOrTimedOut{"machine.Labels['force-deletion']
  || hasDrainTimedOut"}

ChkForceDelOrTimedOut--Yes-->SetForceDelParams["
  forceDeletePods=true
  drainTimeout=1m
  maxEvictRetries=1
  "]
SetForceDelParams-->UpdateNodeTermCond["err=c.UpdateNodeTerminationCondition(ctx, machine)"]
ChkForceDelOrTimedOut--No-->UpdateNodeTermCond

UpdateNodeTermCond-->ChkUpdateErr{"err != nil ?"}
ChkUpdateErr--No-->InitDrainOpts["
  // params reduced for brevity
  drainOptions := drain.NewDrainOptions(
    c.targetCoreClient,
    drainTimeout,
    maxEvictRetries,
    c.safetyOptions.PvDetachTimeout.Duration,
    c.safetyOptions.PvReattachTimeout.Duration,
    nodeName,
    forceDeletePods,
    c.driver,
		c.pvcLister,
		c.pvLister,
    c.pdbV1Lister,
		c.nodeLister,
		c.volumeAttachmentHandler)
"]
ChkUpdateErr--"Yes&&forceDelPods"-->InitDrainOpts
ChkUpdateErr--Yes-->SetOpStateFailed["opstate = v1alpha1.MachineStateFailed
  description=machineutils.InitiateDrain
  //drain failed. retry next sync
  "]

InitDrainOpts-->RunDrain["err = drainOptions.RunDrain(ctx)"]
RunDrain-->ChkDrainErr{"err!=nil?"}
ChkDrainErr--No-->SetOpStateProcessing["
  opstate= v1alpha1.MachineStateProcessing
  description=machineutils.InitiateVMDeletion
// proceed with vm deletion"]
ChkDrainErr--"Yes && forceDeletePods"-->SetOpStateProcessing
ChkDrainErr--Yes-->SetOpStateFailed
SetOpStateProcessing-->
InitLastOp["lastOp:=v1alpha1.LastOperation{
			Description:    description,
			State:          state,
			Type:           v1alpha1.MachineOperationDelete,
			LastUpdateTime: metav1.Now(),
		}
  //lastOp is actually the *next* op semantically"]
SetOpStateFailed-->InitLastOp
InitLastOp-->UpdateMachineStatus["c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,machine.Status.LastKnownState)"]
-->Return(("machineutils.ShortRetry, err"))

Note on above

  1. We skip the drain if node is set to ReadonlyFilesystem for over 5 minutes
    1. Check TODO: ReadonlyFilesystem is a MCM condition and not a k8s core node condition. Not sure if we are mis-using this field. TODO: Check this.
  2. Check TODO: Why do we check that node is not ready for 5m in order to skip the drain ? Shouldn't we skip the drain if node is simply not ready ? Why wait for 5m here ?/
  3. See Run Drain

controller.deleteVM

Called by controller.triggerDeletionFlow

func (c *controller) deleteVM(ctx context.Context, dmReq *driver.DeleteMachineRequest) 
	(machineutils.RetryPeriod, error)

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->Init["
machine = dmr.Machine
"]
-->CallDeleteMachine["
dmResp, err := c.driver.DeleteMachine(ctx, dmReq)
"]
-->ChkDelErr{"err!=nil?"}

ChkDelErr--No-->SetSuccShortRetry["
retryPeriod = machineutils.ShortRetry
description = 'VM deletion was successful.'+ machineutils.InitiateNodeDeletion)
state = MachineStateProcessing
"]
-->InitLastOp["
lastOp := LastOperation{
Description:    description,
State:          state,
Type:           MachineOperationDelete,
LastUpdateTime: Now(),
},
"]
-->SetLastKnownState["
// useless since drivers impls dont set this?. 
// Use machine.Status.LastKnownState instead ?
lastKnownState = dmResp.LastKnownState
"]
-->UpdateMachineStatus["
//Discuss: Introduce finer grained phase for status ?
c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,lastKnownState)"]-->Return(("retryPeriod, err"))

ChkDelErr--Yes-->DecodeErrorStatus["
  errStatus,decodeOk= status.FromError(err)
"]
DecodeErrorStatus-->CheckDecodeOk{"decodeOk ?"}
CheckDecodeOk--No-->SetFailed["
	state = MachineStateFailed
	description = 'machine decode error' + machineutils.InitiateVMDeletion
"]-->SetLongRetry["
retryPeriod= machineutils.LongRetry
"]-->InitLastOp

CheckDecodeOk--Yes-->AnalyzeCode{status.Code?}
AnalyzeCode--Unknown, DeadlineExceeded,Aborted,Unavailable-->SetDelFailed["
state = MachineStateFailed
description = VM deletion failed due to
+ err + machineutils.InitiateVMDeletion"]
-->SetShortRetry["
retryPeriod= machineutils.ShortRetry
"]-->InitLastOp

AnalyzeCode--NotFound-->DelSuccess["
// we can proceed with deleting node.
description = 'VM not found. Continuing deletion flow'+ machineutils.InitiateNodeDeletion
state = MachineStateProcessing
"]-->SetShortRetry

AnalyzeCode--default-->DelUnknown["
state = MachineStateFailed
description='VM deletion failed due to' + err 
+ 'Aborting..' + machineutils.InitiateVMDeletion
"]-->SetLongRetry

controller.deleteNodeObject

NOTE: Should have just called this controller.deleteNode for naming consistency with other methods.

Called by triggerDeletionFlow after successfully deleting the VM.

func (c *controller) deleteNodeObject(ctx context.Context, 
machine *v1alpha1.Machine) 
	(machineutils.RetryPeriod, error) 
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->Init["
nodeName := machine.Labels['node']
"]
-->ChkNodeName{"nodeName != ''" ?}
ChkNodeName--Yes-->DelNode["
err = c.targetCoreClient.CoreV1().Nodes().Delete(ctx, nodeName, metav1.DeleteOptions{})
"]-->ChkDelErr{"err!=nil?"}

ChkNodeName--No-->NodeObjNotFound["
	state = MachineStateProcessing
	description = 'No node object found for' + nodeName + 'Continue'
	+ machineutils.InitiateFinalizerRemoval
"]-->InitLastOp

ChkDelErr--No-->DelSuccess["
	state = MachineStateProcessing
	description = 'Deletion of Node' + nodeName + 'successful'
	 + machineutils.InitiateFinalizerRemoval 
"]-->InitLastOp

ChkDelErr--Yes&&!apierrorsIsNotFound-->FailedNodeDel["
	state = MachineStateFailed
	description = 'Deletion of Node' + 
		nodeName + 'failed due to' + err + machineutils.InitiateNodeDeletion
"]
-->InitLastOp["
lastOp := LastOperation{
Description:    description,
State:          state,
Type:           MachineOperationDelete,
LastUpdateTime: Now(),
},
"]
-->UpdateMachineStatus["
c.machineStatusUpdate(ctx,machine,lastOp,machine.Status.CurrentStatus,machine.Status.LastKnownState)"]
-->Return(("machineutils.ShortRetry, err"))

controller.syncMachineNodeTemplates

func (c *controller) syncMachineNodeTemplates(ctx context.Context, 
	machine *v1alpha1.Machine) 
		(machineutils.RetryPeriod, error) 

See MachineSpec

syncMachineNodeTemplates syncs machine.Spec.NodeTemplateSpec between machine and corresponding node-object. A NodeTemplateSpec just wraps a core NodeSpec and ObjectMetadata

Get the node for the given machine and then:

  1. Synchronize node.Annotations to machine.Spec.NodeTemplateSpec.Annotations.
  2. Synchronize node.Labels to machine.Spec.NodeTemplateSpec.Labels
  3. Synchronize node.Spec.Taints to machine.Spec.NodeTemplateSpec.Spec.Taints
  4. Update the node object if there were changes.

Since, we should not delete third-party annotations on the node object, synchronizing deleted ALT from the machine object is a bit tricky and so we maintain yet another custom annotation node.machine.sapcloud.io/last-applied-anno-labels-taints (code constant LastAppliedALTAnnotation) on the Node object which is a JSON string of the NodeTemplateSpec.

  1. Before synchronizing, we un-marshall this LastAppliedALTAnnotation value into a lastAppliedALT of type NodeTemplateSpec.
  2. While synchronizing annotations, labels and taints, we check respectively whether lastAppliedALT.Annotations, lastAppliedALT.Labels and lastAppliedALT.Spec.Taints hold keys that are NOT in the corresponding machine.Spec.NodeTemplateSpec.Annotations, machine.Spec.NodeTemplateSpec.Labels and machine.Spec.NodeTemplateSpec.Spec.Taints.
    1. If so, we delete keys from the corresponding node.Annotations, node.Labels and node.Spec.Taints respectively.
    2. We maintain a boolean saying the LastAppliedALTAnnotation needs to be updated
  3. Just before updating the Node object, we check if LastAppliedALTAnnotation needs updation and if so we Jsonify machine.Spec.NodeTemplateSpec and override the new LastAppliedALTAnnotation to this value

Node Drain

Node Drain code is in github.com/gardener/machine-controller-manager/pkg/util/provider/drain/drain.go

Drain Utilities

VolumeAttachmentHandler

pkg/util/provider/drain.VolumeAttachmentHandler is an handler used to distribute incoming k8s.io/api/storage/v1.VolumeAttachment requests to a number of workers where each worker is a channel of type *VolumeAttachment.

A k8s.io/api/storage/v1.VolumeAttachment is a non-namespaced k8s object that captures the intent to attach or detach the specified volume to/from the specified node. See VolumeAttachment

type VolumeAttachmentHandler struct {
	sync.Mutex
	workers []chan *storagev1.VolumeAttachment
}

// NewVolumeAttachmentHandler returns a new VolumeAttachmentHandler
func NewVolumeAttachmentHandler() *VolumeAttachmentHandler {
	return &VolumeAttachmentHandler{
		Mutex:   sync.Mutex{},
		workers: []chan *storagev1.VolumeAttachment{},
	}
}

VolumeAttachmentHandler.AddWorker

AddWorker appends a buffered channel of size 20 of type VolumeAttachment to the workers slice in VolumeAttachmentHandler . There is an assumption that not more than 20 unprocessed objects would exist at a given time. On bufferring requests beyond this the channel will start dropping writes. See dispatch method.

func (v *VolumeAttachmentHandler) AddWorker() chan *storagev1.VolumeAttachment {
	// chanSize is the channel buffer size to hold requests.
	// This assumes 
	// On bufferring requests beyond this the channel will start dropping writes
	const chanSize = 20

	klog.V(4).Infof("Adding new worker. Current active workers %d - %v", len(v.workers), v.workers)

	v.Lock()
	defer v.Unlock()

	newWorker := make(chan *storagev1.VolumeAttachment, chanSize)
	v.workers = append(v.workers, newWorker)

	klog.V(4).Infof("Successfully added new worker %v. Current active workers %d - %v", newWorker, len(v.workers), v.workers)
	return newWorker
}

VolumeAttachmentHandler.dispatch

The dispatch method is responsible for distributing incomding VolumeAttachents to available channels.

func (v *VolumeAttachmentHandler) dispatch(obj interface{}) {
	if len(v.workers) == 0 {
		// As no workers are registered, nothing to do here.
		return
	}
	volumeAttachment := obj.(*storagev1.VolumeAttachment)
	v.Lock()
	defer v.Unlock()

	for i, worker := range v.workers {
		select {
		// submit volume attachment to the worker channel if channel is not full
		case worker <- volumeAttachment:
		default:
			klog.Warningf("Worker %d/%v is full. Discarding value.", i, worker)
			// TODO: Umm..isn't this problematic if we miss this ?
		}
	}
}

The Add|Update methods below delegate to dispatch. The usage of this utility involves specifying the add/update methods below as the event handler callbacks on an instance of k8s.io/client-go/informers/storage/v1.VolumeAttachmentInformer. This way incoming volume attachments are distributed to several worker channels.

func (v *VolumeAttachmentHandler) AddVolumeAttachment(obj interface{}) {
	v.dispatch(obj)
}

func (v *VolumeAttachmentHandler) UpdateVolumeAttachment(oldObj, newObj interface{}) {
	v.dispatch(newObj)
}

VolumeAttachmentHandler Initialization in MC

During construction of the MC controller struct, we initialize the callback methods on volume attachment handler using the volume attachment informer

func NewController(...) {
	//...
controller.volumeAttachmentHandler = drain.NewVolumeAttachmentHandler()
volumeAttachmentInformer.Informer().AddEventHandler(
	cache.ResourceEventHandlerFuncs{
			AddFunc:    controller.volumeAttachmentHandler.AddVolumeAttachment,
			UpdateFunc: controller.volumeAttachmentHandler.UpdateVolumeAttachment,
});

Drain

Drain Types

Drain Constants

  • PodEvictionRetryInterval is the interval in which to retry eviction for pods
  • GetPvDetailsMaxRetries is the number of max retries to get PV details using the PersistentVolumeLister or PersistentVolumeClaimLister
  • GetPvDetailsRetryInterval is the interval in which to retry getting PV details
const (
    PodEvictionRetryInterval = time.Second * 20
	GetPvDetailsMaxRetries = 3
	GetPvDetailsRetryInterval = time.Second * 5
)

drain.Options

drain.Options are configurable options while draining a node before deletion

NOTE: Unused fields/Fields with constant defaults omitted for brevity

type Options struct {
	client                       kubernetes.Interface
	kubernetesVersion            *semver.Version
	Driver                       driver.Driver
	drainStartedOn               time.Time
	drainEndedOn                 time.Time
	ErrOut                       io.Writer
	ForceDeletePods              bool
	MaxEvictRetries              int32
	PvDetachTimeout              time.Duration
	PvReattachTimeout            time.Duration
	nodeName                     string
	Out                          io.Writer
	pvcLister                    corelisters.PersistentVolumeClaimLister
	pvLister                     corelisters.PersistentVolumeLister
	pdbV1Lister                  policyv1listers.PodDisruptionBudgetLister
	nodeLister                   corelisters.NodeLister
	volumeAttachmentHandler      *VolumeAttachmentHandler
	Timeout                      time.Duration
}

drain.PodVolumeInfo

drain.PodVolumeInfo is the struct used to encapsulate the PV names and PV ID's for all the PVs attached to the pod

PodVolumeInfo struct {
	persistentVolumeList []string
	volumeList           []string
}

NOTE: The struct fields are badly named.

  • PodVolumeInfo.persistentVolumeList is a slice of persistent volume names. This is from PersistentVolumeSpec.VolumeName
  • PodVolumeInfo.volumeList is a slice of persistent volume IDs. This is obtained using driver.GetVolumeIDs given the PV Spec. This is generally the CSI volume id.

drain.Options.evictPod

func (o *Options) evictPod(ctx context.Context, pod *corev1.Pod, policyGroupVersion string) error 

drain.Options.evictPod is a simple helper method to evict a Pod using Eviction API

  • TODO: GracePeriodSeconds in the code is useless here and should be removed as it is always -1.
  • TODO: Currently this method uses old k8s.io/api/policy/v1beta1. It must be changed to k8s.io/api/policy/v1
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin(("" ))
-->InitTypeMeta["
		typeMeta:= metav1.TypeMeta{
			APIVersion: policyGroupVersion,
			Kind:       'Eviction',
		},
"]
-->InitObjectMeta["
		objectMeta := ObjectMeta: metav1.ObjectMeta{
			Name:      pod.Name,
			Namespace: pod.Namespace,
		},
"]
-->InitEviction["
eviction := &.Eviction{TypeMeta: typeMeta, ObjectMeta: objectMeta }
"]
-->EvictPod["
 err := o.client.PolicyV1beta1().Evictions(eviction.Namespace).Evict(ctx, eviction)
"]
-->ReturnErr(("return err"))

style InitEviction text-align:left

drain.Options.deletePod

Simple helper method to delete a Pod

func (o *Options) deletePod(ctx context.Context, pod *corev1.Pod) error {

Just delegates to PodInterface.Delete

o.client.CoreV1().Pods(pod.Namespace).Delete(ctx, pod.Name, metav1.DeleteOptions{} )

drain.Options.getPodsForDeletion

drain.getPodsForDeletion returns all the pods we're going to delete. If there are any pods preventing us from deleting, we return that list in an error.

func (o *Options) getPodsForDeletion(ctx context.Context) 
	(pods []corev1.Pod, err error)
  1. Get all pods associated with the node.
    podList, err := o.client.CoreV1().Pods(metav1.NamespaceAll).List(ctx, metav1.ListOptions{
     	FieldSelector: fields.SelectorFromSet(fields.Set{"spec.nodeName": o.nodeName}).String()})
    
  2. Iterate through podList.
  3. Apply a bunch of pod filters.
    1. Remove mirror pods from consideration for deletion. See Static Pods
    2. Local Storage Filter. Discuss: seems useless. If Pod has local storge, remove it from consideration. This filter iterates through Pod.Spec.Volumes slice and checks whether Volume.EmptyDir is non nil in order to determine
    3. A Pod whose Pod.Status.Phase is PodSucceeded or PodFailed is eligible for deletion
      1. If a Pod has a controller owner reference, it is eligible for deletion. (TODO: Unsure why this makes a difference anyways)
    4. The final pod filter daemonsetFilter seems useless. Discuss.

drain.Options.getPVList

NOTE: Should be called getPVNames. Gets a slice of the persistent volume names bound to the Pod through its claims. Contains time.sleep and retry handling to a limit. Unsure if this is the best way. Discuss.

func (o *Options) getPVList(pod *corev1.Pod) (pvNames []string, err error) 
  1. Iterate over pod.Spec.Volumes.
  2. If volume.PersistentVolumeClaim reference is not nil, gets the PersistentVolumeClaim using o.pvcLister using vol.PersistentVolumeClaim.ClaimName.
    1. Implements error handling and retry till GetPvDetailsMaxRetries is reached with interval GetPvDetailsRetryInterval for the above.
  3. Adds pvc.Spec.VolumeName to pvNames
  4. Return pvNames

drain.Options.getVolIDsFromDriver

Given a slice of PV Names, this method gets the corresponding volume ids from the driver.

  • It does this by first getting the PersistentVolumeSpec using o.pvLister.Get(pvName) for each PV name and adding to the pvSpecs slice of type PersistentVolumeSpec. See k8s.io/client-go/listers/core/v1.PersistentVolumeLister
  • Retry handling is implemented here while looking up pvName till GetPvDetailsMaxRetries is reached with sleep interval of GetPvDetailsRetryInterval between each retry attempt.
  • Once pvSpecs slice is populated it constructs a driver.GetVolumeIDsRequest from the same and then invokes driver.GetVolumeIDs(driver.GetVolumeIDsRequest)) to obtain the driver.GetVolumeIDsResponse and retruns driver.GetVolumeIDsResponse.VolumeIDs

TODO: BUG ? In case the PV is not found or retry limit is reached the slice of volume ids will not have a 1:1 correspondence with slice of PV names passed in.

func (o *Options) getVolIDsFromDriver(ctx context.Context, pvNames []string) ([]string, error)

drain.Options.doAccountingOfPvs

drain.Options.doAccountingOfPvs returns a map of the pod key pod.Namespace + '/' + pod.Name to a PodVolumeInfo struct which holds a slice of PV names and PV IDs.

NOTES:

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD
Begin((" "))
-->Init["
	podKey2VolNamesMap = make(map[string][]string)
	podKey2VolInfoMap = make(map[string]PodVolumeInfo)
"]
-->RangePods["
	for pod := range pods
"]
-->PopPod2VolNames["
	podKey2VolNamesMap[pod.Namespace + '/' pod.Name] = o.getPVList(pod)
"]
--loop-->RangePods
PopPod2VolNames--done-->FilterSharedPVs["
	filterSharedPVs(podKey2VolNamesMap)
// filters out the PVs that are shared among pods.
"]
-->RangePodKey2VolNamesMap["
	for podKey, volNames := range podKey2VolNamesMap
"]
-->GetVolumeIds["
	volumeIds, err := o.getVolIDsFromDriver(ctx, volNames)
	if err != nil continue; //skip set of volumes
"]
-->InitPodVolInfo["
	podVolumeInfo := PodVolumeInfo{
			persistentVolumeList: volNames,
			volumeList:           volumeIds
	}
	//struct field names are bad.
"]
-->PopPodVolInfoMap["
	podVolumeInfoMap[podKey] = podVolumeInfo
"]
--loop-->RangePodKey2VolNamesMap
PopPodVolInfoMap--done-->Return(("return podVolumeInfoMap"))

drain.filterPodsWithPv

NOTE: should have been named partitionPodsWithPVC

Utility function that iterates through given pods and for each pod, iterates through its pod.Spec.Volumes. For each such pod volume checks volume.PersistentVolumeClaim. If not nil, adds pod to slice podsWithPV else adds pod to slice podsWithoutPV

func filterPodsWithPv(pods []corev1.Pod) 
    (podsWithPV []*corev1.Pod, podsWithoutPV []*corev1.Pod) 

drain.Options.waitForDetach

func (o *Options) waitForDetach(ctx context.Context, 
	podVolumeInfo PodVolumeInfo, 
	nodeName string) error

Summary:

  1. Initiaze boolean found to true. (Representing that a volume is still attached to a node).
  2. Begins a loop while found is true
  3. Uses a select and checks to see if a signal is received from context.Done() (ie context cancelled). If so, return an error with the message that a timeout occurred while waiting for PV's to be detached.
  4. Sets found to false.
  5. Gets the Node associated with nodeName using the nodeLister and assign to node. If there is an error return from the function.
  6. Gets the node.Status.VolumesAttached which returns a []AttachedVolume and assign to attachedVols.
    1. Return nil if this slice is empty.
  7. Begin iteration range podVolumeInfo.volumeList assigning volumeID in parent iteration. Label this iteration as LookUpVolume.
    1. Begin inner iteration over attachedVols assigning attachedVol in nested iteration
    2. If attachedVol.Name is contained in volumeID then this volume is still attached to the node.
      1. Set found to true
      2. Sleep for VolumeDetachPollInterval seconds (5 seconds)
      3. Break out of LookUpVolume

drain.Options.waitForReattach

Purpose: Waits for persistent volume names in podVolumeInfo.persistentVolumeList to be re-attached (to another node).

But I am still confused on why we need to call this during the drain flow. Why should we wait for PVs to be attached to another node. After all, there is no guarantee they will be attached, right ?

func (o *Options) waitForReattach(ctx context.Context, 
	podVolumeInfo PodVolumeInfo, 
	previousNodeName string, 
	volumeAttachmentEventCh chan *VolumeAttachment) error 
  1. Construct a map: var pvsWaitingForReattachments map[string]bool
  2. Initiamize the above map by ranging through podVolumeInfo.persistentVolumeList taking the persistentVolumeName and set pvsWaitingForReattachments[persistentVolumeName] = true
  3. Start a for loop.
    1. Commence a select with following cases:
      1. Case: Check to see if context is closed/cancelled by reading: <-ctx.Done(). If so, return an error with the message that timeout occurred while waiting for PV's to reattach to another node.
      2. Case: Obtain a *VolumeAttachment by reading from channel: incomingEvent := <-volumeAttachmentEventCh
        1. Get the persistentVolumeName associated with this attachment event.
        2. persistentVolumeName := *incomingEvent.Spec.Source.PersistentVolumeName
        3. Check if this persistent volume was being tracked: pvsWaitingForReattachments[persistentVolumeName] is present
        4. Check if the volume was attached to another node
        5. incomingEvent.Status.Attached && incomingEvent.Spec.NodeName != previousNodeName
        6. If above is true, then delete entry corresponding to persistentVolumeName from the pvsWaitingForReattachments map.
    2. if pvsWaitingForReattachments is empty break from the loop.
  4. Log that the volumes in podVolumeInfo.persistentVolumeList have been successfully re-attached and return nil.

drain.Options.waitForDelete

NOTE: Ideally should have been named waitForPodDisappearance

pkg/util/provider/drain.Options.waitForDelete is a helper method defined on drain.Options that leverages wait.PollImmediate and the getPodFn (get pod by name and namespace) and checks that all pods have disappeared within timeout. The set of pods that did not disappear within timeout is returned as pendingPods

func (o *Options) waitForDelete(
        pods []*corev1.Pod, interval, 
        timeout time.Duration,  
        getPodFn func(string, string) (*corev1.Pod, error)
    ) (pendingPods []*corev1.Pod, err error) 

drain.Options.RunDrain

Context: drain.Options.RunDrain is called from the MC helper method controller.drainNode which in turn is called from controller.triggerDeletionFlow when the machine.Status.LastOperation.Description contains operation machineutils.InitiateDrain.

If RunDrain returns an error, then the drain is retried at a later time by putting back the machine key into the queue. Unless the force-deletion label on the machine object is true - in which case we proceed to VM deletion.

func (o *Options) RunDrain(ctx context.Context) error
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD


GetNode["node, err = .client.CoreV1().Nodes().Get(ctx, o.nodeName, metav1.GetOptions{})"]
-->ChkGetNodeErr{err != nil}

ChkGetNodeErr--Yes-->ReturnNilEarly(("return nil
(case where deletion 
triggered during machine creation, 
so node is nil. 
TODO: should use apierrors.NotFound)"))

ChkGetNodeErr--No-->ChkNodeUnschedulable{node.Spec.Unschedulable?}

ChkNodeUnschedulable--Yes-->GetPodsForDeletion["
    pods, err := o.getPodsForDeletion(ctx)
    if err!=nil return err"]
ChkNodeUnschedulable--No-->CloneNode["clone := node.DeepCopy()
		clone.Spec.Unschedulable = true"]
        -->UpdateNode["_, err = o.client.CoreV1().Nodes().Update(ctx, clone, metav1.UpdateOptions{})
        if err != nil return err
        "]

UpdateNode-->GetPodsForDeletion
GetPodsForDeletion-->GetEvictionPGV["
    policyGroupVersion, err := SupportEviction(o.client)
    if err != nil return err
    "]
-->
DefineAttemptEvict["
attemptEvict := !o.ForceDeletePods && len(policyGroupVersion) > 0
// useless boolean which confuses matters considerably.
"]
-->
DefineGetPodFn["
getPodFn := func(namespace, name string) (*corev1.Pod, error) {
		return o.client.CoreV1().Pods(namespace).Get(ctx, name, metav1.GetOptions{})
}"]
-->
CreateReturnChannel["
    returnCh := make(chan error, len(pods))
	defer close(returnCh)
    "]
-->ChkForceDelPods{"o.ForceDeletePods?"}

ChkForceDelPods--Yes-->EvictPodsWithoutPV["go o.evictPodsWithoutPv(ctx, attemptEvict, pods, policyGroupVersion, getPodFn, returnCh)
// go-routine feels un-necessary here."]
ChkForceDelPods--No-->FilterPodsWithPv["
podsWithPv, podsWithoutPv := filterPodsWithPv(pods)
"]

FilterPodsWithPv-->EvictPodsWithPv["
go o.evictPodsWithPv(ctx, attemptEvict, podsWithPv, policyGroupVersion, getPodFn, returnCh)
"]
-->EvictPodsWithoutPV1["
	go o.evictPodsWithoutPv(ctx, attemptEvict, podsWithoutPv, policyGroupVersion, getPodFn, returnCh)	
"]-->CreateAggregateError

EvictPodsWithoutPV-->CreateAggregateError["
	var errors []error
    errors =
	for i = 0; i <  len(pods); ++i {
		err := <-returnCh
		if err != nil {
			errors = append(errors, err)
		}
	}
	
"]
-->ReturnAggError(("return\nerrors.NewAggregate(errors)"))

Notes:

  1. machine-controller-manager/pkg/util/provider/drain.SupportEviction uses Discovery API to find out if the server support eviction subresource and if so return its groupVersion or "" if it doesn't.
    1. k8s.io/kubectl/pkg/drain.CheckEvictionSupport already does this.
  2. attemptEvict boolean usage is confusing. Stick to drain.Options.ForceDeletePods
  3. TODO: GAP? For cordoning a Node we currently just set Node.Spec.Unschedulable. But we are also supposed to set the taint. node.kubernetes.io/unschedulable. The spec way is supposed to be deprecated.

drain.Options.evictPodsWithoutPv

drain method that iterates through each given pod and for each pod launches a go-routine that simply delegates to Options.evictPodsWithoutPv.

func (o *Options) evictPodsWithoutPv(ctx context.Context, 
    pods []*corev1.Pod,
	policyGroupVersion string, //eviction API's GV
	getPodFn func(namespace, name string) (*corev1.Pod, error),
	returnCh chan error) {
    for _, pod := range pods {
		go o.evictPodWithoutPVInternal(ctx, attemptEvict, pod, policyGroupVersion, getPodFn, returnCh)
	}
	return
}

drain.Options.evictPodWithoutPVInternal

drian method that that either evicts or deletes a Pod with retry handling until Options.MaxEvictRetries is reached.

func (o *Options) evictPodWithoutPVInternal(
    ctx context.Context, 
    attemptEvict bool, 
    pod *corev1.Pod, 
    policyGroupVersion string, 
    getPodFn func(namespace, name string) (*corev1.Pod, error), 
    returnCh chan error) 

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD


RangePod["pod := range pods"]
RangePod-->EvictOrDelPod["go evictPodWithoutPVInternal(attemptEvict bool, pod, policyGroupVersion,pod,getPodFn,returnCh)"]
EvictOrDelPod-->Begin

subgraph "evictPodWithoutPVInternal (evicts or deletes Pod) "
Begin(("Begin"))-->SetRetry["retry := 0"]
SetRetry
-->SetAttemptEvict["if retry >= o.MaxEvictRetries {attemptEvict=false}"]
-->ChkAttemptEvict{"attemptEvict ?"}

ChkAttemptEvict--Yes-->EvictPod["err=o.evictPod(ctx, pod, policyGroupVersion)"]
ChkAttemptEvict--No-->DelPod["err=o.deletePod(ctx, pod)"]

EvictPod-->ChkErr
DelPod-->ChkErr

ChkErr{"Check err"}
ChkErr--"Nil"-->ChkForceDelPods
ChkErr--"IsTooManyRequests(err)"-->GetPdb["
    // Possible case where Pod couldn't be evicted because of PDB violation
    pdbs, err = pdbLister.GetPodPodDisruptionBudgets(pod)
    pdb=pdbs[0] if err !=nil && len(pdbs) > 0
"]
ChkErr--"IsNotFound(err)\n(pod evicted)"-->SendNilChannel-->NilReturn
ChkErr--"OtherErr"-->SendErrChannel
GetPdb-->ChkMisConfiguredPdb{"isMisconfiguredPdb(pdb)?"}
ChkMisConfiguredPdb--Yes-->SetPdbError["err=fmt.Errorf('pdb misconfigured')"]
SetPdbError-->SendErrChannel

ChkMisConfiguredPdb--No-->SleepEvictRetryInterval["time.Sleep(PodEvictionRetryInterval)"]
SleepEvictRetryInterval-->IncRetry["retry+=1"]-->SetAttemptEvict


SendErrChannel-->NilReturn

ChkForceDelPods{"o.ForceDeletePods"}
ChkForceDelPods--"Yes\n(dont wait for\npod disappear)"-->SendNilChannel
ChkForceDelPods--No-->GetPodTermGracePeriod["
    // TODO: discuss this, shouldn't pod grace period override drain ?
    timeout=Min(pod.Spec.TerminationGracePeriodSeconds,o.Timeout)
"]
-->SetBufferPeriod["bufferPeriod := 30 * time.Second"]
-->WaitForDelete["pendingPods=o.waitForDelete(pods, timeout,getPodFn)"]
-->ChkWaitForDelError{err != nil ?}

ChkWaitForDelError--Yes-->SendErrChannel
ChkWaitForDelError--No-->ChkPendingPodsLength{"len(pendingPods) > 0?"}
ChkPendingPodsLength--Yes-->SetTimeoutError["err = fmt.Errorf('pod term timeout')"]
SetTimeoutError-->SendErrChannel

ChkPendingPodsLength--No-->SendNilChannel
end

SendNilChannel["returnCh <- nil"]
SendErrChannel["returnCh <- err"]
NilReturn(("return"))


isMisconfiguredPdb

TODO: Discuss/Elaborate on why this is considered misconfigured.

func isMisconfiguredPdbV1(pdb *policyv1.PodDisruptionBudget) bool {
	if pdb.ObjectMeta.Generation != pdb.Status.ObservedGeneration {
		return false
	}

	return pdb.Status.ExpectedPods > 0 && 
        pdb.Status.CurrentHealthy >= pdb.Status.ExpectedPods
        && pdb.Status.DisruptionsAllowed == 0
}

drain.Options.evictPodsWithPv

func (o *Options) evictPodsWithPv(ctx context.Context, 
    attemptEvict bool, 
    pods []*corev1.Pod,
	policyGroupVersion string,
	getPodFn func(namespace, name string) (*corev1.Pod, error),
	returnCh chan error)

NOTE

  • See drain.Options.evictPodsWithPv
  • This method basically delegates to o.evictPodsWithPVInternal with retry handling
  • TODO: Logic of this method can do with some refactoring!
%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->SortPods["
    sortPodsByPriority(pods)
    //Desc priority: pods[i].Spec.Priority > *pods[j].Spec.Priority
"]
-->DoVolumeAccounting["
	podVolumeInfoMap := o.doAccountingOfPvs(ctx, pods)
"]
-->ChkAttemptEvict{attemptEvict ?}

ChkAttemptEvict--Yes-->RetryTillLimit["
	until MaxEvictRetries
"]
-->
InvokeHelper["
	remainingPods, aborted = o.evictPodsWithPVInternal(ctx, attemptEvict, pods, podVolumeInfoMap, policyGroupVersion,  returnCh)
"]
InvokeHelper-->ChkAbort{"
	aborted ||
	len(remainingPods) == 0
"}
ChkAbort--Yes-->RangeRemainingPods
ChkAbort--No-->Sleep["
	pods = remainingPods
	time.Sleep(PodEvictionRetryInterval)
"]
Sleep--loop-->RetryTillLimit

RetryTillLimit--loopend-->ChkRemaining{"len(remainingPods) > 0 && !aborted ?"}
ChkRemaining--Yes-->InvokeHelper1["
// force delete pods
	remainingPods, _ = o.evictPodsWithPVInternal(ctx, false, pods, podVolumeInfoMap, policyGroupVersion, getPodFn, returnCh)
"]

ChkAttemptEvict--No-->InvokeHelper1
InvokeHelper1-->RangeRemainingPods
ChkRemaining--No-->RangeRemainingPods

RangeRemainingPods["pod := range remainingPods"]
RangeRemainingPods--aborted?-->SendNil["returnCh <- nil"]
RangeRemainingPods--attemptEvict?-->SendEvictErr["returnCh <- fmt.Errorf('pod evict error')"]
RangeRemainingPods--else-->SendDelErr["returnCh <- fmt.Errorf('pod delete error')"]


SendNil-->NilReturn
SendEvictErr-->NilReturn
SendDelErr-->NilReturn
NilReturn(("return"))

drain.Options.evictPodsWithPVInternal

FIXME: name case inconsistency with evictPodsWithPv

drain.Options.evictPodsWithPVInternal is a drain helper method that actually evicts/deletes pods and waits for volume detachment. It returns a remainingPods slice and a fastTrack boolean is meant to abort the pod eviction and exit the calling go-routine. (TODO: should be called abort or even better should use custom error here)

func (o *DrainOptions) evictPodsWithPVInternal(ctx context.Context,
    attemptEvict bool, 
    pods []*corev1.Pod, 
    volMap map[string][]string,
	policyGroupVersion string,
	returnCh chan error
    ) (remainingPods []*api.Pod, fastTrack bool)
  1. Uses context.Deadline passing in ctx and a deadline time after the drain timeout to get a sub-context assigned to mainContext and a CancelFunc. Defer the obtained CancelFunc. (So it always invoked when the method terminates)
  2. Maintain a pod slice retryPods which is initially empty.
  3. Iterate through pods slice with i as index variable
    1. Apply a select with the one check case:
      1. Check to see if mainContext is closed/cancelled. Attempt to read from the Done channel: <-mainContext.Done(). If this case matches:
        1. Send nil on the return error channel: returnCh <- nil
        2. Compute remainingPods as retryPods slice appended with pods yet to be iterated: pods[i+1:]...
        3. Return remainingPods, true. (aborted is true)
    2. Initiate the pod eviction start time: podEvictionStartTime=time.Now()
    3. Call volumeAttachmentHandler.AddWorker() to start tracking VolumeAttachmentsand obtain a volumeAttachmentEventCh receive channel that one can use to receive the attached or detached *.VolumeAttachment.
    4. If attemptEvict is true, then call evictPod else call deletePod helper method. Grab the err for eviction/deletion.
    5. eviction/deletion had an error: Analyze the err:
      1. If both attemptEvict is true and apierrors.IsTooManyRequests(err) is true, then this case is interpreted as an eviction failure due to PDB violation.
        1. We get the PodDisruptionBudget for the pod being iterated.
        2. We check whether it is misconfigured. IF So we send an error on returnCh and close the volumeAttachmentEventCh using volumeAttachmentHandler.DeleteWorker and continue with next loop iteration. ie go to next pod.
      2. If just apierrors.IsNotFound(err) is true, this means that Pod is already gone from the node. We send nil on returnCh and call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh) and continue with next pod in iteration.
      3. Otherwise we add the pod to the retryPod slice: retryPods = append(retryPods, pod), call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh) and continue with next pod in iteration.
      4. (NOTE: Error handling can be optimized. too much repetition)
    6. Log that the evict/delete was successful.
    7. Get the PodVolumeInfo from volMap using the pod key.
    8. Obtain a context and cancellation function for volume detachment using context.Timeout passing in the mainContext and detach timeout computed as the sum of the termination grace period of the pod (pod.Spec.TerminationGracePeriodSeconds if not nil) added to the PvDetachTimeout (from the drain options)
    9. Invoke waitForDetach(ctx, podVolumeInfo, o.nodeName) and grab the err.
    10. Invoke the cancel function for detach. NOTE: THIS IS NICHT GUT. The sub context should be created INSIDE waitForDetach with a defer for the cancelllation.
    11. Analyze the detachment error.
      1. If apierrors.IsNotFound(err) is true this indicates that the node is not found.
        1. Send nil on returnCh
        2. Call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
        3. Compute remainingPods as retryPods slice appended with pods yet to be iterated: pods[i+1:]...
        4. Return remainingPods, true. (aborted is true)
      2. For other errors:
        1. Send the err on the returnCh.
        2. Call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
        3. Continue with next pod in iteration.
    12. Obtain a context and cancellation function for volume re-attachment using context.Timeout passing in the mainContext and drain.Options.PvReattachTimeout.
    13. Invoke waitForReattach(ctx, podVolumeInfo, o.nodeName, volumeAttachmentEventCh) and grab the returned err.
    14. Invoke the cancel function for reattach. NOTE: THIS IS NICHT GUT. The sub context should be created INSIDE waitForReattach.
    15. Analyze the re-attachment error.
      1. If err is a reattachment timeout error just log a warning. TODO: Confused on why we don't return an error on the return channel here.
      2. Otherwise we Send the err on the returnCh.
      3. Call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh) and continue with next pod in iteration
    16. YAWN. Someone is very fond of calling this again and again. Call volumeAttachmentHandler.DeleteWorker(volumeAttachmentEventCh)
    17. Log the time taken for pod eviction+vol detachment+vol attachment to another node using time.Since(podEvictionStartTime).
    18. Send nil on returnCh
  4. pod iteration loop is done: return retryPods, false

Orphan / Safety Jobs

Read MCM FAQ: What is Safety Controller in MCM

These are jobs that periodically run by pushing dummy keys onto their respective work-queues. The worker then picks up and dispatches to the reconcile functions.

    worker.Run(c.machineSafetyOrphanVMsQueue, "ClusterMachineSafetyOrphanVMs", 
    worker.DefaultMaxRetries, 
    true, c.reconcileClusterMachineSafetyOrphanVMs, stopCh, &waitGroup)

    worker.Run(c.machineSafetyAPIServerQueue, "ClusterMachineAPIServer", 
    worker.DefaultMaxRetries, true, c.reconcileClusterMachineSafetyAPIServer, 
    stopCh, &waitGroup)

reconcileClusterMachineSafetyOrphanVMs

This techinically isn't a reconcilation loop. It is effectively just a job.

%%{init: {'themeVariables': { 'fontSize': '10px'}, "flowchart": {"useMaxWidth": false }}}%%
flowchart TD

Begin((" "))
-->RescheduleJob["
// Schedule Rerun
defer c.machineSafetyOrphanVMsQueue.AddAfter('', c.safetyOptions.MachineSafetyOrphanVMsPeriod.Duration)
"]-->
GetMC["Get Machine Classes and iterate"]
-->forEach["
asdf
"]
-->ListDriverMachines["
listMachineResp := driver.ListMachines(...)
"]

ListDriverMachines-->MachinesExist{
    #listMachineResp.MachineList > 0 ?
}
MachinesExist-->|yes| SyncCache["cache.WaitForCacheSync(stopCh, machineInformer.Informer().HasSynced)"]

SyncCache-->IterMachine["Iterate machineID,machineName in listMachineResp.MachineList"]
-->GetMachine["machine, err := Get Machine from Lister"]
-->ChkErr{Check err}

ChkErr-->|err NotFound|DelMachine
ChkErr-->|err nil|ChkMachineId

ChkMachineId{
    machine.Spec.ProviderID == machineID 
    OR
    machinePhase is empty or CrashLoopBackoff ?}
ChkMachineId-->|yes|IterMachine
ChkMachineId-->|no|DelMachine["
driver.DeleteMachine(ctx, &driver.DeleteMachineRequest{
Machine:      machine, secretData...})
"]

DelMachine--iterDone-->ScheduleRerun
MachinesExist-->|no| ReturnLongRetry["retryPeriod := machineutils.LongRetry"]

-->ScheduleRerun["
c.machineSafetyOrphanVMsQueue.AddAfter('', time.Duration(retryPeriod))
"]

Machine Controller Manager

The Machine Controller Manager handles reconciliation of MachineDeployment and MachineSet objects.

Ideally this should be the called machine-deployment-controller but the current name is a legacy holdover when all controllers were in one project module.

The Machine Controller Manager Entry Point is at github.com/gardener/machine-controller-manager/cmd/machine-controller-manager/controller_manager.go

MCM Launch

Reconcile Cluster Machine Set

A MachineSet is to a Machine in an analogue of what a ReplicaSet is to a Pod. A MachineSet ensures that the specified number of Machines are running at any given time.

A MachineSet is rarely rarely created directly. It is generally owned by its parent MachineDeployment and its ObjectMetadata.OwnerReferenes slice has a reference to the parent deployment.

The MCM controller reconcileClusterMachineSet is called from objects retrieved from the machineSetQueue as shown below.

worker.Run(c.machineSetQueue, 
"ClusterMachineSet", 
worker.DefaultMaxRetries, true, c.reconcileClusterMachineSet, stopCh, &waitGroup)

The following is the flow diagram for func (c *controller) reconcileClusterMachineSet(key string) error . As can be observed, it could be optimized better. For any error in the below, the ms key is added back to the machineSetQueue according to the default rate limiting.

%%{init: { 'themeVariables': { 'fontSize': '11px'},"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart TD

Begin((" "))
-->GetMachineSet["machineSet=Get MS From Lister"]
-->ValidateMS["validation.ValidateMachineSet(machineSet)"]
-->ChkDeltimestamp1{"machineSet.DeletionTimestamp?"}

ChkDeltimestamp1-->|no| AddFinalizersIMissing["addFinalizersIfMissing(machineSet)"]
ChkDeltimestamp1-->|yes| GetAllMS["allMachineSets = list all machine sets"]-->
GetMSSelector["selector = LabelSelectorAsSelector(machineSet.Spec.Selector)"]
-->ClaimMachines["claimedMachines=claimMachines(machineSet, selector, allMachines)"]
-->SyncNT["synchronizeMachineNodeTemplates(claimedMachines, machineSet)"]
-->SyncMC["syncMachinesConfig(claimedMachines, machineSet)"]
-->SyncMCK["syncMachinesClassKind(claimedMachines, machineSet)"]
-->ChkDeltimestamp2{"machineSet.DeletionTimestamp?"}-->|no| ScaleUpDown

ChkDeltimestamp2-->|yes| ChkClaimedMachinesLen{"len(claimedMachines) == 0?"}
ChkClaimedMachinesLen-->|yes| DelMSFinalizers["delFinalizers(machineSet)"]
ChkClaimedMachinesLen-->|no| TermMachines["terminateMachines(claimedMachines,machineSet)"]-->CalcMSStatus

DelMSFinalizers-->CalcMSStatus
ScaleUpDown["manageReplicas(claimedMachines) // scale out/in machines"]
-->CalcMSStatus["calculateMachineSetStatus(claimedMachines, machineSet, errors)"]
-->UpdateMSStatus["updateMachineSetStatus(...)"]
-->enqueueMachineSetAfter["machineSetQueue.AddAfter(msKey, 10m)"]

AddFinalizersIMissing-->GetAllMS

claimMachines

claimMachines tries to take ownership of a machine - it associates a Machine with a MachineSet by setting machine.metadata.OwnerReferences and releasets the Machine if the MS's deletion timestamp has been set.

  1. Initialize an empty claimedMachines []Machine slice
  2. Initialize an empty errlist []erro
  3. Iterate through allMachines and Get the ownerRef(the first element in OwnerReferences slice)
  4. If the ownerRef is not nil
    1. if the ownerRef.UID is diff from the machineSets UUID skip the claim and continue. (Since the machine belongs to another machine set)
    2. If the machine selector matches the labels of the machineSet, add to claimedMachines and continue
    3. If the machineSet.DeletionTimestamp is set, skip and continue
    4. Release the Machine by removing its ownerReference
  5. If the ownerRef is nil
    1. If the machineSet.DeletionTimestamp is set or if the machine selector does not mach the machineSet, skip and continue.
    2. If the machine.DeletionTimestamp is set, skip and continue.
    3. Adopt the machine, ie. set the ownerReference to the machineSet and add to claimedMachines
    ownerReferences:
     - apiVersion: machine.sapcloud.io/v1alpha1
       blockOwnerDeletion: true
       controller: true
       kind: MachineSet
       name: shoot--i034796--aw2-a-z1-8c99f
       uid: 20bc03c5-e95b-4df5-9faf-68be38cb8e1b
    
  6. Returned claimedMachines.

synchronizeMachineNodeTemplates

func (c *controller) syncMachinesNodeTemplates(ctx context.Context, 
 claimedMachines []*Machine, machineSet *MachineSet) error 
  1. This iterates through the claimeMachines and copies the machineset.Spec.Template.Spec.NodeTemplateSpec to the machine.Spec.NodeTemplateSpec
  2. NOTE: Seems useless IO busy-work to me. When MC launches the Machine, it might as well access the owning MachineSet and get the NodeSpec.
  3. The only reason to do this is to support independent Machines without owning MachineSets. We will need to see whether such a use-case is truly needed.

NOTE: NodeTemplate describes common resource capabilities like cpu, gpu, memory, etc in terms of k8s.io/api/core/v1.ResourceList. This is used by the cluster-autoscaler for scaling decisions.

syncMachinesConfig

Copies machineset.Spec.Template.Spec.MachineConfiguration to machine.Spec.MachineConfiguration for all claimedMachines.

See MachineConfiguration inside MachineSpec

syncMachinesClassKind

NOTE: This is useless and should be removed since we only have ONE kind of MachineClass. TODO: Discuss with Himanshu/Rishabh.

func (c *controller) syncMachinesClassKind(ctx context.Context, 
    claimedMachines []*Machine, machineSet *MachineSet) error 

Iterates through claimedMachines and sets machine.Spec.Class.Kind = machineset.Spec.Template.Spec.Class.Kind if not already set.

manageReplicas (scale-out / scale-in)

func (c *controller) manageReplicas(ctx context.Context, 
    claimedMachines []Machine, machineSet *MachineSet) error
%%{init: { 'themeVariables': { 'fontSize': '11px'},"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart TD

Begin((" "))
-->Init["activeMachines :=[], staleMachines:=[]"]
-->IterCLaimed["machine := range claimedMachines"]
--loop-->IsActiveOrFailed{"IsMachineActiveOrFailed(machine)"}

IsActiveOrFailed-->|active| AppendActive["append(activeMachines,machine)"]
IsActiveOrFailed-->|failed| AppendFailed["append(staleMachines,machine)"]

IterCLaimed--done-->TermStaleMachines["terminateMachines(staleMachines,machineSet)"]

TermStaleMachines-->Delta["diff := len(activeMachines) - machineSet.Spec.Replicas"]
Delta-->ChkDelta{"diff < 0?"}

ChkDelta-->|yes| ScaleOut["numCreated:=slowStartBatch(-diff,..) // scale out"]
ScaleOut-->Log["Log numCreated/skipped/deleted"]
ChkDelta-->|no| GetMachinesToDelete["machinesToDel := getMachinesToDelete(activeMachines, diff)"]
GetMachinesToDelete-->TermMachines["terminateMachines(machinesToDel, machineSet)"]
-->Log-->ReturnErr["return err"]

terminateMachines

func (c *controller) terminateMachines(ctx context.Context, 
    inactiveMachines []*Machine, machineSet *MachineSet) error {
  1. Invokes controlMachineClient.Machines(namespace).Delete(ctx, machineID,..) for each Machine in inactiveMachines and records an event.
  2. The machine.Status.Phase is also set to Terminating.
  3. This is done in parallel using go-routines a WaitGroup on length of inactiveMachines

slowStartBatch

func slowStartBatch(count int, initialBatchSize int, createFn func() error) (int, error)
  1. Initializes remaining to count and successes as 0.
  2. Method executes fn (which creates a Machine object) in parallel with number of go-routines starting with batchSize := initialBatchSize and then doubling batchSize size after the call to fn.
    1. For each batch iteration, a wg sync.WaitGroup is constructed with batchSize. Each batch execution waits for batch to be complete using wg.Wait()
    2. For each batch iteration, an errCh is constructed with size as batchSize
    3. batchSize go-routines execute fn concurrently, sending errors on errCh and invoking wg.Done() when complete.
    4. numErrorsInBatch = len(errCh)
    5. successes is batchSize minus numErrorsInBatch
    6. if numErrorsInBatch > 0, abort, returning successes and first error from errCh
    7. remaining is decremented by the batchSize
    8. Compute batchSize as Min(remaining, 2*batchSize)
    9. Continue iteration while batchSize is greater than 0.
    10. Return successes, nil when done.
  3. fn is a lambda that creates a new Machine in which we do the below:
    1. Create an ownerRef with the machineSet.Name and machineSet.UID
    2. Get the machine spec template using machineSet.Spec.Template
    3. Then create a Machine obj setting the machine spec and ownerRef. Use the machineSet name as the prefix for GenerateName in the ObjectMeta.
    4. If any err return the same or nil if no error.
    5. New Machine objects are persisted using controlMachineClient.Machines(namespace).Create(ctx, machine, createOpts)

Reconcile Cluster Machine Deployment

func (dc *controller) reconcileClusterMachineDeployment(key string) error 
  • Gets the deployment name.
  • Gets the MachineDeployment
  • TODO: WEIRD: freeze labels and deletion timestamp
  • TODO: unclear why we do this
	// Resync the MachineDeployment after 10 minutes to avoid missing out on missed out events
	defer dc.enqueueMachineDeploymentAfter(deployment, 10*time.Minute)
  • Add finalizers if deletion time stamp is nil
  • TODO: Why is observed generation only updated conditionally in the below ? Shouldn't it be done always
everything := metav1.LabelSelector{}
	if reflect.DeepEqual(d.Spec.Selector, &everything) {
		dc.recorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all machines. A non-empty selector is required.")
		if d.Status.ObservedGeneration < d.Generation {
			d.Status.ObservedGeneration = d.Generation
			dc.controlMachineClient.MachineDeployments(d.Namespace).UpdateStatus(ctx, d, metav1.UpdateOptions{})
		}
		return nil
	}
  • Get []*v1alpha1.MachineSet for this deployment using getMachineSetsForMachineDeployment and assign to machineSets
  • if deployment.DeletionTimestamp != nil
    • if there are no finalizers on deployment return nil
    • if len(machineSets) == 0 delete the machine deployment finalizers and return nil
    • Call dc.terminateMachineSets(ctx, machineSets)

Rollout Rolling

func (dc *controller) rolloutRolling(ctx context.Context, 
    d *v1alpha1.MachineDeployment, 
    msList []*v1alpha1.MachineSet, 
    machineMap map[types.UID]*v1alpha1.MachineList) error 

1. Get new machine set corresponding to machine deployment and old machine sets

newMS, oldMSs, err := dc.getAllMachineSetsAndSyncRevision(ctx, d, msList, machineMap, true)
allMSs := append(oldMSs, newMS)

2. Taint the nodes backing the old machine sets.

This is a preference - the k8s scheduler will try to avoid placing a pod that does not tolerate thee taint on the node. Q: Why don't we use NoSchedule instead ? Any pods scheduled on this node will need to be drained - more work to be done.

dc.taintNodesBackingMachineSets(
		ctx,
		oldISs, &v1.Taint{
			Key:    PreferNoScheduleKey,
			Value:  "True",
			Effect: "PreferNoSchedule",
		},
	)

3. Add AutoScaler Scale-Down annotations to Nodes of Old Machine Sets

  1. Create the map. (TODO: Q: Why do we add 2 annotations ?)
     clusterAutoscalerScaleDownAnnotations := make(map[string]string)
     clusterAutoscalerScaleDownAnnotations["cluster-autoscaler.kubernetes.io/scale-down-disabled"]="true"
     clusterAutoscalerScaleDownAnnotations["cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm"]="true"
    
  2. Call annotateNodesBackingMachineSets(ctx, allMSs, clusterAutoscalerScaleDownAnnotations)

4. Reconcile New Machine Set by calling reconcileNewMachineSet

	scaledUp, err := dc.reconcileNewMachineSet(ctx, allISs, newIS, d)
func (dc *controller) reconcileNewMachineSet(ctx context.Context, 
allMSs[]*v1alpha1.MachineSet, 
newMS *v1alpha1.MachineSet, 
deployment *v1alpha1.MachineDeployment) 
    (bool, error) 
  1. if newMS.Spec.Replicates == deployment.spec.Replicates return
  2. if newMS.Spec.Replicas > deployment.Spec.Replicas, we need to scale down. call dc.scaleMachineSet(ctx, newMS, deployment.Spec.Replicas, "down")
  3. Compute newReplicasCount using NewMSNewReplicas(deployment, allMSs, newMS).
  4. Call dc.scaleMachineSet(ctx, newMS, newReplicasCount, "up")

Helper Methods

scaleMachineSet

func (dc *controller) scaleMachineSet(ctx context.Context, 
    ms *v1alpha1.MachineSet, 
    newScale int32, 
    deployment *v1alpha1.MachineDeployment, 
    scalingOperation string) 
        (bool, *v1alpha1.MachineSet, error) {
sizeNeedsUpdate := (ms.Spec.Replicas) != newScale
}

TODO: fill me in.

Get Machine Sets for Machine Deployment

func (dc *controller) getMachineSetsForMachineDeployment(ctx context.Context, 
        d *v1alpha1.MachineDeployment) 
    ([]*v1alpha1.MachineSet, error) 
  • Get all machine sets using machine set lister.
  • NewMachineSetControllerRefManager unclear

Get Machine Map for Machine Deployment

Returns a map from MachineSet UID to a list of Machines controlled by that MS, according to the Machine's ControllerRef.

func (dc *controller) 
    getMachineMapForMachineDeployment(d *v1alpha1.MachineDeployment, 
        machineSets []*v1alpha1.MachineSet) 
     (map[types.UID]*v1alpha1.MachineList, error) {

Terminate Machine Sets of Machine eDeployment


func (dc *controller) terminateMachineSets(ctx context.Context, machineSets []*v1alpha1.MachineSet) 

Sync Deployment Status

func (dc *controller) syncStatusOnly(ctx context.Context, 
    d *v1alpha1.MachineDeployment, 
    msList []*v1alpha1.MachineSet, 
    machineMap map[types.UID]*v1alpha1.MachineList) error 

Gets New and Old MachineSets and Sync Revision

func (dc *controller) getAllMachineSetsAndSyncRevision(ctx context.Context, 
    d *v1alpha1.MachineDeployment, 
    msList []*v1alpha1.MachineSet, 
    machineMap map[types.UID]*v1alpha1.MachineList, 
    createIfNotExisted bool) 
        (*v1alpha1.MachineSet, []*v1alpha1.MachineSet, error) 

Overview

getAllMachineSetsAndSyncRevision does the following:

  1. Get all old MachineSets the MachineDeployment: d targets, and calculate the max revision number among them (maxOldV).
  2. Get new MachineSet this deployment targets ie whose machine template matches the deployment's and updates new machine set's revision number to (maxOldV + 1), This is done only if its revision number is smaller than (maxOldV + 1). If this step failed, we'll update it in the next deployment sync loop.
  3. Copy new MachineSet's revision number to the MachineDeployment (update deployment's revision). If this step failed, we'll update it in the next deployment sync loop.

Detail

  • TODO: describe me

Annotate Nodes Backing Machine Sets

func (dc *controller) annotateNodesBackingMachineSets(
    ctx context.Context, 
    machineSets []*v1alpha1.MachineSet, 
    annotations map[string]string) error 
  1. Iterate through the machineSets. Loop variable: machineSet
  2. List all the machines. TODO: EXPENSIVE ??
allMachines, err := dc.machineLister.List(labels.Everything())
  1. Get the Selector for the Machine Set
   	selector, err := metav1.LabelSelectorAsSelector(machineSet.Spec.Selector)
  1. Claim the Machines for the given machineSet using the selector
    filteredMachines, err = dc.claimMachines(ctx, machineSet, selector, allMachines)
  1. Iterate through filteredMachines, loop variable: machine and if Node is not empty, add or update annotations on node.
if machine.Status.Node != "" {
err = AddOrUpdateAnnotationOnNode(
    ctx,
    dc.targetCoreClient,
    machine.Status.Node,
    annotations,
    )
}

Claim Machines

Basically sets or unsets the owner reference of the machines matching selector to the deployment controller.

func (c *controller) claimMachines(ctx context.Context, 
    machineSet *v1alpha1.MachineSet, 
    selector labels.Selector, 
    allMachines []*v1alpha1.Machine) 
    ([]*v1alpha1.Machine, error) {

TODO: delegates to MachineControllerRefManager.claimMachines

Sets or un-sets the owner reference of the machine object to the deployment controller.

Summary

  • iterates through allMachines. Checks if selector matches the machine labels: m.Selector.Matches(labels.Set(machine.Labels)
  • Gets the controllerRef of the machine using metav1.GetControllerOf(machine)
  • If controllerRef is not nil and the controllerRef.UID matches the
  • If so, then this is an adoption and calls AdoptMachine which patches the machines owner reference using the below:
addControllerPatch := fmt.Sprintf(
		`{"metadata":{"ownerReferences":[{"apiVersion":"machine.sapcloud.io/v1alpha1","kind":"%s","name":"%s","uid":"%s","controller":true,"blockOwnerDeletion":true}],"uid":"%s"}}`,
		m.controllerKind.Kind,
		m.Controller.GetName(), m.Controller.GetUID(), machine.UID)
	err := m.machineControl.PatchMachine(ctx, machine.Namespace, machine.Name, []byte(addControllerPatch))
err := m.machineControl.PatchMachine(ctx, machine.Namespace, machine.Name, []byte(addControllerPatch))

Helper Functions

Compute New Machine Set New Replicas

NewMSNewReplicas calculates the number of replicas a deployment's new machine set should have.

  1. The new MS is saturated: newMS's replicas == deployment's replicas
  2. Max number of machines allowed is reached: deployment's replicas + maxSurge == allMS's replicas
func NewMSNewReplicas(deployment *v1alpha1.MachineDeployment, 
    allMSs []*v1alpha1.MachineSet, 
    newMS *v1alpha1.MachineSet) (int32, error) 
    // MS was called IS earlier (instance set)
  1. Get the maxSurge
maxSurge, err = intstr.GetValueFromIntOrPercent(
    deployment.Spec.Strategy.RollingUpdate.MaxSurge,
    int(deployment.Spec.Replicas),
    true
)
  1. Compute the currentMachineCount: iterate through all machine sets and sum up machineset.Status.Replicas
  2. maxTotalMachines = deployment.Spec.Replicas + maxSurge
  3. if currentMachineCount >= maxTotalMachines return newMS.Spec.Replicas // cannot scale up.
  4. Compute scaleUpCount := maxTotalMachines - currentMachineCount
  5. Make sure scaleUpCount does not exceed desired deployment replicas scaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(deployment.Spec.Replicas -newMS.Spec.Replicas))

🚧 WIP at the moment. Lot more material to be added here from notes. Please do not read presently.

Issues

This section is very basic WIP atm. Please check after this warning has been removed. Lots more to be added here from notes and appropriately structured.

Design Issues

Bad Packaging

  • package controller is inside import path github.com/gardener/machine-controller-manager/pkg/util/provider/machinecontroller

LastOperation is actually Next Operation

Badly named. TODO: Describe more.

Description misused

Error Prone stuff like below due to misuse of description.

// isMachineStatusSimilar checks if the status of 2 machines is similar or not.
func isMachineStatusSimilar(s1, s2 v1alpha1.MachineStatus) bool {
	s1Copy, s2Copy := s1.DeepCopy(), s2.DeepCopy()
	tolerateTimeDiff := 30 * time.Minute

	// Since lastOperation hasn't been updated in the last 30minutes, force update this.
	if (s1.LastOperation.LastUpdateTime.Time.Before(time.Now().Add(tolerateTimeDiff * -1))) || (s2.LastOperation.LastUpdateTime.Time.Before(time.Now().Add(tolerateTimeDiff * -1))) {
		return false
	}

	if utilstrings.StringSimilarityRatio(s1Copy.LastOperation.Description, s2Copy.LastOperation.Description) > 0.75 {
		// If strings are similar, ignore comparison
		// This occurs when cloud provider errors repeats with different request IDs
		s1Copy.LastOperation.Description, s2Copy.LastOperation.Description = "", ""
	}

	// Avoiding timestamp comparison
	s1Copy.LastOperation.LastUpdateTime, s2Copy.LastOperation.LastUpdateTime = metav1.Time{}, metav1.Time{}
	s1Copy.CurrentStatus.LastUpdateTime, s2Copy.CurrentStatus.LastUpdateTime = metav1.Time{}, metav1.Time{}

	return apiequality.Semantic.DeepEqual(s1Copy.LastOperation, s2Copy.LastOperation) && apiequality.Semantic.DeepEqual(s1Copy.CurrentStatus, s2Copy.CurrentStatus)
}

Gaps

TODO: Not comprehensive. Lots more to be added here

Dead/Deprecated Code

controller.triggerUpdationFlow

This is unused and appears to be dead code.

SafetyOptions.MachineDrainTimeout

This field is commented as deprecated but is still in MCServer.AddFlags and in the launch script of individual providers. Ex

go run
cmd/machine-controller/main.go
...
machine-drain-timeout=5m

Dup Code

drainNode Handling

  1. Does not set err when c.machineStatusUpdate is called
  2. o.RunCordonOrUncordon should use apierrors.NotFound while checking error returned by a get node op
  3. attemptEvict bool usage is confusing. Better design needed. attemptEvict is overridden in evictPodsWithoutPv.
  4. Misleading deep copy in drain.Options.doAccountingOfPvs
    for podKey, persistentVolumeList := range pvMap {
     	persistentVolumeListDeepCopy := persistentVolumeList
     	//...
    

Node Conditions

  • CloneAndAddCondition logic seems erroneous ?

VolumeAttachment

func (v *VolumeAttachmentHandler) dispatch(obj interface{}) {
//...
volumeAttachment := obj.(*storagev1.VolumeAttachment)
	if volumeAttachment == nil {
		klog.Errorf("Couldn't convert to volumeAttachment from object %v", obj)
		// Should return here.
	}
//...

Dead? reconcileClusterNodeKey

This just delegates to reconcileClusterNode which does nothing..

func (c *controller) reconcileClusterNode(node *v1.Node) error {
	return nil
}

Dead? machine.go | triggerUpdationFlow

Can't find usages

Duplicate Initialization of EventRecorder in MC

pkg/util/provider/app.createRecorder already dones this below.

func createRecorder(kubeClient *kubernetes.Clientset) record.EventRecorder {
eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(kubeClient.CoreV1().RESTClient()).Events("")})
	return eventBroadcaster.NewRecorder(kubescheme.Scheme, v1.EventSource{Component: controllerManagerAgentName})
}

We get the recorder from this eventBroadcaster and then pass it to the pkg/util/provider/machinecontroller/controller.NewController method which again does:

	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: typedcorev1.New(controlCoreClient.CoreV1().RESTClient()).Events(namespace)})

The above is useless.

Q? Internal to External Scheme Conversion

Why do we do this ?

internalClass := &machine.MachineClass{}
	err := c.internalExternalScheme.Convert(class, internalClass, nil)
	if err != nil {
		return err
	}