Kubernetes Overview

Components, Design Philosophy, and Related Papers


Posted on Mon, Feb 2, 2026
Tags kubernetes, containers, cloud-computing, cowork-with-llm
kubernetes, containers, cloud-computing, cowork-with-llm
📝 This article is a translation of the original Japanese post. View original

What is Kubernetes?

Kubernetes (k8s) is a computational platform that uses declarative APIs and continuous reconciliation to keep an entire system in its desired state.

When you start exploring individual features, terms like Deployment, Service, Ingress, CNI, CSI pile up and it’s easy to lose sight of the big picture. This article summarizes the overall history and architecture of Kubernetes.

  • Kubernetes is a system that declares state through an API and has a set of controllers converge it
  • The core lies in the collaboration between kube-apiserver, etcd (storage), controller, scheduler, and kubelet
  • Execution, communication, and persistence are abstracted by interfaces such as CRI / CNI / CSI
  • The design philosophy is grounded in large-scale cluster management knowledge from Borg / Omega

The responsibility breakdown of major components can be organized as follows:

flowchart LR desired["Desired State"] --> apiserver["kube-apiserver\nState entry point"] apiserver --> etcd["etcd\nState storage"] apiserver --> controller["controller\nClose the gap"] apiserver --> scheduler["scheduler\nDecide placement"] scheduler --> kubelet["kubelet\nMaterialize on node"] controller --> kubelet

In other words, the essence of Kubernetes is less about “executing containers” itself and more about the state management and reconciliation mechanism.

Overall Processing Flow

First, let’s follow a typical request flow end-to-end.

  1. User sends a request to kube-apiserver via kubectl / API client
  2. kube-apiserver passes authentication/authorization/Admission and persists to etcd
  3. Controller detects the diff via Watch and creates/fixes what’s missing (reconciliation)
  4. Scheduler assigns a Node to unscheduled Pods (Binding)
  5. kubelet reads the PodSpec, creates containers, calls CNI/CSI to bring them to running state
  6. kube-proxy (or eBPF implementation) configures Service routing, and add-ons like name resolution (CoreDNS) assist

What’s important here is that Kubernetes is not a system that executes something once and stops. It writes state, and then controllers and kubelet continuously monitor it, fixing any drift.

Because of this continuous control, even if a Pod crashes it gets recreated, and even if a node fails the workload gets rescheduled to another node.

Eventual Consistency

Kubernetes adopts an eventual consistency model, aiming for the overall system state to converge in the end rather than immediately. This means that even with partial failures or delays, the system ultimately converges to the desired state.

However, this is a Kubernetes design principle, not a property automatically satisfied by all controller implementations. Especially in custom controllers / operators, carelessly introducing side effects to external APIs, non-idempotent updates, or duplicate executions during retries can produce an implementation that doesn’t converge in an eventually consistent way. Therefore, when writing controllers, you need to be mindful of “does each reconcile move closer to the same desired state no matter how many times it’s called?”

Component Overview

Having a rough understanding of “which process has which responsibility” dramatically speeds up fault isolation.

Control Plane

ComponentMain ResponsibilityTypical DependenciesCommon Topics
kube-apiserverProvides Kubernetes API, auth/authz/Admission, persistence entry pointetcdQPS/Watch load, Admission latency, object bloat
etcdCluster state persistent store (strong consistency)RaftStorage, snapshots, fragmentation, backup design
kube-schedulerAssigns Pods to NodesapiserverScheduling strategy, extensions (Extender/Plugins), distribution
kube-controller-managerRuns various controllers (Deployment/ReplicaSet/Node/Endpoint etc.)apiserverRetry/convergence, rate limiting, work queue design
cloud-controller-managerCloud-dependent control (LB/Route/Node etc.)Cloud APIAPI limits/latency, permissions, cloud differences

When thinking about Control Plane high availability, the core ultimately comes down to apiserver redundancy and etcd quorum maintenance.

Alternatives to etcd

In principle, what Kubernetes needs is not “etcd itself” but rather the properties that API Machinery assumes: a strongly consistent persistent store with Watch / transaction / revision management semantics. So conceptually, there is room to substitute with a distributed KVS or data store with equivalent properties.

In practice, however, in upstream Kubernetes, etcd is the de facto standard as the persistence backend for apiserver, and vanilla configurations don’t assume it can be swapped. Attempting to use a different store typically requires a storage layer compatibility implementation, a fork, or a managed service’s proprietary implementation.

Node (Worker Node Side)

ComponentMain ResponsibilityTypical DependenciesCommon Topics
kubeletMaterializes Pods (starts containers via CRI), updates Probe/StatusCRI runtime, apiservercgroup/resource isolation, Eviction, certificates/auth
Container RuntimeExecutes containers (containerd/CRI-O etc.)OCIImage distribution, pull latency, runtime differences
kube-proxyL4 routing for Services (iptables/ipvs etc.)OS networkingRule bloat, SNAT, IPVS tuning
CNI pluginConfigures Pod networkingLinux netnsNetworkPolicy, MTU, overlay like VXLAN
CSI node pluginAttaches/mounts volumesStorageMount failures, device exhaustion, reconnection

The responsibility on the node side is “can you keep Pods running properly?”, and Linux kernel features and runtime differences strongly affect this. There are many failures that can’t be solved just by looking at the Control Plane.

Container Runtime Choices

Even if a runtime satisfies CRI, operational characteristics vary considerably between implementations.

  • containerd: Currently the most common. Strong track record with Kubernetes, good balance of features and ecosystem
  • CRI-O: Implementation more narrowly focused on Kubernetes. Relatively simple configuration, also commonly used in OpenShift contexts
  • cri-dockerd via Docker Engine: Docker usability and existing assets are easy to leverage, but it’s a bit further from Kubernetes’s standard path and is not a first choice today
  • Isolation-enhanced runtimes like gVisor or Kata Containers: Easier to achieve better isolation than standard runtimes, but trade-offs with performance and operational complexity

In practice, differences emerge in the following aspects:

  • Startup overhead and image pull time
  • Behavior differences around cgroup / namespace / seccomp
  • Ease of logging, debugging, and troubleshooting
  • Support for GPU, special devices, sandbox execution
  • Compatibility with existing distributions and managed Kubernetes

Therefore, rather than “which runtime is optimal?”, the reality is closer to selecting based on “do you want to run standard workloads stably?”, “do you prioritize isolation?”, or “do you want to leverage existing operational assets?”

Add-ons (Common Additions to Clusters)

  • CoreDNS: Service name resolution (successor to kube-dns)
  • Ingress Controller / Gateway API implementation: L7 ingress (nginx/Envoy etc.)
  • metrics-server: Resource metrics for HPA
  • CNI/CSI controller-side components (depends on provider implementation)
  • Cluster monitoring (Prometheus), log collection (Fluent Bit etc.), tracing (OTel)

Kubernetes in vanilla form only provides “the core of cluster operation,” and monitoring, L7 ingress, detailed policies, and storage capabilities are often supplemented by peripheral components.

Vanilla Kubernetes Can’t Do Much

To put it somewhat extremely, vanilla Kubernetes has the mechanism to place and maintain Pods, but is missing almost all the features typically expected of an application platform.

For example, even though there’s a Service abstraction, DNS add-ons like CoreDNS are essentially a prerequisite to use name resolution properly. The same goes for HTTP/HTTPS ingress—Ingress and Gateway API resource definitions alone can’t handle actual traffic; you separately need controller implementations like nginx or Envoy.

In other words, vanilla Kubernetes alone doesn’t naturally provide features you’d want in real operations: “connect by name,” “receive HTTP,” “collect metrics,” “distribute certificates,” “aggregate logs.” Kubernetes is less a universal PaaS and more a platform for combining such features afterward.

Design Philosophy

1) Declarative API and “Desired State”

The basic unit of Kubernetes is not “execution procedure” but an object representing desired state.

  • Writing YAML = not “writing procedures” but “declaring state”
  • The controller (not humans) bridges the gap between desired state and reality

Without understanding this, “why does it revert automatically after I apply?” and “why does the Pod come back after I delete it?” seem counterintuitive. With this understanding, Kubernetes behavior becomes quite straightforward.

About Declarative API

A declarative API is one that expresses “how should things ultimately be?” rather than “how should they be executed?” In Kubernetes, this approach permeates almost all major resources.

For example, with a Deployment, you declare “I want 3 nginx Pods running.” What you write is not a procedure but a target state.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27

What this YAML conveys is only the requirements “I want 3 Pods” and “I want to use this image”—you don’t write procedures like “first create 1 Pod, then add 2 more, retry on failure.” The responsibility for procedures lies with the Deployment controller and ReplicaSet controller side.

In the Kubernetes API, roughly speaking, spec corresponds to desired state and status corresponds to currently observed state. For example, if spec.replicas: 3 but only 2 Pods are actually running, the controller goes to close that diff. Conversely, even if you manually delete 1 Pod, since the desired state in the API hasn’t changed, it ultimately tries to return to 3.

Service, Job, StatefulSet, HorizontalPodAutoscaler, and others can all be understood with the same idea. Using Kubernetes means not accumulating individual command executions, but declaring desired state to the API and leaving the convergence to the system. Having this understanding makes it easier to grasp the big picture.

As a side note, there are operations in Kubernetes that look imperative. For example, kubectl scale deployment web --replicas=5 looks like “an operation to increase to 5,” but in reality it’s just updating the Deployment’s spec.replicas to 5—what it’s doing is rewriting the desired state.

2) Controller Pattern (Reconciliation Loop)

A controller can roughly be expressed in pseudocode as:

while true:
  desired = read_from_apiserver()
  observed = observe_world()
  diff = desired - observed
  apply(diff)

What’s important is that the premise is “it doesn’t need to succeed in one try” and “it will be re-executed as many times as needed.” That’s why error handling, retry, idempotency, and rate limiting become central to the design.

About Exponential Backoff

In controllers where retries are assumed, retrying infinitely and immediately from the moment of failure would put excessive load on the API server and external dependencies. That’s where exponential backoff comes in, gradually increasing the retry interval based on failure count.

For example, if you increase wait time to 1 second, then 2, 4, 8 seconds, you can avoid hammering unnecessary retries when there’s a temporary failure. Kubernetes controllers emphasize work queues and rate limiters for this reason—not just “keep looping until success” but also “don’t make the overall system worse while things are broken” is part of the design.

This is also important when writing custom controllers / operators. If you immediately re-execute external API calls or heavy processing each time they fail, your own controller can easily become an incident amplifier. Behaving calmly on failure, rather than rushing to converge, ultimately tends to converge more stably.

3) API Machinery (Types, Versions, Extensions)

  • Group/Version/Kind (GVK) and internal representation (Internal)
  • Extension via CRD (though adding APIs also increases management/processing cost)
  • Inject policies and defaults via admission (mutating/validating)

Understanding this makes it easier to trace “where did this default come from?”, “where was this rejected?”, and “why does it appear as v1 instead of v1beta1?”

4) Interface Separation (CRI/CNI/CSI)

Kubernetes itself is designed so that execution, communication, and persistence can be swapped in without knowing the implementation.

  • CRI (Container Runtime Interface): kubelet ↔ runtime
  • CNI (Container Network Interface): Pod network configuration
  • CSI (Container Storage Interface): Storage

The key point is that Kubernetes defines abstractions and connection surfaces, delegating most of the implementation to the ecosystem. This is a source of convenience, but also the cause of the confusing “behavior changes due to implementation differences.”

Useful Reference Materials

Here are useful materials organized by theme for understanding Kubernetes’s philosophy and underlying technology.

Cluster Management (Google Origins)

  • Borg: Large-scale cluster management at Google with Borg (EuroSys 2015)
    • Easy to grasp the requirements and design sensibility of large-scale cluster operations
  • Omega: Omega: flexible, scalable schedulers for large compute clusters (2013)
    • Useful for understanding the background of multiple schedulers and optimistic concurrency control

Distributed Consensus / Persistent Store (Prerequisites for etcd)

  • Raft: In Search of an Understandable Consensus Algorithm (Raft) (Ongaro & Ousterhout, 2014)
    • Essential for understanding etcd

Containers (Isolation and Execution Units)

  • Linux namespaces / cgroups documentation (primary sources rather than papers)
    • You can verify the isolation principles that kubelet and runtimes assume

References

Share with


See Also