Introduction

Kubernetes is a service orchestration framework that provides many of the plumbing pieces required for running services. These services include ...

Containers

Kubernetes is structured around containers.

Kroki diagram output

In the context of containers, an ...

As shown in the entity diagram above, each container is created from a single image, but that same image can be used for to create multiple containers. Another way to think about it is that an image is the blueprint of a factory and a container is the actual factory built from that blueprint. You can build multiple factories from the same blueprint.

Kubernetes requires two core components to run:

Different vendors provide different implementations of each. For example, certain vendors provide an OCI runtime that use virtualization technology for isolation instead of standard Linux isolation (e.g. cgroups).

Kroki diagram output

OCIs and CRIs are also the basis for container engines. Container engines tools responsible for creating and running containers, creating images, and other high-level functionality such as local testing of containers. Docker Engine is an example of a container engine.

Kroki diagram output

Objects

Kubernetes breaks down its orchestration as a set of objects. Each object is of a specific type, referred to as kind. The main kinds are ...

Of these kinds, the two main ones are nodes and pods.

Kroki diagram output

Nodes, pods, and other important kinds are discussed further on in this document.

🔍SEE ALSO🔍

⚠️NOTE️️️⚠️

The terminology here is a bit wishy-washy. Some places call them kinds, other places call them resources, other places call them classes, and yet other places call them straight-up objects (in this case, they mean kind but they're saying object). None of it seems consistent and sometimes terms are overloaded, which is why it's been difficult piecing together how Kubernetes works.

I'm using kind to refer to the different types of objects, and object to refer to an instance of a kind.

Labels

↩PREREQUISITES↩

An object can have two types of key-value pairs associated with it:

Finding objects based on labels is done via label selectors, described in the following table.

Operator Description
key=value key is set to value
key!=value key is not set to value
key in (value1, value2, ...) key is either value1, value2, ...
key notin (value1, value2, ...) key is neither value1, value2, ...
key a value is set for key
!key a value not set for key
key1=value1,key2=value2 key1 is set to value1 and key2 is set to value2

Kubernetes uses labels to orchestrate. Labels allow objects to have loosely coupled linkages to each other as opposed to tightly coupled parent-child / hierarchy relationships. For example, a load balancer decides which pods it routes requests to by searching for pods using label selector.

Kroki diagram output

If there are a large number of labels / annotations, either because the organization set them directly or because they're being set by external tools, the chance of a collision increases. To combat this, keys for labels and annotations can optionally include a prefix (separated by a slash) that maps to a DNS subdomain to help disambiguate it. For example, company.com/my_key rather than just having my_key.

⚠️NOTE️️️⚠️

The book states that key name itself can be at most 63 chars. If a prefix is included, it doesn't get included in that limit. A prefix can be up to 253 chars.

Configuration

Objects can be created, accessed, and modified through either a REST web interface or a command-line interface called kubectl. Changes can be supplied in two ways:

Generally, declarative configurations are preferred over imperative configurations. When a declarative configuration is submitted, Kubernetes runs a reconciliation loop in the background that changes the object to match the submitted manifest, creating that object if it doesn't already exist. Contrast this to the imperative configuration method, where changes have to be manually submitted by the user one by one.

Kinds

↩PREREQUISITES↩

The following subsections give an overview of the most-used kinds and example manifests for those kinds. All manifests, regardless of the kind, require the following fields ...

apiVersion: v1
kind: Pod
metadata:
  name: my-name
  annotations:
    author: "Jimbo D."
    created_on: "Aug 20 2021"
  labels:
    app_server: jetty

In addition, the metadata.labels and metadata.annotations contain the object's labels and annotations (respective).

Pod

Containers are deployed in Kubernetes via pods. A pod is a set of containers grouped together, often containers that are tightly coupled and / or are required to work in close proximity of each other (e.g. on the same host).

Kroki diagram output

By default, containers within a pod are isolated from each other (e.g. isolated process IDs) except for sharing the same ...

While the point of Kubernetes is to orchestrate containers over a set of nodes, the containers for a pod are all guaranteed to run on the same node. As such, pods are usually structured in a way where their containers are tightly coupled and uniformly scale together. For example, imagine a pod with comprised of a container running a WordPress server and a container running the MySQL database for that WordPress server. This would be a poor example of a pod because the two containers within it ...

  1. don't scale uniformly (e.g. you may need to scale the database up before the WordPress server, or vice versa).
  2. don't communicate over anything other than the network (e.g. they don't need a shared volume).
  3. are intended to be distributed (e.g. it's okay for them to be running on separate machines).

Contrast that to a pod with a container running a WordPress server and a container that pushes that WordPress server's logs to a monitoring service. This would be a good example of a pod because the two containers within ...

  1. communicate over the filesystem (e.g. application server is writing logs to a shared volume and the log watcher is tailing them).
  2. aren't intended to be distributed (e.g. log watcher is intended for locally produced logs).
  3. are written by different teams (e.g. SRE team wrote the log watcher image while another team wrote the application server image).

Example manifest:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my_container
      resources:
        requests:
          cpu: "500m"
          memory: "128Mi"
        limits:
          cpu: "1000m"
          memory: "256Mi"
      volumeMounts:
        - mountPath: "/data"
          name: "my_data"
      ports:
        - containerPort: 8080
          name: http
          protocol: TCP
      livenessProbe:
        httpGet:
          path: /healthy
          port: 8080
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        failureThreshold: 3
  volumes:
    - name: "my_data"
      hostPath:
        path: "/var/lib/my_data"  # literally mounts a path from the worker node? not persistent if node modes
    - name: "my_data_nfs"
      nfs:
        server: nfs.server.location
        path: "/path/on/nfs"

Images

Each container within a pod must have an image associated with. Images are specified in the Docker image specification format, where a name and a tag are separated by a colon (e.g. my-image:1.0).

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container1
      image: my-image:1.0
    - name: my-container2
      image: my-image:2.4

Pull Policy

Each container in a pod has to reference an image to use. How Kubernetes loads a container's image is dependent on that container's image pull policy.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      imagePullPolicy: IfNotPresent  # Only download if the image isn't present

A value of ...

If unset, the image pull policy differs based on the image tag. Not specifying a tag or specifying latest as the tag will always pull the image. Otherwise, the image will be pulled only if it isn't present.

Private Container Registries

↩PREREQUISITES↩

Images that sit in private container registries require credentials to pull. Private container registry credentials are stored in secret objects of type kubernetes.io/dockerconfigjson in the format of Docker's config.json file.

apiVersion: v1
kind: Secret
metadata:
  name: my-docker-creds
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: ... # base64 encoded ~/.docker/config.json goes here

⚠️NOTE️️️⚠️

If you don't want to supply the above manifest, you can also use kubectl to create a secret object with the appropriate credentials: kubectl create secret docker-registry secret-tiger-docker --docker-email=tiger@acme.example --docker-username=tiger --docker-password=pass1234 --docker-server=my-registry.example:5000.

Those secret objects are then referenced in the pod's image pull secrets list.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # Place secret here. This is a list, so you can have many container registry credentials
  # here.
  imagePullSecrets:
    - name: my-docker-creds
  containers:
    - name: my-container
      image: my-registry.example/tiger/my-container:1.0  # Image references registry.

🔍SEE ALSO🔍

Resources

Each container within a pod can optionally declare a ...

Setting these options allows Kubernetes to choose a node with enough available resources to run that pod.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      resources:
        requests:  # Minimum CPU and memory for this container
          cpu: "500m"
          memory: "128Mi"
        limits:    # Maximum CPU and memory for this container
          cpu: "1000m"
          memory: "256Mi"

requests are the minimum resources the container needs to operate while limits are the maximum resources the container can have. Some resources are dynamically adjustable while others require the pod to restart. For example, a pod ...

The example above lists out CPU and memory as viable resource types. The unit of measurement for ...

Ports

Each container within a pod can expose ports to the cluster.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      ports:
        - containerPort: 8080
          name: http
          protocol: TCP

The example above exposes port 8080 to the rest of the cluster (not to the outside world). Even with the port exposed, other entities on the cluster don't have a built-in way to discover the pod's IP / host or the fact that it has this specific port open. For that, services are required.

🔍SEE ALSO🔍

A pod can have many containers within it, and since all containers within a pod share the same IP, the ports exposed by those containers must be unique. For example, only one container within the pod expose port 8080.

⚠️NOTE️️️⚠️

By default, network access is allowed to all pods within the cluster. You can change this using a special kind of pod called NetworkPolicy (as long as your Kubernetes environment supports it -- may or may not depending on the container networking interface used). NetworkPolicy lets you limit network access such that only pods that should talk together can talk together (a pod can't send a request to another random pod in the system). This is done via label selectors.

If you're aware of endpoints, service, and ingress kinds, I'm not sure how this network policy stuff plays with those kinds.

Command-line Arguments

An image typically provides a default entry point (process that gets started) and default set of arguments to run with. Each container within a pod can override these defaults.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      command: [/opt/app/my-app]
      args: [--no-logging, --dry-run]

⚠️NOTE️️️⚠️

The Dockerfile used to create the image had an ENTRYPOINT and a CMD. command essentially overrides the Dockerfile ENTRYPOINT and args overrides the Dockerfile's CMD.

Environment Variables

↩PREREQUISITES↩

Each container within a pod can be assigned a set of environment variables.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      env:
        - name: LOG_LEVEL
          value: "OFF"
        - name: DRY_RUN
          value: "true"

Once defined, an environment variables value can be used in other parts of the manifest using the syntax $(VAR_NAME). For example, an environment variable's value may be placed directly within an argument.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      env:
        - name: LOG_LEVEL
          value: "OFF"
      args: [--logging_telemetry=$(LOG_LEVEL)]

🔍SEE ALSO🔍

Configuration

↩PREREQUISITES↩

A container's configuration can come from both config maps and secrets. For config maps, a config map's key-value pairs can be accessed by a container via environment variables, command-line arguments, or volume mounts. To set a ...

For secrets, a secret object's key-value pairs can be accessed in almost exactly the same way as config maps with almost exactly the same set of options and restrictions. To set a ...

Both config maps and secrets can be dynamically updated. If a pod is running when an update gets issued, it may or may not receive those updates depending on the configurations are exposed to the container:

Command-line arguments and environment variables don't update because an application's command-line arguments and environment variables can't be changed from the outside once a process launches. Individual files/directories mounted from a volume don't update because of technical limitations related to how Linux filesystems work (see here). Whole volume mounts do update files under the mount, but it's up to the application to detect and reload those changed files.

⚠️NOTE️️️⚠️

All files in a volume mount get updated at once. This is possible because of symlinks. New directory get loaded in and the symlink is updated to use that new directory.

⚠️NOTE️️️⚠️

For individual files/directories mounted from a volume, one workaround to receiving updates is to use symlinks. Essentially, mount the whole volume to a path that doesn't conflict with an existing path in the container. Then, as a part of the container's start-up process, add symlinks to the whole volume mount wherever needed.

For example, if the application requires a configuration file at /etc/my_config.conf, you can mount all configurations to /config and then symlink /etc/my_config.conf to /config/my_config.conf. That way, you can still receive updates.

The typical workaround to config map dynamic updates is to use deployments. In deployments, secrets / config maps and pods are bound together as a single unit, meaning that all pods restart automatically on any change.

🔍SEE ALSO🔍

Volume Mounts

↩PREREQUISITES↩

A pod is able to supply multiple volumes, where those volumes may be mounted to different containers within that pod.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # Volumes supplied are listed here.
  volumes:
    - name: my-data1
      hostPath:
        path: "/var/lib/my_data1"
    - name: my-data2
      hostPath:
        path: "/var/lib/my_data2"
  # Each container in the pod can mount any of the above volumes by referencing its name.
  containers:
    - name: my-container
      image: my-image:1.0
      volumeMounts:
        # Mount "my-data1" volume to /data1 in the container's filesystem.
        - mountPath: /data1
          name: my-data1
        # Mount "my-data2" volume to /data2 in the container's filesystem.
        - mountPath: /data2
          name: my-data2

In the example above, the two volumes supplied by the pod are both of type hostPath. hostPath volume types reference a directory on the node that the pod is running on, meaning that if two containers within the same pod are assigned the same hostPath volume, they see each other's changes on that volume. The type of volume supplied defines the characteristics of that volume. Depending on the volume type, data on that volume ...

Each supplied volume within a pod can either reference a direct piece of storage or it can reference a persistent volume claim. In most cases, directly referencing a piece of storage (as done in the above example) is discouraged because it tightly couples the pod to that storage and its parameters. The better way is to use persistent volume claims, where volumes are assigned from a pool (or dynamically created and assigned) to pods as required. Assuming that you have a persistent volume claim already created, it can be referenced by using persistentVolumeClaim as the volume type.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  volumes:
    - name: my-data
      persistentVolumeClaim:  # Volume type of "persistentVolumeClaim"
        claimName: my-data-pv-claim
  containers:
    - name: my-container
      image: my-image:1.0
      volumeMounts:
        - mountPath: /data
          name: my-data

Lifecycle

A pod's lifecycle goes through several phases:

Kroki diagram output

Each container in a pod can be in one of several states:

The following subsections detail various lifecycle-related configurations of a pod and its containers.

Probes

Probes are a way for Kubernetes to check the state of a pod. Containers within the pod expose interfaces which Kubernetes periodically pings to determine what actions to take (e.g. restarting a non-responsive pod).

Different types of probes exists. A ...

🔍SEE ALSO🔍

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      # Probe to check if a container is alive or dead. Performs an HTTP GET with path
      # /healthy at port 8080.
      livenessProbe: 
        httpGet:
          path: /healthy
          port: 8080
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        failureThreshold: 3
      # Probe to check if a container is able to service requests. Performs an HTTP GET
      # with path /ready at port 8080.
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        failureThreshold: 3

In the example above, each of the probes check an HTTP server within the container at port 8080 but at different paths. The field ...

There are types of probes other than httpGet. A probe of type ...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      readinessProbe:
        exec:
          command:
            - cat
            - /tmp/some_file_here
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        failureThreshold: 3

Graceful Termination

Kubernetes terminates pods by sending a SIGTERM to each container's main process, waiting a predefined amount of time, then forcefully sending a SIGKILL to that same process if the process hasn't shut itself down. The predefined waiting time is called the termination grace period, and it's provided so the application can perform cleanup tasks after it's received SIGTERM (e.g. emptying queues).

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  restartPolicy: Always
  terminationGracePeriodSeconds: 60  # Default to 30
  containers:
    - name: my-container
      image: my-image:1.0

⚠️NOTE️️️⚠️

If you have a pre-stop pod lifecycle hook (described in another section), note that this termination grace period starts as soon as the hook gets invoked (not after it finishes).

On termination (either via SIGTERM or voluntarily), a pod's container can write a message to a special file regarding the reason for its termination. The contents of this file be visible in the pod container's "last state" property.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  restartPolicy: Always
  containers:
    - name: my-container
      image: my-image:1.0
      terminationMessagePath: /var/exit-message  # Defaults to /dev/termination-log

⚠️NOTE️️️⚠️

A pod container's "last state" property is visible when you describe the pod via kubectl.

Maximum Runtime

The runtime of a pod can be limited such that, if continues to run for more than some duration of time, Kubernetes will forcefully terminate it and mark it as failed.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  activeDeadlineSeconds: 3600  # When set, pod can't run for more than this many seconds
  containers:
    - name: my-container
      image: my-image:1.0

Lifecycle Hooks

↩PREREQUISITES↩

Lifecycle hooks are a way for Kubernetes to notify a container of when ...

Similar to probes, containers within the pod expose interfaces which Kubernetes invokes. A lifecycle hook interface is similar to a probe interface in that it can be one of multiple types: httpGet, tcpSocket, and exec.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      lifecycle:
        # Hook to invoke once the container's main process starts running. This hook runs
        # along-side the main process (not before it launches), but Kubernetes will treat
        # the container as if it's waiting for it to still be created until this hook
        # completes.
        #
        # A pod will be a "Pending" state until all each of its container post-start hooks
        # complete.
        postStart:
          exec:
            command: [sh, sleep 15]  # Artificial sleep
        # Hook to invoke just before the container is voluntarily terminated (e.g. its
        # moving to a new node). This hook runs first and once it's finished, a SIGTERM is
        # sent to the main container process, followed by a SIGKILL if the container's
        # main process hasn't terminated itself.
        #
        # It's important to note that the termination grace period begins as soon as the
        # pre-stop hook gets invoked, not after the pre-stop hook finishes.
        preStop:
          httpGet:
            path: /shutdown
            port: 8080

A post-start hook is useful when some form of initialization needs to occur but it's impossible to do that initialization within the container (e.g. initialization doesn't happen on container start and you don't have access to re-create / re-deploy the container image to add support for it). Likewise, a pre-stop hook is useful when some form of graceful shutdown needs to occur but it's impossible to do that shutdown within the container (e.g. shutdown procedures don't happen on SIGTERM and you don't have access to re-create / re-deploy the container image to add support for it).

⚠️NOTE️️️⚠️

Recall that a container has three possible states: waiting, running, and terminated. The docs say that a container executing a post-start hook is still in the waiting state.

⚠️NOTE️️️⚠️

According to the book, it's difficult to tell if / why a hook failed. Its output doesn't go anywhere. You'll just see something like FailedPostStartHook / FailedPreStopHook somewhere in the pod's event log.

According to the book, many applications use pre-stop hook to manually send a SIGTERM to their app because, even though SIGTERM is being sent by Kubernetes, it's getting gobbled up and discarded by some parent process (e.g. running your app via sh).

Init Containers

↩PREREQUISITES↩

Init containers are pod containers that run prior to a pod's actual containers. Their purpose is to initialize the pod in some way (e.g. writing startup data to some shared volume) or delay the start of the pod until some other service is detected as being online (e.g. database).

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # Init containers are defined similarly to main containers, but they're run before main containers, one after the other in the order they're defined. After the last init container successfully completes, the main containers for the pod start.
  initContainers:
    - name: my-initA
      image: my-init-imageA:1.0
    - name: my-initB
      image: my-initB-image:1.0
      command: ['launchB', '--arg1']
  containers:
    - name: my-container
      image: my-image:1.0

⚠️NOTE️️️⚠️

Important note from the docs:

Because init containers can be restarted, retried, or re-executed, init container code should be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the possibility that an output file already exists.

Restart Policy

Restart policy is the policy Kubernetes uses for determining when a pod should be restarted.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  restartPolicy: Always
  containers:
    - name: my-container
      image: my-image:1.0

A value of ...

Always is typically used when running servers that should always be up (e.g. http server) while the others are typically used for one-off jobs.

When a container within a pod fails, that entire pod is marked as failed and may restart depending on this property. Kubernetes exponentially delays restarts so that, if the restart is happening due to an error, there's some time in between restarts for the error to get resolved (e.g. wait for some pending network resource required by the pod comes online). The delay increases exponentially (10 seconds, 20 seconds, 40 seconds, 80 seconds, etc..) until it caps out at 5 minutes. The delay resets once a restarted pod is executing for more than 10 minutes without issue.

⚠️NOTE️️️⚠️

"When a container within a pod fails, that entire pod is marked as failed" -- Is this actually true?

The delay may also reset if the pod moves to another node. The documentation seems unclear.

🔍SEE ALSO🔍

Service Discovery

↩PREREQUISITES↩

For a pod to communicate with services, it needs to be able to discover the IP(s) of those services. The mechanisms for discovering services within a pod are environment variables and DNS.

These service discovery mechanisms are details in the subsections below.

Environment Variables

When a pod launches, all services within the same namespace have their IP and port combinations added as environment variables within the pod's containers. The environment variable names are in the format {SVCNAME}_SERVICE_HOST / {SVCNAME}_SERVICE_PORT, where {SVCNAME} is the service converted to uppercase and dashes swapped with underscores. For example, service-a would get converted to SERVICE_A.

SERVICE_A_SERVICE_HOST=10.111.240.1
SERVICE_A_SERVICE_PORT=443
SERVICE_B_SERVICE_HOST=10.111.249.153
SERVICE_B_SERVICE_PORT=80

If a service exposes multiple ports, only the first port goes in {SVCNAME}_SERVICE_PORT. When multiple ports are present, additional environment variables get created in the format {SVCNAME}_SERVICE_PORT_{PORTNAME}, where {PORTNAME} is the name of service's port modified the same way that {SVCNAME} is. For example, service-c with two exposed ports named web-1 and metrics-1 would get converted to SERVICE_C_SERVICE_PORT_WEB_1 and SERVICE_C_SERVICE_PORT_METRICS_1 respectively.

SERVICE_C_SERVICE_HOST=10.111.240.1
SERVICE_C_SERVICE_PORT=443
SERVICE_C_SERVICE_PORT_WEB_1=443
SERVICE_C_SERVICE_PORT_METRICS_1=8080

⚠️NOTE️️️⚠️

Looking at the k8s code, it looks like for a service port needs to be named for it as an environment variable. Service ports that don't have a name won't show up as environment variables. See (here)[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/envvars/envvars.go#L51-L55].

Using environment variables for service discovery has the following pitfalls:

On the plus side, inspecting environment variables within a container essentially enumerates all services within the pod's namespace. Enumerating services isn't possible when using DNS for service discovery, discussed in the next section.

DNS

↩PREREQUISITES↩

Kubernetes provides a global DNS server which is used for service discovery. Each pod is automatically configured to use this DNS server and simply has to query it for a service's name. If the queried service is present, the DNS server will return the stable IP of that service.

⚠️NOTE️️️⚠️

The DNS server runs as an internal Kubernetes application called 'corednsorkube-dns. This is usually in the kube-system` namespace. Recall that the IP of a service is stable for the entire lifetime of the service, meaning that service restarts and DNS caching by the application and / or OS isn't an issue here.

The general domain query format is {SVCNAME}.{NAMESPACE}.svc.{CLUSTERDOMAIN}, where ...

For example, to query for the IP of service serviceA in namespace ns1 within a cluster that has the domain name suffix cluster.local, the domain name to query is serviceA.ns1.svc.cluster.local. Alternatively, if the pod doing the querying is ...

Using DNS for service discovery has the following pitfalls:

On the plus side, DNS queries can extend outside the pod's namespace and services started after a container launches are queryable. These aren't possible when using environment variables for service discovery, discussed in the previous section.

Metadata

↩PREREQUISITES↩

Information about a pod and its containers such as ...

... can all be accessed within the container via either a file system mount or environment variables.

Environment Variables

All pod information except for labels and annotations can be assigned to environment variables. This is because a running pod can have its labels and annotations updated but the environment variables within a running container can't be updated once that container starts (updated labels / annotations won't show up to the container).

⚠️NOTE️️️⚠️

CPU resources can also be dynamically updated without restarting the pod / container process. The environment variable for this likely won't update either, but it isn't restricted like labels / annotations are. There may be other reasons that labels / annotations aren't allowed. Maybe Linux has a cap on how large an environment variable can be, and there's a realistic possibility that labels / annotations can exceed that limit?

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my-container
      resources:
        requests:
          cpu: 15m
          memory: 100Ki
        limits:
          cpu: 100m
          memory: 4Mi
      env:
        # These entries reference values that would normal be "fields" under a running pod
        # in Kubernetes. That is, these entries reference paths that you would normally see
        # when you inspect a pod in Kubernetes by dumping out its YAML/JSON. For example,
        # by running "kubectl get pod my_pod -o yaml" -- it produces a manifest but with
        # many more fields (dynamically assigned fields and fields with default values
        # filled out).
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name  # Pulls in "my_pod".
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace  # Pulls in the default namespace supplied by Kubernetes.
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP  # Pulls in the pod's IP.
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName  # Pulls in the name of the node that the pod's running on.
        # When referencing resource requests / limits for a container, if you're requesting those
        # for a different container than the one the container you're assigning the env var to,
        # you'll need to supply a "containerName" field. Otherwise, you can omit the
        # "containerName" field.
        #
        # Resource requests / limits by optionally be provided a "divisor" field, which will
        # divide the value before assigning it.
        - name: CPU_REQUEST
          valueFrom:
            resourceFieldRef:
              resource: requests.cpu
              containerName: my-container  # If you omit this, it'll default to "my-container" anyways.
              divisor: 5m  # Divide by 5 millicores before assigning (15millicores/5millicores=3)
        - name: CPU_LIMIT
          valueFrom:
            resourceFieldRef:
              resource: limits.cpu
              containerName: my-container  # If you omit this, it'll default to "my-container" anyways.
              divisor: 5m  # Divide by 5 millicores before assigning (100millicores/5millicores=20)
        - name: MEM_REQUEST
          valueFrom:
            resourceFieldRef:
              resource: requests.memory
              containerName: my-container  # If you omit this, it'll default to "my-container" anyways.
              divisor: 1Ki  # Divide by 1 kibibyte before assigning (100Kebibytes/1Kibibyte=100)
        - name: MEM_LIMIT
          valueFrom:
            resourceFieldRef:
              resource: limits.memory
              containerName: my-container  # If you omit this, it'll default to "my-container" anyways.
              divisor: 1Ki  # Divide by 1 kibibyte before assigning (4Mebibytes/1Kibibyte=4096)

Volume Mount

↩PREREQUISITES↩

All pod information can be exposed as a volume mount, where files in that mount map to pieces of information. Unlike with environment variables, a volume mount can contain labels and annotations. If those labels and annotations are updated, the relevant files within the mount update to reflect the changes. It's up to the application running within the container to detect and reload those updated files.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    drink: Pepsi
    car: Volvo
  annotations:
    key1: value1
    key2: |
      Good morning,
      Today is Sunday.
spec:
  volumes:
    # A volume of type downwardAPI will populate with files, where each file contains the
    # value for a specific field. Fields are specified in a similar manner to the
    # environment variable version above.
    #
    # Each "path" under "items" is file within the volume.
    - name: downward_vol
      downwardAPI:
        items:
          - path: podName
            fieldRef:
              fieldPath: metadata.name
          - path: podNamespace
            fieldRef:
              fieldPath: metadata.namespace
          - path: podIp
            fieldRef:
              fieldPath: status.podIP
          - path: nodeName
            fieldRef:
              fieldPath: spec.nodeName
          - path: cpuRequest
            resourceFieldRef:
              resource: requests.cpu
              containerName: my-container  # MUST BE INCLUDED, otherwise it's impossible to know which container.
              divisor: 5m
          - path: cpuLimit
            resourceFieldRef:
              resource: limits.cpu
              containerName: my-container  # MUST BE INCLUDED, otherwise it's impossible to know which container.
              divisor: 5m
          - path: memRequest
            resourceFieldRef:
              resource: requests.memory
              containerName: my-container  # MUST BE INCLUDED, otherwise it's impossible to know which container.
              divisor: 1Ki
          - path: memLimit
            resourceFieldRef:
              resource: limits.memory
              containerName: my-container  # MUST BE INCLUDED, otherwise it's impossible to know which container.
              divisor: 1Ki
          # The following two entries supplies labels and annotations. Note that, if labels
          # or annotations change for the pod, the files in this volume will be updated to
          # reflect those changes.
          #
          # Each file below will contain multiple key-value entries. One key-value entry per line, where
          # the key and value are delimited by an equal sign (=). Values are escaped, so the new lines in
          # the multiline example annotation in this pod (see key2, where the value is a good morning
          # message) will be appropriately escaped.
          - path: "labels"
            fieldRef:
              fieldPath: metadata.labels
          - path: "annotations"
            fieldRef:
              fieldPath: metadata.annotations
  containers:
    - image: my-image:1.0
      name: my-container
      resources:
        requests:
          cpu: 15m
          memory: 100Ki
        limits:
          cpu: 100m
          memory: 4Mi
      # Mount the volume declared above into the container. The application in the container
      # will be able to access the metadata as files within the volume mount.
      volumeMounts:
        - name: downward_vol
          mountPath: /metadata

Node Placement

A pod can have soft / hard requirements as to which nodes it can run on. It's common to segregate which nodes a pod can / can't run when dealing with...

Three different mechanisms are used to define these requirements:

These mechanisms are documented in further detail in the subsections below.

Node Selectors

A node selector forces a pod to run on nodes that have a specific set of node labels.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # This pod can only run on nodes that have the labels disk=ssd and cpu=IntelXeon.
  nodeSelector:
    disk: ssd
    cpu: IntelXeon
  containers:
    - name: my-container
      image: my-image:1.0

Taints and Tolerations

↩PREREQUISITES↩

A taint is a node property, structured as a key-value pair and effect, that repels pods. For a pod to be scheduled / executed on a node with taints, it needs tolerations for those taints. Specifically, that pod ...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # This pod can only run on nodes that have the taints "environment=production:NoExecute" and
  # "environment=excess-capacity:NoExecute". The example below uses the "Equal" operator, but
  # there's also an "Exists" operator which will match any value ("value" field shouldn't be
  # set if "Exists" is used).
  tolerations:
    - key: environment
      operator: Equal
      value: production
      effect: NoSchedule
    - key: environment
      operator: Equal
      value: excess-capacity
      effect: NoSchedule
  containers:
    - name: my-container
      image: my-image:1.0

🔍SEE ALSO🔍

Node Affinity

↩PREREQUISITES↩

Node affinity is a set of rules defined on a pod that repels / attracts it to nodes tagged with certain labels, either as soft or hard requirements.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    nodeAffinity:
      # This field lists the hard requirements for node labels. The possible operators
      # allowed are ...
      #
      #  * "In" / "NotIn" - Tests key has (or doesn't have) one of many possible values.
      #  * "Exists" / "DoesNotExist" - Tests key exists (or doesn't exist), value ignored.
      #  * "Gt" / "Lt" - Tests key's value is grater than / less than.
      #
      # Use "In" / "Exists" for attraction and "NotIn" / "DoesNotExist" for repulsion.
      #
      # In this example, it's requiring that the CPU be one of two specific Intel models and
      # the disk not be a hard-drive (e.g. it could be a solid-state drive instead).
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: cpu
              operator: In
              values: [intel-raptor-lake, intel-alder-lake]
            - key: disk-type
              operator: NotIn
              values: [hdd]
      # This field lists out the soft requirements for node labels and weights those
      # requirements. Each requirement uses the same types of expressions / operators as the
      # hard requirements shown above, but it also has a weight that defines the
      # desirability of that requirement. Each weight must be between 1 to 100.
      #
      # In this example, the first preference outweighs the second by a ratio of 10:2 (5x
      # more preferred).
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            - matchExpressions:
              - key: ram-speed
                operator: gt
                values: [3600]
              - key: ram-type
                operator: In
                values: [ddr4]
        - weight: 20
          preference:
            - matchExpressions:
              - key: ram-speed
                operator: lt
                values: [3601]  # This is <, 3601 means speed needs to be <= 3600.
              - key: ram-type
                operator: In
                values: [ddr4]
  containers:
    - name: my-container
      image: my-image:1.0

⚠️NOTE️️️⚠️

The scheduler will try to enforce these preferences, but it isn't guaranteed as there could be other competing scheduling requirements (e.g. the admin have set something up to spread out pods more / less across nodes).

⚠️NOTE️️️⚠️

The "5x more preferred" comment above is speculation. I've tried to look online to see how weights work but haven't been able to find much. Does it scale by whatever the highest weight is? So if the example above's first preference was 10 instead of 100, would it be preferred 0.5x as much (10:20 ratio)?

Unlike ...

⚠️NOTE️️️⚠️

Note that requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution both end with "ignored during execution". This basically says that a pod won't get scheduled on a node but it also won't get evicted if that pod is already running on that node. This is in contrast to node taints, where a taint having an effect of NoExecute will force evictions of running pods.

The book hints that the ability to evict running pods may be added sometime in the future.

Pod Affinity

↩PREREQUISITES↩

Pod affinity is a set of rules defined on a pod that repels / attracts it to the vicinity of other pods tagged with certain labels. Vicinity is determined via a topology key, which is a label placed on nodes to define where they live. For example, nodes within the same ...

Pod affinity defines pod attraction / repulsion by looking for pod labels and a topology key. For example, it's possible to use pod affinity to ensure that a pod gets scheduled on the same rack (via topology key) as another pod with the label app=api-server. There could be multiple pods with the label app=api-server, in which case the scheduler will pick one, figure out which rack it lives on, and place the new pod on that same rack.

Similar to node affinity, pod affinity can have hard and soft requirements, where soft requirements have weights that define the desirability of attraction / repulsion. Selector expressions and weights for pod affinity are defined similarly to those for node affinity.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my-container
  affinity:
    # This field defines the rules specifically for **attraction**. Like with node affinity,
    # pod affinity uses ...
    #
    #  * "requiredDuringSchedulingIgnoredDuringExecution" for hard requirements.
    #  * "preferredDuringSchedulingIgnoredDuringExecution" for soft requirement.
    #  * the same selector operators as node affinity ("In", "NotIn", "Exists",
    #    "DoesNotExist", "Gt", and "Lt").
    #  * the same weight requirements as node affinity (range between 1 to 100 per soft
    #    requirement).
    #
    # Unlike with node affinity, the negation operators ("NotIn" / "DoesNotExist") don't
    # define repulsion, they just attract to pods that don't have something. For example,
    # here we're looking to have affinity to pods that don't have the labels
    # "stability-level=alpha" and "stability-level=alpha". In addition, it strongly
    # prefers to live on the same rack as pods with label "app=api-server" and less strongly
    # prefers to live on the same rack as pods with label "app=db-server".
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: rack
          labelSelector:
            matchExpressions:
              - key: stability-level
                operator: NotIn
                values: [alpha, beta]
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            - topologyKey: rack
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: [api-server]
        - weight: 20
          podAffinityTerm:
            - topologyKey: rack
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: [db-server]
    # This field defines the rules specifically for **repulsion**. It set up exactly the
    # same way as the field for attraction shown above, but the criteria here repulses away.
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: data-center
          labelSelector:
            matchExpressions:
              - key: security-context
                operator: NotIn
                values: [privileged-pod]

⚠️NOTE️️️⚠️

The scheduler will try to enforce these preferences, but it isn't guaranteed as there could be other competing scheduling requirements (e.g. the admin have set something up to spread out pods more / less across nodes).

⚠️NOTE️️️⚠️

I've tried to look online to see how weights work but haven't been able to find much. Does it scale by whatever the highest weight is? So if the example above's first preference was 10 instead of 100, would it be preferred 0.5x as much (10:20 ratio)?

Kubernetes comes with pre-define topology keys:

Container Isolation

The isolation guarantees of the containers within a pod can be modified. Specifically, a pod can ask that its containers get ...

The following subsections document these mechanisms in further detail.

🔍SEE ALSO🔍

Security Context

↩PREREQUISITES↩

A pod and its containers can have security-related features configured via a security context.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my-container
      securityContext:
        # When set, the default user of the container is updated to use be this user ID
        # (note that this is a user ID, not a user name). This is useful if multiple
        # applications are reading / writing to the sane shared volume (no permission
        # problems if they're are read/writing as the same UID?).
        runAsUser: 99 
        # When set, the group of the default user of the container is updated to use be this
        # group ID (there are multiple group IDs here, 123 is the main one used when
        # creating files and directories). This is useful if multiple applications are
        # reading / writing to the sane shared volume (no permission problems if they're are
        # read/writing as the same GID? -- if file permissions allow).
        fsGroup: 123
        supplementalGroups: [456, 789]
        # When set to true, the container will run as a non-root user. This is useful if
        # the pod breaks isolation by exposing the internals of the node to some of its
        # containers.
        runAsNonRoot: true
        # When set to true, the container runs in "privileged mode" (full access to the
        # Linux kernel). This is useful in cases where the pod manages the node somehow
        # (e.g. modified iptables).
        privileged: true
        # An alternative to giving a container "privileged" access (shown above) is to
        # instead provide the container with fine-grained permissions to the kernel.
        # This can also be used to revoke fine-grained permissions that are provided
        # by default (e.g. remove the ability to change ownership of a dir).
        capabilities:
          add:
            - SYS_TIME
          drop:
            - CHOWN
        # When set to true, the container is unable to write to its own filesystem. It
        # can only read from it. Any writing it needs to do has to be done on a mounted
        # volume.
        readOnlyRootFilesystem: true
        #
        # Not all features are listed here. There are many others.

These security-related features can also be used at the pod-level rather than the container-level. When used at the pod level, the features are applied as defaults for all containers (containers can override them if needed).

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # Pod-level security context. This gets applied as a default for all container in the pod.
  securityContext:
    runAsUser: 99 
    runAsNonRoot: true
    privileged: true
    capabilities:
      add:
        - SYS_TIME
      drop:
        - CHOWN
    readOnlyRootFilesystem: true
  containers:
    - image: my-image1:1.0
      name: my-container1
    - image: my-image2:1.0
      name: my-container2
      securityContext:
        privileged: false  # Override the default "privileged" security context option.

Node Access

↩PREREQUISITES↩

A pod's isolation guarantees can be relaxed so that it has access to the internals of the node it's running on. This is important for in certain system-level scenarios, such as pods that collect node performance metrics.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my-container
  # By default, all containers within a pod share the same IP and port space (unique to that
  # pod. However, when this property is set to true, the node's default "network namespace"
  # is shared with the containers of the pod (network interfaces are exposed to the pod).
  hostNetwork: true
  # By default, each container within a pod has its own isolated process ID space. However,
  # when this property is set to true, the node's default "process ID namespace" is used for
  # each pod container's processes.
  hostPID: true
  # By default, each container within a pod has its own isolated IPC space. However, when
  # this property is set to true, the node's default "IPC namespace" is used for each pod
  # container's processes.
  hostIPC: true

🔍SEE ALSO🔍

If the only requirement is that requests from a node's port get forwarded to a container's port, that node's network interfaces don't need to be exposed to the pod. Instead, a container can also simply ask that the node running it directly map a node port to it.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - image: my-image:1.0
      name: my-container
      ports:
        # Map port 9999 on the node that this pod runs to port 8080 on this pod's container.
        #
        # Note that the Kubernetes scheduler ensures that a node has port 9999 available
        # before scheduling this pod to run on it.
        - containerPort: 8080
          hostPort: 9999
          protocol: TCP

⚠️NOTE️️️⚠️

If you already know about services, the NodePort service isn't the same thing as what's going on here. This is opening up a port on the node that the pod is running on and forwarding requests to the container. NodePort opens the same port on all nodes and forwards requests to a random pod (not necessarily the pod running on the same node that the request came in to).

🔍SEE ALSO🔍

API Access

↩PREREQUISITES↩

Containers within a pod can access the Kubernetes API server via a service called kubernetes, typically found on the default namespace. Communicating with this service requires a certificate check (to verify the server isn't a man-in-the-middle box) as well as an access token (to authentication with the service). By default, containers have a secret object mounted as a volume at /var/run/secrets/kubernetes.io/serviceaccount that contains both these pieces of data as files:

In most cases, the credentials provided likely won't provide unfettered access to the Kubernetes API.

⚠️NOTE️️️⚠️

See here for an explanation of bearer tokens. You typically just need to include an HTTP header with the token in it.

Third-party libraries that interface with Kubernetes are available for various languages (e.g. Python, Java, etc..), meaning you don't have to do direct HTTP requests and do things like fiddle with headers.

🔍SEE ALSO🔍

Configuration Map

A configuration map is a set of key-value pairs intended to configure the main application of a container (or many containers). By decoupling configurations from the containers themselves, the same configuration map (or parts of it) could be used to configure multiple containers within Kubernetes.

🔍SEE ALSO🔍

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
data:
  param1: another-value
  param2: extra-value
  my-config.ini: |
    # This is a sample config file that I might use to configure an application
    key1 = value1
    ket1 = value2

The key-value pairs of a configuration map typically get exposed to a container either as environment variables, files, or command-line arguments. Keys are limited to certain characters: alphabet, numbers, dashes, underscores, and dots.

Secret

↩PREREQUISITES↩

A secret object is a set of key-value pairs, similar to a config map, but oriented towards security rather than just configuration (e.g. for storing things like access tokens, passwords, certificates). As opposed to a config map, Kubernetes takes extra precautions to ensure that a secret object is stored and used securely.

apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque  # "Opaque" is the default type (can be omitted)
# Both text and binary data are supported. To insert a text entry, place it under
# "stringData". To insert a binary entry, base64 the value and place it under "data". 
stringData:
  username: admin
  password: pepsi_one
data:
  key_file: eWFiYmFkYWJiYWRvbw==  

Many types of secrets exist. Each type either does some level of verification on the entries and / or acts as a tag to convey what data is contained within (e.g. SSH data, TLS data, etc..). In general Opaque is the secret type used by most applications.

⚠️NOTE️️️⚠️

Certain sources are claiming that a secret object can be 1 megabyte at most.

Node

Nodes are the machines that pods run on. A Kubernetes environment often contains multiple nodes, each with a certain amount of resources. Pods get assigned to nodes based on their resource requirements. For example, if a pod A requires 2gb of memory and node C has 24 gigs available, that node may get assigned to run that pod.

Kroki diagram output

Kubernetes typically attempts to schedule multiple instances of the same pod on different nodes, such that a downed node won't take out all instances of the service that pod runs. In the example above, pod instances of the same type are spread out across the 3 nodes.

Kubernetes has a leader-follower architecture, meaning that of the nodes, a small subset is chosen to lead / manage the others. The leaders are referred to as master nodes while the followers are referred to as worker nodes.

Kroki diagram output

A master node can still run pods just like the worker nodes, but some of its resources will be tied up for the purpose of managing worker nodes.

Taints

A taint is a node property that repels pods, either as a preference or as a hard requirement. Each taint defined as a key-value pair along with an effect that defines how it works (value can be null, leaving just a key and effect). An effect can be either ...

🔍SEE ALSO🔍

Multiple taints on a node repel pods based on each taint.

⚠️NOTE️️️⚠️

The multiple taints paragraph is speculation. I think this is how it works.

A taint is formatted as key=value:effect Node taints can be added and removed via command-line.

kubectl taint node my-staging-node-1 environment-type=production:NoExecute  # Add taint
kubectl taint node my-staging-node-1 environment-type=production:NoExecute- # Remove taint (note the - at the end)

⚠️NOTE️️️⚠️

Can this be done via a manifest as well? Probably, but it seems like the primary way to handle this is either through kubectl or via whatever cloud provider's managed Kubernetes web interface.

Volume

Volumes are disks where data can be persisted across container restarts. Normally, Kubernetes resets a container's filesystem each time that container restarts (e.g. after a crash or a pod getting moved to a different node). While that works for some types of applications, other application types such as database servers need to retain state across restarts.

Volumes in Kubernetes are broken down into "persistent volumes" and "persistent volume claims". A ...

The idea is that a persistent volume itself is just a floating block of disk space. Only when it's claimed does it have an assignment. Pods can then latch on to those assignments.

Kroki diagram output

In the example above, there are 4 volumes in total but only 3 of those volumes are claimed. podA latches on to claim1 and claim2 while podB latches on to claim3 and claim2 (both pods can access the volume claimed in claim2).

⚠️NOTE️️️⚠️

Persistent volumes themselves are cluster-level kinds while persistent volume claims are namespace-level kinds. All volumes are available for claims regardless of the namespace that claim is in. Maybe you can limit which volumes can be claimed by using labels / label selectors?

⚠️NOTE️️️⚠️

Part of the reasoning for doing it like this is decoupling: volumes are independent from pods and a volume can be shared access across pods.

Another reason is that a developer should only be responsible for claiming a volume while the cluster administrator should be responsible for setting up those volumes and dealing with backend details like the specifics of the volume type and how large each volume is. As a developer, you only have to make a "claim" while the administrator is responsible for ensuring those resources exist.

Example persistent volume manifest:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Recycle  # once a claim on this volume is given up, delete the files on disk
  awsElasticBlockStore:
    volumeID: volume-id
    fsType: ext4

Example persistent volume claim manifest:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-claim
spec:
  resources:
    requests:
      storage: 1Gi  # volume must have at least this much space
  accessModes:
    - ReadWriteOnce  # volume must have this access mode
  storageClassName: ""  # MUST BE EMPTY STRING to claim test-vol described above (if set, uses dynamic provisioning)

⚠️NOTE️️️⚠️

Why must spec.storageClassName be an empty string instead of being removed entirely? Being removed entirely would cause Kubernetes to use a default storage class name (if one exists), which is not what you want. Storage classes are described in the next few paragraphs below.

There are two types of volume provisioning available:

Dynamic provisioning only requires that you make a persistent volume claim with a specific storage class name. The administrator is responsible for ensuring a provisioner exists for that storage class and that provisioner automatically creates a volume of that type when a claim comes in. Each storage class can have different characteristics such as volume type (e.g. HDD vs SSD), volume read/write speeds, backup policies, etc.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2

Capacity

Each persistent volume has a storage capacity.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  capacity:
    storage: 10Gi  # Capacity of the persistent volume.

A persistent volume claim can then be set to a capacity within some range.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-claim
spec:
  resources:
    requests:
      storage: 1Gi  # Minimum capacity that the persistent volume should have.
    limits:
      storage: 5Gi  # Maximum capacity that the persistent volume should have.

Access Modes

A persistent volume can support multiple access modes:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  # Access modes for the persistent volume listed here.
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  capacity:
    storage: 10Gi

A persistent volume claim can then be set to target one or more access modes.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-claim
spec:
  # Persistent volume selected must have these access modes
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

⚠️NOTE️️️⚠️

A claim takes a list of access modes, so is it that a claim needs to get a volume with all access modes present or just one of the access modes present?

⚠️NOTE️️️⚠️

Not all persistent volume types support all access modes. Types are discussed further below.

Reclaim Policy

A persistent volume claim, once released, may or may not make the persistent volume claimable again depending on the volume reclaim policy. The options available are ...

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-claim
spec:
  persistentVolumeReclaimPolicy: Recycle  # Recycle the persistent volume once released
  resources:
    requests:
      storage: 1Gi

If the data on disk is critical to operations, the option to choose will likely be Retain.

⚠️NOTE️️️⚠️

For retain specifically, once the existing persistent volume claim is released, the persistent volume itself goes into "Released" status. If it were available for reclamation, it would go into "Available" status. The book mentions that there is no way to "recycle" a persistent volume that's in "Released" status without destroying and recreating it.

According to the k8s docs, this is the way it is so that users have a chance to manually pull out data considered precious before it gets destroyed.

⚠️NOTE️️️⚠️

Not all persistent volume types support all reclaim policies. Types are discussed further below.

Types

A persistent volume needs to come from somewhere, either via a cloud provider or using some internally networked (or even local) disks. There are many volume types: AWS elastic block storage, Azure file, Azure Disk, GCE persistent disk, etc.. Each type has its own set of restrictions such as what access modes it supports or the types of nodes it can be mounted.

The configuration for each type is unique. The following are sample configurations for popular types...

⚠️NOTE️️️⚠️

The documentation says that a lot of these types are deprecated and being moved over to something called CSI (container storage interface), so these examples may need to be updated in the future

# Amazon Elastic Block Storage
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  awsElasticBlockStore:
    volumeID: volume-id  # a volume with this ID must already exist in AWS
    fsType: ext4
# Google Compute Engine Persistent Disk
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  gcePersistentDisk:
    pdName: test-vol  # a disk with this name must already exist in GCE
    fsType: ext4
# Azure Disk
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  azureDisk:
    # a volume with this name and URI must already exist in Azure
    diskName: test.vhd
    diskURI: https://someaccount.blob.microsoft.net/vhds/test.vhd
# Host path
#   -- this is a path on the node that the pod gets scheduled on, useful
#      for debugging purposes.
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vol
spec:
  hostPath:
    path: /data

Storage Classes

↩PREREQUISITES↩

Defining a storage class allows for dynamic provisioning of persistent volumes per persistent volume claim.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
# The two fields below ("provisioner" and "parameters" define how persistent volumes are to
# be created and are unique to each volume type. In this example, the storage class
# provisions new persistent volumes on AWS. Any persistent volume claim with storage class
# name set to `standard` will call out to this AWS elastic store provisioner to create a
# persistent volume of type "awsElasticBlockStore" which gets assigned to it.
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
# This field ("reclaimPolicy") maps to a persistent volume's
# "persistentVolumeReclaimPolicy", except that "Recycle" isn't one of the allowed options:
# Only "Delete" and "Retain" are allowed. This example uses "Retain". If unset, the reclaim
# policy of a dynamically provisioned persistent volume is "Delete". 
reclaimPolicy: Retain
# When this field ("allowVolumeExpansion") is set to true, the persistent volume can be
# resized by editing the persistent volume claim object. Only some volume types support
# volume expansion. This example will work because AWS elastic block store volume types do
# support volume expansion.
allowVolumeExpansion: true

To use a storage class in a persistent volume claim, supply the name of that storage class.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-claim
spec:
  storageClassName: standard  # Use the storage class defined above for this claim
  resources:
    requests:
      storage: 1Gi
  accessModes:
    - ReadWriteOnce

⚠️NOTE️️️⚠️

Since these persistent volumes are being dynamically provisioned, it doesn't make sense to have Recycle. You can just Delete and if a new claim comes in it'll automatically provision a new volume. It's essentially the same thing as Recycle.

If a persistent volume claim provides no storage class name, that persistent volume claim will use whatever storage class Kubernetes has set as its default. Recall that leaving the storage class name unset is not the same as leaving it as an empty string. To leave unset means to keep it out of the declaration entirely. If the storage class name is ...

Most Kubernetes installations have a default storage class available, identified by the storage class having the annotation storageclass.kubernetes.io/is-default-class=true.

⚠️NOTE️️️⚠️

The following example shows the default storage class on microk8s.

# kubectl get sc
# Note how the name identifies it as the default.
NAME                          PROVISIONER            RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
microk8s-hostpath (default)   microk8s.io/hostpath   Delete          WaitForFirstConsumer   false                  6s
# kubectl get sc microk8s-hostpath -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"microk8s-hostpath"},"provisioner":"microk8s.io/hostpath","volumeBindingMode":"WaitForFirstConsumer"}
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2022-07-22T19:41:28Z"
  name: microk8s-hostpath
  resourceVersion: "2775"
  uid: 1df92cbc-6e2f-4726-a487-a81b1fcd8d2b
provisioner: microk8s.io/hostpath
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Endpoints

Endpoints (plural) is a kind that simply holds a list of IP addresses and ports. It's used by higher-level kinds to simplify routing. For example, an endpoints object may direct to all the nodes that make up a sharded database server.

Example manifest:

apiVersion: v1
kind: Endpoints
metadata:
  name: database
subsets: 
  - addresses:
      - ip: 10.10.1.1
      - ip: 10.10.1.2
      - ip: 10.10.1.3
    ports:
      - port: 5432 
        protocol: TCP  # TCP or UDP, default: TCP
        name: pg
  - addresses:
      - ip: 10.13.4.101
      - ip: 10.13.4.102
      - ip: 10.13.4.103
    ports:
      - port: 12345
        protocol: TCP  # TCP or UDP, default: TCP
        name: pg2

The endpoints example above points to ...

Service

Services are a discovery and load balancing mechanism. A service exposes a set of pods under a single fixed unified hostname and IP, routing traffic to that set by load balancing incoming requests across the set. Any external application would need to use a service's hostname because the IP / host of the single pod instances aren't fixed, exposed, or known. That is, pods are transient and aren't guaranteed to always reside on the same node. As they shutdown, come up, restart, move between nodes, etc.., there's no implicit mechanism that requestors can use to route their requests accordingly.

A service fixes this my internally tracking such changes and providing a single unified point of access.

Kroki diagram output

⚠️NOTE️️️⚠️

The book mentions why DNS can't be used directly. For example, having a basic DNS service which returns a list of all up-and-running pod IPs won't work because ...

  1. applications and operating systems often cache DNS results, meaning that changes won't be visible immediately.
  2. applications often only use the first IP given back by a DNS result, meaning that requests won't balance.

The service fixes this because it acts as a load balancing proxy and its IP / host never changes (DNS caching won't break anything).

Example manifest:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80
      targetPort: 9376

⚠️NOTE️️️⚠️

Internally, an endpoints object is used to track pods. When you create a service, Kubernetes automatically creates an accompanying endpoints object that the service makes use of.

Routing

↩PREREQUISITES↩

A service determines which pods it should route traffic to via label selectors.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  # Label selectors that pick out the pods this service routes to.
  selector:
    key1: value1
    key2: value2
    key3: value3
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80
      targetPort: 9376

Internally, the service creates and manages an endpoints object containing the IP and port for each pod captured by the selector. If no selectors are present, the service expects an endpoints object with the same name to exist, where that endpoints object contains the list of IP and port pairs that the service should route to.

apiVersion: v1
kind: Endpoints
metadata:
  name: database  # Must be same name as the service
subsets: 
  - addresses:
      - ip: 10.10.1.1
      - ip: 10.10.1.2
      - ip: 10.10.1.3
    ports:
      - port: 5432

If no label selectors are present but the service's type is set to ExternalName, the service will route to some user-defined host. This is useful for situations where you want to hide the destination, such as an external API that you also want to mock for development / testing.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: ExternalName
  externalName: api.externalcompany.com  # Route to this host
  ports:
    - name: api-port
      protocol: TCP
      port: 8080
      targetPort: 5000

⚠️NOTE️️️⚠️

If not set, spec.type defaults to ClusterIP. That's the type used when selectors are used to create an endpoints / a custom endpoints is used.

Ports

↩PREREQUISITES↩

A service can listen on multiple ports.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  # In most cases, each port entry has ...
  #
  #  * "name", which is a friendly name to identify the port (optional)
  #  * "protocol", which is either `TCP` or `UDP` (defaults to `TCP`).
  #  * "port", which is the port that the service listens on.
  #  * "targetPort", which  is the port that requests are forwarded to on the pod (defaults
  #     to value set for "port").
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80
      targetPort: 9376
    - name: api-port
      protocol: TCP
      port: 8080
      targetPort: 1111

⚠️NOTE️️️⚠️

Not having a name makes it more difficult for pods to discover a service. Discussed further in the service discovery section.

The example above forwards requests on two ports. Requests on port ...

Ports may also reference the names of ports in a pod. For example, the following pod provides names for its ports.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: my-image:1.0
      ports:
        - name: my-http-port  # Name for the port
          containerPort: 8080
          protocol: TCP

In the service targeting that pod, you can use my-http-port as a target port.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80
      targetPort: my-http-port  # The name of the port in the pod

⚠️NOTE️️️⚠️

Does this work for manual endpoints as well? When a selector isn't used with a service, it looks for an endpoints object of the same name as the service to figure out where the service should route to. That endpoints object can have names associated with its ports as well.

⚠️NOTE️️️⚠️

A service decides which pods it routes to based key-value pairs in on spec.selector. What happens if the key-value pairs identify a set of pod instances where some of those instances don't have a port named my-http-port. For example, a service may be forwarding to two applications rather than a single application which just could be sharing the same set of key-value labels (pod instances are heterogenous).

Maybe this isn't possible with Kubernetes?

Health

↩PREREQUISITES↩

The service periodically probes the status of each pod to determine if it can handle requests or not. Two types of probes are performed:

These probes are defined directly in the pod manifest.

Kroki diagram output

⚠️NOTE️️️⚠️

Recall that, when a service has selectors assigned, Kubernetes internally maintains an endpoints object that contains the addresses of ready and healthy pods. The addresses in this endpoints object is what the service routes to.

Headless

↩PREREQUISITES↩

A headless service is one in which there is no load balancer forwarding requests to pods / endpoints. Instead, the domain for the service will resolve a list of ready IPs for the pods (or endpoints) that the service is for.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  # When this field ("clusterIP") is set to "None", the service is a "headless service".
  clusterIP: None
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80

Generally, headless services shouldn't be used because DNS queries are typically cached by the operating system. If the IPs that a service forwards to change, apps that have recently queried the service's DNS will continue to use the old (cached) set of IPs until the operating system purges its DNS cache.

Session Affinity

How a service decides to forward incoming requests to the pod instances assigned to it is controlled via a session affinity field. Assigning a value of ...

When using ClientIP, a maximum session "sticky time" may also be provided.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  # The two fields below ("sessionAffinity" and "sessionAffinityConfig") define how session
  # affinity works.
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10000  # Defaults value is 108300, which is around 3 hours
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80

⚠️NOTE️️️⚠️

When using ClientIP? What happens when the service runs out of memory to track client IPs? LRU algorithm to decide which to keep / discard?

⚠️NOTE️️️⚠️

The book mentions that because services work on the TCP/UDP level and not at HTTP/HTTPS level, forwarding requests by tracking session cookies isn't a thing.

Exposure

The service type defines where and how a service gets exposed. For example, a service may only be accessible within the cluster, to specific parts of the cluster, to an external network, or to the public Internet.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: ClusterIP  # Service type
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80

If not specified, the type is ClusterIP, meaning that it's exposed only locally within the cluster.

Local

↩PREREQUISITES↩

Services of type ClusterIP / ExternalName are only accessible from within the cluster. The hostname of such services are broken down as follows: NAME.NAMESPACE.svc.CLUSTER

Depending on what level you're working in, a hostname may be shortened. For example, if the requestor and the service are within ...

The IP for a ClusterIP / ExternalName service is stable as well, just like the hostname.

⚠️NOTE️️️⚠️

Internally, a ClusterIP service uses kube-proxy to route requests to relevant pods (endpoints).

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: ClusterIP  # ClusterIP service type
  selector:
    app: MyApp
  ports:
    - name: webapp-port
      protocol: TCP
      port: 80

Node Port

Services of type NodePort are accessible from outside the cluster. Every worker node opens a port (either user-defined or assigned by the system) that routes requests to the service. Since nodes are transient, there is no single point of access to the service.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: NodePort  # Nodeport service type
  selector:
    app: MyApp
  ports:
    # When "NodePor"` is used as the type, each port should have "nodePort" field that
    # defines the port on the worker nodes to open.
    - protocol: TCP
      port: 80
      targetPort: 9376
      nodePort: 8080

🔍SEE ALSO🔍

Load Balancer

Services of type LoadBalancer are accessible from outside the cluster. When the LoadBalancer type is used, the cloud provider running the cluster assigns their version of a load balancer to route external HTTP requests to the Kubernetes ingress component. Ingress then determines what service that request should be routed to based on details within the HTTP parameters (e.g. Host).

There is no built-in Kubernetes implementation of ingress. Kubernetes provides the interface but someone must provide the implementation, called an ingress controller, for the functionality to be there. This is because load balancers come in multiple forms: software load balancers, cloud provider load balancers, and hardware load balancers. When used directly, each has a unique way it needs to be configured, but the ingress implementation abstracts that out.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: LoadBalancer
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376
      nodePort: 8080
    ...

Once provisioned, the object will have the field status.loadBalancer.ingress.ip added to it, which states the IP of the load balancer forwarding requests to this service.

# <REMOVED PREAMBLE>
spec:
  type: LoadBalancer
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376
      nodePort: 8080
status:
  loadBalancer:
    ingress:
      ip: 192.0.5.6

⚠️NOTE️️️⚠️

You can also use kubectl to get a list of services and it'll also list out the public IP.

⚠️NOTE️️️⚠️

The book says that a load balancer type is a special case of node port type.

Ingress

↩PREREQUISITES↩

Similar to a service of type LoadBalancer, An ingress object is a load balancer with a publicly exposed IP. However, rather than load balancing at the TCP/UDP level, an ingress object acts as a load balancing HTTP proxy server. An HTTP request coming into an ingress object gets routed to one of many existing services based on host and path HTTP headers. This is useful because the cluster can expose several services under a single public IP address.

Kroki diagram output

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
  - host: stats.myhost.com
    http:
      paths:
      - path: /graphana
        pathType: Prefix
        backend:
          service:
            name: graphana-service
            port:
              number: 80
  - host: api.myhost.com
    http:
      paths:
      - path: /v2
        pathType: Prefix
        backend:
          service:
            name: api-service-v2
            port:
              number: 80
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: api-service-v1
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service-v2
            port:
              number: 80

⚠️NOTE️️️⚠️

According to the book, most if not all implementations of ingress simply query the service for its endpoints and directly load balance across them vs forwarding the request through that service. Note that the port in the example above is still the port that the service is listening on, not the port of the pod is listening on.

Hosts

The host in each rule can be either an exact host or it could contain wildcards (e.g. *.api.myhost.com). Each name in the host (split by dot) intended for a wildcard should explicitly have an asterisk in its place. The portion the asterisk is in must exist and it only covers that name. For example, the rule below will match ONE.api.myhost.com, but not TWO.THREE.api.myhost.com or api.myhost.com.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
  - host: "*.api.myhost.com"
    - path: /v2
      pathType: Prefix
      backend:
        service:
          name: api-service-v2
          port:
            number: 80

Path Type

Each rule entry should have a path type associated with it. It can be set to any of the following values:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
  - host: api.myhost.com
    http:
      paths:
      - path: /my/prefix/path
        pathType: Prefix
        backend:
          service:
            name: api-service-v2
            port:
              number: 80

The most common path type is Prefix. A type of Prefix splits the path using / and matches the rule if the incoming request's path starts with the same path elements as the rule's path. Trailing slashes are ignored (e.g. /p1/p2/p3/ and /p1/p2/p3 are equivalent).

⚠️NOTE️️️⚠️

What about ImplementationSpecific? There are different types of ingress controllers, each of which has its own configuration options. An ingress class is something you can put into your ingress object that contains "configuration including the name of the controller that should implement the class." It seems like an advanced topic and I don't know enough to write about it. Probably not something you have to pat attention to if you're doing basic cloud stuff.

TLS Traffic

↩PREREQUISITES↩

Assuming you have a TLS certificate and key files for the host configured on the ingress object, you can add those into Kubernetes as a secret and configure the ingress object to make use of it.

# openssl genrsa -out tls.key 2048
# openssl req -new -x509 -key tls.key -out tls.cert -days 360 -subj /CN=api.myhost.com
# kubectl create secret tls my-api-tls --cert=tls.crt --key=tls.key
apiVersion: v1
kind: Secret
metadata:
  name: my-api-tls
type: kubernetes.io/tls
data:
  tls.crt: base64 encoded cert
  tls.key: base64 encoded key

Each certificate secret used by an ingress object has its own entry that contains the hosts it supports.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  # TLS certificates are listed here. Each entry has a certificate secret name under
  # "secretName" and the domain(s) supported by that certificate under "hosts". Hosts must
  # match hosts explicitly listed in the ingress object's rules.
  tls:
    - secretName: my-api-tls
      hosts:                
        - api.myhost.com
    - secretName: my-stats-tls
      hosts:
        - stats.myhost.com
      

Once an encrypted request comes in to the ingress controller, it's decrypted. That decrypted request then gets forwarded to the service it was intended for.

⚠️NOTE️️️⚠️

From the k8s website:

You need to make sure the TLS secret you created came from a certificate that contains a Common Name (CN), also known as a Fully Qualified Domain Name (FQDN) for https-example.foo.com.

Keep in mind that TLS will not work on the default rule because the certificates would have to be issued for all the possible sub-domains. Therefore, hosts in the tls section need to explicitly match the host in the rules section.

⚠️NOTE️️️⚠️

The book mentions that CertificateSigningRequest is a special type of kind that will sign certificates for you, if it was set up. You can issue requests via kubectl certificate approve csr_name and it'll either automate it somehow or a human will process it? Not sure exactly what's going on here.

Namespace

A namespace is a kind used to avoid naming conflicts. For example, it's typical for a Kubernetes cluster to be split up into development, testing, and production namespaces. Each namespace can have objects with the same names as those in the other two namespaces.

apiVersion: v1
kind: Namespace
metadata:
  name: production

Namespaces are cluster-level objects. This is contrary to most other kinds in Kubernetes, which are namespace-level objects, meaning that a namespace can be used to disambiguate objects of that type with the same name...

# These are namespace-level objects
apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: testing  # put into the testing namespace
spec:
  containers:
  - name: mypod
    image: my_image:v2_alpha5
---
apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: production  # put into the production namespace
spec:
  containers:
  - name: mypod
    image: my_image:v1

If a namespace-level object doesn't set a namespace, the namespace defaults to default.

Replica Set

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

Replica sets deprecate a replication controllers.

A replica set is an abstraction that's used to ensure a certain number of copies of some pod are always up and running. Typical scenarios where replica sets are used include ...

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-replicaset
spec:
  replicas: 2  # Number of pod copies to run.
  # Selectors are label selectors used to identify pods, which match the key-value pairs
  # used for pod template labels further down.
  selector:
    matchLabels:
      app: my-app
  # A template of the pod to launch when there aren't enough copies currently running.
  # Everything under "template" is essentially a pod manifest, except the "apiVersion" and
  # "kind" aren't included.
  template:
    metadata:
      name: my-pod
      # These labels are how this replicate set will determine how many copies are running. It
      # will look around for pods with this set of labels.
      labels:
        app: my-app
    spec:
      containers:
      - name: my-container
        image: nginx

Recall that, to link objects together, Kubernetes uses loosely coupled linkages via labels rather than hierarchical parent-child relationships. As such, the pod template should have a unique set of labels assigned that the replica set can look for to determine how many instances are running. Regardless of how those instances were launched (via the replica set or something else), the replica set will account for them. In the example above, the replica set determines pod instances it's responsible for by looking for the label named app and ensuring its set to my-app.

⚠️NOTE️️️⚠️

According to the k8s docs, it may be a parent-child relationship. Apparently looking for labels is just an initial step to permanently bringing pods under the control of a specific replica set:

A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.

What happens when two replica sets try "owning" the same pod?

A replica set's job is to ensure that a certain number of copies of a pod template are running. It won't retain state between its copies or do any advanced orchestration. Specifically, a replica set ...

⚠️NOTE️️️⚠️

If one of the replicas is in a loop where it's constantly crashing and restarting, that replica will stay as-is in the replica set. It won't automatically get moved to some other node / forcefully removed and re-added as a new replica.

⚠️NOTE️️️⚠️

You can distinguish a pod created by a replica set vs one created manually by checking the annotation key kubernetes.io/create-by on the pod.

If deleting a replica set, use --cascade=false in kubectl if you don't want the pods created by the replica set to get deleted as well.

🔍SEE ALSO🔍

Deployment

↩PREREQUISITES↩

Deployment are similar to replica sets but make it easy to do rolling updates, where the update happens while the application remains online and still services requests. Old pods are transitioned to new pods as a stream instead of all at once, ensuring that the application is responsive throughout the upgrade process. Likewise, they allow for rolling back should an update encounter any problems.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
# The manifest for a deployment builds off the manifest for a replica set. Most replica set
# fields are present: Number of copies, label selectors, pod template, etc... In addition,
# it supports several other fields specific to doing updates.
spec: 
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
      labels:
        app: my-app
    spec:
      containers:
        - name: my_container
          image: my-image:1.0
  strategy: RollingUpdate  # How the deployment should perform updates (default value)

A deployment provides mechanisms to control how an update happens (e.g. all at once vs gradual), if an update is deemed successful (e.g. maximum amount of time a rollout can take), fail-fast for bad updates, and rollbacks to revert to previous versions. These features are discussed in the subsections below.

⚠️NOTE️️️⚠️

If one of the replicas is in a loop where it's constantly crashing and restarting, that replica will stay as-is in the deployment. It won't automatically get moved to some other node / forcefully removed and re-added as a new replica.

⚠️NOTE️️️⚠️

The same gotchas with replica sets also apply to deployments: all pods will use the same persistent volume claim and IPs / hosts aren't retained when pods are replaced.

Like with replica set, you might have to use --cascade=false in kubectl if you don't want the pods created by the deployment to get deleted as well (unsure about this).

🔍SEE ALSO🔍

Updates

↩PREREQUISITES↩

A deployment can support one of the two update strategies:

Recreate is simple but results in down (a period of time where no pods are running). Most enterprise applications have up-time guarantees and as such require RollingUpdate. RollingUpdate has several options that control the flow and timing of how pods go down and come up, documented in the example below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
      labels:
        app: my-app
    spec:
      containers:
        - name: my_container
          image: my-image:1.0
  strategy:
    type: RollingUpdate
    rollingUpdate:
      # "maxUnavailable" - During an update, this is the number (or percentage) of pods
      # that can be unavailable relative to the number of replicas. Since this deployment
      # has 10 replicas, the parameter below is instructing that the number of replicas
      # can't go below 8 during an update (at most 2 pods may be unavailable).
      #
      # If between 0 and 1, this is treated as a percentage of pods (e.g. 0.25 means 25
      # percent of pods may be unavailable during an update). 
      maxUnavailable: 2
      # "maxSurge" - During an update, this is the number (or percentage) of excess pods
      # that can be available relative to the number of replicas. Since this deployment
      # has 10 replicas, the parameter below is instructing that the number of replicas
      # can't go above 12 during an update (at most 2 pods extra pods may be running).
      #
      # If between 0 and 1, this is treated as a percentage of pods (e.g. 0.25 means 25
      # percent of pods may be unavailable during an update). 
      maxSurge: 2
      # "minReadySeconds" - Once all of the readiness probes of a new pod succeed, this
      # is the number of seconds to wait before the deployment deems that the pod has
      # been successfully brought up. If any readiness probes within the pod fail during
      # this wait, the update is blocked.
      #
      # This is useful to prevent scenarios where pods initially report as ready but
      # revert to un-ready soon after receiving traffic.
      minReadySeconds: 10
      # "progressDeadlineSeconds" - This is the maximum number of seconds that is allowed
      # before progress is made. If this is exceeded, the deployment is considered stalled.
      #
      # Default value is 600 (10 minutes).
      progressDeadlineSeconds: 300

It's common for deployments to fail or get stuck for several reasons:

Rollbacks

A deployment retains update history in case it needs to rollback. The mechanism used for this is replica sets: For each update, a deployment launches a new replica set. The replica set for the old version ramps down the number of pods while the replica set for the new version ramps up the number of pods. Once all pods have been transitioned to the new version, the old replica set (now empty of pods) is kept online.

It's good practice to limit the number of revisions kept in a deployments update history because it limits the number of replica sets kept alive.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  revisionHistoryLimit: 5  # Keep the 5 latest revisions
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
      labels:
        app: my-app
    spec:
      containers:
        - name: my_container
          image: my-image:1.0

⚠️NOTE️️️⚠️

You can inspect previous versions via kubectl rollout history deployment my-deployment. For each update, it's good practice to set the kubernetes.io/change-cause annotation a custom message describing what was updated / why it was updated -- this shows up in the history.

You can rollback via kubectl rollout undo deployments my-deployment --to-revision=12345.

Stateful Set

↩PREREQUISITES↩

A stateful set is similar to a deployment but the pods it creates are guaranteed to have a stable identity and each pod can have its own dedicated storage volumes. In the context of stateful sets, ...

⚠️NOTE️️️⚠️

"Stable identity" doesn't imply that a replacement pod will be scheduled on the same node. The replacement may end up on another node.

A stateful set requires three separate pieces of information:

  1. a headless service, which acts as a gateway to a stateful set's pods (referred to as a governing service).
  2. a volume claim template, which templates persistent volume claims for a stateful set's pods.
  3. a pod template, which templates pods similar to pod template for a deployment.

These three pieces are represented as two separate objects: the governing service and the stateful set itself.

# Manifest #1: Headless service for the stateful set's pods (governing service).
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  clusterIP: None
  # Routes traffic to pods based on the following label selectors, which are the same
  # key-value pairs used for pod template labels of the stateful set further down.
  selector:
    app: my-app
  ports:
    - name: http
      port: 80
----
# Manifest #2: The stateful set itself, which contains both the pod template and the
# volume claim template.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-set
spec:
  # Selectors are label selectors used to identify pods, which match the key-value pairs
  # used for pod template labels further down.
  selector:
    matchLabels:
      app: my-app
  serviceName: my-service  # Name of headless service for stateful set (governing service).
  replicas: 3              # Number of replicas for the stateful set.
  # Persistent volume claims will be created based on the following template.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        resources:
          requests:
             storage: 1Mi
        accessModes:
          - ReadWriteOnce
  # Pod's will be created based on the following template. Note that the volume mount
  # references the persistent volume claim template described above.
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: my-image:1.0
          ports:
            - name: http
              containerPort: 8080
          volumeMounts:
            - name: data
              mountPath: /var/data
  # Similar to deployments, stateful sets also support updating / rollback mechanisms exist,
  # but not exactly the same ones.
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
   # Once all of the readiness probes of a new pod succeed, this is the number of seconds to
   # wait before the stateful set deems the pod to be available. No readiness probes within
   # the pod can fail during this wait.
  minReadySeconds: 10

The example above creates a stateful set that manages three pod replicas and a governing service for those pods. The pods created by the stateful set are numbered starting from 0: my-stateful-set-0, my-stateful-set-1, and my-stateful-set-2. In addition, each pod gets its own persistent volume claim mounted at /data containing a modest amount of storage space. That persistent volume claim will have the format data-my-stateful-set-N (where N is the ordinal suffix)

Kroki diagram output

The ordinal suffixes of a stateful set's pods are part of their stable identity. If a pod were to die, the volume for that stable identity will be re-bound to its replacement. Stateful sets take great care to ensure that no more than one pod will ever be running with the same stable identity so as to prevent race conditions (e.g. conflicts regarding IP / host, multiple pods using the same volume, etc..). In many cases, that means a pod won't be replaced until the stateful set is absolutely sure that it has died.

⚠️NOTE️️️⚠️

If one of the replicas is an a loop where its constantly crashing and restarting, that replica will stay as-is in the stateful set. It won't automatically get moved to some other node / forcefully removed and re-added as a new replica.

🔍SEE ALSO🔍

Scaling

Because of stable identity guarantees and the fact that each stable identity can have its own distinct volumes, stateful sets have different scaling behavior than deployments. A stateful set scales pods based on the ordinal suffix of its pod names. When the number of replicas is ...

For example, given the stateful set my-stateful-set with 3 replicas (those replicas being pods my-stateful-set-0, my-stateful-set-1, and my-stateful-set-2), ...

The scaling behavior makes the stable identities of the pods being removed / added known beforehand. In contrast, a deployment's scaling behavior makes no guarantees as to which replicas get removed / added and in what order.

Kroki diagram output

Stateful sets scale one pod at a time to avoid race conditions that are sometimes present in distributed applications (e.g. which database server is the primary vs which database server is the replica). When scaling down, the persistent volumes for a pod won't be removed along with the pod. This is to avoid permanently deleting data in the event of an accidental scale down. Likewise, when scaling up, if a volume for that stable identity is already present, that volume gets attached instead of creating a new volume.

For example, given the same 3 replica my-stateful-set example above, changing the number of replicas to 1 will leave the volumes for my-stateful-set-1 and my-stateful-set-2 lingering undeleted.

Kroki diagram output

Changing the number of replicas back to 3 will then recreate my-stateful-set-1 and my-stateful-set-2, but those new pods will be assigned the lingering undeleted volumes from before rather than being assigned new volumes (all previous data will be present).

Kroki diagram output

A stateful set will not proceed with scaling until all preceding pods (ordinal suffix) are in a healthy running state. This is because, if a pod is unhealthy and the stateful set gets scaled down, it's effectively lost two members at once. This goes against the "only one pod can go down at a time" stateful set scaling behavior.

For example, given the same 3 replica my-stateful-set example above, scaling down to 1 replica will first shut down pod my-stateful-set-2 and then pod my-stateful-set-1. If my-stateful-set-2 shuts down but then my-stateful-set-0 enters into an unhealthy state, my-stateful-set-1 won't shut down until my-stateful-set-0 recovers. Likewise, if my-stateful-set-0 enters into an unknown state (e.g. the node running it temporarily lost communication with the control plane), my-stateful-set-1 won't shut down until my-stateful-set-0 is known and healthy.

⚠️NOTE️️️⚠️

The scaling guarantees described here can be relaxed through spec.podManagementPolicy. By default, this value is set to OrderedReady, which enables the behavior described in this section. If it were instead set to Parallel, the stateful set's scaling will launch / terminate pods in parallel and won't wait for preceding pods to be healthy.

Updates

↩PREREQUISITES↩

There are two templates in a stable set: a pod template and a volume claim template.

A pod template has two different update strategies:

OnDelete is simple but requires user intervention to shutdown pods. RollingUpdate is similar to the RollingUpdate strategy for deployments, but it supports fewer parameters and its behavior is slightly different. Specifically, rolling updates for stateful sets support two parameters.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-ss
spec:
  selector:
    matchLabels:
      app: my-app
  serviceName: my-service
  replicas: 10
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: my-image:1.0
  # Rolling update strategy.
  strategy:
    type: RollingUpdate
    rollingUpdate:
      # "maxUnavailable" - During an update, this is the number (or percentage) of pods
      # that can be unavailable relative to the number of replicas. Since this deployment
      # has 10 replicas, the parameter below is instructing that the number of replicas
      # can't go below 8 during an update (at most 2 pods may be unavailable).
      #
      # If between 0 and 1, this is treated as a percentage of pods (e.g. 0.25 means 25
      # percent of pods may be unavailable during an update). 
      maxUnavailable: 1
      # "partition" - Only pods with suffix ordinals that are >= to this number will
      # receive updates. All other pods will remain un-updated. For this stateful set,
      # that means only "my-ss-5", "my-ss-6", "my-ss-7", "my-ss-8", "my-ss-9", and
      # "my-ss-10" get updated. 
      #
      # This is a useful feature for gradual / phased roll outs.
      partition: 5

⚠️NOTE️️️⚠️

Deployments also supported the rolling update parameter minReadySeconds. There's a similar feature for stateful sets but it goes under the field spec.minReadySeconds (it isn't specific to rolling updates).

Rolling updates performed with a pod management policy of OrderedReady (the default) may get into a broken state which requires manual intervention to roll back. If an update results in a pod entering into an unhealthy state, the rolling update will pause. Reverting the pod template won't work because it goes against the "only one pod can go down at a time" behavior of stateful sets.

🔍SEE ALSO🔍

A volume claim template cannot be updated. The system will reject an updated stateful set if its volume claim template differs from the original. As such, users have devised various manual strategies for modifying volumes in a stateful set:

⚠️NOTE️️️⚠️

What about shrinking a volume? I imagine what you need to do is, starting with the last ordinal to the first (current pod denoted N), ...

  1. delete stateful set without deleting its pods (kubectl delete sts --cascade=orphan <name>).
  2. delete pod N.
  3. create a temporary volume with the new desired size.
  4. create a temporary pod with both the pod N's volume and the temporary volume attached.
  5. use the temporary pod to copy pod N's volume to the temporary volume.
  6. delete the temporary pod.
  7. delete pod N's volume.
  8. re-create pod N's volume with the new desired size (same name).
  9. create a temporary pod with both the pod N's volume and the temporary volume attached.
  10. use the temporary pod to copy the temporary volume to pod N's volume.
  11. re-create the stateful set (kubectl apply -f <name>).
  12. trigger the stateful set to restart pods one at a time (kubectl rollout restart sts <name>).

The last step should restart the deleted pod, and that deleted pod will attach the updated volume.

These same steps may work for expanding volumes when allowVolumeExpansion isn't set to true.

Peer Discovery

A stateful set's governing service allows for its pods to discover each other (peer discovery). For each pod in a stateful set, that pod will have a sub-domain within that stateful set's governing service. For example, a stateful set with the name my-ss in the namespace apple within the cluster will expose its pods as ...

... , where each sub-domain points to one of the stable identities.

A governing service allows for enumerating all pods within its stateful set via DNS service records (SRV). In the example above, performing an SRV lookup on my-ss.apple.svc.cluster.local will list out the sub-domains and IPs for all my-ss pods.

⚠️NOTE️️️⚠️

If you have access to dig, you can do dig SRV my-ss.apple.svc.cluster.local and it'll list out all available sub-domains and their IPs for you.

⚠️NOTE️️️⚠️

If polling for the IP of a peer that hasn't come up yet, DNS negative caching might cause a small delay to discovering the IP when that peer actually comes up.

Job

↩PREREQUISITES↩

A job launches one or more pods to perform one-off tasks. Once those one-off tasks complete, the job is effectively over. Typical job use-cases include ...

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
spec:
  parallelism: 5  # Num of pods that can run at the same time (default is 1).
  completions: 10 # Num of pods that must successfully finish for job to end (default is 1).
  backoffLimit: 4 # Max num of retries of a failed pod before failing job (default is 6).
  activeDeadlineSeconds: 99 # Max secs before job forcibly fails, killing all pods.
  # Completion mode, when set to "Indexed", provides an ordinal suffix / stable identity to
  # each launched pod, similar to how a stateful set provides its pods with a stable identity.
  # This is useful in cases where the pods of a job need to communicate with each other (e.g.
  # distributed work-queue processing), but a service will likely also need to be provided.
  #
  # In most cases, this should be set to "NonIndexed" (default value).
  completionMode: NonIndexed
  # Pod template describing job's pods.
  template:
    spec:
      containers:
        - name: my-container
          image: my-image:1.0
      # Restart policy of the launched pods. For jobs, this must be set to either
      # "OnFailure" or "Never".
      restartPolicy: OnFailure

The example job 10 pods to successful completion, keeping up to 5 concurrently running at any one time. If a pod fails, the job will retry it up to 4 times before failing the job entirely. Similarly, the job itself runs no more than 99 seconds before failing entirely.

Common gotchas with jobs:

Cleanup

One common problem with jobs is resource cleanup. Except for failed pods that have been retried (backoffLimit field), a completed job won't delete its pods by default. Those pods are kept around in a non-running state so that their logs can be examined if needed. Likewise, the job itself isn't deleted on completion either.

⚠️NOTE️️️⚠️

This became a problem for me when using Amazon EKS with Amazon Fargate to run the job's pods. The Fargate nodes were never removed from the cluster because the job's pods were never deleted?

The problem with letting jobs and pods linger around in the system is that it causes clutter, putting pressure on the Kubernetes servers. Typically, it's up to the user to delete a job (deleting a job will also deletes any lingering pods). However, there are other mechanisms that can automate the deletion of jobs:

User-defined Labels

↩PREREQUISITES↩

By default, a job automatically picks out a unique label to identify its pods (such that it definitively knows which pods in the system belong to it). However, it's possible to give custom labels to a job's pods / give custom label selectors to a job. This is useful is cases where a job's pods need to communicate with each other: A headless service can target a job's pods based on its labels, which is similar to how a stateful set's pods can communicate with and discover each other (governing service).

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
spec:
  parallelism: 5  # Num of pods that can run at the same time (default is 1).
  completions: 10 # Num of pods that must successfully finish for job to end (default is 1).
  backoffLimit: 4 # Max num of retries of a failed pod before failing job (default is 6).
  activeDeadlineSeconds: 99 # Max secs before job forcibly fails, killing all pods.
  # Completion mode, when set to "Indexed", provides an ordinal suffix / stable identity to
  # each launched pod, similar to how a stateful set provides its pods with a stable identity.
  # This is useful in cases where the pods of a job need to communicate with each other (e.g.
  # distributed work-queue processing), but a service will likely also need to be provided.
  #
  # In most cases, this should be set to "NonIndexed" (default value).
  completionMode: NonIndexed
  # Selectors are label selectors used to identify pods, which match the key-value pairs
  # used for pod template labels further down.
  #
  # In most cases, you shouldn't need to specify this (or the labels in the pod template
  # below). When not present, the system will automatically pick labels / label selectors
  # that won't conflict with other jobs / pods.
  selector:
    matchLabels:
      app: my-app
  # Pod template describing job's pods.
  template:
    spec:
      containers:
        - name: my-container
          image: my-image:1.0
      # Restart policy of the launched pods. For jobs, this must be set to either
      # "OnFailure" or "Never".
      restartPolicy: OnFailure

Cron Job

↩PREREQUISITES↩

A cron job launches a job periodically on a schedule, defined in cron format.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-cronjob
spec:
  schedule: "0 * * * *"  # Schedule of the job in cron format (launches every hour)
  # How much tolerance to have (in seconds) for a scheduled run of a job that's been missed.
  # If a scheduled run gets missed for any reason but is identified within this window,
  # it'll run anyways. If it's past the window, it'll count as a failed job.
  #
  # This is an optional field. If not set, there is no deadline (infinite tolerance).
  startingDeadlineSeconds: 200
  # How should a job launch be treated if the previously launched job is still running.
  # If this is set to ...
  #  * "Allow", it allows the jobs to run concurrently (default value).
  #  * "Forbid", it skips the new job launch, meaning concurrently running jobs not allowed.
  #  * "Replace", it replaces the previously running job with the new job.
  concurrencyPolicy: Forbid
  # How many ended successful/failed jobs should remain in Kubernetes. If set to 0, a job and
  # its corresponding pods are removed immediately after ending. If  > 0, the last N jobs and
  # their pods will remain in Kubernetes (useful for inspection of logs).
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  # Job template that describes the job that a cron job launches. This is effectively a job
  # definition without "apiVersion", "kind", and "metadata" fields.
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: my-container
              image: my-image:1.0
          restartPolicy: OnFailure

⚠️NOTE️️️⚠️

There is no stable support for timezones. The timezone used by all cron jobs is whatever the timezone of the controller manager is (other parts of the doc say unspecified timezone). There currently is a beta feature that's gated off that lets you specify a timezone by setting spec.timeZone (e.g. setting it to Etc/UTC will use UTC time).

Common gotchas with cron jobs:

Daemon Set

↩PREREQUISITES↩

A daemon set ensures that a set of nodes each have a copy of some pod always up and running. Typical scenarios where a daemon set is used include ...

The above scenarios are ones which break container / pod isolation. That is, a daemon set is intended to run pods that are coupled to nodes and sometimes those pods will do things such as mount the node's root filesystem and run commands to either install software or gather information.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: my-ds
spec:
  # Selectors are label selectors used to identify pods, which match the key-value pairs
  # used for pod template labels further down.
  selector:
    matchLabels:
      app: my-app
  # Pod template describing daemon set's pods.
  template:
    metadata:
      name: my-pod
      # These labels are how this daemon set will determine if the pod is running on a node.
      # It will look around for pods with this set of labels.
      labels:
        app: my-app
    spec:
      # Put copies of this pod only on nodes that have these labels. There is also a
      # "nodeAffinity" field and "tolerations" field, which allow for more elaborate logic
      # / soft logic for node selection (too vast to cover here).
      #
      # If neither "nodeSelector" nor "nodeAffinity" is set, copies of this pod will run on
      # all nodes.
      nodeSelector:
        type: my-node-type
      containers:
        - name: my-container
          image: my-image:1.0
          resources:
            limits:
              cpu: 100m
              memory: 200Mi
          volumeMounts:
            - name: varlog
              mountPath: /host_log
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log

The example above runs a copy of the pod template on each node that has the label type=my-node-type and mounts the host's /var/log directory to /host_log in the container. Most daemon sets are used for some form of monitoring or manipulation of nodes, so it's common to have volumes of the type hostPath, which mounts a directory that's directly on the node itself.

Unlike deployments and stateful sets, daemon sets don't have support for rolling updates. On any change to a daemon set's pod template on node selection criteria, old pods are deleted and updated pods are brought up in their place.

Service Account

↩PREREQUISITES↩

A service account is a set of credentials that applications within a pod use to communicate with the Kubernetes API server. Service accounts also provide an aggregation point for image pull secrets and other security-related features.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-service-account
# A list of image pull secrets to use with private container registries. When this service
# account is applied to a pod, all image pull secrets in the service account get added to
# the pod.
imagePullSecrets:
  - name: my-dockerhub-secret
  - name: my-aws-ecr-secret
# API access credentials will never be mounted to any pod that uses this service account.
automountServiceAccountToken: false

By default, each namespace comes with its own service account which pods in that namespace automatically use. The service account used by a pod can be overridden to another service account.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  # Custom service account to use. This can't be changed once the pod's been creatd.
  serviceAccountName: my-service-account
  containers:
    - name: my-container
      image: my-registry.example/tiger/my-image:1.0

🔍SEE ALSO🔍

Horizontal Pod Autoscaler

↩PREREQUISITES↩

A horizontal pod autoscaler (HPA) periodically measures how much work a set of pod replicas are doing so that it can appropriately adjust the number of replicas on that replica set, deployment, or stateful set. If the pod replicas ...

⚠️NOTE️️️⚠️

This feature depends on a "metrics server" that should be running on Kubernetes by default.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
spec:
  # What kind and name are being targeted for autoscaling? Is it a replica set, deployment,
  # stateful set, or something else?
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: my-ss
  # What are the minimum and maximum replicas that this HPA will scale to? At the time of
  # writing, the minimum number of replicas must be 1 or more (it can't be 0).
  minReplicas: 2
  maxReplicas: 10
  # What type of metrics are being collected? In this example, on average across all active
  # pod replicas, we want the CPU load to be 50%. If the average is more than 50%, scale
  # down the number of replicas. If it's more, scale up the number of replicas.
  #
  # The average utilization is referring to the amount of resource requested by the pod. In
  # this example, this is 50% of the CPU resource AS REQUESTED BY THE POD (via the pod's
  # spec.containers[].resources.requests for CPU).
  #
  # Instead of doing average percentage, you can also do absolute values by changing
  # target.type to AverageValue and replacing target.averageUtilization with
  # target.averageValue.
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

🔍SEE ALSO🔍

In the example above, if the average CPU usage of the stateful set my-ss replicas is

The HPA will at-most double the number of replicas on each iteration. Each scaling iteration has an intentional waiting period. Specifically, scaling ...

Common gotchas with HPAs:

⚠️NOTE️️️⚠️

In addition to horizontal pod autoscaler, there's a vertical pod autoscaler (VPA). A VPA will scale a single pod based on metrics and its resource requests / resource limits.

The VPA kind doesn't come built-in with Kubernetes. It's provided as an add-on package found here.

Scaling Behavior

How an HPA scales up / down can be controlled through policies. It's common to control scaling behavior to reduce problems such as trashing of replicas (constantly introducing and evicting replicas).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: my-ss
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  # Scaling behavior is defined through policies. Both scaling up and scaling down have
  # their own list of policies to select from. When a list contains more than one policy,
  # the one selected is defined by "spec.selectPolicy" (next field after this one).
  #
  # In addition to policies, scaling up and scaling down both have their own "stabilization
  # window" which is used to prevent thrasing of replicas (constantly introducing / evicting
  # replicas). The window tracks the highest replica count in the past n seconds and won't
  # let policies go through with setting it to a smaller value (e.g. use the highest
  # computed replica count over the past 5 mins).
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Pods # Allow at most 4 pod replicas to be added in a span of 1 min
          value: 4
          periodSeconds: 60 
        - type: Percent # Allow at most a 10% increase of pod replicas in a span of 1 min
          value: 10
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods  # Allow at most 3 pod replicas to be removed in a span of 2 mins
          value: 3
          periodSeconds: 120
        - type: Percent  # Allow at most a 10% decrease of pod replicas in a span of 5 mins
          value: 10
          periodSeconds: 300
  # When many policies are present for a scale up / down, the policy chosen can be either
  # the one causing the ...
  #
  #  * most change (e.g. most pods added), set by using "Max" as the value.
  #  * least change (e.g. least pods removed), set by using "Min" as the value.
  selectPolicy: Max

⚠️NOTE️️️⚠️

Default scaling behavior is defined here.

Metric Types

Metrics can be one of the following types types:

Resource is set directly by Kubernetes while Pods, Object, and External are for custom / user-defined metrics.

If multiple metrics being tracked by an HPA, as in the example below, that HPA calculates the replica counts for each metric and then chooses the one with the highest.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: my-ss
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # Target an average of 50% CPU utilization across all replicas
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
    # Target an average of 1000 queries-per-second across all replicas.
    - type: Pods
      pods:
        metric:
          name: queries-per-second
        target:
          type: AverageValue
          averageValue: 1000
    # Target an average of 1000 queries-per-second across all replicas.
    - type: Object
      object:
        metric:
          name: requests-per-second
        describedObject:
          apiVersion: extensions/v1
          kind: Ingress
          name: my-ingress
        target:
          type: Value
          averageValue: 2000

⚠️NOTE️️️⚠️

I've spent hours trying to figure out how replicas can send custom metrics that an HPA can scale on. There is barely any documentation on this and zero examples online. The closest thing I could find to a source is here.

The simplest solution here seems to be to use a third-party software package called Prometheus.

🔍SEE ALSO🔍

Cluster Autoscaler

↩PREREQUISITES↩

A cluster autoscaler is a component that scales a Kubernetes cluster on a public cloud by adding and removing nodes as needed. Each public could has its own implementation of a cluster autoscaler. In some clouds, the implementation is exposed as a kind / set of kinds. In other clouds, the implementation is exposed as a web interface or a command-line interface.

In most cases, nodes and added and removed to predefined groups called node pools. Each node pool has nodes of the same type (same resources and features). For example, a specific node pool of machines with the same CPU, networking gear, and same amount and type of RAM.

If a pod is scheduled to run but none of the nodes have enough resources to run it, the cluster autoscaler will increase the number of nodes in one of the node pools that has the capability to run the pod. Likewise, the cluster autoscaler will decrease the number of nodes if nodes aren't being utilized enough by actively running pods.

⚠️NOTE️️️⚠️

It doesn't seem to be consistent so there isn't much else to put about cluster autoscaling. See the cluster autoscaler section on this website and navigate to whatever public cloud you're using.

🔍SEE ALSO🔍

Pod Disruption Budget

↩PREREQUISITES↩

A pod disruption budget (PDB) specifies the number of downed pod replicas that a replica set, deployment, or stateful set can tolerate relative to its expected replica count. In this case, a disrupted pod is one that's brought down via ...

Once a PDB has reached its downed pod replica limit, it prevents further downing of pod replicas via voluntary disruptions. It cannot prevent further downing of pod replicas via involuntary disruptions or rolling updates (these downed pods will still be accounted for within the PDB).

⚠️NOTE️️️⚠️

This has something to do with an "Eviction API". Haven't had a chance to read about this yet.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-pdb
spec:
  # Selector for pod replicas within the stateful set.
  selector:
    matchLabels:
      app: my-ss-pods
  # Max number of pods that can be unavailable at one time.
  #
  # Instead of "maxUnavailable", this can also be "minAvailable", which is the min number of
  # pods must be available at all times.
  maxUnavailable: 3

⚠️NOTE️️️⚠️

The manifest is using label selectors are being used to identify pods. How does it know what the replica count is for the replica set / deployment / or stateful set? According to the docs:

The "intended" number of pods is computed from the spec.replicas of the workload resource that is managing those pods. The control plane discovers the owning workload resource by examining the metadata.ownerReferences of the Pod.

Security

Kubernetes comes with several features to enhance cluster security. These include gating off inter-cluster networking, gating off how containers can break isolation, and providing access control mechanisms to the API.

The subsections below detail various security related topics.

Network Policy

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

For this feature to work, Kubernetes needs to be using a network plugin that supports it.

Network policies restrict pods from communicating with other network entities (e.g. services, endpoints, other pods, etc..). Restrictions can be for either inbound connections, outbound connections, or both.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-np
  namespace: my-ns
spec:
  # Which pods this network policy applies to is defined via pod labels. The pods have to be
  # in the same namespace as the network policy.
  podSelector:
    matchLabels:
      app: backend
  # Which directions is this network policy? Possible options include "Ingress" is for
  # inbound connections, "Egress" is for outbound connections, or both. If empty, it'll
  # default to just "Ingress" and "Egress" will also be set if there are any egress rules
  # below.
  policyTypes: [Ingress, Egress]
  # Inbound connection rules go here. Rules can be for IP blocks (CIDR), namespaces
  # (identified via labels), or pods (identified via labels). Each rule is for a specific
  # port.
  ingress:
    - from:
        # Allow all pods from another namespace.
        - namespaceSelector:
            matchLabels:
              project: my-company
        # Allow all pods in this namespace that have a particular set of labels.
        - podSelector:
            matchLabels:
              app: frontend
        # Allow all pods from another namespace that have a particular set of labels. Note
        # that this is a SINGLE ENTRY, not two separate entries (there is no dash before
        # "podSelector" like the "podSelector" above has).
        - namespaceSelector:
            matchLabels:
              project: my-company
          podSelector:
            matchLabels:
              app: frontend
        # Allow all pods from an IP block (with exceptions).
        - ipBlock:
            cidr: 172.17.0.0/16
            except:
              - 172.17.99.0/24
      ports:
        - protocol: TCP
          port: 8080
  # Outbound connection rules go here. This is specified in exactly the same way as the
  # rules for inbound connections, but the rules apply to outbound connections.
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 8888

⚠️NOTE️️️⚠️

The pod selectors should still apply when communicating to pods over a service. Will they still apply when communicating to pods without using a service (e.g. raw)?

The following are commonly used patterns for network policies.

# DENY ALL INGRESS TRAFFIC
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes: [Ingress]
----
# ALLOW ALL INGRESS TRAFFIC
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-ingress
spec:
  podSelector: {}
  ingress:
  - {}
  policyTypes: [Ingress]
----
# DENY ALL EGRESS TRAFFIC
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
spec:
  podSelector: {}
  policyTypes: [Egress]
----
# ALLOW ALL EGRESS TRAFFIC
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-egress
spec:
  podSelector: {}
  ingress:
  - {}
  policyTypes: [Egress]
----
# DENY ALL INGRESS AND EGRESS TRAFFIC
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Pod Security Admission

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

There was a feature called pod security policy that this deprecates. Pod security policies are no longer a thing.

Pod security admissions are used to restrict the security context of pods using policy groups that define up-to-date best practices. Restricting requires two pieces of information: policy and mode. Specifically, ...

For example, setting ...

To apply to all pods cluster-wide, use the following object.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: PodSecurity
    configuration:
      apiVersion: pod-security.admission.config.k8s.io/v1
      kind: PodSecurityConfiguration
      defaults:
        enforce: privileged
        enforce-version: latest
        audit: restricted
        audit-version: "1.25"
        warn: baseline
        warn-version: "1.22"
      exemptions:
        usernames: []             # Usernames to exempt
        runtimeClasses: []        # Runtime class names to exempt
        namespaces: [kube-system] # Namespaces to exempt

🔍SEE ALSO🔍

To apply to all pods in a specific namespace, use the following namespace label templates.

pod-security.kubernetes.io/warn=baseline
pod-security.kubernetes.io/warn-version=1.22
pod-security.kubernetes.io/audit=restricted
pod-security.kubernetes.io/audit-version=1.25
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/enforce-version=latest

API Access Control

↩PREREQUISITES↩

Access control to the Kubernetes API is modeled using accounts and groups, where an account can be associated with many groups and each group grants a certain set of permissions to its users. Two types of accounts are provided:

Kroki diagram output

Service accounts are what get used when a pod needs access to the Kubernetes API. Containers within that pod have a volume mounted with certificates and credentials that the applications within can use to authenticate and communicate with the API server.

Each namespace gets created with a default service account (named default). By default, pods created under a namespace will be assigned the default service account for that namespace. A pod can be configured to use a custom service account that has more or less access rights than the default service account. A pod can also forgo volume mounting the credentials of a service account entirely if the applications within it don't need access to the Kubernetes API.

Kroki diagram output

🔍SEE ALSO🔍

How access rights are defined depends on how Kubernetes has been set up. By default, Kubernetes is set up to use role-based access control (RBAC), which tightly maps to the REST semantics of the Kubernetes API server. RBAC limits what actions can be performed on which objects: Objects map to REST resources (paths on the REST server) and manipulations of objects map to REST actions (verbs such as DELETE, GET, PUT, etc.. on those REST server paths).

The subsections below detail RBAC as well as other API security related topics.

⚠️NOTE️️️⚠️

Other types of access control mechanisms exist as well, such as attribute-based access control (ABAC).

Role-based Access Control

RBAC tightly maps to the REST semantics of the Kubernetes API server by limiting what actions can be performed on which objects: Objects map to REST resources (paths on the REST server) and manipulations of objects map to REST actions (verbs such as DELETE, GET, PUT, etc.. on those REST server paths). RBAC is configured using two sets of kinds:

A role binding always maps a single role to many users, groups, and service accounts.

Kroki diagram output

⚠️NOTE️️️⚠️

Recall that a ...

ClusterRole and ClusterRoleBinding are cluster-level kinds while Role and RoleBinding are namespace-level kinds. RBAC provides different permissions based on which role variant (ClusterRole vs Role) gets used with which role binding variant (ClusterRoleBinding vs RoleBinding):

Role Binding Permission Granted
Role RoleBinding Allow access to namespace-level kinds in that role / role binding's namespace.
ClusterRole ClusterRoleBinding Allows access to namespace-level kinds in all namespaces, cluster-level kinds, and arbitrary URL paths on the Kubernetes API server.
ClusterRole RoleBinding Allow access to namespace-level kinds in all namespaces, cluster-level kinds, and arbitrary URL paths on the Kubernetes API server. But, that access is only permitted from within the namespace of the role binding.
Role ClusterRoleBinding Invalid. It's allowed but it does nothing (it won't cause an error).

Roles / cluster roles define a set of rules, where each rule grants access to a specific part of the API. Each rule requires three pieces of information:

In addition, cluster roles take in a set of paths, where each path grants access to a specific path on the API server. This is useful in scenarios where the path being accessed doesn't represent a kind (meaning it can't be represented with rules as described above -- e.g. querying the health information of the cluster).

All of the fields discussed above can use wildcards via *.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: my-cluster-role
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]
  - nonResourceURLs: ["/api/*", "/apis/*"]

Role bindings / cluster role bindings associate a set of subjects (users, groups, and / or service accounts) with a role / cluster role with, granting each subject the permissions defined by that role / cluster role. Each subject needs to be either a ...

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-cluster-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: my-role
subjects:
  - kind: ServiceAccount
    namespace: my-ns
    name: default
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:authenticated  # case-sensitive
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: dave  # case-sensitive

⚠️NOTE️️️⚠️

I suspect that users and groups are cluster-level kinds, hence the lack of namespace. Haven't been able to verify this.

⚠️NOTE️️️⚠️

What exactly is the system:authenticated group in the examples above? According to the book, there are several groups internal to Kubernetes that help identify an account:

Kubernetes comes with several predefined cluster roles that can be used as needed:

⚠️NOTE️️️⚠️

edit also excludes access to resource quotes and namespaces? Unsure.

Disable Credentials

By default, a pod will mount its service account's credentials to /var/run/secrets/kubernetes.io/serviceaccount within its containers. Unless access to the API is required, it's good practice to disable the mounting of credentials entirely. This can be done via the service account object or the pod object.

# Disable on service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-service-account
  automountServiceAccountToken: false # Don't mount creds into pods using this service account
----
# Disable on pod
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: my-service-account
  automountServiceAccountToken: false  # Prevent auth details from mounting to containers in pod
  containers:
    - name: my-container
      image: my-image:1.0

⚠️NOTE️️️⚠️

Recall that it's also possible to disable auto-mounting on individual pods. Auto-mounting can't be disabled on individual containers, but it is possible to override the /var/run/secrets/kubernetes.io/serviceaccount mount on those containers with something like tmpfs (empty directory).

Extensions

↩PREREQUISITES↩

Kubernetes can be automated / extended through user supplied code and third-party software packages. The subsections below detail various automation related topics.

Custom Kinds

↩PREREQUISITES↩

Kubernetes allows user-defined kinds. User-defined kinds typically build on existing kinds, either to ...

Each user-defined kind first requires a custom resource definition (CRD), which tells Kubernetes how a user-defined kind is defined. Given a CRD for a user-defined kind, the Kubernetes API server will store, allow access, and perform basic validation on objects of that kind.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: cars.my-corp.com  # Must be set to <spec.names.plural>.<spec.group>.
spec:
  # The different naming variations of the kind that this CRD is adding.
  names:
    plural: cars     # Plural variant, used for API path (discussed further below).
    singular: car    # Singular variant, used as alias for kubectl and for display.
    kind: Car        # CamelCased singular variant, used by manifests.
    shortNames: [cr] # Shorter variants, used by kubectl.
  # Is this a namespace-level kind (objects associated with a namespace) or a cluster-level
  # kind (objects are cluster-wide). Use either "Namespaced" or "Cluster".
  scope: Namespaced
  # The REST API path is specified using the two fields below ("group" and "version") along
  # with the plural name defined above. There's only one group but there can be multiple
  # versions for that group. Each version gets its own REST API path in the format
  # /apis/<group>/<version>/<plural>. 
  #
  # In this example, only one version exists. Its REST API path is /api/my-corp.com/v1/vars.
  group: my-corp.com
  versions:
    - name: v1
      served: true  # Setting this to false disables this version.
      storage: true # Given multiple version, exactly one must be marked for storage.
      # Each version can have some validation performed via a schema. The following OpenAPI
      # schema ensures several fields exist on the object and those objects are of the
      # correct type.
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                model: {type: string}
                year: {type: integer}
                doorCount: {type: integer}
----
# An example object created using the definition above
apiVersion: cars.my-corp.com/v1
kind: Car
metadata:
  name: my-car
spec:
  model: Jetta
  year: 2005
  doorCount: 4

To process objects of a user-defined kind, a special pod needs to be written with access to the Kubernetes API. This pod, called a controller, needs to use the Kubernetes API to watch for object events and process those events in whatever way is appropriate. To provide redundancy, it's common for multiple instances of such a pod to be running at once (e.g. deployment), possibly coordinating with each other (e.g. shared database).

# This is a naive implementation of a controller. It shouldn't have multiple instances
# running on the cluster because those instances get into race conditions and trip over each
# other.

# Import Kubernetes API client for Python.
import kubernetes.config
from kubernetes.client import V1Pod
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1Service
from kubernetes.client import V1ServiceSpec
from kubernetes.client import V1ServicePort

# Load the credentials mounted within the pod. Make sure a service account is associated
# with the pod and that service account has the necessary permissions to watch, add/delete
# pods, and add/delete services.
#
# If this is running locally rather than within a pod, use load_kube_config() instead.
kubernetes.config.load_incluster_config()

v1 = kubernetes.client.CoreV1Api()
custom_v1 = kubernetes.client.CustomObjectsApi()

# An added car should result in a new pod and a new service for that pod.
def create_car(car_name):
    pod_name = car_name + '-pod'
    pod_labels = {'app': 'car'}
    pod_spec = V1PodSpec(containers=[V1Container(name='car', image='my-image:1.0')])
    pod = V1Pod(metadata=V1ObjectMeta(name=pod_name, labels=pod_labels), spec=pod_spec)
    v1.create_namespaced_pod(namespace='default', body=pod)
    
    service_name = car_name + '-service'
    service_labels = {'app': 'car'}
    service_spec = V1ServiceSpec(selector=pod_labels, ports=[V1ServicePort(protocol='TCP', port=80)])
    service = V1Service(metadata=V1ObjectMeta(name=service_name, labels=service_labels), spec=service_spec)
    v1.create_namespaced_service(namespace='default', body=service)

# A deleted car should result in pod and service associated with it to be deleted as well.
def delete_car(car_name):
    pod_name = car_name + '-pod'
    v1.delete_namespaced_pod(name=pod_name, namespace='default')
    
    service_name = car_name + '-service'
    v1.delete_namespaced_service(name=service_name, namespace='default')

# Watch the API server for changes to "cars"
w = kubernetes.watch.Watch()
for event in w.stream(custom_v1.list_cluster_custom_object, 'my-corp.com', 'v1', 'cars', watch=True):
    car = event['object']
    if event['type'] == 'ADDED':
        create_car(car['metadata']['name'])
    elif event['type'] == 'DELETED':
        delete_car(car['metadata']['name'])

Note that, when a CRD is deleted, all of its objects are deleted as well. Those deleted objects will cause DELETED events. For example, deleting the CRD example above will cause the controller example above to receive DELETED events for objects that were taken out by that CRD's deletion.

⚠️NOTE️️️⚠️

In addition to doing it manually like this, there's also a handy framework to take a lot of the boilerplate out called Kubernetes Operators Framework (Kopf).

⚠️NOTE️️️⚠️

There's also another much more complicated mechanism of adding your own kind called: API server aggregation. In this method, you create your own API server that handles requests, storage, and management of your kind. The Kubernetes API sever then proxies to your API server via an aggregation layer.

Helm

↩PREREQUISITES↩

Helm is an application installer for Kubernetes, similar to package managers on Linux distributions (e.g. apt on Ubuntu). It installs, upgrades, and uninstalls software, taking care of configuration details and applying all the necessary manifests into Kubernetes. This could be software that's internal to Kubernetes (e.g. helps manage or extend Kubernetes in some way) or software that runs on top of Kubernetes (e.g. install a cluster of Redis pods for your application to use).

A chart is a recipe that details how Helm should install a piece of software. Each chart is made up of ...

  1. manifest templates that render based on user-supplied configurations.
  2. default configurations for those manifest templates.
  3. dependencies to other charts.

Kroki diagram output

Helm can access charts from either ...

Publicly available charts / chart repositories are commonly listed on ArtifactHub.

Kroki diagram output

Each chart can be installed multiple times, where each installation has a different name and / or namespace. Helm can update installations in one of two ways: An installation can be updated to ...

Each change to an installation is called a revision. For example, an installation of a Redis chart may go through the following revisions:

  1. Installed Redis 5 with default configurations.
  2. Updated configurations to listen on port 3333.
  3. Upgraded Redis 5 to Redis 6.
  4. Upgraded Redis 6 to Redis 7 and updated configurations to listen on port 7777.
  5. Rolled back to revision 2 (Redis 5 configured to listen on port 3333).

Kroki diagram output

Each revision has a lifecycle, which is stored inside of Kubernetes as a secret of a type helm.sh/release.v1. The steps of that lifecycle are ...

Kroki diagram output

Helm is used via a CLI of the same name. By default, the Helm CLI uses the same configuration as kubectl to access the cluster, meaning Helm's CLI should work if kubectl works.

helm repo add bitnami https://charts.bitnami.com/bitnami  # Add bitnami's repo as "bitnami"
helm repo update                                          # Update list of charts from repos
helm install my-db bitnami/mysql                          # Install bitnami's mysql as "my-db"

Repository References

To add a reference to a chart repository, use helm repo add. Each added chart repository is given a local name, such that whenever that chart repository needs to be accessed in the CLI, the local name is used instead of the full URL.

# Add a repo as "bitnami".
helm repo add bitnami https://charts.bitnami.com/bitnami
# Add a repo as "my-private" using username and password authentication.
helm repo add my-private https://my-org/repo --username $USERNAME --password $PASSWORD
# Add a repo as "my-private" using SSL key file.
helm repo add my-private https://my-org/repo --key-file $SSL_KEY_FILE
# Add a repo as "my-private" using SSL certificate file.
helm repo add my-private https://my-org/repo --cert-file $SSL_CERT_FILE
# Add a repo as "my-private" but skip certificate checks for the server.
helm repo add my-private https://my-org/repo --insecure-skip-tls-verify
# Add a repo as "my-private" but use custom CA file to verify server's certificate.
helm repo add my-private https://my-org/repo --ca-file $CA_FILE

To list all chart repository references, use helm repo list.

# List all chart repositories.
helm repo list
# List all chart repositories as YAML.
helm repo list --output yaml
# List all chart repositories as JSON.
helm repo list --output json

To remove a chart repository reference, use helm repo remove.

# Remove the repo "bitnami".
helm repo remove bitnami
# Update only the repos "bitnami" and "my-private".
helm repo remove bitnami my-private

Helm caches a chart repository once it's been added. To update an added chart repository's cache, use helm repo update.

# Update all repos.
helm repo update
# Update only the repos "bitnami" and "my-private".
helm repo update bitnami my-private

To search added chart repositories for charts containing a specific keyword, use helm search repo.

# Search all repos for charts containing "redis".
helm search repo redis
# Search all repos for charts containing "redis", including pre-release versions.
helm search repo redis --devel
# Search all repos for charts containing "redis", including prior versions of charts.
helm search repo redis --versions

⚠️NOTE️️️⚠️

Rather than searching just added repositories for "redis", it's possible to search all repositories in ArtifactHub (or your own instance of it) via helm search hub redis.

Install Management

↩PREREQUISITES↩

To install a chart, use helm install. Installing a chart requires the chart's name (e.g. redis), where to find the chart (e.g. locally vs chart repository), and what to name the installation (e.g. my-redis-install). Depending on the chart, some configuration parameters will likely be needed during installation as well.

⚠️NOTE️️️⚠️

If installing / upgrading from a chart repository that's been added to Helm, you should update Helm's cache of that chart repository first: helm repo update.

# Install "my-app" from the directory "./web-server".
helm install my-app ./web-server
# Install "redis" from the repo at https://charts.bitnami.com/bitnami.
helm install my-app https://charts.bitnami.com/bitnami/redis
# Install "redis" from the repo added as "bitnami".
helm install my-redis-install bitnami/redis \
  --namespace $NAMESPACE \  # Namespace to install under (optional, default if omitted)
  --version $VERSION \      # Version of chart to install (optional, latest if omitted)
  --values $YAML_FILE       # YAML containing overrides for chart's config defaults (optional)
# Install "redis" from the repo added as "bitnami" using multiple config files.
#
#  * When multiple "--values" exist, the latter's values override former's values.
helm install my-redis-install bitnami/redis \
  --namespace $NAMESPACE \  # Namespace to install under (optional, default if omitted)
  --version $VERSION \      # Version of chart to install (optional, latest if omitted)
  --values $YAML_FILE_1 \   # YAML file containing config overrides (optional)
  --values $YAML_FILE_2 \   # YAML file containing config overrides (optional)
  --values $YAML_FILE_2     # YAML file containing config overrides (optional)
# Install "redis" from the repo added as "bitnami" using individual configs.
#
#  * Individual configs reference paths in a object, similar to YAML/JSON keys.
helm install my-redis-install bitnami/redis \
  --namespace $NAMESPACE \  # Namespace to install under (optional, default if omitted)
  --version $VERSION \      # Version of chart to install (optional, latest if omitted)
  --set app.port=80 \       # YAML file containing config overrides (optional)
  --set app.name=app \      # YAML file containing config overrides (optional)
  --set app.replicas=5      # YAML file containing config overrides (optional)
# Install "redis" from the repo added as "bitnami" and force creation of namespace if it doesn't
# exist.
#
#  * Forcing namespace creation can result in security issues because the namespace won't have RBAC
#    set up.
helm install my-redis-install bitnami/redis \
  --namespace $NAMESPACE \  # Namespace to install under (optional, default if omitted)
  --create-namespace        # Create namespace if missing (optional)
# Install "redis" from the repo added as "bitnami" using a uniquely generated name.
#
#  * Generated name will be based on chart's name.
helm install my-redis-install bitnami/redis --generate-name
# Install "my-app" but wait until specific objects are in a ready state before marking as success.
#
#  * By default, the install command will block only until manifests are submitted to Kubernetes.
#    It's possible to force the install command to block until certain success criteria have bee
#    met: pods, persistent vol claims, services, and min pods for deployments/stateful sets/etc..).
helm install my-app ./web-server --wait \
  --wait-for-jobs \    # Wait until all jobs have completed as well (optional)
  --atomic \           # Delete the installation on failure  (optional)
  --timeout $DURATION  # Max duration to wait before marking as failed (optional, 5m0s if omitted)
# Simulate install "my-app" from the directory "./web-server".
#
#  * Won't install, but will give the set of changes to be applied.
helm install my-app ./web-server --dry-run

To list installations, use helm list.

# List installations.
helm list \
  --namespace $NAMESPACE \  # Namespace of installations (optional, default if omitted)
  --max $MAX_ITEMS \        # Max installations to list (256 is set to 0 -- optional, 256 if omitted)
  --filter $REGEX           # Perl compatible regex (optional, all listed if omitted)
# List installations in all namespaces.
helm list --all-namespaces

To update an installation, use helm upgrade. Updating works very similarly to installing, supporting many of the same options. The difference between the two is that, with helm upgrade, Helm diffs the current installation's objects with the objects created by the chart to see what it should update. Only objects that have changes will get submitted to Kubernetes.

When using helm upgrade, the configuration values used by the installation aren't re-applied by default. If the previous configuration values aren't supplied by the user or the --reuse-values flag isn't set, the upgrade will revert all configurations back to the chart's default values.

⚠️NOTE️️️⚠️

The book recommends that you not use --reuse-values and instead supply the configurations each time via YAML. It's recommended that you store the YAML somewhere in git (unless it has sensitive information, in which case you should partition out the sensitive info into another YAML and keep it secure somewhere).

# Update "my-app" from the directory "./web-server" with existing config.
helm upgrade my-app ./web-server --reuse-values \
  --force                   # Replace object manifests instead of updating them (optional)
# Update "redis" from the repo added as "bitnami" with existing config.
helm upgrade my-redis-install bitnami/redis --reuse-values \
  --force                   # Replace object manifests instead of updating them (optional)
  --namespace $NAMESPACE \  # Namespace installed under (optional, default if omitted)
  --version $VERSION \      # Version of chart to update to (optional, latest if omitted)
# Update "redis" from the repo added as "bitnami" with new YAML config.
helm upgrade my-redis-install bitnami/redis \
  --force                   # Replace object manifests instead of updating them (optional)
  --namespace $NAMESPACE \  # Namespace installed under (optional, default if omitted)
  --version $VERSION \      # Version of chart to update to (optional, latest if omitted)
  --values $YAML_FILE       # YAML containing overrides for chart's config defaults (optional)
# Update "my-app" with existing config, but wait until specific objects are in a ready state before
# marking as success.
#
#  * By default, the install command will block only until manifests are submitted to Kubernetes.
#    It's possible to force the install command to block until certain success criteria have bee
#    met: pods, persistent vol claims, services, and min pods for deployments/stateful sets/etc..).
helm upgrade my-app ./web-server --reuse-values --wait \
  --cleanup-on-fail \  # Delete new resources inserted by the upgrade on failure (optional)
  --wait-for-jobs \    # Wait until all jobs have completed as well (optional)
  --atomic \           # Rollback the update on failure  (optional)
  --timeout $DURATION  # Max duration to wait before marking as failed (optional, 5m0s if omitted)
# Simulate update "my-app" from the directory "./web-server".
#
#  * Won't update, but will give the set of changes to be applied.
helm upgrade my-app ./web-server --reuse-values --dry-run

To update or install (update if it exists / install if it doesn't exist), use helm upgrade just as before but also set the --install flag.

# Update or install "redis" from the repo added as "bitnami".
helm upgrade my-redis-install bitnami/redis \
  --install                 # Install rather than upgrade if "my-app" doesn't exist (optional)
  --namespace $NAMESPACE \  # Namespace installed under (optional, default if omitted)
  --version $VERSION \      # Version of chart to update to (optional, latest if omitted)
  --values $YAML_FILE       # YAML containing overrides for chart's config defaults (optional)
# Update or install "redis" from the repo added as "bitnami", and if installing, force creation of
# namespace if it doesn't exist.
#
#  * Forcing namespace creation can result in security issues because the namespace won't have RBAC
#    set up.
helm upgrade my-redis-install bitnami/redis \
  --install                 # Install rather than upgrade if "my-app" doesn't exist (optional)
  --namespace $NAMESPACE \  # Namespace installed under (optional, default if omitted)
  --create-namespace        # Create namespace if missing (optional)

⚠️NOTE️️️⚠️

helm template will do similar work to helm upgrade, but instead of applying updated / created objects to Kubernetes, it simply shows you the manifests for those objects.

Helm retains an installation's update history directly within Kubernetes (stored as secrets). To access the update history of an installation, use helm history.

# Get revision history for "redis".
helm history redis
# Get revision history for "redis" as YAML.
helm history redis --output yaml
# Get revision history for "redis" as JSON.
helm history redis --output json

To rollback to a previous revision of an installation, use helm rollback.

# Rollback "redis" to revision 3.
helm rollback redis 3
# Simulate rollback "redis" to revision 3.
#
#  * Won't rollback, but will give the set of changes to be applied.
helm rollback redis 3 --dry-run

To remove an installation, use helm uninstall.

# Uninstall "redis".
helm uninstall redis
# Uninstall "redis" but keep it's history.
helm uninstall redis --keep-history

Custom Charts

To create a chart skeleton, use helm create. Helm comes with one standard skeleton, but other skeletons can be made available by placing them in the starters/ directory. The starters/ directory gives chart developers different starting points for developing charts. For example, a skeleton could include all the necessary objects for a basic microservice.

# Create "my-chart" chart, using standard skeleton.
helm create my-chart
# Create "my-chart" chart, using basic-microservice skeleton.
helm create --starter basic-microservice my-chart

Each created chart will live in its own directory. Once development is complete, it can be packaged up and distributed to others via a chart repository. The subsections below detail the particulars of chart development.

Configuration

↩PREREQUISITES↩

In a chart directory, the paths values.yaml, Chart.yaml, and charts/ make up the configurations for the chart.

values.yaml contains a single object that acts as the configuration defaults for installs / updates of the chart. Its field values are used when the user doesn't supply configurations for an install / update.

app:
  port: 8080
  extra-annotations: {key1: value1, key2: value2} 
  service-type: LoadBalancer
  postgres-enabled: true

⚠️NOTE️️️⚠️

A values.yaml may have a JSON Schema associated with it via values.schema.json. Helm will check this schema when it works with the chart.

Chart.yaml contains chart metadata, chart dependencies, and some control flags.

# [REQUIRED] The chart specification being targeted by this chart. Set this value to "v2" ("v2"
# charts targets version 3 of Helm, while "v1" targets version 2 of Helm).
apiVersion: v2
# [REQUIRED] The name of this chart. This is the name people use when they install / update this
# chart. This name must conform to the specifications of names for Kubernetes objects (lowercase
# characters, numbers, dashes, and dots only).
name: my-app  
# [OPTIONAL] The annotations associated with this chart. Similar to annotations for Kubernetes
# objects.
annotations:
  development-os: Windows95
# [OPTIONAL] The description, project URL, source URLs, icons, maintainers, and keywords associated
# with this chart.
description: Some text here.
home: https://chart.github.io
sources: [https://github.com/organization/chart]
icon: https://github.com/some_user/chart_home/icon.svg  # This can be a data URL (if you want)
maintainers:
  - name: Steve
    email: steve@gmail.com
    url: https://steve.com
  - name: Josh
    url: https://josh.com
  - name: George
keywords: [apple, cars, pepsi]
# [OPTIONAL] This can be set to either "application" or "library" (default is "application"). An
# "application" is a chart that installs an application for some user, while a "library" is
# essentially a chart that provides helper functionality for other charts (it can't be
# installed).
type: application
# [REQUIRED] This is the version of this chart. This should be incremented whenever a change is
# made to the chart.
version: 1.0.1
# [OPTIONAL] This is the version of the application being installed by this chart (must be set if
# type of this chart is "application"?).
appVersion: 7.0.1
# [OPTIONAL] This is the dependencies required by the chart. Each dependency requires the field ...
#
#  * "name": Chart name.
#  * "version": Semantic version or a semantic version range (e.g. "~1.1.x" or "^2.15.9").
#  * "repository": Repository URL or a REFERENCE to a repository URL ("helm repo add ...").
#
# Each dependency can optionally have the field ....
#
#  * "condition": Configuration that controls whether or not the dependency should be installed. 
#     This is a configuration supplied by the user / by default ("values.yaml").
#  * "import-values": List of configurations to pull in from the child. These are configurations
#     supplied by the child's configuration and remapped to the parent's configuration.
dependencies:
  - name: my_dep
    version: ^4.9.11
    repository: https://my.org/chart_repo/
  - name: my_server
    version: 1.1.3
    repository: https://my.org/chart_repo/
    import-values:
      # Map child's "network-parameters.port" config to parent's "server.port" config.
      - child: network-parameters.port
        parent: server.port
      # Map child's "app.name" config to parent's "server.name" config.
      - child: app.name
        parent: server.name
  - name: postgres
    version: 13.0.1
    repository: @local_ref           # Repository reference
    condition: app.postgres-enabled  # Install only if config property is true

charts/ contains a local copy of the chart dependencies that were specified in Chart.yaml. To populate this directory, use helm dependency update . in the chart directory (period at the end references the current directory).

# Download dependencies into "charts/".
helm dependency update .
# Download dependencies into "charts/" and verify their signatures.
helm dependency update . --verify

Running this command also generates a Chart.lock file which contains the exact versions of the chart dependencies downloaded. This helps keeps builds reproducible when dependencies in Chart.yaml use version ranges instead of exact versions. To re-create charts/ based on Chart.lock, use helm dependency build . in the chart directory (period at the end references the current directory).

# Download dependencies into "charts/".
helm dependency build .
# Download dependencies into "charts/" and verify their signatures.
helm dependency build . --verify

Templates

↩PREREQUISITES↩

In a chart directory, the files inside of the templates/ directory consist of templates. These templates are mostly for manifests, but can also include macros and templates for other aspects of the chart (e.g. NOTES.txt is displayed to the user once it installs).

⚠️NOTE️️️⚠️

The manifests in templates/ must be namespace-level objects. It's best to avoid using CRDs if you can. If you must use CRDs, you need to place them in crds/ instead and none of those CRDs can be templates. Helm will apply those CRDs before it renders and applies templates.

Why should you avoid CRDs? Recall that CRDs are cluster-level objects. That means that if you ...

Imagine having two installs of some chart, where each install is a different version of the chart. Those installs are under their own name and namespace combinations, but they're sharing the same CRDs. If a newer version of a chart modifies / deletes a CRD, installing it may mess up installs of older versions of that chart. For that reason, Helm ignores updates and deletes to CRDs in /crds.

Helm's templating functionality leverages Go's text template package. Rendering a template requires an object, where that object's fields are evaluated to fill in various sections of the template. Helm provides this object. The object has ...

The object's most commonly accessed fields are...

⚠️NOTE️️️⚠️

For .Chart, the Config.yaml has keys with lowercase first characters. When accessed through the object, you need to uppercase that first character (e.g. name becomes .Chart.Name).

In a template, evaluations are encapsulated by double squiggly brackets, where inside the brackets is code to be executed and rendered (e.g. {{action}}). The object provided by Helm is referenced simply with a preceding dot (e.g. {{.Values.app.name}} to print out the app.name configuration value).

# Define macros
{{define "name"}}
name: {{.Values.app.name}}
{{end}}
{{define "labels"}}
app: backend
version: 3.4.1
{{end}}
# Define manifest
apiVersion: v1
kind: Pod
metadata:
  {{include "name" . | nident 2}}
  labels:
    {{include labels | nident 4}}
spec:
  containers:
    - image: my-image:1.0
      name: my-container

The rest of this section gives a brief but incomplete overview of templating in Helm. A complete overview of templating is available at the documentation for Go's template package and Helm's templating functions.

Common template evaluations are listed below (e.g. accessing fields, calling functions, if-else, variables, etc..).

Common template functions are categorized and listed below. Some of these come directly from Go while others are provided by Helm.

⚠️NOTE️️️⚠️

Many of these functions have must variants (e.g. toYaml vs mustToYaml) which error out in case it can't perform the expected function.

A template can define a set of macros that act similarly to functions. To define a macro, wrap template text in {{define name}} and {{end}}.

{{define "name"}}
  name: {{.Values.app.name}}
{{end}}
{{define "labels"}}
    app: backend
    version: 3.4.1
{{end}}

To use a macro, use {{template name scope}}, where name is the name of the template and scope is whatever object that template uses when accessing fields and functions. If scope is omitted, the render may not be able to access necessary fields or functions to render. For example, in the last macro defined above, scope needs to contain the path app.name.

# PRE-RENDER
{{define "name"}}
  name: {{.Values.app.name}}
{{end}}
{{define "labels"}}
    app: backend
    version: 3.4.1
{{end}}
apiVersion: v1
kind: Pod
metadata:
{{template "name" .}}
  labels:
{{template labels}}
spec:
  containers:
    - image: my-image:1.0
      name: my-container
----
# POST-RENDER
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  labels:
    app: backend
    version: 3.4.1
spec:
  containers:
    - image: my-image:1.0
      name: my-container

One issue with {{template name scope}} is inability to control whitespace in the render. For the macros to truly be reusable, they need to appropriately be indented to align with the section of YAML they're being rendered in. For example, imagine wanting to modify the example above to include the name as label as well. The YAML produced would be incorrect.

# PRE-RENDER
{{define "name"}}
  name: {{.Values.app.name}}
{{end}}
{{define "labels"}}
    app: backend
    version: 3.4.1
{{end}}
apiVersion: v1
kind: Pod
metadata:
{{template "name" .}}
  labels:
app-{{template "name" .}}
{{template labels}}
spec:
  containers:
    - image: my-image:1.0
      name: my-container
----
# POST-RENDER
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  labels:
app-  name: my-app  # WRONG - should be "app-name: my-app" and indented to be child of "labels:.
    app: backend
    version: 3.4.1
spec:
  containers:
    - image: my-image:1.0
      name: my-container

The typical workaround to this is to use {{include name scope}}, which is a special function provided by Helm that, unlike {{template name scope}}, can be used in pipeline method invocations. Being usable in a pipeline means that you can pass the macro output to the indent / nindent function to properly align output.

# PRE-RENDER
{{define "name"}}
name: {{.Values.app.name}}
{{end}}
{{define "labels"}}
app: backend
version: 3.4.1
{{end}}
apiVersion: v1
kind: Pod
metadata:
  {{include "name" . | nident 2}}
  labels:
    # Trim preceding/trailing whitespace from "name" macro output, then add "app- to it, then
    # indent add a new line to the beginning of it and indent all lines by 4.
    {{printf "app-%s" trim (include "name" .) | nident 4}}
    {{include labels | nident 4}}
spec:
  containers:
    - image: my-image:1.0
      name: my-container
----
# POST-RENDER
apiVersion: v1
kind: Pod
metadata:

  name: my-app
  labels:
    # Trim preceding/trailing whitespace from "name" macro output, then add "app- to it, then
    # indent add a new line to the beginning of it and indent all lines by 4.

    app-name: my-app

    app: backend
    version: 3.4.1
spec:
  containers:
    - image: my-image:1.0
      name: my-container

Hooks

↩PREREQUISITES↩

A hook allows a chart to run actions before / after a lifecycle stage. These actions run directly on the Kubernetes cluster. For example, a chart may run a container to back up a database before an upgrade is allowed to take place.

The following hooks are allowed:

Any templated manifest can identify itself as a hook by having an annotation with key helm.sh/hook and a value that's a comma-delimited list of hooks (e.g. helm.sh/hook=pre-install,pre-upgrade). In addition, the optional annotation key ...

apiVersion: batch/v1
kind: Job
metadata:
  name: database-backup
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "1"  # Must be string, that's why it's "1"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 9999999
  template:
    spec:
      containers:
        - name: my-container
          image: my-image:1.0.
      restartPolicy: OnFailure

🔍SEE ALSO🔍

Tests

↩PREREQUISITES↩

Helm provides functionality for testing, debugging, and linting a chart.

Testing a chart involves running its test hooks. These hooks can check that certain things are properly in place (e.g. test to see if a port is open). test hooks are typically placed under the templates/tests directory. To test a chart, use helm test.

⚠️NOTE️️️⚠️

So, it seems like these hooks don't actually test your manifests. They're sanity checking your manifests after they've already been applied? There's a secondary tool called Chart Testing Tool that allows you to do more elaborate tests: different configurations, Chart.yaml schema validation, ensuring Chart.yaml has its version incremented if using source control, etc...

To debug a chart, two options are available:

To lint a chart, use helm lint. The linter supports three levels of feedback: informational, warning, and error. Only error causes the process to exit with a non-zero return code.

# Lint "my-chart".
helm lint my-chart
# Lint "my-chart" but treat warnings as errors.
helm lint my-chart --strict

Distribution

↩PREREQUISITES↩

To package a chart for distribution, use helm package. Based on the configuration of the chart, Helm will generate an archive with filename $NAME-$VERSION.tgz which encompasses the entire chart. Others can use this file to install the chart.

# Package "my-chart".
helm package my-chart
# Package "my-chart" but update chart dependencies first.
helm package my-chart --dependency-update
# Package "my-chart" to a the directory "/home/user".
helm package my-chart --destination /home/user
# Package "my-chart" but automatically update the chart version and the app version.
helm package my-chart \
  --app-version 1.2.3 \ # Override chart's app version (optional, config value used if omitted)
  --version 4.5.6       # Override chart's version (optional, config value used if omitted)
# Package "my-chart" and sign it with a PGP key.
helm package my-chart --sign
  --key 'my-key' \           # Key name
  --keyring path/to/keyring  # Keyring path

⚠️NOTE️️️⚠️

Signing packages is good practice because it ensures to others that it came from you. Use helm verify my-chart-4.5.6.tgz --keyring public.key to verify the package's signature. Likewise, most other commands that work with charts (e.g. helm install) can take in a --verify flag.

The packaging process reads a special file that instructs it on which files and directories to ignore, named .helmignore. Files and directories are ignored using glob patterns, similar to .gitignore for git and .Dockerignore for Docker.

# Patterns to ignore when building packages.
.git/
.gitignore
.vscode/
*.tmp
**/temp/

To distribute a directory of packaged charts as a chart repository, use helm repo index. This generates a index.yaml file in the directory. When the directory is placed on a web server, other users can treat it as a chart repository.

# Create a chart repo out of the "charts/" directory.
helm repo index charts/
# Add to a chart repo by including the charts in "charts/" directory to the existing "index.yaml".
helm repo index charts/ --merge charts/index.yaml

⚠️NOTE️️️⚠️

Full-fledged chart repository servers already exist. Here's an example.

Prometheus

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

A full exploration of Prometheus is out of scope here. This is just the basics of how to get it running, scraping metrics, and having HPAs make scaling decisions based on those scraped metrics.

Look into Prometheus more sometime in the future.

Prometheus is a monitoring system that can integrate with Kubernetes to support custom metrics. These custom metrics, which Prometheus grabs by scraping files and querying servers, provide better visibility into the system and can be used to scale replicas via an HPA or VPA.

The quickest way to install Prometheus on Kubernetes is to use Helm. Specifically, the prometheus chart located here.

# Add repository reference and install prometheus.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install my-prometheus prometheus-community/prometheus
# List objects. New Prometheus related objects should have been added by the install.
kubectl get all
# Forward 9999 to Prometheus pod's port 80 (access web interface via http://127.0.0.1:9999).
kubectl port-forward service/my-prometheus-server 9999:80

The Prometheus server will collect metrics from any pod or service so long as it has a set of annotations:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/prometheus-metrics"
    prometheus.io/port: "8080"
spec:
  containers:
    - name: my-container
      image: my-image:1.0

The metrics being served by the pod / service must be in Prometheus's text-based format, documented here. Each line starts with the name of the metric being collected, optionally followed by squiggly brackets encompassing a set of comma-delimited key-value labels, followed by a quantity, and optionally followed by a timestamp (milliseconds since epoch).

http_requests_per_second{pod="pod-name", env="staging"} 133
http_requests_total{pod="pod-name", env="staging"} 1111442
some_other_metric_without_labels_and_with_timestamp 123 1483228830000

⚠️NOTE️️️⚠️

Timestamps shouldn't be used for dumping in historical data. See here.

Metrics within Prometheus can be exposed to HPAs / VPAs for replica scaling via the prometheus-adapter chart located here.

# Add repository reference and install prometheus-adapter.
#
# The installed adapter is set to use the Prometheus instance installed by the "my-prometheus"
# installation above. This assumes that the "my-prometheus" installation ended up in the default
# namespace.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install my-prometheus-adapter prometheus-community/prometheus-adapter \
  --set "prometheus.url=http://my-prometheus-server.default.svc" \
  --set "prometheus.port=80"

The adapter will expose Prometheus metrics as objects of kind custom.metrics.k8s.io (use kubectl to list: kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1). These objects can be referenced in an HPA / VPA for replica scaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: my-ss
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # Custom metrics (custom.metrics.k8s.io) are accessed via metrics of type "Pod". The metric
    # below targets an average of 1000 "http_requests_per_second" across all replicas.
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 1000

Vault

↩PREREQUISITES↩

⚠️NOTE️️️⚠️

A full exploration of Vault is out of scope here. This is just the basics of how to use the basics.

Look into Vault more sometime in the future.

Vault is a secrets management system that can integrate with Kubernetes (among other tools) to support a broader range of use-cases than normal Kubernetes secrets:

There are 3 main pillars to Vault: Authentication, Policies, and Secret engines.

Kroki diagram output

  1. Authentication: Both human users and services, collectively referred to as clients, authenticate themselves with Vault prior to being granted access. Vault supports several authentication backends: Tokens, username and passwords, active directory, AWS, Google, Okta, etc...

    Kroki diagram output

    Once authenticated, Vault returns an access token that the client can use to access functionality in Vault.

  2. Policies: Policies dictate which API functions (HTTP paths on the Vault API server) that a client can invoke. These functions almost always have something to do with accessing or managing secrets. For example, user A may have a policy assigned that allows them to read the credentials for a MySQL server, but not update or delete those credentials.

    Kroki diagram output

  3. Secret engines: High-level components that store, generate, or encrypt data. For example, the secret engine for MySQL can dynamically create, revoke, and rotate credentials.

    Kroki diagram output

Most Vault interactions can be viewed through the prism of the 3 pillars mentioned above. This includes the administration of the Vault server itself. For example, setting up a new secret engine requires the admin to first authenticate with Vault, where that admin's policy defines if they have the necessary privileges to set up a new secret engine. Likewise, a service needing access to a secret will first authenticate with Vault, where its policy defines if it has the necessary privileges to read (or generate) that secret.

⚠️NOTE️️️⚠️

There are other parts to Vault that may not fit these pillars. For example, Vault has an auditing interface that hooks into different auditing backends (e.g. syslog).

Vault can be accessed through a web API or a command-line interface, where the CLI is a thin-wrapper around the API.

# These examples are specifically for the key-value secret engine.

vault kv put -mount=secret foo bar=baz  # Create/update key "foo" in the "secret" mount with value "bar=baz"
vault kv get -mount=secret foo  # Read "foo"'s value (at current version of the key)
vault kv get -mount=secret -version=1 foo  # Read "foo"'s value (at a specific version of the key)
vault kv metadata get -mount=secret foo  # Get "foo"'s metadata

Patterns

The following subsections are guides and patterns related to various aspects of Kubernetes.

Assistive Containers

↩PREREQUISITES↩

Often times, a pod can't integrate into a larger ecosystem because it either ...

The typical way to address this is to add or adapt that pod's functionality by adding in assistive containers. Containers that add or adapt a pod's functionality typically fall into the following categories.

Pod Design

↩PREREQUISITES↩

Designing a pod appropriately requires ...

The following subsections detail various aspects of pod design.

Security

↩PREREQUISITES↩

There are several aspects to hardening the security of a pod.

Configuration

↩PREREQUISITES↩

A pod's configuration can make use of several features to ensure that it functions well.

Lifecycle

↩PREREQUISITES↩

A pod's lifecycle can make use of several features to ensure that it functions well.

Performance

↩PREREQUISITES↩

There are several aspects to ensure that pods perform well and pod replicas scale well. Before applying performance features, a bare-bones pod should be placed on a staging environment and load tested to determine its performance characteristics.

Command-line Interface

kubectl commands are typically organized into contexts, where each context is defines contextual information about the cluster: cluster location, cluster authentication, and default namespace. To ...

Context information is usually stored in $HOME/.kube/config.

kubectl commands that target an object require a namespace. That namespace can either be supplied via ...

, ... or through the default namespace set for the current context. If not set explicitly in the context, the namespace will be default.

Kubernetes API is exposed as a RESTful interface, meaning everything is represented as an object and accessed / mutated using standard REST verbs (GET, PUT, DELETE, etc..). kubectl uses this interface to access the cluster. For example, accessing https://cluster/api/v1/namespaces/default/pods/obn_pod is equivalent to running kubectl get pod obj_pod. The difference between the two is that, by default, kubectl formats the output in a human friendly manner, often omitting or shortening certain details. That output can be controlled using flags. Specifically, to ...

CRUD

get / describe allows you to get details on a specific objects and kinds. To get an overview of a ...

describe provides more in-depth information vs get.

Examples of object access:

Add --watch flag to have kubectl continually provide updates.

apply allows you to create and update objects. To create or update using ...

It will not allow you to delete objects.

⚠️NOTE️️️⚠️

Is this true? See kubectl apply with prune flag.

edit is shorthand for get and apply in that it'll open the YAML in an editor and allow you to make changes directly.

delete allows you to delete an object. To delete using ...

In certain cases, the object being deleted has parental links to other objects. For example, a replica set is the parent of the pods it creates and watches. If you delete these parent objects, by default their children go with it unless the --cascade=false flag is used.

label / annotate allows you to label / annotate an object.

When referencing objects, the ...

Deployment

rollout allows you to monitor and control deployment rollouts.

configmap allows you to create a configuration for applications running in pods.

⚠️NOTE️️️⚠️

The option --from-file can also point to a directory, in which case an entry will get created for each file in the directory provided that the filenames don't have any disallowed characters.

secret allows you to create a security related configuration for applications running in pods.

Proxy

proxy allows you to launch a proxy that lets you talk internally with the Kubernetes API server.

Debug

logs allows you to view outputs of a container.

exec allows you to run a command on a container.

attach allows you to attach to a container's main running process.

⚠️NOTE️️️⚠️

attach is similar to logs with the tailing flag but also allows you pipe into stdin.

cp allows you to copy files between your machine and a container.

port-forward allows you to connect to a open port on a container or connect to a service.

top allows you to see cluster usage.

Terminology