top of page

Random pod failures - zone affinity

  • mooneya9
  • Mar 1, 2024
  • 2 min read

Updated: Jun 22

This case involves a Kubernetes platform running on AWS, hosting a suite of microservices that had been deployed over a year earlier by a third-party provider. The system had generally been stable, but during routine cluster patching, one particular service consistently failed to start promptly - causing delays in rollout and uncertainty during maintenance windows.


The issue manifested as a prolonged delay in the service becoming healthy post-restart, despite having operated reliably for months prior. Initial suspicion pointed to infrastructure drift or a regression introduced during recent patching cycles.


Upon reviewing the Kubernetes event logs, the root cause was immediately clear: the pod associated with the problematic service was attempting to mount an Amazon EBS volume, but the volume did not exist in the same AWS Availability Zone (AZ) as the EC2 instance the pod had been scheduled on.


This is a common architectural consideration in AWS-based Kubernetes environments. Whilst Kubernetes can schedule pods on any worker node across Availability Zones, EBS volumes are zonal - bound to the specific AZ in which they were created. If a pod depends on an EBS volume, it must be scheduled into the same AZ to successfully mount the volume.


In this case, no AZ constraint had been defined in the pod specification nor were alternative options used to prevent the issue. As a result, the Kubernetes scheduler would attempt to start the pod in any AZ. If the scheduler happened to select the correct zone by chance, the pod would eventually come online - leading to inconsistent and delayed behaviour during upgrades.


To resolve this, the pod manifest was updated with an appropriate node affinity rule to ensure it always scheduled onto a node within the correct Availability Zone for the attached volume. This guaranteed compatibility with the underlying storage layer.


The change was committed to version control to ensure it persisted across future deployments and updates.


While other architectural strategies exist that would have prevented this issue if designed better - this solution offered a low-friction, high-impact fix that aligned with the client's immediate operational needs.


With the issue resolved, the cluster could be patched reliably going forward, without risk of random service recovery delays.

 
 

Recent Posts

See All
RDS database slow - storage layer

In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing...

 
 
bottom of page