Broken prod autoscaling
- mooneya9
- Mar 1, 2024
- 2 min read
Updated: Jun 12
A client experienced a critical issue in their AWS environment where their auto-scaling group stopped producing healthy instances. As a result, their application became entirely unavailable. The behaviour was unexpected, particularly because their infrastructure followed a golden image approach. The golden image - a preconfigured server snapshot - had not been modified in months. Given that no changes were made to the image itself, the sudden failure of all newly launched instances raised concerns.
We were brought in to investigate. This type of problem is a strong example of where experienced troubleshooting is crucial. When an image-based system that previously worked suddenly starts failing, it typically indicates that something has changed outside of the image - or that the image is not as immutable as assumed.
Our first step was to examine potential external dependencies. To do this, we enabled DNS query logging in the VPC, which allowed us to observe all outbound DNS calls made by the instance during boot. If the server was relying on an external service - whether public or internal - this would provide visibility.
The DNS logs showed no unexpected external dependencies, and the client’s development team also confirmed that the system was designed to be self-contained. This allowed us to rule out external factors.
Attention then turned to the instance boot process itself. If the system was not changing externally, the most likely explanation was that the server was undergoing changes during boot. Based on past experience, we suspected automatic security updates.
Reviewing the instance logs confirmed this. The operating system was performing unattended package upgrades during the boot sequence, and these updates introduced changes that caused the application to fail to start correctly. While the original golden image was stable, each new instance became subtly different due to these updates - undermining the assumption of immutability.
To restore functionality, we disabled automatic updates and tested the launch of new instances. This resolved the issue, and the auto-scaling group was once again able to produce healthy instances. A new golden image was then created using the patched version of the operating system, but with automatic updates disabled. This ensured consistency across future deployments.
We also worked with the client to improve their patching strategy. In production environments, updates should never be applied in an unsupervised or untested manner. Patching should either be done manually by an engineer following testing, or integrated into a pipeline with automated validation. In this case, automatic security updates were introducing breaking changes that were enough to bring down the entire system.
The incident highlighted that while automation is valuable, it must be paired with visibility, testing, and controls. Without these, even well-intentioned processes like automatic patching can have serious unintended consequences.