top of page

Servers failing to launch

  • mooneya9
  • Mar 1, 2024
  • 2 min read

Updated: Jun 12

A client operating an AWS Landing Zone encountered an issue where EC2 instances within an auto-scaling group consistently failed to launch. This was part of a newly deployed workload, which made the issue more difficult to troubleshoot. Unlike a regression in an existing system, failures in brand new infrastructure - where there is no known-good baseline - can take longer to isolate.


The deployment had been created using AWS CloudFormation templates, aligning with best practices in infrastructure-as-code. The templates appeared valid, and no errors were reported during the stack creation process. However, attempts to scale out the auto-scaling group resulted in instance launch failures, with little diagnostic output.


Initial investigation focused on typical causes of EC2 launch issues, including IAM permissions, missing dependencies, and misconfigured networking. After a thorough review, the problem was traced to the use of KMS-encrypted EBS volumes. The design followed best practices by enabling encryption at rest, but the KMS key used for encryption resided in a separate AWS account - specifically, a dedicated security account within the Landing Zone structure.


While this approach aligns with a strong separation-of-duties model, it introduces additional complexity. In this case, the KMS key policy had not been configured to allow usage by EC2 instances in the target account. As a result, when the auto-scaling group attempted to launch a new instance with an encrypted volume, the request silently failed due to insufficient permissions on the key.


After identifying the root cause, the KMS key policy was updated to allow cross-account usage by the relevant EC2 instances. The change was applied, and the auto-scaling group successfully launched new instances on the next test attempt.


The issue was resolved within the same day, allowing the client to meet an upcoming project deadline. The case serves as a reminder that while multi-account architectures such as AWS Landing Zone offer improved security and isolation, they also introduce a layer of complexity that requires careful consideration - particularly around permissions and inter-account resource access.

 
 

Recent Posts

See All
RDS database slow - storage layer

In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing...

 
 
bottom of page