top of page

All environments down - external dependency

  • mooneya9
  • Feb 27, 2024
  • 2 min read

Updated: Jun 12

A client contacted us following a severe incident in which all of their web applications went offline - across development, staging, and production environments simultaneously. This was especially concerning given the client's well-architected AWS setup. Each environment was isolated in its own AWS account, with separate CI/CD pipelines and strict stage gates to prevent unauthorised or accidental deployments to production.


Their initial suspicion was that a developer had somehow pushed bad code into both staging and production due to a flaw in the platform’s architecture. However, based on past experience, we recognised that simultaneous failure across all environments is more often due to a shared external dependency rather than a code promotion issue. Separation of accounts and pipelines effectively prevents propagation of code, but it does not protect against outages introduced by external services that are used universally.


We advised the client that the most likely cause was a failing external dependency, common to all environments. They provided access to the development environment so we could replicate the issue and identify the failure point.


Without access to source code, our approach relied on observing DNS traffic generated by the application. Since any external service call would begin with a DNS resolution, we enabled query logging on AWS Route 53 for the VPC. This allowed us to capture a list of all domain lookups originating from the application.


The logs revealed both external URLs and calls to AWS services such as DynamoDB. We began methodically isolating each dependency by blocking access and observing the application’s behaviour. For AWS services like DynamoDB and Systems Manager, we used DNS Firewall to simulate resolution failures. For external services, we blocked outbound access at the network ACL level, simulating a real-world service outage rather than just DNS failure.


As we moved through the list, we eventually identified a single external URL that, when blocked, caused the application login page to hang and timeout. This confirmed that the application had a hard dependency on an external third-party service, which provided user information - for analytics and geolocation.


This dependency had not been designed to fail gracefully. When the third-party service became unreachable, the application would hang, waiting indefinitely for a response. Because the same service was called in all environments, all of the client’s applications went down simultaneously when that provider experienced an outage.


We documented the findings and provided the client with clear evidence of the dependency and its impact. Developers were instructed to update the application logic so that if the external service became unavailable, the system would proceed gracefully rather than time out entirely.


The issue was diagnosed rapidly, and the client was able to restore confidence with internal stakeholders by providing a clear root cause and mitigation plan. The case highlighted the importance of building resilience into external service integrations.

 
 

Recent Posts

See All
RDS database slow - storage layer

In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing...

 
 
bottom of page