top of page

Random latency - Cron job

  • mooneya9
  • Mar 1, 2024
  • 2 min read

Updated: Jun 12

In many web architectures, traffic is distributed across multiple application servers via a load balancer. A common strategy is the "round robin" algorithm, which simply rotates requests evenly between servers. While this is straightforward to implement, it makes no attempt to assess the current load on each backend, which can result in performance issues if one server becomes overloaded.


A client reported intermittent failures in their web application, receiving occasional HTTP 5XX errors. These errors suggested that backend servers were failing to respond correctly. The issue was puzzling, as the overall metrics for the auto-scaling group appeared normal. There was no indication, at a high level, that additional servers were needed to handle the traffic.


Closer inspection of the server metrics revealed the issue: one of the instances in the auto-scaling group had been operating at 100% CPU usage for extended periods, while the other instances remained under normal load. This imbalance in resource usage was not visible when observing only aggregate metrics.


Further investigation showed that long-running batch jobs were executing only on this overloaded instance. These jobs had been introduced by a developer and scheduled without accounting for their effect on the shared infrastructure. The server had effectively been assigned a dual role - handling both user-facing web traffic and intensive background processing - without informing the auto-scaling or load balancing logic.


To resolve the issue, the overloaded server was removed from the auto-scaling group and repurposed as a dedicated job-processing node. This allowed the remaining servers to return to their sole responsibility of serving web traffic. In addition, the load balancer configuration was updated to use an algorithm that considers the number of outstanding requests per server, providing more intelligent traffic distribution than the blind round-robin approach.


The result was a stable web application environment with consistent server performance and no recurrence of the 5XX failures. The root cause was identified quickly due to focused analysis of server-level metrics rather than relying solely on group-level summaries.

 
 

Recent Posts

See All
RDS database slow - storage layer

In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing...

 
 
bottom of page