top of page

DNS failures - CICD on Kubernetes

  • mooneya9
  • Feb 25, 2024
  • 2 min read

Updated: Jun 12

A software development company reached out with a persistent issue affecting their continuous integration pipelines. The client managed several CI/CD pipelines for customer projects, using Jenkins deployed on a Kubernetes cluster. The problem was that some pipelines were failing at random, leading to inconsistent software builds. Nightly builds were frequently unavailable, creating mounting frustration both internally and with their customers.


The client had already begun a root cause analysis. Their initial theory focused on the DNS servers used within their environment. They were using AWS-hosted domain controllers for DNS and believed that performance limitations might be contributing to the issue. They had begun considering upgrades to the instance types hosting these services.


Upon analysing the system, two core issues were identified.


First, the Kubernetes cluster was under-provisioned in terms of DNS capacity. Specifically, there were not enough CoreDNS pods running to support the high query volume generated by the Jenkins workloads. The CoreDNS logs revealed an unexpectedly high volume of DNS traffic - over 100,000 requests per day. This level of traffic placed considerable strain on the limited number of CoreDNS pods deployed.


Second, and more critically, we discovered that each DNS query made by the CI/CD pipelines was generating multiple redundant lookups. Every attempt to resolve a hostname was resulting in up to ten separate queries to CoreDNS. This behaviour was traced back to the DNS resolver settings in /etc/resolv.conf, specifically related to the use of search domains and the ndots configuration.


In Linux-based systems, the ndots setting controls how DNS queries are constructed when search domains are applied. If a hostname contains fewer dots than the ndots threshold, the system attempts to resolve it using appended search domains - resulting in multiple queries per request. In this environment, the configuration was not optimised for the structure of the Bitbucket URLs used by Jenkins, which were by far the most queried hostnames.


We adjusted the ndots value in the resolver settings to better suit the expected domain format. After this change, we observed an 80% reduction in redundant DNS queries. The decreased load on CoreDNS immediately stabilised the DNS resolution process, and the CI/CD pipelines stopped failing due to lookup issues.


This was a highly nuanced issue, buried deep within system-level behaviour, and one that is difficult to detect without detailed knowledge of DNS internals and Kubernetes runtime behaviour. Nonetheless, it was resolved efficiently, restoring the reliability of the client’s build pipelines and preventing further disruption to their development workflows.

 
 

Recent Posts

See All
RDS database slow - storage layer

In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing...

 
 
bottom of page