RDS database slow - storage layer
- mooneya9
- Mar 2, 2024
- 2 min read
Updated: Jun 12
In this case study we explore a problem where we tackled performance issues plaguing an enterprise application responsible for processing millions of records daily. This application was used by a global enterprise with branches in several countries worldwide and contained a repository of client data and the updates from each branch.
The problem was the file was processing fine up until a certain point after which it became extremely slow. It was not clear from the application logs why the file was hanging after a certain period of time elapsed. The client had their own technical teams do an initial assessment which confirmed there were no problems with resources such as CPU, memory, disk, database indexes etc.
We started with reviewing the architecture documentation provided by the client to get up to speed with the overall design of the system. It consisted of 3 main components: EC2 virtual servers, an RDS database, and EFS (NFS storage). We reviewed the metrics for each component during the period of the file load. The initial analysis showed all 3 components looked good, no resource exhaustion on the EC2 servers, no performance bottleneck on the EFS storage and the RDS database had plenty of spare CPU and memory.
We then decided to take a deeper dive into the less commonly viewed metrics on each infrastructure component. It was then that we noticed that the RDS database had one metric in particular that was hitting zero at the same time that the file load was getting slow. The metric in question was BurstBalance. From the AWS Cloudwatch documentation this metric means the following: "The percent of General Purpose SSD (gp2) burst-bucket I/O credits available."
We confirmed that the storage layer of the RDS was indeed using GP2 storage. This storage type offers 3 IOPS per GiB. If the disk size provisioned is less than 1000GiB this means that the storage will have to use burst credits when it is under heavy load. These credits accumulate when the system is not under heavy load. The problem was that the file load operation was depleting the credits to zero after a period of time meaning that the IOPS available to the database was very low once the surplus credits were consumed.
The root cause was now clear: the database was being bottlenecked by storage throughput due to credit exhaustion on GP2 volumes.
Two solutions were considered:
Increase the storage volume to at least 1000 GiB, which would permanently raise the baseline IOPS and prevent credit depletion.
Migrate the RDS instance to GP3 storage, which provides a consistent baseline of 3000 IOPS regardless of volume size and allows independent scaling of throughput and IOPS.
The GP3 option was chosen as it offered better performance and cost efficiency without requiring unnecessary over-provisioning of storage capacity. The transition from GP2 to GP3 was straightforward, and once completed, the system no longer experienced performance degradation during file loads.
This case highlights the importance of understanding underlying storage mechanics within managed services like RDS. While high-level metrics may appear healthy, nuanced limits such as burst capacity can be the hidden cause of systemic issues. Identifying and addressing these subtle bottlenecks is key to ensuring consistent performance in cloud-native architectures.