How I Optimized an Overloaded Database and Reduced Costs by 70% 🚀

how I tackled a high CPU utilization issue, optimized queries, and drastically reduced infrastructure costs

Feb 20, 2025

Joining a new production team always comes with surprises, but I never expected my first major challenge to be a rapidly growing database that kept demanding more resources. Here's how I tackled a high CPU utilization issue, optimized queries, and drastically reduced infrastructure costs—plus the key lessons I learned along the way.

The Growing Database Problem 📈

When I joined the team, our AWS RDS database was already 160GB in size, and we had to increase storage by 50GB every two weeks due to high application usage. Everything seemed manageable—until we hit a major CPU spike (99.80%) that wouldn't come down.

At first, I assumed the database couldn’t handle the load, so I upgraded the RDS instance. The system ran smoothly for a month, but then... the same issue reappeared. Another upgrade. And another.

When the client noticed a $1,000 jump in AWS costs, I was finally given one day to investigate and try solve the issue while maintaining performance.

The Investigation: Finding the Culprit 🕵️

With limited time, I focused on the most obvious suspects:

✅ Slow API queries – Some queries took longer than expected. I optimized them.
✅ Background jobs – Checked their execution time and impact on the system.

This helped, but CPU usage was still around 80%, and SQS processing slowed down, and AWS RDS Performance Insights showed slow queries still causing trouble —especially an INSERT query. 🤔

The real breakthrough came 10 days later when I finally got time to dig deeper into the query execution plan. I carefully investigated every query that ran during SQS message processing. I ran one of the slowest queries in pgAdmin, and to my surprise, I found a SELECT query using multiple Common Table Expressions (CTEs) that was causing the entire query planner to struggle. it took ~12 seconds to execute—way too long for a high-traffic production system!

🚨 The query planner was exhausted, leading to unnecessary CPU overhead.

💡 Solution? Optimize the query.

After carefully rewriting the query, the results were shocking:

✅ CPU usage dropped from 99.80% to ~2%
✅ No more need for expensive instance upgrades
✅ I downgraded the database by 3 instance levels
✅ Even at peak traffic, CPU stayed below 30%

This experience taught me a lot, and I want to share some key learnings to help others avoid similar pitfalls:

✅ Don’t upgrade instances blindly – Performance issues are often due to inefficient queries, not hardware limits.

✅ Use AWS RDS Performance Insights – It’s a lifesaver in identifying slow queries.

✅ Regularly analyze slow queries – Running EXPLAIN ANALYZE on slow queries can reveal hidden inefficiencies.

✅ Background jobs can be silent killers – Slow background jobs (e.g., SQS workers) can overload the system. Monitor them closely.

✅ Optimize indexes & partitions – Large tables can cause bottlenecks. Partitioning can significantly improve query speed.

✅ Use Materialized Views – For common data usage with frequent queries, materialized views can provide significant performance improvements by storing the result of a query for fast retrieval.

✅ Archive old data – Keeping too much unnecessary data in the active database increases storage and slows down queries.

✅ Cost optimization matters – A well-optimized database can reduce costs without sacrificing performance.

Next Steps: Database Cleanup & Optimization 🔧

Now that the system is stable, my next plan is to:

📌 Archive old data to reduce database size.
📌 Implement table partitioning for faster range queries.
📌 Continuously monitor slow queries to prevent similar issues in the future.

Final Thoughts 💭

What seemed like a scaling issue turned out to be a query optimization problem. This journey reinforced an important lesson—throwing more hardware at a problem is not always the answer. Sometimes, all it takes is digging deeper and optimizing queries efficiently.

I hope this story helps others dealing with database performance issues. If you’ve faced a similar situation, I’d love to hear how you tackled it! Let’s discuss in the comments. 🚀

Ismael’s Learning Substack

Discussion about this post