Series-B SaaS platform — Case Study

The situation

The platform had scaled fast on a single large GKE cluster shared across every team. The monthly bill had roughly tripled in a year while revenue had not, and finance was asking questions nobody on the engineering side could answer — because there was no breakdown of cost by team, service or environment.

What we did

We worked in the order that protects the savings:

Visibility first. Labelled workloads by team and environment, then built a Grafana view showing spend and utilization side by side. For the first time, every team could see its own footprint.
Right-sized requests. Two weeks of usage data showed most services requested 4–6× the CPU they used at p95. We tuned requests and limits service by service, with the owning teams in the loop.
Reshaped node pools. With honest requests, the cluster bin-packed onto fewer, better-matched machine types, and bursty batch work moved to its own preemptible pool.
Scheduled non-prod to sleep. Staging and preview environments scaled to zero overnight and on weekends.

The result

Cluster cost dropped 52% over eight weeks, with no impact on production latency or reliability. More importantly, the showback dashboards and budget alerts we left behind mean the savings haven’t crept back — each team now owns its number.