Hello 👋
Welcome to another week — and another opportunity to grow into a strong, confident DevOps, Infrastructure, or Platform Engineer.
Today’s issue is brought to you by The Engineering Ladder — where we share practical, career-shaping lessons in DevOps and Software Engineering to help you level up with clarity and direction.
💡 PS: Before we dive into today’s topic, I want to quickly share something important with you…
If you’ve been following The Engineering Ladder, you already know one thing I believe deeply:
👉 Real tech careers are built on evidence, not just interest.
That belief is exactly why we built CloudOps Academy.
CloudOps Academy is a hands-on training program for DevOps Engineers, Infrastructure Engineers, and Platform Engineers who want more than theory.
We focus on helping engineers build real systems, understand how production environments work, and gain the confidence to perform in real roles — not just pass interviews.
At CloudOps Academy, you don’t just “learn tools.”
You learn how to:
✅ Design and operate real cloud infrastructure
✅ Work with Docker, CI/CD, monitoring, and automation the way teams do in production
✅ Think like a reliability-focused engineer, not just a script writer
✅ Build projects you can confidently explain in interviews
✅ Grow from uncertainty to clarity with structured guidance and mentorship
Our goal is simple:
to help you become job-ready, confident, and credible as an engineer.
If you’re serious about building a strong DevOps or Cloud career — and you want guidance from engineers who are actively working in the field — we’d love to talk.
📞 Phone: +237 653 583 000
📧 Email: [email protected]
No pressure.
Just clarity on whether CloudOps Academy is the right next step for you.
Now, let’s get into today’s lesson 👇
It was a Saturday afternoon.
I was not at the office. Nobody was supposed to be working. We had just finished a heavy week and the plan was a quiet weekend before another sprint kicked off on Monday.
Then the alerts started.
First one. Then three. Then twelve in under two minutes.
Consumer lag on our Kafka cluster had crossed 200,000 events. By the time I opened my laptop it was at 600,000. By the time I got the team on a call — 1.4 million. It was climbing faster than I had ever seen.
Somewhere in our system, messages were being produced faster than anything was consuming them. And the gap was widening every second.
That Saturday turned into a seven hour war room. And by the end of it — after we had pulled the cluster back from the edge — I had learned more about Kafka, system design under pressure, and engineering humility than in the six months before it.
This is that story.
First — A Quick Explanation of the Problem
If you have never worked with a message queue before, here is the simplest way to think about it.
Imagine a busy restaurant kitchen. Orders come in from the front of house and get placed on a ticket rail. The cooks grab tickets, prepare the food, and clear them off the rail. As long as the cooks are working at roughly the same pace as the orders coming in, the kitchen runs fine.
Now imagine the restaurant gets slammed. Orders pour in three times faster than the kitchen can handle them. The ticket rail fills up. Then it overflows. Tickets start falling on the floor. Customers who ordered thirty minutes ago are still waiting. The kitchen is overwhelmed and falling further behind with every passing minute.
That is consumer lag. Your ticket rail is Kafka. Your cooks are your consumer services. And when the rail gets long enough — things start breaking in ways that are not immediately obvious.
In our case at Gozem, the message queue was the backbone of several critical flows. Driver location updates. Ride status changes. Payment event processing. Notification triggers. All of it flowed through Kafka. When that cluster started falling behind, it was not one thing that broke. It was everything, quietly, at the same time.
What Actually Happened
We had just rolled out a new feature that generated significantly more events than we anticipated.
Every time a driver's location updated — which happens every few seconds per driver — we were now publishing three events instead of one. We had tested this in staging. Staging had forty drivers. Production had thousands.
The math was not complicated in hindsight. But in the rush of the release, nobody had done it.
Within six hours of the feature going live on a Saturday morning, our consumer services — which had been keeping up fine — started falling behind. Slowly at first. Then faster.
The problem was not just the volume. It was what happened downstream when consumers fell behind.
Our ride status service consumed events to update booking states in the database. When it fell behind, ride statuses stopped updating in real time. Drivers were completing trips that the system still showed as active. Passengers were being charged for rides that had ended ten minutes ago. Customer support started getting calls.
Our payment service consumed events to trigger settlement jobs. When it fell behind, settlements queued up. Finance started seeing reconciliation gaps.
Our notification service consumed events to send SMS updates. When it fell behind, passengers stopped receiving "your driver is arriving" messages. More support calls.
One queue. One team falling behind. Six downstream systems showing symptoms that looked completely unrelated to anyone who did not know where to look.
The Diagnosis
The first thing we did when we got everyone on the call was resist the urge to immediately start changing things.
This sounds obvious. Under pressure it is genuinely hard. When something is broken and getting worse, every instinct says do something. The right instinct is understand first.
We pulled up our Kafka monitoring dashboard — we were using Confluent at the time, with custom Grafana panels on top. We looked at three numbers:
Messages per second being produced. Still climbing. The new feature was still running.
Messages per second being consumed. Roughly flat. Our consumers were at maximum throughput.
Consumer lag per topic partition. This told us which consumer groups were furthest behind and which topics were most affected.
Within fifteen minutes we had a clear picture. The location update topic was the source. Three consumer groups were behind. The ride status consumer was the most critical — 900,000 events behind and falling further back every minute.
We also found something else. A subset of events were malformed — the new feature had a serialization edge case that produced occasional bad messages. Those messages were not being skipped. Each consumer was trying to process them, failing, retrying, failing again, and effectively getting stuck on bad events while the good ones piled up behind them.
That was the real crisis inside the crisis.
What We Did — Step by Step
Step one — stop the bleeding.
We rolled back the new feature. Not because the feature was wrong — but because continuing to produce events at that rate while we were already behind made recovery impossible. You cannot bail out a sinking boat while someone is still drilling holes in it.
Scale down the new feature's deployment
kubectl scale deployment location-event-publisher --replicas=0
Production rate dropped immediately. The gap stopped widening.
Step two — handle the bad messages.
We needed the consumers to skip the malformed events without crashing or retrying forever. This is exactly what a dead letter queue is for.
A dead letter queue — or DLQ — is a separate topic where messages go when they cannot be processed successfully after a defined number of retries. Instead of blocking the consumer forever, the bad message gets moved aside. Processing continues. You deal with the bad messages separately and deliberately.
We configured our consumers to send failed messages to a DLQ after three retries:
@KafkaListener(topics = "location-updates")
public void consume(LocationEvent event) {
try {
processEvent(event);
} catch (Exception e) {
// After max retries, this goes to the DLQ
// Consumer continues with the next message
throw e;
}
}
// In application config
spring.kafka.consumer.properties.max.poll.interval.ms=300000
spring.kafka.listener.ack-mode=manualOnce the DLQ was routing bad messages away, our consumers started moving again.
Step three — scale up consumers to burn down the backlog.
With the production rate stopped and bad messages no longer blocking progress, we needed to consume the 1.4 million pending events as fast as possible.
Kafka partitions are the unit of parallelism. You can only have as many active consumers in a consumer group as you have partitions on the topic. Our location updates topic had 12 partitions. We were running 3 consumer instances — meaning each instance was handling 4 partitions.
We scaled to 12 consumer instances — one per partition. Maximum parallelism.
kubectl scale deployment location-event-consumer --replicas=12Consumption rate tripled immediately. The lag started dropping.
Step four — monitor the burn-down carefully.
We did not walk away. We watched the lag metric in Grafana every few minutes. We were looking for two things — that lag was consistently decreasing, and that our database was handling the increased write throughput from the consumers catching up.
This is where backpressure matters. When 12 consumers are all processing at full speed and all writing to the same database, you can trade one problem for another — a database that cannot keep up with the sudden write surge.
We had a read replica and PgBouncer connection pooling in place, which absorbed most of it. But we kept an eye on database CPU and connection counts throughout. If either had spiked dangerously, we would have scaled consumers back down and accepted a slower recovery.
About four hours after we started the burn-down, consumer lag hit zero.
The cluster was back.
What We Changed After That Saturday
We came back on Monday and spent the first two hours writing up a proper post-mortem. No blame. Just facts, timeline, root causes, and action items.
The changes we made:
Load testing for event volume, not just API volume. Our staging environment now runs load tests that simulate production-level event rates before any feature that touches Kafka goes live.
DLQ configured by default on every consumer. Not optional. Not something individual teams decide. Every consumer group has a dead letter queue. Every DLQ has an alert. Every alert has a runbook.
Consumer lag alerts at three thresholds. Warning at 10,000 events. Critical at 100,000. Page the on-call engineer at 500,000. We would have caught this hours earlier with those numbers in place.
Partition count reviewed before topics are created. We had under-partitioned the location updates topic for the scale we were running at. More partitions means more parallelism when you need it. You cannot add partitions to an existing Kafka topic without complexity — so we now size topics for future scale, not current scale.
A runbook specifically for consumer lag incidents. Step by step. Written when you are calm. Designed to be followed when you are not.
This Week's Challenge
✅ If you are running Kafka or any message queue in production — pull up your consumer lag metrics right now. What is the current lag across your consumer groups? Do you even have visibility into this?
✅ Check whether your consumers have a dead letter queue configured. If a bad message hits your consumer today, what happens? Does it block? Does it retry forever? Does it get skipped silently?
✅ Pick your most critical consumer and ask: if lag hit 100,000 events on this topic right now, what would break downstream? Write that answer down. That is your starting point for a runbook.
Final Thoughts
That Saturday was painful. Seven hours of stress, of watching numbers climb, of making decisions under pressure with incomplete information.
But it was also one of the most clarifying experiences I have had as an engineer.
Because when a system falls apart in front of you and you have to bring it back piece by piece — you understand it at a level that no documentation, no tutorial, no certification ever gives you.
The 2 million event backlog taught us more about our own system than two years of normal operation had.
Build the alerts before you need them. Write the runbooks before the incident. Configure the DLQs before the bad messages arrive.
Because they will arrive. The only question is whether you are ready when they do.
The system will break. Your preparation is what decides how long it stays broken.
PS:
At CloudOps Academy, we help engineers make this exact transition — from uncertainty to clarity — through hands-on training, real systems, and structured mentorship.
If you’re ready to move beyond theory and start building real DevOps skills, reach out:
📞 +237 653 583 000
📧 [email protected]
P.S. If you found this helpful, share it with a friend or colleague who’s on their DevOps or Software engineering journey. Let’s grow together!
Got questions or thoughts? Reply to this newsletter-we’d love to hear from you!
See you on Next Week.
Looking for structured, expert-led mentorship to accelerate your Cloud or DevOps career?
Visit consult.akumblaiseacha.com — where I work 1:1 with aspiring and experienced tech professionals to help them build real skills, grow their career, and land the opportunities they deserve.
From personalized career roadmaps and hands-on project guidance, to interview prep, LinkedIn positioning, and job search strategy — everything is tailored to your specific goals and timeline.
No cohorts. No pre-recorded content. Just direct, focused mentorship from a Senior DevOps Engineer with years of real-world, production experience.
👉 Book your session today → consult.akumblaiseacha.com
Join Whatsapp Community here:
Weekly Backend and DevOps Engineering Resources
The DevOps Career Roadmap: A Guide to Becoming a World Class DevOps Engineer by Akum Blaise Acha
API Versioning 101 for Backend Engineers by Akum Blaise Acha
System Design 101: Understanding Database Sharding by Akum Blaise Acha
Why Engineers Should Embrace the Art of Writing by Akum Blaise Acha
From Good to Great: Backend Engineering by Akum Blaise Acha
System Design 101: Understanding Caching by Akum Blaise Acha


