Hello 👋
Welcome to another week — and another opportunity to grow into a strong, confident DevOps, Infrastructure, or Platform Engineer.
Today’s issue is brought to you by The Engineering Ladder — where we share practical, career-shaping lessons in DevOps and Software Engineering to help you level up with clarity and direction.
💡 PS: Before we dive into today’s topic, I want to quickly share something important with you…
If you’ve been following The Engineering Ladder, you already know one thing I believe deeply:
👉 Real tech careers are built on evidence, not just interest.
That belief is exactly why we built CloudOps Academy.
CloudOps Academy is a hands-on training program for DevOps Engineers, Infrastructure Engineers, and Platform Engineers who want more than theory.
We focus on helping engineers build real systems, understand how production environments work, and gain the confidence to perform in real roles — not just pass interviews.
At CloudOps Academy, you don’t just “learn tools.”
You learn how to:
✅ Design and operate real cloud infrastructure
✅ Work with Docker, CI/CD, monitoring, and automation the way teams do in production
✅ Think like a reliability-focused engineer, not just a script writer
✅ Build projects you can confidently explain in interviews
✅ Grow from uncertainty to clarity with structured guidance and mentorship
Our goal is simple:
to help you become job-ready, confident, and credible as an engineer.
If you’re serious about building a strong DevOps or Cloud career — and you want guidance from engineers who are actively working in the field — we’d love to talk.
📞 Phone: +237 653 583 000
📧 Email: [email protected]
No pressure.
Just clarity on whether CloudOps Academy is the right next step for you.
Now, let’s get into today’s lesson 👇
It was a Friday afternoon.
The team had been working on a new feature for three weeks. Everyone was excited. The product manager had already told the client it would be live before the weekend.
The engineer ran the deployment.
The app went down.
Not for seconds. For 22 minutes.
Users were mid-transaction. Sessions were lost. The support inbox filled up. The client called. And the engineer who ran the deployment sat staring at their screen with that very specific feeling that every engineer knows — the one where your stomach drops and time slows down.
The deployment itself was not wrong. The code worked. The feature was good. But the way it was deployed killed the application for everyone who was using it at that exact moment.
This is one of the most common and most preventable problems in software engineering.
And today we are going to fix it, from the ground up.
What Zero-Downtime Deployment Actually Means
Let me be very clear about this from the start.
Zero-downtime deployment means that when you push new code to production, users who are actively using your application feel nothing. No error page. No spinning loader. No lost form data. Nothing.
The app keeps running. Requests keep getting served. And somewhere in the background, the new version of your code quietly replaces the old one.
This sounds simple. It is not. It requires you to think carefully about three things:
Your application. How does it start up? How long does it take? What happens to requests that are in flight when you stop the old version?
Your database. If your new code expects a new column that does not exist yet, it will crash. If your migration removes a column the old code still reads, the old version crashes during the migration.
Your traffic. How do you stop sending new requests to the old version while still letting it finish the requests it is already handling?
We are going to walk through all three. And we are going to use a real tool — Docker Swarm — to show you exactly how this works in practice.
Why Docker Swarm?
Before we go further, let me explain the tool we are using and why.
Docker Swarm is Docker's built-in clustering and orchestration tool. It lets you run your application across multiple servers and manage everything from one place.
A lot of engineers jump straight to Kubernetes when they hear "container orchestration." Kubernetes is powerful, but it is also genuinely complex. For a team that is just getting serious about deployments, Docker Swarm gives you 80% of what you need with 20% of the complexity.
It has rolling updates built in. It handles health checks. It knows how to drain traffic from a container before stopping it. And if you already know Docker, the learning curve is small.
For our scenario today, we are going to use Docker Swarm to deploy a Node.js API. The same principles apply to any language or framework.
The Scenario
You work at a startup. You have a Node.js REST API running in production. It handles user orders for an e-commerce platform. There are real users placing orders right now, at this very moment.
Your team has just finished building a new feature — an order tracking endpoint. You need to deploy it without interrupting a single active user.
Here is your starting setup:
Production server: 1 manager node, 2 worker nodes
Current app version: v1.0
New app version: v2.0
Database: PostgreSQL
Traffic: ~200 requests per minuteLet us build the pipeline step by step.
Step One: Handle Your Database Migration First
This is the step most engineers get wrong. And it is the step that causes the most downtime.
Here is the problem. When you deploy new code, there is a window — even if it is small — where the old version and the new version of your application are both running at the same time. This is actually intentional in a zero-downtime deployment. But it creates a dangerous situation with your database.
If your new code adds a column called tracking_number to the orders table, and you run that migration at the same time as the deployment, here is what happens:
The migration runs. The new column exists. The new version of your app works fine. But the old version — still running, still serving requests — does not know about that column. It runs a query that inserts a new order without providing tracking_number. If that column has a NOT NULL constraint, the query fails. Orders stop being created. Users get errors.
The solution is a pattern called expand and contract.
Expand first. Before you deploy the new code, run a migration that only adds things. New columns should be nullable. New tables can be empty. Nothing is removed. Nothing is changed. The old code still works perfectly because nothing it depends on has been touched.
-- Run this BEFORE deploying v2.0
ALTER TABLE orders ADD COLUMN tracking_number VARCHAR(100);
-- No NOT NULL constraint yet. Old code ignores this column completely.
-- New code can write to it. Both versions work.Deploy the new code. Now both old and new versions of your app can run safely. The new version writes tracking_number. The old version ignores it. No crashes.
Contract later. Once the old version is completely gone and the new version is fully running, you can come back and add the NOT NULL constraint, remove old columns, or clean up anything that is no longer needed.
-- Run this AFTER deployment is complete and v1.0 is fully gone
ALTER TABLE orders ALTER COLUMN tracking_number SET NOT NULL;This two-step approach means your database migration and your deployment are never a single scary event happening at the same moment.
Step Two: Build and Tag Your Docker Image
Now that the database is ready, build your new application image.
# Build the new version
docker build -t your-registry/orders-api:v2.0 .
# Push it to your container registry
docker push your-registry/orders-api:v2.0Tag your images with version numbers. Never deploy latest to production. latest is a moving target — you can never tell exactly what is running, and rolling back becomes guesswork.
With v1.0 and v2.0 clearly tagged, you always know what is deployed and you can roll back to a specific known version in seconds.
Step Three: Set Up Docker Swarm
If you have not initialised your Swarm yet, this is how you do it on your manager node:
# On your manager node
docker swarm init --advertise-addr <manager-ip>This gives you a token. Use it to join your worker nodes:
# On each worker node
docker swarm join --token <token> <manager-ip>:2377Check that everything is connected:
docker node lsYou should see your manager and both workers listed, all with status Ready.
Now deploy your application as a Swarm service. A service is how Swarm manages your running containers across nodes:
docker service create \
--name orders-api \
--replicas 3 \
--publish published=80,target=3000 \
--health-cmd "curl -f http://localhost:3000/health || exit 1" \
--health-interval 10s \
--health-retries 3 \
--health-timeout 5s \
your-registry/orders-api:v1.0Let me explain what each part does:
--replicas 3 means Swarm runs three copies of your container — one on each node. If one goes down, the other two keep serving traffic.
--publish published=80,target=3000 maps port 80 on the host to port 3000 inside the container where your Node.js app listens.
--health-cmd tells Swarm how to check if your container is healthy. Every 10 seconds, Swarm hits your /health endpoint. If it fails three times in a row, Swarm marks that container as unhealthy and stops sending it traffic.
That health check is critical. Without it, Swarm has no way to know if your new container actually started successfully before it removes the old one.
Your Node.js app needs a health endpoint. Here is a simple one:
app.get('/health', (req, res) => {
res.status(200).json({ status: 'ok' });
});Step Four: Configure the Rolling Update Policy
This is where the zero-downtime magic lives.
Before you deploy, tell Swarm exactly how to roll out updates:
docker service update \
--update-parallelism 1 \
--update-delay 15s \
--update-failure-action rollback \
--update-order start-first \
orders-apiLet me explain each flag:
--update-parallelism 1 means Swarm updates one container at a time. With three replicas, it updates the first, waits, updates the second, waits, updates the third. At no point are all three containers being replaced simultaneously.
--update-delay 15s means Swarm waits 15 seconds between updating each container. This gives the new container time to start up, pass health checks, and start receiving traffic before Swarm moves on to the next one.
--update-failure-action rollback means if a container fails its health check after being updated, Swarm automatically rolls back the entire service to the previous version. You do not have to do anything. It fixes itself.
--update-order start-first is the most important flag for zero downtime. By default, Swarm stops the old container before starting the new one. With start-first, it does the opposite — it starts the new container first, waits for it to pass health checks, and only then stops the old container. At every moment during the update, you have at least the same number of healthy replicas as you started with.
Step Five: Deploy the New Version
Now the actual deployment. One command:
docker service update \
--image your-registry/orders-api:v2.0 \
orders-apiHere is exactly what happens when you run this:
Swarm looks at replica one. It starts a new container running v2.0 alongside the existing v1.0 container. It waits for the new container to pass three consecutive health checks. Once it does, it stops the v1.0 container gracefully — sending it a SIGTERM signal and giving it time to finish any requests it is currently handling. Then it waits 15 seconds. Then it moves to replica two. Then replica three.
During this entire process:
Traffic is always being served
At least two replicas are always healthy
If anything goes wrong, Swarm rolls back automatically
Watch it happen in real time:
docker service ps orders-apiYou will see the old tasks showing as Shutdown and the new ones showing as Running, one at a time.
Step Six: Handle In-Flight Requests Gracefully
There is one more thing that most deployment guides skip.
When Swarm sends SIGTERM to your old container, your Node.js app receives a signal that it is about to be stopped. If you do not handle this signal, Node.js exits immediately — and any request that was being processed at that exact moment gets dropped.
Here is how to handle it properly in your Node.js app:
const server = app.listen(3000, () => {
console.log('Server running on port 3000');
});
// Listen for the shutdown signal
process.on('SIGTERM', () => {
console.log('SIGTERM received. Shutting down gracefully...');
// Stop accepting new connections
server.close(() => {
console.log('All existing requests finished. Exiting.');
process.exit(0);
});
// If requests take too long, force exit after 30 seconds
setTimeout(() => {
process.exit(1);
}, 30000);
});What this does: when Swarm tells your container to stop, your app stops accepting new requests immediately. But it waits for every request that is already being processed to finish before it exits. No request is dropped. No user sees an error.
Step Seven: Roll Back If Something Goes Wrong
If you catch a problem after deployment — maybe a bug slipped through testing — rolling back is one command:
docker service rollback orders-apiSwarm immediately starts replacing v2.0 containers with v1.0 containers, using the same rolling process. Within a minute, you are back to the previous version. No panic. No manual intervention.
This is why tagging your images with version numbers matters so much. Swarm knows exactly what v1.0 looks like because the image is still in your registry, unchanged.
Putting It All Together
Here is the complete sequence for every deployment:
Run expand migration → add new columns as nullable
Build and push new Docker image → orders-api:v2.0
Deploy with docker service update → Swarm handles the rest
Watch health checks → confirm all replicas on v2.0
Run contract migration → add constraints, remove old columns
Monitor for 30 minutes → check error rates and latency
Done.
That is it. No maintenance windows. No 11pm deployments out of fear. No apologetic emails to users. Just clean, quiet, professional deployments that respect the people using your product.
This Week’s Challenge
✅ If you are running a simple application right now — add a /health endpoint today. It takes 10 minutes and it is the foundation of everything else.
✅ If you are already using Docker — initialise a local Swarm with docker swarm init and deploy your application as a service with three replicas. Practice the update command. Watch what happens.
✅ If you are already using Swarm or Kubernetes — check your update strategy. Is start-first configured? Is your app handling SIGTERM gracefully? These two things alone will close most of the gaps in your current setup.
Final Thoughts
The engineer who ran that Friday deployment was not careless. They were not inexperienced. They just had not been shown a better way.
Zero-downtime deployment is not advanced knowledge reserved for big tech companies. It is a set of habits and configurations that any team can adopt — and once you do, you will wonder how you ever deployed any other way.
Your users do not know what a deployment is. They should never find out from your app going down.
Ship confidently. Deploy quietly. Keep production alive.
PS:
At CloudOps Academy, we help engineers make this exact transition — from uncertainty to clarity — through hands-on training, real systems, and structured mentorship.
If you’re ready to move beyond theory and start building real DevOps skills, reach out:
📞 +237 653 583 000
📧 [email protected]
P.S. If you found this helpful, share it with a friend or colleague who’s on their DevOps or Software engineering journey. Let’s grow together!
Got questions or thoughts? Reply to this newsletter-we’d love to hear from you!
See you on Next Week.
Remember to check out MentorAura → A powerful, all-in-one platform crafted to guide aspiring and seasoned tech professionals through their career journeys. MentorAura offers structured mentorship programs, career development tracks, industry-grade challenges, personalized learning paths, and community support. It’s your gateway to mastering tech skills, building a standout portfolio, receiving expert guidance, and connecting with a vibrant community of future innovators.
Join Mentoraura Whatsapp Community here:
Weekly Backend and DevOps Engineering Resources
The DevOps Career Roadmap: A Guide to Becoming a World Class DevOps Engineer by Akum Blaise Acha
API Versioning 101 for Backend Engineers by Akum Blaise Acha
System Design 101: Understanding Database Sharding by Akum Blaise Acha
Why Engineers Should Embrace the Art of Writing by Akum Blaise Acha
From Good to Great: Backend Engineering by Akum Blaise Acha
System Design 101: Understanding Caching by Akum Blaise Acha


