Hello 👋
Welcome to another week — and another opportunity to grow into a strong, confident DevOps, Infrastructure, or Platform Engineer.
Today’s issue is brought to you by The Engineering Ladder — where we share practical, career-shaping lessons in DevOps and Software Engineering to help you level up with clarity and direction.
💡 PS: Before we dive into today’s topic, I want to quickly share something important with you…
If you’ve been following The Engineering Ladder, you already know one thing I believe deeply:
👉 Real tech careers are built on evidence, not just interest.
That belief is exactly why we built CloudOps Academy.
CloudOps Academy is a hands-on training program for DevOps Engineers, Infrastructure Engineers, and Platform Engineers who want more than theory.
We focus on helping engineers build real systems, understand how production environments work, and gain the confidence to perform in real roles — not just pass interviews.
At CloudOps Academy, you don’t just “learn tools.”
You learn how to:
✅ Design and operate real cloud infrastructure
✅ Work with Docker, CI/CD, monitoring, and automation the way teams do in production
✅ Think like a reliability-focused engineer, not just a script writer
✅ Build projects you can confidently explain in interviews
✅ Grow from uncertainty to clarity with structured guidance and mentorship
Our goal is simple:
to help you become job-ready, confident, and credible as an engineer.
If you’re serious about building a strong DevOps or Cloud career — and you want guidance from engineers who are actively working in the field — we’d love to talk.
📞 Phone: +237 653 583 000
📧 Email: [email protected]
No pressure.
Just clarity on whether CloudOps Academy is the right next step for you.
Now, let’s get into today’s lesson 👇
I got a message from an engineering manager a few months ago.
His team was good. Senior engineers, solid codebase, decent test coverage. But something was wrong and he could not put his finger on it.
He said:
"Blaise, our pipeline takes 47 minutes to run. Engineers are pushing code and then going to do something else while they wait. By the time the build finishes, they have lost context. Some days we get three deployments out. Some days we get zero because something in the pipeline broke and nobody knows why."
47 minutes.
For a team that deploys software for a living, 47 minutes per pipeline run is not a minor inconvenience. It is a tax. A tax that compounds every single day across every engineer on that team.
If you have five engineers each pushing twice a day, a 47-minute pipeline means your team collectively spends over seven hours every day just waiting. Not building. Not solving problems. Waiting.
And the pipeline breaking randomly? That is a different kind of damage — the kind that erodes trust. When engineers stop trusting the pipeline, they start working around it. They skip steps. They push directly. They treat CI as optional. And that is when the real problems begin.
I spent two days with that team. What we found was not one big problem. It was six small ones — each manageable on its own, devastating in combination.
Today I am going to walk you through all six. And more importantly, I am going to show you exactly how to fix them.
Mistake One: Flaky Tests That Nobody Has Fixed
A flaky test is a test that sometimes passes and sometimes fails — for no clear reason. Not because the code is wrong. Because the test itself is poorly written.
Maybe it depends on the order other tests run in. Maybe it makes a real network call to an external service that occasionally times out. Maybe it checks a timestamp and fails when the system is running slow. Maybe it assumes a database is empty and fails when another test left data behind.
In the team I was working with, I ran their test suite ten times in a row on the same unchanged codebase. It failed four of those times. Different tests each time. No pattern. No explanation.
Here is what flaky tests do to a team:
An engineer sees a red build. They check the failure. It is a test they have seen fail before for no reason. So they re-run the pipeline. It passes. They assume it was a fluke and move on.
Now the pipeline is a tool that cries wolf. Engineers stop treating red builds as urgent. They re-run first and investigate second. And the day a real failure gets re-run and passes by coincidence — because sometimes broken code still passes some tests — that broken code ships to production.
The fix:
First, identify your flaky tests. Most CI platforms track test results over time. GitHub Actions, GitLab CI, and CircleCI all have ways to see which tests fail intermittently. Pull that report. You will almost always find the same five to ten tests responsible for 80% of the flakiness.
For each flaky test, do one of three things:
Fix it properly — remove the external dependency, mock the network call, isolate the test from shared state.
Quarantine it — move it to a separate suite that runs separately and does not block the main pipeline while you fix it properly.
Delete it — if a test has been flaky for months and nobody has fixed it, it is not providing value. It is providing noise. Delete it and write a better one.
# GitHub Actions — separate job for quarantined tests
jobs:
main-tests:
runs-on: ubuntu-latest
steps:
- run: npm test -- --testPathIgnorePatterns=quarantine
quarantined-tests:
runs-on: ubuntu-latest
continue-on-error: true # doesn't block the pipeline
steps:
- run: npm test -- --testPathPattern=quarantineThe quarantine approach is important. It means flaky tests no longer block your deployments while you take the time to fix them properly. Your pipeline becomes trustworthy again immediately.
Mistake Two: Running Everything in Sequence When Most Things Could Run in Parallel
This was the biggest contributor to that 47-minute pipeline.
When I looked at their pipeline configuration, everything ran one after the other. Install dependencies. Run unit tests. Run integration tests. Run end-to-end tests. Build the Docker image. Push to registry. Deploy to staging.
Each step waited for the previous one to finish before starting.
But most of those steps had no dependency on each other. Unit tests and integration tests do not need to wait for each other. Building the Docker image does not need to wait for end-to-end tests. Linting and security scanning can run at the same time as tests.
Here is what their pipeline looked like:
Install → Lint → Unit Tests → Integration Tests → E2E Tests → Build → Push → Deploy
Total: 47 minutes — all sequentialHere is what it looked like after we restructured it:
Install
↓
┌─────────────────────────────────┐
│ Lint │ Unit Tests │ Scan │ ← parallel
└─────────────────────────────────┘
↓
Integration Tests
↓
┌──────────────────┐
│ E2E │ Build │ ← parallel
└──────────────────┘
↓
Push → DeploySame work. Different structure.
The result: 47 minutes became 18 minutes.
No new infrastructure. No new tools. Just restructuring which jobs could run at the same time.
# GitHub Actions parallel jobs example
jobs:
lint:
runs-on: ubuntu-latest
steps:
- run: npm run lint
unit-tests:
runs-on: ubuntu-latest
steps:
- run: npm run test:unit
security-scan:
runs-on: ubuntu-latest
steps:
- run: npm run security:scan
integration-tests:
needs: [lint, unit-tests, security-scan]
runs-on: ubuntu-latest
steps:
- run: npm run test:integration
build:
needs: [integration-tests]
runs-on: ubuntu-latest
steps:
- run: docker build -t app:${{ github.sha }} .needs controls the dependency chain. Jobs without a needs field run immediately in parallel. Jobs that list dependencies wait only for those specific dependencies — not for everything else.
Mistake Three: No Dependency Caching
Every pipeline run was reinstalling every dependency from scratch.
For a Node.js project with 847 packages, that was four minutes just on npm install. Every single run. Every single time.
Dependencies do not change between every commit. If your package.json has not changed, there is no reason to reinstall everything. You cache the node_modules folder and reuse it on every run where the lockfile has not changed.
# GitHub Actions dependency caching
- name: Cache node modules
uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
- name: Install dependencies
run: npm ciThe cache key is a hash of your package-lock.json. If the lockfile has not changed, the cache is valid and npm ci runs in seconds instead of minutes. If the lockfile changes — a new package was added — the cache is invalidated and everything reinstalls fresh.
Same principle applies to Docker layer caching. If your base image and dependencies have not changed, Docker does not need to rebuild those layers:
# Put dependencies first — they change less often than your code
COPY package*.json ./
RUN npm ci
# Copy application code last — it changes most often
COPY . .
RUN npm run buildWhen Docker builds this image, the npm ci layer is cached as long as package.json has not changed. Only the layers after a changed file get rebuilt. A build that previously took six minutes now takes 40 seconds on most commits.
This one change alone saved the team three hours of collective waiting per day.
Mistake Four: No Environment Parity — Staging Is Nothing Like Production
This is the mistake that does not slow your pipeline down. It slows your incidents down.
The team had three environments: local development, staging, and production. The problem was that staging and production had drifted so far apart that a green staging pipeline told you almost nothing about whether production would be fine.
Production ran on Ubuntu 22. Staging ran on Ubuntu 20. Production used PostgreSQL 15. Staging used PostgreSQL 13. Production had specific environment variables for third-party integrations. Staging had fake values that never got updated. Production used Redis for session management. Staging did not use Redis at all.
They had caught this drift in the worst possible way — a feature that worked perfectly through the entire staging pipeline that broke immediately in production because of a PostgreSQL 15 behavior that did not exist in PostgreSQL 13.
The fix:
Define your environments as code and make them identical except for scale and data.
Use Docker Compose to define your local environment with the exact same services, versions, and configuration as production:
# docker-compose.yml — mirrors production exactly
version: '3.8'
services:
app:
build: .
environment:
- NODE_ENV=development
- DB_HOST=postgres
- REDIS_HOST=redis
postgres:
image: postgres:15.2 # same version as production
environment:
POSTGRES_DB: appdb
redis:
image: redis:7.0 # same version as productionUse infrastructure as code — Terraform, Pulumi — to provision staging and production from the same templates, with only the instance sizes and replica counts differing. When both environments are defined from the same source, drift becomes nearly impossible.
Mistake Five: Missing Rollback Steps in the Pipeline
This one is subtle but it is the most dangerous mistake on this list.
Their pipeline had a deploy step. It did not have a rollback step.
When a deployment broke production, the team had to manually figure out the last working image tag, manually update the deployment configuration, and manually trigger a new deployment. In a stressful incident, with multiple people in a war room, this process took between 12 and 25 minutes.
A rollback should be one command or one button click. It should take less than two minutes. It should be something every engineer on the team can do without needing to ask anyone how.
The fix:
First, always tag your images with the git commit SHA — not just latest. This means you always have a specific, immutable reference to every version you have ever deployed:
- name: Build and push
run: |
docker build -t your-registry/app:${{ github.sha }} .
docker push your-registry/app:${{ github.sha }}Second, store the last known good image tag somewhere your pipeline can retrieve it. A simple approach is writing it to a file in your infrastructure repository after every successful deployment:
echo ${{ github.sha }} > .last-successful-deploy
git commit -am "Update last successful deploy"
git pushThird, add a rollback job to your pipeline that any engineer can trigger manually:
rollback:
runs-on: ubuntu-latest
if: github.event_name == 'workflow_dispatch' # manual trigger only
steps:
- name: Get last successful SHA
run: |
LAST_SHA=$(cat .last-successful-deploy)
echo "Rolling back to $LAST_SHA"
- name: Rollback deployment
run: |
docker service update \
--image your-registry/app:$LAST_SHA \
your-serviceNow rollback is a button in your CI platform. Any engineer can do it. No war room needed. No manual digging through image tags under pressure.
Mistake Six: No Pipeline Observability
The final mistake was that nobody had visibility into how the pipeline was performing over time.
Nobody knew which jobs were slowest. Nobody knew which tests were flakiest. Nobody knew how deployment frequency had changed month over month. Nobody knew how long rollbacks were taking.
Without that data, every pipeline improvement was guesswork. You could not tell if the changes you made were actually helping.
The fix:
Start tracking four numbers every week:
Mean pipeline duration — how long does the average pipeline run take? This is your baseline. Every optimization should move this number down.
Pipeline failure rate — what percentage of pipeline runs fail? A healthy pipeline should fail because of real code problems, not infrastructure flakiness. If your failure rate is above 15%, something is wrong with the pipeline itself.
Deployment frequency — how many times does your team deploy to production per week? This is your velocity signal. A CI/CD pipeline that nobody trusts gets used less.
Mean time to restore — when a deployment breaks production, how long does it take to recover? This tells you whether your rollback process is working.
Most CI platforms expose this data natively. GitHub Actions has workflow analytics. GitLab CI has pipeline analytics built into the dashboard. DataDog, Grafana, and Honeycomb can all ingest pipeline metrics if you want a unified view.
Put these four numbers on a dashboard your team looks at every week. Make pipeline health a first-class concern — not something you only think about when something breaks.
The Results
Two weeks after working through all six of these with that engineering manager's team:
Pipeline duration: 47 minutes → 18 minutes Pipeline failure rate: 34% → 6% Deployment frequency: 3 per week → 11 per week Mean time to restore: 22 minutes → 4 minutes
Same team. Same codebase. Same infrastructure budget.
The only thing that changed was how seriously they treated their pipeline.
This Week’s Challenge
✅ Run your pipeline right now and time it. Write the number down. That is your baseline.
✅ Open your pipeline configuration and find three jobs that are currently running sequentially but have no dependency on each other. Move them to run in parallel.
✅ Check if you have dependency caching configured. If not — add it today. It takes 15 minutes and saves hours.
✅ Ask your team: if production broke right now, how long would a rollback take? If the honest answer is "we are not sure" — that is your most important thing to fix this week.
Final Thoughts
A slow, unreliable pipeline is not just a tooling problem. It is a culture problem. It teaches engineers that CI is something to work around rather than something to trust. It slows down every feature, every bug fix, every improvement your team tries to make.
The fixes are not glamorous. Parallelizing jobs, adding caching, fixing flaky tests, tagging images properly — none of these make it onto a CV. But they compound. A team that ships 11 times a week instead of 3 times a week is not 3x faster. Over a year, that difference in velocity is the difference between a product that wins its market and one that falls behind.
Your pipeline is not overhead. It is your team's most used tool. Treat it that way.
If your team is sitting on a slow pipeline and this gave you a clear path forward — share it with your engineering manager or the person who owns your CI/CD setup. Sometimes the problem is not awareness. It is just not knowing where to start.
PS:
At CloudOps Academy, we help engineers make this exact transition — from uncertainty to clarity — through hands-on training, real systems, and structured mentorship.
If you’re ready to move beyond theory and start building real DevOps skills, reach out:
📞 +237 653 583 000
📧 [email protected]
P.S. If you found this helpful, share it with a friend or colleague who’s on their DevOps or Software engineering journey. Let’s grow together!
Got questions or thoughts? Reply to this newsletter-we’d love to hear from you!
See you on Next Week.
Looking for structured, expert-led mentorship to accelerate your Cloud or DevOps career?
Visit consult.akumblaiseacha.com — where I work 1:1 with aspiring and experienced tech professionals to help them build real skills, grow their career, and land the opportunities they deserve.
From personalized career roadmaps and hands-on project guidance, to interview prep, LinkedIn positioning, and job search strategy — everything is tailored to your specific goals and timeline.
No cohorts. No pre-recorded content. Just direct, focused mentorship from a Senior DevOps Engineer with years of real-world, production experience.
👉 Book your session today → consult.akumblaiseacha.com
Join Whatsapp Community here:
Weekly Backend and DevOps Engineering Resources
The DevOps Career Roadmap: A Guide to Becoming a World Class DevOps Engineer by Akum Blaise Acha
API Versioning 101 for Backend Engineers by Akum Blaise Acha
System Design 101: Understanding Database Sharding by Akum Blaise Acha
Why Engineers Should Embrace the Art of Writing by Akum Blaise Acha
From Good to Great: Backend Engineering by Akum Blaise Acha
System Design 101: Understanding Caching by Akum Blaise Acha


