Cloud & Infrastructure
SaaS platform · B2B · Production system serving paying users
The client ran a Node.js backend serving a B2B SaaS product with several thousand active users. The application was deployed to a single EC2 instance via manual SSH sessions. There was no CI/CD pipeline, no structured logging, no alerting, and the staging environment was configured differently from production — so bugs found in staging were not reliable indicators of production behavior.
GitHub
Push to main
CI Pipeline
Build + test + scan
Container Registry
ECR · Versioned images
Staging
Parity with prod
Production
ECS · Auto-scaling
CloudWatch
Structured logs
Alerts
Threshold-based
Health Checks
Every 30s
Deployments were manual: an engineer would SSH into the production server, pull the latest code, run a build, and restart the process. This took 20–30 minutes and was error-prone. A bad deploy meant SSH-ing back in and reverting manually.
There was no structured logging. The application used console.log for everything. When an incident occurred, the on-call engineer had to SSH into the server and grep through log files to figure out what happened. Two incidents in the past quarter had taken over four hours to resolve.
The staging environment ran on a different OS version, different Node version, and had a different database schema migration applied. Bugs that passed staging regularly appeared in production.
The on-call rotation was dreaded. Engineers would trade shifts to avoid it. The CTO described it as the single biggest morale problem on the team.
Every log entry now includes a timestamp, severity level, request ID, user ID (when applicable), and structured context. This took about a week to migrate across the codebase — mostly mechanical work, but critical for everything else.
Logs ship to CloudWatch with structured queries. We defined alert thresholds for error rates, response times, and specific failure patterns. The on-call engineer gets a notification with context — not a generic 'server down' ping.
We Dockerized the application so the exact same image runs in development, staging, and production. This eliminated the environment-parity problem entirely. The 'works on staging but not production' class of bugs disappeared.
Push to main triggers: build → test → security scan → deploy to staging → manual approval → deploy to production. A bad deploy can be rolled back in under two minutes. The 20-minute manual SSH deploy became a one-click operation.
Each service has a /health endpoint checked every 30 seconds. We wrote a runbook covering the five most common incident types with step-by-step resolution procedures. New on-call engineers can follow the runbook without needing institutional knowledge.
runtime
containers
infrastructure
observability
cicd
Timeline
6 weeks
Team
1 senior engineer (AxionvexTech) + 1 internal DevOps-leaning engineer on the client side
Mean time to resolution
2–4 hours per incident
Under 15 minutes for most incidents
Deploy time
20–30 minutes, manual SSH
Under 5 minutes, automated with one-click rollback
Staging reliability
Bugs passed staging regularly
Environment parity — staging catches what production would see
On-call morale
Engineers traded shifts to avoid on-call
First rotation after launch: engineer said it was the first time they did not dread the shift
“The CTO told us that the infrastructure work changed how the team felt about the product. Before, they were scared of their own system. After, they had the confidence to ship faster because they could see what was happening and fix it quickly when something went wrong.”