How We Solved the On-Time Notification Delivery Problem at Scale
Scaling Real-Time Notifications with AWS Step Functions, Lambda, and RDS Proxy

A software engineer who likes to explore new technologies, problem-solving, build projects, and have a keen interest in Web development.
Delivering notifications exactly on time sounds easy — until you have to do it for thousands of users at the same second.
Our medicine reminder app depends heavily on precise dose reminders. Even a 1–2 minute delay can cause users to miss their doses, so reliability was critical.
Initially, we used BullMQ + Node.js workers for scheduling and sending notifications. It worked fine for a small number of users, but at scale, the system started to break.
The Problem
1. Too Many Notifications at the Same Time
Thousands of notifications were scheduled for the same second.
Workers pulled huge batches from Redis, causing CPU & memory spikes.
Redis queues became congested and some jobs got delayed.
2. Worker Overload
Even with multiple worker instances, the Node.js event loop struggled during peaks.
Delays became more frequent as user count increased.
Step 1 — Moving to AWS Step Functions + Lambda
We redesigned the scheduling process to spread the load more efficiently.
New Flow:
Step Function schedules notification batches at exact times.
Each execution triggers a Lambda dedicated to a subset of notifications.
We keep 2–3 hot Lambdas ready during peak times to avoid cold starts.
Why Step Functions?
Native CRON/Rate-based scheduling.
Orchestration of multiple parallel Lambda executions.
Fully managed — no manual queue management.
Step 2 — Horizontal Scaling in Lambda
AWS Lambda scales automatically, so during high load, we got dozens of Lambdas in parallel.
This solved the compute bottleneck — but introduced a new problem…
Step 3 — The Database Connection Storm
Each Lambda invocation created a new PostgreSQL connection.
At scale:
RDS hit the max_connections limit.
Some Lambdas failed instantly due to connection errors.
This caused missed or late notifications.
Step 4 — Introducing Amazon RDS Proxy
RDS Proxy pools and shares DB connections across Lambdas.
Benefits:
Reuses existing DB connections.
Reduces connection churn and overhead.
Eliminates
too many connectionserrors.Lowers latency because connections are pre-warmed.
Step 5 — Putting Lambdas in a VPC
Since RDS Proxy lives inside a VPC:
All Lambdas were moved into private subnets within the same VPC.
This allowed private, low-latency connections to RDS Proxy.
Step 6 — Adding Internet Access via NAT Gateway
Once Lambdas were in the VPC, they lost internet access — which broke calls to FCM.
Fix:
Created a NAT Gateway in the VPC.
Updated route tables so Lambdas could:
Connect to RDS Proxy privately.
Still reach the internet for external APIs.
Final Architecture Diagram
┌───────────────────┐
│ Step Function │
│ (Scheduled CRON) │
└─────────┬─────────┘
│
▼
┌────────────────────────────────────┐
│ AWS Lambda(s) │
│ (Horizontal Scaling, VPC-Enabled) │
│ 1️⃣ Query DB via RDS Proxy │
│ 2️⃣ Send Notifications via APIs │
└─────────┬────────────────┬─────────┘
│ │
┌────────────▼─────────┐ ┌──▼───────────────┐
│ Amazon RDS Proxy │ │ NAT Gateway │
│ (Connection Pooling) │ │ (Internet Access)│
└───────────┬──────────┘ └──────────┬──────┘
│ │
┌─────────▼──────────┐ ┌───────▼─────────────────┐
│ PostgreSQL (RDS) │ │ External APIs (FCM, SNS,│
└────────────────────┘ │ Push Notification, etc.)│
└─────────────────────────┘
Conclusion
Building an on-time notification system at scale required more than just adding more workers — it demanded a complete architectural rethink. By moving from BullMQ workers to AWS Step Functions for scheduling, Lambda for scalable compute, and RDS Proxy for efficient database connectivity, we achieved a fully serverless, reliable, and low-maintenance solution.
Integrating VPC networking with a NAT Gateway ensured secure database access while still allowing internet connectivity for push and Firebase APIs. Today, our system delivers notifications precisely on time, even during heavy load, while remaining cost-efficient and easy to operate.


