Skip to main content

Command Palette

Search for a command to run...

How We Solved the On-Time Notification Delivery Problem at Scale

Scaling Real-Time Notifications with AWS Step Functions, Lambda, and RDS Proxy

Updated
3 min read
How We Solved the On-Time Notification Delivery Problem at Scale
P

A software engineer who likes to explore new technologies, problem-solving, build projects, and have a keen interest in Web development.

Delivering notifications exactly on time sounds easy — until you have to do it for thousands of users at the same second.

Our medicine reminder app depends heavily on precise dose reminders. Even a 1–2 minute delay can cause users to miss their doses, so reliability was critical.

Initially, we used BullMQ + Node.js workers for scheduling and sending notifications. It worked fine for a small number of users, but at scale, the system started to break.


The Problem

1. Too Many Notifications at the Same Time

  • Thousands of notifications were scheduled for the same second.

  • Workers pulled huge batches from Redis, causing CPU & memory spikes.

  • Redis queues became congested and some jobs got delayed.

2. Worker Overload

  • Even with multiple worker instances, the Node.js event loop struggled during peaks.

  • Delays became more frequent as user count increased.


Step 1 — Moving to AWS Step Functions + Lambda

We redesigned the scheduling process to spread the load more efficiently.

New Flow:

  1. Step Function schedules notification batches at exact times.

  2. Each execution triggers a Lambda dedicated to a subset of notifications.

  3. We keep 2–3 hot Lambdas ready during peak times to avoid cold starts.

Why Step Functions?

  • Native CRON/Rate-based scheduling.

  • Orchestration of multiple parallel Lambda executions.

  • Fully managed — no manual queue management.


Step 2 — Horizontal Scaling in Lambda

AWS Lambda scales automatically, so during high load, we got dozens of Lambdas in parallel.

This solved the compute bottleneck — but introduced a new problem


Step 3 — The Database Connection Storm

Each Lambda invocation created a new PostgreSQL connection.

At scale:

  • RDS hit the max_connections limit.

  • Some Lambdas failed instantly due to connection errors.

  • This caused missed or late notifications.


Step 4 — Introducing Amazon RDS Proxy

RDS Proxy pools and shares DB connections across Lambdas.

Benefits:

  • Reuses existing DB connections.

  • Reduces connection churn and overhead.

  • Eliminates too many connections errors.

  • Lowers latency because connections are pre-warmed.


Step 5 — Putting Lambdas in a VPC

Since RDS Proxy lives inside a VPC:

  • All Lambdas were moved into private subnets within the same VPC.

  • This allowed private, low-latency connections to RDS Proxy.


Step 6 — Adding Internet Access via NAT Gateway

Once Lambdas were in the VPC, they lost internet access — which broke calls to FCM.

Fix:

  • Created a NAT Gateway in the VPC.

  • Updated route tables so Lambdas could:

    • Connect to RDS Proxy privately.

    • Still reach the internet for external APIs.


Final Architecture Diagram

          ┌───────────────────┐
          │  Step Function    │
          │  (Scheduled CRON) │
          └─────────┬─────────┘
                    │
                    ▼
           ┌────────────────────────────────────┐
           │            AWS Lambda(s)           │
           │ (Horizontal Scaling, VPC-Enabled)  │
           │  1️⃣ Query DB via RDS Proxy         │
           │  2️⃣ Send Notifications via APIs    │
           └─────────┬────────────────┬─────────┘
                     │                │
        ┌────────────▼─────────┐   ┌──▼───────────────┐
        │   Amazon RDS Proxy   │   │ NAT Gateway      │
        │ (Connection Pooling) │   │ (Internet Access)│
        └───────────┬──────────┘   └──────────┬──────┘
                    │                        │
          ┌─────────▼──────────┐     ┌───────▼─────────────────┐
          │ PostgreSQL (RDS)   │     │ External APIs (FCM, SNS,│
          └────────────────────┘     │ Push Notification, etc.)│
                                     └─────────────────────────┘

Conclusion

Building an on-time notification system at scale required more than just adding more workers — it demanded a complete architectural rethink. By moving from BullMQ workers to AWS Step Functions for scheduling, Lambda for scalable compute, and RDS Proxy for efficient database connectivity, we achieved a fully serverless, reliable, and low-maintenance solution.

Integrating VPC networking with a NAT Gateway ensured secure database access while still allowing internet connectivity for push and Firebase APIs. Today, our system delivers notifications precisely on time, even during heavy load, while remaining cost-efficient and easy to operate.