Scaling is not one problem — it's a series of problems that reveal themselves at different thresholds. What works at 10K users breaks at 100K. What works at 100K breaks at 1M. The key is knowing which bottlenecks to expect at which scale, and solving them before they become outages.
This article draws on our experience scaling real-time platforms that now handle millions of concurrent users. The patterns here are hard-won and battle-tested.
The Scaling Staircase
Think of scaling as a staircase, not a slope. Each step represents a threshold where a new bottleneck emerges:
Single DB bottleneck
A single Postgres instance starts to show read latency spikes. Solution: Add read replicas. Separate read-heavy queries to replicas immediately.
Cache miss storm
Without caching, repeated identical queries hammer the DB. Solution: Introduce Redis for session data, query results, and rate limiting state.
Connection pool exhaustion
Postgres has a limited connection pool. Hundreds of app servers × connection pool size = exhausted DB connections. Solution: PgBouncer for connection pooling.
Write throughput ceiling
Single primary DB can't handle write volume. Solution: Database sharding by tenant ID or user ID, or migrate to a horizontally scalable DB like CockroachDB.
Network & stateful services
WebSocket connections, pub/sub, and session affinity become complex at this scale. Solution: Dedicated WebSocket tier with Redis Pub/Sub for message fan-out.
Global distribution
Latency for geographically distributed users becomes the constraint. Solution: Multi-region deployments with geo-routing, edge caching, and regional databases.
Database: The First Bottleneck
The database is almost always the first thing that breaks. Here's the progression we recommend:
- Read replicas — Separate read traffic immediately. 80% of queries are reads in most applications.
- Connection pooling — PgBouncer in transaction mode between app servers and Postgres is non-negotiable at scale.
- Query optimization — Before sharding, make sure your indexes are optimal. Explain analyze every slow query.
- Horizontal sharding — Only when the above are exhausted. Sharding adds massive operational complexity.
Hard-learned lesson: We've seen teams jump to sharding when connection pooling alone would have solved their problem. Always exhaust vertical and read-replica scaling before sharding.
Caching Strategy
A well-designed cache can reduce database load by 70–90%. Key principles:
- Cache at the right layer — Application-level cache (Redis) for computed results, CDN for static assets, HTTP cache headers for API responses
- Cache invalidation strategy — TTL-based for non-critical data, event-driven invalidation for critical data
- Cache warming — Pre-populate critical caches on deployment to avoid cold-start latency spikes
- Monitor hit rate — A cache hit rate below 80% usually means your cache key design needs work
Real-Time at Scale: WebSockets
Maintaining millions of open WebSocket connections is a stateful problem in a world that prefers stateless services. Our approach:
- Dedicated WebSocket gateway tier (separate from REST API servers)
- Redis Pub/Sub for message fan-out across WebSocket server instances
- Sticky sessions via consistent hashing at the load balancer for connection affinity
- Horizontal pod autoscaling based on open connection count in Kubernetes
Load Balancing & Autoscaling
At 1M+ concurrent users, load balancing is not just round-robin. You need:
- Layer 7 load balancing with health-check-aware routing
- Kubernetes HPA (Horizontal Pod Autoscaler) based on CPU, memory, AND custom metrics (requests/second, queue depth)
- Cluster autoscaling to provision new nodes automatically during traffic spikes
- Pre-warming capacity before known traffic events (product launches, marketing campaigns)
Chaos Engineering: Test Before Production Tests You
At scale, failures are inevitable. The goal is to fail safely. We implement chaos engineering using tools like Chaos Monkey or Litmus to randomly kill pods, introduce latency, and simulate region failures in staging. If your system handles chaos gracefully in staging, it handles real outages gracefully in production.
Scaling challenges holding you back?
Our engineering team has scaled systems from thousands to millions of users. We can audit your architecture and build a scaling roadmap.
Talk to Our Architects →