Scaling to 1 Million Concurrent Users: Lessons from the Field

Scaling is not one problem — it's a series of problems that reveal themselves at different thresholds. What works at 10K users breaks at 100K. What works at 100K breaks at 1M. The key is knowing which bottlenecks to expect at which scale, and solving them before they become outages.

This article draws on our experience scaling real-time platforms that now handle millions of concurrent users. The patterns here are hard-won and battle-tested.

The Scaling Staircase

Think of scaling as a staircase, not a slope. Each step represents a threshold where a new bottleneck emerges:

10K users

Single DB bottleneck

A single Postgres instance starts to show read latency spikes. Solution: Add read replicas. Separate read-heavy queries to replicas immediately.

50K users

Cache miss storm

Without caching, repeated identical queries hammer the DB. Solution: Introduce Redis for session data, query results, and rate limiting state.

200K users

Connection pool exhaustion

Postgres has a limited connection pool. Hundreds of app servers × connection pool size = exhausted DB connections. Solution: PgBouncer for connection pooling.

500K users

Write throughput ceiling

Single primary DB can't handle write volume. Solution: Database sharding by tenant ID or user ID, or migrate to a horizontally scalable DB like CockroachDB.

1M users

Network & stateful services

WebSocket connections, pub/sub, and session affinity become complex at this scale. Solution: Dedicated WebSocket tier with Redis Pub/Sub for message fan-out.

Beyond 1M

Global distribution

Latency for geographically distributed users becomes the constraint. Solution: Multi-region deployments with geo-routing, edge caching, and regional databases.

Database: The First Bottleneck

The database is almost always the first thing that breaks. Here's the progression we recommend:

Read replicas — Separate read traffic immediately. 80% of queries are reads in most applications.
Connection pooling — PgBouncer in transaction mode between app servers and Postgres is non-negotiable at scale.
Query optimization — Before sharding, make sure your indexes are optimal. Explain analyze every slow query.
Horizontal sharding — Only when the above are exhausted. Sharding adds massive operational complexity.

Hard-learned lesson: We've seen teams jump to sharding when connection pooling alone would have solved their problem. Always exhaust vertical and read-replica scaling before sharding.

Caching Strategy

A well-designed cache can reduce database load by 70–90%. Key principles:

Cache at the right layer — Application-level cache (Redis) for computed results, CDN for static assets, HTTP cache headers for API responses
Cache invalidation strategy — TTL-based for non-critical data, event-driven invalidation for critical data
Cache warming — Pre-populate critical caches on deployment to avoid cold-start latency spikes
Monitor hit rate — A cache hit rate below 80% usually means your cache key design needs work

Real-Time at Scale: WebSockets

Maintaining millions of open WebSocket connections is a stateful problem in a world that prefers stateless services. Our approach:

Dedicated WebSocket gateway tier (separate from REST API servers)
Redis Pub/Sub for message fan-out across WebSocket server instances
Sticky sessions via consistent hashing at the load balancer for connection affinity
Horizontal pod autoscaling based on open connection count in Kubernetes

Load Balancing & Autoscaling

At 1M+ concurrent users, load balancing is not just round-robin. You need:

Layer 7 load balancing with health-check-aware routing
Kubernetes HPA (Horizontal Pod Autoscaler) based on CPU, memory, AND custom metrics (requests/second, queue depth)
Cluster autoscaling to provision new nodes automatically during traffic spikes
Pre-warming capacity before known traffic events (product launches, marketing campaigns)

Chaos Engineering: Test Before Production Tests You

At scale, failures are inevitable. The goal is to fail safely. We implement chaos engineering using tools like Chaos Monkey or Litmus to randomly kill pods, introduce latency, and simulate region failures in staging. If your system handles chaos gracefully in staging, it handles real outages gracefully in production.

Scaling challenges holding you back?

Our engineering team has scaled systems from thousands to millions of users. We can audit your architecture and build a scaling roadmap.

Talk to Our Architects →