Why rate limiting is a system design concern
Rate limiting is not just API protection. It shapes fairness, cost control, and reliability under traffic spikes. A weak limiter either blocks valid users or allows abusive traffic to consume shared resources. A good limiter answers three things:
- Who is being limited?
- What window and quota apply?
- How do we enforce this consistently across instances?
Algorithm choices
Common strategies have clear trade-offs.
- Fixed window: simple, but boundary burst problem
- Sliding log: accurate, but expensive at high volume
- Sliding window counter: good balance for most APIs
- Token bucket: best for controlled bursts
In practice, token bucket plus per-route policy works well for product APIs.
Distributed architecture
A single-instance in-memory limiter fails once traffic is load-balanced. Use a shared fast store (commonly Redis) with atomic operations.
- Key format: tenant:user:route
- Value: tokens + last refill timestamp
- Enforcement: atomic Lua script to avoid race conditions
Keep script latency low and avoid large key cardinality explosions.
Multi-tenant fairness
If all tenants share one global quota, noisy neighbors win. Define limits at multiple levels.
- Global service limit
- Tenant-level limit
- User-level or token-level limit
- Route-level overrides for expensive endpoints
This layered model keeps premium and free traffic predictable.
Handling partial outages
Limiter dependency failure should not blindly take your API down.
- Fail-open for non-critical endpoints
- Fail-closed for sensitive endpoints
- Local emergency budget cache for short redis disruptions
Document these behaviors clearly so on-call decisions are consistent.
Metrics that matter
- Allowed vs blocked request ratio
- p95 limiter decision latency
- Top keys by block count
- Redis script error and timeout rate
Without these metrics, tuning limits becomes guesswork.
Reference architecture
A production setup usually includes policy evaluation near the API edge.
- API gateway receives request and extracts identity context
- Limiter service evaluates quota policy and key hierarchy
- Redis executes atomic token-bucket logic via Lua
- Decision headers (<code>X-RateLimit-*</code>) are returned to clients
- Central policy store controls per-plan and per-route limits
Keeping policy and enforcement separated makes tuning safer.
Choosing limits without harming UX
Start with product behavior, not random numeric limits.
- Interactive endpoints: lower burst and tighter sustained rate
- Background ingestion APIs: larger burst but strict sustained budget
- Authentication endpoints: strict limits with stronger abuse controls
- Internal service calls: service-account specific limits
Run limits in observe mode first and inspect affected traffic cohorts.
Hardening patterns
As traffic grows, these controls prevent noisy outages.
- Hot-key mitigation by sharding high-volume identities
- Per-region local limiting plus global safety budget
- Circuit-breaker behavior when Redis latency spikes
- Separate quotas for expensive operations (search, exports, AI calls)
Rate limiting should evolve as product usage evolves.
Common product mistakes
Many problems come from policy design, not the limiter algorithm itself.
- One limit for every endpoint
- No difference between free and paid users
- Limits chosen without observing normal user behavior
- No useful response headers for clients
These mistakes create frustration even when the limiter is technically correct.
A good rollout strategy
Introduce new limits in observe mode first whenever possible.
- Measure who would be blocked
- Review noisy tenants and client bugs
- Adjust burst and sustained limits separately
- Turn on enforcement only after traffic looks healthy
This makes rate limiting feel predictable instead of random to users and teams.
Final takeaway
Rate limiting is a first-class architecture component. Treat it like core infrastructure: explicit policies, atomic distributed enforcement, and clear outage behavior.