Rate Limiting¶
Pyvider RPC Plugin integrates with Foundation's rate limiting system, providing server-side protection using a token bucket algorithm to protect against abuse, ensure fair resource usage, and maintain service quality under high load conditions.
Overview¶
Rate limiting operates at the server level using Foundation's TokenBucketRateLimiter and applies to all incoming requests regardless of the client or request type. It uses a token bucket algorithm that provides:
- Sustained rate control: Limits average requests per second over time
- Burst handling: Allows temporary spikes in traffic up to a configured limit
- Fair resource allocation: Prevents any single client from overwhelming the server
- Graceful degradation: Returns standard gRPC errors when limits are exceeded
Foundation Integration¶
The rate limiting is implemented using Foundation's rate limiting utilities:
from provide.foundation.utils.rate_limiting import TokenBucketRateLimiter
from pyvider.rpcplugin.config import rpcplugin_config
# Rate limiter is automatically configured from environment
if rpcplugin_config.plugin_rate_limit_enabled:
rate_limiter = TokenBucketRateLimiter(
refill_rate=rpcplugin_config.plugin_rate_limit_requests_per_second,
capacity=rpcplugin_config.plugin_rate_limit_burst_capacity
)
Configuration¶
Basic Rate Limiting¶
Enable rate limiting with default settings:
# Enable rate limiting with defaults
export PLUGIN_RATE_LIMIT_ENABLED=true
# Default: 100 requests/second, 200 burst capacity
Custom Rate Configuration¶
Configure specific rate and burst limits:
# Custom rate limiting configuration
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=50.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=100
Configuration Parameters¶
| Parameter | Environment Variable | Type | Default | Description |
|---|---|---|---|---|
| Enabled | PLUGIN_RATE_LIMIT_ENABLED |
bool |
false |
Enable/disable rate limiting |
| Rate | PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND |
float |
100.0 |
Average requests per second allowed |
| Burst | PLUGIN_RATE_LIMIT_BURST_CAPACITY |
int |
200 |
Maximum requests in burst (token bucket size) |
Token Bucket Algorithm¶
The rate limiter uses a token bucket algorithm with the following behavior:
How It Works¶
- Token Generation: Tokens are added to the bucket at the configured rate (requests per second)
- Request Processing: Each request consumes one token from the bucket
- Burst Handling: Bucket can hold up to the burst capacity in tokens
- Rate Limiting: When bucket is empty, requests are rejected with
RESOURCE_EXHAUSTEDerror
Example Scenarios¶
Scenario 1: Steady Traffic¶
# Configuration: 10 RPS, 20 burst capacity
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=10.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=20
- Sustained load: 10 requests/second are consistently allowed
- Idle periods: Bucket fills to 20 tokens during quiet periods
- Steady state: Bucket maintains ~10 tokens during consistent load
Scenario 2: Bursty Traffic¶
# Configuration: 50 RPS, 200 burst capacity
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=50.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=200
- Initial burst: 200 requests can be processed immediately
- Recovery: Takes 4 seconds (200/50) to refill bucket after burst
- Ongoing: 50 requests/second sustained after initial burst
Scenario 3: High Throughput¶
# Configuration: 1000 RPS, 2000 burst capacity
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=1000.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=2000
- High capacity: Handles 1000 requests/second consistently
- Large bursts: Can accommodate 2000 request spikes
- Quick recovery: 2 second recovery time after burst
Configuration Patterns¶
Development Environment¶
For development, use lenient rate limiting to avoid interrupting testing:
# Development rate limiting - very permissive
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=1000.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=5000
Production Environment¶
Web API Backend¶
# Typical web API rate limiting
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=100.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=300
High-Throughput Service¶
# High-throughput microservice
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=1000.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=2000
Administrative Interface¶
# Conservative rate limiting for admin operations
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=10.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=50
Public API¶
# Public-facing API with abuse protection
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=20.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=100
Error Handling¶
When rate limits are exceeded, the server returns a standard gRPC RESOURCE_EXHAUSTED error. Clients should handle this gracefully with appropriate retry logic.
Server-Side Error Response¶
# When rate limit is exceeded, server automatically returns:
# grpc.StatusCode.RESOURCE_EXHAUSTED: "Rate limit exceeded"
Client-Side Error Handling¶
import grpc
import asyncio
from pyvider.rpcplugin import plugin_client
async def handle_rate_limited_request():
async with plugin_client() as client:
for attempt in range(3):
try:
response = await client.my_service.process_request(data="example")
return response
except grpc.aio.AioRpcError as e:
if e.code() == grpc.StatusCode.RESOURCE_EXHAUSTED:
# Rate limit exceeded, implement backoff
backoff_time = min(2 ** attempt, 10) # Exponential backoff, max 10s
logger.warning(f"Rate limited, retrying in {backoff_time}s")
await asyncio.sleep(backoff_time)
continue
else:
# Other error, don't retry
raise
# All retries exhausted
raise Exception("Request failed after rate limit retries")
Graceful Degradation¶
Implement circuit breaker pattern for persistent rate limiting:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Rate limited, rejecting requests
HALF_OPEN = "half_open" # Testing if rate limiting is resolved
class RateLimitCircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = 0
self.state = CircuitState.CLOSED
async def call_with_circuit_breaker(self, client, request_func, *args, **kwargs):
if self.state == CircuitState.OPEN:
# Check if we should transition to half-open
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
logger.info("Circuit breaker transitioning to half-open")
else:
raise Exception("Circuit breaker is open due to rate limiting")
try:
result = await request_func(*args, **kwargs)
# Success - reset circuit breaker
if self.state == CircuitState.HALF_OPEN:
logger.info("Circuit breaker closing - rate limiting resolved")
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except grpc.aio.AioRpcError as e:
if e.code() == grpc.StatusCode.RESOURCE_EXHAUSTED:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
logger.warning(f"Circuit breaker opening due to {self.failure_count} rate limit failures")
self.state = CircuitState.OPEN
raise
else:
# Non-rate-limiting error, don't affect circuit breaker
raise
Monitoring and Observability¶
Rate Limiting Metrics¶
Monitor these key metrics to understand rate limiting behavior:
from prometheus_client import Counter, Histogram, Gauge
# Define rate limiting metrics
RATE_LIMIT_REQUESTS = Counter('plugin_rate_limit_requests_total', 'Total requests processed by rate limiter', ['result'])
RATE_LIMIT_TOKENS = Gauge('plugin_rate_limit_tokens_available', 'Available tokens in rate limit bucket')
RATE_LIMIT_LATENCY = Histogram('plugin_rate_limit_check_duration_seconds', 'Time spent checking rate limits')
# Example implementation (conceptual - actual implementation is internal)
async def rate_limited_handler(request, context):
with RATE_LIMIT_LATENCY.time():
if rate_limiter.allow_request():
RATE_LIMIT_REQUESTS.labels(result='allowed').inc()
RATE_LIMIT_TOKENS.set(rate_limiter.available_tokens())
return await process_request(request, context)
else:
RATE_LIMIT_REQUESTS.labels(result='rejected').inc()
context.set_code(grpc.StatusCode.RESOURCE_EXHAUSTED)
context.set_details("Rate limit exceeded")
return empty_response()
Logging Rate Limiting Events¶
from provide.foundation import logger
# Rate limiting is automatically logged by the framework
# Example log messages you'll see:
# When rate limiting is enabled:
logger.info("🚦 Rate limiting enabled", extra={
"requests_per_second": 100.0,
"burst_capacity": 200.0
})
# When requests are rate limited:
logger.warning("🚦 Request rate limited", extra={
"client_ip": "192.168.1.100",
"available_tokens": 0,
"refill_rate": 100.0
})
# When burst capacity is reached:
logger.info("🚦 Burst capacity reached", extra={
"burst_size": 200.0,
"recovery_time_seconds": 2.0
})
Health Check Integration¶
Rate limiting status can be included in health checks:
from pyvider.rpcplugin import RPCPluginServer
class CustomHealthServicer:
async def check_health(self) -> dict:
health_status = {
"status": "SERVING",
"rate_limiting": {
"enabled": True,
"current_tokens": rate_limiter.available_tokens(),
"requests_per_second": 100.0,
"burst_capacity": 200.0
}
}
return health_status
Performance Considerations¶
Rate Limiter Overhead¶
The token bucket implementation is highly optimized:
- O(1) complexity: Constant time for rate limit checks
- Minimal memory: Small fixed memory footprint per server
- Lock-free: Uses atomic operations for thread safety
- Low latency: Sub-microsecond overhead per request
Benchmarks¶
Typical performance characteristics:
| Configuration | Overhead per Request | Memory Usage |
|---|---|---|
| 100 RPS, 200 burst | < 0.1μs | < 1KB |
| 1000 RPS, 2000 burst | < 0.1μs | < 1KB |
| 10000 RPS, 20000 burst | < 0.2μs | < 1KB |
Tuning for Performance¶
For high-throughput services:
# High-performance configuration
export PLUGIN_RATE_LIMIT_ENABLED=true
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=5000.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=10000
# Consider disabling rate limiting for internal services
export PLUGIN_RATE_LIMIT_ENABLED=false
Advanced Patterns¶
Per-Client Rate Limiting¶
While the built-in rate limiter is global, you can implement per-client rate limiting in your service handlers:
import asyncio
from collections import defaultdict
from time import time
class PerClientRateLimiter:
def __init__(self, tokens_per_second: float, bucket_size: float):
self.rate = tokens_per_second
self.capacity = bucket_size
self.buckets = defaultdict(lambda: {
'tokens': bucket_size,
'last_update': time()
})
def allow_request(self, client_id: str) -> bool:
now = time()
bucket = self.buckets[client_id]
# Add tokens based on elapsed time
elapsed = now - bucket['last_update']
tokens_to_add = elapsed * self.rate
bucket['tokens'] = min(self.capacity, bucket['tokens'] + tokens_to_add)
bucket['last_update'] = now
# Check if request is allowed
if bucket['tokens'] >= 1.0:
bucket['tokens'] -= 1.0
return True
return False
# Usage in your service - 10 requests/second with burst capacity of 50
per_client_limiter = PerClientRateLimiter(tokens_per_second=10.0, bucket_size=50.0)
async def my_service_method(self, request, context):
client_id = context.peer() # or extract from metadata
if not per_client_limiter.allow_request(client_id):
context.set_code(grpc.StatusCode.RESOURCE_EXHAUSTED)
context.set_details(f"Rate limit exceeded for client {client_id}")
return ErrorResponse()
return await process_request(request)
Dynamic Rate Limiting¶
Adjust rate limits based on system load:
import psutil
from pyvider.rpcplugin import configure
class AdaptiveRateLimiter:
def __init__(self):
self.base_rate = 100.0
self.max_rate = 500.0
self.min_rate = 10.0
async def update_rate_limits(self):
while True:
# Get system metrics
cpu_percent = psutil.cpu_percent(interval=1)
memory_percent = psutil.virtual_memory().percent
# Calculate rate based on system load
if cpu_percent > 80 or memory_percent > 80:
# High load - reduce rate
new_rate = max(self.min_rate, self.base_rate * 0.5)
elif cpu_percent < 30 and memory_percent < 50:
# Low load - increase rate
new_rate = min(self.max_rate, self.base_rate * 1.5)
else:
# Normal load - use base rate
new_rate = self.base_rate
# Update configuration
configure(
rate_limit_requests_per_second=new_rate,
rate_limit_burst_capacity=new_rate * 2
)
logger.info(f"📊 Adjusted rate limit to {new_rate} RPS based on system load")
await asyncio.sleep(30) # Update every 30 seconds
# Start adaptive rate limiting
adaptive_limiter = AdaptiveRateLimiter()
asyncio.create_task(adaptive_limiter.update_rate_limits())
Troubleshooting¶
Common Issues¶
1. Too Aggressive Rate Limiting¶
Symptoms: Legitimate requests being rejected, high error rates Solution: Increase burst capacity or requests per second
# Before (too restrictive)
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=5.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=10
# After (more permissive)
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=20.0
export PLUGIN_RATE_LIMIT_BURST_CAPACITY=100
2. Rate Limiting Not Working¶
Symptoms: No rate limiting observed, server overwhelmed Solution: Verify rate limiting is enabled and configured correctly
# Check configuration
export PLUGIN_RATE_LIMIT_ENABLED=true # Must be explicit
export PLUGIN_RATE_LIMIT_REQUESTS_PER_SECOND=100.0
3. Client Retry Storms¶
Symptoms: Rate limiting triggers cascade of retries, making problem worse Solution: Implement exponential backoff in client retry logic
# Good: Exponential backoff
async def retry_with_backoff(func, max_attempts=3):
for attempt in range(max_attempts):
try:
return await func()
except grpc.aio.AioRpcError as e:
if e.code() == grpc.StatusCode.RESOURCE_EXHAUSTED and attempt < max_attempts - 1:
backoff = min(2 ** attempt, 10) # Cap at 10 seconds
await asyncio.sleep(backoff)
continue
raise
Debugging Rate Limiting¶
Enable debug logging to see rate limiting decisions:
Debug logs will show:
- Token bucket state changes
- Request allow/reject decisions
- Rate limiting configuration updates
- Client connection patterns
Best Practices¶
- Start Conservative: Begin with lower limits and increase based on monitoring
- Monitor Metrics: Track rate limiting events and adjust based on patterns
- Client-Side Handling: Always implement proper error handling for rate limits
- Burst Capacity: Set burst capacity 2-5x higher than sustained rate
- Environment Specific: Use different limits for dev/staging/production
- Document Limits: Make rate limits visible to API consumers
- Graceful Degradation: Implement circuit breakers for persistent rate limiting
- Load Testing: Test rate limiting under realistic load conditions
Next Steps¶
- Production Setup - Complete production configuration guide
- Logging Configuration - Observability and monitoring setup
- Environment Variables - Complete configuration reference
- API Reference - Auto-generated API documentation