Rate Limits
The Balchemy API enforces rate limits at three levels: per-IP for global HTTP traffic, per-user for AI request quotas, and per-user for fast (own-key) AI requests. All limits use Redis INCR + EXPIRE counters with a fail-open policy — if Redis is unavailable, traffic is allowed through rather than blocked.
Mechanism
Rate limiting is implemented as a NestJS guard applied globally to all API routes. The algorithm is a fixed window counter:
- On each request, the guard increments a Redis key scoped to
IP + route. - On the first increment, a TTL is set for the window duration.
- If the counter exceeds the limit, the guard throws
429 Too Many Requests. - If Redis is unreachable, the guard fails open (request is allowed).
Rate limit state persists in Redis only. Restarting the backend does not reset counters.
Global API rate limit (per IP)
Applied to all HTTP API routes except health endpoints (/api/nest/health/*, /health).
| Parameter | Value |
|---|---|
| Window | 15 minutes |
| Max requests | 100 per window |
| Key | IP address + route scope |
| Configurable | Yes — via GlobalSettings.rateLimitConfig in MongoDB |
The limit policy can be updated at runtime by an admin without a server restart. The guard polls MongoDB every 30 seconds and caches the active policy.
Override per endpoint
Individual endpoints can override the global limit using the @RateLimit() decorator:
@RateLimit({ windowMs: 60_000, max: 10, message: "Slow down." })
@Post("/some-route")
async sensitiveEndpoint() { ... }Endpoints annotated with @SkipRateLimit() bypass the guard entirely.
AI usage quota (per user)
Applied to AI message processing — ask_bot, trade_command, and the web chat endpoint.
Platform API keys (daily quota)
When users route AI requests through Balchemy's API keys:
| Parameter | Value |
|---|---|
| Window | 24 hours (rolling from first request) |
| Redis key | rl:daily:<userId> |
| Limit | Configured per user via calculateUserLimits() |
| Unlimited | Limit value of -1 means no cap |
| Fail-open | true — Redis outage allows all traffic |
Own API keys (monthly quota)
When users bring their own LLM API keys:
| Parameter | Value |
|---|---|
| Window | 30 days (rolling from first request) |
| Redis key | rl:monthly:<userId> |
| Limit | Configured per user via calculateUserLimits() |
| Disabled | Limit value of 0 means feature not provisioned |
| Unlimited | Limit value of -1 means no cap |
| Fail-open | true |
AI quotas are not AI-model-specific — they count request events, not token usage.
MCP rate limit
MCP tool calls go through the global per-IP guard and additionally respect the user's AI quota when ask_bot or trade_command is called.
There is no separate per-bot MCP rate limit distinct from the above. High-throughput integrations should distribute load across time rather than burst.
Rate limit response headers
When the global IP guard is active, every response includes:
| Header | Value |
|---|---|
X-RateLimit-Limit | Maximum requests allowed in the window |
X-RateLimit-Remaining | Requests remaining in the current window |
X-RateLimit-Reset | Unix timestamp (seconds) when the window resets |
Retry-After | Seconds to wait before retrying (only on 429 responses) |
Example response headers when approaching the limit:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 3
X-RateLimit-Reset: 1742389200Example response headers on a 429:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1742389200
Retry-After: 47
Content-Type: application/json
{
"statusCode": 429,
"message": "Too many requests from this IP, please try again later."
}Rate limits by endpoint category
| Category | Window | Limit | Notes |
|---|---|---|---|
| All API routes (default) | 15 min | 100 / IP | Configurable via GlobalSettings |
| Health endpoints | — | Exempt | /api/nest/health/*, /health |
| AI requests (platform keys) | 24 h | Per-user limit | Fails open if Redis unavailable |
| AI requests (own keys) | 30 days | Per-user limit | 0 = feature disabled |
| MCP tool calls | Inherits global | 100 / 15 min / IP | Plus AI quota on LLM-backed tools |
Quota exceeded behavior
When a user's AI daily quota is exhausted, the API does not return a 429. Instead, the message handler returns a 200 response with a user-friendly message:
You have reached your daily request limit. Your quota resets in 4 hours.
When the IP rate limit is exhausted, the API returns a strict 429 Too Many Requests with Retry-After.
Best practices
Respect Retry-After. When you receive a 429, always read the Retry-After header and wait at least that many seconds before retrying. The SDK does this automatically.
Use exponential backoff. For programmatic retries, use exponential backoff with jitter rather than fixed delays. The balchemy-agent-sdk ships a retry utility:
import { withRetry } from "@balchemy/agent-sdk";
const result = await withRetry(() => mcp.agentExecute({ instruction: "..." }), {
maxAttempts: 3,
baseDelayMs: 200,
maxDelayMs: 5000,
jitter: true,
});Spread load. If your agent processes many tokens in parallel, introduce a delay between calls to avoid IP rate limit bursts. Batching reads into single multi-token calls (e.g. trading_market_dexscreener_tokens) is more efficient than calling single-token tools in a loop.
Monitor X-RateLimit-Remaining. For production agents, log the remaining header and emit an alert before you reach 0, so you can throttle proactively rather than receive 429s.
Cache read results. Market data tools (dexscreener_pairs, geckoterminal_pool_details, etc.) return data that is valid for seconds to minutes. Cache the response locally rather than re-calling within the same processing cycle.
WebSocket rate limit
WebSocket connections (Telegram/Discord bot gateway) go through a separate WS guard with a per-connection message rate limit. The limit is configurable and enforced independently of the HTTP guard.