Resilience
Resilience
Section titled “Resilience”discord-mcp wraps every Discord REST call in a composite resilience policy: bulkhead → circuit breaker → 429-aware retry → timeout → REST. Each layer is configurable via environment variables. This page explains what the defaults are, when to override them, and how the layers interact.
The policy chain
Section titled “The policy chain”graph LR A[Tool] --> B[Bulkhead] B --> C[Circuit Breaker] C --> D[Retry] D --> E[Timeout] E --> F[Discord REST]Read top-down: a call is admitted by the bulkhead, then checked against the circuit, then dispatched through the retry policy (which honors 429s and times out individual attempts).
If you stack the layers in a different mental model, debugging becomes
confusing — keep this picture in mind when reading
packages/mcp-core/src/rest/policy.ts.
Retries are ON by default with exponential backoff + jitter.
| Var | Default | Range | Notes |
|---|---|---|---|
MCP_RETRY_ENABLED | true | bool | Set to literal false to disable. |
MCP_RETRY_MAX_ATTEMPTS | 3 | 1–10 | Total tries (NOT extra retries). 3 means: 1 attempt + up to 2 retries. |
MCP_RETRY_BASE_DELAY_MS | 200 | 50–5000 | Starting backoff. |
MCP_RETRY_MAX_DELAY_MS | 10000 | 500–60000 | Cap for exponential growth. |
MCP_RETRY_JITTER | full | none / full / decorrelated | full ≈ AWS recommendation. |
What retries: 5xx, network errors, request timeouts, and explicit-Discord transient codes.
What does NOT retry: 4xx client errors (bad input, missing permissions, not-found). These bubble straight up as structured errors so the agent can correct itself rather than burn rate-limit headroom on a request that will never succeed.
When to override
Section titled “When to override”- Aggressive dev loop (low latency, fail fast):
MCP_RETRY_MAX_ATTEMPTS=1MCP_RETRY_BASE_DELAY_MS=50. No retries means the first failure surfaces immediately. - Conservative prod:
MCP_RETRY_MAX_ATTEMPTS=4MCP_RETRY_BASE_DELAY_MS=500MCP_RETRY_MAX_DELAY_MS=20000. Higher attempt count + longer cap protects against extended Discord outages without thrashing. - Long-running bulk operations (e.g. bulk-ban, archive sweep):
consider raising
MCP_TIMEOUT_LONG_MSrather than retry count — the operation that’s already in flight is more valuable than starting over.
429 (rate limit) handling
Section titled “429 (rate limit) handling”Discord rate-limit responses (HTTP 429) carry a retry_after header (in
seconds, may be fractional). The resilience pipeline honors it directly: when
a 429 fires, the next retry waits at least retry_after seconds (plus
jitter) before re-issuing.
Two scopes:
- Per-route: routes in Discord are bucketed by major parameters (e.g.
/channels/:id/messagesis bucketed bychannel_id). Hitting the per-route limit pauses only that route. - Global:
X-RateLimit-Global: trueis treated the same way — pause and respectretry_after. Global hits are rare and almost always indicate a bug (e.g. tight loop without yielding).
Both surface a mcp.tool.errors increment with error_code: "rate_limited"
if all retries are exhausted; a single 429 followed by a successful retry is
invisible at the audit/error layer (counted in metrics as a normal call).
There is no separate env var for 429 behavior — it’s gated by
MCP_RETRY_ENABLED and bounded by MCP_RETRY_MAX_ATTEMPTS. Setting attempts
to 1 turns 429 retry off (the call surfaces immediately).
Timeouts
Section titled “Timeouts”Two budgets:
| Var | Default | Range | Description |
|---|---|---|---|
MCP_TIMEOUT_DEFAULT_MS | 30000 | 1000–120000 | Per-call ceiling for the standard REST path. |
MCP_TIMEOUT_LONG_MS | 60000 | 1000–300000 | For tools annotated as long-running (bulk/sweep ops). |
Timeout fires AFTER retry — i.e. each individual attempt can take up to the
timeout, then the policy retries (until MAX_ATTEMPTS is reached). The total
wall-clock budget for a tool call is roughly
MAX_ATTEMPTS * TIMEOUT_MS + sum(backoffs). With defaults: ~95 seconds worst
case for the standard path.
Recommendation:
- Dev:
MCP_TIMEOUT_DEFAULT_MS=10000to surface hangs quickly. - Prod: keep defaults unless you’ve measured Discord P99 latency for your specific routes.
Circuit breaker
Section titled “Circuit breaker”Once a route has failed MCP_CIRCUIT_FAILURE_THRESHOLD times in a sliding
window, the circuit opens and fast-rejects subsequent calls without
hitting Discord. After MCP_CIRCUIT_HALF_OPEN_AFTER_MS, the breaker enters
half-open: the next call probes the upstream; if it succeeds the circuit
closes, otherwise it re-opens.
| Var | Default | Range | Notes |
|---|---|---|---|
MCP_CIRCUIT_ENABLED | true | bool | Set literal false to disable. |
MCP_CIRCUIT_FAILURE_THRESHOLD | 10 | 3–100 | Failures before opening. |
MCP_CIRCUIT_HALF_OPEN_AFTER_MS | 60000 | 5000–600000 | Recovery probe delay. |
When to override:
- Disable in unit/integration tests where you want failures to bubble
immediately:
MCP_CIRCUIT_ENABLED=false. - High-volume prod: bump
MCP_CIRCUIT_FAILURE_THRESHOLD=25to reduce noise from sporadic transients. - Aggressive recovery: lower
MCP_CIRCUIT_HALF_OPEN_AFTER_MS=15000if you’re confident upstream issues clear quickly.
The breaker is per-route, not global. A flapping /messages endpoint
won’t trip /channels.
Bulkhead (concurrency limit)
Section titled “Bulkhead (concurrency limit)”The bulkhead semaphore caps in-flight Discord REST calls across all
tools. When the limit is reached, new calls fast-reject with error_code: "bulkhead_saturated" rather than queueing — head-of-line blocking is worse
than a clear “back off” signal.
| Var | Default | Range | Notes |
|---|---|---|---|
MCP_BULKHEAD_LIMIT | 100 | 1–1000 | Max concurrent in-flight REST calls. |
Sizing: Discord allows ~50 requests/second per bot to most routes. A bulkhead of 100 is generous and rarely hits in practice unless the agent is making bulk parallel tool calls. Tighten to 20–30 if you want a clear early signal of “agent is over-parallelizing”.
Pipeline + bulkhead interaction
Section titled “Pipeline + bulkhead interaction”The mcp_pipeline meta-tool composes other tools serially or in parallel.
Each leaf tool that hits Discord acquires a bulkhead slot. If your
pipeline fans out to 50 children that each issue 2 REST calls, you can
briefly exceed 100 in-flight REST calls; the bulkhead will kick in and the
over-spilled calls return bulkhead_saturated, which the pipeline propagates
back up as a partial-result with the failed children flagged.
Bottom line: pipelines do NOT amplify the bulkhead — they share it. This is
intentional. If you raise MCP_BULKHEAD_LIMIT, you’re also raising the
pipeline’s effective fan-out before saturation.
See Architecture → Pipeline for the recursion guard and interpolation syntax.
Reference table
Section titled “Reference table”| Layer | Default behavior | Disable how |
|---|---|---|
| Retry | 3 attempts, exponential w/ full jitter | MCP_RETRY_ENABLED=false or MCP_RETRY_MAX_ATTEMPTS=1 |
| 429 retry-after | Honored, capped by retry budget | (gated by retry) |
| Timeout | 30s default, 60s long | (no toggle — set ms to max range) |
| Circuit | 10 failures → open 60s | MCP_CIRCUIT_ENABLED=false |
| Bulkhead | 100 in-flight, fast-reject | MCP_BULKHEAD_LIMIT=1000 (effectively off for normal load) |
Related
Section titled “Related”- Telemetry → resilience metrics — circuit transitions, bulkhead rejections, dead-letter counts.
- Architecture → Rate limits — Discord’s global vs per-route buckets,
@discordjs/restinteraction. - Architecture → Error handling — how Cockatiel errors map to MCP error codes.
- Source:
packages/mcp-core/src/rest/policy.ts,resilient.ts.