Skip to content

Resilience

discord-mcp wraps every Discord REST call in a composite resilience policy: bulkhead → circuit breaker → 429-aware retry → timeout → REST. Each layer is configurable via environment variables. This page explains what the defaults are, when to override them, and how the layers interact.

graph LR
A[Tool] --> B[Bulkhead]
B --> C[Circuit Breaker]
C --> D[Retry]
D --> E[Timeout]
E --> F[Discord REST]

Read top-down: a call is admitted by the bulkhead, then checked against the circuit, then dispatched through the retry policy (which honors 429s and times out individual attempts).

If you stack the layers in a different mental model, debugging becomes confusing — keep this picture in mind when reading packages/mcp-core/src/rest/policy.ts.

Retries are ON by default with exponential backoff + jitter.

VarDefaultRangeNotes
MCP_RETRY_ENABLEDtrueboolSet to literal false to disable.
MCP_RETRY_MAX_ATTEMPTS31–10Total tries (NOT extra retries). 3 means: 1 attempt + up to 2 retries.
MCP_RETRY_BASE_DELAY_MS20050–5000Starting backoff.
MCP_RETRY_MAX_DELAY_MS10000500–60000Cap for exponential growth.
MCP_RETRY_JITTERfullnone / full / decorrelatedfull ≈ AWS recommendation.

What retries: 5xx, network errors, request timeouts, and explicit-Discord transient codes.

What does NOT retry: 4xx client errors (bad input, missing permissions, not-found). These bubble straight up as structured errors so the agent can correct itself rather than burn rate-limit headroom on a request that will never succeed.

  • Aggressive dev loop (low latency, fail fast): MCP_RETRY_MAX_ATTEMPTS=1 MCP_RETRY_BASE_DELAY_MS=50. No retries means the first failure surfaces immediately.
  • Conservative prod: MCP_RETRY_MAX_ATTEMPTS=4 MCP_RETRY_BASE_DELAY_MS=500 MCP_RETRY_MAX_DELAY_MS=20000. Higher attempt count + longer cap protects against extended Discord outages without thrashing.
  • Long-running bulk operations (e.g. bulk-ban, archive sweep): consider raising MCP_TIMEOUT_LONG_MS rather than retry count — the operation that’s already in flight is more valuable than starting over.

Discord rate-limit responses (HTTP 429) carry a retry_after header (in seconds, may be fractional). The resilience pipeline honors it directly: when a 429 fires, the next retry waits at least retry_after seconds (plus jitter) before re-issuing.

Two scopes:

  • Per-route: routes in Discord are bucketed by major parameters (e.g. /channels/:id/messages is bucketed by channel_id). Hitting the per-route limit pauses only that route.
  • Global: X-RateLimit-Global: true is treated the same way — pause and respect retry_after. Global hits are rare and almost always indicate a bug (e.g. tight loop without yielding).

Both surface a mcp.tool.errors increment with error_code: "rate_limited" if all retries are exhausted; a single 429 followed by a successful retry is invisible at the audit/error layer (counted in metrics as a normal call).

There is no separate env var for 429 behavior — it’s gated by MCP_RETRY_ENABLED and bounded by MCP_RETRY_MAX_ATTEMPTS. Setting attempts to 1 turns 429 retry off (the call surfaces immediately).

Two budgets:

VarDefaultRangeDescription
MCP_TIMEOUT_DEFAULT_MS300001000–120000Per-call ceiling for the standard REST path.
MCP_TIMEOUT_LONG_MS600001000–300000For tools annotated as long-running (bulk/sweep ops).

Timeout fires AFTER retry — i.e. each individual attempt can take up to the timeout, then the policy retries (until MAX_ATTEMPTS is reached). The total wall-clock budget for a tool call is roughly MAX_ATTEMPTS * TIMEOUT_MS + sum(backoffs). With defaults: ~95 seconds worst case for the standard path.

Recommendation:

  • Dev: MCP_TIMEOUT_DEFAULT_MS=10000 to surface hangs quickly.
  • Prod: keep defaults unless you’ve measured Discord P99 latency for your specific routes.

Once a route has failed MCP_CIRCUIT_FAILURE_THRESHOLD times in a sliding window, the circuit opens and fast-rejects subsequent calls without hitting Discord. After MCP_CIRCUIT_HALF_OPEN_AFTER_MS, the breaker enters half-open: the next call probes the upstream; if it succeeds the circuit closes, otherwise it re-opens.

VarDefaultRangeNotes
MCP_CIRCUIT_ENABLEDtrueboolSet literal false to disable.
MCP_CIRCUIT_FAILURE_THRESHOLD103–100Failures before opening.
MCP_CIRCUIT_HALF_OPEN_AFTER_MS600005000–600000Recovery probe delay.

When to override:

  • Disable in unit/integration tests where you want failures to bubble immediately: MCP_CIRCUIT_ENABLED=false.
  • High-volume prod: bump MCP_CIRCUIT_FAILURE_THRESHOLD=25 to reduce noise from sporadic transients.
  • Aggressive recovery: lower MCP_CIRCUIT_HALF_OPEN_AFTER_MS=15000 if you’re confident upstream issues clear quickly.

The breaker is per-route, not global. A flapping /messages endpoint won’t trip /channels.

The bulkhead semaphore caps in-flight Discord REST calls across all tools. When the limit is reached, new calls fast-reject with error_code: "bulkhead_saturated" rather than queueing — head-of-line blocking is worse than a clear “back off” signal.

VarDefaultRangeNotes
MCP_BULKHEAD_LIMIT1001–1000Max concurrent in-flight REST calls.

Sizing: Discord allows ~50 requests/second per bot to most routes. A bulkhead of 100 is generous and rarely hits in practice unless the agent is making bulk parallel tool calls. Tighten to 20–30 if you want a clear early signal of “agent is over-parallelizing”.

The mcp_pipeline meta-tool composes other tools serially or in parallel. Each leaf tool that hits Discord acquires a bulkhead slot. If your pipeline fans out to 50 children that each issue 2 REST calls, you can briefly exceed 100 in-flight REST calls; the bulkhead will kick in and the over-spilled calls return bulkhead_saturated, which the pipeline propagates back up as a partial-result with the failed children flagged.

Bottom line: pipelines do NOT amplify the bulkhead — they share it. This is intentional. If you raise MCP_BULKHEAD_LIMIT, you’re also raising the pipeline’s effective fan-out before saturation.

See Architecture → Pipeline for the recursion guard and interpolation syntax.

LayerDefault behaviorDisable how
Retry3 attempts, exponential w/ full jitterMCP_RETRY_ENABLED=false or MCP_RETRY_MAX_ATTEMPTS=1
429 retry-afterHonored, capped by retry budget(gated by retry)
Timeout30s default, 60s long(no toggle — set ms to max range)
Circuit10 failures → open 60sMCP_CIRCUIT_ENABLED=false
Bulkhead100 in-flight, fast-rejectMCP_BULKHEAD_LIMIT=1000 (effectively off for normal load)