Resilience

discord-mcp wraps every Discord REST call in a composite resilience policy: bulkhead → circuit breaker → 429-aware retry → timeout → REST. Each layer is configurable via environment variables. This page explains what the defaults are, when to override them, and how the layers interact.

The policy chain

graph LR
  A[Tool] --> B[Bulkhead]
  B --> C[Circuit Breaker]
  C --> D[Retry]
  D --> E[Timeout]
  E --> F[Discord REST]

Read top-down: a call is admitted by the bulkhead, then checked against the circuit, then dispatched through the retry policy (which honors 429s and times out individual attempts).

If you stack the layers in a different mental model, debugging becomes confusing — keep this picture in mind when reading packages/mcp-core/src/rest/policy.ts.

Retry

Retries are ON by default with exponential backoff + jitter.

Var	Default	Range	Notes
`MCP_RETRY_ENABLED`	`true`	bool	Set to literal `false` to disable.
`MCP_RETRY_MAX_ATTEMPTS`	`3`	1–10	Total tries (NOT extra retries). `3` means: 1 attempt + up to 2 retries.
`MCP_RETRY_BASE_DELAY_MS`	`200`	50–5000	Starting backoff.
`MCP_RETRY_MAX_DELAY_MS`	`10000`	500–60000	Cap for exponential growth.
`MCP_RETRY_JITTER`	`full`	`none` / `full` / `decorrelated`	`full` ≈ AWS recommendation.

What retries: 5xx, network errors, request timeouts, and explicit-Discord transient codes.

What does NOT retry: 4xx client errors (bad input, missing permissions, not-found). These bubble straight up as structured errors so the agent can correct itself rather than burn rate-limit headroom on a request that will never succeed.

When to override

Aggressive dev loop (low latency, fail fast): MCP_RETRY_MAX_ATTEMPTS=1 MCP_RETRY_BASE_DELAY_MS=50. No retries means the first failure surfaces immediately.
Conservative prod: MCP_RETRY_MAX_ATTEMPTS=4 MCP_RETRY_BASE_DELAY_MS=500 MCP_RETRY_MAX_DELAY_MS=20000. Higher attempt count + longer cap protects against extended Discord outages without thrashing.
Long-running bulk operations (e.g. bulk-ban, archive sweep): consider raising MCP_TIMEOUT_LONG_MS rather than retry count — the operation that’s already in flight is more valuable than starting over.

429 (rate limit) handling

Discord rate-limit responses (HTTP 429) carry a retry_after header (in seconds, may be fractional). The resilience pipeline honors it directly: when a 429 fires, the next retry waits at least retry_after seconds (plus jitter) before re-issuing.

Two scopes:

Per-route: routes in Discord are bucketed by major parameters (e.g. /channels/:id/messages is bucketed by channel_id). Hitting the per-route limit pauses only that route.
Global: X-RateLimit-Global: true is treated the same way — pause and respect retry_after. Global hits are rare and almost always indicate a bug (e.g. tight loop without yielding).

Both surface a mcp.tool.errors increment with error_code: "rate_limited" if all retries are exhausted; a single 429 followed by a successful retry is invisible at the audit/error layer (counted in metrics as a normal call).

There is no separate env var for 429 behavior — it’s gated by MCP_RETRY_ENABLED and bounded by MCP_RETRY_MAX_ATTEMPTS. Setting attempts to 1 turns 429 retry off (the call surfaces immediately).

Timeouts

Two budgets:

Var	Default	Range	Description
`MCP_TIMEOUT_DEFAULT_MS`	`30000`	1000–120000	Per-call ceiling for the standard REST path.
`MCP_TIMEOUT_LONG_MS`	`60000`	1000–300000	For tools annotated as `long-running` (bulk/sweep ops).

Timeout fires AFTER retry — i.e. each individual attempt can take up to the timeout, then the policy retries (until MAX_ATTEMPTS is reached). The total wall-clock budget for a tool call is roughly MAX_ATTEMPTS * TIMEOUT_MS + sum(backoffs). With defaults: ~95 seconds worst case for the standard path.

Recommendation:

Dev: MCP_TIMEOUT_DEFAULT_MS=10000 to surface hangs quickly.
Prod: keep defaults unless you’ve measured Discord P99 latency for your specific routes.

Circuit breaker

Once a route has failed MCP_CIRCUIT_FAILURE_THRESHOLD times in a sliding window, the circuit opens and fast-rejects subsequent calls without hitting Discord. After MCP_CIRCUIT_HALF_OPEN_AFTER_MS, the breaker enters half-open: the next call probes the upstream; if it succeeds the circuit closes, otherwise it re-opens.

Var	Default	Range	Notes
`MCP_CIRCUIT_ENABLED`	`true`	bool	Set literal `false` to disable.
`MCP_CIRCUIT_FAILURE_THRESHOLD`	`10`	3–100	Failures before opening.
`MCP_CIRCUIT_HALF_OPEN_AFTER_MS`	`60000`	5000–600000	Recovery probe delay.

When to override:

Disable in unit/integration tests where you want failures to bubble immediately: MCP_CIRCUIT_ENABLED=false.
High-volume prod: bump MCP_CIRCUIT_FAILURE_THRESHOLD=25 to reduce noise from sporadic transients.
Aggressive recovery: lower MCP_CIRCUIT_HALF_OPEN_AFTER_MS=15000 if you’re confident upstream issues clear quickly.

The breaker is per-route, not global. A flapping /messages endpoint won’t trip /channels.

Bulkhead (concurrency limit)

The bulkhead semaphore caps in-flight Discord REST calls across all tools. When the limit is reached, new calls fast-reject with error_code: "bulkhead_saturated" rather than queueing — head-of-line blocking is worse than a clear “back off” signal.

Var	Default	Range	Notes
`MCP_BULKHEAD_LIMIT`	`100`	1–1000	Max concurrent in-flight REST calls.

Sizing: Discord allows ~50 requests/second per bot to most routes. A bulkhead of 100 is generous and rarely hits in practice unless the agent is making bulk parallel tool calls. Tighten to 20–30 if you want a clear early signal of “agent is over-parallelizing”.

Pipeline + bulkhead interaction

The mcp_pipeline meta-tool composes other tools serially or in parallel. Each leaf tool that hits Discord acquires a bulkhead slot. If your pipeline fans out to 50 children that each issue 2 REST calls, you can briefly exceed 100 in-flight REST calls; the bulkhead will kick in and the over-spilled calls return bulkhead_saturated, which the pipeline propagates back up as a partial-result with the failed children flagged.

Bottom line: pipelines do NOT amplify the bulkhead — they share it. This is intentional. If you raise MCP_BULKHEAD_LIMIT, you’re also raising the pipeline’s effective fan-out before saturation.

See Architecture → Pipeline for the recursion guard and interpolation syntax.

Reference table

Layer	Default behavior	Disable how
Retry	3 attempts, exponential w/ full jitter	`MCP_RETRY_ENABLED=false` or `MCP_RETRY_MAX_ATTEMPTS=1`
429 retry-after	Honored, capped by retry budget	(gated by retry)
Timeout	30s default, 60s long	(no toggle — set ms to max range)
Circuit	10 failures → open 60s	`MCP_CIRCUIT_ENABLED=false`
Bulkhead	100 in-flight, fast-reject	`MCP_BULKHEAD_LIMIT=1000` (effectively off for normal load)

Telemetry → resilience metrics — circuit transitions, bulkhead rejections, dead-letter counts.
Architecture → Rate limits — Discord’s global vs per-route buckets, @discordjs/rest interaction.
Architecture → Error handling — how Cockatiel errors map to MCP error codes.
Source: packages/mcp-core/src/rest/policy.ts, resilient.ts.

Resilience

Resilience

The policy chain

Retry

When to override

429 (rate limit) handling

Timeouts

Circuit breaker

Bulkhead (concurrency limit)

Pipeline + bulkhead interaction

Reference table

Related