Webhook delivery failures are the most expensive bug in any agreement automation stack, because you don't find out about them until a customer or auditor does. This piece breaks down exactly how Docusign Connect behaves when something goes wrong, why those failures go unnoticed in most setups, and the specific configuration that fixes it.
The mechanism: Docusign Connect retry policy #
Docusign Connect is the webhook layer inside Docusign. When an envelope changes state, a Docusign Workflow Builder run transitions, or a template is instantiated, Connect fires an HTTP POST to the URL you configured for that event type.
When that POST fails - your endpoint returns 5xx, times out, or the TCP connection can't be established - Connect queues a retry.
The retry interval starts at 15 seconds and grows with exponential backoff. By attempt 20 or so, retries are hours apart. By attempt 40+, they're a day or more apart. After 45 attempts or 7 days - whichever comes first - Docusign marks the event permanently failed and stops trying.
A permanently failed event is not re-fired by a downstream state change. If an envelope completed and Connect couldn't tell you, you won't hear about it when that envelope later gets archived either. The state transition is gone.
Why these failures go unnoticed #
Three converging defaults:
1. Connect logging is off by default. On a new Docusign account, Connect writes very little to the log. Silent failures leave no trace. Most teams don't realize this until they're hunting a missing envelope.
2. The log retains only the last 100 entries. Even with logging on, a busy account rotates through 100 log entries fast. By the time you realize a webhook failed last Tuesday, its log entry has been overwritten.
3. "Require Acknowledgement" is off by default. Without it, Docusign considers delivery successful the moment the TCP connection completes. Your endpoint can return 500, crash mid-request, or drop the connection - Docusign marks it delivered and moves on.
The combined effect: a silent failure happens, nothing writes to the log (or the log entry ages out), no retry fires because Docusign thinks it succeeded, and the first person to notice is an operations manager noticing their agreement count doesn't match their CRM count a week later.
The fix: three settings and a monitor #
Setting 1 — Enable Connect logging #
Navigate to Settings → Connect → [your configuration] → Enable logging.
This turns on the feature that actually records attempts. Enable it on every Connect configuration you have.
Setting 2 — Require Acknowledgement #
Same configuration page: Require Acknowledgement.
With this enabled, Docusign expects your endpoint to return HTTP 200 within 100 seconds. Non-200 (or no response) registers as a failure and triggers the retry policy. This is the difference between "Docusign thinks it worked" and "Docusign will actually verify it worked."
Setting 3 — Enable HMAC signing #
Same page: HMAC Signing → Generate new key.
Docusign will sign every outgoing payload with HMAC-SHA256 using the secret. Your endpoint must verify the signature header before processing. Unsigned or invalid payloads should be rejected with 401, logged, and investigated - they mean someone is probing your endpoint with fake data.
The external monitor #
The three settings above tell Docusign to report failures. They don't tell you about failures. For that, you need an external system that queries Connect's failure logs on an interval and alerts when new entries appear, or (better) a relay layer that acts as Connect's endpoint, re-verifies delivery to the ultimate destination, and surfaces failures to a dashboard.
Common failure shapes and their fixes #
Endpoint intermittent 5xx. Usually the downstream service (your CRM, your internal API) is rate-limited or deploying. The retry policy handles this if your endpoint correctly returns 5xx. The fix: make sure your endpoint propagates the downstream error as a 5xx rather than absorbing it into a 200. A Zapier "Catch Hook" that always returns 200 is the classic anti-pattern.
Endpoint timeout. Docusign's timeout is 100 seconds (with Require Acknowledgement). If your endpoint does heavy processing inline - calling multiple downstream APIs, doing database writes, enriching payloads - you risk timing out. The fix: return 200 immediately, queue the work for async processing.
DNS or TLS failure. Usually a certificate that expired and nobody renewed. Retry policy won't save you if your certificate is expired for 7 days. The fix: monitor certificate expiry separately.
Payload rejected as invalid. Schema drift on Docusign's end (rare but happens). Your endpoint expects a completedDateTime field, Docusign ships an update that renames it. Retry policy won't save you because every retry hits the same bug. The fix: treat schema mismatch as a warning not an error, log, and move on.
Signature verification failure. Secret got rotated on Docusign's side (or yours) and the two no longer match. Every payload looks invalid. The fix: HMAC verification failures should alert immediately, not silently 401 and move on.
Testing your setup #
You can't wait for a real failure to verify your monitoring works. Two approaches:
- Synthetic failure test. Temporarily point a Connect configuration at an endpoint that returns 500. Verify: Connect log shows the failure, your external monitor alerts, retry counter increments on Connect. Re-enable the real endpoint.
- Chaos testing. On a staging Docusign account, randomly return 500 from your endpoint 10% of the time. Verify end-to-end delivery still completes via retries, and your monitor quietly records the retried failures.
Neither of these should surprise you - they just exercise the path.
Republishing failed events #
When something did permanently fail and you fix the root cause, you need to re-fire the missed events.
UI path: Docusign Settings → Connect → Failures → select entries → Republish.
API path: POST /accounts/{accountId}/connect/failures/retry_queue with an array of failure IDs.
Both options exist for every event. The pain is that they're per-envelope, and the failure log only holds 100 entries - if you've had more than 100 failures since the incident, some events are gone entirely and you have to reconstruct them by diffing Docusign's state against your CRM's state.
Baton isn't a replacement for Connect monitoring - it solves a different problem (reliable Maestro triggering from source-platform webhooks), and its Action log covers the verify + Maestro-trigger pipeline specifically, not Connect redelivery. Treat Connect monitoring and Maestro-trigger monitoring as two separate disciplines: Connect's log + retries for your system's inbound delivery, Baton's Action log for the Maestro-triggering path.
Closing thought #
Docusign Connect's retry policy is generous and well-designed. It's not the problem. The problem is that Docusign's defaults assume you have an external system that watches for permanent failures - and most teams don't, because the need isn't obvious until the first time they miss a signed envelope for a week.
The three settings above take 10 minutes to enable. Every Docusign Connect configuration should have them on. After that, decide whether you want to build the external monitor yourself or use something like Baton that does it by default.