Day-2 Operations
Monitoring, troubleshooting, scaling, and on-call response.
Health checks
| Endpoint | Returns |
|---|---|
GET /health | 200 · { status: "healthy", timestamp, version } |
GET /ready | 200 · { ready: true } (no DB hit) |
Logs
wrangler tail inboxos --format pretty wrangler tail inboxos --status error wrangler tail inboxos --method POST
Every line is structured JSON with a requestId — same id surfaced in the response envelope.
Cron triggers
| Cron | Job |
|---|---|
17 * * * * | Idempotency purge + scheduled-email flush |
30 4 * * * | Daily analytics rollup |
Rollback
wrangler deployments list # find a known-good version wrangler rollback <version-id>
Backup & restore
D1 has point-in-time recovery built in:
wrangler d1 backup list inboxos wrangler d1 backup download inboxos <backup-id> wrangler d1 backup restore inboxos <backup-id>
For R2, set up a scheduled cron Worker that copies tenants/ to a backup bucket once a day.
Common failures
"Worker exceeded its memory limit"
Likely cause: oversized attachment via inline base64. Switch to File Cache — upload once, reference by fileId. Mailgrid streams from R2 instead of holding the full file in memory.
"SES upstream failed: 403 SignatureDoesNotMatch"
Either AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY is wrong, or AWS_REGION doesn't match the verified SES identity's region.
wrangler secret list # confirm secrets exist grep AWS_REGION wrangler.toml # confirm region
"ses:FromAddress denied"
The IAM policy restricts From: to verified domains. Either verify the new domain, or edit the policy to allow more.
"DKIM-Signature: bad"
DKIM CNAMEs are proxied (orange cloud) in Cloudflare. Toggle them to DNS-only (grey cloud). The proxy rewrites the body and breaks the signature.
"Webhook subscription stuck in pending"
The SNS confirmation POST wasn't received. Check wrangler tail for an entry matching /api/webhooks/ses. If absent, the Worker isn't routing the request — verify DNS and the custom-domain binding.
SES bounce + complaint thresholds
- Bounce rate > 10% — AWS pauses sending.
- Complaint rate > 0.5% — AWS pauses sending.
- Target under 2% for bounces, under 0.1% for complaints.
Mailgrid auto-suppresses bounced/complained recipients via the SNS feedback loop. If your content is the problem (causing fast complaint rates), you'll get throttled before suppressions can catch up.
Incident response
| Severity | Symptom | First action |
|---|---|---|
| SEV-1 | API returning 5xx for >5 min | wrangler tail → find error → wrangler rollback |
| SEV-2 | Send rate dropping | Check SES dashboard for sending paused |
| SEV-3 | Specific tenant high-error | Check rate limit + suppressions table |
| SEV-4 | DKIM flapping | Confirm CNAMEs are grey-cloud (DNS-only) |
Recommended alerts
- Worker error rate > 1% over 5 min → page
- Worker latency p99 > 500 ms over 10 min → page
- Queue DLQ size > 0 → notify
- SES bounce rate > 4% (CloudWatch alarm) → page
- SES complaint rate > 0.08% → page