Docs / Day-2 operations

Day-2 Operations

Monitoring, troubleshooting, scaling, and on-call response.

Health checks

EndpointReturns
GET /health200 · { status: "healthy", timestamp, version }
GET /ready200 · { ready: true } (no DB hit)

Logs

wrangler tail
wrangler tail inboxos --format pretty
wrangler tail inboxos --status error
wrangler tail inboxos --method POST

Every line is structured JSON with a requestId — same id surfaced in the response envelope.

Cron triggers

CronJob
17 * * * *Idempotency purge + scheduled-email flush
30 4 * * *Daily analytics rollup

Rollback

bash
wrangler deployments list           # find a known-good version
wrangler rollback <version-id>

Backup & restore

D1 has point-in-time recovery built in:

d1 backup
wrangler d1 backup list inboxos
wrangler d1 backup download inboxos <backup-id>
wrangler d1 backup restore inboxos <backup-id>

For R2, set up a scheduled cron Worker that copies tenants/ to a backup bucket once a day.

Common failures

"Worker exceeded its memory limit"

Likely cause: oversized attachment via inline base64. Switch to File Cache — upload once, reference by fileId. Mailgrid streams from R2 instead of holding the full file in memory.

"SES upstream failed: 403 SignatureDoesNotMatch"

Either AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY is wrong, or AWS_REGION doesn't match the verified SES identity's region.

debug
wrangler secret list                       # confirm secrets exist
grep AWS_REGION wrangler.toml              # confirm region

"ses:FromAddress denied"

The IAM policy restricts From: to verified domains. Either verify the new domain, or edit the policy to allow more.

"DKIM-Signature: bad"

DKIM CNAMEs are proxied (orange cloud) in Cloudflare. Toggle them to DNS-only (grey cloud). The proxy rewrites the body and breaks the signature.

"Webhook subscription stuck in pending"

The SNS confirmation POST wasn't received. Check wrangler tail for an entry matching /api/webhooks/ses. If absent, the Worker isn't routing the request — verify DNS and the custom-domain binding.

SES bounce + complaint thresholds

Mailgrid auto-suppresses bounced/complained recipients via the SNS feedback loop. If your content is the problem (causing fast complaint rates), you'll get throttled before suppressions can catch up.

Incident response

SeveritySymptomFirst action
SEV-1API returning 5xx for >5 minwrangler tail → find error → wrangler rollback
SEV-2Send rate droppingCheck SES dashboard for sending paused
SEV-3Specific tenant high-errorCheck rate limit + suppressions table
SEV-4DKIM flappingConfirm CNAMEs are grey-cloud (DNS-only)

Recommended alerts