Day-2 Operations

Monitoring, troubleshooting, scaling, and on-call response.

Health checks

Endpoint	Returns
`GET /health`	`200 · { status: "healthy", timestamp, version }`
`GET /ready`	`200 · { ready: true }` (no DB hit)

Logs

wrangler tail

wrangler tail inboxos --format pretty
wrangler tail inboxos --status error
wrangler tail inboxos --method POST

Every line is structured JSON with a requestId — same id surfaced in the response envelope.

Cron triggers

Cron	Job
`17 * * * *`	Idempotency purge + scheduled-email flush
`30 4 * * *`	Daily analytics rollup

Rollback

bash

wrangler deployments list           # find a known-good version
wrangler rollback <version-id>

Backup & restore

D1 has point-in-time recovery built in:

d1 backup

wrangler d1 backup list inboxos
wrangler d1 backup download inboxos <backup-id>
wrangler d1 backup restore inboxos <backup-id>

For R2, set up a scheduled cron Worker that copies tenants/ to a backup bucket once a day.

Common failures

"Worker exceeded its memory limit"

Likely cause: oversized attachment via inline base64. Switch to File Cache — upload once, reference by fileId. Mailgrid streams from R2 instead of holding the full file in memory.

"SES upstream failed: 403 SignatureDoesNotMatch"

Either AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY is wrong, or AWS_REGION doesn't match the verified SES identity's region.

debug

wrangler secret list                       # confirm secrets exist
grep AWS_REGION wrangler.toml              # confirm region

"ses:FromAddress denied"

The IAM policy restricts From: to verified domains. Either verify the new domain, or edit the policy to allow more.

"DKIM-Signature: bad"

DKIM CNAMEs are proxied (orange cloud) in Cloudflare. Toggle them to DNS-only (grey cloud). The proxy rewrites the body and breaks the signature.

"Webhook subscription stuck in pending"

The SNS confirmation POST wasn't received. Check wrangler tail for an entry matching /api/webhooks/ses. If absent, the Worker isn't routing the request — verify DNS and the custom-domain binding.

SES bounce + complaint thresholds

Bounce rate > 10% — AWS pauses sending.
Complaint rate > 0.5% — AWS pauses sending.
Target under 2% for bounces, under 0.1% for complaints.

Mailgrid auto-suppresses bounced/complained recipients via the SNS feedback loop. If your content is the problem (causing fast complaint rates), you'll get throttled before suppressions can catch up.

Incident response

Severity	Symptom	First action
SEV-1	API returning 5xx for >5 min	`wrangler tail` → find error → `wrangler rollback`
SEV-2	Send rate dropping	Check SES dashboard for sending paused
SEV-3	Specific tenant high-error	Check rate limit + suppressions table
SEV-4	DKIM flapping	Confirm CNAMEs are grey-cloud (DNS-only)

Recommended alerts

Worker error rate > 1% over 5 min → page
Worker latency p99 > 500 ms over 10 min → page
Queue DLQ size > 0 → notify
SES bounce rate > 4% (CloudWatch alarm) → page
SES complaint rate > 0.08% → page