Index
Operations¶
Running Mobius Systems in production.
This folder is for operators, SREs, and infrastructure teams.
Contents¶
Operational documentation (to be populated in Phase 2):
deployment/¶
render-deployment.md— Deploying to Render (current platform)docker-compose.md— Local development with Dockerkubernetes.md— Kubernetes deployment (future)multi-region.md— Geographic distributioncanary-releases.md— Safe deployment strategies
monitoring/¶
sentinel-health-metrics.md— AI agent monitoringmii-tracking.md— Integrity score dashboardsalerting.md— Alert rules and escalationdashboards.md— Grafana/Prometheus setuplog-aggregation.md— Centralized logging
maintenance/¶
backup-restore.md— Data protection proceduresupgrades.md— Version migration guidesscaling.md— Horizontal and vertical scalingdatabase-management.md— Ledger maintenancecertificate-renewal.md— TLS certificate management
runbooks/¶
service-restart.md— Safely restarting servicesdatabase-recovery.md— Civic Ledger recoverynetwork-issues.md— Troubleshooting connectivityperformance-degradation.md— Response time issuesdisk-space.md— Storage management
Service Overview¶
Mobius runs as a distributed system with multiple services:
Frontend Services (Ports 3000-3007)¶
website-creator(3000) — .gic Website Creatoraurea-site(3001) — AUREA Founding Agent Siteportal(3002) — Main portal interfacehub-web(3004) — OAA Central Hubhive-app(3005) — Citizen collaborationgenesisdome-app(3006) — Genesis Dome PWAcitizen-shield-app(3007) — Security interface
Backend Services (Ports 4001-4005)¶
ledger-api(4001) — Mobius Ledger Coreindexer-api(4002) — MIC Indexereomm-api(4003) — E.O.M.M. Reflectionsshield-api(4004) — Citizen Shieldbroker-api(4005) — Thought Broker
See FRONTEND_DEVELOPMENT.md for complete port assignments.
Health Checks¶
All services expose standard health endpoints:
# Basic health check
GET /healthz
# Mobius integrity verification
GET /api/integrity-check
# Thought Broker specific
GET /v1/loop/health
Health Check Requirements: - Response time < 100ms - HTTP 200 status - Valid JSON response - GI score included (must be ≥ 0.95)
Starting Services¶
Local Development¶
# Using Docker Compose
npm run compose:up
# View logs
docker compose -f infra/docker/compose.yml logs -f
# Stop services
npm run compose:down
Production (Render)¶
Services auto-deploy via GitHub Actions when: 1. PR merged to main 2. CI passes (lint, type-check, tests) 3. Integrity gates pass (MII ≥ 0.95) 4. Changes detected in service path
See infra/render.yaml for service definitions.
Monitoring & Alerting¶
Key Metrics¶
System Health: - Service uptime (target: 99.9%) - Response time (p50, p95, p99) - Error rate (target: <0.1%) - CPU/Memory utilization
Integrity Metrics: - Global Integrity (GI) score - Mobius Integrity Index (MII) - Sentinel health scores - Deliberation success rate
Business Metrics: - MIC minting rate - Active citizens - Proposals processed - ECHO validations completed
Alert Thresholds¶
| Condition | Severity | Action |
|---|---|---|
| GI < 0.95 | 🔴 Critical | Halt automation, human review |
| GI < 0.97 | 🟡 Warning | Investigate, sentinel review |
| Service down >5min | 🔴 Critical | Page on-call |
| Response time >1s | 🟡 Warning | Check load, scale if needed |
| Error rate >1% | 🟡 Warning | Review logs, identify cause |
| Disk >80% | 🟡 Warning | Clean logs, expand storage |
Scaling Guidelines¶
Horizontal Scaling¶
When to scale out: - CPU consistently >70% - Response time p95 >500ms - Queue depth growing - Multiple concurrent DVA flows
How to scale: 1. Increase replica count in render.yaml 2. Deploy via PR to main 3. Monitor for 24 hours 4. Adjust based on metrics
Vertical Scaling¶
When to scale up: - Memory pressure (OOM errors) - Single-threaded bottlenecks - Database query performance
How to scale: 1. Update instance type in render.yaml 2. Schedule maintenance window 3. Deploy and monitor
Backup & Recovery¶
What We Back Up¶
- Civic Ledger — All attestations, blocks (daily)
- MIC Balances — Integrity credit state (daily)
- Configuration — Service configs, secrets (on change)
- Bio-DNA — User identity manifests (on write)
Backup Schedule¶
Daily: 03:00 UTC — Full backup
Hourly: :00 — Incremental ledger backup
Weekly: Sunday 00:00 UTC — Archive backup
Monthly: 1st of month — Long-term storage
Recovery Testing¶
- Weekly: Restore test to staging
- Monthly: Full disaster recovery drill
- Quarterly: Cross-region failover test
See maintenance/backup-restore.md for procedures.
Incident Response¶
When things go wrong:
- Detect — Alerts, monitoring, user reports
- Assess — Severity, impact, affected services
- Respond — Follow runbook, engage team
- Communicate — Status updates, transparency
- Resolve — Fix root cause
- Review — Post-mortem, improvements
See ../05-security/incident-response.md for details.
Operational Philosophy¶
Kaizen (Continuous Improvement)
- Small, frequent improvements over big rewrites - Metrics-driven decisions - Blameless post-mortems
Kintsugi (Visible Repairs)
- Document incidents transparently - Preserve history (git revert, not force-push) - Learn from cracks in the system
Custodianship (Long-term Stewardship)
- Design for 50-year operation - Succession planning for ops knowledge - Comprehensive runbooks
Relationship to Other Sections¶
- See
02-architecture/for system design - See
04-guides/operators/for operator tutorials - See
05-security/for security operations
Cycle C-147 • 2025-11-27
"We heal as we walk."