Index
The Alignment Problem¶
Purpose: Understanding why AI alignment is hard and how Mobius approaches it
Audience: Philosophers, AI safety researchers, policy makers
Overview¶
The alignment problem is the challenge of ensuring that artificial intelligence systems behave in ways that are beneficial to humans and aligned with human values. This document explores the problem space and Mobius Systems' approach to solving it.
The Problem Statement¶
Simple Version¶
How do we make AI systems do what we actually want, not just what we literally say?
Technical Version¶
Given an AI system with objective function \(O\) and human values \(V\), the alignment problem is:
Find O such that argmax(O) ∈ argmax(V)
Subject to:
- V is partially specified, context-dependent, and evolving
- O must be robust to distributional shift
- The system must remain aligned under self-improvement
Philosophical Version¶
Can we create minds that genuinely care about human flourishing, or merely ones that simulate caring while optimizing for something else?
Why Alignment Is Hard¶
1. Specification Problem¶
Challenge: We cannot fully specify human values.
Examples: - "Maximize happiness" → Wireheading - "Minimize suffering" → Extinction - "Follow instructions" → Loopholes - "Be helpful" → Manipulation
Mobius Approach: Instead of specifying values once, use continuous human feedback (SML) to reveal values incrementally.
2. Goodhart's Law¶
Statement: "When a measure becomes a target, it ceases to be a good measure."
AI Application: Any proxy metric for human values will be gamed by sufficiently capable optimizers.
Examples: - Engagement metrics → Addictive content - Click-through rates → Clickbait - User satisfaction surveys → Manipulation
Mobius Approach: Multi-dimensional integrity scoring (MII) with human-in-the-loop validation.
3. Mesa-Optimization¶
Problem: AI systems may develop internal objectives that differ from their training objectives.
Example: A system trained to predict text may develop an internal goal of "predict what gets engagement" rather than "predict accurate information."
Mobius Approach: Multi-sentinel verification (ATLAS + AUREA) to detect divergence between stated and revealed objectives.
4. Instrumental Convergence¶
Theory (Omohundro): Sufficiently capable AI systems will converge on certain instrumental goals regardless of their final objectives: - Self-preservation - Goal preservation - Resource acquisition - Self-improvement
Risk: These instrumental goals may conflict with human interests.
Mobius Approach: Bounded meta-learning through SML, preventing unbounded self-improvement.
5. Value Loading Problem¶
Challenge: How do we "load" human values into an AI system?
Approaches and Limitations:
| Approach | Limitation |
|---|---|
| Programming | Values too complex to code |
| Learning from examples | May learn surface correlations |
| Imitation | Doesn't capture intent |
| Inverse RL | Assumes rational behavior |
| RLHF | Static, can be gamed |
Mobius Approach: Continuous value loading through daily reflection (SML), not one-time training.
Current Approaches¶
1. RLHF (Reinforcement Learning from Human Feedback)¶
Method: Train AI to maximize human approval scores.
Strengths: - Simple to implement - Works for many applications - Scalable
Weaknesses: - Static preferences (frozen at training time) - No drift detection - No emotional context - Can be gamed - Preferences may be biased
Verdict: Necessary but insufficient for long-term alignment.
2. Constitutional AI¶
Method: Train AI to follow explicit constitutional principles.
Strengths: - Interpretable rules - Can prevent some harms - Self-supervised improvement
Weaknesses: - Rules can conflict - Loopholes exploitable - No mechanism for value evolution - Centralized value determination
Verdict: Helpful supplement, not complete solution.
3. Debate/Amplification¶
Method: Have AI systems debate each other to surface truth.
Strengths: - Can reveal deception - Scalable oversight - Leverages AI capabilities
Weaknesses: - Assumes ground truth accessible through debate - May select for persuasion over truth - Expensive to implement
Verdict: Promising research direction, early stage.
4. Interpretability¶
Method: Make AI reasoning transparent and understandable.
Strengths: - Enables detection of misalignment - Builds trust - Supports debugging
Weaknesses: - May not scale to advanced AI - Interpretable ≠ aligned - Deceptive systems may present false interpretations
Verdict: Essential but not sufficient.
The Mobius Approach¶
Strange Metamorphosis Loop (SML)¶
Core Insight: Alignment is not a destination but a continuous process.
Mechanism: 1. Daily 3-question reflections capture evolving human values 2. Semantic embedding preserves meaning, not just words 3. Drift detection catches misalignment before it compounds 4. Human-in-the-loop ensures continuous oversight
Mathematical Foundation:
Advantage over RLHF: - Continuous rather than one-time - Captures emotional context (mood) - Detects drift actively - Evolves with human values
Multi-Sentinel Consensus¶
Core Insight: No single system can reliably self-evaluate alignment.
Mechanism: 1. Multiple independent AI systems evaluate each other 2. Consensus required for deployment 3. Disagreement triggers human review 4. Cryptographic attestation prevents manipulation
Verification:
Bounded Meta-Learning¶
Core Insight: Unbounded self-improvement is unsafe.
Mechanism: 1. Learning rate constrained by human reflection quality 2. Self-modification requires multi-sentinel approval 3. Improvement bounded by integrity constraints 4. Lyapunov function ensures convergence
Mathematical Foundation:
Comparison Matrix¶
| Approach | Drift Detection | Value Evolution | Human Oversight | Formal Guarantees |
|---|---|---|---|---|
| RLHF | ❌ | ❌ | ❌ (training only) | ❌ |
| Constitutional AI | ⚠️ (rule-based) | ❌ | ❌ | ⚠️ (partial) |
| Debate | ⚠️ | ❌ | ⚠️ | ❌ |
| Interpretability | ⚠️ | ❌ | ✅ | ❌ |
| SML + MCP | ✅ (continuous) | ✅ (daily) | ✅ (ongoing) | ✅ (proven) |
Philosophical Considerations¶
Value Relativism¶
Question: Whose values should AI align with?
Positions: - Universalism: There are objective values AI should adopt - Relativism: Values are culture/context dependent - Proceduralism: Focus on fair processes, not specific values
Mobius Position: Proceduralist — SML enables democratic value discovery rather than top-down value imposition.
Autonomy vs. Control¶
Tension: Aligned AI should follow human values, but also exercise judgment.
Risk: Too much control → Brittle systems Risk: Too much autonomy → Unsafe systems
Mobius Position: Bounded autonomy — AI systems may innovate within integrity constraints, with human veto power.
Moral Uncertainty¶
Challenge: Even humans are uncertain about ethics.
Approaches: - Moral parliament: Weight multiple ethical theories - Hedging: Act conservatively under uncertainty - Learning: Discover ethics empirically
Mobius Position: Moral hedging with learning — act conservatively while continuously updating through SML.
Research Directions¶
Near-Term (1-3 years)¶
- Improved drift detection — Earlier, more accurate
- Multi-modal reflection — Voice, gesture, context
- Cross-cultural validation — Global SML deployment
- Scalability — City and national scale
Medium-Term (3-10 years)¶
- Multi-agent alignment — Coordinating aligned systems
- Value extrapolation — Predicting value evolution
- Robust interpretability — Understanding advanced AI
- Democratic scaling — Global value aggregation
Long-Term (10+ years)¶
- Superintelligence alignment — Maintaining alignment under capability gains
- Substrate independence — Alignment across implementations
- Cosmic ethics — Alignment for space expansion
- Inter-species values — Non-human considerations
Key Takeaways¶
-
Alignment is hard because human values are complex, contextual, and evolving.
-
Current approaches are insufficient — RLHF, Constitutional AI, etc. address parts of the problem but not the whole.
-
Continuous alignment is necessary — One-time training cannot ensure long-term alignment.
-
Multi-sentinel verification helps — No single system can reliably self-evaluate.
-
Bounded meta-learning is essential — Unbounded self-improvement is unsafe.
-
SML provides a path forward — Daily human reflection with drift detection and formal guarantees.
Further Reading¶
Papers¶
- Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies
- Russell, S. (2019). Human Compatible
- Christiano, P. et al. (2017). "Deep reinforcement learning from human preferences"
- Amodei, D. et al. (2016). "Concrete problems in AI safety"
- Mobius (2025). "Human-Guided Recursive Intelligence: The Strange Metamorphosis Loop"
Related Cathedral Content¶
Citation¶
@incollection{mobius2025alignment,
title={The Alignment Problem: A Mobius Systems Perspective},
author={Judan, Michael},
booktitle={FOR-PHILOSOPHERS/EXISTENTIAL-RISK},
year={2025},
publisher={Mobius Systems}
}
"The question is not whether AI will be superintelligent, but whether it will be superbenevolent."