vihari/ai_alignment.md

## ai_alignment.md

      
    Raw
  

              ai_alignment.md
            
          
    The following is a short summary of AI alignment that you may find handy.
Imagine a maid robot with which we are interacting.


Outer alignment problem, aka Reward hacking, task misspecification, specification gaming.
You ask for a coffee. It understood the assignment, but grabbed it from your father and gave it to you. You got the coffee, but that is not how you want it.

Problem: Your values and preferences are not encoded.

Challenging part: How to specify innumerably many preferences and ensure they are adhered?

Methods: Tune it to be honest, harmless and helpful: RLHF. Feedback at scale for super-intelligence: Scalable oversight, weak-to-strong generalisation, super-alignment. Explain the process instead of simply specifying the outcome: process-based feedback.


Inner alignment problem, aka goal misgeneralisation, spurious correlations, distribution shift
You ask for a coffee. It misunderstood the assignment from its experience and instead (a) gave you a cup of hot milk (goal misgeneralisation), or otherwise (b) failed because it cannot operate an unseen coffee machine (capability misgeneralisation).

Problem: Training with sparse feedback (reward or label) leaves to imagination its causes.

Challenging part: How to attribute the reward to the appropriate feature/action while keeping the feedback sparse?

Methods: Many classic methods to tackle distribution shifts (causal learning, domain generalisation, learning from explanations etc.), Interpretability methods to weed out problematic concepts.


Existential risk (hypothesised)
You ask for a coffee. It gets you one. But in the free time, builds strategies for long-horizon reward accumulation: (1) ensure humanity never runs out of coffee, (2) the bot is irreplaceable.

Problem: Extreme case of outer-(mis)alignment.

Challenge: same as outer-alignment. Also, how to monitor and control the true intentions of a learning system?

Methods: Any outer-alignment method, my personal favorite is process-based feedback.


Grounding the common terms with their technical causes

deceptively-aligned — hard to detect failures of complex system
situationally-aware policies — train-test distribution shift of policies
manipulative — can provide convincing explanation even for a wrong answer
power-seeking — outcome-based feedback (inadvertently) make actions that guarantee perpetuation more desirable.

I fear that by describing the agent behaviour using such terms as above is portraying the technical problem as some kind of a rehabilitation program. Misalignment is an engineering challenge that I believe we can solve.


Summary.

Alignment is not a new problem or necessarily require super-intelligence. Alignment is an outcome of black-box models and input high-dimensionality.
However, increased capability of systems may lead to increased autonomy thereby triggering even greater concern.
The expected scenarios of AI takeover require very long-range planning, much higher capabilities (not sure on what all), strong harm-causing motive (read very poorly designed reward), strong persuasion. While I appreciate the risks of misalignment, harms to the extent of existential risk are still unsubstantiated.

References or Further Reading

https://80000hours.org/articles/what-could-an-ai-caused-existential-catastrophe-actually-look-like/

Contains a list of extreme risks due to superintelligence with somewhat more concrete scenarios.
"The alignment problem from a deep learning perspective" http://arxiv.org/abs/2209.00626

Longer summary of AI Alignment with many references.