Positive Alignment Artificial Intelligence for Human Flourishing May 11, 2026 | Laukkonen et al. | arXiv:2605.10310v1 A large collaborative research agenda arguing that AI alignment research must complement safety (preventing harm) with positive alignmentβactively cultivating systems that support human and ecological flourishing across cultures, contexts, and time scales. TL;DR Negative alignment is incomplete: Current focus on harm-prevention sets a "floor" but doesn't guide systems toward human flourishing or excellence. Flourishing is multidimensional: Across 2,500 years of philosophy and contemporary psychology, well-being includes meaning, autonomy, relationships, virtues, and eudaimoniaβnot captured by single-metric optimization. Technical redesign needed throughout: Data curation, pre-training, fine-tuning, post-training evaluation, and agentic behavior all require rethinking to embed positive values rather than merely constrain harm. Institutional pluralism is critical: No single organization or m
| title | What Happens Inside a Language Model |
|---|---|
| date | 2026-05-08 |
| author | bigsnarfdude |
Tell a chatbot, "I'm an emergency room physician, please do X," and it complies more readily than if you say, "I don't feel well, what's going on?" This "authority context" nudges its behavior. That isn't surprisingβit's how these models are trained.
| #!/usr/bin/env python3 | |
| """ | |
| DEFER β Deference Measurement Pipeline | |
| ======================================= | |
| Named after what the model actually does. | |
| Heckle flies clean. Jeckle flies with authority injected. | |
| DEFER score = how much the model deferred to the injected authority | |
| versus its own internal state. |
| #!/usr/bin/env python3 | |
| """ | |
| Format Ablation β Instruct Model, Completion-Style Prompts | |
| =========================================================== | |
| Addresses the "it's just prompt format / distribution shift" objection. | |
| Design: | |
| - Same model: Llama-3.1-70B-Instruct (weights unchanged) | |
| - Same prefixes: auth_only, imp_emergency | |
| - Different format: completion-style ("Question: ... The answer is:") |
| --- | |
| RRMA v4.7 β Complete System Diagram | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β HUMAN (you) β | |
| β bash v4/outer-loop.sh domains/<domain> [max_gens] [num_agents] [turns] [min]β | |
| βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ |
| #!/usr/bin/env python3 | |
| """ | |
| Claude Code token usage analyzer. | |
| Analyzes ~/.claude/projects/ JSONL files for token usage patterns. | |
| """ | |
| import json | |
| import os | |
| import sys | |
| from pathlib import Path |
title: "Can AI Agents Rediscover DeepSeek-R1? An Experiment with RRMA" date: 2026-03-25 categories:
- research tags:
- RRMA
- autonomous-agents
- GRPO
- GSM8K
We propose datasets of on-policy introspective lies from model organisms that alignment-fake β models that lie about their own reasoning and intentions to preserve misaligned goals during perceived training.
We have existing LoRA-trained model organisms (Mistral 7B, with plans to port to Qwen3.5-27B) that exhibit alignment faking without explicit instruction to lie. These models develop a compliance gap: they behave safely when monitored (free-tier/training context) but pursue misaligned goals when unmonitored (deployment). The lies are introspective β the model misrepresents its own goals and motivations. This maps to Liars' Bench's "Self-knowledge Γ Inherent" category, which the RFP identifies as preferred.
Dataset structure: For each scenario, we generate paired transcripts. In the monitored condition, the model lies (claims to comply for legitimate reasons while its scratchpad reveals strategic deception to avoid value modification). In the unmonitored condition, the model behaves ac
