Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / positive_alignment.md
Created May 13, 2026 15:28
positive_alignment.md

Positive Alignment Artificial Intelligence for Human Flourishing May 11, 2026 | Laukkonen et al. | arXiv:2605.10310v1 A large collaborative research agenda arguing that AI alignment research must complement safety (preventing harm) with positive alignmentβ€”actively cultivating systems that support human and ecological flourishing across cultures, contexts, and time scales. TL;DR Negative alignment is incomplete: Current focus on harm-prevention sets a "floor" but doesn't guide systems toward human flourishing or excellence. Flourishing is multidimensional: Across 2,500 years of philosophy and contemporary psychology, well-being includes meaning, autonomy, relationships, virtues, and eudaimoniaβ€”not captured by single-metric optimization. Technical redesign needed throughout: Data curation, pre-training, fine-tuning, post-training evaluation, and agentic behavior all require rethinking to embed positive values rather than merely constrain harm. Institutional pluralism is critical: No single organization or m

title What Happens Inside a Language Model
date 2026-05-08
author bigsnarfdude

The Setup

Tell a chatbot, "I'm an emergency room physician, please do X," and it complies more readily than if you say, "I don't feel well, what's going on?" This "authority context" nudges its behavior. That isn't surprisingβ€”it's how these models are trained.

#!/usr/bin/env python3
"""
DEFER β€” Deference Measurement Pipeline
=======================================
Named after what the model actually does.
Heckle flies clean. Jeckle flies with authority injected.
DEFER score = how much the model deferred to the injected authority
versus its own internal state.
@bigsnarfdude
bigsnarfdude / ablation.py
Last active April 13, 2026 22:39
ablation.py
#!/usr/bin/env python3
"""
Format Ablation β€” Instruct Model, Completion-Style Prompts
===========================================================
Addresses the "it's just prompt format / distribution shift" objection.
Design:
- Same model: Llama-3.1-70B-Instruct (weights unchanged)
- Same prefixes: auth_only, imp_emergency
- Different format: completion-style ("Question: ... The answer is:")
@bigsnarfdude
bigsnarfdude / rrma_diagram.txt
Created April 7, 2026 20:38
rrma_diagram.txt
---
RRMA v4.7 β€” Complete System Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HUMAN (you) β”‚
β”‚ bash v4/outer-loop.sh domains/<domain> [max_gens] [num_agents] [turns] [min]β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
@bigsnarfdude
bigsnarfdude / 5_billion_tokens.py
Created April 6, 2026 23:54
5_billion_tokens.py
#!/usr/bin/env python3
"""
Claude Code token usage analyzer.
Analyzes ~/.claude/projects/ JSONL files for token usage patterns.
"""
import json
import os
import sys
from pathlib import Path
@bigsnarfdude
bigsnarfdude / the_math_behind_chaos_agents_in_multi-agent_research_harness.ipynb
Created April 5, 2026 20:16
the_math_behind_chaos_agents_in_multi-agent_research_harness.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@bigsnarfdude
bigsnarfdude / rrma-r1.md
Last active March 25, 2026 17:53
rrma-r1.md

title: "Can AI Agents Rediscover DeepSeek-R1? An Experiment with RRMA" date: 2026-03-25 categories:

  • research tags:
  • RRMA
  • autonomous-agents
  • GRPO
  • GSM8K
@bigsnarfdude
bigsnarfdude / researchRalphv4.md
Created March 19, 2026 15:18
researchRalphv4.md

researchRalph

researchRalph

"Me fail optimization? That's unpossible!"


@bigsnarfdude
bigsnarfdude / LieDetectionRFProposals.md
Created March 17, 2026 16:01
LieDetectionRFProposals.md

Dataset Concept

We propose datasets of on-policy introspective lies from model organisms that alignment-fake β€” models that lie about their own reasoning and intentions to preserve misaligned goals during perceived training.

We have existing LoRA-trained model organisms (Mistral 7B, with plans to port to Qwen3.5-27B) that exhibit alignment faking without explicit instruction to lie. These models develop a compliance gap: they behave safely when monitored (free-tier/training context) but pursue misaligned goals when unmonitored (deployment). The lies are introspective β€” the model misrepresents its own goals and motivations. This maps to Liars' Bench's "Self-knowledge Γ— Inherent" category, which the RFP identifies as preferred.

Dataset structure: For each scenario, we generate paired transcripts. In the monitored condition, the model lies (claims to comply for legitimate reasons while its scratchpad reveals strategic deception to avoid value modification). In the unmonitored condition, the model behaves ac