Skip to content

Instantly share code, notes, and snippets.

@mietek
Last active May 21, 2022 17:38
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mietek/ec9baffc6d154f1febaa to your computer and use it in GitHub Desktop.
Save mietek/ec9baffc6d154f1febaa to your computer and use it in GitHub Desktop.
On the Löbian obstacle

On the Löbian obstacle

Miëtek Bak

Please see “Strongly-normalising agents” for a follow-up.

Throughout much of MIRI research work, there is an underlying assumption that a consistent system cannot prove its own consistency.

While it is usually acknowledged that the above assumption applies only to systems as strong as Peano Arithmetic (PA), it is done only in passing, with little consideration for the possibility of using other systems. Girard’s system F appears merely as a footnote, while Willard’s self-verifying systems merit just a brief mention. (YH2013, pp.6,9,18; F2013a, p.1; F2013b, p.1; FS2014, pp.3; FS2015, p.4b)

Moreover, this acknowledgement is sometimes omitted in simplified presentations of the problem. This is unfortunate, as it serves to perpetuate the objectionable assumption. (YH2013, p.2,4; FS2014, pp.8,9; SF2014, p.7, FLI2015, p.9, FS2015, p.4a, LV2015a, p.1, LV2015b).

No motivation is provided for focusing solely on systems as strong as PA, even though it is this focus which leads directly to running into the Löbian obstacle. (YH2013, p.4, FS2014, pp.8,9)

Yudkowsky and Herreshoff introduce the notion of tiling agents, “able to approve the construction of successors similar to themselves, both architecturally and in the reasoning licensing their construction”. (YH2013, p.24)

I propose to consider tiling agents which trust a single consistent system, and therefore, trust each other’s proofs.

Furthermore, I propose to investigate a class of systems, such that:

  1. each system has the strong normalisation property, and therefore, is consistent;
  2. each system can prove its own consistency;
  3. each system is practical enough to use for solving everyday programming problems, and so, eventually, for the implementation of tiling agents.

It is not yet clear to me whether (2) is actually necessary.

Potential members of this class include all strongly-normalising systems, such as:

  • Gödel’s system T,
  • all eight λ calculi of Barendregt’s λ cube, including:
    • the simply-typed λ calculus,
    • Girard’s system F, and
    • Coquand’s calculus of constructions;
  • the LF logical framework,
  • Cockett’s Charity language,
  • systems arising from Turner’s work on total functional programming (ESFP),
  • Spławski’s ET language (IPL).

Please see “Strongly-normalising agents” for a follow-up.

References

YH2013

E. Yudkowsky, M. Herreshoff, MIRI, “Tiling agents for self-modifying AI, and the Löbian obstacle”, 2013

p.2

A consistent mathematical theory T cannot trust itself in the sense of verifying that a proof in T of any formula φ implies φ’s truth.

p.4

Löb’s Theorem normally implies that no system can prove its own soundness schema. … This seemingly implies that Axm¹ ≠ Axm⁰ and that the offspring A⁰ must possess strictly weaker mathematical power than its parent A¹.

p.6

Gödel’s Second Incompleteness Theorem states that systems at least as powerful as PA prove their own consistency if and only if they are inconsistent.

F2013a

B. Fallenstein, MIRI, “An infinitely descending sequence of sound theories each proving the next consistent”, 2013

F2013b

B. Fallenstein, MIRI, “Decreasing mathematical strength in one formalization of parametric polymorphism”, 2013

FS2014

B. Fallenstein, N. Soares, MIRI, “Problems of self-reference in self-improving space-time embedded intelligence”, 2014

p.3

Löb’s theorem, a variant of Gödel’s second incompleteness theorem, implies that no consistent theory T as strong as PA can show that everything provable in T is actually true.

p.8

Naively, we might expect that one should be able to have T = T': that our agent would be able to keep using the same theory when rewriting itself. … However, … if we set T = T', then … says that T shows its own consistency, which by Gödel’s second incompleteness theorem implies that T is inconsistent.

p.9

Because of Löb’s theorem, we cannot hope to have T = T' if our proof is to go through.

SF2014

N. Soares, B. Fallenstein, MIRI, “Aligning superintelligence with human interests: A technical research agenda”, 2014

p.7

By Gödel’s second incompleteness theorem (1934), sufficiently powerful formal systems cannot rule out the possibility that they may be inconsistent. This makes it difficult for agents using formal logical reasoning to verify the reasoning of similar agents which also use formal logic for high-confidence reasoning; the first agent cannot verify that the latter agent is consistent.

FLI2015

Future of Life Institute, “Research priorities for robust and beneficial artificial intelligence”, 2015

p.6

Attempting to straightforwardly apply formal verification tools to this more general setting presents new difficulties, including the challenge that a formal system that is sufficiently powerful cannot use formal methods in the obvious way to gain assurance about the accuracy of functionally similar formal systems, on pain of inconsistency via Gödel’s incompleteness.

FS2015

B. Fallenstein, N. Soares, MIRI, “Vingean reflection: Reliable reasoning for self-improving agents”, 2015

p.4a

Due to Gödel’s second incompleteness theorem, an agent using formal proofs cannot trust the reasoning of future versions using the same proof system.

p.4b

The suggester … cannot in general conclude that an action which has been proven safe is in fact safe. Intuitively, this follows from Gödel’s second incompleteness theorem: No consistent formal system as strong as PA is able to prove its own consistency.

LV2015a

P. LaVictoire, MIRI, “An introduction to Löb’s theorem in MIRI research”, 2015

p.1

If we try and model DT1 and DT2 as proving statements in two formal systems (one stronger than the other), then the only way that DT1 can make such a statement about DT2’s reliability is if DT1 (and thus both) are in fact unreliable!

LV2015b

P. LaVictoire, MIRI, “Welcome, new contributors” and “How to contribute”, 2015

Informally, if an agent self-modifies to become better at problem-solving or inference, it should be able to trust that its modified self will be better at achieving its goals. As it turns out, there is a self-referential obstacle with simple models of this (akin to the fact that only inconsistent formal systems believe themselves to be consistent), and one method of fixing it results in the possibility of indefinitely deferred actions or deductions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment