Skip to content

Instantly share code, notes, and snippets.

@hoehrmann
Created May 26, 2024 22:54
Show Gist options
  • Save hoehrmann/f234c1156ee5ef7b24cb589c14aaefda to your computer and use it in GitHub Desktop.
Save hoehrmann/f234c1156ee5ef7b24cb589c14aaefda to your computer and use it in GitHub Desktop.
Using LLMs to convert old RFC .txt files to modern xml2rfc XML files (and turn them into modern .html files)

Frameworks like llama.cpp support context-free grammars to restrict the output of a large language model to a specific format.

The specification for the xml2rfc format comes with a RELAX NG schema that describes this particular format.

The RELAX NG specification defines its semantics based on a simpler format called the simple syntax. Some more advanced constructs are basically just syntactic sugar in this sense.

There are tools that convert the full format into the simple syntax.

The simple syntax is very easy to work for for all kinds of purposes.

We can make a formal grammar for a concrete XML format easily

html        = start-html  (head body) final-html
head        = start-head  (title)     final-head
title       = start-title ""          final-title
body        = start-body  *(div / p)  final-body
div         = start-div   *(div / p)  final-div 
p           = start-p     ""          final-p

Throw in attributes and such as appropriate.

So, take the RNC schema from the xml2rfc RFC. Convert it from compact syntax to XML-based simple syntax. Transform that into a context-free grammar in the form as above. Write a system prompt for the large language model tasking it with the conversion. Priorities would be to preserve the wording exactly and the formatting of ascii art diagrams and similar constructs exactly.

Constrain its output with the CFG for the XML format. If the grammar mechanism works properly, the result should be a valid xml2rfc file which (modulo some issues like that there might not be an ID for each IDREF or whatever internal linking mechanism exists in xml2rfc). That can then be put through the converter to generate plain text files again.

Diff the results, possibly with re-wrapping tolerant settings.

@hoehrmann
Copy link
Author

XSLT to generate the above (modulo some search and replace to remove the FIXME strings):

<xsl:transform
    version='1.0'
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
    xmlns:rng='http://relaxng.org/ns/structure/1.0'
>

    <xsl:output method='text'></xsl:output>

    <xsl:template match='rng:attribute[rng:name/@ns = "http://www.w3.org/XML/1998/namespace"]'/>
    <xsl:template match='rng:attribute[starts-with(rng:name, "ascii")]'/>

    <xsl:template match='rng:attribute'>
        <xsl:text> | (" </xsl:text>
        <xsl:value-of select='rng:name'/>
        <xsl:text>=\x22" </xsl:text>
        <xsl:apply-templates></xsl:apply-templates>
        <xsl:text> "\x22")</xsl:text>
    </xsl:template>

    <xsl:template match='rng:choice[descendant::*[descendant-or-self::*[(self::rng:value or self::rng:data or self::rng:ref or self::rng:text) and not(ancestor-or-self::rng:attribute)]]]'>
        <xsl:text> ( </xsl:text>
        <xsl:apply-templates select='*[1]'/>
        <xsl:text> | </xsl:text>
        <xsl:apply-templates select='*[2]'/>
        <xsl:text> ) </xsl:text>
    </xsl:template>

    <xsl:template match='rng:choice'/>

    <xsl:template match='rng:data'>
        <xsl:text> text </xsl:text>
    </xsl:template>

    <xsl:template match='rng:define'>
        <xsl:value-of select='@name'></xsl:value-of><xsl:text> ::= </xsl:text>
        <xsl:apply-templates/>
        <xsl:text>&#xa;</xsl:text>
    </xsl:template>

    <xsl:template match='rng:element'>
        <xsl:text>&quot;</xsl:text>
        <xsl:value-of select='concat("&lt;", rng:name)'></xsl:value-of>
        <xsl:text>&quot;</xsl:text>

        <xsl:text> ( "FIXME" </xsl:text>
        <xsl:apply-templates select='.//rng:attribute'></xsl:apply-templates>
        <xsl:text>)* </xsl:text>

        <xsl:text> &quot;&gt;&quot; nl</xsl:text>
        <xsl:text> </xsl:text>
        <xsl:apply-templates/>
        <xsl:text> </xsl:text>
        <xsl:text>&quot;</xsl:text>
        <xsl:value-of select='concat("&lt;/", rng:name)'></xsl:value-of>
        <xsl:text>&gt;&quot;</xsl:text>
        <xsl:text> nl</xsl:text>
    </xsl:template>

    <xsl:template match='rng:empty'>
        <xsl:text> "" </xsl:text>
    </xsl:template>

    <xsl:template match='rng:grammar'>
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match='rng:group[descendant::*[descendant-or-self::*[(self::rng:value or self::rng:data or self::rng:ref or self::rng:text) and not(ancestor-or-self::rng:attribute)]]]'>
        <xsl:text> ( </xsl:text>
        <xsl:apply-templates select='*[1]'/>
        <xsl:text> </xsl:text>
        <xsl:apply-templates select='*[2]'/>
        <xsl:text> ) </xsl:text>
    </xsl:template>
    <xsl:template match='rng:group'/>

    <xsl:template match='rng:oneOrMore'>
        <xsl:text> ( </xsl:text>
        <xsl:apply-templates/>
        <xsl:text> )+ </xsl:text>
    </xsl:template>

    <xsl:template match='rng:ref'>
        <xsl:text> </xsl:text>
        <xsl:value-of select='@name'></xsl:value-of>
        <xsl:text> </xsl:text>
    </xsl:template>

    <xsl:template match='rng:start'>
        <xsl:text>&#xa;nl ::= "\n"?</xsl:text>
        <xsl:text>&#xa;text ::= </xsl:text>
        <xsl:text>([^&gt;&quot;&lt;&amp;]</xsl:text>
        <xsl:text>| "<![CDATA[&amp;]]>"</xsl:text>
        <xsl:text>| "<![CDATA[&lt;]]>"</xsl:text>
        <xsl:text>| "<![CDATA[&gt;]]>"</xsl:text>
        <xsl:text>| "<![CDATA[&quot;]]>"</xsl:text>
        <xsl:text>| "<![CDATA[&apos;]]>"</xsl:text>
        <xsl:text>| "&amp;#x" [0-9a-fA-F]+ ";"</xsl:text>
        <xsl:text>| "&amp;#" [0-9]+ ";"</xsl:text>
        <xsl:text>)*</xsl:text>
        <xsl:text>&#xa;</xsl:text>
        <xsl:text> root ::= </xsl:text>
        <xsl:apply-templates/>
        <xsl:text>&#xa;</xsl:text>
    </xsl:template>

    <xsl:template match='rng:text'>
        <xsl:text> text </xsl:text>
    </xsl:template>

    <xsl:template match='rng:value'>
        <!-- TODO: escaping -->
        <xsl:text>"</xsl:text>
        <xsl:value-of select='.'></xsl:value-of>
        <xsl:text>"</xsl:text>
    </xsl:template>

    <xsl:template match='rng:name'/>
    <xsl:template match='text()'/>

</xsl:transform>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment