Skip to content

Instantly share code, notes, and snippets.

@uogbuji
Last active May 1, 2024 17:39
Show Gist options
  • Save uogbuji/5bd08f74125934fa9e0d37236a8e168e to your computer and use it in GitHub Desktop.
Save uogbuji/5bd08f74125934fa9e0d37236a8e168e to your computer and use it in GitHub Desktop.
Word Loom proposed update

Word Loom is a convention for expressing language text and templates for AI language model-related uses, for example prompt templates. The format is based on TOML, and word looms are meant to be kept in resource directories for use with code invoking LLMs.

Basic principles:

  1. Separation of code from natural language
    • Must be a straightforward process to translate any natural language elements
  2. Composability of natural language elements
  3. Friendliness to mechanical comparisons (i.e. via diff)
  4. Friendliness to traditional globalization (G11N) techniques

Principle #3 motivates the choice of TOML format. Principle #1 makes templating languages such as Jinja2 unsuitable.

An example word loom:

# Warning: there is a difference between single & double quotes in TOML. Former is not escaped.
# Since in the root table, all prompts in this file will default to English
# Can use more precise values, such as "en_UK".
lang = "en"

[davinci3_instruct_system]
_ = """
Obey the instruction below, based on the provided context. If you cannot obey the instruction
based on the provided context, respond: "I don't have enough information to comply".
"""

[i18n_context]
_ = """
Internationalization is a corporate strategy that involves making products and services as adaptable as possible, so they can easily enter different national markets. This often requires the assistance of subject matter experts. Internationalization is sometimes shortened to "i18n", where 18 represents the number of characters in the word.
"""
source = "https://www.lionbridge.com/blog/translation-localization/localization-globalization-internationalization-whats-the-difference/"

[write_i18n_advocacy]
_ = """
{davinci3_instruct_system}

CONTEXT: {i18n_context}

INSTRUCTION: Write a corporate memo encouraging our company to take i18n seriously
"""
# Declare template vars, for introspection in code. Presence of markers signals that this is a template.
_m = ["davinci3_instruct_system", "i18n_context"]

[translate_request]
_ = "Comment dit-on en anglais: {hardcoded_food}?"
lang = "fr"  # Override default language code for this item
_m = ["hardcoded_food"]

[hardcoded_food]
_ = "pomme de terre"
lang = "fr"

[hello_translated]
_ = "Hello"
_fr = "Salut"

[goodbye_translated]
_ = "Adieu"
lang = "fr"  # Override default language code for this item
_en = "Goodbye"

Language items

A language item (or just item) is an entity that encapsulates a text, which can be represented in multiple languages. An item comprises one text value in a default language, zero or more text values in alternate languages, and a hash table of metadata (key/value pairs). A word loom, or just loom, is a term for a file expressing one or more language items in Word Loom format.

The example above defines the following language items:

  • davinci3_instruct_system
  • i18n_context
  • write_i18n_advocacy
  • translate_request
  • hardcoded_food
  • hello_translated
  • goodbye_translated

They are defined by top-level TOML hash tables. Language item keys starting with _ as well as the special key lang are reserved by the Word Loom specification. The key _ sets the text value in the default language. A key in the form _ followed by a language code sets the corresponding text value in an alternate language.

Any keys which are not reserved by Word Loom become part of the language item's metadata, and are made available to the processing layer for the loom.

For example the i18n_context item has a source metadata key. Perhaps this is can be used for citing sources within the LLMOps workflow.

Languages and translations

A default language code fo rthe entire loom can be set with a top-level loom key, of which there must only be one. A language item's default language can be overridden within its TOML hash table using the lang key. Alternate language correspondences for the default text can be expressed using _ prefixed language codes. All language codes in Word Loom follow the IETF BCP 47 specification.

Note: This example has multiple languages in one, but traditional i18n generally has a separate file per language. Word loom implementations should support selecting the correct file from a directory full of different language files.

LLM localization can't necessarily be treated as a simple extension of code l10n, though. If you just naively give LLM prompts to translators in, say gettext file format, their translations might result in dramatically different performance from the LLMs used. Prompt management (sometimes called prompt engineering) is not a simple matter of speaking the relevant language, and in fact it opens up a situation where natural language becomes code, with technical implications. It is still best treated separately from traditional coding languages, and yet it's not just simple text to be localized.

One approach would be to generate translation files from word loom files, for an initial, naive translation, reconstruct word looms from those translations, and then work on the localized word looms to meet LLM performance and alignment needs.

Templating

The curly braces in text values provide templating ability, and are to be replaced with different text. The string between the curly braces is the marker, and can either be an identifier string or a full URL. Identifiers can refer to named items elsewhere in the same file, in an included file, or provided at runtime by the host system. A URL marker represents a service which can dynamically create the replacement text. This service gets the full word loom (with inclusions) as additional context. This can be used to implement e.g. the ReAct LLM pattern.

TODO: Add examples of more complex prompts with e.g. nested loops, such as ReAct loops.

Tips:

  • For VS Code users, the Even Better TOML extension is better
  • Python users: 3.11 or later builds in tomllib; for prior versions you can install tomli, which is API compatible

Some useful resources for prompts in general:

Sources of sample prompts

Other names considered: Prompt Mark (prefer to generalize from "prompt"), Word Flux

Some useful resources for g10n in general (bias to Python):

@uogbuji
Copy link
Author

uogbuji commented May 1, 2024

Original revision sourced here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment