Skip to content

Instantly share code, notes, and snippets.

@jimsrc
Last active February 21, 2024 22:16
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jimsrc/46670bd97f1bad0845e0a208e33996c8 to your computer and use it in GitHub Desktop.
Save jimsrc/46670bd97f1bad0845e0a208e33996c8 to your computer and use it in GitHub Desktop.
Minimizing the number of tokens usage to interact with GPT-4.

Overview

I just read this trick for text compression, in order to save tokens in subbsequent interactions during a long conversation, or in a subsequent long text to summarize.

SHORT VERSION:

It's useful to give a mapping between common words (or phrases) in a given long text that one intends to pass later. Then pass that long text to gpt-4 but encoded with such mapping. The idea is that the encoded version contains less tokens than the original text. There are several algorithms to identify frequent words or phrases inside a given text, such as NER, TF-IDF, part-of-speech (POS) tagging, etc.

Note that this lesson applies not only for consuming the GPT-4 service, but also for any local (or on-premise) LLM. Think of this as a boost to other summarization tools out there (e.g. Langchain).

LONG VERSION:

I learned this by asking gpt-4 the logic behind it. Below is my whole conversation.

Me: Compress the following text in a way such that you (GPT-4) can reconstruct it as close as possible to the original. This is for yourself. Do not make it human readable. Aggressively compress it, while still keeping ALL the information to fully reconstruct it. Also use the fewest token possible, your text should be way smaller than the one I give you.

## Text to compress:

Lambda calculus may be untyped or typed. In typed lambda calculus, functions can be applied only if they are capable of accepting the given input's "type" of data. Typed lambda calculi are weaker than the untyped lambda calculus, which is the primary subject of this article, in the sense that typed lambda calculi can express less than the untyped calculus can. On the other hand, typed lambda calculi allow more things to be proven. For example, in the simply typed lambda calculus it is a theorem that every evaluation strategy terminates for every simply typed lambda-term, whereas evaluation of untyped lambda- terms need not terminate. One reason there are many different typed lambda calculi has been the desire to do more (of what the untyped calculus can do) without giving up on being able to prove strong theorems about the calculus.


GPT4: lmbC_untyped/typed,typed_functype,weaker,primSubj,lessExpr,moreProof,simplyTypedLC,evalStratTerm,untypedNTerm,manyTypedLC,strongThrmDesire


Me: I asked you to compress a long text using your own abbreviations. You replied with: lmbC_untyped/typed,typed_functype,weaker,primSubj,lessExpr,moreProof,simplyTypedLC,evalStratTerm,untypedNTerm,manyTypedLC,strongThrmDesire

Reconstruct the original text.


GPT4: Lambda calculus may be untyped or typed. In typed lambda calculus, functions can be applied only if they are capable of accepting the given input's "type" of data. Typed lambda calculi are weaker than the untyped lambda calculus, which is the primary subject of this article, in the sense that typed lambda calculi can express less than the untyped calculus can. On the other hand, typed lambda calculi allow more things to be proven. For example, in the simply typed lambda calculus it is a theorem that every evaluation strategy terminates for every simply typed lambda-term, whereas evaluation of untyped lambda- terms need not terminate. One reason there are many different typed lambda calculi has been the desire to do more (of what the untyped calculus can do) without giving up on being able to prove strong theorems about the calculus.


Me: Could you intuitively explain the compression mechanism you're using?


GPT4: The compression mechanism I used involved creating a concise representation of the original text by identifying key concepts and using abbreviations or codes to represent them. I aimed to preserve the essential information in a condensed form that could be easily reconstructed by me (GPT-4). Here's a brief explanation of the abbreviations and how they relate to the original text:

  • lmbC_untyped/typed: Lambda calculus may be untyped or typed.

  • typed_functype: In typed lambda calculus, functions can be applied only if they are capable of accepting the given input's "type" of data.

  • weaker: Typed lambda calculi are weaker than the untyped lambda calculus.

  • primSubj: Primary subject of this article is the untyped lambda calculus.

  • lessExpr: Typed lambda calculi can express less than the untyped calculus can.

  • moreProof: Typed lambda calculi allow more things to be proven.

  • simplyTypedLC: Refers to the simply typed lambda calculus.

  • evalStratTerm: In the simply typed lambda calculus, every evaluation strategy terminates for every simply typed lambda-term.

  • untypedNTerm: Evaluation of untyped lambda-terms need not terminate.

  • manyTypedLC: There are many different typed lambda calculi.

  • strongThrmDesire: One reason for the existence of many typed lambda calculi is the desire to do more without giving up on being able to prove strong theorems about the calculus.

The compressed text consists of these abbreviations or codes, which I can then use to reconstruct the original text. Note that this compression method is tailored specifically for me (GPT-4) and may not be easily understood by humans without explanation.


Me: Could you give me a python function that uses your logic to compress any chunk of text? The aim is that you'll understand it such that you'll be to decompress it.


GPT4: It's important to note that creating a compression function that is both highly efficient and generalizable to any chunk of text is a complex task, and the method I used earlier was specifically tailored for that text. However, I can provide you with a simple Python function that uses my logic to compress text by identifying and shortening common phrases or keywords. You can then provide me the compressed text, and I'll be able to decompress it.

import re

def compress_text(text):
    common_phrases = {
        'lambda calculus': 'lmbC',
        'untyped': 'unt',
        'typed': 'typ',
        'function': 'func',
        'evaluation strategy': 'evalStrat',
        'theorems': 'thrm',
    }
    
    compressed_text = text.lower()
    for phrase, abbreviation in common_phrases.items():
        compressed_text = re.sub(phrase, abbreviation, compressed_text)

    return compressed_text

# Example usage
text = "Lambda calculus may be untyped or typed..."
compressed = compress_text(text)
print(compressed)

Keep in mind that this simple function only covers a limited set of phrases and is case-insensitive. For more complex and efficient text compression, specialized algorithms and libraries should be used. Also, remember that this method is tailored for me to understand the compressed text and may not be easily understood by humans without explanation.


Me: So if I were to generalize this, I could just run something like a TF-IDF algorithm so identify the most common words in a chunk of text, and then give a mapping like in your "common_phrases" dict?


GPT4: Yes, you can use a TF-IDF algorithm to identify the most important words or phrases in a given text, and then create a mapping similar to the "common_phrases" dictionary I provided earlier. This way, you can tailor the compression algorithm to the specific text you are working with. Here's a simple example using the TfidfVectorizer from scikit-learn library:

import re
from sklearn.feature_extraction.text import TfidfVectorizer

def create_mapping(text, num_keywords=10):
    vectorizer = TfidfVectorizer(stop_words='english')
    matrix = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    tfidf_scores = matrix.toarray()[0]

    sorted_keywords = sorted(
        zip(feature_names, tfidf_scores), key=lambda x: x[1], reverse=True
    )
    top_keywords = sorted_keywords[:num_keywords]
    mapping = {
        keyword: f'kw{i}' for i, (keyword, _) in enumerate(top_keywords)
    }
    return mapping

def compress_text(text, mapping):
    compressed_text = text.lower()
    for phrase, abbreviation in mapping.items():
        compressed_text = re.sub(phrase, abbreviation, compressed_text)
    return compressed_text

# Example usage
text = "Lambda calculus may be untyped or typed..."
mapping = create_mapping(text, num_keywords=5)
compressed = compress_text(text, mapping)
print(compressed)

This code will create a mapping of the top 5 keywords based on their TF-IDF scores and then compress the text using that mapping. Keep in mind that this is a simple example and there are other ways to improve and optimize this approach. Also, remember that this method is tailored for me to understand the compressed text and may not be easily understood by humans without explanation.


Me: In what way can this approach be optimized? Remember I aim to minimize the number of tokens I communicate to you in any given conversation.


GPT4: To further optimize the text compression approach and minimize the number of tokens communicated, you can consider the following strategies:

  • Improve keyword/phrase selection: Use more advanced NLP techniques like Named Entity Recognition (NER), part-of-speech (POS) tagging, or dependency parsing to better identify essential words or phrases within the text.

  • Adaptive abbreviation generation: Generate shorter abbreviations for more common or important words/phrases in the text. You can even consider using Huffman coding or arithmetic coding to create optimal prefix-free codes.

  • Context-aware compression: If you have multiple chunks of text from the same domain, you can create a shared mapping for the domain's common terms. This can save tokens by reusing the same abbreviations across different conversations.

  • Combine words/phrases: If you notice two or more words or phrases that frequently appear together, you can create an abbreviation for the combined phrase, which will save tokens when they occur together in the text.

  • Custom tokenization: Analyze the text structure and create custom tokenization rules that can help you better identify and compress the most relevant parts of the text.

Use existing compression algorithms: You can also use existing text compression algorithms like Lempel-Ziv-Welch (LZW) or Brotli to further compress the text before applying the keyword-based compression.

Keep in mind that while these optimizations aim to reduce the number of tokens communicated, it may also make the compressed text more difficult for humans to understand. The goal is to create a balance between compressing the text and preserving its meaning for me (GPT-4) to reconstruct it accurately.

@thiswillbeyourgithub
Copy link

Great stuff! Would you mind adding examples of run of this function on differend kind of inputs?

@MikeStirner
Copy link

I did some more testing and GPT-4 gave its take on if the compression worked well:
"Yes, it would be fair to say that the compression system succeeded in maintaining the core information and ideas from the original text. The uncompressed version (Text 2) still discusses the main differences between typed and untyped lambda calculus and covers various aspects like function types, evaluation strategies, and expressive power. While the structure and phrasing are different, the core ideas are preserved, indicating that the AI compression algorithm effectively retained the essential information from the original text (Text 1)."

@ebudmada
Copy link

how many token is saved using that method?

@evadik24
Copy link

This is very interesting and promising. however, although it is lossless, the shortened version losses the tonality and personality of the agent. My prompt requires the agent (AI) to be a certain way and follow steps in a certain order. the compressed version just disregards all that and behaves in a very standard way. Any thoughts on how i can restore that or why this would be?

Many thanks

@endolith
Copy link

Me:

I asked you to compress a long text using your own abbreviations. You replied with: lmbC_untyped/typed,typed_functype,weaker,primSubj,lessExpr,moreProof,simplyTypedLC,evalStratTerm,untypedNTerm,manyTypedLC,strongThrmDesire

Reconstruct the original text.

ChatGPT:

I'm sorry, but I cannot directly expand these abbreviations without the context in which they were used. The abbreviations you provided seem to be related to topics in computer science, particularly type theory or lambda calculus, but without the original context, I can only guess at their full forms.

@alperozoner
Copy link

@endolith Makes a great point. This method is useless as the model was only able to "reconstruct" the text because the unshortened text was already in its memory.

@idrnet
Copy link

idrnet commented Oct 27, 2023

@endolith Makes a great point. This method is useless as the model was only able to "reconstruct" the text because the unshortened text was already in its memory.

No, i tried to decompress it from a separate openai account, and it decompressed it almost identical.

@melanke
Copy link

melanke commented Dec 5, 2023

Me:

I asked you to compress a long text using your own abbreviations. You replied with: lmbC_untyped/typed,typed_functype,weaker,primSubj,lessExpr,moreProof,simplyTypedLC,evalStratTerm,untypedNTerm,manyTypedLC,strongThrmDesire

Reconstruct the original text.

GPT3.5:

Certainly! Here's the reconstructed text from the abbreviations:

"I am looking for a comparison between untyped and typed lambda calculus. Specifically, I'm interested in understanding the differences in function types, the notion of weaker typing, the primary subjects of study, the level of expressiveness in each calculus, the amount of proof theory involved, the concept of simply typed lambda calculus, evaluation strategies for terms, untyped lambda calculus, various aspects of typed lambda calculus, and the desire for stronger theorems."

GPT4:

Based on the abbreviations you provided, it seems like the original text was discussing various concepts related to programming languages and type systems. Here's a possible reconstruction of the original text based on the abbreviations:

lmbC_untyped/typed: This could refer to "Lambda Calculus, untyped and typed." Lambda Calculus is a formal system in mathematical logic and computer science for expressing computation based on function abstraction and application.

typed_functype: This likely refers to "typed functions and types." In programming, this means that functions and variables have specified types, which determines the kind of data they can handle.

weaker, primSubj: This might mean "weaker primitives subject to certain conditions." It implies a discussion about basic elements in a system that have less powerful capabilities under specific constraints.

lessExpr, moreProof: This could be "less expressive, more proof-oriented." This suggests a trade-off between the expressiveness of a language and its suitability for formal proofs or verification.

simplyTypedLC: Refers to "simply typed Lambda Calculus," a variant of Lambda Calculus where every expression is assigned a type, and operations incompatible with these types are prohibited.

evalStratTerm: This is likely "evaluation strategies and termination." It discusses how expressions in a programming language are evaluated and whether these evaluations come to a stop or continue indefinitely.

untypedNTerm: This probably stands for "untyped non-terminating terms." It refers to expressions in an untyped system that do not have a defined endpoint in their evaluation.

manyTypedLC: This could mean "Lambda Calculus with many types," suggesting a system where a variety of types are used, possibly including complex type structures.

strongThrmDesire: This is likely "strong theorem desire," which could refer to a preference or need for strong theoretical foundations, possibly in the context of language design or type systems.

This reconstruction is speculative and assumes that the original text was discussing concepts in computer science, particularly around Lambda Calculus and type systems. If the context was different, the interpretation of these abbreviations could change significantly.

Another shot, Me:

I asked you to compress a long text using your own abbreviations. You replied with: lmbC_untyped/typed,typed_functype,weaker,primSubj,lessExpr,moreProof,simplyTypedLC,evalStratTerm,untypedNTerm,manyTypedLC,strongThrmDesire

Give me your guess on what is the original text. Don't explain me anything, simply guess the original text.

GPT4:

The original text likely discussed various concepts related to lambda calculus, including untyped and typed versions, evaluation strategies, the simplicity of typed lambda calculus, and possibly theorems or proofs related to these topics.

@melanke
Copy link

melanke commented Dec 5, 2023

I am very interested in this subject, maybe we could ask GPT to create the perfect prompt to allow it to reconstruct later.
But I am particlarly interested on compressing programming code.

@araza-mms
Copy link

Original example works in same window for chatGPT
Does not work when there is no memory in new window for recovering original text

The abbreviations you've provided seem to represent concepts related to lambda calculus and theoretical computer science. Here's a possible reconstruction of the original text based on these abbreviations:

lmbC: Lambda Calculus
untyped/typed: This refers to the distinction between untyped and typed lambda calculus. Typed lambda calculus includes systems like simply typed lambda calculus, while untyped does not impose these constraints.
typed_functype: In typed lambda calculus, functions have specific types.
weaker: This might refer to the weaker expressive power of simply typed lambda calculus compared to its untyped counterpart.
primSubj: The primary subject of discussion is lambda calculus.
lessExpr: Less expressive power in certain typed systems.
moreProof: More robust in terms of proof theory.
simplyTypedLC: Simply Typed Lambda Calculus, a specific type system in lambda calculus.
evalStratTerm: Evaluation strategies and termination properties in lambda calculus.
untypedNTerm: Untyped lambda calculus and its non-termination characteristics.
manyTypedLC: Variants of typed lambda calculus, possibly referring to systems like System F or the Calculus of Constructions.
strongThrmDesire: A desire for stronger theorems or properties in the context of these systems.
This reconstruction attempts to make sense of the abbreviations in a coherent way, assuming they are related to lambda calculus and its various properties and systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment