Skip to content

Instantly share code, notes, and snippets.

@izikeros
Last active March 26, 2024 18:11
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save izikeros/17d9c8ab644bd2762acf6b19dd0cea39 to your computer and use it in GitHub Desktop.
Save izikeros/17d9c8ab644bd2762acf6b19dd0cea39 to your computer and use it in GitHub Desktop.
[split text fixed tokens] Split text into parts with limited length in tokens #llm #tokens #python

Text Splitter

Code style: black MIT license

A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken library for encoding and decoding text.

Table of Contents

Introduction

Have you ever needed to split a long text into smaller parts with a specific token limit? The Text Splitter script is here to help! This Python script takes a text input, tokenizes it using the specified encoding, and splits it into parts, ensuring that each part does not exceed the given token limit. It then converts the tokenized parts back into human-readable text.

Installation

  1. Copy the text of the gist and save it on your drive, or clone it:
git clone https://gist.github.com/17d9c8ab644bd2762acf6b19dd0cea39
  1. Install the required dependencies. The script relies on the tiktoken library, which can be installed using pip:

    pip install tiktoken

Usage

To use the text splitter module, follow these steps:

  1. Import the split_string function from the module:

    from split_string import split_string_with_limit
  2. Obtain an encoding using the tiktoken library. You can choose from different pre-trained encodings or create your own.

    import tiktoken
    
    encoding = tiktoken.get_encoding("cl100k_base")
  3. Provide the text you want to split, the token limit, and the encoding to the split_string_with_limit function. This will return a list of text parts.

    text = "This is a sample sentence for testing the string splitting function."
    limit = 5
    texts = split_string_with_limit(text, limit, encoding)
  4. Use the texts variable to access the split text parts as a list.

Examples

Here's an example usage of the Text Splitter script:

import tiktoken
from split_string import split_string_with_limit

# Obtain encoding
encoding = tiktoken.get_encoding("cl100k_base")

# Input text and token limit
text = "This is a sample sentence for testing the string splitting function."
limit = 5

# Split the text
texts = split_string_with_limit(text, limit, encoding)

# Print the split text parts
for part in texts:
    print(part)

Output:

This is a
sample sentence for
testing the string
splitting function.

License

This project is licensed under the MIT License - see the LICENSE file for details.

import tiktoken
def split_string_with_limit(text: str, limit: int, encoding: tiktoken.Encoding):
"""Split a string into parts of given size without breaking words.
Args:
text (str): Text to split.
limit (int): Maximum number of tokens per part.
encoding (tiktoken.Encoding): Encoding to use for tokenization.
Returns:
list[str]: List of text parts.
"""
tokens = encoding.encode(text)
parts = []
text_parts = []
current_part = []
current_count = 0
for token in tokens:
current_part.append(token)
current_count += 1
if current_count >= limit:
parts.append(current_part)
current_part = []
current_count = 0
if current_part:
parts.append(current_part)
# Convert the tokenized parts back to text
for part in parts:
text = [
encoding.decode_single_token_bytes(token).decode("utf-8", errors="replace")
for token in part
]
text_parts.append("".join(text))
return text_parts
if __name__ == "__main__":
# Example usage
encoding = tiktoken.get_encoding("cl100k_base")
text = "This is a sample sentence for testing the string splitting function."
limit = 5
texts = split_string_with_limit(text, limit, encoding)
print(texts)
@filimo
Copy link

filimo commented Sep 23, 2023

Your function, which seems to incorrectly split the string, causing a multi-byte character to be broken and displayed as "��".

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
texts = split_string_with_limit("которому все равно придется", 100, encoding)

print(texts)

['которому все ра��но придется']

@filimo
Copy link

filimo commented Sep 24, 2023

def split_string_with_limit(text: str, limit: int, encoding) -> List[str]:
    tokens = encoding.encode(text)
    parts = []
    current_part = []
    current_count = 0

    for token in tokens:
        current_part.append(token)
        current_count += 1

        if current_count >= limit:
            parts.append(current_part)
            current_part = []
            current_count = 0

    if current_part:
        parts.append(current_part)

    text_parts = [encoding.decode(part) for part in parts]

    return text_parts

encoding = tiktoken.get_encoding("cl100k_base")
texts = split_string_with_limit("которому все равно придется", 100, encoding)

print(texts)

['которому все равно придется']

In your code, each token is decoded individually, which could lead to issues with characters that are composed of multiple tokens.
In my suggested version, all tokens in each part are decoded at once, ensuring accurate decoding.

@multikitty
Copy link

multikitty commented Feb 8, 2024

def split_string_with_limit(text, limit, encoding):
    tokens = encoding.encode(text)
    chunks = [tokens[i : i + limit] for i in range(0, len(tokens), limit)]
    return [encoding.decode(chunk) for chunk in chunks]

Thanks. @izikeros, @filimo
This is the refactored version of @filimo's split_string_with_limit function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment