izikeros/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Text Splitter


A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken library for encoding and decoding text.
Table of Contents


Introduction
Installation
Usage
Examples
License

Introduction

Have you ever needed to split a long text into smaller parts with a specific token limit? The Text Splitter script is here to help! This Python script takes a text input, tokenizes it using the specified encoding, and splits it into parts, ensuring that each part does not exceed the given token limit. It then converts the tokenized parts back into human-readable text.
Installation


Copy the text of the gist and save it on your drive, or clone it:

git clone https://gist.github.com/17d9c8ab644bd2762acf6b19dd0cea39


Install the required dependencies. The script relies on the tiktoken library, which can be installed using pip:
pip install tiktoken


Usage

To use the text splitter module, follow these steps:


Import the split_string function from the module:
from split_string import split_string_with_limit


Obtain an encoding using the tiktoken library. You can choose from different pre-trained encodings or create your own.
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")


Provide the text you want to split, the token limit, and the encoding to the split_string_with_limit function. This will return a list of text parts.
text = "This is a sample sentence for testing the string splitting function."
limit = 5
texts = split_string_with_limit(text, limit, encoding)


Use the texts variable to access the split text parts as a list.


Examples

Here's an example usage of the Text Splitter script:
import tiktoken
from split_string import split_string_with_limit

# Obtain encoding
encoding = tiktoken.get_encoding("cl100k_base")

# Input text and token limit
text = "This is a sample sentence for testing the string splitting function."
limit = 5

# Split the text
texts = split_string_with_limit(text, limit, encoding)

# Print the split text parts
for part in texts:
    print(part)
Output:
This is a
sample sentence for
testing the string
splitting function.

License

This project is licensed under the MIT License - see the LICENSE file for details.

  
## split_string.py
import tiktoken


def split_string_with_limit(text: str, limit: int, encoding: tiktoken.Encoding):
    """Split a string into parts of given size without breaking words.

    Args:
        text (str): Text to split.
        limit (int): Maximum number of tokens per part.
        encoding (tiktoken.Encoding): Encoding to use for tokenization.

    Returns:
        list[str]: List of text parts.

    """
    tokens = encoding.encode(text)
    parts = []
    text_parts = []
    current_part = []
    current_count = 0

    for token in tokens:
        current_part.append(token)
        current_count += 1

        if current_count >= limit:
            parts.append(current_part)
            current_part = []
            current_count = 0

    if current_part:
        parts.append(current_part)

    # Convert the tokenized parts back to text
    for part in parts:
        text = [
            encoding.decode_single_token_bytes(token).decode("utf-8", errors="replace")
            for token in part
        ]
        text_parts.append("".join(text))

    return text_parts


if __name__ == "__main__":
    # Example usage
    encoding = tiktoken.get_encoding("cl100k_base")
    text = "This is a sample sentence for testing the string splitting function."
    limit = 5
    texts = split_string_with_limit(text, limit, encoding)
    print(texts)
	import tiktoken


	def split_string_with_limit(text: str, limit: int, encoding: tiktoken.Encoding):
	"""Split a string into parts of given size without breaking words.

	Args:
	text (str): Text to split.
	limit (int): Maximum number of tokens per part.
	encoding (tiktoken.Encoding): Encoding to use for tokenization.

	Returns:
	list[str]: List of text parts.

	"""
	tokens = encoding.encode(text)
	parts = []
	text_parts = []
	current_part = []
	current_count = 0

	for token in tokens:
	current_part.append(token)
	current_count += 1

	if current_count >= limit:
	parts.append(current_part)
	current_part = []
	current_count = 0

	if current_part:
	parts.append(current_part)

	# Convert the tokenized parts back to text
	for part in parts:
	text = [
	encoding.decode_single_token_bytes(token).decode("utf-8", errors="replace")
	for token in part
	]
	text_parts.append("".join(text))

	return text_parts


	if __name__ == "__main__":
	# Example usage
	encoding = tiktoken.get_encoding("cl100k_base")
	text = "This is a sample sentence for testing the string splitting function."
	limit = 5
	texts = split_string_with_limit(text, limit, encoding)
	print(texts)