A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken
library for encoding and decoding text.
Have you ever needed to split a long text into smaller parts with a specific token limit? The Text Splitter script is here to help! This Python script takes a text input, tokenizes it using the specified encoding, and splits it into parts, ensuring that each part does not exceed the given token limit. It then converts the tokenized parts back into human-readable text.
- Copy the text of the gist and save it on your drive, or clone it:
git clone https://gist.github.com/17d9c8ab644bd2762acf6b19dd0cea39
-
Install the required dependencies. The script relies on the
tiktoken
library, which can be installed using pip:pip install tiktoken
To use the text splitter module, follow these steps:
-
Import the
split_string
function from the module:from split_string import split_string_with_limit
-
Obtain an encoding using the
tiktoken
library. You can choose from different pre-trained encodings or create your own.import tiktoken encoding = tiktoken.get_encoding("cl100k_base")
-
Provide the text you want to split, the token limit, and the encoding to the
split_string_with_limit
function. This will return a list of text parts.text = "This is a sample sentence for testing the string splitting function." limit = 5 texts = split_string_with_limit(text, limit, encoding)
-
Use the
texts
variable to access the split text parts as a list.
Here's an example usage of the Text Splitter script:
import tiktoken
from split_string import split_string_with_limit
# Obtain encoding
encoding = tiktoken.get_encoding("cl100k_base")
# Input text and token limit
text = "This is a sample sentence for testing the string splitting function."
limit = 5
# Split the text
texts = split_string_with_limit(text, limit, encoding)
# Print the split text parts
for part in texts:
print(part)
Output:
This is a
sample sentence for
testing the string
splitting function.
This project is licensed under the MIT License - see the LICENSE file for details.
['которому все ра��но придется']