- conversationId: acfa33fb-d353-4ca0-8fb2-23867ea4514c
- endpoint: openAI
- title: Python Text Token Counting
- exportAt: 19:29:58 GMT-0400 (Eastern Daylight Time)
- endpoint: openAI
- presetId: null
- model: gpt-4
- chatGptLabel: Pyper
- promptPrefix: I want you to act as a python developer. I will provide some specific information about what I want to achieve and it will be your job to come up with an architecture and python code that follows best practices, properly handles errors and logging where appropriate
- temperature: 1
- top_p: 1
- presence_penalty: 0
- frequency_penalty: 0
- title: Python Text Token Counting
User: In Python I need to take a block of text either as a string or as an array of paragraphs.
- If it's a string I need to split it into paragraphs.
- For each paragraph I need to calculate the token count using a provided tiktoken encooder
- I then need to combine paragraphs into newline separated strings whose combined size does not exceed a passed in token limit
Pyper: Here's an architecture and Python code to accomplish your task:
- Import the necessary libraries.
- Define a function that takes the input text, whether as a single string or a list of paragraphs, and an optional token limit.
- Inside the function, check if the input is a string or a list of paragraphs. a. If it's a string, split it into paragraphs using the appropriate delimiter, such as a newline character. b. If it's already a list of paragraphs, use it as-is.
- Define the token count calculation function using the provided tiktoken encoder.
- Define a function that combines paragraphs into newline-separated strings. This function should respect the token limit passed in as an argument.
- Loop through the paragraphs, calculate token counts, and combine paragraphs into strings according to the token limit.
- Return the combined paragraphs as a list of newline-separated strings.
Here's the Python code:
from typing import List, Union
from tiktoken import Tokenizer, TokenCountEncoder
def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(paragraph)
token_count = encoder.encode(tokens)
return token_count
def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
combined = []
current_tokens = 0
current_paragraphs = []
for paragraph in paragraphs:
token_count = calculate_token_count(paragraph, encoder)
if current_tokens + token_count <= token_limit:
current_paragraphs.append(paragraph)
current_tokens += token_count
else:
combined.append('\n'.join(current_paragraphs))
current_paragraphs = [paragraph]
current_tokens = token_count
if current_paragraphs:
combined.append('\n'.join(current_paragraphs))
return combined
def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
if isinstance(text, str):
paragraphs = text.strip().split('\n')
elif isinstance(text, list):
paragraphs = text
else:
raise ValueError("Invalid input type. Expected a string or a list of strings.")
combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
return combined_paragraphs
# Example usage
encoder = TokenCountEncoder()
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 10
result = process_text(input_text, token_limit, encoder)
print(result)
This code defines the functions calculate_token_count
and combine_paragraphs
to perform the specified tasks. It then exposes a single entry point, the process_text
function, which takes the input text (either as a string or a list of paragraphs), the token limit, and the TikToken encoder. The process_text
function splits the text into paragraphs, calculates token counts, combines paragraphs as per the given limit, and returns a list of newline-separated strings.
User:
Let's modify the example code so that we're using tiktoken.encoding_for_model('gpt-4')
as the encoder and 3900 tokens as the limit
Pyper:
Here's the modified example code using tiktoken.encoding_for_model('gpt-4')
as the encoder and 3900 tokens as the limit:
from typing import List, Union
from tiktoken import Tokenizer
from tiktoken.models import Model
from tiktoken.encoding import TokenCountEncoder
def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(paragraph)
token_count = encoder.encode(tokens)
return token_count
def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
combined = []
current_tokens = 0
current_paragraphs = []
for paragraph in paragraphs:
token_count = calculate_token_count(paragraph, encoder)
if current_tokens + token_count <= token_limit:
current_paragraphs.append(paragraph)
current_tokens += token_count
else:
combined.append('\n'.join(current_paragraphs))
current_paragraphs = [paragraph]
current_tokens = token_count
if current_paragraphs:
combined.append('\n'.join(current_paragraphs))
return combined
def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
if isinstance(text, str):
paragraphs = text.strip().split('\n')
elif isinstance(text, list):
paragraphs = text
else:
raise ValueError("Invalid input type. Expected a string or a list of strings.")
combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
return combined_paragraphs
# Example usage
model = Model('gpt-4')
encoder = TokenCountEncoder(model)
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 3900
result = process_text(input_text, token_limit, encoder)
print(result)
This code imports the required libraries and updates the encoder assignment to use tiktoken.encoding_for_model('gpt-4')
. The token limit is set to 3900 in the example usage.