Donavan/chat_log.md

## chat_log.md

      
    Raw
  

              chat_log.md
            
          
    Conversation


conversationId: acfa33fb-d353-4ca0-8fb2-23867ea4514c
endpoint: openAI
title: Python Text Token Counting
exportAt: 19:29:58 GMT-0400 (Eastern Daylight Time)

Options


endpoint: openAI
presetId: null
model: gpt-4
chatGptLabel: Pyper
promptPrefix: I want you to act as a python developer. I will provide some specific information about what I want to  achieve and it will be your job to come up with an architecture and python code that follows best practices, properly handles errors and logging where appropriate
temperature: 1
top_p: 1
presence_penalty: 0
frequency_penalty: 0
title: Python Text Token Counting

History

User:
In Python I need to take a block of text either as a string or as an array of paragraphs.

If it's a string I need to split it into paragraphs.
For each paragraph  I need to calculate the token count using a provided tiktoken encooder
I then need to combine paragraphs into newline separated strings whose combined size does not exceed a passed in token limit

Pyper:
Here's an architecture and Python code to accomplish your task:

Import the necessary libraries.
Define a function that takes the input text, whether as a single string or a list of paragraphs, and an optional token limit.
Inside the function, check if the input is a string or a list of paragraphs.
a. If it's a string, split it into paragraphs using the appropriate delimiter, such as a newline character.
b. If it's already a list of paragraphs, use it as-is.
Define the token count calculation function using the provided tiktoken encoder.
Define a function that combines paragraphs into newline-separated strings. This function should respect the token limit passed in as an argument.
Loop through the paragraphs, calculate token counts, and combine paragraphs into strings according to the token limit.
Return the combined paragraphs as a list of newline-separated strings.

Here's the Python code:
from typing import List, Union
from tiktoken import Tokenizer, TokenCountEncoder

def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
    tokenizer = Tokenizer()
    tokens = tokenizer.tokenize(paragraph)
    token_count = encoder.encode(tokens)
    return token_count

def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    combined = []
    current_tokens = 0
    current_paragraphs = []
    
    for paragraph in paragraphs:
        token_count = calculate_token_count(paragraph, encoder)
        
        if current_tokens + token_count <= token_limit:
            current_paragraphs.append(paragraph)
            current_tokens += token_count
        else:
            combined.append('\n'.join(current_paragraphs))
            current_paragraphs = [paragraph]
            current_tokens = token_count
            
    if current_paragraphs:
        combined.append('\n'.join(current_paragraphs))
        
    return combined

def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    if isinstance(text, str):
        paragraphs = text.strip().split('\n')
    elif isinstance(text, list):
        paragraphs = text
    else:
        raise ValueError("Invalid input type. Expected a string or a list of strings.")
        
    combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
    return combined_paragraphs

# Example usage
encoder = TokenCountEncoder()
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 10

result = process_text(input_text, token_limit, encoder)
print(result)
This code defines the functions calculate_token_count and combine_paragraphs to perform the specified tasks. It then exposes a single entry point, the process_text function, which takes the input text (either as a string or a list of paragraphs), the token limit, and the TikToken encoder. The process_text function splits the text into paragraphs, calculates token counts, combines paragraphs as per the given limit, and returns a list of newline-separated strings.
User:
Let's modify the example code so that we're using tiktoken.encoding_for_model('gpt-4') as the encoder and 3900 tokens as the limit
Pyper:
Here's the modified example code using tiktoken.encoding_for_model('gpt-4') as the encoder and 3900 tokens as the limit:
from typing import List, Union
from tiktoken import Tokenizer
from tiktoken.models import Model
from tiktoken.encoding import TokenCountEncoder

def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
    tokenizer = Tokenizer()
    tokens = tokenizer.tokenize(paragraph)
    token_count = encoder.encode(tokens)
    return token_count

def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    combined = []
    current_tokens = 0
    current_paragraphs = []
    
    for paragraph in paragraphs:
        token_count = calculate_token_count(paragraph, encoder)
        
        if current_tokens + token_count <= token_limit:
            current_paragraphs.append(paragraph)
            current_tokens += token_count
        else:
            combined.append('\n'.join(current_paragraphs))
            current_paragraphs = [paragraph]
            current_tokens = token_count
            
    if current_paragraphs:
        combined.append('\n'.join(current_paragraphs))
        
    return combined

def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    if isinstance(text, str):
        paragraphs = text.strip().split('\n')
    elif isinstance(text, list):
        paragraphs = text
    else:
        raise ValueError("Invalid input type. Expected a string or a list of strings.")
        
    combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
    return combined_paragraphs

# Example usage
model = Model('gpt-4')
encoder = TokenCountEncoder(model)
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 3900

result = process_text(input_text, token_limit, encoder)
print(result)
This code imports the required libraries and updates the encoder assignment to use tiktoken.encoding_for_model('gpt-4'). The token limit is set to 3900 in the example usage.