Skip to content

Instantly share code, notes, and snippets.

@Donavan
Created May 16, 2023 23:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Donavan/f0bd3438f914e95aa2d18917262e1f16 to your computer and use it in GitHub Desktop.
Save Donavan/f0bd3438f914e95aa2d18917262e1f16 to your computer and use it in GitHub Desktop.
Chunking for summarization

Conversation

  • conversationId: acfa33fb-d353-4ca0-8fb2-23867ea4514c
  • endpoint: openAI
  • title: Python Text Token Counting
  • exportAt: 19:29:58 GMT-0400 (Eastern Daylight Time)

Options

  • endpoint: openAI
  • presetId: null
  • model: gpt-4
  • chatGptLabel: Pyper
  • promptPrefix: I want you to act as a python developer. I will provide some specific information about what I want to achieve and it will be your job to come up with an architecture and python code that follows best practices, properly handles errors and logging where appropriate
  • temperature: 1
  • top_p: 1
  • presence_penalty: 0
  • frequency_penalty: 0
  • title: Python Text Token Counting

History

User: In Python I need to take a block of text either as a string or as an array of paragraphs.

  • If it's a string I need to split it into paragraphs.
  • For each paragraph I need to calculate the token count using a provided tiktoken encooder
  • I then need to combine paragraphs into newline separated strings whose combined size does not exceed a passed in token limit

Pyper: Here's an architecture and Python code to accomplish your task:

  1. Import the necessary libraries.
  2. Define a function that takes the input text, whether as a single string or a list of paragraphs, and an optional token limit.
  3. Inside the function, check if the input is a string or a list of paragraphs. a. If it's a string, split it into paragraphs using the appropriate delimiter, such as a newline character. b. If it's already a list of paragraphs, use it as-is.
  4. Define the token count calculation function using the provided tiktoken encoder.
  5. Define a function that combines paragraphs into newline-separated strings. This function should respect the token limit passed in as an argument.
  6. Loop through the paragraphs, calculate token counts, and combine paragraphs into strings according to the token limit.
  7. Return the combined paragraphs as a list of newline-separated strings.

Here's the Python code:

from typing import List, Union
from tiktoken import Tokenizer, TokenCountEncoder

def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
    tokenizer = Tokenizer()
    tokens = tokenizer.tokenize(paragraph)
    token_count = encoder.encode(tokens)
    return token_count

def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    combined = []
    current_tokens = 0
    current_paragraphs = []
    
    for paragraph in paragraphs:
        token_count = calculate_token_count(paragraph, encoder)
        
        if current_tokens + token_count <= token_limit:
            current_paragraphs.append(paragraph)
            current_tokens += token_count
        else:
            combined.append('\n'.join(current_paragraphs))
            current_paragraphs = [paragraph]
            current_tokens = token_count
            
    if current_paragraphs:
        combined.append('\n'.join(current_paragraphs))
        
    return combined

def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    if isinstance(text, str):
        paragraphs = text.strip().split('\n')
    elif isinstance(text, list):
        paragraphs = text
    else:
        raise ValueError("Invalid input type. Expected a string or a list of strings.")
        
    combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
    return combined_paragraphs

# Example usage
encoder = TokenCountEncoder()
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 10

result = process_text(input_text, token_limit, encoder)
print(result)

This code defines the functions calculate_token_count and combine_paragraphs to perform the specified tasks. It then exposes a single entry point, the process_text function, which takes the input text (either as a string or a list of paragraphs), the token limit, and the TikToken encoder. The process_text function splits the text into paragraphs, calculates token counts, combines paragraphs as per the given limit, and returns a list of newline-separated strings.

User: Let's modify the example code so that we're using tiktoken.encoding_for_model('gpt-4') as the encoder and 3900 tokens as the limit

Pyper: Here's the modified example code using tiktoken.encoding_for_model('gpt-4') as the encoder and 3900 tokens as the limit:

from typing import List, Union
from tiktoken import Tokenizer
from tiktoken.models import Model
from tiktoken.encoding import TokenCountEncoder

def calculate_token_count(paragraph: str, encoder: TokenCountEncoder) -> int:
    tokenizer = Tokenizer()
    tokens = tokenizer.tokenize(paragraph)
    token_count = encoder.encode(tokens)
    return token_count

def combine_paragraphs(paragraphs: List[str], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    combined = []
    current_tokens = 0
    current_paragraphs = []
    
    for paragraph in paragraphs:
        token_count = calculate_token_count(paragraph, encoder)
        
        if current_tokens + token_count <= token_limit:
            current_paragraphs.append(paragraph)
            current_tokens += token_count
        else:
            combined.append('\n'.join(current_paragraphs))
            current_paragraphs = [paragraph]
            current_tokens = token_count
            
    if current_paragraphs:
        combined.append('\n'.join(current_paragraphs))
        
    return combined

def process_text(text: Union[str, List[str]], token_limit: int, encoder: TokenCountEncoder) -> List[str]:
    if isinstance(text, str):
        paragraphs = text.strip().split('\n')
    elif isinstance(text, list):
        paragraphs = text
    else:
        raise ValueError("Invalid input type. Expected a string or a list of strings.")
        
    combined_paragraphs = combine_paragraphs(paragraphs, token_limit, encoder)
    return combined_paragraphs

# Example usage
model = Model('gpt-4')
encoder = TokenCountEncoder(model)
input_text = "This is a sample paragraph.\n\nThis is another sample paragraph."
token_limit = 3900

result = process_text(input_text, token_limit, encoder)
print(result)

This code imports the required libraries and updates the encoder assignment to use tiktoken.encoding_for_model('gpt-4'). The token limit is set to 3900 in the example usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment