Skip to content

Instantly share code, notes, and snippets.

@iCUE-Solutions
Last active October 7, 2023 12:12
Show Gist options
  • Save iCUE-Solutions/c346ccbba3567ddee1176adc4770c66f to your computer and use it in GitHub Desktop.
Save iCUE-Solutions/c346ccbba3567ddee1176adc4770c66f to your computer and use it in GitHub Desktop.
Token count for embedding
# FILEPATH: /tiktoken
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""
Returns the number of tokens in a text string.
Args:
string (str): The text string to count tokens in.
encoding_name (str): The name of the encoding to use.
Returns:
int: The number of tokens in the text string.
"""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("tiktoken is great!", "cl100k_base")
For second-generation embedding models like text-embedding-ada-002, use the cl100k_base encoding.
More details and example code are in the OpenAI Cookbook guide how to count tokens with tiktoken.
https://platform.openai.com/docs/guides/embeddings/limitations-risks#:~:text=how%20to%20count%20tokens%20with%20tiktoken
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment