Skip to content

Instantly share code, notes, and snippets.

@schalkneethling
Created January 21, 2023 21:24
Show Gist options
  • Save schalkneethling/8a7dd682161e7fb766fa4c25829580c7 to your computer and use it in GitHub Desktop.
Save schalkneethling/8a7dd682161e7fb766fa4c25829580c7 to your computer and use it in GitHub Desktop.
Understanding the impact on memory and speed of different tokenizers - Flexsearch

I asked ChatGPT the following questions:

  • Memory Factor (n = length of word) what is the memory impact of * 1. The value of n is firefox
  • Memory Factor (n = length of word) what is the memory impact of * n. The value of n is firefox
  • Memory Factor (n = length of word) what is the memory impact of * 2n - 1. The value of n is firefox
  • Memory Factor (n = length of word) what is the memory impact of * n * (n - 1). The value of n is firefox

It responded with the following:

  • The memory impact of 1 when the value of n is “firefox” would be minimal, it would be the minimal amount of memory required to store a single character.
  • The memory impact of n when the value of n is “firefox” would be 8 bytes, assuming that each character in the word “firefox” is stored using 1 byte of memory.
  • The memory impact of 2n - 1 when the value of n is “firefox” would be 15 bytes, assuming that each character in the word “firefox” is stored using 1 byte of memory.
  • The memory impact of n * (n - 1) when the value of n is “firefox” would be 28 bytes, assuming that each character in the word “firefox” is stored using 1 byte of memory. It is important to note that these are rough estimates as they do not take into consideration any additional memory overhead for storing the word in memory or any data type or encoding used. It is also important to note that the equation 2n-1 and n*(n-1) are not standard memory factors used to calculate the memory usage, they are custom equations.

Thoughts? For additional context. I am trying to understand/make sense of the following: https://github.com/nextapps-de/flexsearch#tokenizer-prefix-search

@schalkneethling
Copy link
Author

schalkneethling commented Jan 21, 2023

For this one:

Memory Factor (n = length of word) what is the memory impact of * 1. The value of n is firefox

The answer provided is not helpful at all. Looking at http://eckstein.rutgers.edu/mis/handouts/storage-calcs.pdf, I would gather that for the strict tokenizer, which indexes only whole words, the memory impact would be 1 byte (8 bits) for every character. So if the index consists of just the word firefox, the memory used will be 1 byte for every character, i.e. 7 bytes (56 bits).

If you index the sentence, "Firefox is a Gecko based browser" using the strict tokenizer it would use up 27 bytes (216 bits) of memory.

@schalkneethling
Copy link
Author

Assuming the above is true, then the following would be true for a forward tokenizer. The word firefox would be tokenized into the following sequences.

f (1 byte)
fi  (2 bytes)
fir  (3 bytes)
fire  (4 bytes)
firef  (5 bytes)
firefo  (6 bytes)
firefox  (7 bytes)

Therefore, the word firefox will use a total of 28 bytes (224 bits) of memory. Which is interestingly more memory then the entire sentence will take up with a strict tokenizer. Assuming the above holds true, then the following will be the impact of using a reverse tokenizer.

f (1 byte)
fi  (2 bytes)
fir  (3 bytes)
fire  (4 bytes)
firef  (5 bytes)
firefo  (6 bytes)
firefox  (7 bytes)
irefox  (6 bytes)
refox  (5 bytes)
efox  (4 bytes)
fox  (3 bytes)
ox  (2 bytes)
x  (1 byte)

Therefore, the word firefox will use a total of 49 bytes (392 bits) of memory. That is 21 bytes (168 bits) more than a forward tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment