Skip to content

Instantly share code, notes, and snippets.

Last active October 31, 2023 10:22
Show Gist options
  • Save akhan619/cc0a0cd9d4997114c1803bb2882b6458 to your computer and use it in GitHub Desktop.
Save akhan619/cc0a0cd9d4997114c1803bb2882b6458 to your computer and use it in GitHub Desktop.
Exploring Tokenizers from Hugging Face

Exploring Tokenizers from Hugging Face

Hugging Face (HF) has made NLP (Natural Language Processing) a breeze. In this post, we are going to take a look at tokenization using a hands on approach with the help of the Tokenizers library. We are going to load a real world dataset containing 10-K filings of public firms and see how to train a tokenizer from scratch based on the BERT tokenization scheme. In the process we will understand tokenization in detail and some gotchas to keep an eye out for.

Background on NLP (Optional)

If you already have an understanding of the NLP pipeline, you can safely skip this section.

For any NLP task, one of the first steps is pre-processing the data so that it can be fed into our NLP models. For those new to NLP, the general pipeline for any NLP task (text classification, question answering, etc.) is as follows:

  • Pre-process
    • Get the data ready into a format that can be passed on to the NLP model.
  • Train
    • Train the model.
  • Evaluate
    • Using metrics suitable for a given task, evaluate how well the trained model performs on some test data.
  • Predict
    • Once we are satisfied with our trained model, make some predictions.

Of course this is a very broad overview of the steps and there is a lot going on in each step. As mentioned before, in this post we will focus on the first step - pre-processing the data and how we can leverage Hugging Face Tokenizers to achieve it.


You can very easily install the Tokenizers library in a new python environment using:

pip install tokenizers

You will also need the Datasets library to load the data we will be working with.

pip install datasets


Before we can do anything with the HF Tokenizers library, we need data to work with. I will be working with a dataset I created on HF but the steps can be applied to any dataset.

# Load our dataset
from datasets import load_dataset

# Most datasets on HF are split into test/train/validate. This is useful when training our
# NLP model. However, during tokenization we want the combined data from all 3. For this
# we pass the "train+test+validation" to the split parameter so that the load_dataset()
# function returns a Dataset object instead of a DatasetDict object and at the same time 
# combines the splits together. 

# ds = load_dataset('JanosAudran/financial-reports-sec', 'small_lite', split="train+test+validation")

ds = load_dataset('JanosAudran/financial-reports-sec', 'large_lite', split="train+test+validation")

Now that we have loaded our dataset, let's check it out. We can gather some info on the dataset size and structure using:

# Size
print(f"Size of the dataset {ds.dataset_size / 1024 ** 3:.2f} GB.")
# 'Size of the dataset 21.09 GB.'

# Let's check the features in the dataset.
# Dataset({
#     features: ['cik', 'sentence', 'section', 'labels', 'filingDate', 'docID', 'sentenceID', 'sentenceCount'],
#     num_rows: 71866962
# })

Note: Your dataset size may be different depending on whether you loaded the small version or not.

This dataset is almost 21 GB is size and contains over 71 million observations. It also has 8 features. We can think of features as columns/fields in a typical database for our current purposes, but note that they can have added functionality depending on the type of feature.

We are interested in only one of the features which is the 'sentence' feature which contains a single sentence from a 10-K filing. Let's check an example sentence from this dataset.

# An example sentence from the dataset.
example_sentence = ds[100]['sentence']
# 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'

Now that our dataset is loaded we can look at the pre-processing step in more detail.


The thing with textual data is that it can be all over the place. So, cleaning the text becomes an important part. After all the following two sentences convey the same meaning yet, for a machine it is two very different things.

Héllò? What aré yòü üptò tòday?


Hello? What are you upto today?

Next, a string is hard for a machine to understand. In the example above a machine will have no idea whether you or you upto is a single word. So, we need to pass on the structure of words explicitly.

Lastly, machines like numbers. We need to convert the sequence of words to some fixed sequence of numbers.

Each of these steps is a part of a general pipeline:

  • Normalization
  • Pre-tokeninzation
  • Tokenization
  • Post-processing

So, when we say pre-process the data or tokenize the data what we actually have in mind are the steps in the pipeline above. Let's see exactly what they are and how they help to get the data ready into a format the NLP models can work with.

Step 1: Normalization

This step helps to manage the plethora of Unicode characters that might be present in our text or take care of accented characters. We want to have our text in a consistent format.

Usually, Unicode normalization is applied which is a topic in itself and outside the scope of this post. We can very easily apply this step as follows:

# Step 1: Load our normalizer.
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents

# We create our normalizer which will appy Unicode normalization and strip accents
normalizer = normalizers.Sequence([NFD(), StripAccents()])

normalizer.normalize_str("Héllò? What aré yòü üptò tòday?")
# "Hello? What are you upto today?"

# Example on our dataset
# Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.

In the code above we have used two normalizers, NFD and StripAccents. And as you can see we can very easily chain these two together using the Sequence class.

Step 2: Pre-tokenization

Here we want to split our long string into individual words. We can split on whitespace, punctuation or even more specific ones such as ByteLevel or BertPreTokenizer.

A point to note here is that the tokenization pipeline is going to be heavily influenced by the NLP model that will be used subsequently. For example, BERT has it's own tokenization pipeline and in order to use the BERT model we must follow the same pipeline. This means the way the normalization, word splitting, etc. was done while training BERT must be used on our data as well provided we will be using the pre-trained BERT for fine-tuning. Instead, if we are going to train BERT from scratch then we can follow our own design.

The code to pre-tokenize our sentence is as follows:

# Step 2: Load our pre-tokenizer
from tokenizers.pre_tokenizers import Whitespace

# We create our pre-tokenizer which will split based on the regex \w+|[^\w\s]+
pre_tokenizer = Whitespace()

pre_tokenizer.pre_tokenize_str("Hello! What are you upto today?.")
# [('Hello', (0, 5)),
#  ('!', (5, 6)),
#  ('What', (7, 11)),
#  ('are', (12, 15)),
#  ('you', (16, 19)),
#  ('upto', (20, 24)),
#  ('today', (25, 30)),
#  ('?.', (30, 32))]

# Example on our dataset
# [('Our', (0, 3)),
#  ('Expeditionary', (4, 17)),
#  ('Services', (18, 26)),
#  ('segment', (27, 34)),
#  ('competes', (35, 43)),
#  ('with', (44, 48)),
#  ('a', (49, 50)),
#  ('number', (51, 57)),
#  ('of', (58, 60)),
#  ('divisions', (61, 70)),
#  ('of', (71, 73)),
#  ('large', (74, 79)),
#  ('corporations', (80, 92)),
#  ('and', (93, 96)),
#  ('other', (97, 102)),
#  ('large', (103, 108)),
#  ('and', (109, 112)),
#  ('small', (113, 118)),
#  ('companies', (119, 128)),
#  ('.', (128, 129))]

As we can see the pre-tokenizer splits our sentence based on whitespace and punctuation. It also returns the offset of the words that it has generated in our sentence.

Step 3: Tokenization

At this point one can say that our work is over. We started with a string, cleaned it and split it into words. We could simply repeat the same process over all the sentences we have and collect all the unique words. Then their index position would serve as an id that we can feed into our models. The words themselves are called tokens and the ids are called token ids. This can definitely be a strategy.

But, one would soon see the problem. For any decently sized textual dataset (also called a corpus in NLP lingo) we could have tens of thousands of words. This would make the training process for our actual NLP model much longer and less efficient.

There is another problem. Consider the next 2 sentences:

I will give you a dollar tomorrow


I will be giving you a dollar tomorrow

First off, both the sentences convey the same idea. But, we have used 2 different words give and giving here. Semantically they should be interpreted in the same way. Imagine, instead of creating a list (which is called our vocabulary) of unique words (tokens) as follows:

['I', 'will', 'be', 'give', 'giving', 'you', 'a', 'dollar', 'tomorrow']

We create,

['I', 'will', 'be', 'giv', '##e', '##ing', 'you', 'a', 'dollar', 'tomorrow']

Note: The list above is our vocabulary not the sentence broken into tokens.

This might seem a very weird way to create our list of words. We have split give and giving into a common part and 2 other pieces. But, notice what happens when we replace our sentences with the new words from our vocabulary (Note: To keep it simple I have kept the other words as they are. However, they might be split as well depending on the corpus.):

['I', 'will', 'giv', '##e', 'you', 'a', 'dollar', 'shortly']


['I', 'will', 'be' 'giv', '##ing', 'you', 'a', 'dollar', 'shortly']

Note: The list above is our tokenized sentence not the vocabulary.

Now, both our sentences after replacing with the tokens, will have a common word 'giv' which can be very helpful to a NLP model to understand that the sentences share a similar meaning.

Further, the other token '##ing' is a very common ending for many words and will reduce the size of our overall vocabulary. For example, if we had the following sentence:

I was willing to go to the concert

The new vocabulary is:

['I', 'will', 'be' 'giv', '##ing', 'you', 'a', 'dollar', 'tomorrow', 'was', 'to', 'go', 'the', 'concert']

See how the word willing is already present in the vocabulary.

The above strategy is a very simplified version of an algorithm known as WordPiece and is used by the BERT Transformer models. So, lets see how we could implement it in our tokenizer.

# Step 3: Load our model
from tokenizers.models import WordPiece
from tokenizers import Tokenizer

# We create our tokenizer based on the WordPiece algorithm model.
# We need to supply the token which will represent unknown tokens.
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# With our tokenizer object ready we set our normalizer and pre-tokenizer.
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizer

So, all we need to do is create a Tokenizer object. We set the normalizer and the pre-tokenizer of this new Tokenizer object to the ones we created earlier. What this means is that we don't have to run our normalizer and pre-tokenizer on the dataset beforehand. They will be run automatically by the Tokenizer object. Second, we have used a WordPiece class object as our model so that our tokenizer uses the WordPiece algorithm.

Finally, we have defined a new token '[UNK]'. These are what as known as special tokens and are dictated by the NLP model that will be used. More on it soon.

Step 4: Training

With our model/normalizer/pre-tokenizer all ready, we can now train our tokenizer model on the data. The code for it is:

# Step 4: Train our tokenizer
from tokenizers.trainers import WordPieceTrainer
import time

# We will create a batch iterator which will generate a batch of sentences for training
# our tokenizaer. This is the preferred way instead of passing single sentences to the
# tokenizer as it will a lot faster.
def batch_iterator(dataset, batch_size=10000):
    for i in range(0, len(dataset), batch_size):
        lower_idx = i
        # Ensure the upper idx doesn't overflow leading to an 'IndexError'
        upper_idx = i + batch_size if i + batch_size <= len(dataset) else len(dataset)
        yield dataset[lower_idx : upper_idx]["sentence"]   
# We pass in the list of special tokens so that our model knows about them.
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tic = time.perf_counter()
# Now, we do batch training based on our iterator that we defined earlier.
tokenizer.train_from_iterator(batch_iterator(ds), trainer=trainer, length=len(ds))
toc = time.perf_counter()
print(f"Elapsed time: {toc - tic:0.4f} seconds")

Note: It took me about 30 mins to train the tokenizer on the full 21GB corpus on a AMD Ryzen Pro 7 8 Core machine.

Most of the code is self explanatory. We create a batch iterator so that we don't train on a single sentence every time and our training is faster. We also need a Trainer object to train our tokenizer. This must be compatible with the model that we instantiated our tokenizer with. We used WordPiece as our model:

_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))_

So, we use the WordPieceTrainer class to create the trainer object. There is again these special tokens that we pass to the constructor. Let's see what they mean.

Special Tokens

When using BERT as a model there are certain special tokens used by it that need to be used. Other models might use a different set of special tokens. We will keep it simple here and see the BERT ones. They are:

["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]

  • [UNK]: This is used to represent any word that the tokenizer fails to find in it's vocabulary. This can happen when the word comes from a different corpus to the one the tokenizer was trained on or we set the size of our vocabulary to a small one.
  • [CLS]: This token is automatically inserted during post-processing at the start of a sentence or pair of sentences.
  • [SEP]: This token is automatically inserted during post-processing at the end of every sentence.
  • [PAD]: This token is used to ensure that the size of all sentences in a batch of sentences are of the same length.
  • [MASK]: This a special token that is used only during training the BERT model (not the tokenizer) on a Masked Language Modelling task.

Don't worry if these seem vague. We will be applying them all in our post-processing section.

Understanding the Encoding object

Once we complete the training process, we can use our tokenizer to encode sentences. Let's see what this means.

# Define our example
example_sentence = ds[100]['sentence']
# 'Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies.'

# Now that the training is done let us check out what the output of the tokenizer looks like.
output = tokenizer.encode(example_sentence)
# Encoding(num_tokens=22, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

Ok we got an Encoding class object. We see a list of attributes the object has as well as the number of tokens that we generated.

We can check each of them out as:

# The number of sequences
# 1

# The tokens generated after our sentence went through the normalization->
# pre-tokenization->tokenization(WordPiece) pipeline

# The ids assigned to these tokens.

# The attention masks

# The sequence ids

# The word ids

# The type ids

# The offsets for our tokens.

But, I find it better to see them in a table side-by-side to really get an understanding of what they mean. Here is the same set of outputs in tabular form for a truncated set of tokens:

tokens ids attention_mask special_tokens_mask sequence_ids word_ids type_ids offsets
Our 1817 1 0 0 0 0 (0, 3)
Exped 19910 1 0 0 1 0 (4, 9)
##ition 1515 1 0 0 1 0 (9, 14)
##ary 1610 1 0 0 1 0 (14, 17)
Services 3504 1 0 0 2 0 (18, 26)
... ... ... ... ... ... ... ...
companies 2351 1 0 0 18 0 (119, 128)
. 18 1 0 0 19 0 (128, 129)

For now, just focus on tokens, ids, word_ids and offsets. The rest will be clearer when we explore the next few sections.

So, we see that our tokenizer gave the word 'Our' the same token representation with an integer id of 1817 and a word id of 0 as its the first word in the sentence. The offset gives the exact index in the sentence string where this token (not the word) is found.

Next, the tokenizer split the word 'Expeditionary' into 3 separate tokens as 'Exped', '##ition' and '##ary'. This is the WordPiece algorithm in action. It assigned different ids to each of them. However, the word id assigned to them were the same, 1. So, from this we can see that even though the word was split, we still have enough information to reconstruct the word. Finally, the offset again provides the index into our string where the token (not the whole word) is found.

Step 5: Post-processing

The post-processing is highly tied to the NLP model which we will be using. We can do all kinds of things in this step, but usually here is where we add the special tokens based on the NLP model. As we are assuming that the tokenized text will be fed into BERT, let us see what BERT needs.

BERT expects every single sentence to begin with the '[CLS]' token and end with a '[SEP]' token. So, for the following:

I love machine learning.

we need to feed into BERT:

['[CLS]', 'I', 'love', 'machine', 'learning', '.', '[SEP]']

BERT can also be fed 2 sentences at a time for a training task known as Next Sentence Prediction. So, for the following inputs:

I love machine learning. It is cool.

we need to feed:

['[CLS]', 'I', 'love', 'machine', 'learning', '.', '[SEP]', 'It', 'is', 'cool', ',', '[SEP]']

So, the '[SEP]' token goes at the end of every sentence while the '[CLS]' token only goes at the beginning of the first sentence.

With this in mind, lets see how we can easily achieve it using our tokenizer:

from tokenizers.processors import TemplateProcessing

# BERT like post-processor
post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair=[CLS] $A [SEP] $B:1 [SEP]:1,
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),

tokenizer.post_processor = post_processor

output = tokenizer.encode(example_sentence)

So, we use a new class called TemplateProcessing which can be easily told how to process a single sentence using the single parameter and a pair of of sentences using the pair parameter.

We provide the string:

"[CLS] $A [SEP]"

to tell the post-processor that for any sentence represented by $A add the [CLS] and [SEP] tokens as defined. For a pair of sentences we provide:

"[CLS] $A [SEP] $B:1 [SEP]:1"

Here again, $A and $B are the two sentences. The extra :1 basically tells the tokenizer how to identify which sentence a token belongs to when there are a pair of sentences. So, here every token coming from second sentence will have a type_id of 1 while every token coming from the first sentence will have a type_id of 0 (the default when nothing is specified).

One last thing in the example above is the token_to_id method. This method on the tokenizer object easily gives us the id that is assigned to a token.

So, lets see an example output:

# Multiple sentences
# "Our Expeditionary Services segment competes with a number of divisions of large corporations and other large and small companies."
# Although certain of our competitors have substantially greater financial and other resources than we do, we believe that we have maintained a satisfactory competitive position through our responsiveness to customer needs, our attention to quality, and our unique combination of market expertise and technical and financial capabilities.

output = tokenizer.encode(ds[100]["sentence"], ds[101]["sentence"])
# 2

If we return to our table as before:

tokens ids attention_mask special_tokens_mask sequence_ids word_ids type_ids offsets
[CLS] 1 1 1 None None 0 (0, 0)
Our 1817 1 0 0 0 0 (0, 3)
Exped 19910 1 0 0 1 0 (4, 9)
##ition 1515 1 0 0 1 0 (9, 14)
##ary 1610 1 0 0 1 0 (14, 17)
Services 3504 1 0 0 2 0 (18, 26)
... ... ... ... ... ... ... ...
companies 2351 1 0 0 18 0 (119, 128)
. 18 1 0 0 19 0 (128, 129)
[SEP] 2 1 1 None None 0 (0, 0)
Although 3854 1 0 1 0 1 (0, 8)
certain 1809 1 0 1 1 1 (9, 16)
... ... ... ... ... ... ... ...
capabilities 4870 1 0 1 49 1 (323, 335)
. 18 1 0 1 50 1 (335, 336)
[SEP] 2 1 1 None None 1 (0, 0)

We see the special tokens have been added. The following needs to be noted:

  • sequence_ids and type_ids are 0 for any token belonging to the first sentence and 1 for the second sentence.
  • offsets are always calculated with respect to the sentence the token comes from, not the combined sentences.
  • Special tokens can be identified using the special_tokens_mask attribute which is 1 if the token is a special token.
  • sequence_ids and word_ids are always None for special tokens and the offset is always (0,0) as these tokens don't really belong to the sentence.
  • However, type_ids work the same for special tokens as normal tokens.

Padding and Attention Masks

Padding comes into the picture when we have multiple sentences in a batch that we want to tokenize and feed into a NLP model. Most model require the input to be of a fixed size. But, almost always sentences are going to vary in size. So, one thing we can do is to simply add padding tokens till the size of every sentence in our batch is the same. For example, consider we have 2 sentences in our batch as follows:

I love football

I live in Paris

We could simply tokenize and add a padding token so that the tokenized sentences have the same length:

['[CLS]', 'I', 'love', 'football', '[SEP]', '[PAD]']

['[CLS]', 'I', 'live', 'in', 'Paris', '[SEP]']

We see a pad token is added to the end of the first sentence after the [SEP] token to make the final count of tokens in each sentence the same. More than 1 padding token can be added and we can control whether to pad left or right. Here we will keep it simple and use defaults as:

pad_token = "[PAD]"
tokenizer.enable_padding(pad_id=tokenizer.token_to_id(pad_token), pad_token=pad_token)

output = tokenizer.encode_batch([
    [ds[100]["sentence"], ds[101]["sentence"]],
    [ds[102]["sentence"], ds[103]["sentence"]]

This batch produces the following output:

tokens ids attention_mask special_tokens_mask sequence_ids word_ids type_ids offsets
[CLS] 1 1 1 None None 0 (0, 0)
Our 1817 1 0 0 0 0 (0, 3)
... ... ... ... ... ... ... ...
. 18 1 0 0 19 0 (128, 129)
[SEP] 2 1 1 None None 0 (0, 0)
Although 3854 1 0 1 0 1 (0, 8)
... ... ... ... ... ... ... ...
. 18 1 0 1 50 1 (335, 336)
[SEP] 2 1 1 None None 1 (0, 0)
[CLS] 1 1 1 None None 0 (0, 0)
Backlog 12416 1 0 0 0 0 (0, 7)
... ... ... ... ... ... ... ...
. 18 1 0 0 18 0 (115, 116)
[SEP] 2 1 1 None None 0 (0, 0)
Backlog 12416 1 0 1 0 1 (0, 7)
... ... ... ... ... ... ... ...
. 18 1 0 1 28 1 (161, 162)
[SEP] 2 1 1 None None 1 (0, 0)
... ... ... ... ... ... ... ...
[PAD] 3 0 1 None None 0 (0, 0)
[PAD] 3 0 1 None None 0 (0, 0)

We provide a batch of two, with each input of the batch being a pair of sentences. We see the tokens of the second sentence was padded with [PAD] tokens. As the pad token is a special token all the discussion about special tokens earlier apply here as well.

Now, we can finally talk about the attention_mask. If you noticed carefully in the examples before, this was always 1 for all the tokens. Only in the current example is the value different from 1 and that too only for the [PAD] token. This ties in with the attention mechanism for Transformer models in general.

I won't go into the details of the attention mechanism. But, intuitively we can understand why it is 0 for the [PAD] token. The pad token was introduced just to make sure the size of all our tokenized sentences are the same. Our model shouldn't really care about it. To ensure it doesn't we set the attention_mask value to 0.

This won't be the case for the other special tokens. The other special tokens all play a role in the model learning so for them the attention_mask value is still 1.


This wraps up our discussion of tokenization. We saw the different aspects of the process and the ideas behind them. We saw how the entire pipeline of Normalization->Pre-tokenization->Tokenization->Post-processing can be easily integrated into a single instance of the Tokenizer class and applied to entire batches of textual input.

If you want more details then the Hugging Face documentation is a great resource to start. Thank you for reading!

Copy link

Thanks for this tokenizationtutorial. Very interesting.
I would like to test with a French sentence dataset.
Can you describe the format of your dataset? eg JanosAudran/financial-reports-sec small-lite
Are all columns required for this tokenization?

Copy link

@LeMoussel Thanks for reading.

The format is as follows:

  • A given US firm files a 10-K annually with SEC.
  • That filing has a lot of textual data which are categorized into separate sections.
  • Within each section there is a lot of text.
  • With the above in mind, the way the dataset is structured is every sentence in the filing is 1 single row/obs in the dataset. That's what the sentence feature captures.
  • The rest of the features simply allows us to track which section/filing/firm the sentence came from.

This is a broad overview. If you have any specific question feel free to ask.

Copy link


And for the last question, no you don't need any other feature than the 'sentence' feature for a tokenization task.

Copy link

LeMoussel commented Jan 16, 2023

I found this dataset allocine for testing in French.

Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository,

Copy link

@LeMoussel The dataset is a very standard dataset fit for SA. What exactly is the task you have in mind?

Copy link

I want to test your discussion of tokenization (entire pipeline of Normalization->Pre-tokenization->Tokenization->Post-processing) on French text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment