Joel Mathew joelthe1

## tokenizations_post.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              64 stars
            
          
                tamuhey
                / tokenizations_post.md
            
            
              Last active
              June 26, 2024 01:00
            
              
                How to calculate the alignment between BERT and spaCy tokens effectively and robustly
              
          
    How to calculate the alignment between BERT and spaCy tokens effectively and robustly


site: https://tamuhey.github.io/tokenizations/
Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm.
Here are the library and the demo site links:

repo: https://github.com/tamuhey/tokenizations


## create_data.py
import pandas as pd
from tqdm import tqdm
from difflib import SequenceMatcher
import re
import pickle

def matcher(string, pattern):
    '''
    Return the start and end index of any pattern present in the text.
    '''
	import pandas as pd
	from tqdm import tqdm
	from difflib import SequenceMatcher
	import re
	import pickle

	def matcher(string, pattern):
	'''
	Return the start and end index of any pattern present in the text.
	'''