Skip to content

Instantly share code, notes, and snippets.

View joelthe1's full-sized avatar

Joel Mathew joelthe1

  • Information Sciences Institute
  • Los Angeles, USA
View GitHub Profile
@tamuhey
tamuhey / tokenizations_post.md
Last active March 30, 2024 19:00
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links:

import pandas as pd
from tqdm import tqdm
from difflib import SequenceMatcher
import re
import pickle
def matcher(string, pattern):
'''
Return the start and end index of any pattern present in the text.
'''