Skip to content

Instantly share code, notes, and snippets.

Avatar

Thorben Hellweg thllwg

View GitHub Profile
@tamuhey
tamuhey / tokenizations_post.md
Last active Dec 24, 2021
How to calculate the alignment between BERT and spaCy tokens effectively and robustly
View tokenizations_post.md

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links: