Skip to content

Instantly share code, notes, and snippets.

View louisguitton's full-sized avatar
Focusing

Louis Guitton louisguitton

Focusing
View GitHub Profile
@louisguitton
louisguitton / tokenizations_post.md
Created July 10, 2020 15:35 — forked from tamuhey/tokenizations_post.md
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still requires language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm which simplifies calculating of correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm.