Louis Guitton louisguitton

## tokenizations_post.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                louisguitton
                / tokenizations_post.md
            
            
              Created
              July 10, 2020 15:35
                — forked from tamuhey/tokenizations_post.md
            
              
                How to calculate the alignment between BERT and spaCy tokens effectively and robustly
              
          
    How to calculate the alignment between BERT and spaCy tokens effectively and robustly


site: https://tamuhey.github.io/tokenizations/
Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture.
However, many NLP systems still requires language-specific pre- and post-processing, especially in tokenizations.
In this article, I describe an algorithm which simplifies calculating of correspondence between tokens (e.g. BERT vs. spaCy), one such process.
And I introduce Python and Rust libraries that implement this algorithm.