Skip to content

Instantly share code, notes, and snippets.

View thllwg's full-sized avatar

Thorben Hellweg thllwg

View GitHub Profile
@tamuhey
tamuhey / tokenizations_post.md
Last active March 30, 2024 19:00
How to calculate the alignment between BERT and spaCy tokens effectively and robustly

How to calculate the alignment between BERT and spaCy tokens effectively and robustly

image

site: https://tamuhey.github.io/tokenizations/

Natural Language Processing (NLP) has made great progress in recent years because of neural networks, which allows us to solve various tasks with end-to-end architecture. However, many NLP systems still require language-specific pre- and post-processing, especially in tokenizations. In this article, I describe an algorithm that simplifies calculating correspondence between tokens (e.g. BERT vs. spaCy), one such process. And I introduce Python and Rust libraries that implement this algorithm. Here are the library and the demo site links:

@timvisee
timvisee / falsehoods-programming-time-list.md
Last active April 19, 2024 20:56
Falsehoods programmers believe about time, in a single list

Falsehoods programmers believe about time

This is a compiled list of falsehoods programmers tend to believe about working with time.

Don't re-invent a date time library yourself. If you think you understand everything about time, you're probably doing it wrong.

Falsehoods

  • There are always 24 hours in a day.
  • February is always 28 days long.
  • Any 24-hour period will always begin and end in the same day (or week, or month).