Skip to content

Instantly share code, notes, and snippets.

View bukson's full-sized avatar

Michał Bukowski bukson

View GitHub Profile
@ksopyla
ksopyla / polish_sentence_nltk_tokenizer.py
Last active September 19, 2022 07:29
A curated list of Polish abbreviations for NLTK sentence tokenizer based on Wikipedia text
import nltk
# interactive download
# nltk.download()
nltk.download('punkt')
extra_abbreviations = [
"ps",
"inc",
"corp",
# to simulate remote binary correctly
socat TCP-LISTEN:31337,fork,reuseaddr EXEC:"./server.py",pty,stderr