Skip to content

Instantly share code, notes, and snippets.

@brianspiering
Last active April 8, 2024 20:48
Show Gist options
  • Save brianspiering/64b2256f25880a97936c198955b437e1 to your computer and use it in GitHub Desktop.
Save brianspiering/64b2256f25880a97936c198955b437e1 to your computer and use it in GitHub Desktop.
A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

Extraction

  • textract - Extract text from any document.
  • camelot - Extract text from PDF.

Preprocessing

  • Python's Standard Library, especially str.methods and string module are powerful for text processing. Start there.
  • regex - Extends Python's Standard Library re module while being backwards-compatible.
  • chardet - Finds character encoding.
  • ftfy - Takes in bad Unicode and outputs good Unicode. Seriously automagical.
  • ploygot - Helpful for multilingual preprocessing.
  • fuzzywuzzy - Fuzzy string matching like a boss.
  • enchant - Spell checking.
  • inflect - Convert numbers to words, switch between singular/plural, and generate ordinals.

Modeling

  • nltk - Hard pass. Too academic, too slow.
  • scikit-learn - Handles basic text processing and modeling. Easy to combine text-based features with other features.
  • TextBlob - A great package for common NLP tasks. Consistent OOP-style API.
  • spaCy - Industrial strength NLP including, very good transformers and named entity recognition (NER) abilities.
    • textacy - Higher level NLP built on top of spaCy.
  • Hugging Face - Collections of datasets and pretrained models.
  • gensim - A nice API for all kinds of topic modeling and word2vec.
  • pattern - Text mining at its finest. Handles normalizing numbers, comparatives, and superlatives.
  • jellyfish - Approximate & phonetic string matching.

Blogposts

Listicles of Deep NLP Packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment