Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

Preprocessing

  • The Python Standard Library, especially str.methods and string module are powerful for text processing. Start there.
  • regex - Extends Python's Standard Library re module while being backwards-compatible.
  • chardet - Finds character encoding.
  • ftfy - Take in bad Unicode and output good Unicode. Seriously automagical.
  • ploygot - Helpful for multilingual preprocessing.
  • fuzzywuzzy - Fuzzy string matching like a boss.
  • enchant - Spell checking.
  • inflect - Convert numbers to words, switch between singular/plural, and generate ordinals.

Modeling

  • nltk - Hard pass. Too academic, too slow.
  • scikit-learn - Handles basic text processing and modeling. Easy to combine text-based features with other features.
  • TextBlob - A great package for common NLP tasks. Consistent OOP-style API.
  • spaCy - Industrial strength NLP including Fast syntactic parsing
    • textacy - Higher level NLP built on top of spaCy
  • gensim - a nice API for all kinds of topic modeling and word2vec.
  • pattern - Text mining at its finest. Handles normalizing numbers, comparatives, and superlatives.

Blogposts

Listicles of Deep NLP Packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.