Skip to content

Instantly share code, notes, and snippets.

@ftfarias
Forked from brianspiering/python_nlp_packages.md
Created October 14, 2019 20:19
Show Gist options
  • Save ftfarias/8d30ba33af483703938b1c9432e35567 to your computer and use it in GitHub Desktop.
Save ftfarias/8d30ba33af483703938b1c9432e35567 to your computer and use it in GitHub Desktop.
A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

A Hacker's Guide to Python string and Natural Language Processing (NLP) packages

Preprocessing

  • The Python Standard Library, especially str.methods and string module are powerful for text processing. Start there.
  • regex - Extends Python's Standard Library re module while being backwards-compatible.
  • chardet - Finds character encoding.
  • ftfy - Take in bad Unicode and output good Unicode. Seriously automagical.
  • ploygot - Helpful for multilingual preprocessing.
  • fuzzywuzzy - Fuzzy string matching like a boss.
  • enchant - Spell checking.
  • inflect - Convert numbers to words, switch between singular/plural, and generate ordinals.

Modeling

  • nltk - Hard pass. Too academic, too slow.
  • scikit-learn - Handles basic text processing and modeling. Easy to combine text-based features with other features.
  • TextBlob - A great package for common NLP tasks. Consistent OOP-style API.
  • spaCy - Industrial strength NLP including Fast syntactic parsing
    • textacy - Higher level NLP built on top of spaCy
  • gensim - a nice API for all kinds of topic modeling and word2vec.
  • pattern - Text mining at its finest. Handles normalizing numbers, comparatives, and superlatives.

Blogposts

Listicles of Deep NLP Packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment