poudelprakash/ruby_nlp.md

## ruby_nlp.md

      
    Raw
  

              ruby_nlp.md
            
          
    Ruby Natural Language Processing Resources

A collection of Natural Language Processing (NLP) Ruby libraries, tools and software. Suggestions and contributions are welcome.
Categories


APIs
Bitext Alignment
Books
Case
Chatterbot
Classification
Date and Time
Error Correction
Full-Text Search
Keyword Ranking
Language Detection
Machine Learning
Machine Translation
Miscellaneous
Multipurpose Tools
Named Entity Recognition
Ngrams
Parsers
Part-of-Speech Taggers
Readability
Regular Expressions
Ruby NLP Presentations
Sentence Segmentation
Speech-to-Text
Stemmers
Stop Words
Summarization
Text Extraction
Text Similarity
Text-to-Speech
Tokenizers
Word Count

APIs

Client libraries to various 3rd party NLP API services.

alchemy_api - provides a client API library for AlchemyAPI's NLP services
aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API
biffbot - Ruby gem for Diffbot's APIs that extract Articles, Products, Images, Videos, and Discussions from any web page
BOTServer - Telegram Bot API Webhooks Framework, for Rubyists
gengo-ruby - a Ruby library to interface with the Gengo API for translation
monkeylearn-ruby - build and consume machine learning models for language processing from your Ruby apps
napi-ruby - a simple Ruby wrapper for the Maluuba nAPI
poliqarpr - Ruby client for Poliqarp text corpus server
TelegramBot - a charismatic Ruby client for Telegram's Bot API
TelegramBotRuby - yet another client for Telegram's Bot API
telegram-bot-ruby - Ruby wrapper for Telegram's Bot API
wlapi - Ruby based API for the project Wortschatz Leipzig

Books


Text Processing with Ruby by Rob Miller

Bitext Alignment

Bitext alignment is the process of aligning two parallel documents on a segment by segment basis. In other words, if you have one document in English and its translation in Spanish, bitext alignment is the process of matching each segment from document A with its corresponding translation in document B.

alignment - alignment functions for corpus linguistics (Gale-Church implementation)

Case


active_support - the rails active_support gem has various string extensions that can handle case (e.g. .mb_chars.upcase.to_s or #transliterate)
string_pl - additional support for Polish encodings in Ruby 1.9
twitter-cldr-rb - casefolding
u - U extends Ruby’s Unicode support
unicode - Unicode normalization library
unicode_utils - Unicode algorithms for Ruby 1.9

Chatterbot


chatterbot - A straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate
Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis

Classification

Classification aims to assign a document or piece of text to one or more classes or categories making it easier to manage or sort.

Classifier - a general module to allow Bayesian and other types of classifications
classifier-reborn - (a fork of cardmagic/classifier) a general classifier module to allow Bayesian and other types of classifications
Latent Dirichlet Allocation - used to automatically cluster documents into topics
liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
linnaeus - a redis-backed Bayesian classifier
maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
Naive-Bayes - simple Naive Bayes classifier
nbayes - a full-featured, Ruby implementation of Naive Bayes
omnicat - a generalized rack framework for text classifications
omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy
stuff-classifier - a library for classifying text into multiple categories

Date and Time


Chronic - a pure Ruby natural language date parser
Chronic Between - a simple Ruby natural language parser for date and time ranges
Chronic Duration - a simple Ruby natural language parser for elapsed time
Kronic - a dirt simple library for parsing and formatting human readable dates
Nickel - extracts date, time, and message information from naturally worded text
Tickle - a natural language parser for recurring events

Error Correction


Chat Correct -  shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence
gingerice - Ruby wrapper for correcting spelling and grammar mistakes based on the context of complete sentences

Full-Text Search


ferret - an information retrieval library in the same vein as Apache Lucene
ranguba - a project to provide a full-text search system built on Groonga

Keyword Ranking


graph-rank - Ruby implementation of the PageRank and TextRank algorithms
highscore - find and rank keywords in text

Language Detection


Detect Language API Client - detects language of given text and returns detected language codes and scores
whatlanguage - a language detection library for Ruby that uses bloom filters for speed

Machine Learning


Decision Tree - a ruby library which implements ID3 (information gain) algorithm for decision tree learning
rb-libsvm - implementation of SVM, a machine learning and classification algorithm
RubyFann - a ruby gem that binds to FANN (Fast Artificial Neural Network) from within a ruby/rails environment

Machine Translation


Google API Client - Google API Ruby Client
microsoft_translator - Ruby client for the microsoft translator API
termit - Google Translate with speech synthesis in your terminal as ruby gem

Miscellaneous


gibber - Gibber replaces text with nonsensical latin with a maximum size difference of +/- 30%
hiatus - a localization QA tool
language_filter - a Ruby gem to detect and optionally filter multiple categories of language
Naturally - Natural (version number) sorting with support for legal document numbering, college course codes, and Unicode
rwordnet - a pure Ruby interface to the WordNet lexical/semantic database
sort_alphabetical -  sort UTF8 Strings alphabetical via Enumerable extension
stringex - some [hopefully] useful extensions to Ruby’s String class
twitter-text - gem that provides text processing routines for Twitter Tweets
nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching
dialable - A Ruby gem that provides parsing and output of North American Numbering Plan (NANP) phone numbers, and includes location & time zones

Multipurpose Tools

The following are libraries that integrate multiple NLP tools or functionality.

nlp - NLP tools for the Polish language
NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
Open NLP (Ruby bindings)
Stanford Core NLP (Ruby bindings)
Treat - natural language processing framework for Ruby
twitter-cldr-rb - TwitterCldr uses Unicode's Common Locale Data Repository (CLDR) to format certain types of text into their localized equivalents
ve - a linguistic framework that's easy to use
zipf - a collection of various NLP tools and libraries

Named Entity Recognition


Confidential Info Redactor - a Ruby gem to semi-automatically redact confidential information from a text
ruby-ner - named entity recognition with Stanford NER and Ruby
ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer

Ngrams


N-Gram - N-Gram generator in Ruby
ngram - break words and phrases into ngrams
raingrams - a flexible and general-purpose ngrams library written in Ruby

Parsers

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
Parslet - A small PEG based parser library
rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
Treetop - a Ruby-based parsing DSL based on parsing expression grammars

Part-of-Speech Taggers


engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
rbtagger - a simple ruby rule-based part of speech tagger
TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid

Readability


lingua - Lingua::EN::Readability is a Ruby module which calculates statistics on English text

Regular Expressions


CommonRegexRuby - find a lot of kinds of common information in a string
regexp-examples - generate strings that match a given regular expression
verbal_expressions - make difficult regular expressions easy

Ruby NLP Presentations


N-gram Analysis for Fun and Profit [tutorial] - Jesus Castello (2015)
Machine Learning made simple with Ruby [tutorial] - Lorenzo Masini (2015)
Using Ruby Machine Learning to Find Paris Hilton Quotes [tutorial] - Rick Carlino (2015)
Exploring Natural Language Processing in Ruby [slides] - Kevin Dias (2015)
Natural Language Parsing with Ruby [tutorial] - Glauco Custódio (2014)
Demystifying Data Science (Analyzing Conference Talks with Rails and Ngrams) [video RailsConf 2014 | Repo from the Video] - Todd Schneider (2014)
Natural Language Processing with Ruby [video ArrrrCamp 2014 | video Ruby Conf India] - Konstantin Tennhard (2014)
How to parse 'go' - Natural Language Processing in Ruby [slides] - Tom Cartwright (2013)
Natural Language Processing in Ruby [slides | video] - Brandon Black (2013)
Natural Language Processing with Ruby: n-grams [tutorial] - Nathan Kleyn (2013)
A Tour Through Random Ruby [tutorial] - Robert Qualls (2013)

Sentence Segmentation

Sentence segmentation (aka sentence boundary disambiguation, sentence boundary detection) is the problem in natural language processing of deciding where sentences begin and end. Sentence segmentation is the foundation of many common NLP tasks (machine translation, bitext alignment, summarization, etc.).

Pragmatic Segmenter
Punkt Segmenter
TactfulTokenizer
Scapel
SRX English

Speech-to-Text


att_speech - A Ruby library for consuming the AT&T Speech API for speech to text
pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx
Speech2Text - using Google Speech to Text API Provide a Simple Interface to Convert Audio Files

Stemmers

Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form.

Greek stemmer - a Greek stemmer
Ruby-Stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby
Turkish stemmer - a Turkish stemmer
uea-stemmer - a conservative stemmer for search and indexing

Stop Words


clarifier
stopwords - really just a list of stopwords with some helpers
Stopwords Filter - a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence

Summarization

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

Epitome - A small gem to make your text shorter; an implementation of the Lexrank algorithm
ots - Ruby bindings to open text summarizer
summarize - Ruby C wrapper for Open Text Summarizer

Text Extraction


docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
rtesseract - Ruby library for working with the Tesseract OCR
Ruby Readability - a tool for extracting the primary readable content of a webpage
ruby-tesseract - This wrapper binds the TessBaseAPI object through ffi-inline (which means it will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque Engine class
Yomu - a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit

Text Similarity


amatch - collection of five type of distances between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance'. Last one seems to work well to find similarity in long phrases)
damerau-levenshtein - calculates edit distance using the Damerau-Levenshtein algorithm
FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
fuzzy-string-match - fuzzy string matching library for ruby
FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
Going the Distance - contains scripts that do various distance calculations
hotwater - Fast Ruby FFI string edit distance algorithms
levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
tf-idf-similarity - calculate the similarity between texts using tf*idf

Text-to-Speech


espeak-ruby - small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files
Isabella - a voice-computing assistant built in Ruby
tts - a ruby gem for converting text-to-speech using the Google translate service

Tokenizers


Jieba - Chinese tokenizer and segmenter (jRuby)
MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
Pragmatic Tokenizer - a multilingual tokenizer to split a string into tokens
rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
Textoken - Simple and customizable text tokenization gem
thailang4r - Thai tokenizer
tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
tokenizer - a simple multilingual tokenizer

Word Count


wc - a rubygem to count word occurrences in a given text
word_count - a word counter for String and Hash in Ruby
Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
WordsCounted - a highly customisable Ruby text analyser