Skip to content

Instantly share code, notes, and snippets.

@daniel-j-h
Created August 16, 2013 18:06
Show Gist options
  • Save daniel-j-h/6252118 to your computer and use it in GitHub Desktop.
Save daniel-j-h/6252118 to your computer and use it in GitHub Desktop.
Fun with Natural language processing. Not really useful right now.
(ns nlp.core
(:require [opennlp.nlp :as nlp]
[opennlp.tools.filters :as filters]
[net.cgrand.enlive-html :as html]))
(def ^:dynamic *base-url* "https://news.ycombinator.com")
;; get models from http://opennlp.sourceforge.net/models-1.5/
(def tokenize (nlp/make-tokenizer "models/en-token.bin"))
(def pos-tag (nlp/make-pos-tagger "models/en-pos-maxent.bin"))
(def get-sentences (nlp/make-sentence-detector "models/en-sent.bin"))
(def get-persons (nlp/make-name-finder "models/namefind/en-ner-person.bin"))
(def get-location (nlp/make-name-finder "models/namefind/en-ner-location.bin"))
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn hn-headlines []
(map html/text (html/select (fetch-url *base-url*) [:td.title :a])))
(defn hn-headlines-nlp []
(doseq [line (hn-headlines)
:let [tokens (tokenize line)
tags (pos-tag tokens)
verbs (map first (filters/verbs tags))
nouns (map first (filters/nouns tags))
locations (get-location tokens)
persons (get-persons tokens)
fmt (fn [coll] (->> coll (interpose ", ") (apply str) (clojure.string/trim)))]]
(println
(apply format "Headline: %s\nNouns: %s\nVerbs: %s\nLocations: %s\nPersons: %s\n"
line (map fmt [nouns verbs locations persons])))))
(defn -main [& args]
(hn-headlines-nlp))
;; Sample output, from a test run right now
Headline: PlanGrid is looking for dev-ops and Android engineers in San Francisco
Nouns: PlanGrid, dev-ops, Android, engineers, San, Francisco
Verbs: is, looking
Locations: San Francisco
Persons:
Headline: Edward Snowden and Gen Y: a sign of leaks to come?
Nouns: Edward, Snowden, Gen, Y, sign, leaks
Verbs: come
Locations:
Persons: Edward Snowden
(defproject nlp "0.1.0-SNAPSHOT"
:description "Natural Nanguage processing"
:url "https://gist.github.com/daniel-j-h"
:license {:name "MIT License"
:url "http://www.opensource.org/licenses/mit-license.php"}
:min-lein-version "2.0.0"
:global-vars {*warn-on-reflection* true}
:plugins [[lein-kibit "0.0.8"]
[jonase/eastwood "0.0.2"]]
:dependencies [[org.clojure/clojure "1.5.1"]
[clojure-opennlp "0.3.1"]
[enlive "1.1.1"]]
:main nlp.core)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment