Created
August 16, 2013 18:06
-
-
Save daniel-j-h/6252118 to your computer and use it in GitHub Desktop.
Fun with Natural language processing. Not really useful right now.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns nlp.core | |
(:require [opennlp.nlp :as nlp] | |
[opennlp.tools.filters :as filters] | |
[net.cgrand.enlive-html :as html])) | |
(def ^:dynamic *base-url* "https://news.ycombinator.com") | |
;; get models from http://opennlp.sourceforge.net/models-1.5/ | |
(def tokenize (nlp/make-tokenizer "models/en-token.bin")) | |
(def pos-tag (nlp/make-pos-tagger "models/en-pos-maxent.bin")) | |
(def get-sentences (nlp/make-sentence-detector "models/en-sent.bin")) | |
(def get-persons (nlp/make-name-finder "models/namefind/en-ner-person.bin")) | |
(def get-location (nlp/make-name-finder "models/namefind/en-ner-location.bin")) | |
(defn fetch-url [url] | |
(html/html-resource (java.net.URL. url))) | |
(defn hn-headlines [] | |
(map html/text (html/select (fetch-url *base-url*) [:td.title :a]))) | |
(defn hn-headlines-nlp [] | |
(doseq [line (hn-headlines) | |
:let [tokens (tokenize line) | |
tags (pos-tag tokens) | |
verbs (map first (filters/verbs tags)) | |
nouns (map first (filters/nouns tags)) | |
locations (get-location tokens) | |
persons (get-persons tokens) | |
fmt (fn [coll] (->> coll (interpose ", ") (apply str) (clojure.string/trim)))]] | |
(println | |
(apply format "Headline: %s\nNouns: %s\nVerbs: %s\nLocations: %s\nPersons: %s\n" | |
line (map fmt [nouns verbs locations persons]))))) | |
(defn -main [& args] | |
(hn-headlines-nlp)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; Sample output, from a test run right now | |
Headline: PlanGrid is looking for dev-ops and Android engineers in San Francisco | |
Nouns: PlanGrid, dev-ops, Android, engineers, San, Francisco | |
Verbs: is, looking | |
Locations: San Francisco | |
Persons: | |
Headline: Edward Snowden and Gen Y: a sign of leaks to come? | |
Nouns: Edward, Snowden, Gen, Y, sign, leaks | |
Verbs: come | |
Locations: | |
Persons: Edward Snowden |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defproject nlp "0.1.0-SNAPSHOT" | |
:description "Natural Nanguage processing" | |
:url "https://gist.github.com/daniel-j-h" | |
:license {:name "MIT License" | |
:url "http://www.opensource.org/licenses/mit-license.php"} | |
:min-lein-version "2.0.0" | |
:global-vars {*warn-on-reflection* true} | |
:plugins [[lein-kibit "0.0.8"] | |
[jonase/eastwood "0.0.2"]] | |
:dependencies [[org.clojure/clojure "1.5.1"] | |
[clojure-opennlp "0.3.1"] | |
[enlive "1.1.1"]] | |
:main nlp.core) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment