Skip to content

Instantly share code, notes, and snippets.

@honnibal
Created July 16, 2015 17:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save honnibal/30499850449a46c167a8 to your computer and use it in GitHub Desktop.
Save honnibal/30499850449a46c167a8 to your computer and use it in GitHub Desktop.
Syntax-specific search with spaCy
"""
Example use of the spaCy NLP tools for data exploration.
Here we will look for reddit comments that describe Google doing something,
i.e. discuss the company's actions. This is difficult, because other senses of
"Google" now dominate usage of the word in conversation, particularly references to
using Google products.
The heuristics here are quick and dirty --- about 5 minutes work. A better approach
is to use the word vector of the verb. But, the demo here is just to show what's
possible to build up quickly, to start to understand some data.
"""
from __future__ import unicode_literals
from __future__ import print_function
import sys
import plac
import bz2
import ujson
import spacy.en
def main(input_loc):
nlp = spacy.en.English() # Load the model takes 10-20 seconds.
for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump.
comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.
comment_parse = nlp(comment_str) # Apply the spaCy NLP pipeline.
for word in comment_parse: # Look for the cases we want
if google_doing_something(word):
# Print the clause
print(''.join(w.string for w in word.head.subtree).strip())
def google_doing_something(w):
if w.lower_ != 'google':
return False
elif w.dep_ != 'nsubj': # Is it the subject of a verb?
return False
elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux': # And not 'is'
return False
elif w.head.lemma_ in ('say', 'show'): # Exclude e.g. "Google says..."
return False
else:
return True
if __name__ == '__main__':
plac.call(main)
@honnibal
Copy link
Author

Well, the idea is actually that words-before-and-after is actually just a proxy measure for the syntactic structure, which is really a "tree". But we don't have to use the string order — spaCy gives you that tree :).

Like, compare these sentences (trees provided by CMU's parser, since I don't have spaCy linked up to a visualiser yet):

a) "a quick Google would show you're wrong"
http://demo.ark.cs.cmu.edu/parse?sentence=A%20quick%20Google%20would%20show%20you%27re%20wrong.
b) "Google shows you're wrong"
http://demo.ark.cs.cmu.edu/parse?sentence=A%20quick%20Google%20shows%20you%27re%20wrong.

You see the arc labelled "nsubj" from "show" to Google? That's the sort of relationship we're checking out in the google_doing_something function. The "dep" property refers to the label of the arc (e.g. nsubj), and the "lemma" property ensures we get the uninflected form ("show", not "shows").

The idea is to give representations that abstract away a lot of the incidental variation, so that you can write more precise rules for what you're looking for. The CMU parser page has an example of a representation that's more abstract still, the semantic parse. But then the accuracy starts to go down, and we get too many parse errors. The syntactic parse is a sort of compromise, where we can extract this "view" of the sentence reasonably reliably (about 92% of the arcs are correct), but abstract enough to be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment