Skip to content

Instantly share code, notes, and snippets.

@drtobbe
Last active December 17, 2015 15:19
Show Gist options
  • Save drtobbe/5630906 to your computer and use it in GitHub Desktop.
Save drtobbe/5630906 to your computer and use it in GitHub Desktop.
loadWordNet.gremlin
import java.io.*;
java.util.HashMap
/**
Introduction to the WordNet datamodel:
The core concept in WordNet is the synset. A synset groups words with a synonymous meaning, such as {car, auto, automobile, machine, motorcar}. Another sense of the word "car" is recorded in the synset {car, railcar, railway car, railroad car}. Although both synsets contain the word "car", they are different entities in WordNet because they have a different meaning. More precisely: a synset contains one or more word senses and each word sense belongs to exactly one synset. In turn, each word sense has exactly one word that represents it lexically, and one word can be related to one or more word senses.
There are four disjoint kinds of synset, containing either nouns, verbs, adjectives or adverbs. There is one more specific kind of adjective called an adjective satellite. Furthermore, WordNet defines seventeen relations (called pointers by the word net folks), of which ten between synsets (hyponymy, entailment, similarity, member meronymy, substance meronymy, part meronymy, classification, cause, verb grouping, attribute) and five between word senses (derivational relatedness, antonymy, see also, participle, pertains to). The remaining relations are "gloss" (between a synset and a sentence), and "frame" (between a synset and a verb construction pattern). There is also a more specific kind of word. Collocations are indicated by hyphens or underscores (an underscore stands for a space character). -- http://www.w3.org/TR/2006/WD-wordnet-rdf-20060619/#figure1
Word net consists of Index Files and Data Files.
This script parses both of these file types into a graph database using Gremlin, Pipes and the Blueprints API:
More information on these software projects can be found here:
https://github.com/tinkerpop/gremlin/wiki
More information on WordNet can be found here:
http://wordnet.princeton.edu/wordnet/download/current-version/
http://wordnet.princeton.edu/man/wndb.5WN.html
http://wordnetcode.princeton.edu/3.0/WNdb-3.0.tar.gz
WordNet basically contains the following files:
data.adj data.adv data.noun data.verb index.adj index.adv index.noun index.sense index.verb
DATA files and INDEX files have different formats. Here is an example using UNIX grep to explore WordNet for the word "whole" (painful):
$ grep "^whole " index.noun
whole n 2 4 @ ~ %p + 2 1 05869584 00003553
$ grep "^05869584" data.noun
05869584 09 n 01 whole 0 006 @ 05835747 n 0000 + 00514884 a 0101 %p 05867413 n 0000 ~ 05869857 n 0000 ~ 05870180 n 0000 ~ 05870365 n 0000 | all of something including all its component elements or parts; "Europe considered as a whole"; "the whole of American literature"
$ grep "^00003553" data.noun
00003553 03 n 02 whole 0 unit 0 015 @ 00002684 n 0000 + 01462005 v 0204 + 00367685 v 0201 + 01385458 v 0201 + 00368109 v 0201 + 00784215 a 0103 ~ 00003993 n 0000 ~ 00004258 n 0000 ~ 00019128 n 0000 ~ 00021939 n 0000 ~ 02749953 n 0000 ~ 03588414 n 0000 %p 03892891 n 0000 %p 04164989 n 0000 ~ 04353803 n 0000 | an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit"
$ grep "^whole " index.adj
whole a 5 5 ! & ^ = + 5 1 00514884 00517916 01319712 01171396 00784215
$ grep "^00514884" data.adj
00514884 00 a 01 whole 0 010 ^ 00520214 a 0000 = 14460565 n 0000 + 05869584 n 0101 ! 00516539 a 0101 & 00515380 a 0000 & 00515622 a 0000 & 00515753 a 0000 & 00515870 a 0000 & 00516231 a 0000 & 00516360 a 0000 | including all components without exception; being one unit or constituting the full amount or extent or duration; complete; "gave his whole attention"; "a whole wardrobe for the tropics"; "the whole hog"; "a whole week"; "the baby cried the whole trip home"; "a whole loaf of bread"
$ grep "^00517916" data.adj
00517916 00 a 01 whole 2 001 ! 00518035 a 0101 | (of siblings) having the same parents; "whole brothers and sisters"
$ grep "^01319712" data.adj
01319712 00 s 04 unharmed 0 unhurt 0 unscathed 0 whole 0 001 & 01319182 a 0000 | not injured
$ grep "^01171396" data.adj
01171396 00 s 02 hale 0 whole 0 003 & 01170243 a 0000 + 14050011 n 0201 + 14050011 n 0102 | exhibiting or restored to vigorous good health; "hale and hearty"; "whole in mind and body"; "a whole person again"
$ grep "^00784215" data.adj
00784215 00 s 03 solid 0 unanimous 0 whole 0 004 & 00783675 a 0000 + 00003553 n 0301 + 14460565 n 0303 + 13972387 n 0201 | acting together as a single undiversified whole; "a solid voting bloc"
$ grep "^whole " index.adv
whole r 1 1 ; 1 1 00008007
$ grep ^00008007 data.adv
00008007 02 r 07 wholly 0 entirely 0 completely 4 totally 0 all 0 altogether 4 whole 0 006 ;u 07075172 n 0000 \ 00515380 a 0403 \ 00520214 a 0301 \ 00515380 a 0201 \ 00514884 a 0101 ! 00007703 r 0102 | to a complete degree or to the full or entire extent (`whole' is often used informally for `wholly'); "he was wholly convinced"; "entirely satisfied with the meal"; "it was completely different from what we expected"; "was completely at fault"; "a totally new situation"; "the directions were all wrong"; "it was not altogether her fault"; "an altogether new approach"; "a whole new idea"
The alternative to all that in germlin is:
g.idx(T.v)[[lemma:'whole']].bothE.inV.map
At it's most basic level, WordNet contains Syntax 'words' (or lemmas). Each word can have multiple 'definitions', alternate uses, or 'semantics. The 'words' are obtained by parsing the INDEX files. A 'word form’ (word) will be used here to refer to the physical utterance or inscription. A ‘word meaning’ (semantic use/synset) to refers to the lexicalized concept that a form can be used to express. Then the starting point for lexical semantics can be said to be the mapping between forms and meanings Alternative Symantic uses for the word are obtained by parsing the DATA files. The DATA files also illustrate relationships between words in a given semantic concept.
WordNet was initially concerned with the pattern of semantic relations between lexicalized concepts (the synsets - or a set of synonyms); that is to say, it was to be a theory of word meanings. As work proceeded, however, it became increasingly clear that lexical relations of word syntax could not be ignored. At present, WordNet distinguishes between semantic relations and lexical relations; the emphasis is still on semantic relations between meanings, but relations between words are also included.
WordNet is organized by semantic relations. Since a semantic relation is a relation between meanings, and since meanings can be represented by synsets, it is natural to think of semantic relations as pointers (edges/relations) between synsets. It is characteristic of semantic relations that they are reciprocated.
*/
/**
Setup base classes in the ontology such as parts of speach, and LexPointers
*/
Vertex ontRoot
Vertex languageparts
public void create_ontology_classes(Graph graph){
//OWL:Class - the root of the typing system.
ontRoot = graph.addVertex(null);
ontRoot.setProperty("name","OWL:Class");
ontRoot.setProperty("type","Class");
/*
Parts of speech
*/
languageparts = graph.addVertex(null);
languageparts.setProperty("name","LanguageParts");
languageparts.setProperty("type","Class");
graph.addEdge(null, languageparts, ontRoot, "isA");
//Noun:
createDescendant(graph, languageparts, "Noun", "Class", "n");
//Verb:
createDescendant(graph, languageparts, "Verb", "Class", "v");
//Adjective:
createDescendant(graph, languageparts, "Adjective", "Class", "a");
//Adverb:
createDescendant(graph, languageparts, "Adverb", "Class", "r");
//ADJECTIVE SATELLITE
createDescendant(graph, languageparts, "Adjective_Satellite", "Class", "s");
//Pointers: (Lexicographer Shorthand - see below)
create_pointer_subgraph(graph, ontRoot);
}
/**
Utility functions
*/
public boolean isHeader(String line){
if(line.matches("^\\s+\\d+\\s+.*")) return true;
return false;
}
/**
Index File Format:
The purpose of the index file is to list a set of words (lemma) that are in the word net catalog.
Each index file begins with several lines containing a copyright notice, version number and license agreement. These lines all begin with two spaces and the line number so they do not interfere with the binary search algorithm that is used to look up entries in the index files. All other lines are in the following format. In the field descriptions, number always refers to a decimal integer unless otherwise defined.
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
* lemma - lower case ASCII text of word or collocation. Collocations are formed by joining individual words with an underscore (_ ) character.
* pos - Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.
All remaining fields are with respect to senses of lemma in pos .
* synset_cnt - Number of synsets that lemma is in. This is the number of senses of the word in WordNet. See Sense Numbers below for a discussion of how sense numbers are assigned and the order of synset_offset s in the index files.
* p_cnt - Number of different pointers that lemma has in all synsets containing it.
* ptr_symbol - A space separated list of p_cnt different types of pointers that lemma has in all synsets containing it. See wninput(5WN) for a list of pointer_symbol s. If all senses of lemma have no pointers, this field is omitted and p_cnt is 0 .
* sense_cnt - Same as sense_cnt above. This is redundant, but the field was preserved for compatibility reasons.
* tagsense_cnt - Number of senses of lemma that are ranked according to their frequency of occurrence in semantic concordance texts.
* synset_offset - Byte offset in data.pos file of a synset containing lemma . Each synset_offset in the list corresponds to a different sense of lemma in WordNet. synset_offset is an 8 digit, zero-filled decimal integer that can be used with fseek(3) to read a synset from the data file. When passed to read_synset(3WN) along with the syntactic category, a data structure containing the parsed synset is returned.
For example:
zero a 4 2 & \ 4 3 02186132 02269142 02201882 03145851
lemma = zero
pos = a for adjective
synset_cnt = 4
p_cnt = 2
ptr_symbol = & and \
sense_cnt = 4
tagsense_cnt = 3
synset_offset = [02186132, 02269142, 02201882, 03145851]
*/
public Vertex parseIndexLine(Graph g, String line){
//println line;
String[] tokens = line.split("\\s")
//println tokens;
if(tokens.size() > 6){//have to have at least 7 columns
Vertex vword = g.addVertex(null);
vword.setProperty("type","Word");
vword.setProperty("lemma",tokens[0].trim());
connectSyntacticCategory(g, tokens[1].trim(), vword);
vword.setProperty("synset_cnt",tokens[2].trim());
vword.setProperty("p_cnt",tokens[3].trim());
def i = 4;
def stage = 0; //stage 0 - represents the tokens before the pointers symbols (e.g. & \ )
while(i < tokens.size()){
if(stage == 0){
//println "Stage0: make sure we are not on an all digits token, current token: " + tokens[i].trim()
if( !tokens[i].trim().matches("[0123456789]+") )
stage = 1;
if(i == 4 && tokens[i].trim().matches("[0123456789]+")){
//println "NO pointer symbols exist, going directly to Stage2"
stage = 2;
}
}
if(stage == 1){
//println "Stage1: process the pointer symbols (e.g. ! & \\ ) " + tokens[i].trim()
if( tokens[i].trim().matches("[0123456789]+") ){
//println "We are out of pointers, going directly to Stage2"
stage = 2;
}else {
//println "Processing pointer: " + tokens[i].trim()
connectPointer(g, tokens[i].trim(), tokens[1].trim(), vword, "ptr_symbol");
}
}
if(stage == 2){
//println "Stage2 process sense_cnt " + tokens[i].trim()
vword.setProperty("sense_cnt",tokens[i].trim());
i++;
stage = 3;
}
if(stage == 3){
//println "Stage3: process tagsense_cnt " + tokens[i].trim()
vword.setProperty("tagsense_cnt",tokens[i].trim());
i++;
stage = 4;
}
if(stage == 4){
//println "Stage4: process synset_offset " + tokens[i].trim()
Vertex v = getCreateSynset(g, tokens[i].trim())
g.addEdge(null, vword, v, "hasSynset"); //words are associated with sunset usages, these are called Lemmas.
}
i++;
}
//println vword.map();
return vword;
}
return null;
}
/**
individual words (objects of type word) need to be connected to the ontology for the type of word that they are (e.g. noun, verb, adverb, adj,…).
*/
public Edge connectSyntacticCategory(Graph g, String categoryToken, Vertex v){
//query the ontology for the vertex Class representing the type of thing we are dealing with…
//println "connectSyntacticCategory: categoryToken: [" + categoryToken + "]";
scat = g.idx(T.v)[[name:'OWL:Class']].inE("isA").outV.inE("isA").outV.filter{it.WNToken.equals(categoryToken)}.toList()
if(scat.size() == 1){ //if it matched the token, link the word node to the adjective pointer.
return g.addEdge(null, v, scat[0], "pos");
}
return null;
}
/**
This directly connects a vertex, v, to a pointer Class object.
e.g.
v --relationshipType--> pointerClass
categoryToken = the shorthand for the pointer e.g. !, @, #p, %m, =, ;c and so on. The complete pointer list is below
partOfSpeechToken = n for noun, v for verb, a for adj, and r for adverb (same as in the Classes representing parts of speech above)
v = the vertex we want to connect to the pointer
relationshipType - the value on the edge making the connection.
*/
public Edge connectPointer(Graph g, String categoryToken, String partOfSpeechToken, Vertex v, String relationshipType){
//println "find the pointer class: categoryToken: [" + categoryToken + "]";
//Let me go through the logic here, it looks complex, but is hopelessy simple when you understand the pipeline
// 1) g.idx(T.v)[[type:'Class']] --- go get all vertices that are of type 'Class'
// 2) .filter{it.WNToken.equals(categoryToken)} --- now filter out those classes that have a Lexicographer Shorthand agreeing with the input request
// 3) .outE("isA").inV.inE("LexicographerPointer").outV --- now make sure that the pointer is a subclass of type pointer
// and find the associated part of speech (e.g. noun, verb) by following the LexicographerPointer edge
// 4) .filter{it.WNToken.equals(partOfSpeechToken)} --- filter out everything but the pointer used for THIS part of speech e.g. a subclass of nounPointer if it is a noun
// 5) .back(5) --- go back and get the pointer object node.
l = g.idx(T.v)[[type:'Class']].filter{it.WNToken.equals(categoryToken)}.outE("isA").inV.inE("LexicographerPointer").outV.filter{it.WNToken.equals(partOfSpeechToken)}.back(5).toList()
if(l.size() == 1){
//println "Adding edge of type (categoryToken, partOfSpeechToken, relationshipType): (" + categoryToken + "," + partOfSpeechToken + "," + relationshipType + ")"
return g.addEdge(null, v, l[0], relationshipType);
}else{
//println "Pointer not found"
return null;
}
}
public Vertex getCreateSynset(Graph g, String synsetID){
//Use gremlin to try to find the vertex.
l = g.idx(T.v)[[synset_id:synsetID]].toList();
if(l.size() > 0){//we found one
//println l[0].map{}
return l[0];
}else {//if we did not find it, then make a new node!
Vertex v = g.addVertex(null);
v.setProperty("synset_id", synsetID);
v.setProperty("type","Synset");
return v;
}
}
/**
Data File Format:
Each data file begins with several lines containing a copyright notice, version number and license agreement. These lines all begin with two spaces and the line number. All other lines are in the following format. Integer fields are of fixed length, and are zero-filled.
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
* synset_offset - Current byte offset in the file represented as an 8 digit decimal integer. (represented in the graph by synset_id)
* lex_filenum - Two digit decimal integer corresponding to the lexicographer file name containing the synset. See lexnames(5WN) for the list of filenames and their corresponding numbers.
* ss_type
One character code indicating the synset type:
n NOUN
v VERB
a ADJECTIVE
s ADJECTIVE SATELLITE
r ADVERB
* w_cnt - Two digit hexadecimal integer indicating the number of words in the synset.
* word - ASCII form of a word as entered in the synset by the lexicographer, with spaces replaced by underscore characters (_ ). The text of the word is case sensitive, in contrast to its form in the corresponding index. pos file, that contains only lower-case forms. In data.adj , a word is followed by a syntactic marker if one was specified in the lexicographer file. A syntactic marker is appended, in parentheses, onto word without any intervening spaces. See wninput(5WN) for a list of the syntactic markers for adjectives.
* lex_id - One digit hexadecimal integer that, when appended onto lemma , uniquely identifies a sense within a lexicographer file. lex_id numbers usually start with 0 , and are incremented as additional senses of the word are added to the same file, although there is no requirement that the numbers be consecutive or begin with 0 . Note that a value of 0 is the default, and therefore is not present in lexicographer files.
* p_cnt - Three digit decimal integer indicating the number of pointers from this synset to other synsets. If p_cnt is 000 the synset has no pointers.
* ptr
A pointer from this synset to another. ptr is of the form:
pointer_symbol synset_offset pos source/target where synset_offset is the byte offset of the target synset in the data file corresponding to pos .
The source/target field distinguishes lexical and semantic pointers. It is a four byte field, containing two two-digit hexadecimal integers. The first two digits indicates the word number in the current (source) synset, the last two digits indicate the word number in the target synset. A value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset indicated by synset_offset .
A lexical relation between two words in different synsets is represented by non-zero values in the source and target word numbers. The first and last two bytes of this field indicate the word numbers in the source and target synsets, respectively, between which the relation holds. Word numbers are assigned to the word fields in a synset, from left to right, beginning with 1 .
See wninput(5WN) for a list of pointer_symbol s, and semantic and lexical pointer classifications.
* frames
In data.verb only, a list of numbers corresponding to the generic verb sentence frames for word s in the synset. frames is of the form:
f_cnt + f_num w_num [ + f_num w_num...]
where f_cnt a two digit decimal integer indicating the number of generic frames listed, f_num is a two digit decimal integer frame number, and w_num is a two digit hexadecimal integer indicating the word in the synset that the frame applies to. As with pointers, if this number is 00 , f_num applies to all word s in the synset. If non-zero, it is applicable only to the word indicated. Word numbers are assigned as described for pointers. Each f_num w_num pair is preceded by a + . See wninput(5WN) for the text of the generic sentence frames.
* gloss
Each synset contains a gloss. A gloss is represented as a vertical bar (| ), followed by a text string that continues until the end of the line. The gloss may contain a definition, one or more example sentences, or both.
Example:
00003553 03 n 02 whole 0 unit 0 015 @ 00002684 n 0000 + 01462005 v 0204 + 00367685 v 0201 + 01385458 v 0201 + 00368109 v 0201 + 00784215 a 0103 ~ 00003993 n 0000 ~ 00004258 n 0000 ~ 00019128 n 0000 ~ 00021939 n 0000 ~ 02749953 n 0000 ~ 03588414 n 0000 %p 03892891 n 0000 %p 04164989 n 0000 ~ 04353803 n 0000 | an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit"
synset_offset = 00003553 (we call this property the synset_id)
lex_filenum = 03
ss_type = n
w_cnt = 02
word = whole and unit -- these guys need to point at the nodes for indexed words loaded by parsing the index
lex_id = 0 and 0
p_cnt = 015
ptr = the following table represents the parse structure of the additional columns and the relationships as edges that they will be encoded in the graph:
[pointer_symbol, synset_offset, pos, source/target]
@ 00002684 n 0000 = 00003553 --Hypernym--> 00002684 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
+ 01462005 v 0204 = 00003553 --Derivationally_related_form--> 01462005 ; Edge.setProperty("source", "02"); Edge.setProperty("target", "04") ; Edge.setProperty("pos", "v")
+ 00367685 v 0201 = 00003553 --Derivationally_related_form--> 00367685 ; Edge.setProperty("source", "02"); Edge.setProperty("target", "01") ; Edge.setProperty("pos", "v")
+ 01385458 v 0201 = 00003553 --Derivationally_related_form--> 01385458 ; Edge.setProperty("source", "02"); Edge.setProperty("target", "01") ; Edge.setProperty("pos", "v")
+ 00368109 v 0201 = 00003553 --Derivationally_related_form--> 00368109 ; Edge.setProperty("source", "02"); Edge.setProperty("target", "01") ; Edge.setProperty("pos", "v")
+ 00784215 a 0103 = 00003553 --Derivationally_related_form--> 00784215 ; Edge.setProperty("source", "02"); Edge.setProperty("target", "03") ; Edge.setProperty("pos", "a")
~ 00003993 n 0000 = 00003553 --Hyponym--> 00003993 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 00004258 n 0000 = 00003553 --Hyponym--> 00004258 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 00019128 n 0000 = 00003553 --Hyponym--> 00019128 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 00021939 n 0000 = 00003553 --Hyponym--> 00021939 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 02749953 n 0000 = 00003553 --Hyponym--> 02749953 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 03588414 n 0000 = 00003553 --Hyponym--> 03588414 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
%p 03892891 n 0000 = 00003553 --Part_meronym--> 03892891 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
%p 04164989 n 0000 = 00003553 --Part_meronym--> 04164989 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
~ 04353803 n 0000 = 00003553 --Hyponym--> 04353803 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00") ; Edge.setProperty("pos", "n")
gloss = an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit"
To look into the synset_offset, we can grep for each of the offset numbers in the data.noun file to show related synsets. For example taking the first entry in the table above:
@ 00002684 n 0000 = 00003553 --Hypernym--> 00002684 ; Edge.setProperty("source", "00"); Edge.setProperty("target", "00");
$ grep ^00002684 data.noun
00002684 03 n 02 object 0 physical_object 0 039 @ 00001930 n 0000 + 00532607 v 0105 ~ 00003553 n 0000 ~ 00027167 n 0000 ~ 03009633 n 0000 ~ 03149951 n 0000 ~ 03233423 n 0000 ~ 03338648 n 0000 ~ 03532080 n 0000 ~ 03595179 n 0000 ~ 03610270 n 0000 ~ 03714721 n 0000 ~ 03892891 n 0000 ~ 04012260 n 0000 ~ 04248010 n 0000 ~ 04345288 n 0000 ~ 04486445 n 0000 ~ 07851054 n 0000 ~ 09238143 n 0000 ~ 09251689 n 0000 ~ 09267490 n 0000 ~ 09279458 n 0000 ~ 09281777 n 0000 ~ 09283193 n 0000 ~ 09287968 n 0000 ~ 09295338 n 0000 ~ 09300905 n 0000 ~ 09302031 n 0000 ~ 09308398 n 0000 ~ 09334396 n 0000 ~ 09335240 n 0000 ~ 09358550 n 0000 ~ 09368224 n 0000 ~ 09407346 n 0000 ~ 09409203 n 0000 ~ 09432990 n 0000 ~ 09468237 n 0000 ~ 09474162 n 0000 ~ 09477037 n 0000 | a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects"
*/
public Vertex parseDataLine(Graph g, String line){
println line;
String[] tokens = line.split("\\s")
println tokens;
if(tokens.size() > 7){//have to have at least 8 columns
//synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
Vertex vsynset = getCreateSynset(g, getID(line) );
vsynset.setProperty("lex_filenum", tokens[1]);
vsynset.setProperty("ss_type", tokens[2]);
vsynset.setProperty("w_cnt", tokens[3]);
vsynset.setProperty("gloss", getDefinition(line));
//def i = 4;
//Handle the words as: word lex_id [word lex_id...]
def wordnum = Integer.parseInt(tokens[3], 16);
for(i in 0..(wordnum-1)){
println "Processing next word: " + tokens[2*i+4] + ":" + tokens[2*i+1+4];
vword = getLemma(g, tokens[2*i+4], tokens[2], tokens);
Edge e = g.addEdge(null, vsynset, vword, "hasWord");
e.setProperty("lex_id", tokens[2*i+1+4]);
}
//Handle the pointers as: [pointer_symbol, synset_offset, pos, source/target]
//while(i < tokens.size()){
// //t =
// i = i+4;
//}
println vsynset.map();
return vsynset;
}
return null;
}
/*
getLemma takes an english word or word phrase (syntax) such as "command_processing_overhead_time",
and a part of speech ( also called a sys net type / WNToken:
n NOUN
v VERB
a ADJECTIVE
s ADJECTIVE SATELLITE
r ADVERB )
*/
public Vertex getLemma(Graph g, String wordPhrase, String partOfSpeach, String[] tokens){
def vlist = g.idx(T.v)[[lemma:wordPhrase]].outE("pos").inV.filter{it.WNToken.equals(partOfSpeach)}.toList();
if(vlist.size() > 0){
return vlist[0];
}else{
//we have to create a new one, it was not found in the index files
Vertex vword = g.addVertex(null);
//TODO: all the stuff below needs careful consideration
// vword.setProperty("type","Word");
// vword.setProperty("lemma",tokens[0].trim());
// connectSyntacticCategory(g, tokens[1].trim(), vword);
// vword.setProperty("synset_cnt",tokens[2].trim());
// vword.setProperty("p_cnt",tokens[3].trim());
return vword;
}
}
public String getID(String line){
String[] tokens = line.split("\\s")
if(tokens[0].matches("\\d+") ){
return (tokens[0]).trim();
}
return ""
}
public String getDefinition(String line){
String[] tokens = line.split("\\|")
if(tokens.length > 1)
return (tokens[1]).trim();
else return ""
}
/**
Pointers: (Lexicographer Shorthand)
Pointers are used to represent the relations between the words in one synset and another. Semantic pointers represent relations between word meanings, and therefore pertain to all of the words in the source and target synsets. Lexical pointers represent relations between word forms, and pertain only to specific words in the source and target synsets. The following pointer types are usually used to indicate lexical relations: Antonym, Pertainym, Participle, Also See, Derivationally Related. The remaining pointer types are generally used to represent semantic relations.
A relation from a source to a target synset is formed by specifying a word from the target synset in the source synset, followed by the pointer_symbol indicating the pointer type. The location of a pointer within a synset defines it as either lexical or semantic. The Lexicographer File Format section describes the syntax for entering a semantic pointer, and Word Syntax describes the syntax for entering a lexical pointer.
Although there are many pointer types, only certain types of relations are permitted between synsets of each syntactic category.
See: http://wordnet.princeton.edu/man/wninput.5WN.html#toc2
*/
Vertex pointerRoot;
Vertex nounPointerSymbol;
Vertex verbPointerSymbol;
Vertex adjectivePointerSymbol;
Vertex adverbPointerSymbol;
public void create_pointer_subgraph(Graph graph, Vertex parent){
//This subgraph is what one would typically call the 'metadata' about pointer objects in the system
pointerRoot = graph.addVertex(null);
pointerRoot.setProperty("name","Pointer");
pointerRoot.setProperty("type","Class");
Edge isA = graph.addEdge(null, pointerRoot, parent, "isA");
////Nouns:
nounPointerSymbol = graph.addVertex(null);
nounPointerSymbol.setProperty("name","NounPointerSymbol");
nounPointerSymbol.setProperty("type","Class");
isA = graph.addEdge(null, nounPointerSymbol, pointerRoot, "isA");
lnoun = graph.idx(T.v)[[name:'Noun']].toList();
graph.addEdge(null, lnoun[0], nounPointerSymbol, "LexicographerPointer");
/*
noun symbols:
*/
//! Antonym
createDescendant(graph, nounPointerSymbol, "Antonym", "Class", "!");
//@ Hypernym (commonly an is_a relationship: a human is_a primate)
createDescendant(graph, nounPointerSymbol, "Hypernym", "Class", "@");
//@i Instance Hypernym
createDescendant(graph, nounPointerSymbol, "Instance_Hypernym", "Class", "@i");
//~ Hyponym (defendant or instance of relationship - opposite of is_a : primate has_member human)
createDescendant(graph, nounPointerSymbol, "Hyponym", "Class", "~");
//~i Instance Hyponym
createDescendant(graph, nounPointerSymbol, "Instance_Hyponym", "Class", "~i");
//#m Member holonym
createDescendant(graph, nounPointerSymbol, "Member_holonym", "Class", "#m");
//#s Substance holonym
createDescendant(graph, nounPointerSymbol, "Substance_holonym", "Class", "#s");
//#p Part holonym
createDescendant(graph, nounPointerSymbol, "Part_holonym", "Class", "#p");
//%m Member meronym
createDescendant(graph, nounPointerSymbol, "Member_meronym", "Class", "%m");
//%s Substance meronym
createDescendant(graph, nounPointerSymbol, "Substance_meronym", "Class", "%s");
//%p Part meronym
createDescendant(graph, nounPointerSymbol, "Part_meronym", "Class", "%p");
//= Attribute
createDescendant(graph, nounPointerSymbol, "Attribute", "Class", "=");
//+ Derivationally related form
createDescendant(graph, nounPointerSymbol, "Derivationally_related_form", "Class", "+");
//; Domain of synset
createDescendant(graph, nounPointerSymbol, "Domain_of_synset", "Class", ";");
//- Member of this domain
createDescendant(graph, nounPointerSymbol, "Member_of_this_domain", "Class", "-");
//;c Domain of synset - TOPIC
createDescendant(graph, nounPointerSymbol, "Domain_of_synset_TOPIC", "Class", ";c");
//-c Member of this domain - TOPIC
createDescendant(graph, nounPointerSymbol, "Member_of_this_domain_TOPIC", "Class", "-c");
//;r Domain of synset - REGION
createDescendant(graph, nounPointerSymbol, "Domain_of_synset_REGION", "Class", ";r");
//-r Member of this domain - REGION
createDescendant(graph, nounPointerSymbol, "Member_of_this_domain_REGION", "Class", "-r");
//;u Domain of synset - USAGE
createDescendant(graph, nounPointerSymbol, "Domain_of_synset_USAGE", "Class", ";u");
//-u Member of this domain - USAGE
createDescendant(graph, nounPointerSymbol, "Member_of_this_domain_USAGE", "Class", "-u");
////Verbs:
verbPointerSymbol = graph.addVertex(null);
verbPointerSymbol.setProperty("name","VerbPointerSymbol");
verbPointerSymbol.setProperty("type","Class");
isA = graph.addEdge(null, verbPointerSymbol, pointerRoot, "isA");
lverb = graph.idx(T.v)[[name:'Verb']].toList();
graph.addEdge(null, lverb[0], verbPointerSymbol, "LexicographerPointer");
/*
verb symbols:
*/
//! Antonym
createDescendant(graph, verbPointerSymbol, "Antonym", "Class", "!");
//@ Hypernym
createDescendant(graph, verbPointerSymbol, "Hypernym", "Class", "@");
//~ Hyponym
createDescendant(graph, verbPointerSymbol, "Hyponym", "Class", "~");
//* Entailment
createDescendant(graph, verbPointerSymbol, "Entailment", "Class", "*");
//> Cause
createDescendant(graph, verbPointerSymbol, "Cause", "Class", ">");
//^ Also see
createDescendant(graph, verbPointerSymbol, "Also_see", "Class", "^");
//$ Verb Group
createDescendant(graph, verbPointerSymbol, "Verb_Group", "Class", "\$");
//+ Derivationally related form
createDescendant(graph, verbPointerSymbol, "Derivationally_related_form", "Class", "+");
//; Domain of synset
createDescendant(graph, verbPointerSymbol, "Domain_of_synset", "Class", ";");
//;c Domain of synset - TOPIC
createDescendant(graph, verbPointerSymbol, "Domain_of_synset_TOPIC", "Class", ";c");
//;r Domain of synset - REGION
createDescendant(graph, verbPointerSymbol, "Domain_of_synset_REGION", "Class", ";r");
//;u Domain of synset - USAGE
createDescendant(graph, verbPointerSymbol, "Domain_of_synset_USAGE", "Class", ";u");
////Adjectives:
adjectivePointerSymbol = graph.addVertex(null);
adjectivePointerSymbol.setProperty("name","AdjectivePointerSymbol");
adjectivePointerSymbol.setProperty("type","Class");
isA = graph.addEdge(null, adjectivePointerSymbol, pointerRoot, "isA");
ladjective = graph.idx(T.v)[[name:'Adjective']].toList();
graph.addEdge(null, ladjective[0], adjectivePointerSymbol, "LexicographerPointer");
/*
adjective symbols:
*/
//! Antonym
createDescendant(graph, adjectivePointerSymbol, "Antonym", "Class", "!");
//& Similar to
createDescendant(graph, adjectivePointerSymbol, "Similar_to", "Class", "&");
//< Participle of verb
createDescendant(graph, adjectivePointerSymbol, "Participle_of_verb", "Class", "<");
//\ Pertainym (pertains to noun)
createDescendant(graph, adjectivePointerSymbol, "Pertainym", "Class", "\\");
//= Attribute
createDescendant(graph, adjectivePointerSymbol, "Attribute", "Class", "=");
//^ Also see
createDescendant(graph, adjectivePointerSymbol, "Also_see", "Class", "^");
//; Domain of synset
createDescendant(graph, adjectivePointerSymbol, "Domain_of_synset", "Class", ";");
//;c Domain of synset - TOPIC
createDescendant(graph, adjectivePointerSymbol, "Domain_of_synset_TOPIC", "Class", ";c");
//;r Domain of synset - REGION
createDescendant(graph, adjectivePointerSymbol, "Domain_of_synset_REGION", "Class", ";r");
//;u Domain of synset - USAGE
createDescendant(graph, adjectivePointerSymbol, "Domain_of_synset_USAGE", "Class", "!");
////Adverbs:
adverbPointerSymbol = graph.addVertex(null);
adverbPointerSymbol.setProperty("name","AdverbPointerSymbol");
adverbPointerSymbol.setProperty("type","Class");
isA = graph.addEdge(null, adverbPointerSymbol, pointerRoot, "isA");
ladverb = graph.idx(T.v)[[name:'Adverb']].toList();
graph.addEdge(null, ladverb[0], adverbPointerSymbol, "LexicographerPointer");
/*
adverb symbols:
*/
//! Antonym
createDescendant(graph, adverbPointerSymbol, "Antonym", "Class", "!");
//\ Derived from adjective
createDescendant(graph, adverbPointerSymbol, "Derived_from_adjective", "Class", "\\");
//; Domain of synset
createDescendant(graph, adverbPointerSymbol, "Domain_of_synset", "Class", ";");
//;c Domain of synset - TOPIC
createDescendant(graph, adverbPointerSymbol, "Domain_of_synset_TOPIC", "Class", ";c");
//;r Domain of synset - REGION
createDescendant(graph, adverbPointerSymbol, "Domain_of_synset_REGION", "Class", ";r");
//;u Domain of synset - USAGE
createDescendant(graph, adverbPointerSymbol, "Domain_of_synset_USAGE", "Class", ";u");
return;
}
/**
Given a token, pointer_symbol_nouns will connect vertex v to the Class representing the type of pointer
that the part of speech (e.g. noun, verb) can take in usage.
g = a graph containing the pointer_subgraph
token = a string that will match properties of type WNToken
v = the vertex that we are going to link to the pointer symbol node
example usage:
Vertex vt = graph.addVertex(null);
Edge e = pointer_symbol_nouns(graph, "!", vt, "foo")
The functions:
pointer_symbol_verbs
pointer_symbol_adjectives
pointer_symbol_adverbs
all work the same way
*/
public Edge pointer_symbol_nouns(Graph g, String token, Vertex v, String relationshipType){
//get all Class objects representing a noun pointer, match it to the token
l = nounPointerSymbol.inE.outV.filter{it.WNToken.equals(token)}.toList()
//Class cls = l.getClass();
//println "The type of the object is: " + cls.getName();
//println "The content of the list is: " + l
//println "The size of the list is: " + l.size()
if(l.size() == 1){ //if it matched the token, link the word node to the noun pointer.
return g.addEdge(null, l[0], v, relationshipType);
}
}
public Edge pointer_symbol_verbs(Graph g, String token, Vertex v, String relationshipType){
//get all Class objects representing a verb pointer, match it to the token
l = verbPointerSymbol.inE.outV.filter{it.WNToken.equals(token)}.toList()
if(l.size() == 1){ //if it matched the token, link the word node to the verb pointer.
return g.addEdge(null, l[0], v, relationshipType);
}
}
public Edge pointer_symbol_adjectives(Graph g, String token, Vertex v, String relationshipType){
//get all Class objects representing a adjective pointer, match it to the token
l = adjectivePointerSymbol.inE.outV.filter{it.WNToken.equals(token)}.toList()
if(l.size() == 1){ //if it matched the token, link the word node to the adjective pointer.
return g.addEdge(null, l[0], v, relationshipType);
}
}
public Edge pointer_symbol_adverbs(Graph g, String token, Vertex v, String relationshipType){
//get all Class objects representing a adverb pointer, match it to the token
l = adverbPointerSymbol.inE.outV.filter{it.WNToken.equals(token)}.toList()
if(l.size() == 1){ //if it matched the token, link the word node to the adverb pointer.
return g.addEdge(null, l[0], v, relationshipType);
}
}
/*
A slightly time intensive operation (because of the search traversal) that can help to build the ontology before words are added
*/
public Vertex createDescendant(Graph g, Vertex parent, String namev, String typev, String WNTokenv){
//check to see if there is a node already that has these attributes, if so don't make it again
vlist = g.V.inE.outV.filter{it.name.equals(namev) && it.type.equals(typev) && it. WNToken.equals(WNTokenv)}.toList()
Class cls = vlist.getClass();
//println "The type of the object is: " + cls.getName();
//println "The content of the list is: " + vlist
//println "The size of the list is: " + vlist.size()
Vertex v;
if(vlist.size() == 0){
v = g.addVertex(null);
v.setProperty("name",namev);
v.setProperty("type",typev);
v.setProperty("WNToken",WNTokenv);
}else {
v = vlist[0];
//println v.map()
//println "parent" + parent.map()
}
g.addEdge(null, v, parent, "isA");
return v;
}
public void readFile(String path, String filename, Graph g){
println "Parsing file: " + filename
String line;
def count = 0; def prev = 0;
//we may need to manage transactions, do that here:
//g.setMaxBufferSize(0);
//g.startTransaction();
g.setMaxBufferSize(10000);
//then we need to put in the corresponding stop transaction below..
try {
BufferedReader br = new BufferedReader(new FileReader(path + filename));
while (( (String)line = br.readLine()) != null) {
//println line
if( !isHeader(line) ) {
if(filename.contains("index.sense")){ //the sense file is different
}else if(filename.contains("index")){ //it is an index formatted file!
parseIndexLine(g, line);
}else if(filename.contains("data")){ //it is an data formatted file!
parseDataLine(g, line);
}
}
count++;
if(count == prev + 1000){ println "Records Parsed = " + count; prev = count; }
} // end while
br.close();
println "";
} // end try
catch (IOException e) {
System.err.println("Error: " + e);
}
}
path = "../../WordNet/dict/"
files = ["index.noun","index.verb","index.adj", "index.adv","data.noun","data.verb","data.adj","data.adv","index.sense"]
String graphPath = "/tmp/wordnet";
println "Initiating Graph: " + graphPath;
Graph graph = new Neo4jGraph(graphPath);
create_ontology_classes(graph);
println "Loading WordNet:\n"
//filename = files[0]
//filename = files[4]
//readFile(path, filename, graph)
for(filename in files){
readFile(path, filename, graph)
}
println "Word Net Loaded! Thanks for using!"
graph.shutdown();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment