Skip to content

Instantly share code, notes, and snippets.

@tokumine
Created December 14, 2009 08:45
Show Gist options
  • Save tokumine/255907 to your computer and use it in GitHub Desktop.
Save tokumine/255907 to your computer and use it in GitHub Desktop.
#CLUSTERING OF PA'S FOR ABSTRACT AND RELATEDNESS MEASURES
#1) AIM: CREATE A DICTIONARY OF "BAD TERMS" TO EXCLUDE FROM FURTHER ANALYSIS
# METHOD: TOKENISE ALL WORDS AND HISTOGRAM. SORT.
tokens = {}
pas = Pa.find_each( :select => 'DISTINCT name_eng, gid', :batch_size => 100 ) do |p|
p.name_eng.split(/\W/).each do |t|
tokens[t] ||=0
tokens[t] += 1
end
end
sorted_tokens = tokens.sort {|a,b| b[1]<=>a[1]}
puts sorted_tokens.inspect
#2) decide how to split histogram dictionary to remove common words - idea: use second derivative to determine threshold. this step is incredibly neccessary. looking through the data there are many examples of use/not use of typical feature class names in the pa names, eg ozero in russian means lake. lots of pa's with ozero in the name.
#3) tokenise name eng, remove bad tokens determined in #2
#4) generate double metaphones for each token and store with gid. double metaphone gives us a best attempt at classifying/simplifying words for similarity comparison. it is suitable for comparing the same word in many languages http://en.wikipedia.org/wiki/double metaphone. is more suitable than other indeces of similarity, eg soundex & levenstein as these are usualy targetted at english only.
#5) select nearby pa's only < 50km? buffer and distance. would love to use new postgis 1.5 geography support for this
#6) create proximity matrix using tanimoto similarity index on metaphone sets (ratio of intersects vs total union of 2. tanimoto is discussed in collective intelligence book, but is dead simple)
#7) cluster based on threshold
# key issues for clustering. all clustering methods are going to depend on us feeding some sort of "stop criterion"
# k-means. no as we need to know number of clusters to generate - could use sum total of natioal parks? guesing though.
# hierachical clustering. we need to know either at how many, or at what range apart clustering stops
# c-means. again, decide on stop
# qt clusters. determine a maximum diameter for each cluster
# there were others but i didn't bookmark them.
#(k-means and hierachical clustering are covered in the collective intelligence book)
#more info on working out sensible k's here: http://en.wikipedia.org/wiki/determining the number of clusters in a data set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment