Skip to content

Instantly share code, notes, and snippets.

Matt Biddulph mattb

Block or report user

Report or block mattb

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@mattb
mattb / gist:3888345
Created Oct 14, 2012
Some pointers for Natural Language Processing / Machine Learning
View gist:3888345

Here are the areas I've been researching, some things I've read and some open source packages...

Nearly all text processing starts by transforming text into vectors: http://en.wikipedia.org/wiki/Vector_space_model

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms): http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order: http://matpalm.com/blog/2011/10/22/collocations_1/

View .eslintrc
{
"parserOptions": {
"ecmaVersion": 8
},
"env": {
"jest": true,
"browser": true,
"es6": true,
"node": true
},
View gist:8462540
description "autossh tunnel"
author "Joni Kähärä "
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1) # assuming we have multiple interfaces
stop on runlevel [016]
respawn
respawn limit 5 60
exec autossh -M 0 -N -R 10000:192.168.1.1:22 -o "ServerAliveInterval 60" -o "ServerAliveCountMax 3" -o "StrictHostKeyChecking=no" -o "BatchMode=yes" -i /home/user/.ssh/id_rsa username@hostname
View gist:7714439
[error] found : anorm.TupleFlattener[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing)]
[error] required: anorm.TupleFlattener[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],T2] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing)]
[error] Note: anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing) >: anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing
@mattb
mattb / simplenlg ruby DSL test
Created Apr 3, 2013
notice how it gets all the fiddly grammar with tenses and plural agreements correct.
View simplenlg ruby DSL test
# a jruby DSL for the Simple Natural Language Generator library - https://code.google.com/p/simplenlg/
realise clause {
subject 'two cats', :plural => true
verb 'live with'
object 'Matt', 'Ariel'
complement preposition_phrase {
complement 'San Francisco'
preposition 'in'
}
View gist:4588259
ids = [293401191054979074, 293401663698513921, 293401787464040448, 293401934436659200, 293402177140056064, 293402218432983040, 293402263777587202, 293402332853583872, 293402393025077250, 293402466354077698, 293402940637597696, 293403039295995904, 293403262642700288, 293403304124366848, 293403623084404737, 293403854266048512, 293403891633094656, 293404203659964416, 293404415535239168, 293404447453872128, 293404518945783808, 293404594669760512, 293404694401929217, 293405000087003137, 293405024229404673, 293405072929484800, 293405372021108736, 293405406661857281, 293405547682734080, 293405590468845570]
count = ids.size
tweets = ids.map { |i|
puts count
count -= 1
tweet = Twitter.status(i).to_hash
activity = Twitter.status_activity(i).to_hash
tweet.merge(activity)
}
open("tweets.json","w" ) { |f|
@mattb
mattb / gist:4588234
Last active Dec 11, 2015
Count of retweets plus favourites of @BarackObama's key phrases from the 2013 US Inauguration (one asterisk = 500 retweets+favourites) as of 10:42am PST on January 21st 2013. Source code: https://gist.github.com/4588259
View gist:4588234
************************************************* Our journey is not complete until our gay brothers and sisters are treated like anyone else under the law. (6699fav/17845rt)
************************** Our journey is not complete until our wives, our mothers, and daughters can earn a living equal to their efforts. (3059fav/9974rt)
**************** Thank you, God Bless you, and may He forever bless these United States of America. (2191fav/5866rt)
*************** Our journey is not complete until all our children know that they are cared for, and cherished, and always safe from harm. (1743fav/5970rt)
************* Our journey is not complete until we find a better way to welcome the immigrants who still see America as a land of opportunity. (1774fav/5016rt)
************ We have always understood that when times change, so must we. —President Obama (1422fav/5049rt)
************ We, the people, declare today that the most evident of truths—that all of us are created equal—is the star that guides us still. (1418fa
View gist:3798797
/usr/lib/jvm/java-7-openjdk-i386/bin/java -Xms384m -Xmx384m -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError org.elasticsearch.bootstrap.ElasticSearch
@mattb
mattb / gist:3765807
Created Sep 22, 2012
Screenscraping London Open House venues into GeoJSON
View gist:3765807
# curl -O 'http://events.londonopenhouse.org/Venues?q=&Page=[1-80]'
# grep 'href.*/building/' * | cut -d\" -f 2 | sort -u | sed -e 's/.*/http:\/\/events.londonopenhouse.org\/&/' > buildings.txt
# gem install nokogiri
# ruby this_script.rb > loh.json
# ogr2ogr -f KML loh.kml loh.json
require 'nokogiri'
require 'json'
DETAILED = false # Google Maps complains if the KML gets too big
@mattb
mattb / gist:1244665
Created Sep 27, 2011
Top 100 ascii-only 2-shingles on Twitter sample for the last 6 hours. Source at https://github.com/mattb/Storm-Try/
View gist:1244665
in the: ************************************* (3765)
i love: *************************** (2781)
to be: ************************** (2637)
of the: ********************* (2150)
if you: ********************* (2150)
on the: ******************** (2081)
i just: ****************** (1838)
i was: ****************** (1829)
i don't: ****************** (1825)
i have: ****************** (1802)
You can’t perform that action at this time.