Skip to content

Instantly share code, notes, and snippets.

View coppeliaMLA's full-sized avatar

coppelia machine learning and analytics coppeliaMLA

View GitHub Profile
@coppeliaMLA
coppeliaMLA / dfToJSON.R
Last active August 29, 2015 13:56
I've been using a lot of javascript charting and visualisation libraries recently (e.g. D3, highcharts) and found that it is quite painful to get my data into the JSON structure required by each library. Since I'm doing most of the data manipulation in R anyway it makes sense to arrange the data as a nested list in R and then transform it to JSO…
#Load libraries
library(rjson)
library(stringr)
dfToJSON<-function(df, mode='vector'){
colToList<-function(x, y){
@coppeliaMLA
coppeliaMLA / ArethereAnyGood5LetterDomains.py
Created February 25, 2014 17:38
Creates all 5 letter permutations of words, ranks them by how pronounceable they are, checks that they are a word in some language then checks whether the domain is free.
'''
Created on Feb 6, 2014
@author: sraper
'''
import itertools, urllib, urllib2, time, re, random
from bs4 import BeautifulSoup
def catchURL(queryURL): # Nicked this from someone. Afraid I can't remember who. Sorry
@coppeliaMLA
coppeliaMLA / trimFirstLine.py
Created February 27, 2014 09:44
Hive seems to struggle with files headers when loading flat files. Here's a bit of python to trim the first line (i.e. the column header line) from every file.
import os
dir = 'put your director in here'
for filename in os.listdir(dir):
with open(dir+filename, 'r') as fin:
data = fin.read().splitlines(True)
with open(dir+filename, 'w') as fout:
fout.writelines(data[1:])
@coppeliaMLA
coppeliaMLA / joinTables.sql
Last active August 29, 2015 13:56
This is the hiveql for the rag: quick start Hadoop and Hive for analysts. You can find it here http://www.ragscripts.com/2014/03/04/quick-start-hadoop-and-hive-for-analysts/
drop table if exists recommender_set_num; --In case you need to rerun the script
drop table if exists person_ids_full_names;
drop table if exists recom_names;
-- Set up a table to load the recommendations data into
create external table if not exists recommender_set_num
(
userID bigint,
itemID bigint
) row format delimited fields terminated by ','
@coppeliaMLA
coppeliaMLA / cubicSplineExample.R
Last active August 29, 2015 13:57
Example of a cubic spline
x<-seq(1,10, by=0.1)
y<-sin(x/4)+rnorm(91, 0,0.05) #Sin fucntion plus noise
plot(x,y)
#Knots at 2, 4 and 6
x2<-x^2
x3<-x^3
k1<-(x>2)*(x-2)^3
k2<-(x>4)*(x-4)^3
@coppeliaMLA
coppeliaMLA / csvToPipe.py
Created March 7, 2014 12:50
Another useful bit of code for preparing flat files for Hive. Takes in csvs with double quote text delimiters and outputs pipe delimited files.
import os, csv
progDir = '/pathToFolderContainingCSVs/'
for filename in os.listdir(progDir):
if filename != '.DS_Store':
with open(progDir+filename, 'rb') as csvfile:
progReader = csv.reader(csvfile, delimiter=',', quotechar='"')
@coppeliaMLA
coppeliaMLA / clusterSankey.R
Last active August 29, 2015 14:02
Visualising cluster stability using a Sankey diagram
#Sequence for adding new data
s<-seq(20,50, by=5)
#Set up object for recording clusters
clus.change<-NULL
#Cycle through the clustering solutions
for (i in s){
hc <- hclust(dist(USArrests[1:i,]), "ave")
@coppeliaMLA
coppeliaMLA / DendToForce.R
Created June 20, 2014 16:30
Converts a hclust dendrogram into a graph in JSON for input into D3
#Run hclust
hc <- hclust(dist(USArrests[1:40,]), "ave")
#Function for extracting nodes and links
extractGraph<-function(hc){
n<-length(hc$order)
m<-hc$merge
links<-data.frame(source=as.numeric(), target=as.numeric(), value=as.numeric())
@coppeliaMLA
coppeliaMLA / confusion.htm
Created June 24, 2014 07:52
Exploration of a confusion matrix using tangle.js
<!DOCTYPE html>
<html>
<head>
<title>Tangle: a JavaScript library for reactive documents</title>
<link rel="stylesheet" href="http://worrydream.com/Tangle/TangleKit/TangleKit.css" type="text/css">
<script type="text/javascript" src="http://worrydream.com/Tangle/TangleKit/mootools.js"></script>
<script type="text/javascript" src="http://worrydream.com/Tangle/TangleKit/sprintf.js"></script>
<script type="text/javascript" src="http://worrydream.com/Tangle/TangleKit/BVTouchable.js"></script>
@coppeliaMLA
coppeliaMLA / compCorrMI.R
Created June 25, 2014 16:00
Look at the relationship between MI and correlation for binary vars (since it's quicker than doing the maths)
#Check the relationship between correlation and mutual information for binary vars
store<-NULL
for (i in 1:1000){
prob.1<-runif(1)
prob.2<-runif(1)
x<-rbinom(10000, 1, prob.1)
y<-rbinom(10000, 1, prob.2)
c<-cor(x,y)
m<-mi.empirical(table(x,y))
store<-rbind(store, data.frame(c=c, m=m))