Skip to content

Instantly share code, notes, and snippets.

View coppeliaMLA's full-sized avatar

coppelia machine learning and analytics coppeliaMLA

View GitHub Profile
@coppeliaMLA
coppeliaMLA / finSim.R
Last active September 22, 2016 09:07
Uncertainty in a financial model
#First we are going to set up probaility distributions for our beliefs about the inputs
#We've been told ARPU is about £7 and it's very unlikely to be higher than £10 or lower than £4
#So we'll go for a normal distribution centred at 7 with 5% and 95% quantiles at 4 and 10
#Show how we get the variance
arpu.sd<-3/1.96
x<-seq(0, 15,by=0.5)
d<-dnorm(x, 7, arpu.sd)
plot(x, d, type='l')
@coppeliaMLA
coppeliaMLA / dfToJSON.R
Last active August 29, 2015 13:56
I've been using a lot of javascript charting and visualisation libraries recently (e.g. D3, highcharts) and found that it is quite painful to get my data into the JSON structure required by each library. Since I'm doing most of the data manipulation in R anyway it makes sense to arrange the data as a nested list in R and then transform it to JSO…
#Load libraries
library(rjson)
library(stringr)
dfToJSON<-function(df, mode='vector'){
colToList<-function(x, y){
@coppeliaMLA
coppeliaMLA / ArethereAnyGood5LetterDomains.py
Created February 25, 2014 17:38
Creates all 5 letter permutations of words, ranks them by how pronounceable they are, checks that they are a word in some language then checks whether the domain is free.
'''
Created on Feb 6, 2014
@author: sraper
'''
import itertools, urllib, urllib2, time, re, random
from bs4 import BeautifulSoup
def catchURL(queryURL): # Nicked this from someone. Afraid I can't remember who. Sorry
@coppeliaMLA
coppeliaMLA / trimFirstLine.py
Created February 27, 2014 09:44
Hive seems to struggle with files headers when loading flat files. Here's a bit of python to trim the first line (i.e. the column header line) from every file.
import os
dir = 'put your director in here'
for filename in os.listdir(dir):
with open(dir+filename, 'r') as fin:
data = fin.read().splitlines(True)
with open(dir+filename, 'w') as fout:
fout.writelines(data[1:])
@coppeliaMLA
coppeliaMLA / joinTables.sql
Last active August 29, 2015 13:56
This is the hiveql for the rag: quick start Hadoop and Hive for analysts. You can find it here http://www.ragscripts.com/2014/03/04/quick-start-hadoop-and-hive-for-analysts/
drop table if exists recommender_set_num; --In case you need to rerun the script
drop table if exists person_ids_full_names;
drop table if exists recom_names;
-- Set up a table to load the recommendations data into
create external table if not exists recommender_set_num
(
userID bigint,
itemID bigint
) row format delimited fields terminated by ','
@coppeliaMLA
coppeliaMLA / cubicSplineExample.R
Last active August 29, 2015 13:57
Example of a cubic spline
x<-seq(1,10, by=0.1)
y<-sin(x/4)+rnorm(91, 0,0.05) #Sin fucntion plus noise
plot(x,y)
#Knots at 2, 4 and 6
x2<-x^2
x3<-x^3
k1<-(x>2)*(x-2)^3
k2<-(x>4)*(x-4)^3
@coppeliaMLA
coppeliaMLA / csvToPipe.py
Created March 7, 2014 12:50
Another useful bit of code for preparing flat files for Hive. Takes in csvs with double quote text delimiters and outputs pipe delimited files.
import os, csv
progDir = '/pathToFolderContainingCSVs/'
for filename in os.listdir(progDir):
if filename != '.DS_Store':
with open(progDir+filename, 'rb') as csvfile:
progReader = csv.reader(csvfile, delimiter=',', quotechar='"')
@coppeliaMLA
coppeliaMLA / binDiff.R
Created March 21, 2014 08:14
A function that gives the probability mass function for the difference between to binomially distributed random variables
modBin<-function(k, n, p){
if (k<=n) {
return(dbinom(k, n, p))
}
else {
return(0)
}
}
@coppeliaMLA
coppeliaMLA / clusterSankey.R
Last active August 29, 2015 14:02
Visualising cluster stability using a Sankey diagram
#Sequence for adding new data
s<-seq(20,50, by=5)
#Set up object for recording clusters
clus.change<-NULL
#Cycle through the clustering solutions
for (i in s){
hc <- hclust(dist(USArrests[1:i,]), "ave")
@coppeliaMLA
coppeliaMLA / DendToForce.R
Created June 20, 2014 16:30
Converts a hclust dendrogram into a graph in JSON for input into D3
#Run hclust
hc <- hclust(dist(USArrests[1:40,]), "ave")
#Function for extracting nodes and links
extractGraph<-function(hc){
n<-length(hc$order)
m<-hc$merge
links<-data.frame(source=as.numeric(), target=as.numeric(), value=as.numeric())