Skip to content

Instantly share code, notes, and snippets.

View jamesthomson's full-sized avatar

James Thomson jamesthomson

View GitHub Profile
#import all data add column headers and run checks
dist <- read.delim("~/Documents/my blog/million song database/7plus songs/output1.txt", header=FALSE)
colnames(dist)<-c('length', 'freq')
dist
dist_time <- read.csv("~/Documents/my blog/million song database/7plus songs/output2.txt", header=FALSE)
@jamesthomson
jamesthomson / import msd to dataframe.py
Created May 21, 2015 15:20
importing a million song dataset file and converting to a dataframe
import pandas as pd
#open and split file then convert to df
lines = [line.strip().split("\t") for line in open("P:\\A.tsv.a.txt", "r")]
df=pd.DataFrame(lines)
#pull out columns for further split
cols=range(18,22)+range(33,42)
arrays=df.loc[1:5,cols].values
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
print X
#scale the data
from sklearn.preprocessing import StandardScaler
SS=StandardScaler()
XS=SS.fit_transform(X)
@jamesthomson
jamesthomson / pandas manip lastfm data.py
Last active August 29, 2015 14:22
use pandas to manipulate lastfm listening data into the format i need ready for modelling with sklearn
#import data
import pandas as pd
plays = pd.read_table("usersha1-artmbid-artname-plays-sample.tsv", usecols=[0, 2, 3], names=['user', 'artist', 'plays'])
users = pd.read_table("usersha1-profile-sample.tsv", usecols=[0, 1], names=['user', 'gender'])
#print plays.head()
#print users.head()
#clear people who don't know gender for
users=users.dropna()
#dummy code up gender
@jamesthomson
jamesthomson / python predict gender.py
Last active October 12, 2015 14:49
process to predict gender based on lastfm data
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plays = pd.read_table("usersha1-artmbid-artname-plays-sample.tsv", usecols=[0, 2, 3], names=['user', 'artist', 'plays'])
users = pd.read_table("usersha1-profile-sample.tsv", usecols=[0, 1], names=['user', 'gender'])
users=users.dropna()
@jamesthomson
jamesthomson / lastfm_spark_rec_aws.py
Last active March 27, 2016 23:40
aws version of the lastfm recommendations in spark
#in terminal connect ot the master node
ssh hadoop@ec2-xx-xx-xxx-xxx.compute-1.amazonaws.com -i ~/aws_key_pair.pem
#then fire up spark
MASTER=yarn-client /home/hadoop/spark/bin/pyspark
lines = sc.textFile('s3n://jthomson/lastfm_listens/listens/usersha1-artmbid-artname-plays.tsv')
data = lines.map(lambda l: l.split('\t'))
ratings = data.map(lambda d: (d[0], d[2], 1))
users_lkp = ratings.map(lambda s: s[0]).distinct().zipWithUniqueId()
@jamesthomson
jamesthomson / lastfm_spark_rec_local.py
Created June 29, 2015 19:51
local version of the lastfm recommendations in spark
#start a terminal at the folder where spark is installed
#in the command line run this to fire up a pyspark instance
./bin/pyspark
###########################
### LOADING IN THE DATA ###
###########################
#load in the file and examine
lines = sc.textFile('usersha1-artmbid-artname-plays.tsv')
@jamesthomson
jamesthomson / entity_recognition_example.py
Created July 11, 2016 13:16
basic named entity recognition example. pull out people, places, organisations
import nltk
#with open('sample.txt', 'r') as f:
# sample = f.read()
#article taken from the bbc
sample="""Renewed fighting has broken out in South Sudan between forces loyal to the president and vice-president. A reporter in the capital, Juba, told the BBC gunfire and large explosions could be heard all over the city; he said heavy artillery was being used. More than 200 people are reported to have died in clashes since Friday. The latest violence came hours after the UN Security Council called on the warring factions to immediately stop the fighting. In a unanimous statement, the council condemned the violence "in the strongest terms" and expressed "particular shock and outrage" at attacks on UN sites. It also called for additional peacekeepers to be sent to South Sudan.
Chinese media say two Chinese UN peacekeepers have now died in Juba. Several other peacekeepers have been injured, as well as a number of civilians who have been caught in crossfire. The latest round of violence erupted when troops loy
@jamesthomson
jamesthomson / word2vec example.py
Created July 12, 2016 09:44
word2vec model example using simple text sample
import nltk
import gensim
sample="""Renewed fighting has broken out in South Sudan between forces loyal to the president and vice-president. A reporter in the capital, Juba, told the BBC gunfire and large explosions could be heard all over the city; he said heavy artillery was being used. More than 200 people are reported to have died in clashes since Friday. The latest violence came hours after the UN Security Council called on the warring factions to immediately stop the fighting. In a unanimous statement, the council condemned the violence "in the strongest terms" and expressed "particular shock and outrage" at attacks on UN sites. It also called for additional peacekeepers to be sent to South Sudan.
Chinese media say two Chinese UN peacekeepers have now died in Juba. Several other peacekeepers have been injured, as well as a number of civilians who have been caught in crossfire. The latest round of violence erupted when troops loyal to President Salva Kiir and first Vice-President Riek Machar began sho
@jamesthomson
jamesthomson / word2vec tweets example.py
Created July 12, 2016 09:45
word2vec example using tweet data
import pandas as pd
import re
import numpy as np
import nltk
import gensim
#import data. contains identifier and tweet
tweets=pd.DataFrame.from_csv('tweets.txt', sep='\t', index_col=False)