Skip to content

Instantly share code, notes, and snippets.

@ctufts
ctufts / PhlCrime_GettingStarted_PT_I_II.R
Last active November 7, 2015 22:35
The script imports Philadelphia Crime Data Parts I and II (https://www.opendataphilly.org/dataset/crime-incidents), creates a summary based on year, month, and crime type and creates a basic map in leaflet using the first 1000 incidents
# the following script will import Philadelphia Crime Data
# Parts I and II, create a summary based on year, month, and crime type
# and will create a basic map in leaflet using the first 1000 incidents
# you will need to install dplyr, leaflet, readr, lubridate, and stringr
# packages ( install.packages('package name') )
rm(list = ls())
library(dplyr)
library(leaflet)
library(readr)
@ctufts
ctufts / replaceValuesWithKeys.py
Last active November 7, 2015 22:39
Find 'value' in a list of tokens and replace them with the 'key' of a dictionary. The dictionary can have multiple values per key. Example use: replacing multiple nick names with one common name.
s = ['c' ,'is', 'equal', 'to', 'b']
print(s)
# output >> ['c', 'is', 'equal', 'to', 'b']
# dictionary of names:values
d = {'joe':['a', 'b'], 'tom':['c', 'd']}
# replace any values from the dict with the key value
for i in range(0, len(s)):
for key,value in d.items():
for v in value:
##################### Import Libraries ##############################################
# if you don't have the libraries below you can install the library using the
# following command : install.packages('package_name_here_in_quotes')
library(dplyr)
library(lubridate)
library(RColorBrewer)
library(leaflet)
library(stringr)
library(rgdal)
@ctufts
ctufts / deepAssignmentOperatorExamples.R
Created April 28, 2016 19:56
Example illustrating the deep assignment operator's (<<-) use in R
####################################################
# case of global assignment
# parent environment is the global environment
# from Hadley Wickham's 'Advanced R'
#####################################################
x <- 0
f <- function() {
x <<- 1
}
@ctufts
ctufts / group_arrange_assign_ranking.R
Last active June 24, 2016 14:37
Group by , summarise, sort on summary data, append ranking from the sorting - dplyr
ds %>% group_by(group1, group2) %>%
summarise(
summary_value = some_function
) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
mutate(rank=row_number())
@ctufts
ctufts / python_reference.md
Created July 11, 2016 17:35
Pandas/Python functions/reference
  • df.dtypes : lists the type of each column in the dataframe (no parenthesis)
@ctufts
ctufts / Stat_notes.md
Last active July 22, 2016 20:38
General notes about statistics (distributions, tests, etc.)
  • Test for normality:
    • Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
    • When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
    • Anderson-Darling
  • Comparison on distributions (no assumption of normality)
  • Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired
@ctufts
ctufts / gensim_notes.md
Last active August 8, 2016 18:08
General notes from using gensim on 20 million messages
  • save_as_text : don't use this unless you just want to read the text in the file. Otherwise it will cause issues if you want to go back later and revise/filter the dictionary
  • If you choose to import a dictionary then alter it, the corpus must also be updated as outlined here - Q8
  • You have to limit the number of features in large datasets otherwise the memory consumption is huge
  • This is regardless of weather the corpus is loaded in RAM or serialized
  • Iterations argument - refers to the number of iterations in the EM step
@ctufts
ctufts / ODS.md
Last active August 15, 2016 16:03
Open data sites
@ctufts
ctufts / java_compatable_regex.txt
Last active September 8, 2016 20:15
Regular expression examples in R
(?<!@|#)\b\w+ : Remove all words starting with @ or # (remove hashtags and user handles from twitter)
(?<!@|#)\b\w{2,} : Same as above but only keep words with length of 2 or greater