Chris Tufts ctufts

## PhlCrime_GettingStarted_PT_I_II.R
# the following script will import Philadelphia Crime Data
# Parts I and II, create a summary based on year, month, and crime type
# and will create a basic map in leaflet using the first 1000 incidents
# you will need to install dplyr, leaflet, readr, lubridate, and stringr
# packages ( install.packages('package name') )

rm(list = ls())
library(dplyr)
library(leaflet)
library(readr)

## replaceValuesWithKeys.py
s = ['c' ,'is', 'equal', 'to', 'b']
print(s)
# output >>  ['c', 'is', 'equal', 'to', 'b']
# dictionary of names:values
d = {'joe':['a', 'b'], 'tom':['c', 'd']}

# replace any values from the dict with the key value
for i in range(0, len(s)):
    for key,value in d.items():
        for v in value:

## PHL_Crime_By_District.R

##################### Import Libraries ##############################################
# if you don't have the libraries below you can install the library using the
# following command : install.packages('package_name_here_in_quotes')
library(dplyr)
library(lubridate)
library(RColorBrewer)
library(leaflet)
library(stringr)
library(rgdal)

## deepAssignmentOperatorExamples.R
####################################################
# case of global assignment
# parent environment is the global environment
# from Hadley Wickham's 'Advanced R'
#####################################################
x <- 0
f <- function() {
    x <<- 1

}

## group_arrange_assign_ranking.R
ds %>% group_by(group1, group2) %>%
  summarise(
    summary_value = some_function
  ) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
  mutate(rank=row_number())


## python_reference.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ctufts
                / python_reference.md
            
            
              Created
              July 11, 2016 17:35
            
              
                Pandas/Python functions/reference
              
          
df.dtypes : lists the type of each column in the dataframe (no parenthesis)


## Stat_notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ctufts
                / Stat_notes.md
            
            
              Last active
              July 22, 2016 20:38
            
              
                General notes about statistics (distributions, tests, etc.)
              
          
Test for normality:

Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with  a Q-Q plot
Anderson-Darling


Comparison on distributions (no assumption of normality)

Kolmogorov-Smirnov test

Compares CDF's of two sample sets - D value close to 1 indicates distributions are different, close to 0 distributions are close to one another


Wilcoxon’s signed-rank test

Compares medians from two sample sets


Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired


## gensim_notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ctufts
                / gensim_notes.md
            
            
              Last active
              August 8, 2016 18:08
            
              
                General notes from using gensim on 20 million messages
              
          
save_as_text : don't use this unless you just want to read the text in the file. Otherwise it will cause issues if you want to go back later and revise/filter the dictionary
If you choose to import a dictionary then alter it, the corpus must also be updated as outlined here - Q8
You have to limit the number of features in large datasets otherwise the memory consumption is huge
This is regardless of weather the corpus is loaded in RAM or serialized
Iterations argument - refers to the number of iterations in the EM step


## ODS.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ctufts
                / ODS.md
            
            
              Last active
              August 15, 2016 16:03
            
              
                Open data sites
              
          
CDC WONDER

mortality data
birth data
environment
population data


Pennsylvania State Data Center

County level data (mostly census data) for PA


Census Data
County Adjacency: County adjacency data from the US census bureau
County Health Rankings


## java_compatable_regex.txt
(?<!@|#)\b\w+    : Remove all words starting with @ or #  (remove hashtags and user handles from twitter)

(?<!@|#)\b\w{2,}  : Same as above but only keep words with length of 2 or greater
	# the following script will import Philadelphia Crime Data
	# Parts I and II, create a summary based on year, month, and crime type
	# and will create a basic map in leaflet using the first 1000 incidents
	# you will need to install dplyr, leaflet, readr, lubridate, and stringr
	# packages ( install.packages('package name') )

	rm(list = ls())
	library(dplyr)
	library(leaflet)
	library(readr)
	s = ['c' ,'is', 'equal', 'to', 'b']
	print(s)
	# output >> ['c', 'is', 'equal', 'to', 'b']
	# dictionary of names:values
	d = {'joe':['a', 'b'], 'tom':['c', 'd']}

	# replace any values from the dict with the key value
	for i in range(0, len(s)):
	for key,value in d.items():
	for v in value:

	##################### Import Libraries ##############################################
	# if you don't have the libraries below you can install the library using the
	# following command : install.packages('package_name_here_in_quotes')
	library(dplyr)
	library(lubridate)
	library(RColorBrewer)
	library(leaflet)
	library(stringr)
	library(rgdal)
	####################################################
	# case of global assignment
	# parent environment is the global environment
	# from Hadley Wickham's 'Advanced R'
	#####################################################
	x <- 0
	f <- function() {
	x <<- 1

	}
	ds %>% group_by(group1, group2) %>%
	summarise(
	summary_value = some_function
	) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
	mutate(rank=row_number())
	(?<!@\|#)\b\w+ : Remove all words starting with @ or # (remove hashtags and user handles from twitter)

	(?<!@\|#)\b\w{2,} : Same as above but only keep words with length of 2 or greater