Skip to content

Instantly share code, notes, and snippets.

@ctufts
ctufts / groupby_apply_multiple_inputs.py
Created July 29, 2016 18:06
group by and apply a function with multiple input arguments (PANDAS)
# ds has columns A, B, C, - group by A, then use B and C as inputs in the
# MSE calculation
grouped = ds.groupby('A')
mse = grouped.apply( lambda x: metrics.mean_squared_error(x['B'], x['C']))
@ctufts
ctufts / Stat_notes.md
Last active July 22, 2016 20:38
General notes about statistics (distributions, tests, etc.)
  • Test for normality:
    • Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
    • When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
    • Anderson-Darling
  • Comparison on distributions (no assumption of normality)
  • Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired
@ctufts
ctufts / group_by_and_ggplot.R
Created July 11, 2016 19:56
dplyr group_by and ggplot example
plot_df <-df %>% group_by(feature) %>%
do(
plots = ggplot(data = .) + aes(x = xcol, y = ycol) +
geom_point() + ggtitle(.$feature)
)
# show plots
plot_df$plots
@ctufts
ctufts / python_reference.md
Created July 11, 2016 17:35
Pandas/Python functions/reference
  • df.dtypes : lists the type of each column in the dataframe (no parenthesis)
@ctufts
ctufts / .block
Last active June 21, 2023 21:26
Clustered Force Layout Bubble Chart
license: gpl-3.0
height: 500
border: yes
@ctufts
ctufts / group_arrange_assign_ranking.R
Last active June 24, 2016 14:37
Group by , summarise, sort on summary data, append ranking from the sorting - dplyr
ds %>% group_by(group1, group2) %>%
summarise(
summary_value = some_function
) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
mutate(rank=row_number())
@ctufts
ctufts / .block
Last active April 15, 2019 23:12
D3 Scatterplot with Regression Line
license: gpl-3.0
height: 500
scrolling: no
border: no
@ctufts
ctufts / ODS.md
Last active August 15, 2016 16:03
Open data sites
@ctufts
ctufts / General Conda Commands.md
Last active January 3, 2024 03:36
List of commonly used commands in anaconda
  • conda info --envs : lists all environments
  • source activate <env name>: activate an environment
  • source deactivate: deactivate an environment
  • conda list : list all packages installed
  • conda create --name <env name> python=3 astroid babel : create new environment, specify version of python, and install packages
  • WINDOWS NOTE: SOURCE is not recognized. When deactivating and activating in the anaconda command prompt, skip source and just type deactivate or activate depending on what you are trying to do.
  • conda env export > environment.yml: export conda environment requirements list to a file
  • conda env remove -n ENV_NAME : delete environment
@ctufts
ctufts / LinuxCommands.md
Last active March 1, 2023 20:09
Common linux commands

Ubuntu Linux

  • sudo -i : elevate to super user
  • du : get breakdown of memory usage of all subdirectories
  • df -h : get breakdown of memory usage on disk
  • ls -a : show all files in directory (including hidden files)
  • rsync : copy files from one server to another (similar to scp but more functionality)
    • Set up rsync with sudo
    • rsync -az -e "ssh" --rsync-path="sudo rsync" user@servername:/pulled-source-directory /local-directory/
    • rsync [source] [destination]