Skip to content

Instantly share code, notes, and snippets.

View m-Py's full-sized avatar

Martin Papenberg m-Py

View GitHub Profile
@m-Py
m-Py / PCA_Variance_Explained.R
Created February 22, 2024 11:54
Get % of variance explained by Principal Component Analysis
# Get % of variance explained by Principal Component Analysis
library(psych)
pca_variance_explained <- function(data, n_components) {
pca <- psych::principal(data, n_components, rotate = "none")
list(
by_variable = colSums(cor(pca$scores, data)^2),
total = summary(prcomp(data))$importance["Cumulative Proportion", paste0("PC", n_components)]
)
# compare Cohen's h and phi coefficient as effect sizes for comparing proportions
library(effectsize)
# test data (have to be contingency tables; one column is always the same)
matrices <- lapply(1:999, function(i) matrix(c(i, 1000-i, 999, 1), ncol = 2))
phis <- sapply(matrices, function(x) effectsize::phi(x)$phi)
hs <- sapply(matrices, function(x) effectsize::cohens_h(x)$Cohens_h)
plot(abs(hs), phis, type = "l")
@m-Py
m-Py / small_anticlust_simulation.R
Last active October 21, 2020 17:39
Small anticlust simulation
# Test if splitting data via anticlustering leads to closer groups means to the *true* population means,
# as compared to a random split (e.g., for cross validation
simulate <- function(N = 100, split = c(1, 3) / 4) { # default: split 75/25
data <- rnorm(N)
groups <- anticlustering(
data,
K = round(N * split),
objective = "variance"
)
c(
@m-Py
m-Py / test_anticlust.R
Last active October 13, 2020 15:24
Test out the most recent version (v0.5.4) of anticlust
## 1. Load - and, if required, install - package `anticlust`
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
remotes::install_github("m-Py/anticlust")
library(anticlust)
# Show that interaction in glm() changes nature of main effect
# (only if a categorical predictor is dummy coded - not contrast coded)
# Returns the p-value associated with a predictor main effect, once
# with and once without interaction with a (non-predictive) categorical
# independent variable
simulate_glm <- function(N = 100, contrast_coding = FALSE) {
iv1 <- rnorm(N) # related to DV
@m-Py
m-Py / KNN_RANN.R
Last active February 25, 2020 19:16
# Author: Martin Papenberg
# Year: 2019
# Perform fast KNN classifier using RANN for nearest neighbour search
library("RANN")
library("data.table")
# param data: The numeric data matrix used
# param labels: the labels to predict
## This document illustrates that type 1 sum of squares lead to increased alpha
## error rates when a predictive covariate is included in the regression model.
# Estimate p-value for treatment (null) effect via linear regression,
# including a covariate that is predictive of the outcome
#
# param N: sample size, default 100
@m-Py
m-Py / correlated_data.R
Last active May 13, 2020 08:05
Function to generate bivariate normal data with specified correlation
## Year 2019 - 2020
## Author: Martin Papenberg
## This code is in the public domain, do with it whatever you like.
# Generate bivariate normal data with specified correlation
# param n: how many data points
# param mx: the mean of the first variable
# param my: the mean of the second variable
# param sdx: the standard deviation of the first variable
@m-Py
m-Py / SIX_OUT_OF_THIRTY.R
Created January 16, 2019 13:24
How the p value in a t test can be minimized by data removal
## Warning: This code is just for fun / educational purposes; the file contains functions
## to find out how severely the p value in a t-test can be minimized by systematic removal of data points.
## SIX OUT OF THIRTY - Martin's approach
## Based on @juli_tkotz's (https://twitter.com/juli_tkotz/status/1085446224117985281)
## idea that removing from the most extreme values is the best apporach.
#' Simulate t-tests and store best p values
#'
@m-Py
m-Py / ordinal_scores.R
Last active February 20, 2018 10:34
Compute ordinal scores from continuous data
## Author Martin Papenberg
## Year 2018
## This code is released into the public domain. Anybody may use, alter
## and distribute the code without restriction. The author makes no
## guarantees, and takes no liability of any kind for use of this code.
#' Compute ordinal scores from continuous data
#'
#' Might be useful for data exploration with highly skewed data