Skip to content

Instantly share code, notes, and snippets.

@statwonk
Last active February 7, 2021 17:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save statwonk/51b53242483bf5c557a80d6e7d810359 to your computer and use it in GitHub Desktop.
Save statwonk/51b53242483bf5c557a80d6e7d810359 to your computer and use it in GitHub Desktop.
Putting sklearn's SGD algo through its paces, now with J groups.
---
title: "Testing sklearn's Stochastic Gradient Descent Algo"
author: "Statwonk"
date: "2/07/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reticulate)
use_python("/Users/statwonk/big_sk_logistic/venv/bin/python")
```
First generate some data,
```{r}
library(tidyverse)
library(brms)
library(tidybayes)
1e4 -> N # obs
5 -> J # groups of members
20 -> K # members
0.5 -> base_p # base rate, this is logistic regression
# sample member coefficients
tibble(groups = LETTERS[1:J]) %>%
group_by(groups) %>%
mutate(coefs = list(tibble(member = paste0("member_", seq_len(K)), coef = rnorm(K, sd = 1)))) %>%
ungroup() %>%
unnest(cols = c(coefs)) -> d
# suppose this is unbalanced data, where multilevel / regularization shines.
tibble(groups = unique(d$groups)) %>%
mutate(appearance_probabilities = map(groups, ~ rbeta(K, 1, 1))) %>%
unnest(cols = c(appearance_probabilities)) %>%
group_by(groups) %>%
mutate(appearance_probabilities = appearance_probabilities / sum(appearance_probabilities)) %>%
ungroup() %>%
pull(appearance_probabilities) -> appearance_probabilities
```
```{r}
# generate samples
d %>%
mutate(appearance_probabilities) %>%
mutate(appearances = rmultinom(1, N, appearance_probabilities)[, 1]) %>%
mutate(data = parallel::mcmapply(function(x, y) {
if(x == 0) {
stop("Increase N, 0 appearances by a group, unhandled exception.")
}
tibble(p = plogis(qlogis(base_p) + y)) %>%
rerun(x, .) %>%
bind_rows() %>%
mutate(y = rbinom(n(), 1, p)) %>% # linear predictor
select(-p)
}, appearances, coef, SIMPLIFY = FALSE, mc.cores = 8)) %>% # 8 cores
select(groups, member, data) %>%
unnest(cols = data) %>%
select(y, A = member) -> d2
d2 %>% count(A) # check frequency of group member obs
bind_cols(
y = d2$y,
d2 %>%
model.matrix(~ A - 1, data = .) %>%
as_tibble()
) %>%
mutate_if(is.numeric, as.integer) -> d3
```
```{python test, echo=FALSE, message=FALSE, warning=FALSE}
import os, sys
class suppress_output:
"""
https://medium.com/swlh/python-recipes-suppress-stdout-and-stderr-messages-9141ef4b1373
"""
def __init__(self,suppress_stdout=False,suppress_stderr=False):
self.suppress_stdout = suppress_stdout
self.suppress_stderr = suppress_stderr
self._stdout = None
self._stderr = None
def __enter__(self):
devnull = open(os.devnull, "w")
if self.suppress_stdout:
self._stdout = sys.stdout
sys.stdout = devnull
if self.suppress_stderr:
self._stderr = sys.stderr
sys.stderr = devnull
def __exit__(self, *args):
if self.suppress_stdout:
sys.stdout = self._stdout
if self.suppress_stderr:
sys.stderr = self._stderr
import sklearn
from sklearn import linear_model
import pandas as pd
import numpy as np
d = r.d3
y = d.iloc[:, 0]
X = d.iloc[:,list(range(1, d.shape[1]))]
B = []
for x in list(range(0, 500)):
with suppress_output(suppress_stdout=True,suppress_stderr=True):
fit = linear_model.SGDClassifier(loss='log', fit_intercept=False, max_iter=100000, tol=1e-4)
fit.partial_fit(X, y, (0, 1))
est = pd.DataFrame(np.transpose(fit.coef_), X.columns, columns=['Coefficients'])
B.append([est])
estimates = np.transpose(B[0][0])
for x in list(range(0, len(B))):
estimates = estimates.append(np.transpose(B[x][0]))
```
```{r pressure}
py$estimates %>%
reshape2::melt() %>%
as_tibble() %>%
filter(variable != "(Intercept)") %>%
mutate(variable = as.character(variable),
variable = substr(variable, 2, nchar(variable))) %>%
inner_join(d %>% rename(variable = member, actual = coef), by = "variable") -> coefs
```
```{r}
coefs %>%
mutate(diff = value - actual) %>%
ggplot(aes(x = diff)) +
geom_density(aes(color = factor(groups))) +
geom_vline(xintercept = 0, color = 'red') +
scale_color_discrete(name = "Group") +
facet_wrap(~ variable)
```
```{r}
coefs %>%
mutate(individual = factor(paste(groups, variable))) %>%
mutate(diff = value - actual) %>%
ggplot(aes(x = diff, group = individual, color = factor(groups))) +
stat_ecdf() +
geom_vline(xintercept = 0, color = "black", size = 1) +
scale_color_discrete(name = "Group") +
xlab("Difference between known parameter and estimate")
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment