Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Comparing pairs of MNIST digits based on one pixel
library(tidyverse)
# Data is downloaded from here:
# https://www.kaggle.com/c/digit-recognizer
kaggle_data <- read_csv("~/Downloads/train.csv")
pixels_gathered <- kaggle_data %>%
mutate(instance = row_number()) %>%
gather(pixel, value, -label, -instance) %>%
extract(pixel, "pixel", "(\\d+)", convert = TRUE)
roc_by_pixel <- pixels_gathered %>%
filter(instance %% 20 == 0) %>%
crossing(compare1 = 0:4, compare2 = 0:4) %>%
filter(label == compare1 | label == compare2, compare1 != compare2) %>%
group_by(compare1, compare2, pixel, value) %>%
summarize(positive = sum(label == compare2),
negative = n() - positive) %>%
arrange(desc(value)) %>%
mutate(tpr = cumsum(positive) / sum(positive),
fpr = cumsum(negative) / sum(negative)) %>%
filter(n() > 1)
roc_by_pixel %>%
summarize(auc = sum(diff(fpr) * (tpr + lag(tpr))[-1]) / 2) %>%
arrange(desc(auc)) %>%
mutate(row = pixel %/% 28, column = pixel %% 28) %>%
ggplot(aes(column, 28 - row, fill = auc)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = .5) +
facet_grid(compare2 ~ compare1) +
labs(title = "AUC for distinguishing pairs of MNIST digits by one pixel",
subtitle = "Red means pixel is predictive of the row, blue predictive of the column",
fill = "AUC") +
theme_void()
@mathematicalmichael

This comment has been minimized.

Copy link

@mathematicalmichael mathematicalmichael commented Sep 10, 2019

I know it's been a minute, but I'd like to comment that I couldn't get this to run.
I used something analogous to a beefed-up version of https://github.com/jupyter/docker-stacks/tree/master/r-notebook (docker run -p 8888:8888 jupyter/r-notebook) for my environment.

The error I received was:

R[write to console]: Parsed with column specification:
cols(
  .default = col_double()
)

R[write to console]: See spec(...) for full column specifications.

R[write to console]: Error in is_character(x) : object 'label' not found
Calls: <Anonymous> ... vars_select_eval -> map_if -> map -> .f -> - -> is_character

R[write to console]: In addition: 
R[write to console]: Warning message:

R[write to console]: Duplicated column names deduplicated: '0' => '0_1' [3], '0' => '0_2' [4], '0' => '0_3' [5], '0' => '0_4' [6], '0' => '0_5' [7], '0' => '0_6' [8], '0' => '0_7' [9], '0' => '0_8' [10], '0' => '0_9' [11], '0' => '0_10' [12], '0' => '0_11' [13], '0' => '0_12' [14], '0' => '0_13' [15], '0' => '0_14' [16], '0' => '0_15' [17], '0' => '0_16' [18], '0' => '0_17' [19], '0' => '0_18' [20], '0' => '0_19' [21], '0' => '0_20' [22], '0' => '0_21' [23], '0' => '0_22' [24], '0' => '0_23' [25], '0' => '0_24' [26], '0' => '0_25' [27], '0' => '0_26' [28], '0' => '0_27' [29], '0' => '0_28' [30], '0' => '0_29' [31], '0' => '0_30' [32], '0' => '0_31' [33], '0' => '0_32' [34], '0' => '0_33' [35], '0' => '0_34' [36], '0' => '0_35' [37], '0' => '0_36' [38], '0' => '0_37' [39], '0' => '0_38' [40], '0' => '0_39' [41], '0' => '0_40' [42], '0' => '0_41' [43], '0' => '0_42' [44], '0' => '0_43' [45], '0' => '0_44' [46], '0' => '0_45' [47], '0' => '0_46' [48], '0' => '0_47' [49], '0' => '0_48' [50], '0' => '0_49' [51] [... truncated] 


Error in is_character(x) : object 'label' not found
Calls: <Anonymous> ... vars_select_eval -> map_if -> map -> .f -> - -> is_character
@dgrtwo

This comment has been minimized.

Copy link
Owner Author

@dgrtwo dgrtwo commented Sep 10, 2019

Hmm, that's strange, since the first column of the Kaggle data is label last time I checked.

Can you check that there's a label column in your train.csv, and then check that there's a label column in pixels_gathered after running that line?

@mathematicalmichael

This comment has been minimized.

Copy link

@mathematicalmichael mathematicalmichael commented Sep 11, 2019

oh goodness, thank you so much. I didn't realize it was imperative that I use Kaggle's version of the dataset (I didn't want to sign up just to download it, so I found the dataset elsewhere). I'll give it another go later today once I have stable internet and post an update. I have a suspicion that's exactly the problem. It's annoying that the dataset isn't accessible via download through command-line.

It looks like the csv I got has no labels at all. It's just pixel values for each image comma-separated, one image per line (which now explains the "renaming" portion of the stack trace). If you have a suggestion for how to add the requisite label using R after the data is loaded, I would appreciate that (much as I do your prompt reply), as it's been a while since I've written any R myself (these days it's all Python for me).

Once I get the Kaggle dataset downloaded to my computer, do you think it would be apropos to upload them to a public server I rent and make them accessible via wget?

@mathematicalmichael

This comment has been minimized.

Copy link

@mathematicalmichael mathematicalmichael commented Sep 11, 2019

IT WORKED (the environment I used was sufficient to handle all dependencies)! thanks so much for your help, @dgrtwo

I'm not sure I understand why the figure that gets plotted at the end is indicative of predictive potential by a single pixel. Does it have to do with sharp boundaries? There aren't really comments anywhere to help. What are the four rows/columns representing?

@hot9cups

This comment has been minimized.

Copy link

@hot9cups hot9cups commented Aug 18, 2020

Any update on what @mathemaicalmichael said? Still not sure how the predictive potential of a single pixel is portrayed by the figure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.