rshorey/instructions.md

## instructions.md

      
    Raw
  

              instructions.md
            
          
    Lights, Camera, Algorithm

Act out and discuss machine learning algorithms. This activity is from a SRCCON 2018 session led by Jeremy Merrill and Rachel Shorey.
Materials


Index cards
pens/pencils
Dice with varying numbers of faces (several D10 and one D6 for sure)
Masking tape to mark floor
Paper, easel, marker
Stickers in several colors

Goal

A concrete understanding of what machine learning is by… acting it out.  If you have never implemented a machine learning algorithm before, then by the end of this session...you probably still will not be able to implement a machine learning algorithm. The goal is to learn more about machine learning and the issues that come up when it is applied to real world data. This could be a starting point for digging deeper, could help you with reporting on machine learning, could help you have a better sense of what’s happening with something you use already, or could just be a fun way to end your day.
Intro

We're going to experience two different algorithms. First I want to really quickly define machine-learning, talk about why we're doing two algorithms.
So, first. Machine learning is:
Not magic
It's just manipulating numbers to find patterns that might not be intuitively obvious. You can contrast it with regular computer code in that regular code, you tell the computer what to do; with machine learning, you tell the computer how to figure out what to do based on some data.
Two kinds of machine learning, we're trying both of them.
Supervised:
This is where you have a bunch of things for which you know the right answer. And then you want to decide for new things what the right answer is based on the ones you already knew about.
Simplified version of random forest.
Unsupervised
This is where you have a bunch of things and you don't know the right answer, but want to learn something interesting about them by finding similarities and differences.
Simplified version of k-means
OKAY LETS GO! We’re starting with supervised.
Random Forest:

Prep:


~50 index cards with the name of a fruit or vegetable, clearly indicating whether it is a FRUIT or VEGETABLE, with features written on the back.
Furniture moved out of the way
An easel with paper

Recommendation: your algorithm will work better if you have an approximately equal number of fruits and vegetables, so we recommend sorting the cards in alternating fruit-veg order to make this happen if you have fewer people.
Adversarial examples

If the group is advanced, you might want to include a few adversarial examples to test at the end. We did not do this due to time and crowd restrictions, but if you want to, we might recommend:

Things that are not obviously a fruit or vegetable (maybe a poisonous plant)
Things that are very obviously not a fruit or a vegetable (maybe a baseball cap, or a cup of water)
Something that is a fruit or a vegetable but has most of the data missing

Background

In a random forest, we build an arbitrary number of decision trees [ask someone to explain what a decision tree is?], each to an arbitrary depth. For each tree, at a branching point we select some features at random and pick the best of those to branch on. In this case, we need to keep it pretty simple, so we’re going to build three trees. And we’re going to stop building each tree when we’ve done just 2 levels of splitting. And each time, we’re just going to randomly pick a single feature to split on. Let’s see how it goes!
Everyone’s going to get an index card with a fruit or vegetable on it. It will be labelled “fruit” or “veg”. On the back, you’ll find 6 features, with true/false answers:
Does the food item have more than 30 calories per 100g?
Does the food item cost more than a dollar a pound?
Do you need to peel the item?
Is the item green?
Do you keep it in the fridge?
Does it grow on a tree?
Step 1:

Facilitator 1 will set up to record the results of each tree on a separate sheet of paper on the easel for testing purposes later.
Step 2:

Facilitator 2 will have each data point roll a D10. Anyone rolling a 1 will be put in “test”. Move the test data to the side. Impress on the test data the importance that they not become contaminated by mingling with the training data. Facilitator 1 will be responsible for babysitting the test data. If the test data seems bored/rowdy, they could be encouraged to perhaps play duck duck goose. (They will probably enjoy watching though).
Step 3:

With the entire training group, facilitator 2 will roll a D6 to select the feature we will split on first. The data points who are TRUE on that feature will go right, the datapoints who are FALSE will go left. Both facilitators should help out with this and confirm everyone knows what is happening for the first round. Facilitator 1 will draw the node of the tree with the selected feature.
Step 4:

For each of the 2 groups created in step 3, facilitator 2 will roll the die again. Facilitator 2 will throw out the roll if the same feature was selected and roll again. The individuals will split again, and facilitator 1 will record.
Step 5:

At each leaf, facilitator 2 will count the number of fruits and veggies. Facilitator 1 will record. If any node is tied, facilitator 2 will split the node again using the same procedure as in previous steps.
Step 6:

Repeat steps 3-5 two more times, for a total of 3 trees. Facilitator 1 will continue recording and can possibly encourage some cheering/jeering from the likely now bored test set (“TEST IS THE BEST” etc).
Step 7:

Training set is banished. Using the diagrams, facilitator 1 will help one test data point run through the first tree out loud (if you happen to know your test data and there is someone who's particularly charismatic, pick that person, it goes well if this datapoint gets really into it). Then all members of the test set will evaluate themselves, recording a “vote” for each tree. [Note: we tried actually dividing them the way we had the training data, and it got confusing. We would recommend having them each trace through mentally and keep track of how many fruit/veg votes they got, and then just say that out loud]
Step 8:

Facilitator 1 will have each test data point will introduce themself with their name (eg: “Apple”), true category (“Fruit”) and number of votes for “Fruit” vs “Vegetable”. Facilitator 2 (NOTE THIS SWITCH) will record who was categorized correctly. We will record the overall prediction accuracy of the algorithm, and also note if that accuracy was in any way skewed (eg if it just predicts everything is a fruit).
Step 9:

If using adversarial examples, facilitator 1 will evaluate the adversarial examples out loud. Facilitator 2 will record these as well.
K-means:

Prep:


~50 index cards with the name of a fruit or vegetable including several adversarial examples (eg water, mushroom and ham sandwich)
Furniture moved out of the way
A grid roughly taped on the floor

Background

In k-means, we’re going to spread our data out based on two dimensions. Then we’re going to randomly place some cluster centroids on the board. Each data point will “choose” the centroid closest to them. The centroids will then move, to be in the approximate center of the points who have “chosen” them. Then we’ll re-choose centroids. Then the centroids will move again. We’ll keep doing this until no one moves.
Step 1:

Facilitator 1 selects 3-4 people to be centroids from the group, depending on group size. Facilitator 1 huddles with centroids to discuss their role. As part of this, each centroid will be assigned a sticker color, and will get a set of those stickers.
Step 2:

Facilitator 2 gives each data point a fresh index card with the name of the fruit or vegetable on it, and directs them to stand in the most appropriate place on the grid. The X axis is “sweetness” and the y axis is “size”. Give some examples, eg:

Up here, at the top, we’d put the very largest fruits or vegetables, such as one of those giant pumpkins
Down here at the bottom, we’d imagine things like green peas or blueberries
Over to the left, we’d want things that aren’t sweet at all, like maybe celery?
And to the right, like, literal sugar cane.
Use your own judgement!

Step 3:

When the data points are all placed, facilitator 1 will have each centroid roll a D10 twice, once for sweetness and once for size. Help them place themselves on the grid, and put a sticker down so they can find their spot again if they move.
Step 4:

Direct data points to identify their nearest centroid. If they feel they are equidistant, one of the facilitators will be the judge [people were good at deciding and did so reasonably in our session]. Each data point will collect a sticker from the nearest centroid.
Step 5:

One by one, we will cycle through the centroids. For each centroid, ask the people who have that color sticker to raise their hands. The centroid must now estimate the midpoint of their data points and move there. They should put a new sticker on the ground.
Step 6:

Each data point should again identify the nearest centroid, and if it has changed, get a new sticker color. Remember which is the newest one. Facilitators ask if anyone changed sticker color. If so, repeat steps 5 and 6 until no one changes. [We stopped after 3 moves despite the fact that there was still movement because people got the idea and the clusters seemed pretty solid].
Step 7:

For each centroid, ask them to huddle and talk about what they have in common, and whether there are any outliers. Then ask each centroid to report back about what is in the cluster, what kinds of things they have in common (eg, are all the citrus fruits there?) and what outliers they found.
Afterwards, discussion, with prompts.

Examples related


When does our Random Forest classifier do a bad job of classifying/grouping? [in our case, the random forest was 100% successful(!) so we kind of flailed a little bit here]
Our "training" data had a bunch of clear answers. When is your data not going to be so clear? (i.e. barriers to getting real data)
What happens if the "right answers" are actually wrong?
Why did we test, anyway?
Did you disagree with any of the provided data or classifications in Random Forest? [Avocado was VERY upset in our random forest. Would recommend finding Avocado for a person-on-the-street interview]

Application related:


Better real examples of when to use ML?
Have you had times its gone badly? (Talk about Breaking the Black Box).
Talk about the effects of randomly choosing a training set and how things could differ if different people end up in the training set (consider if an intuitive edge cases, like tomatoes or venus fly traps, abstract nouns, verbs, something that makes sense but has missing data (a prehistoric animal about which we don't know all the answers?) being in or out of the training set changes the result).
If you are reporting on machine learning, what questions should you definitely ask?
If you were try to explain why either of the models, once it's in production, made a particular decision, what would you say? [be sure to mention complete unexplainability of deep learning]


fruit
>30 cal per 100g
over $1/pound
need to peel
green?
keep in fridge
grows on a tree
truth


apple
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
fruit


apricot
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
fruit


banana
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
fruit


blueberry
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


blackberry
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


cantaloupe
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
fruit


coconut
TRUE
TRUE
TRUE
FALSE
FALSE
TRUE
fruit


coffee bean
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
fruit


durian
TRUE
TRUE
TRUE
TRUE
FALSE
TRUE
fruit


grape
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


grapefruit
FALSE
FALSE
TRUE
FALSE
FALSE
TRUE
fruit


lime
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
fruit


mango
TRUE
FALSE
TRUE
FALSE
FALSE
TRUE
fruit


orange
TRUE
FALSE
TRUE
FALSE
FALSE
TRUE
fruit


pineapple
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
fruit


peach
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
fruit


pear
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
fruit


plum
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
fruit


quince
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
fruit


raspberry
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


rhubarb
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


strawberry
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
fruit


sugar cane
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
fruit


tamirand
TRUE
TRUE
TRUE
FALSE
FALSE
TRUE
fruit


watermelon
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
fruit


avocado
TRUE
TRUE
TRUE
TRUE
FALSE
TRUE
veg


artichoke
TRUE
TRUE
FALSE
TRUE
FALSE
FALSE
veg


asparagus
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
veg


brussels sprout
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
veg


green beans
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
veg


beet
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
veg


carrot
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
veg


corn
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
veg


cabbage
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
veg


celery
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
veg


cucumber
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
veg


eggplant
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
veg


garlic
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
veg


kale
TRUE
TRUE
FALSE
TRUE
TRUE
FALSE
veg


lettuce
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
veg


mushroom
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
veg


onion
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
veg


potato
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
veg


green peas
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
veg


pumpkin
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
veg


radish
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
veg


spinach
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
veg


sweet potato
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
veg


tomato
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
veg


zucchini
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
veg
fruit	>30 cal per 100g	over $1/pound	need to peel	green?	keep in fridge	grows on a tree	truth
apple	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	fruit
apricot	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	fruit
banana	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	fruit
blueberry	TRUE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
blackberry	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
cantaloupe	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	fruit
coconut	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	fruit
coffee bean	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	fruit
durian	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	fruit
grape	TRUE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
grapefruit	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE	fruit
lime	FALSE	FALSE	TRUE	TRUE	FALSE	TRUE	fruit
mango	TRUE	FALSE	TRUE	FALSE	FALSE	TRUE	fruit
orange	TRUE	FALSE	TRUE	FALSE	FALSE	TRUE	fruit
pineapple	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	fruit
peach	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	fruit
pear	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	fruit
plum	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	fruit
quince	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	fruit
raspberry	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
rhubarb	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
strawberry	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	fruit
sugar cane	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	fruit
tamirand	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	fruit
watermelon	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	fruit
avocado	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	veg
artichoke	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	veg
asparagus	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	veg
brussels sprout	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	veg
green beans	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	veg
beet	TRUE	FALSE	TRUE	FALSE	TRUE	FALSE	veg
carrot	FALSE	FALSE	TRUE	FALSE	TRUE	FALSE	veg
corn	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	veg
cabbage	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	veg
celery	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	veg
cucumber	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	veg
eggplant	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	veg
garlic	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	veg
kale	TRUE	TRUE	FALSE	TRUE	TRUE	FALSE	veg
lettuce	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	veg
mushroom	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	veg
onion	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	veg
potato	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	veg
green peas	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	veg
pumpkin	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	veg
radish	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	veg
spinach	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	veg
sweet potato	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	veg
tomato	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	veg
zucchini	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	veg