Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
description of Lights, Camera, Algorithms session from SRCCON 2018

Lights, Camera, Algorithm

Act out and discuss machine learning algorithms. This activity is from a SRCCON 2018 session led by Jeremy Merrill and Rachel Shorey.

Materials

  • Index cards
  • pens/pencils
  • Dice with varying numbers of faces (several D10 and one D6 for sure)
  • Masking tape to mark floor
  • Paper, easel, marker
  • Stickers in several colors

Goal

A concrete understanding of what machine learning is by… acting it out. If you have never implemented a machine learning algorithm before, then by the end of this session...you probably still will not be able to implement a machine learning algorithm. The goal is to learn more about machine learning and the issues that come up when it is applied to real world data. This could be a starting point for digging deeper, could help you with reporting on machine learning, could help you have a better sense of what’s happening with something you use already, or could just be a fun way to end your day.

Intro

We're going to experience two different algorithms. First I want to really quickly define machine-learning, talk about why we're doing two algorithms. So, first. Machine learning is: Not magic It's just manipulating numbers to find patterns that might not be intuitively obvious. You can contrast it with regular computer code in that regular code, you tell the computer what to do; with machine learning, you tell the computer how to figure out what to do based on some data. Two kinds of machine learning, we're trying both of them. Supervised: This is where you have a bunch of things for which you know the right answer. And then you want to decide for new things what the right answer is based on the ones you already knew about. Simplified version of random forest. Unsupervised This is where you have a bunch of things and you don't know the right answer, but want to learn something interesting about them by finding similarities and differences. Simplified version of k-means OKAY LETS GO! We’re starting with supervised.

Random Forest:

Prep:

  • ~50 index cards with the name of a fruit or vegetable, clearly indicating whether it is a FRUIT or VEGETABLE, with features written on the back.
  • Furniture moved out of the way
  • An easel with paper

Recommendation: your algorithm will work better if you have an approximately equal number of fruits and vegetables, so we recommend sorting the cards in alternating fruit-veg order to make this happen if you have fewer people.

Adversarial examples

If the group is advanced, you might want to include a few adversarial examples to test at the end. We did not do this due to time and crowd restrictions, but if you want to, we might recommend:

  • Things that are not obviously a fruit or vegetable (maybe a poisonous plant)
  • Things that are very obviously not a fruit or a vegetable (maybe a baseball cap, or a cup of water)
  • Something that is a fruit or a vegetable but has most of the data missing

Background

In a random forest, we build an arbitrary number of decision trees [ask someone to explain what a decision tree is?], each to an arbitrary depth. For each tree, at a branching point we select some features at random and pick the best of those to branch on. In this case, we need to keep it pretty simple, so we’re going to build three trees. And we’re going to stop building each tree when we’ve done just 2 levels of splitting. And each time, we’re just going to randomly pick a single feature to split on. Let’s see how it goes!

Everyone’s going to get an index card with a fruit or vegetable on it. It will be labelled “fruit” or “veg”. On the back, you’ll find 6 features, with true/false answers: Does the food item have more than 30 calories per 100g? Does the food item cost more than a dollar a pound? Do you need to peel the item? Is the item green? Do you keep it in the fridge? Does it grow on a tree?

Step 1:

Facilitator 1 will set up to record the results of each tree on a separate sheet of paper on the easel for testing purposes later.

Step 2:

Facilitator 2 will have each data point roll a D10. Anyone rolling a 1 will be put in “test”. Move the test data to the side. Impress on the test data the importance that they not become contaminated by mingling with the training data. Facilitator 1 will be responsible for babysitting the test data. If the test data seems bored/rowdy, they could be encouraged to perhaps play duck duck goose. (They will probably enjoy watching though).

Step 3:

With the entire training group, facilitator 2 will roll a D6 to select the feature we will split on first. The data points who are TRUE on that feature will go right, the datapoints who are FALSE will go left. Both facilitators should help out with this and confirm everyone knows what is happening for the first round. Facilitator 1 will draw the node of the tree with the selected feature.

Step 4:

For each of the 2 groups created in step 3, facilitator 2 will roll the die again. Facilitator 2 will throw out the roll if the same feature was selected and roll again. The individuals will split again, and facilitator 1 will record.

Step 5:

At each leaf, facilitator 2 will count the number of fruits and veggies. Facilitator 1 will record. If any node is tied, facilitator 2 will split the node again using the same procedure as in previous steps.

Step 6:

Repeat steps 3-5 two more times, for a total of 3 trees. Facilitator 1 will continue recording and can possibly encourage some cheering/jeering from the likely now bored test set (“TEST IS THE BEST” etc).

Step 7:

Training set is banished. Using the diagrams, facilitator 1 will help one test data point run through the first tree out loud (if you happen to know your test data and there is someone who's particularly charismatic, pick that person, it goes well if this datapoint gets really into it). Then all members of the test set will evaluate themselves, recording a “vote” for each tree. [Note: we tried actually dividing them the way we had the training data, and it got confusing. We would recommend having them each trace through mentally and keep track of how many fruit/veg votes they got, and then just say that out loud]

Step 8:

Facilitator 1 will have each test data point will introduce themself with their name (eg: “Apple”), true category (“Fruit”) and number of votes for “Fruit” vs “Vegetable”. Facilitator 2 (NOTE THIS SWITCH) will record who was categorized correctly. We will record the overall prediction accuracy of the algorithm, and also note if that accuracy was in any way skewed (eg if it just predicts everything is a fruit).

Step 9:

If using adversarial examples, facilitator 1 will evaluate the adversarial examples out loud. Facilitator 2 will record these as well.

K-means:

Prep:

  • ~50 index cards with the name of a fruit or vegetable including several adversarial examples (eg water, mushroom and ham sandwich)
  • Furniture moved out of the way
  • A grid roughly taped on the floor

Background

In k-means, we’re going to spread our data out based on two dimensions. Then we’re going to randomly place some cluster centroids on the board. Each data point will “choose” the centroid closest to them. The centroids will then move, to be in the approximate center of the points who have “chosen” them. Then we’ll re-choose centroids. Then the centroids will move again. We’ll keep doing this until no one moves.

Step 1:

Facilitator 1 selects 3-4 people to be centroids from the group, depending on group size. Facilitator 1 huddles with centroids to discuss their role. As part of this, each centroid will be assigned a sticker color, and will get a set of those stickers.

Step 2:

Facilitator 2 gives each data point a fresh index card with the name of the fruit or vegetable on it, and directs them to stand in the most appropriate place on the grid. The X axis is “sweetness” and the y axis is “size”. Give some examples, eg:

  • Up here, at the top, we’d put the very largest fruits or vegetables, such as one of those giant pumpkins
  • Down here at the bottom, we’d imagine things like green peas or blueberries
  • Over to the left, we’d want things that aren’t sweet at all, like maybe celery?
  • And to the right, like, literal sugar cane.
  • Use your own judgement!

Step 3:

When the data points are all placed, facilitator 1 will have each centroid roll a D10 twice, once for sweetness and once for size. Help them place themselves on the grid, and put a sticker down so they can find their spot again if they move.

Step 4:

Direct data points to identify their nearest centroid. If they feel they are equidistant, one of the facilitators will be the judge [people were good at deciding and did so reasonably in our session]. Each data point will collect a sticker from the nearest centroid.

Step 5:

One by one, we will cycle through the centroids. For each centroid, ask the people who have that color sticker to raise their hands. The centroid must now estimate the midpoint of their data points and move there. They should put a new sticker on the ground.

Step 6:

Each data point should again identify the nearest centroid, and if it has changed, get a new sticker color. Remember which is the newest one. Facilitators ask if anyone changed sticker color. If so, repeat steps 5 and 6 until no one changes. [We stopped after 3 moves despite the fact that there was still movement because people got the idea and the clusters seemed pretty solid].

Step 7:

For each centroid, ask them to huddle and talk about what they have in common, and whether there are any outliers. Then ask each centroid to report back about what is in the cluster, what kinds of things they have in common (eg, are all the citrus fruits there?) and what outliers they found.

Afterwards, discussion, with prompts.

Examples related

  • When does our Random Forest classifier do a bad job of classifying/grouping? [in our case, the random forest was 100% successful(!) so we kind of flailed a little bit here]
  • Our "training" data had a bunch of clear answers. When is your data not going to be so clear? (i.e. barriers to getting real data)
  • What happens if the "right answers" are actually wrong?
  • Why did we test, anyway?
  • Did you disagree with any of the provided data or classifications in Random Forest? [Avocado was VERY upset in our random forest. Would recommend finding Avocado for a person-on-the-street interview]

Application related:

  • Better real examples of when to use ML?
  • Have you had times its gone badly? (Talk about Breaking the Black Box).
  • Talk about the effects of randomly choosing a training set and how things could differ if different people end up in the training set (consider if an intuitive edge cases, like tomatoes or venus fly traps, abstract nouns, verbs, something that makes sense but has missing data (a prehistoric animal about which we don't know all the answers?) being in or out of the training set changes the result).
  • If you are reporting on machine learning, what questions should you definitely ask?
  • If you were try to explain why either of the models, once it's in production, made a particular decision, what would you say? [be sure to mention complete unexplainability of deep learning]
fruit >30 cal per 100g over $1/pound need to peel green? keep in fridge grows on a tree truth
apple TRUE FALSE FALSE FALSE FALSE TRUE fruit
apricot TRUE TRUE FALSE FALSE FALSE TRUE fruit
banana TRUE FALSE TRUE FALSE FALSE FALSE fruit
blueberry TRUE TRUE FALSE FALSE TRUE FALSE fruit
blackberry FALSE TRUE FALSE FALSE TRUE FALSE fruit
cantaloupe TRUE FALSE TRUE FALSE FALSE FALSE fruit
coconut TRUE TRUE TRUE FALSE FALSE TRUE fruit
coffee bean TRUE TRUE TRUE FALSE FALSE FALSE fruit
durian TRUE TRUE TRUE TRUE FALSE TRUE fruit
grape TRUE TRUE FALSE FALSE TRUE FALSE fruit
grapefruit FALSE FALSE TRUE FALSE FALSE TRUE fruit
lime FALSE FALSE TRUE TRUE FALSE TRUE fruit
mango TRUE FALSE TRUE FALSE FALSE TRUE fruit
orange TRUE FALSE TRUE FALSE FALSE TRUE fruit
pineapple TRUE TRUE TRUE FALSE FALSE FALSE fruit
peach TRUE FALSE FALSE FALSE FALSE TRUE fruit
pear TRUE FALSE FALSE FALSE FALSE TRUE fruit
plum TRUE FALSE FALSE FALSE FALSE TRUE fruit
quince TRUE TRUE FALSE FALSE FALSE TRUE fruit
raspberry FALSE TRUE FALSE FALSE TRUE FALSE fruit
rhubarb FALSE TRUE FALSE FALSE TRUE FALSE fruit
strawberry FALSE TRUE FALSE FALSE TRUE FALSE fruit
sugar cane TRUE TRUE FALSE FALSE FALSE FALSE fruit
tamirand TRUE TRUE TRUE FALSE FALSE TRUE fruit
watermelon FALSE FALSE TRUE FALSE FALSE FALSE fruit
avocado TRUE TRUE TRUE TRUE FALSE TRUE veg
artichoke TRUE TRUE FALSE TRUE FALSE FALSE veg
asparagus FALSE TRUE FALSE TRUE TRUE FALSE veg
brussels sprout FALSE TRUE FALSE TRUE TRUE FALSE veg
green beans TRUE FALSE FALSE TRUE TRUE FALSE veg
beet TRUE FALSE TRUE FALSE TRUE FALSE veg
carrot FALSE FALSE TRUE FALSE TRUE FALSE veg
corn TRUE FALSE TRUE FALSE FALSE FALSE veg
cabbage FALSE FALSE FALSE TRUE TRUE FALSE veg
celery FALSE FALSE FALSE TRUE TRUE FALSE veg
cucumber FALSE FALSE FALSE TRUE TRUE FALSE veg
eggplant FALSE FALSE FALSE FALSE TRUE FALSE veg
garlic TRUE FALSE TRUE FALSE FALSE FALSE veg
kale TRUE TRUE FALSE TRUE TRUE FALSE veg
lettuce FALSE TRUE FALSE TRUE TRUE FALSE veg
mushroom FALSE TRUE FALSE FALSE TRUE FALSE veg
onion TRUE FALSE TRUE FALSE FALSE FALSE veg
potato TRUE FALSE FALSE FALSE FALSE FALSE veg
green peas TRUE FALSE FALSE TRUE TRUE FALSE veg
pumpkin FALSE FALSE TRUE FALSE FALSE FALSE veg
radish FALSE FALSE FALSE FALSE TRUE FALSE veg
spinach FALSE TRUE FALSE TRUE TRUE FALSE veg
sweet potato TRUE FALSE FALSE FALSE FALSE FALSE veg
tomato FALSE TRUE FALSE FALSE FALSE FALSE veg
zucchini FALSE FALSE FALSE TRUE TRUE FALSE veg
@rshorey

This comment has been minimized.

Copy link
Owner Author

rshorey commented Aug 3, 2018

Thanks to Brent Jones, who turned the random forest training data into nametags for much easier use.

file-00001
file-00002
file-00003
file-00004
file-00005
file-00006
file-00007
file-00008
file-00009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.