Skip to content

Instantly share code, notes, and snippets.

@alfaraday
Last active February 25, 2020 20:44
Show Gist options
  • Save alfaraday/8b1042492125e7aebfd8760b2e600f5f to your computer and use it in GitHub Desktop.
Save alfaraday/8b1042492125e7aebfd8760b2e600f5f to your computer and use it in GitHub Desktop.
Mock interview scripts and rubrics for the Data Science program. These scripts are intended for interviewers only, please do not share this link.

Mock Interview 1: Experimentation

After completing Unit 1, students will take an interview focused on experimentation. The goal of this interview is to evaluate problem solving in the context of experimentation. Students should be able to design a proper A/B test and discuss how they would evaluate it.

We use a generic coffee shop as the setup for this problem. The form of the interview in general is as follows, though be sure to be prepared to adapt to the path the student chooses to pursue.

Afterwards, write up comments and ways to improve and submit them via the Typeform linked in your dashboard.

Introduce yourself first! Ask if they have any questions before you begin, and then start the interview.

Introductory question:

You run a coffee shop and you’re looking to increase business. The way that you come up with to do that is to change the sign in front of your store. However, you’re not just some typical coffee shop owner, you want to be data driven. How would you use data to evaluate  the new sign’s effect on your business?

Stages of the Answer:

Students should be allowed to answer the question as they see fit, and do not create too much scaffolding for them right at the start. Part of the task is to be able to build that outline and then fill it in with the details.

However, here is a rough outline of what is typically present in a good answer and some potential follow up questions. Following the flow of how experimentation actually works is key to developing a logical plan and outline through the interview, so too much deviation from this order may cause confusion (meaning, don’t start with evaluating a t-test and then outline a metric to track). Throughout the interview ask why and ask them to explain any tools or techniques they want to use.

  • What should we do?

    • Set up an experiment or A/B test

      • Why?
    • Period with sign A then period with sign B

      • Make sure that the only difference is the sign, making for a good controlled experiment
  • How long do you run the experiment?

    • Most suggest a week or a month

    • Why?

      • Control for seasonality
      • Sample size
    • If people don’t include an A test period (wanting to use old data) would they use all the old data or just a similar time period to the test time period

      • What happens if they compare against years of data?
    • Can you be statistically rigorous here?

  • What are the metrics you’d want to look at?

    • Usually sales or number of customers

      • They have to make sure they’re not just calculating a single number
    • Would you look at any other metrics?

      • Why look at multiple metrics?
      • Other metrics students could mention:
        • Visitors
        • Sales per customer
  • Evaluate your experiment

    • How would you do it?

      • Usually t-test

        • Why?
      • Walk through interpretation

        • P-value should get defined here. If they don’t ask.
    • What if the results are significant? What do you do next

    • What if they’re not?

      • It may take some time for an effect to be seen is a good thing for students to bring up. It also makes a decent follow up.
  • Extensions - Assuming the student gets through these problems and outlines a robust experiment, here are potential extensions or follow up questions.

    • How would it change if it were a national chain of coffee shops?
    • How is this experiment different if it’s price?
Objective Rate 1 Rate 3 Rate 5
Features
Identifying Features The student suggests nonsense features The student suggests one or two features, and does not focus on usage behaviors The student suggests several features across behavioral and demographic categories outlining strong logic for each
Engineering Features The student does not want to engineer new features The student engineers one feature with questionable efficacy The student engineers many features that creatively attend to the concern of churn
Model Selection
Which Model The student proposes either only a single model or a non-modeling approach The student presents a few models but may fail to detail the advantages of them The student suggests several models with a thorough explanation of their strengths and weaknesses when pressed
Random Forest The student either incorrectly or incompletely explains random forest The student explains what random forest is without noting legitimate advantages or disadvantages to the technique The student presents a thorough explanation of random forest, noting features like parallelizeablity
Weaker Models The student either does not know or does not discredit weaker models like Naive Bayes or KNN in this context The student suggests these models will not work well but does not explain why The student acknowledges and explains these models weakness at incorporating complex trends observed in data
Evaluation The student does not correctly evaluate their model The student indicated some form of correct validation but it is narrow and/or incomplete. For example, suggesting one metric for evaluation but not mentioning cross validation or explaining that metric. The student provides a thorough and varied evaluation strategy for their model and puts models in context with each other, weighing various kinds of costs and benefits
Small Sample Size When offered a data set with a small sample size, the student proposes the same approach The student proposes some changes to modeling structure but struggles with the reasons why something like neural networks wont work with small samples The student accurately acknowledges the advantages of certain models in small sample sizes
New Product The student does not adjust their approach The student proposes an analytic or modeling solution that does not fully utilize the information that is available The student elegantly combines analytics and modeling to leverage the small size of the dataset but also the potential value of machine learning.
Other
K-Means Student does not know what k means is other than basic facts like it 'does grouping' Student talk about kmeans as unsupervised and mentions finding clusters, but does not go into how those clusters are found or defined Student describes K-Means clearly, mentioning key features like convergence, distance to cetroid, and k-centroids
K-Means failure Student says k-means cannot fail Student works through thinking about different shaped clusters but cannot see a failure rule Student sees linear separability as key to k-means's success
Supervised vs Unsupervised Student does not know what these things are Outcome vs no outcome kind of explanation The student understand and expresses the predictive, observed nature of supervised learning vs the assosciative nature of much of unsupervised
Bias Variance tradeoff Student does not know what these things are Some mention of what bias and variance are, but not how they relate Explains a clear relationship including

Mock Interview 2: Modeling

After completing Units 2-4, students will take an interview focused on modeling. The goal of this interview is to evaluate problem solving in the context of supervised and unsupervised learning. Students should be able to propose a proper model for a given problem and discuss how they would evaluate it.

We use Twitter’s user problem for this interview. The form of the interview in general is as follows, though be sure to be prepared to adapt to the path the student chooses to pursue.

Afterwards, write up comments and ways to improve and submit them via the Typeform linked in your dashboard.

Introduce yourself first! Ask if they have any questions before you begin, and then start the interview.

Intro Questions:

  • What’s the difference between Supervised and unsupervised learning?
  • Can you explain the bias variance tradeoff?

Case Study: Twitter has a user problem. Specifically, we’re starting to lose users faster than we’re adding them. How can you help us solve this problem?

First approach: Billions of rows of data on users, 1000’s of columns. How many people they follow. How often they visit. If they have the mobile app. Etc. Basically everything you’d want. If we don’t track it imagine that we could.

How would you approach building a model to help with our user problem?

Key questions

  1. What should we predict?

    1. Break it into two problems first.

      1. User churn
      2. User growth
    2. User Churn is the real problem here (push against the marketing problem, which is more complex and really needs time series work…)

  2. How do we predict user churn

    1. Feature engineering

      1. Some logic around interesting and valuable features…
        1. Last sign on
        2. Frequency of use
    2. How do you handle the volume of data

      1. Select columns via intuition, correlation
      2. PCA for feature reduction
      3. Can try subsampling
    3. What kinds of models would you want to try

      1. RF parallelizes well, good with overfitting
      2. Boosted models will be slow but can perform well
      3. If you value explanatory power can try a linear model
        1. That comes with much more concern about feature selection…
        2. Note that feature selection should prioritize variables we may be able to influence in some way
    4. Explain random forest

    5. How do you evaluate

      1. Cross Validation
      2. Validation sample
      3. Test in wild
    6. What if it works?

      1. This is really about taking something from being a model and driving some kind of useful outcomes from it. It’s great that we’ve modeled churn, but it doesn’t matter if we just stop there.
      2. Things like
        1. Finding influenceable variables and affecting them
        2. Target different marketing
    7. What if it doesn’t?

  3. What if we wanted to do it on signup for a single market that was experiencing distress?

    1. This is a much smaller dataset

      1. Thousands of rows
      2. A dozen features
    2. So, the key question then is how do you deal with a smaller dataset?

      1. Boosting or NN is likely a challenge…
      2. No need to do dimensionality reduction
      3. Linear models, possibly still big enough for RF
      4. Also definitely value in simple analytics…
        1. How is this market different?
        2. Comparative tests, etc
  4. What if instead we’re talking about a new product and signups just aren’t where we’d expected…

    1. This is almost certainly just an analytics project.
      1. Look at history of other product launches
      2. Competitive analysis
      3. Really just look for reasonable responses here

Wrap up questions:

  • Can you describe K-Means clustering?
  • When does it fail?
    • (When they’re not linearly separable. Think of a bullseye.)
Objective Rate 1 Rate 3 Rate 5
Communication
Clarity The student is difficult to follow, telling stories that don't seem to make sense or otherwise struggling to communicate The student tells stories but occasionally drops details or makes leaps without filling in the necessary details The student tells clear stories that are engaging and relevant while being natural to follow
Pacing The student speaks either far too quickly or too slowly The student speaks at a reasonable pace, but it has some difficulties or frustrations The student keeps an engaged pace and ensures that the listener is following by asking when appropriate
Focus The student does not stay on topic or answer questions The student occasionally gets lost in details or rabbit holes but generally stays on topic The student uses every opportunity to further their case as a data science applicant
Narrative
Data science narrative The student says they're a data scientist because they took a bootcamp The student acknowledges a previous interest in data science but doesn't sell it well The student crafts a clear story of how their interest in data science has evolved but has longstanding roots
Discussion of bootcamp The student says a bootcamp made them a data scientist The student mentions the bootcamp, either overemphasies it or undersells it The student uses the bootcamp as a tool in their narrative and development but not the crux thereof
Discussion of work history The student doesn't acknowledge work before the bootcamp or speaks about it poorly Some clear coverage of work history, but not great explanation of progress and development The student uses past work experience to lay the groundwork for data science
Case Example
Setup Does not explain the setup of their example project well There is little explanation of why the project is important or interesting, but the setup is explained It is clear what the project aimed to do and why it was done
Explanation Does not explain the project well The student provides some explanation of the project but it is unclear The project is clearly presented and easily understood
Impact The project had no impact Project is technically useful but not impactful The project is obviously shown to be impactful both in their development and as a product itself

Mock Interview 3: Phone Screen

After completing Unit 5, students will take an interview that simulates a common phone screen. The goal of this interview is to evaluate the student’s preparedness for the job search. Students should be prepared to discuss their experience with data science and their interest in the field.

The form of the interview in general is as follows, though be sure to be prepared to adapt to the path the student chooses to pursue.

Afterwards, write up comments and ways to improve and submit them via the Typeform linked in your dashboard.

Phone Screen Questions:

  • Give me a quick overview of what you’ve been working on for the past few years.

    • DON’T JUST LEAD WITH BOOTCAMP

      • This is a tool to advance you from somewhere
    • Do not be afraid to interrupt

  • How did you get into data science?

    • When did you know this was something you wanted to do?
    • How did you start?
    • What about it is interesting to you?
  • What’s your favorite project that you’ve ever made?

    • Walk me through it?
    • Inception
    • Implementation
    • Iteration
    • Impact
  • What kind of job are you looking for?

  • Where do you see yourself in 5 years?

  • What are your strengths? Weaknesses?

  • What’s the best job you ever had?

Looking for:

  • Why they are making this change

  • A narrative of their interest in data

    • Usually this is not a sudden change. Explain why it makes sense and I should trust your commitment to it
  • Clear discussion of past work

  • Focus on relevance of work and desire to expand skills

  • This should be a dialogue, not questions and monologues. Ask follow ups. Respond to their interesting comments.

Objective Rate 1 Rate 3 Rate 5
Product
Data choice The student chooses a dataset that is either trivially small or otherwise inappropriate for ML The student picks a dataset that is in some way significantly flawed or incompatible with their desired model and does not appropriately navigate those challenges. (IE data is only from a specific subset of the population or has some knowable bias). Also the data may be too explicitly set up for only one problem. The student picks a robust data set, understands its provenance, and accomodates any relevant outside information or assumptions
Product The product is too simple or has no use case The product is limited in its use and not easily generalized to a non sterlizied environment The product easily translates into a variety of situations and has obvious and clear value
Code
Python Essentials The code has a structure to it, but is redundant and heavily reliant on bad practices like copy/pasting or contains code that is no longer used. The code is good, but not very efficient or the logic is somewhat broken The student writes good, clean, coherent code.
Pep8 Student is living in the wild west of code style. Maybe there is a bad line or two, but not much. The code looks OK but not awe inspiring. The code is fully pep8 compliant.
Data Science Toolkit The student uses some data science tools, but they frequently aren't the right ones The student uses some data science tools, but occasionally reverts to other structures in Python unnecessarily or doesn't always use the best tool for the job The student is using the data science toolkit and creating easy to understand data structures like well labeled pandas dataframes rather than matricies.
Machine Learning
Model Training/Tuning The model is trained on test data or untuned The model has a simple training structure but no real tuning. Also no robust pipeline is created incoporating unsupervised techniques The model cleanly includes a solid pipeline and incorporates the necessary techniques for broad robustness
Model Selection Only one model is tried Models are compared but not robustly The model comparison incorporates the training toolset robustly and clearly.
Model Validation Evaluation is a one off and one a single metric that is just the 'score' attribute in SKLearn The model is validated using cross validation but it doesn't consider if the score metric is the right metric or it relys on only one metric The model is evluated robustly and considers the right metrics that will matter for the problem as defined
Presentation
Clarity The presentation is muddled The presentation has some missing information or provokes some questions it doesn't answer but provides some structure for the student to walk through. It may be text heavy. The presentation is easy to walk through and sets up a talk that invites and then answers the audience's quesitons
Visuals The visuals are ugly or lacking There are some visuals but they are not the best representations of the data or they are not well formatted Visuals are crisp, clean and hugely compelling. They could exist on their own and be great.

Final Interview

This script covers the final interview for the course. The interview and presentation should be evaluated together.

Because this is a project based interview questions should be tailored to the project itself, but a rough guideline is below.

Afterwards, write up comments and ways to improve and submit them via the Typeform linked in your dashboard.

  • What did you build?

    • Why did you choose this?
    • How did you know it would be doable?
    • What research did you do before you started trying to make it?
  • How is this going to help people?

    • Who is your user?
    • How will they use the thing?
  • How was this technically challenging?

    • What was the biggest difficulty you had? How did you overcome it?
  • Technical dive

    • How did you use unsupervised learning?

      • Why did you use it like that?
      • Describe the technique.
    • How did you select your supervised model component?

      • What was your evaluation metric?
      • Why was that the right one? Did you consider any others?
  • How do you know this is good enough?

    • What are the weaknesses of the model as it stands?
  • What would you do if you had more time?

    • What expertise did you not have that you wish you did?

Data Analysis Report

Prompting questions for call

The goal of this call is to assess if the the student will be successful in the full data science bootcamp.

Indicators of success

  • Statistical literacy
  • Grasp of programming fundamentals

The questions below are suggestions to help you dig into the student’s work. You may choose to skip some of these questions or ask others that are not listed here. The student’s report should guide the discussion. If you see obvious issues, question them. Keep in mind that you’re trying to gauge the student’s understanding of basic statistics and programming. Be empathetic to the fact that they’re just starting their journey in data science.

Intro

Greet the student

Have the student pull up their project and share their screen

Explain the agenda of the call: “You’re going to walk me through your report, and along the way, I’ll stop you to ask questions. To start off, explain the dataset you chose.”

Dataset

Student should explain what’s in their dataset, where it comes from, and why it’s interesting/significant. If the student skips any of this information, prompt them to explain.

Further prompting:

  • Did you have any challenges with this data?
  • Was there any missing information or anything you had to drop?
  • Why did you choose this dataset?
  • How might this dataset be biased?

Questions

Student should have three complex analytic questions that are broken down or presented from different perspectives.

Further prompting:

  • How did your dataset inform the questions you chose to explore?
  • Did your questions change at all while working on this project? If so, how did they change?
  • Why did you choose these questions? Looking back, on your work, do you want to rephrase or reword any of your questions?
  • You’ve asked a question with a yes or no answer. How could you dig into this topic more? What other questions arise from this one?
  • Is this question answerable? Can you prove your answer?
  • Is the data you have the best data to answer this question? What’s missing?

Code

Student should write clear, coherent code and use the data science toolkit to analyze their dataset.

Further prompting:

  • What steps did you take to answer this question?
  • What issues did you run into while analyzing your data?
  • What tools did you use to help you analyze your data?
  • If you see any obvious errors in their code, point it out. Ask the student how they could fix the error.
  • Point to obvious PEP8 nonconformance and ask them "This code works, but do you see any issues with it?"

Analysis

Student should use summary statistics, statistical tests, and clear visualizations to present their conclusions in a way that’s easy to understand.

  • Did the conclusions from your data analysis surprise you or did they confirm your expectations? Why do you think that is?
  • Imagine someone sees this chart/graph/visualization out in the wild, separated from your report. What conclusions would you expect them to draw? Is that the conclusion that you want them to draw?
  • Why are these conclusions significant?
  • What further research would you propose for this dataset? What technologies or concepts would you need to learn in order to conduct that research?
  • How could you make your conclusions more rigorous?
  • Could someone look at the work you’ve done and come to a different conclusion?
  • What does this visualization/analysis mean? How else could you show the same results?

Rubric

Objective Rate 1 Rate 3 Rate 5
Data/Questions
Data set choice The student chooses a dataset that is either trivially small or otherwise inappropriate for analysis The student picks a dataset that is in some way significantly flawed or incompatible with their desired analysis and does not appropriately navigate those challenges. (IE data is only from a specific subset of the population or has some knowable bias). The student picks a robust data set, understands its provenance, and accomodates any relevant outside information or assumptions
Questions The student asks overly simple questions that are asnwerable in single lines of code The student approaches the questions with multiple steps, but presents only a single perspective or is disjointed in the approach. The student chooses complex questions and then breaks them down to either multiple subquestions or presents different ways of reaching a conclusion and evaluates those merits. The questions also build on each other, leading to robust and engaging conclusions.
General Clarity/Structure The student's report is unstructured or difficult to read. The student provides some structure but it still contains moments where it is easy to lose the narrative or flow of questions. The report is easy to read and uses appropriate markdown to give it a nice presentation and flow.
Code
Python Essentials The student is writing unorganized, unintelligible code. The code has a structure to it, but is redundant and heavily reliant on bad practices like copy/pasting or contains code that is no longer used. The student writes good, clean, coherent code.
Pep8 Student is living in the wild west of code style. There are some critical errors in pep8 styling, like inappropraite spacing or bad variable names The code is fully or almost fully pep8 compliant.
Data Science Toolkit The student is doing some work outside of python or not using data science tools where appropriate The student uses some data science tools, but occasionally reverts to other structures in Python The student is using the data science toolkit and creating easy to understand data structures like well labeled pandas dataframes rather than matricies.
Analysis
Visualizations - Visual Elements Plots are unlabeled and unreadable Some plots may lack a few labels but they are generally readable Visuals are easily and independently readable, presenting robust conclusions that are easy to understand
Visualizations - Statistical Elements The student relies on at most one type of visualization, even when others are more appropriate The student tries to use multiple types of visual, but occasionally picks an inappropriate visual for a given question The student uses a wide variety of visualizations, with each visual presenting concise information in the best possible way.
Summary and General Statistics The student does not use summary statistics. The student conducts some summary statistics to balance out visuals, but they are not always the most effective for their narrative The student uses summary stats and statistical tests to compliment their visuals ina clear and compelling way.
Objective Rate 1 Rate 3 Rate 5
Problem Statement
Data choice The student chooses a dataset that is either trivially small or otherwise inappropriate for ML The student picks a dataset that is in some way significantly flawed or incompatible with their desired model and does not appropriately navigate those challenges. (IE data is only from a specific subset of the population or has some knowable bias). Also the data may be too explicitly set up for only one problem. The student picks a robust data set, understands its provenance, and accomodates any relevant outside information or assumptions
Questions The student asks overly simple questions that are asnwerable in single lines of code The student approaches the questions with multiple steps, but presents only a single perspective or is disjointed in the approach. The student approaches a complex ML problem, but frames it correctly within the tools they have developed in the course
Code
Python Essentials The code has a structure to it, but is redundant and heavily reliant on bad practices like copy/pasting or contains code that is no longer used. The code is good, but not very efficient or the logic is somewhat broken The student writes good, clean, coherent code.
Pep8 Student is living in the wild west of code style. Maybe there is a bad line or two, but not much. The code looks OK but not awe inspiring. The code is fully pep8 compliant.
Data Science Toolkit The student uses some data science tools, but they frequently aren't the right ones The student uses some data science tools, but occasionally reverts to other structures in Python unnecessarily or doesn't always use the best tool for the job The student is using the data science toolkit and creating easy to understand data structures like well labeled pandas dataframes rather than matricies.
Machine Learning
Model Training/Tuning The model is trained on test data or untuned The model has a simple training structure but no real tuning The model is tuned and trained using tools like grid search and cross validation, and those tools are explained
Model Selection Only one model is tried Models are compared but not robustly The model comparison incorporates the training toolset robustly and clearly.
Model Validation Evaluation is a one off and one a single metric that is just the 'score' attribute in SKLearn The model is validated using cross validation but it doesn't consider if the score metric is the right metric or it relys on only one metric The model is evluated robustly and considers the right metrics that will matter for the problem as defined
Presentation
Clarity The presentation is muddled The presentation has some missing information or provokes some questions it doesn't answer but provides some structure for the student to walk through. It may be text heavy. The presentation is easy to walk through and sets up a talk that invites and then answers the audience's quesitons
Visuals The visuals are ugly or lacking There are some visuals but they are not the best representations of the data or they are not well formatted Visuals are crisp, clean and hugely compelling. They could exist on their own and be great.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment