Data Engineer at Stepsize – take home exercise
This is the take home exercise for the Data Engineer position at Stepsize for candidates going through the fast track (it's a little bit lighter for the normal track).
Table of contents
- Why should I contribute this time?
- Task 1 – Prototype an algorithm to retrieve the relevant Git history of any snippet of code
- Task 2 – From prototype to production
- Task 3 – Q&A
- Instructions to submit your work
The goal of this task is to implement an algorithm to retrieve the relevant Git history of any snippet of code, and to think about how this algorithm could be put into production and perform well.
You can use any languages / frameworks / libraries and you can Google as much as you like.
Why should I contribute this time?
We know your time is precious so we should explain why we're asking for 4-8 hours of your time.
We want to assess your problem solving skills and your creativity, and we think it's better if you're able to demonstrate these in an environment where you're relaxed, without anyone breathing down your neck.
We want you to get exposure to the core of our backend as part of the hiring process so you can get a sense of our problem space and whether you would find it interesting.
We intend to build upon this task in the final interview stage and get fairly deep into our backend, and this is only reasonable if you've spent some time thinking about it already.
Task 1 – Prototype an algorithm to retrieve the relevant Git history of any snippet of code
Let's say you're looking at a fairly big file with lots of code in it. A snippet of code in that file is of interest and you'd like to retrieve and examine its commit history.
Unfortunately, since this file is big, the majority of its commit history is irrelevant. This effectively rules out simply running the command
git log -- path/to/file and browsing the output manually.
What could you do to filter out its commit history?
Implement a function that takes as input:
And returns an array of commit hashes that are relevant to the input:
Recommended simplifying assumptions
What to optimise for
You can optimise for precision & recall assuming retrieval speed doesn't matter.
What about file renames and merge commits
You can assume that files are never renamed and merge commits don't exist (i.e. the commit history is strictly linear) to simplify things.
Define "relevant" 🤨
It's ok to use an approximate definition of relevance. We're looking for a language-agnostic solution that doesn't try to understand things like scope.
Being able to retrieve the commit history of a snippet of code is sorely needed in larger files that have grown hairs and all sorts of quirks over time. We have some of these (sadly) – here's an example:
This file has received many commits since it was first created, and most of them are irrelevant to the function selected in the screenshot.
Task 2 – From prototype to production
Let's say you wanted to turn this prototype into a product. You think other engineers will often find themselves in the same situation you were in and could use a tool like this. You might even be keen to build upon this foundation: how about retrieving relevant pull requests, issues, designs, people, etc. for any given piece of code?! And how about allowing engineers to retrieve information for multiple files at a time, potentially even across different repositories?!
Now retrieval speed matters just as much as precision and recall.
Put together a draft proposal outlining the broad strokes of how to turn your prototype into a solution fit for production.
Precision and recall matter just as much as they did in the previous task, but now retrieval speed does as well. We expect the production solution to perform some preprocessing and/or transformations on the underlying Git data ahead of retrieval time for performance to be good enough.
If for whatever reason you think your prototype is not the right approach to take to production, that's ok. Tell us why you think that and what a suitable approach would look like.
A successful draft proposal needs to include the following:
- Specifics about the relevant history retrieval API
- Specifics about the relevant history retrieval logic
- Specifics about the preprocessing logic, if any
- Specifics about the data structure(s) and storage solution(s) used
We're not expecting a specific number of pages or words, or specific things like diagrams and pseudo-code. This task is as much about assessing your communication skills as your problem solving skills, so we're giving you the freedom to express yourself as you see fit.
Keep in mind that this is a draft proposal outlining broad strokes not a technical spec – it's meant to illustrate the key aspects of your proposal, not every single detail. It's totally fine if there are some unknowns, you can acknowledge them without outlining how they would be resolved.
Recommended simplifying assumptions
A successful draft proposal can but does not need to include the following:
- Specifics about how your proposal handles file renames and merge commits – you can keep assuming files are never renamed and the commit history is strictly linear.
- Specifics about how your proposal handles new commits to the Git repo – you can assume the state of the repo is fixed.
- Specifics about the external services your proposal might require (e.g. the HTTP Git Service) – focus on the service responsible for understanding & surfacing the Git history
Task 3 – Q&A
Once you've completed the previous tasks, take a moment to reflect on them by answering these questions:
- Which part(s) of your submission are you most satisfied with?
- Which part(s) of your submission are you most dissatisfied with?
- How would you improve upon your submission if you had time?
- Do you think this task was a good way to assess whether you're a good fit for this role?
We're looking for relatively short answers composed a few sentences answering the question and explaining the answer.
Instructions to submit your work
Please email email@example.com with:
- A zip file containing your code for task 1 and instructions on how to run it
- Your proposal for task 2 in whichever form is appropriate (we like Google Docs and markdown)
- Your answers to the questions of task 3 (Google Docs, markdown, PDF, etc.)
Don't hesitate to get in touch by email if you have any questions. Note that if you can't quite finish task 1 and would like some pointers on what a good solution looks like to be able to take on task 2, we can share one with you.
Thank you for your time, we really appreciate it. We'll get back to you as soon as possible – you can relax now.