Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save calumhalcrow/006e41c569cb9ad51036d5702a478b30 to your computer and use it in GitHub Desktop.
Save calumhalcrow/006e41c569cb9ad51036d5702a478b30 to your computer and use it in GitHub Desktop.
An engineer skills test for RRSoft

PLEASE NOTE: We'd prefer you send your answers to workwithus@rrsoft.co rather than post them publicly in another Gist, or in comments on this Gist ;-)

Engineer Skills Test

The following is a series of questions related to skills that we commonly use in our projects.

This is not a timed or monitored test, and you may use any print or online resources you would like to assist you in your work.

Please submit your answers in the form of text, markdown, or other lightweight data formats. Please indicate clearly which parts of your response correspond to which question. If your answers span multiple files, please attach them as a zip file.

1. Family Tree

You have a simple and straightforward table which holds people information. It has this structure:

people (
  id integer not null, -- PK
  first_name text not null,
  last_name text not null,
  gender char not null,
  father_id integer, -- FK: people.id
  mother_id integer, -- FK: people.id
  age integer
);

Write down an SQL query which would select count of all adult first cousins for a given people.id.

2. But Everything Worked Flawlessly On Staging!

You have designed a system which collects event stream into a database table called fact_event for further processing. The table consists of event_id, event_date, event_type and other event-related specific attributes. The table also has a BTree index over event_date field. You have just implemented a report which shows ('event_type_id': count) statistics over fact_event table data for last couple of years. You use this query:

SELECT event_type_id, count(*) FROM fact_event WHERE event_date < now() and event_date >= now() - interval '2 years'

It works fast and smooth on staging and uses Bitmap Index Scan over the event_date based index. But when you promote your changes to production you notice that the query is inacceptably slow. You use EXPLAIN with the select query and notice that instead of using Bitmap Index Scan on the index over event_date the optimizer uses Seq Table Scan. What could be a reason to prefer table sequential scan over bitmap index scan? What could you to in order to speed up the report?

3. Event Chains

Here you still have fact_event table from the previous task, but now it has a really simple structure:

fact_event(
  event_date datetime,
  event_id integer,
  previous_event_id integer,
  referrer_host text
)
Primary key: event_id
Foreign key: previous_event_id, points to `fact_event.event_id`, nullable

So events in the table can either have a parent event or not. Let's call linked events event_id_1->event_id_2->...->event_id_n a session. Each session starts with an event with parent_event_id is null. Now, for each event in the table you have to find out what was the referrer host of the very first event in the given session. Let's also assume (select count(*) from fact_event) ~ 5M. Please describe how would you solve this issue, what could be an effective solution from your perspective and why?

4. Lies, Damn Lies, And Git

You're working on a cool branch of the foobar project, and your branch contains two commits A and B. The git lineage is:

X -- Y -- Z <-- master
           \
            A -- B <-- your-cool-branch

You want to contribute the code back to the master foobar repo, but you realize there is a really dumb typo in one of your source code comments in commit A. You'd still like to submit the pull request as two commits A' and B', where A' is the fixed version of A, and B' is the same exact diff as B. How do you rewrite git history to make this happen? Is B' the same hash as B? Why or why not?

5. Unix Tools

In one Unix command, find all of the files in /usr/local whose contents contain the word "aardvark" (case-insensitive), and list them all in order from most-recently created to least-recently created.

6. JavaScript

Take a look at the following:

function *foo(x) {
  while (x < 4) {
    x += yield x;
  }
  return x;
}
var bar = foo(3);
console.log( bar.next(1) );
console.log( bar.next(1) );
console.log( bar.next(1) );

What is the output of this code snippet, and why does it behave in the way it does?

7. Blablacar And Distance Estimation

You have an interesting and complex task: to analyse all available Blablacar offers for a given country and try to find out correlation between the route distance and amount of offers for the route per day. In order to determine the route distance you have decided to use Google Distance Matrix API: a convenient way to get a distance between two cities. So you wrote a web-scraper for blablacar website which does all offers parsing and produces a pairs ('City1', 'City2'). Also, you have implemented a Google Distance Matrix API wrapper function:

import googlemaps 

gmaps = googlemaps.Client(key='Your_API_key')

def distance(departure, destination):
"""A Google Distance API wrapper, returns distance between two cities in meters"""
    my_dist = gmaps.distance_matrix(departure, destination)['rows'][0]['elements'][0]
    return my_dist['distance']['value']    

However, Google Distance Matrix API is paid, and you definitely don't want to pay for redundant API calls: there are more than one route could exist for the same cities pair, also City1-City2 distance is the same as City2-City1 distance. Assuming that your solution is a one-time script, which improvements could you propose in order to reduce money spent on Google Distance Matrix API calls in your code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment