Skip to content

Instantly share code, notes, and snippets.

@andrewromanenco
Created March 3, 2015 19:35
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save andrewromanenco/90e814e232d5fc56eef6 to your computer and use it in GitHub Desktop.
Answers: Application Engineer Gist for Andrew

GitHub Application Engineer Questionnaire

Thanks again for applying to the Application Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.

Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technical concepts.

Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.

Q1

Knowing what you know about GitHub, how would you design the high level infrastructure for github.com? What sequence of steps would happen when loading http://www.github.com in a browser? Don't worry about describing the specific libraries and services that handle each step.

A1:

First of all I would like to note that I am familiar with this blog post: https://github.com/blog/530-how-we-made-github-fast (actually, I used it as a reference for my git hosting service).

The major architecture decision is to separate services based on protocols (http, ssh, git). For every protocol there are common areas to watch for: security, load balancing, fail over and caching. Major components are: load balancers, front end servers, database layer, caching services and file access points. I am skipping details for the architecture due to link from above.

When accessing github home page it is important to distinct anonymous users vs authenticated ones. As of now, github.com for anonymous users looks like a static page, which makes it an ideal candidate for caching as a single object. Basically, after hitting a load balancer, the request gets forwarded to a front end server; this server identifies user as an anonymous one and retrieves the page from a cache.

There is a completely different story for authenticated users. For example, my github.com landing page is customized with list of my repositories and list of news. These dynamic elements are specific to each user and cannot (or shouldn't) be cached as single unit. Here is how the process of page loading looks like from very high perspective: as usual, the request gets forwarded by a load balancer to a front end server. Front end server identifies the user and collects required information from database/cache. Caching is tricky in this case. Depending on how long it takes to extract related news and repositories, it might be more optimal to eliminate cache at all. Decision to use caching is a result of collecting accurate metrics (e.g. cache misses). After all information is available it is sent to view layer to render actual page.

Q2

Describe the common components of web application frameworks. What purpose does each component serve? What is the benefit of separating each component from the others?

A2:

A web application framework is responsible for providing common services for a web application. A dev team can focus on implementing business logic and ui; and reuse framework for shared services. These services can be divided into two groups: core and auxiliary.

Core part of most web frameworks is Model-View-Controller layer. This pattern is important for code separation based on responsibilities (SOLID principles). Good example is decoupling views to allow them to change independently (e.g. support both desktop and mobile version of ui). Component separation is key feature for effective development and testing.

MVC separation: Model - data model, V - view (or views), C - business logic.

Auxiliary services are optional and examples are: caching, data access, security, logging.

I would like to emphasize that testing is the most important part of development process. Unit tests require for components to be decoupled. Good web app framework should include testing capabilities (at least it should have clean way of best practices for testing).

Q3

Given the following table schema, indexes, and query plan, explain how the query is executed and what you would do to improve the performance.

mysql> describe ci_statuses;
+-----------------+--------------+------+-----+---------+----------------+
| Field           | Type         | Null | Key | Default | Extra          |
+-----------------+--------------+------+-----+---------+----------------+
| id              | int(11)      | NO   | PRI | NULL    | auto_increment |
| state           | varchar(255) | NO   |     | unknown |                |
| sha             | varchar(255) | NO   | MUL | NULL    |                |
| repository_id   | int(11)      | NO   | MUL | NULL    |                |
| created_at      | datetime     | YES  |     | NULL    |                |
| updated_at      | datetime     | YES  |     | NULL    |                |
| pull_request_id | int(11)      | YES  | MUL | NULL    |                |
| context         | varchar(255) | YES  |     | default |                |
+-----------------+--------------+------+-----+---------+----------------+

indexes:
+----------------------------+------------------------------+--------+
| Index_name                 | Columns                      | Unique |
+----------------------------+------------------------------+--------+
| PRIMARY                    | id                           |    1   |
| sha                        | sha                          |    0   |
| pull_request_id_created_at | pull_request_id, created_at  |    0   |
| repository_id_created_at   | repository_id, created_at    |    0   |
+----------------------------+------------------------------+--------+

>explain SELECT r.id FROM repositories r JOIN ci_statuses s ON s.repository_id = r.id GROUP BY s.sha HAVING COUNT(s.context) > 1;
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+
| select_type | table | type   | possible_keys            | key     | key_len | ref             | rows      | Extra                    |
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+
| SIMPLE      | s     | index  | repository_id_created_at | sha     | 767     | NULL            | 122041204 |                          |
| SIMPLE      | r     | eq_ref | PRIMARY                  | PRIMARY | 4       | s.repository_id |         1 | Using where; Using index |
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+

A3:

Query execution process:

DB makes full scan for ci_repositories (it scans index tree and then reads a row from the table for values not included in the index). Index tree scan is optimal in this case due to sorted nature of data in the index. Exactly one row is selected from repositories table as a join to each record from ci_statuses. There is pages caching hit/miss logic behind to reduce IO operations.

There are several ways how we can approach to optimization.

  1. Can we eliminate join entirely? Repository id is already in ci_statuses table. At this point inner join eliminates orphaned records (statuses with no corresponding repos). It's a question if we can assume that no orphan records exist; which can be enforced via foreign key constraint.

  2. Eliminate table scan by creating a composite index with sha and id. This will let the engine to scan index tree only. (will be identified by "using index" note as extra information)

  3. Precalculate the data is always an option for read intensive usecases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment