marksim/github.md

## github.md

      
    Raw
  

              github.md
            
          
    Github Questions

Q1:

Knowing what you know about GitHub, how would you design the high level infrastructure for github.com? What sequence of steps would happen when loading http://www.github.com in a browser? Don't worry about describing the specific libraries and services that handle each step.
A1:

Assumptions

Well, let's start with a few assumptions so we're on the same page:

Github has ~40M repositories (roughly extracted from trending data in this post)
The vast majority of actions taken on github are per-repository (i.e. there is not a lot of need for user facing cross-repo or even cross-user actions other than search)
Most of the actions that are performed on a webserver are highly tuned and have fast response times.

Basic Thoughts

Since github is a write-heavy application, caching will need to be highly attuned and easily invalidated.  There are also so many repositories, each with many commits that storing it all in one massive database makes it more difficult to replicate dbs and handle outages.  There will also need to be a large number of load balanced web servers that can handle lots of simultaneous requests.
When the browser makes a request, using something like AnyCast, various github load balancers would be available in different locales.  The load balancer then forwards the request to a specific web server, which would return the requested page (potentially cached, with appropriate max-age), or at least a skeleton of it if there is a lot of front-end a-sync calls afterward.  The requested page would be rendered depending on the endpoint.  Additional assets (images, common JS, etc) make sense to distribute via a CDN.
On the server side, the server would likely return either cached data or connect to a database shard (potentially based on a hash of the repository name).  Those shards could use replication both within regions to increase redundancy and improve reliability AND across regions to improve individual query response time.
Rough Diagram

                                                                   --> ( Web 1 ) <---       ----> ( DB Shard 1 )
                                                                  /                  \     /
   -----------                                                   /         ...        \  /       ...
   |         |  => ( Local DNS Server )  /=> ( Load Balancer )  ---------> ...         **------> ...
   |         |                           ||                      \         ...        /  \       ...
   |         |                           ||    ...                \                  /     \
   |         |               ||          ||    (other LBs in       --> ( App N ) <---       ----> ( DB Shard N )
   |         |               \/          ||     other regions)
   |         |                           ||                           ||
   |         |      github.com regional  ||                           ||
   |         |  <=        DNS entry      ||                           ||
   |         |                           ||                        HTTP Response /
   |         |                           ||                         Rendered HTML 
   | Browser | ===========================/                           ||
   |         |                                                        ||
   |         |                                                        ||
   |         | <=======================================================/ 
   |         |
   |         | =========> ( CDN )
   |         |
   |         |              ||
   |         |        Image / JS / CSS Assets 
   |         |              ||
   ----------- <=============/


Q2:

Describe the common components of web application frameworks. What purpose does each component serve? What is the benefit of separating each component from the others?
A2:

Components


Models - map data in a datastore to objects
Views - used for rendering the response body, very little business logic and few (if any) db calls
Controllers - map routes to endpoints, gathering data, executing or calling Business Logic, and rendering views
Service Objects - objects that contain reused (or reusable) business logic

Separating each of these components allows for code reuse as well as lower coupling (i.e. the query used for the index of all widgets is not bound to that particular view, but instead, is cohesively included in the model and can be reused on other pages, whereas the view of the index does not care about the query, and only cares about the methods on each instance of the model).
Q3:

Given the following table schema, indexes, and query plan, explain how the query is executed and what you would do to improve the performance.
mysql> describe ci_statuses;
+-----------------+--------------+------+-----+---------+----------------+
| Field           | Type         | Null | Key | Default | Extra          |
+-----------------+--------------+------+-----+---------+----------------+
| id              | int(11)      | NO   | PRI | NULL    | auto_increment |
| state           | varchar(255) | NO   |     | unknown |                |
| sha             | varchar(255) | NO   | MUL | NULL    |                |
| repository_id   | int(11)      | NO   | MUL | NULL    |                |
| created_at      | datetime     | YES  |     | NULL    |                |
| updated_at      | datetime     | YES  |     | NULL    |                |
| pull_request_id | int(11)      | YES  | MUL | NULL    |                |
| context         | varchar(255) | YES  |     | default |                |
+-----------------+--------------+------+-----+---------+----------------+

indexes:
+----------------------------+------------------------------+--------+
| Index_name                 | Columns                      | Unique |
+----------------------------+------------------------------+--------+
| PRIMARY                    | id                           |    1   |
| sha                        | sha                          |    0   |
| pull_request_id_created_at | pull_request_id, created_at  |    0   |
| repository_id_created_at   | repository_id, created_at    |    0   |
+----------------------------+------------------------------+--------+

>explain SELECT r.id FROM repositories r JOIN ci_statuses s ON s.repository_id = r.id GROUP BY s.sha HAVING COUNT(s.context) > 1;
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+
| select_type | table | type   | possible_keys            | key     | key_len | ref             | rows      | Extra                    |
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+
| SIMPLE      | s     | index  | repository_id_created_at | sha     | 767     | NULL            | 122041204 |                          |
| SIMPLE      | r     | eq_ref | PRIMARY                  | PRIMARY | 4       | s.repository_id |         1 | Using where; Using index |
+-------------+-------+--------+--------------------------+---------+---------+-----------------+-----------+--------------------------+

A3:

As it currently stands, there is no reason for the join since the repository id is stored in the ci_statuses table so the query could be rewritten like this to take advantage of the sha index:
SELECT s.repository_id as id FROM ci_statuses s GROUP BY s.sha HAVING COUNT(s.context) > 1;
There could also be an additional index made that combines the repository_id and the sha so that both could be used when scanning ci_statuses in the original query.  Right now, there is no index being accessed on the GROUP BY clause, which means that every single sha will need to be scanned through the table since it's already using the repository_id_created_at index to look up the repo_id