mgummelt/gist:e26908fec9eea7078212

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    GitHub Infrastructure Engineer Questionnaire

Thanks again for applying to the Infrastructure Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.
Section 1

Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technological concepts.
Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.
Q1

A service daemon in production has stopped responding to network requests. You receive an alert about the health of the service, and log in to the affected node to troubleshoot. How would you gather more information about the process and what it is doing? What are common reasons a process might appear to be locked up, and how would you rule out each possibility?
A1:

problem

diagnostic
The problem is on the alerting tool node

see if the service is responding from my local machine, or from another node in the network
Node is down, or disconnected from network

ping or ssh
Process isn't running

ps -ef
Process isn't listening

netstat -lntp, or the equivalent
Load is too high for the process to run
top
Process is spinning CPU

CPU utilization in top
Process is in a wait state, during an IO call for example

S or D in STAT field of ps aux
Kernel recv/send buffers for the socket are full and packets are being dropped

compare buffer size from /proc/net/ to max sizes in /proc/sys/net/ipv4, or inspect packets with tcpdump*
Server backlog is full, and connections are being rejected

Look for ECONNREFUSED errors on the client, and inspect packets with tcpdump on the server
Q2

A user on an ubuntu machine runs curl http://github.com. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.
A2:

The shell process forks and execs the curl process with the specified argument.  When curl gets the processor, it resolves the host (getaddrinfo()), and connects to the resultant IP (socket(), connect()).  In the kernel, the connect syscall negotiates the 3-way SYN/SYN-ACK/ACK TCP handshake with the remote host.  Once completed, the socket is put in a CONNECTED state, and connect() returns.  "curl" then constructs an HTTP GET message with a "HOST: github.com" header, and sends it over the network (send() or write()).  The request fits in one packet (probably), so the entire message is sent, regardless of congestion/flow control limitations.
Meanwhile, the github web server has already created a socket (socket()), bound it (bind()) to the public IP for which github.com resolves, or some private proxy.  The kernel negotiates the handshake with the client, and upon completion, places the connection in the server's backlog buffer, and wakes up the server (if it's using select(), poll(), or accept()).  The accept() syscall on the server returns, providing a new file descriptor for the connected socket.  The server can now handoff a connection to a thread or process for handling, so it can return to serving requests.  The handling process constructs and HTTP response, and sends it back to the client (write() or send()).
The client, which has been blocking on recv(), reads the response into a buffer, and closes the client descriptor (close()), which prompts the server to close the TCP connection via a FIN packet.  "curl" writes the response to stdout, and the process returns.
Q3

Explain in detail each line of the following shell script. What is the purpose of this script? How would you improve it?
#!/bin/bash
set -e
set -o pipefail
exec sudo ngrep -P ' ' -l -W single -d bond0 -q 'SELECT' 'tcp and dst port 3306' |
  egrep "\[AP\] .\s*SELECT " |
  sed -e 's/^T .*\[AP\?\] .\s*SELECT/SELECT/' -e 's/$/;/' |
  ssh $1 -- 'sudo parallel --recend "\n" -j16 --spreadstdin mysql github_production -f -ss'

A3:

On the non-zero exit status of any command, exit immediately.
2: The exit code of a pipeline is the first non-zero exit code, rather than the right-most exit code
3: For all outgoing mysql packets on the bond0 interface containing "SELECT", print them to stdout, such that each packet is on a single line, and control characters are printed as a ' ', rather than a '.'.
4: Select only those packets containing "[AP]" followed by a "SELECT", which limits the results to SQL SELECT statements, rather than statements that happen to have a "SELECT" substring.
5: Eliminate the ngrep prefix, thus recreating the original SQL statement, and ensure that all lines end with a semicolon.
6: On a remote node specified on the command line, create 16 worker processes to relay the modified SQL statements to production.
The purpose of the script is to ensure each SQL command is terminated with a semicolon, so that command termination is not ambiguous.  I'm not sure how MYSQL handles unterminated commands, but I assume there's the danger that it might try to append the next command to the current one.
The script might be improved by batching sql commands under a single invocation of mysql, to avoid the overhead of starting the client for each select statement.
Section 2

The following areas map to technologies we use on a regular basis at GitHub. Experience in all of these areas is not a prerequisite for working here. We'd like to know how many of these overlap with your skill set so that we can tailor our interview questions if we move forward in the process.
Please assess your experience in the following areas on a 1-5 scale, where (1) is "no knowledge or experience" and (5) is "extensive professional experience". If you're not sure, feel free to leave it blank. Just place the number next to the corresponding areas listed here:

system administration

puppet (1)
ubuntu (3)
debian packages (3)
raid (1)
new hardware burn-in testing (1)


virtualization

lxc (1)
xen/kvm (1)
esx (1)
aws (3)


troubleshooting

debuggers (gdb, lldb) (3)
profilers (perf, oprofile, perftools, strace) (2)
network flow (tcpdump, pcap) (3)


large system design

unix processes and threads (3)
sockets (3)
signals (3)
mysql (3)
redis (1)
elasticsearch (2)


coding

comp-sci fundamentals (data structures, big-O notation) (4)
git usage (4)
git internals (2)
c programming (4)
shell scripting (3)
ruby programming (2)
rails (2)
javascript (3)
coffeescript (3)


networking

TCP/UDP (4)
bgp (2)
juniper (1)
arista (1)
DDoS mitigation strategies and tools (2)
transit setup and troubleshooting (2)


operational experience

reading and debugging code youâ€™ve never seen before (4)
handling urgent incidents when on-call (2)
helping other engineers understand and navigate production systems (3)
handling large scale production incidents (external communications, internal coordination) (3)