Skip to content

Instantly share code, notes, and snippets.

@hossbeast
Created March 5, 2015 16:01
Show Gist options
  • Save hossbeast/febb888e7a9a7b745eb5 to your computer and use it in GitHub Desktop.
Save hossbeast/febb888e7a9a7b745eb5 to your computer and use it in GitHub Desktop.
# GitHub Infrastructure Engineer Questionnaire
Thanks again for applying to the Infrastructure Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.
## Section 1
Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technological concepts.
Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.
### Q1
A service daemon in production has stopped responding to network requests. You receive an alert about the health of the service, and log in to the affected node to troubleshoot. How would you gather more information about the process and what it is doing? What are common reasons a process might appear to be locked up, and how would you rule out each possibility?
### A1:
The first thing I would look at is the open file listing for the process, using the utility lsof. Things to look for : are the standard file descriptors (0, 1, 2) open where they should be (tty, log file, /dev/null, depends on the daemon)? Are there "lots" of open files? Is the list changing? These are heuristics, but its a good place to start.
Next, strace. Is the process making any system calls? If so, on which files? For example, you may observe a write call on a file descriptor that lsof shows to be on a network-mounted filesystem. If the remote filesystem goes away, the call can hang. No system calls being made at all may indicate a deadlock.
At this point, I would attach to the process with gdb. The amount of information I'm able to gather is dependent on my familiarity with the daemon. What language is it written in, was it compiled with debug symbols, and at what optimization level? In the case of a deadlock, what locking primitives are being used? In the best case, you can obtain a thread listing, locate the interlocked threads, and find the root cause. This may require recompiling with debug symbols.
### Q2
A user on an ubuntu machine runs `curl http://github.com`. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.
### A2:
First, curl resolves the hostname github.com probably with the getaddrinfo call. Curl opens a tcp socket using the system calls socket and connect, providing the address, and the default http port of 80. The kernel selected the appropriate interface using the routing table, and initiates the SYN/ACK handshake with the remote server.
Next, on a github server, the kernel receives the SYN packet, and completes the handshake. An http daemon (possibly apache) has registered to receive connections on port 80 with the system call listen, so the kernel sets up the socket / file description, and passes the open socket to that process when it calls the system call accept.
Finally, curl sends an http request over this tcp connection, for the page at /. The http daemon replies over the same connection. Curl writes the response to a file and closes the connection.
### Q3
Explain in detail each line of the following shell script. What is the purpose of this script? How would you improve it?
```
#!/bin/bash
set -e
set -o pipefail
exec sudo ngrep -P ' ' -l -W single -d bond0 -q 'SELECT' 'tcp and dst port 3306' |
egrep "\[AP\] .\s*SELECT " |
sed -e 's/^T .*\[AP\?\] .\s*SELECT/SELECT/' -e 's/$/;/' |
ssh $1 -- 'sudo parallel --recend "\n" -j16 --spreadstdin mysql github_production -f -ss'
```
### A3:
This script is capturing sql queries from network traffic, sending them to a remote mysql database, and executing them in parallel.
set -e causes the bash process to exit immediately when any of its immediate children exits with a nonzero status.
set -o pipefail causes a chain of commands piped together to return the exit status of the last command in the pipeline that exited with a nonzero status.
The ngrep command is capturing tcp traffic on the bond0 interface matching the regular expression "SELECT" bound for port 3306. Note, this may not be going to the box on which the ngrep command is being run. The naming of the interface, "bond0" suggests a bonded interface, probably for redundancy. The options -P and -W control the output format. Notably, each output record will be on its own line.
The egrep command further refines the traffic being captured, by discarding lines that do not match a second regular expression. This line could probably be omitted altogether by using a more specific regex in the ngrep command.
The sed command removes everything before SELECT in each line, and places a semicolon at the end of each line. The semicolon is important because it terminates the query. If its missing, the mysql command would hang while reading more input.
The ssh command opens an ssh session to the first parameter to the script ($1) and uses the parallel utility to execute the queries in parallel, up to 16 at a time. The -f will cause the mysql command to succeed even in the case of a malformed query, and the -ss option will prevent mysql from printing some output that is not needed.
What would I change? The first problem is that ngrep operates on packets, not sessions. TCP is a stream-oriented protocol, which means that you have no guarantee that the entire query is in a single packet. There is no input scrubbing. ssh sessions are not being reused. I this script is a good proof-of-concept, but a production solution would require a good bit more design.
## Section 2
The following areas map to technologies we use on a regular basis at GitHub. Experience in all of these areas is not a prerequisite for working here. We'd like to know how many of these overlap with your skill set so that we can tailor our interview questions if we move forward in the process.
Please assess your experience in the following areas on a 1-5 scale, where (1) is "no knowledge or experience" and (5) is "extensive professional experience". If you're not sure, feel free to leave it blank. Just place the number next to the corresponding areas listed here:
- system administration
- puppet 1
- ubuntu 4
- debian packages 4
- raid 2
- new hardware burn-in testing 1
- virtualization
- lxc 1
- xen/kvm 2
- esx 1
- aws 3
- troubleshooting
- debuggers (gdb, lldb) 5
- profilers (perf, oprofile, perftools, strace) 5
- network flow (tcpdump, pcap) 4
- large system design
- unix processes and threads 5
- sockets 5
- signals 5
- mysql 3
- redis 1
- elasticsearch 1
- coding
- comp-sci fundamentals (data structures, big-O notation) 5
- git usage 3
- git internals 3
- c programming 5
- shell scripting 5
- ruby programming 1
- rails 1
- javascript 4
- coffeescript 1
- networking
- TCP/UDP 5
- bgp 2
- juniper 1
- arista 1
- DDoS mitigation strategies and tools 2
- transit setup and troubleshooting 1
- operational experience
- reading and debugging code you’ve never seen before 5
- handling urgent incidents when on-call 5
- helping other engineers understand and navigate production systems 5
- handling large scale production incidents (external communications, internal coordination) 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment