Skip to content

Instantly share code, notes, and snippets.

@askldjd
Created June 21, 2017 01:21
Show Gist options
  • Save askldjd/1c280ace6fcbdd821af40e3cd1767d77 to your computer and use it in GitHub Desktop.
Save askldjd/1c280ace6fcbdd821af40e3cd1767d77 to your computer and use it in GitHub Desktop.
jiffies-blog-post

In the VA Digital Service's Appeals Modernization effort, we developed a React on Rails applications called Caseflow. We follow the industry's best practices with Continuous Integration and Continuous Deployment (CI/CD) using Travis and Jenkins. Our team develops features quickly, and deploy changes daily to production environment. However, our deployment pipeline was not always smooth. This is a story of a bug that haunted us for months.

Five minutes of actions

Around January 2017, deploying Caseflow to production environment was an exhausting event. Our deployment pipeline uses the Immutable Server Pattern to create an Amazon Machine Image (AMI), and performs rolling deploys onto our AWS AutoScaling group. The newly deployed EC2 instances would serve traffic beautifully for five minutes, and the servers would tied up all its threads on queries to an Oracle database called VACOLS. The systems would go down, and prevents Veterans' Appeals from being certified from Region Office in the entire nation. All of us would be texted and called by PagerDuty. By the end of the day, the team would be exhausted and traumatized.

A little bit of background

Caseflow is an application we are building to modernize the benefits claim appeals process in the VA. The application is deployed in AWS GovCloud region. To connect to the VA on-premise VACOLS (Oracle) database, it goes through a IPSec tunnel managed by VA Network Security Operation Center (NSOC). NSOC manages a fleet of Cisco routers that contains a large number of routing and firewall rules for our packets.

Unpeeling the onion

We knew that the bug is related to connection with the Oracle database. However, it was difficult to pinpoint the root cause because there are many layers to the issue. We came up with a set of hypothesis to disprove.

  • Ruby's OCI8 implementation is dropping connection to the Oracle database.
  • The Oracle database was dropping our session silently.
  • Some startup program in our Linux image is dropping the database connections.
  • Our Site to Site VPN is dropping the connection at the IPSec tunnel.

A quick check to netstat -tunlap | grep 1526 would show that our database TCP connections are in ESTABLISHED state. So at least on the surface, we know that connection should be okay.

We used SQL Developer and connected to Oracle for a second opinion. Status check shows that the database is healthy and is happily serving request for other applications.

I crafted up a small Node.js application to test the Oracle database connection. The small program connects to the database, and perform a simple query once a second. At the fifth minute from OS boot up, the query hangs. We verified that the problem is not isolated to Ruby's OCI8 implementation.

We were able to replicate this issue using Ubuntu 14.04 with the 3.16 based Linux kernel. Since we were able to replicate the issues across two major version of Linux, we concluded that this problem is not specific to our version of Linux.

Through process of elimination, we isolated the problem to be likely a Site-to-Site VPN issue.

Breakthrough

After repeated experiment with a stopwatch, we know that the connection is dropped at the five minute mark. To diagnose the next layer, we used tcpdump to examine the IP packets for clues. Our effort paid off and we observed an anomaly - the TCP TS val is at the 32 bit boundary when the TCP window stop sliding. This was the moment of breakthrough. After discussing with NSOC, we confirmed that this is a known Cisco bug in the latest fireware.

15:34:25.326520 IP AWS-IP > VACOLS: Flags [.], ack 1802470, win 843, options [nop,nop,TS val 4294967250 ecr 2081125286], length 0

After discussing this issue with VA NSOC, we determined the bug to be originated in a fleet of Cisco routers. The router drops the packet if it sees a TS ECR value to be ahead of TS Val. This bug occurs only if we make a persistent TCP connections within 5 minutes from bootup, which perfectly explains our situation. To work around the issue temporarily, we disabled TCP timestamping by setting net.ipv4.tcp_timestamps = 0 in /etc/sysctl.conf.

All about the Jiffies

The most interesting aspect of this bug is the magical five minute death mark. Digging deeping into the Linux TCP stack, we found out that the TS Val is initialized to the Jiffies on session creation. Jiffies is a unit of measurement for the Linux kernel to measure time. And for many years, the kernel developers initializes the Jiffies to 5 minute from wrapping to encourage exposure of integer wrapping bug. Naturally, the TCP TS Val will also wraps around at the 5 minute mark.

Before joining USDS, I would never imagine myself being a SRE, debugging a fascinating NSOC router issues that affects the entire VA agency, and becomes a critical team member to modernize the Appeals process for Veterans. If this types of problem interests you, consider joining USDS as a SRE.

@askldjd
Copy link
Author

askldjd commented Jun 28, 2017

blog-overview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment