raymondluong/benefits_infra_4-18-19.md

## benefits_infra_4-18-19.md

      
    Raw
  

              benefits_infra_4-18-19.md
            
          
    Benefits Infrastructure

April 18, 2019
Our Infrastructure

We host everything inside of Aptible. Why? It's HIPAA compliant!
We have different containers that run our web workers, environments, Postgres instances, and Redis instances.

Daniel signs into Aptible.com -
We have three environments — staging, production, and demo.
We can see the containers, services, databases, etc for each environment.
We have replicas to use for backups and our validation.

Our CI Model

CI = continuous integration --> how do we get our application code to users?

Daniel walks through flow chart -
We create docker images to pass into Aptible.

What has Daniel been working on?

Giving our HI Monitoring + Dashboards an upgrade.
We see all of our metrics through Datadog.
We need alerts for certain things (e.g. Postgres running out of disk space) instead of waiting for Aptible to email us.
We want to know about sparks before they turn into fires.
Three main metrics we should be keeping track of:

Memory usage - RAM, data that we store temporarily
Disk space usage - permanently stored memory
CPU usage

Demo


Daniel logs into Datadog and clicks on Monitors in the sidebar -
There are a bunch of monitors here! We can check the three metrics above. The data is pulled from Aptible.
Dashboards - five main dashboards - 3 for the metrics above, 1 for system health, and 1 for production replica health
We can compare usage across environments.
Why does it look like a staircase? We bumped our disk space twice in the past few months.
Monitoring is based on percentages.

Questions

Nat: The dashboard says an OK status. How often does it ping the server?
Daniel: That comes from Aptible itself.
Nat: Does Datadog give you the ability to save history to work out uptime?
Daniel: Yes that's available - we can take snapshots of specific time periods.
Brittney: What's been the biggest surprise for you?
Daniel: I don't think there's a surprise. It's a whole different mindset switching back and forth between product (features) and infrastructure (keeping our system up holistically).
Raymond: Postgres replication lag alert and why does it self resolve?
Daniel: Replication lag - specific data in replica is behind what's in the prod database. It self-resolves when the replication and validation jobs finishes.
Brittney: What's next?
Daniel: Getting secrets manager to work. Upgrade containers from Debian to Ubuntu. Update Redis instance from 2 to 4. Set up DogStatsD. Rotating to Infra team.
Brittney: Is it HIPAA compliant for something outside of Aptible to reach into Aptible to grab metrics?
Team: Depends on what is reaching in and what the metrics are.
Chao: Redis 2 to 4?
Daniel: Redis 4 uses a different way of memory fragmentization [...missed a bit here]. It'll let us get more granular monitoring.
Brittney: How did you know to upgrade Redis?
Daniel: Noticed weird metrics. Read articles that pointed out metrics. Stumbled upon memory fragmentation ratio and it seemed like something we needed to improve.
Brittney: Did you keep track of these resources?
Daniel: Yes, set up multiple Confluence pages for dashboards and metrics. Check out the Further Resources slide. Also I have a personal learnings journal.