shaunagm/wtf_is_ops.md

## wtf_is_ops.md

      
    Raw
  

              wtf_is_ops.md
            
          
    If you're a developer who hasn't done much work with operations, it can be quite confusing!  After spending several days reading documentation and tutorials, I decided to write the overview I wish I'd been handed at the beginning.  This guide starts with a simple example - running a Django app on your laptop - and works up to running automated deployments of multiple instances simultaneously on Amazon Web Services.
I learn by working with concrete examples, so this guide makes a lot of arbitrary technology choices as it goes.  Obviously this guide will work best if you're familiar with Django and interested in AWS, but hopefully it's still somewhat useful for a Rails developer who wants to use Azure or what have you.
Of course, I may well be missing key concepts or have gotten something very wrong.  Any comments, feedback, etc are appreciated.
Okay, let's get started.
Deploying your web app

Every web app has a few basic elements:


a machine to run on


a location, aka an IP address or domain name pointing to that machine


a web server


an app which processes incoming requests and tells the server how to respond


and, optionally but frequently, a database that the app uses to store data


These elements are present even with toy projects, for example the app in the Django tutorial.  Typing python manage.py runserver starts a simple web server, by default at 127.0.0.1:8000.  127.0.0.1 is the address for your local machine, and 8000 is the default port Django uses.  Django configures its web server to take HTTP requests coming in on 127.0.0.1:8000 and pass them to the app to handle.
Django also makes it easy to connect your app to a database.  By default, it uses sqlite, which is so lightweight that it does not need to be configured separately.  If you switch out sqlite for, say, a MySQL or Postgres database, you'll need to create the database yourself, and provide the database name, port, an administrative user, and a user password, and provide them in the settings.py file.  This kind of database, which exists separately from the app itself and is communicated with in a client-server relationship, is much more common in production.
So even running locally on your machine, your web app has all the essential elements.  But your local machine isn't in production! What does the simplest version of a web app in production look like?
Deploying to a virtual private server

To begin with, you need to find a host for your app.  You can't use your machine, which will stop responding to requests whenever you restart it or put it to sleep.  So you make an account with a web hosting company like Linode. They give you a virtual private server, tell you its IP address, and give you some credentials to help you log in.
The IP address they provide can be used as your site's location, but you probably want a domain name.  You can purchase a domain name at a domain name registrar. You'll create a record with that registrar (most likely an A record or a CName record) saying which IP address the domain name points to.  They'll pass that information to nameservers around the globe so that anyone who wants to go to your domain will be sent to the correct IP address. (More about DNS.)
Now people can send requests to your app.  To handle those requests, you need to set up a web server.  Technically you could just run python manage.py runserver on your host but the Daphne server that Django uses isn't built for production so you're better off using a more reliable solution.  For instance, you could use the most popular web server, Apache.
Apache communicates with your Django app via a WSGI file.  When you set up Apache, you'll configure it to send all traffic addressed to your domain name to that WSGI file.  (WSGI stands for Web Server Gateway Interface and is the standard interface between servers and Python apps.)
Finally, you can set up a database on a single, simple host pretty much the same way as setting it up locally.  You'll install the database software and any of its dependencies, create a database, user, password, etc and provide those variables to Django's settings.py file.
To get our code onto the virtual private server from our laptop, we can use SSH.  If we're using a version control system like Git (and we should be!) we can set it up to use SSH, and type git push and git pull to get our code to and from the virtual private server.
Deploying to the cloud

Your app on its single host can handle a fair amount of traffic, especially if you are careful to optimize your code and use strategies like caching.  But you may want to scale beyond what a single virtual private server can handle, in which case it's time to switch to cloud computing. Cloud computing allows you to provision more resources as you need them.  There are many cloud computing providers, including Linode, but we'll use Amazon's Elastic Cloud Computing (EC2) service.
The process for setting up our app on a single EC2 instance is fairly similar to how we set things up on our virtual private server on Linode.  Once we launch our EC2 instance, we can:


Determine our location. Amazon provides a DNS service called Route 53.  You can create a record with Route 53 pointing at the IP address of your EC2 instance.  Be careful, though - EC2 instances are meant to be replaced more frequently than a virtual private server, at which point the IP address will change.  Amazon's Elastic IP service lets you keep a single public IP address associated with changing instances.


Set up a web server and app. You can install and configure a web server such as Apache on EC2 just like you did with the virtual private server.  You can link the server to the app via the server's  conf files and Django's wsgi files in the same way as well.


Communicate with the database.  As with the virtual private server, you can install a database on your EC2 instance and connect it to the Django app via the settings.py file.  AWS also offers a "database as a service" (Amazon RDS). If you choose this option, you will be given configuration details which can also go in the Django app's settings.py file.


Deploying to a single EC2 instance is easy enough.  But what happens when we want to scale using multiple EC2 instances?
Let's start by talking about what an EC2 instance is.  An EC2 instance is actually created from an Amazon Machine Image, or AMI.  The image can be just an operating system, or it can have applications like our Apache web server, our Django app, and our database pre-installed.  Regardless, when we launch an EC2 instance, we're creating a copy of that image and running it. Once we launch the instance, we can make additional modifications.  We place our EC2 instances within a virtual private cloud (VPC).
Amazon has an auto scaling service that lets us say things like, "When demand gets above X, I want Y number of EC2 instances made from this image."  You can also manually increase or decreases the number of EC2 instances, or scale them up/down at a given time.
When we have multiple instances, we need to decide which instances are given which requests to handle.  This is called load balancing and, surprise!, Amazon provides a service for it. It's called Elastic Load Balancing (ELB; we want specifically the Application Load Balancing service, not the Network or Classic Load Balancing service).  The load balancer can do things like health checks and only send traffic to healthy instances. It can also route traffic based on paths, which is useful for microservice architectures.
Database management is also more complicated when we have multiple instances.  Because we're frequently creating and terminating instances, we can't use instance storage - otherwise our data will disappear when the instance does.  Here are some alternatives Amazon provides:


Elastic Block Store (EBS).  EBS storage volumes exist independently of EC2 instances.  EC2 instances can connect to them, but they don't disappear when nothing's connected to them.


Relational Database Service (RDS).  As mentioned above, Amazon offers this service.  It sits on top of EBS and provides an easy-to-use interface for storing data in one of six kinds of relational database engines.


Simple Storage Service (S3).  S3 can store files or data objects, as opposed to RDS which is better for relational and hierarchical data.


Note that using a database service will make database schema updates even more complex.  If you're using Amazon RDS, you should definitely spend time learning how to handle updates and how to minimize breaking schema changes.
Automating your deployment

Our deployment is getting pretty complex!  We can make it easier to manage by automating it.  The simplest way of automating our deployment would be to take all of the command line statements we've used above and stick them in a bash script, but we can do better than that, and there are plenty of existing tools to help us.
Let's start by picking a tool to oversee the whole deployment process, a "continuous integration" or "continuous delivery" tool like Jenkins, Travis, or CircleCI.  Jenkins is the most popular, so we'll use that as our example here. Jenkins is open source software, so it's free to use, but you'll need to set it up on its own server, for instance you can do so on Amazon.  If you want something cheaper and easier to get started with, CircleCI has a free tier.
We automate our deployment by building a deployment "pipeline" with Jenkins.  The pipeline contains three high level steps: build, test, and deliver. We'll talk about testing later.  For now, let's talk about build and delivery.
There are two approaches we can take during the build phase.  The first is to simply build our app, for instance by grabbing it from a Github repo and compiling it.  Later, in the delivery step, we'll add it to a simple base image and do a bunch of configuration. Alternatively, we can use a tool like Packer to build a machine image which includes our app, Apache, Python, Django, and any other dependencies.  Either way, we have to handle this complexity - it's just a matter of when.
Once we've got our app built, we need to deliver it.  We'll need to use an additional tool to specify all the details of how to run our production infrastructure, which Jenkins will call on during the deliver stage of the pipeline.  This post provides a rundown of some of the different tools available (specifically, Chef, Puppet, Ansible, SaltStack, Terraform and CloudFormation) and why you might pick one over another.  We're going to use Terraform.
Once we provide Terraform our AWS credentials, we can tell it to launch instances of various sizes based on our image.  If we chose a simple base image, we'll need to install the additional apps and make sure they're configured at this step.  If we handled that in the image building stage, we don't have to worry about it here.
We can also tell it to add an autoscaling group, a load balancer, and any number of other AWS services.  Here are some of the configuration details we can coordinate automatically with Terraform:


We can set our load balancer to do health checks, and have our instances restart if the load balancer finds the health check has gone poorly.


We can specify our autoscaling rules.


We can associate our instances with Elastic IPs.


We can provision database services to work with our instances.


We can control which users can access which infrastructure resources using Amazon's IAM system.


We can specify that new instances should be created before the old ones are destroyed, so our users experience zero downtime.


All of our initial account setup, with Amazon and any other providers, must be done manually, and the various passwords and secret keys secured and/or put into environmental variables.  But most of our work is now automated, making it easier to remember, change, and roll back.
The Story So Far

Let's take a moment to summarize what we've covered so far.  We've now got an example Django app that's deployed automatically to multiple Amazon instances.  We do this by:


Writing the relevant code in Python, using the Django framework, and pushing our code to Github.


We use Jenkins (our continuous delivery tool) to oversee our build pipeline.


Jenkins calls on Packer to build a machine image using a base image, our Python code, an Apache web server, and various other dependencies.


Jenkins calls on Terraform to launch EC2 instances with that image and to configure IP addresses, the database service, set rules about autoscaling and more.


Technically we could stop here.  But there are some important elements of good devops practice that we're still missing.
Improving your web app

Testing

There are several different types of software testing, and your app will likely need at least a few of them.  For instance, a Django app with a Javascript front-end might require Python and Javascript unit tests for individual functions and objects, as well as integration testing to see whether everything works together.
You can test your Django code with Django's inbuilt testing framework, which builds off Python's unittest.  There are a variety of Javascript unit testing libraries, such as Mocha or Jasmine, and framework-specific tools like React-unit or Enzyme for unit testing React components.  Selenium is a popular tool for integration tests, and has bindings to a variety of languages including Python and Javascript. Lettuce is a tool which lets you write Selenium integration tests in Python using "plain English".
In development, we can run our unit and integration tests manually, but how do we automate them?  We want to add tests to our deployment pipeline in between the build and deliver stages.  Instead of manually invoking the tests from the command line, we can tell Jenkins what command to run and where.  We can also tell Jenkins where to store any results from the test so that if any of the tests fail, we can begin investigating why.
These are the basic types of testing that can be automated.  Other automated tools include linters like PyLint and JSLint which look for 'stylistic' errors and language-specific tools like mypy which does type-checking for Python.  You will also want to implement stress/load testing, accessibility testing, and other types as you have capacity.
Logging & Monitoring

Once your app is in production, you'll want to keep an eye on it (monitoring) and store information about errors and failures (logging).
Your app should produce logs as it runs.  For instance, you can add logs to a Django app via the Python standard library's logging module, or to Javascript with console.log() or a custom logging library.  Your Apache web server will produce logs as well.
It's considered good practice to "drain" these logs to an outside service.  This keeps them from taking up space on your instances and also makes them easier to search and receive alerts for.  Amazon has a service to do this called Amazon Cloudwatch Logs.  You can also get logs from various Amazon services such as the elastic load balancer or from the database.  Cloudwatch can also help you monitor your system.
Caching

You can vastly improve the performance of an app by "caching" some of your data.  Cached data is stored in RAM, which is much quicker to access than data in a database.  Your cache can also be used to store ephemeral state information, like session data, which frees up the app itself to be stateless.  The two most common tools for caching are Redis and Memcached.
Because the cache stores ephemeral data, you can run the service in the same EC2 instance your app is running in, but that's not recommended, for two reasons.  First, if you're storing session data in multiple EC2 instances, there's no way for the load balancer to know which instance to direct your traffic to, so you'll end up with different session data on different instances.  Second, when caches are flushed it ends up putting a lot of load on the database until they get built back up again. Separating the cache from various instances makes it easier to remove data from the cache gracefully.
Adding a caching service is similar to adding a database service.  You'll set it up on its own EC2 instance or using AWS ElastiCache service, then edit the CACHES setting in your app's settings.py file to point to the service.  You can then use Django's cache framework to cache the site as a whole, individual pages, or individual elements or queries.
An additional type of caching is HTTP caching, with a tool like Varnish.  Varnish sits in front of your web server. If a request is in its cache, it responds to the client without querying the web servers.  This is useful for static content and can greatly ease the load on your servers. Unfortunately, it looks like Varnish is not easy to set up on AWS because there are some incompatibilities with the Elastic Load Balancer.
omfg containers

'Containerization' is a popular approach to developing applications at scale.  They isolate applications from the machines they're running on, making it quicker and easier to build images and launch instances based off of them.  You can pre-install your app and all of its dependencies on an image (streamlining the build part of the pipeline) and then supply most of the necessary post-launch configuration via environment variables (streamlining the delivery).
Docker is the most well known containerization tool.  Docker Compose and Docker Swarm can be used to coordinate/configure containers in development and production respectively.  Kubernetes is another popular container configuration tool, which can work with its own containers or with Docker containers.  Amazon's Elastic Container Service (ECS) lets you configure containers on AWS, while Amazon Elastic Container Service for Kubernetes (EKS) lets you configure containers via Kubernetes on Amazon.
Containers are not a replacement for your build pipeline.  Instead, they can be incorporated into your build pipeline.  All of the major continuous delivery tools allow you to work with containers.
Security

It's important to encrypt traffic to your website using SSL/TLS.  If you're using Amazon, their Certificate Manager (ACM) is free and simple to use.  Other platforms may have similar services, or you can use a tool like Let's Encrypt.
It's also important to think about operations security.  Here are some AWS-specific security pointers, courtesy of this write-up:


Never login with your master account.  Create a non-root user for yourself, giving yourself the privileges you need, and create users for other teammates as well.


Use Amazon's IAM groups to assign permissions based on role/team.


Enable multi-factor authentication on the root user and mandate it for all other users.


Keep your access keys out of your code.  They should be set as environmental variables.


If possible, generate unique access keys for each third party who needs them.  This makes them easy to revoke as needed.


You can give IAM roles to EC2 instances, rather than passing access keys via environmental variables.


The Amazon website provides some additional detail on how to configure security groups and access control lists to limit access to individual resources.
Summary

We now have a robust operations pipeline for our project:


We develop our Django app locally.  Our Docker containers specify the development environment, which is kept as close as possible to the production environment, but which necessarily contains some differences.  For instance, it uses a local test database rather than Amazon's RDS and a local cache rather than ElasticCache.


When our changes are merged on Github, a webhook tells Jenkins to kick-start the build pipeline process.  Jenkins builds Docker containers, then runs the tests we've told it to, including unit and integration tests.


Jenkins references Terraform and determines how many EC2 instances need to be launched with our new images.  Terraform, in turn, references our autoscaling group rules. It also checks whether any of our other existing infrastructure needs to be altered to match our desired state.


Our existing infrastructure, meanwhile, looks something like:


We've got a Virtual Private Cluster (VPC) with an Elastic IP that provides a consistent location for requests to our domain to be routed to.


An Elastic Load Balancer (ELB) that routes traffic to one of our several Elastic Cloud Compute (EC2) instances.


Running on these EC2 instances are our docker containers.  Within those containers we've got our app, our apache server, and the various dependencies running, with configuration applied via Dockerfile environmental configuration variables.  These variables can then be referenced by our Django app, for instance we can provide access information to the database service by having the DATABASES setting in settings.py look for the environmental variable passed in by the Dockerfile.


We use a logging and monitoring service to capture data about traffic, usage, and errors.


And that's it!  Or rather, that's it for this overview - there's always plenty more to learn.  :)