Skip to content

Instantly share code, notes, and snippets.

@maciejgryka
Last active December 26, 2018 08:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maciejgryka/8d87f87d14d5bf342ad5f40dd204ed71 to your computer and use it in GitHub Desktop.
Save maciejgryka/8d87f87d14d5bf342ad5f40dd204ed71 to your computer and use it in GitHub Desktop.

Modern Python workers

We've recently had an interesting opportunity to experiment with deploying modern Python workers at Rainforest. We used to host most of our stack on Heroku[1], but it was not a good fit for this particular use case. This post explains the challenge we were facing and how we solved it, while also mentioning a bunch of cool tools that make development and deployment much less painful than it was a few years ago.

So what was our task? We wanted to run Python workers, each for some time between a few minutes and a few hours. We also wanted to have the flexibility to support thousands of workers running simultaneously during high-traffic periods without paying for the infrastructure in times of low demand - basically dynamic scaling. When you're used to building web applications, where a request taking more than 1s is considered bad and your demand is a little more constant, you have to slightly change your perspective and probably the tools you're using. Heroku can work great for some things, but dynamic scaling is not really one of them (things like HireFire can help, but we found that we wanted a bit more flexibility).

Other things that we wanted to support: modern Python with whatever parts of the scientific stack we want (numpy, scipy, scikit-learn, OpenCV) and easy deployment. I don't know abut you, but following best practices of modern software development gives me a weird satisfaction, so making sure we have a decent test coverage, CI/CD pipeline set up and an automatically-enforced coding style was fun. Don't judge me.

So to re-iterate, here's what we wanted:

  • scalability to thousands of parallel workers with no cost during down-times,
  • flexibility to run each worker for anywhere between a minute and an hour,
  • easy deployment of non-trivial code (e.g. C or Go extensions called from Python),
  • decent test coverage and a CI/CD setup,
  • modern Python (we don't necessarily want to use the version shipped with Ubuntu),
  • nice things like code style checks.

Before we get to the solution, let's make this concrete and lay out where all the requirements come from. Rainforest provides QA-as-a-Service solution, where our customers write tests in plain English and we distribute them to humans to perform. Some of these tasks are repetitive and we're working on automating as many of those as possible, while leaving the ones that require human judgment ("Does this website look OK?") to humans.

Our testers perform tests by connecting to virtual machines we provision for them: each tester gets a fresh VM that is created especially for that test and destroyed afterwards - which is pretty complex operationally, but a hard requirement for reproducibility. We want our automations to work in the same way, because there's a bunch of infrastructure we have on our VMs that we don't want to duplicate. As a consequence, our automations need to effectively pretend to be humans and control the virtual machines using mouse/keyboard events, receive screenshots and make judgments about the state of the VM a how to proceed. A test can take anywhere from a minute to an hour to complete since it runs effectively and human speed.

Turns out all this is totally doable and even pretty convenient! The tools we used are: Docker, AWS Batch, CircleCI, pytest, pyenv, pipenv, black. What follows is a brief overview of each and how they can all fit together.

Docker

This is probably not very surprising, but Docker makes it pretty easy to ship code with whatever infrastructure and libraries you want to distribute. Do you want to build OpenCV yourself and use its Python bindings? Do you want to ship a Go library with your code? Do you want to call Ruby from Python? No problem, it's all pretty easy with Docker (doesn't meant that you should really to that last one). I've put up a sample Dockerfile as a Gist so you can see how this looks like in practice.

pyenv

This would be a good time to talk about pyenv. It's a great tool for managing multiple Python versions on your machine. Under the hood it's a "just" bunch of shell scripts and it makes getting new version of Python as easy as pyenv install 3.7.1. If you want to make use of new shine features like f-strings and data classes, this is probably your best option!

Pipenv

We've all been using pip for years now to manage Python packages in our projects, but Pipenv is a slightly more modern tools that has a bunch of nice advantages. I'll let you watch the video to get the details, but if you've ever used and liked requests (especially if you had to deal with raw urllib before that), you're likely to enjoy using Pipenv too - it was also wirtten by Kenneth Reitz.

pytest

This is also a pretty standard recommendation, but testing is much nicer if you can use pytest. Among other benefits, it gives you nice syntax for assertions, fixtures etc. and is pretty much the standard way to test Python applications nowadays.

We use CircleCI to serve all our CI/CD needs, which is useful for a couple of things: besides testing, you can also configure it to build your Docker containers and push them to a registry. We use AWS ECR, since we want to use them with Batch later on.

black

For reasons that are probably not entirely rational, of all the tools here, I enjoy black the most. It's an auto-formatter for Python code that has basically no configuration, meaning that if you use it you have to accept its opinionated approach. And that's a great thing! Suddenly code style becomes something you don't even have to think about - you can just set up your code editor to auto-format your code on save and you can stop caring about which quotes to use, how to break up your lines etc. You press "save" and all your code magically becomes beautifully PEP8-compliant.

You can also go one step further and add a test to your test suite that will fail if any new code is committed, which does not follow the style (black --check).

AWS Batch

Finally, a critical part of our infrastructure is AWS Batch - which basically handles the launching of however many workers we need and automatically scaling them up and down. Setting it all up correctly does take a bit of expertise and you'll have to write some CloudFormation YAML files, so you might want to ask a friendly ops person for some help.

Conclusion

That's basically it - at a cost of some set up you can have a pretty nice and flexible way to ship Python code for long-ish living workers. Hope that helps!

[1] This is no longer the case - we've recently migrated to kubernetes on GCP. Which also has implications for this project, as we'll probably end up migrating this infrastructure to kubernetes as well. Stay tuned for the future post about this!

@ukd1
Copy link

ukd1 commented Dec 18, 2018

lgtm - maybe some images for the tools would brighten it up? is there a sample project?

@roman-dowakin
Copy link

Looks nice, but I'd like to read more about AWS Batch from blog post (as title suggest). Paragraph about black is bigger than about AWS Batch :))

Maybe add how Batch works inside, how we specify instance type, use SQS to distribute jobs, and push jobs to AWS Batch. How Batch spins Docker image on EC2 server automatically. What problems we had with Batch and how we solved them (about number disc operation per 1Gb on Amazon)

I also agree with Russ about images.
About sample project, this could be additional blogpost, more in depths, to showcase one simple example with fully working Sample project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment