Skip to content

Instantly share code, notes, and snippets.

@eykd
Last active August 29, 2015 13:59
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save eykd/10512298 to your computer and use it in GitHub Desktop.
Save eykd/10512298 to your computer and use it in GitHub Desktop.
PyCon 2014 notes

Friday, April 11

DNS

  • Lynn Root
  • roguelynn.com
  • roguelynn-spy.herokuapp.com

Use scapy python library for sniffing network traffic. Chrome does one DNS request for each autocomplete guess. Interesting.

DNS names end with a dot?

example.com vs. example.com.

Relative vs. absolute (FQDN) if your local resolution has something funky going on.

dig is your friend.

dig +trace python.org

. is the root DNS server. Queries resolve down the hierarchy. . -> org -> python -> www.

Show all records for the name:

$ dig +nocmd +noqr +nostats pyladies.com -t ANY

DNS relies on caching so that the root servers aren't hammered. So we start at our local DNS and go out from there until we get a result, which is then cached for a TTL.

Query -> Local Cache -> "closer" name server -> authoritative name server.

TTL is a balancing act. Too long, and caching takes forever to propagate. Too short, and the authoritative server gets hammered.

You can't get the entire zone file usually. dnsmap brute-forces subdomain lookup to retrieve extant subdomains.

You can run a DNS server from Twisted. Cool.

Unicast, multicast, etc. Anycast. One-to-nearest association. Google uses this. Someone in Australia looking up 8.8.8.8 to get the same response from a nearer server.

DANE (DNS Authentication of Named Entities) uses DNSSEC. Apparently, firewalls can intercept HTTPS traffic and fake your secure connection.

Service Discovery. SRV records. Spotify clients do SRV lookups to get a service access point to the web API.

DHT: distributed hash table. Key-value store in DNS, distributed through the network. Spotify does this using TXT records.

Cache me if you can

Memcached:

  • Key/object store.

  • O(1) everything.

  • Primary data in relational database

  • data you can lose, regenerates slowly: persistent storage (mongo, redis)

  • data you can lose, regenerates quickly: RAM store (memcached)

Expiries given in seconds. USE CONSTANTS.

Use set_many/get_many/delete_many. Slowest part of operation is network latency.

incr/decr good for counters.

add good for not clobbering existing keys.

Facebook has a fork that lets you dump memcached to the disk. Huh.

Key Naming

  • ASCII based
  • Not crazy long (they have to be hashed--few dozens of characters)
  • Explicit! e.g. json.users.<user_id>

Bad naming: md5(sql_query) Don't use user input for cache names!

Memcached is not queryable. There is a debug interface, but don't use it!

Memcached cluster

With multiple nodes, the client decides which node to go to using hashing.

Memcached subtleties

Key stored at 8am w/ 2 hour expiration. What happens at 10am? Nothing. It gets removed if the client tries to fetch after 10am.

Memcached has a fixed amount of memory using an LRU cache. Objects can be evicted before expiration if memory fills up. Objects w/ oldest timestamp get evicted.

Sometimes memcached returns None if the expiration hasn't been reached AND memory isn't full. Memcached pages its memory, and chunks its pages into chunks of various slab classes. If all pages of the needed class are full, get a free page and give it the needed class. If there are no more free pages, the LRU kicks in and evicts data. Each slab class has its own LRU.

memcached -v  # verbose output
	memcached -M  # doesn't evict when out of memory, but errors
	memcached -I1k  # change slab page size

	memcached -f1.5  # change growth factor

	man memcached # is your friend

Common practices

Add a cache_name property to Django models. Use model versioning to invalidate cache names automatically.

get_many returns a dict, which may not have all the keys you requested. You'll have to fill those in yourself.

Common problems

Thundering Herd problem: on a cache miss, if it's expensive to rebuild the object, a flurry of simultaneous requests will bury the application server in simultaneous builds. Solve the thundering herd with a lock object.

Caching large values: say a large number of objects. Instead of caching them as one object, do a 2-phase fetch. Store the list of IDs, then store each object w/ set_many.

Paginated cache:

  1. Break big objects into smaller slices.
  2. Store each slice as a separate object
  3. Store the list of slices.

If some of the chunks get evicted, well, there you are.

Bootstrapping: Porting to Python 3

Objectives

  • Straddling Python 2/3 in a single codebase
  • Choosing target Python versions
  • Porting as an iterative process
    • Ordering components by dependencies
  • Adding test coverage to reduce risk (if you don't have good tests, you will lose)
  • Covering C extensions

Background

Ported Zope3, ZODB, WebOb, Pyramid, other dependencies to Python3. 180kLOC Python, ~25 kLOC C.

Porting strategies

-Port once, abandon Python2-

  • Not the subject here
  • Customers / users still need Python 2
  • More feasible for applications than libraries
  • 2to3 may be useful starting point.

-"Fix up" at installation using 2to3-

  • Python2 users unaffected
  • Python3 source "drifts" from canonical version. Bug reports don't match.
  • 2to3 painfully slow on large codebases.

"Straddling" in a single codebase. (Thought to be impossible, initially.)

  • Use compatible subset of Python syntax
  • Conditional imports mask stdlib changes
  • six module can help (but you might not need it)

Targeting Python Versions

Syntax changes make Python2 < 2.5 hard

  • No b'' literals
  • No except Exception as e:
  • Much more cruft/pain

Python 2.6 is the bare reasonable minimum. 2.4/2.5 are long past EOL. But some folks need system Python in "enterprisey" systems.

Incompatibilities make Python 3 < 3.2 hard.

  • PEP 3333 fixes WSGI in Py3k
  • callable() restored in 3.2
  • 3.3 restores u'' literals.
  • 3.2 is "system Python3" on some LTS systems.

Summary: support 2.6, 2.7, 3.2+

Managing Risks

  • Ports are great opportunities for bug injection!
  • Fear of breaking working software is the barrier, even more than the effort required.
  • Some mitigations also improve your software.
    • Improved testing
    • Modernized idioms in Python2
    • Clarity in text vs. bytes.

Bottom-up Porting

Port packages with no dependencies first. Then port packages with already-ported dependencies. Note the Python versions supported by dependencies. Lather, rinse, repeat. Finally, port the application.

Common subset idioms

  • Read Lennart's book!
  • `python2.7 -3' can point out problem areas.
  • Modernize idioms in Python2 code. e.g. exception syntax, with open() as fi.
  • Distinguish bytes vs. text. Use b''/u'' for all literals. Quit letting Python promote things to unicode for you.
  • Adopt new syntax.
    • E.g. extept ... as ...
    • print()
  • Use new stdlib facilities, e.g. io.BytesIO vs. StringIO.StringIO

Testing Avoids Hair-Loss

  • Untested code is where the bugs go to hide.
  • 100% coverage is ideal before porting.
    • Unit testing preferable for libraries
    • Functional testing best for applications
    • Subtle bugs hide in the libraries
  • Measure test coverage: pypi/coverage
  • Work to improve assertions as well as coverage
  • Assert contracts, not implementation details
    • Don't assert against exception types/formats, things that change between versions!
    • If at all feasible, convert doctests to Sphinx examples. Sphinx can run examples to make sure they don't break.
  • Automate running tests
  • tox helps ensure that tests pass under all supported versions. (Also test pypy!)
  • Don't run coverage on all your tests. Coverage is really slow. Just use it on one separate tox target on one version of python.

Considerations for C extensions

  • Testing C is harder!
  • http://python3porting.com/cextensions.html (Lennart's book)
  • Maintain a Python reference implementation
    • Easier to test
    • Supports PyPy
  • Design for same API as C
  • 100% coverage for Python
  • Ensure C version passes same tests.

Hygiene

  • Signal supported versions using Trove classifiers in PyPI
  • Consider bumping the major version. Allow users to stick with "safe" versions as you iterate.
  • Apply continuous integration
    • Travis CI
    • Jenkins
    • Shining Panda for Windows

Resources

  • python3porting.com
  • testrun.org/tox/latest
  • pypi.python.org/pypi/six
  • python.org/wiki has common idioms for Python 2/3˘

Enough Machine Learning to Make Hacker News Readable Again

An achievable goal: a personalized filter for Hacker News.

Machine learning is just applying statistics to big piles of data, using it to understand the data better or make predictions.

  1. Get data
  2. Engineer the data
  3. Train and tune models (SCIENCE!)
  4. Apply model to new data

Use scikit-learn. The documentation is fantastic. The hard part is installing SciPy.

The terminology is daunting. When you don't understand the math, go "blah blah blah" and keep on reading.

Supervised learning is when you have input data and output data. Unsupervised learning is about understanding your data; visualization, grouping, etc.

We'll focus on supervised learning.

Good books:

  • NLP with Python
  • Building machine learning systems w/ Python
  • Learning scikit-learn with Python
  • Building Collective Intelleginence

Parallel arrays: (x, y): x is article, y is category.

Set aside a validation set. Learning data your machine hasn't seen. Take 25% of your data. Use it at the end to validate your learning.

Hyper-parameters are magical for tuning. GridSearch is a great tool.

You'll see a lot of these functions:

  • transform()
  • fit()
  • predict() # SCIENCE!

transform(X, [, y])

fit(X, y): X, what it gets; y, what the result should be.

predict(X): predict based on the fit() training.

Get the Data: the Hard Part

requests & lxml

Classifying Dreck and Non-Dreck: he wrote a web app to classify 5,000 articles. 20% were Non-Dreck.

Data: Title, URL, Submitter, Content of Link, Rank, Votes, Comments, Time of Day, Dreck or Not.

Turning that messy data to normalized Numpy arrays: "Time flies like an arrow, fruit flies like bananas"

  • Bag of Words : count occurrences: "flies": 2, "arrow": 1, "bitcoin": 0
  • n-grams: time flies, flies like, like an, like bananas
  • Normalization: stemming
  • Stop words: cut out the useless words (articles, etc.)
  • TF-IDF: Term-Frequency, Inverse Document Frequency. (e.g. an article about bitcoin has more refs to bitcoin than an article not about it)

Engineering Features

  • Pull out the relevant text (readability package)
  • Roll your own features (e.g. bump up long-form content)
  • Combine features (pipeline w/ TF-IDF with long-form feature)
  • Hostname pipeline: extract hostnames into numpy array. Pickle your built classifier (save the data to recreate it) then use the classifier to predict.

What's possible?

  • Use unsupervised learning.
  • Predict numerical scores.
  • Watch an RSS feed.
  • Auto-submit it!

How to get started w/ Machine Learning

Melanie Warrick nyghtowl.io @nyghtowl

  • Hackbright Academy
  • Zipfian Academy

Covering:

  • Machine Learning Overview
  • AI, data science, big data relationships
  • Example code, linear regression
  • Algorithms & tools
  • Skills and resources

"Computers... ability to learn without... explicit programing. Arthur Samuel (1959)

  • Build a model that finds patterns and/or predicts results
  • Apply algorithms
  • Pick best result for pattern match or prediction

Ex: spam detection, weather prediction

What is a model?

Linear regression (line fitting)

y = mx + b

Find best fit m & b algorithm to predict/pattern match. (e.g. plotting High School GPA vs. University GPA)

  • Handwritten address recognition

  • Search engines (Google, Bing)

  • Twitter and Facebook friend recommendations, Netflix

  • Fraud detection

  • Weather prediction

  • Face detection

  • AI, helping machines make better decisions. Intelligence exhibited by machines or software

  • Data Science, helping people make better decisions. Get knowledge from data & create products

  • Big Data challenges both AI and Data Science. Data volumes beyond ability of common tech to capture and curate. (2 GB == 20 yrds of books, 50 PB = entire written works of humankind)

Project flow

  • Define goal and metrics
  • Gather and clean data
  • Explore and analyze
  • Id Algorithm or method (ML)
  • Build model (ML)
  • Evaluate results (ML)
  • Iterate
  • Create data product, visualization.
  • Make decisions.

Ex: Linear Regression

Using pandas for data frames and scikit-learn.

Predict brain weight from head size. Head size is x, brain weight is y.

Cross-validation: hold out a certain percentage of the training data for testing and evaluating the model.

Metrics for evaluating a model: R-squared, where 1 is a perfect prediction.

Use matplotlib and seaborn for visualization (seaborn for prettification). Visualization helps you understand how a model is working.

Machine Learning Algorithms

  • Unsupervised, continuous: clustering and dimensionality reduction (SVD, PCA, K-means)
  • Unsupervised, categorical: association analysis (Apriori, FP-Growth), Hidden Markov model
  • Supervised, continuous: regression (linear, polynomial), decision trees, random forests.
  • Supervised, categorical: Classification (KNN, trees, logistic regression, Naive-Bayes, SVM)

Machine Learning Key Tools

  • Test Model: Scikit Learn, Matplotlib,
  • Explore Data: Pandas, StatsModels, Matplotlib, numPy, Unix
  • Build Model: Scikit, NumPy, Pandas, scipy
  • Visualize: D3, matplotlib, vincent, vega, ggplot

Machine Learning Skills to Build

  • Algorithms
  • Statistics (probability, inferential, descriptive)
  • Linear Algebra (vectors & matrices)
  • Data analysis (intuition)
  • SQL, Python, R, Java, Scala (programming)
  • Databases & APIs (get data)

Machine Learning Resources

  • Andrew Ng's Machine Learning on Coursera
  • Khan Academy (linear algebra, stats)
  • "Think Stats" - Allen Downey
  • Zipfian's practical intro to data science
  • Metacademy
  • Open Source data science masters
  • StackOverflow, Data Tau, Kaggle
  • Mentors!

Getting Started w/ Salt

  • Peter Baumgartner, Founder of Lincoln Loop
  • lincolnloop.com
  • @ipmb

SaltStack is: configuration management. Version control your servers, self-documenting, repeatable, reusable.

Saltstack is: remote execution. Deploy your code, run one-off scripts, critical package updates, system monitoring.

Why SaltStack?

  • Familiar tools: Python/YAML/Jinja2.
  • Community: Great documentation, insanely responsive (IRC, GitHub), backed by for-profit org.

Why Not SaltStack?

  • Young project
  • Moves fast
  • Not SSH (new SSH support is "alpha")

Learning Salt

Vocabulary lesson

  • Chef: knife, recipe, cookbook

  • Puppet: terminus, metaparameters

  • Ansible: playbook, inventory

  • Master: server that manages the whole stack

  • Minion: a server controlled by the master

  • State: a declarative rep. of system state

  • Grain: static information about a minion (RAM, CPUs, OS, etc.)

  • Pillar: variables for one or more minions (ports, file paths, config parameters)

  • Top file: matches states or pillars to minions

  • Highstate: all the state data for a minion

Getting started

Master server: apt-get install salt-master ... or run masterless

Minion: apt-get install salt-minion; echo "salt 10.10.1.1" > /etc/hosts

Accept the minion key on the master.

Advanced topics

  • Salt-cloud
  • Custom modules
  • Scheduler
  • Renderers
  • Returners (return to email, sentry, syslog)
  • Reactor

Tips & tricks:

  • In minion conf, output_mode: mixed
  • Jinja2 is powerful. Don't go nuts.
  • Update often, and review the change log.
  • Test before you deploy. Make friends w/ Vagrant or Docker.

Castle Anthrax: Dungeon Generation Techniques

Designing procedurally generated content for games.

How do we represent tiles?

  • List of lists?
  • Single list? (Row Major Order) offset = (row * num_cols) + column

Mazes

  • Depth-first search (backtrack): long, twisty corridors
  • "Prim's algorithm" (sp?) A*-style search: short, blocky corridors

Placing rooms: binary tree

Space-partitioning algorithm

Start with an open grid Split it Recurse on the sub-parts. Stop at minimum size or maximum depth. From lowest, widest portion of tree, ascend and connect nodes.

Techniques for placing things or generating terrain:

  • Poisson Disks: distributing equidistant points across a space e.g. item placement
  • Cellular Automata: e.g. caves!
  • Perlin noise (simplex): e.g. pits!

Using constraint solvers

Take a bunch of variables with continuous or discrete ranges (finite). Define a constraint on a few variables, e.g. 8 queens problem.

Ex: I've got all these rooms, here's the exit. I don't care where the boss is, but he must be at least one room away from the exit. These enemies need a place. And there should be a health potion near the beginning.

Optimize!

  • Represent sets as bitmasks

  • Undo-stack

  • Use as few variables as possible to stay in the discrete/finite domain as possible.

  • Rogue Basin

  • How to build a constraint Propagator in a Weekend

  • goo.gl/sdrbkJ

  • Horton goo.gl/xpLTFB

  • PyGame, PyAngband

Lightning Talks

Structlog

Certificate-based SSH

Provides controlled, audited access to servers. Not a key-based solution!

  • Launch instances w/ cert authority
  • Users that need access request a cert
  • Security officer uses he ssh-ca tool
  • ssh-ca generates audit trail in S3: who, when, why, how long
  • Certificates include restrictions on use (time-based!)
  • OpenSSH logs the key id (email address)

Instead of host certificates, sign the host key.

  • github.com/cloudtools/ssh-ca
  • CERTIFICATES section of man ssh-keygen(1)

DIY Stuffed animals

  • github.com/caretdashcaret/Patternfy
  • Make magazine vol 38

Saturday, April 12

Lightning Talks

Erik Rose: pip install peep https://pypi.python.org/pypi/peep/1.1

Docker.io

Amjith Ramanujam

"Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient contianer that will run virtually anywhere*."

(* Anywhere meaning any reasonably modern Linux machine.)

What?

  • Like Chroot, BSD Jails, etc.
  • Uses Linux Containers
  • AUFS: Union Filesystem
  • Git-like versioning
  • REST API
  • Don't need a full Guest OS like in virtualization
  • Multiple containers share the same underlying libraries read-only. Any changes trigger copy-on-write.

Why?

  • Lightweight
  • Isolated instances
  • Faster than VMs (ofter under a second startup time)

Setup

  • docs.docker.io
  • OSX: boot2docker (minimal linux VM) + docker client

Terminology

  • Image: Read-only snapshot
  • Container: an instance of the image
  • Registry: PyPI for docker images
  • Repository: Projects in the Registry

Automation

Dockerfile: a series of commands

Network

Django port forwarding: docker run -d -p host:container django-docker

Misc

  • Volumes: mount folders host/container: docker run -v host_path:container_path django
  • Links: Service discovery through env vars docker run --link mysql:db --name webapp django
  • REST API
    • docker daemon is also a server

Solution

  • Postgres in a container
  • Django/nginx in a container
  • celery in a container

Makes testing very easy: Jenkins can run things in parallel that once had to be separated. Containers can be run on local machines.

Apparently, Docker.io recommends against running in production until v1.0.

Developing Django's Migrations

South was good for its time. But some bad initial design decisions and core underlying problems. south.hacks.

The Initial Plan

  • Django: schema backend, ORM hooks
  • South 2: Migration handling, user interface

Revised plan

  • Django: Schema backend, ORM Hooks, Migration handling, User interface
  • South 2: Backport for 1.4-1.6

Logical separation:

  • SchemaEditor: schema backend, ORM hooks
  • Migrations: migration handling, UI

Not moving south into Django, instead, adding migrations to Django. Complete rewrite of South. New file format, many other things.

SchemaEditor

  • Abstracts schema operations across DBs.
  • Works in terms of Django fields/models.
  • Contains per-database workarounds.

django.db.migrations:

  • Migration file reader/writer
  • Dependency resolver
  • Autodetector
  • Applied/unapplied tracking

A new format

  • More concise
  • Declarative
  • Introspectable

In-memory running

  • Creates models from migration sets
  • Autodetector diffs created from on-disk
  • Used to feed SchemaEditor / ORM

DB peculiarities

Postgres: it's great

MySQL:

  • No transactional DDL
  • No CHECK constraints
  • Conflates UNIQUE and INDEX

Oracle:

  • Different SQL syntax
  • Picky about names
  • Can't convert to/from TextField (LOB)

SQLite:

  • AAAAAAAAAAAHHHHHHHHH
  • Altering tables? Schema introspection? What?

Backwards Compatibility

  • Django generally very good at this
  • Auto-applies first migration if tables exist
  • Ignores South-style migrations

Lessons Learnt

  • Explicit is better than implicit.
  • Abstracting DBs is hard. Wouldn't do it from scratch.
  • Composability rocks. It's simplified the code so much.
  • Feedback is vital. I'm just not mad enough to do nasty things to my code. Users always find your edge cases.

Designing Poetic APIs

Auden: "A -poet- programmer is, before anything else, a person who is passionately in love with language."

Programming is inventing new language. "Go learn Lisp or Haskell, it will change how you think about programming."

Sapir-Whorf hypothesis. Yes, it's fallen out of favor, esp. the strong form. But language still has flavor. Language influences how we think.

Wittgenstein: "The limits of my language are the limits of my world."

Having a symbol for something makes it mentally lighter-weight. Mental abstractions. Extracting symbols is the root of all human language. And software engineering!

Intellectual Intelligibility

Fowler: "Any fool can write code that a computer can understand. Good programmers write code that humans can understand."

Capture existing symbols, and use them in your API design. Take requests vs. urllib2 as a good example.

Principle 1: Don't be an Architecture Astronaut

Robert Storm Petersen: "It's hard to make predictions, especially about the future."

The first step of designing a new library is: don't design a new library. The best libraries are extracted, not invented.

Example: blessings extracted from nose plugin.

  • Identify the tasks
  • Identify language constructs
  • Identify patterns, protocols, interfaces, conventions

Principle 2: Consistency

Yeats: "Think like a wise man, but communicate in the language of the people."

The culture you are in has spent a lot of time building up conventions. Use them. Don't be weird or clever. This shows respect to your users.

Ex: Macintosh Human Interface Guidelines. When you've learned one program, you've learned them all.

Principle of Least Astonishment: Try not to surprise the user.

get(key, default) vs. fetch(default, key)

Warning Signs

  • Frequent references to your own docs or source
  • Feeling syntactically clever (novel syntax)

Brevity

George Eliot: "The finest language..."

Warning Signs

  • Copying and pasting when writing against your own API
  • Typing something irrelevant while grumbling "Why can't it just assume the obvious thing?"
  • Long arg list, suggesting a lack of sane defaults

Composability

"Perfection is achieved not there is nothing left to add but when there is nothing left to take away.

Two ways to go about this, one of them wrong.

print_formatted(...)
	print_formatted(..., out=some_file)  # WRONG!!!

	print formatted(...)  # CORRECT!

Warning Signs

  • Classes w/ lots of state. Lots of little classes struggling to get out. Ex. ElasticSearch PenaltyBox. Didn't add to Connection, new class.
  • Deep inheritance hierarchies. Inheritance inherits invariant baggage from above, and must tiptoe around them.
  • Violations of the Law of Demeter. "One dot rule." A.b is OK. A.b.c is not. A.b.c.d is right out.
  • Mocking in tests. Your code may have too many dependencies! Testable code is decoupled code. Some mocking may be necessary, if your framework requires it. Mocking not intrinsically evil, but a code smell.
  • Bolt-on options.

Plain Data

Churchill: "All the great things are simple, and many can be expressed in a single word..."

Reduce barriers to re-use. Ex: ConfigParser. Not idiomatic python. Dictionaries would be the expected result, but it forces you to use its own API for anything. Can't substitute anything else.

MyClass.read(filename)  # NO!
	MyClass.parse(string)   # YES!

warning Signs

  • Users immediately transform output to another format
  • Instantiating one object just to pass it to another
  • Rewriting language-provided things

Grooviness

Talmud,: Ta'anith 7b: "The bad teacher's words fall on his pupils like harsh rain; the good teacher's, as gently as dew."

Sloping sides that nudge you to the center. Cut grooves in your APIs.

Avoid nonsense representations, e.g. optional kwargs that are actually required, one or the other.

Fail shallowly!

Resource acquisition is initialization

Don't have invariants that aren't invariant. Ex: Designing a PoppableBalloon class. Require filling in initialization.

Compelling examples: MacPaint. Nintendo platformers. Set a good example, and people will follow it forever. Users are docile: they will do what you tell them to do.

Warning Signs

  • Representable nonsense. You shouldn't even be able to say nonsense.
  • Invariants that aren't.
  • Lack of a clear starting point.
  • Long, complicated documentation.

Safety

More safety than grooves. More danger, higher walls, mean guard dogs in front.

rm *.pyc
	rm *
	rm -f *

How to report errors: Exceptions > Return values

Warning Signs

  • Docs that say "remember to..." or "make sure you...". Docs that say "before" or "after", add a context manager.
  • Surprisingly few will report safety errors. People will blame themselves. Don't electrify the door knob.

Orderability

With orthogonality at the center, the flowchart divides into lingual and mathematical halves of a Venn Diagram, with the left hand helping humans to read and use them, and the right to better computability.

Q&A

  • Book: Making Software: chapters on API usability and linguistic influence.
  • Book: RESTful web APIs (by author of BeautifulSoup)
  • I like my code to read like English. Ex: not using verbs as function names, but nouns describing what's returned. sorted() as an example.
  • Decoupling has its tendrils in many places
  • How does change management fit into all this? Use semantic versioning! Compatibility is a place to bolt on an argument. Composability is one way to do it, via decoupling. Compatibility puts us in 4-dimensional space, and we get into time-based coupling.
  • Another principle: Fractalness, an API can be used at any level of abstraction.

Getting Started With Testing

Goals:

  • Show you a way to test
  • Remove mystery

Why test?

  • Know if your code works
  • Save time
  • Better code (more modularity, separation of concerns)
  • Remove fear, turn it into boredom
  • "Debugging is hard, testing is easy."

Yes, testing is hard.

  • A lot of work.
  • People (you) won't want to
  • But: it pays off
  • Fight chaos!

Roadmap

  • Growing tests
  • unittest
  • Mocks

First principles: Growing tests

First attempt: interactive.

  • Good: testing the code
  • Bad: not repeatable
  • Bad: labor intensive
  • Bad: is it right?

Second attempt: standalone python module exercising the code.

  • Good: testing the code
  • Better: repeatable
  • Better: low effort
  • Bad: are the results right?

Third attempt: print expected results

  • Good: repeatable w/ low effort
  • Better: explicit expected results
  • Bad: Have to check manually

Fourth attempt: check results automatically, print and assert

  • Good: repeatable with low effort
  • Good: explicit expected results
  • Good: results checked automatically
  • Bad: failure stops tests

Getting complicated!

  • Tests will grow
  • Real programs
  • real engineering

Good tests are:

  • Automated
  • Fast
  • Reliable
  • Informative
  • Focused

unittest

  • python stdlib
  • infrastructure for well-structured tests
  • patterned on xUnit

Test isolation

  • every test gets a new test object
  • tests can't affect each other
  • failure doesn't stop the next test

setUp and tearDown

  • Establish context
  • Common pre- or post- work

Test engineering

  • Treat your test code like real code. Engineer it.
  • Pro tip: use your own base TestCase subclass.
  • TestCase.assertRaises() works as a context manager!
  • Make your tests expressive. Refactor.
  • Extract repetitive boilerplate to setUp().

Tests are real code!

  • Helper functions, classes, etc.
  • Can become significant!
  • Might need their own tests!

Mocks

Testing small amounts of code

  • Systems are built on layers

Dependencies are bad

  • More suspect code in each test
  • Slow components
  • Unpredictable components

Enter test doubles:

  • replace a component's dependencies
  • Focus on one component

Question should be;

  • assuming this outside service is working,
  • do my tests work?

Be careful not to skip code that needs to be tested when mocking!

Instead of stubbing our method, we fake urllib.urlopen instead.

  • Stdlib is stubbed
  • All our code is run
  • No web access during test

Don't do all this yourself: use a mock object library, like mock or mox.

Mock objects:

  • automatic chameleons
  • act like any object
  • record what happened
  • patch context manager! with mock.patch('urllib.urlopen') as urlopen: urlopen.return_value = fake_yahoo

Test Doubles:

  • powerful: isolates code
  • focuses tests
  • removes speed bumps and randomness
  • BUT: fragile tests!
  • Another, better way to do this: dependency injection

Tools

  • addCleanup: nicer than tearDown
  • doctest: only for testing docs
  • nose, py.test ,
  • ddt: data-driven tests
  • coverage
  • Selenium: browser tests
  • jenkins, Travis: ci

TDD: tests before code? BDD: describe external behavior Integration tests: bigger chunks load tests: how much traffic is OK?

Summing up

  • Complicated
  • Important
  • Worthy
  • Rewarding

Q&A

  • python-unittest-skeleton on github
  • TESTING IS ENGINEERING.

Unit-testing makes your code better

Assumption # 1:

You've at least started to drink the Kool-Aid.

  • either you're already writing unit tests
  • or you're ready to start, with or w/o this talk

Assumption #2:

Corollary of #1: You already get that unit testing helps make code more correct.

I'm talking about better on a higher plane: aesthetics, elegance, beauty.

Beautiful code is better code:

  • easier to understand
  • easier to extend
  • easier to reuse

Plan

Real life case study:

  • examine some untested code
  • work through adding tests
  • understand how imperfect design -> hairy tests
  • modify the design for simpler tests -> better code

Background

  • what is this code?
  • why does it exist?
  • where does it come from?
  • what requirements does it meet?

What is the code?

  • we measure the internet
  • we ping all your public IPs every couple of months
  • we traceroute everything
  • result: ~200m traces/day
  • throw it all in plain text!

Staying sane w/ plain text:

  • keep it simple, stupid
  • restrict the data tightly to avoid escaping
  • stay consistent even as data and requirements evolve

T3 files contain one record for each trace. Variable number of fields, just to keep things interesting.

TIP1 files contain one record summarizing all traces sent to a single target.

Lots of similarities. Common format, common library.

  • dozens of similar formats
  • writing new parser for each would be nuts
  • hence, GenericLineParser
  • with many subclasses: T3Parser, TIP1Parser, etc.

Requirements

  • structured
  • fast
  • flexible

Good news: when we start testing, the code meets all requirements.

Where to start?

You can't test an object if you can't construct it. So, start w/ the constructor. This goes double in cases like this, with a non-trivial constructor (complex internal logic, sometimes does I/O).

512 code paths through the constructor, based on args! Required only 6 test cases for one method, but definitely a code smell.

Constructors should be dead simple. Take arguments, store them. Be done.

Line parsers parse lines. Something else should open files. Convenience functions to the rescue!

Refactor the constructor, 6 tests to 3 tests.

Progress so far

  • constructor simpler and shorter
  • other code can use zopen(), uzopen()
  • now supports gz files for free
  • less test code to maintain
  • fewer code paths to worry about, fewer code paths == fewer, simpler tests == better code

So I refactored some messy code. So what?

  • writing tests made me look deeper
  • made me read the code very carefully
  • made me see both the good side and the bad side

The courage to refactor

This is what unit-testing zealots like to boast about:

  • sounds hokey
  • sounds like something from a self-help bok
  • but it's true!
  • absolutely no fear about tearing the line parser to pieces and putting it back together again, even though I didn't write it.

No happy ending... yet.

The applications that use this code are completely untested. I'm afraid to refactor.

  • easy to adapt existing clients of line parser to use uzopen()
  • ...

Costs of not testing

  • incorrect code (bugs caught late in the cycle)
  • fear of refactoring
  • code duplication (-> bug duplication!)
  • insufficient code reuse

Don't let this get you down

  • 1000 tests are better than 999 tests
  • 1 test vastly better than 0 tests
  • unit tests will never cover everything (don't try!). cover almost everything.
  • you'll be surprised how much you can cover w/ effort.

Trojan horse time

  • Extreme programming!
  • Test-driven development!
  • Agile manifesto!
  • etc.

Conclusions

  • duh. Water is wet.
  • less obvious: writing unit tests make code more beautiful
  • beautiful code is better, more reusable, more maintainable, more pleasant

Pushing Python: Lessons Learned Building a High Throughput Service in Python

Taba

  • Distributed event aggregation service
  • built w/ python, gevent, cython
  • 10,000,000 events/sec, 50,000 metrics, 1000 clients, 100 processors

Lesson #1: Get the data model right

Once you've committed to a model, it's very difficult to change it.

The model, the way you flow data through the system, makes a big difference in the performance of the system.

Lesson #2: State is hard

Don't reinvent the wheel. Offload state into db systems designed to handle it, or offload to client.

Centralize your state. Make request handlers stateless. Handlers are now resistant to failure and scalable up and down. Also makes deployments easier.

Lesson #3: Generators + Greenlets = Awesome

Asynchronous iterator!!!!! Fan out, fan in!

In iterator -> in queue -> worker greenlets -> out queue -> out iterator

  • JIT processing
  • Automatically switches through I/O

Lesson #4: CPython suffers from memory fragmentation

  • Fragmentation is when a process's heap is inefficiently used
  • The GC may report a low memory footprint, but the OSS reports a much larger RSS.

Ways to fight fragmentation:

  • Avoid large numbers of small objects (esp. combo of many small objects and a few large objects)
  • Minimize in-flight data (less used, less fragmented). Generators are great for this.
  • Reference, don't copy

Hybrid memory management:

  • Use Cython to allocate page-sized blocks of pointers into incoming chunk
  • Hand-off the whole thing to the GC to handle normally
  • For JSON, the resulting deserialized object points back to the large blob.

Ratcheting

  • Ratcheting is a pathological case of fragmentation, caused by the fact that the heap must be contiguous (a limitation of CPython that it cannot compact memory)
  • Large object at end of heap, small object added after, large object freed, but heap can't be shrunk until small object is freed.

Fighting this:

  • Avoid persistent objects (sockets common offenders)
  • Anything that has to be persistent should be created as soon as possible at app startup, before processing data
  • Avoid letting the heap grow in the first place

Slow Python, Fast Python

What is performance? How fast things go. Fast websites sell more widgets on amazon.com.

Benchmarks are full of lies and nonsense.

Performance is specialization. We can achieve performance in our own apps by specializing.

Systems performance: what is the difference between micro and macro benchmarks? We understand unit and functional tests.

What is Python? It's the language we all get when we type python. Python is abstract now, with dozens of difference machines. CPython is a specific machine.

Python isn't

  • Cython
  • C
  • Numba
  • RPython

These can make our apps faster, but they can't make our Python faster.

Untrue: Python is slow. Dynamic languages are slow.

Optimizing dynamic languages is simply different from optimizing typed languages.

You can monkey patch anything. How can you optimize that? Solved problem. Make assumptions, and make cheap checks.

Slow vs. harder to optimize. True, Python programs may run slow, but they can be optimized.

PyPy is an implementation of Python. It often runs your code faster than CPython.

Here's the deal: performance is about specialization. You choose good algorithms, I'll make them run fast. We have excellent strategies for optimizing dynamic code.

Use objects for objects, not dictionaries. Classes are more specialized than Dictionaries.

Specialize your code for the use case. Python makes using general tools easier.

Strings: don't copy when you don't have to.

Zero Buffer. Work with strings in a sized buffer, manipulate w/o copying.

More myths:

  • Function call are really expensive
  • Use builtins because they're fast
  • Don't write python in C or Java style

These get you to a local maxima. Might have worked on CPython of yore. Try PyPy first.

One Python: conventions are key. Use care with the conventions you use. Use fast conventions.

Q&A

  • cProfile
  • pypi/line_profiler
  • optimize the algorithm first.
  • wiki page has algorithm time/complexity annotations of most python builtins
  • dicts are not always dicts in PyPy!

Performance Testing and Profiling: A Virtuous Cycle

Works at Magnetic, an online advertising provider. Lots of thousands and requests/second.

Overview

  • Performance testing web apps
  • Profiling w/ the standard library
  • Instrumentation
  • The Virtuous Cycle

Performance testing basics

  • Generate requests against your app (record and replay production)
  • Measure response time and error rate

Types of testing:

  • stress test
  • load test (not talking about:
  • spike test
  • soak test

Stress testing

  • Generate excessive load
    • lots of requests
    • slow/difficult requests
    • adversarial testing
  • "How much can it take?"
  • Identify breaking point (esp. if you control synthetic load)

Not very good for identifying problems

Load testing

  • Generate specific, constant load
    • Expected conditions
    • Exaggerated conditions
  • "What if?"
  • Capacity planning

Best practices

  • Isolate testing from external influences
    • Use dedicated load testing environments
    • "scaled down" copies of all components
    • results are extrapolatable
  • Generate load consistently
    • Random considered harmful
    • Automate, automate, automate! One click!

Profiling

Batteries included:

  • cProfile, pstats
  • documentation not really included

Goofy, horrible API. Avoid run(), runctx()

import cProfile
	profiler = cProfile.Profile()
	profiler.enable()
	... do stuff
	profiler.disable()
	profiler.dump_stats("myprogram.prof")

Then:

	import pstats
	stats = pstats.Stats("myprogram.prof")
	stats.sort_stats("calls").print_stats()
	stats.sort_stats("calls").print_stats("webapp.py")
	stats.sort_stats("calls").print_callees('webapp.py:8(login)')
	stats.sort_stats("calls").print_callers('hashpw')

Use filters!

Profiling in practice

  • "Why is it slow?"
  • Good for identifying un-optimized code
    • tight loops, recursion, lots of function calls
    • these are candidates for optimizations
  • Good for identifying bottlenecks
    • distinguish between slow external resource and slow app code

Other profilers

  • line_profiler: function decorator. prints at you.
  • yappi: profiles code across multiple threads; measure wall clock or CPU time. outputs profile data for pstats.

Instrumentation

  • Use statsd to collect time-series metrics (lightweight, low-overhead, always-on profiling)
  • Two key instruments:
    • counters let you know how many things happened
    • timers let you know how long they each took
  • Learn what's normal for your app (bonus: alert when things are not normal)
  • "Does the real world match expectations?"

Virtuous Cycle

Instrumentation & Alerting -> Performance Testing -> Profiling -> Performance optimization -> I&A

Lightning Talks

Writing Good commit messages

Why?

  • memory (short and long time scale)
  • collaboration

Two fundamental purposes of VC

  • remind your future self why you made that change
  • tells your colleagues why you made that change
  • tells your future colleagues and successors why you made that change!

Assume the person who will be maintaining your code in two years is an axe-wielding maniac who knows where you live.

Tell me WHY (and WHAT) you changed.

Rules:

  • Tell me what you did and why. What is obvious from the diff.
  • Brevity is the soul of wit. Keep the novels out of the commit log.
  • Pick a style and stick w/ it (real sentences or telegraph english)
  • Pick a grammatical mode and stick with it: present tense imperative is best and most fun.
  • Spelling counts. As do grammar and punctuation. Pick a style and be consistent (and comprehensible).
  • Teamwork counts. You are not working alone. Everybody should follow the same rules.
  • TELL ME WHY. WHYYYYY?

Things You Didn't Know

  • Larry Hastings

Command line quoting subprocess.Popen(string or list, ...) string = shlex.quote(list) # 3.3+ list = shlex.split(string)

Liskov Violation Violation of the Liskov Substitution Principle. Type T -> property P Subtype s(T) -> property P (Rect -> Square)? Liskov Violation!

Distributing your Python Game

Why write a game in Python?

  • Game jams
  • books
  • Raspberry Pi
  • school
  • why not?

Libraries and frameworks: http://wiki.python.org/moin/PythonGameLibraries

Build a MS Windows exe:

  • py2exe
  • pyinstaller
  • cx_freeze
  • cython

Server Security 101

  • Kevin Veroneau
  • @kveroneau
  • Pythondiary.com
  • Debiandiary.com
  • iamkevin.ca

Basics

  • Install fail2ban
  • Use IPTables to block IPs
  • Disable password auth in sshd_config
  • Always use priv/pub keys to connect
  • Disable SSHv1, only use SSHv2
  • Minimal packages
  • Configure and customize PAM

NEVER ALLOW ROOT TO LOGIN VIA SSH

  • Cannot stress enough
  • Always have a personal account and su to root.
  • Never have admins share the same accounts
  • Only give out root when absolutely needed
  • Configure sudo with commands the user may need to run
  • Have ITIL system in place to grant access

Simpler solutions

  • Use modularity when possible
  • Web server on separate user/process than other components
  • If one service exploited, it limits the damage.

Configure pam_limits

  • configure /etc/security/limits.conf
  • Protect against fork bombs by limiting resources
  • Personal website uses an ncurses python app to render the page in a vt102 terminal uses this to limit the python processes forked to only 20

Partition the hard disc

  • If possible (not in the cloud), however you can mount a loop file system
  • Make sure the following is set up in fstab (/home, /var, /tmp noexec, nosuid, nodev)

Extreme

  • Mount the rootfs as RO
  • Build a live system in RAM!

Google Crisis Map

  • googlecrisismap.googlecode.com
  • Ka-Ping Yee
  • kpy@google.com
  • @zestyping

Publishing maps for disaster and humanitarian aid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment