Skip to content

Instantly share code, notes, and snippets.

@SteelPangolin
Last active August 29, 2015 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SteelPangolin/22a4496b159b9281c24a to your computer and use it in GitHub Desktop.
Save SteelPangolin/22a4496b159b9281c24a to your computer and use it in GitHub Desktop.
PyCon notes

Exploring is never boring: understanding CPython without reading the code

Observation

  • play the role of a 19th-century naturalist, coming back from an island to give a talk at the local scientific society

  • take into account history and evolution of the code

  • remember that not everything is intentional

  • observational astronomy: why are we seeing what we're seeing?

  • is it because of what we're looking at, or where we're looking?

  • texts are not (normally) designed to deceive or mislead you

  • but code can and will due to external constraints (deadlines, perf) or mistakes

  • inspect is a useful module for observation

  • inspect.getsource(foo) equivalent to IPython foo??

  • doesn't work on C functions

  • cinspect extends inspect to handle C code

  • use history and changelogs

  • Python used to have a good rep for very clean and readable C code

  • 15 years later, perf constrains have changed this

  • but you can go back and look at earlier versions!

  • hg blame -r revnum for Mercurial changelogs

  • look at the source

  • look at the AST

  • look at the bytecode

  • False is False is False not equivalent to (False is False) is False)

  • original version is actually a ternary compare

  • parenthesized version is not

  • difference is most obvious at bytecode level

Experimentation

  • run experiments and test hypotheses

  • timeit module runs code snippets many times, best of 3, to minimize measurement errors and startup costs

  • python -m timeit -s "foo()"

  • write tests to demonstrate invariants

  • break CPython as much as you want (provided you don't contribute the breakage back)

  • poke stuff and see what happens

Gradual Typing for Python 3

  • Guido van Rossum
  • Python developer
  • Dropbox

Timeline

  • 2006: PEP 3107 introduces annotation syntax but no semantics
  • 2013: MyPy adapted to use PEP 3107 syntax, List[T] for generics
  • 2015: PEP 484 for type hints and gradual typing, targeted at Python 3.5

PEP 484

  • static type checker outside runtime

    • Google, Dropbox have their own analyzers
    • products like Semmle and PyCharm
    • open source: MyPy
  • standard syntax for type hint annotations

  • stub files to add types to code you can't change:

    • C code
    • Files that need to be backwards compatible with Python 2 (no annotations)
    • Other people's code (OH: "monkey typing" 🙈)
  • aimed at static analyzers, IDEs

  • many idioms in Python that defeat unannotated static analysis

  • static type checkers will help warn you if your annotations are incorrect

  • a big IDE developer said they can currently work out types for about 50-60% of code

  • Python 3.5 type checker is provisional (PEP 411)

  • code generation is not a focus of PEP 484

  • neither CPython nor PyPy uses them (yet)

  • Cython can use them, optionally

Type hint syntax

  • @no_type_checker decorator for disabling checking if you're using incompatible annotations

  • unannotated functions are treated as if they had an annotation of Any for every param & return

  • Any is a superclass and subclass of every object

  • breaks issubclass transitivity

  • creates a new "is consistent with" relationship between types

  • Jeremy Siek's "What is Gradual Typing" blog post

  • typing.py provides Any and other helpers such as Dict, List, Union, Callable, Tuple

  • only concrete change from proposal

  • backwards-compatible with Python 3.2 to 3.4

  • can use builtin types and your own classes as type annotations

  • unparameterized types only

  • new bracket magic for parameterized types: def foo() -> List[Tuple[float, float]]:

  • can't use list[str] because list is already an object and doesn't have an index operator

  • typing.Tuple is treated more like a struct than a sequence

  • implementation: all typing stuff is derived from a metaclass that abuses __getitem__

  • types are for the type checker, classes are for the runtime

Helpers from typing

  • Union[a, b] is identical to PHP docstring a|b
  • Optional[int] is sugar for Union[int, None]
  • Tuple[float, float] is a 2-tuple
  • Tuple[float, ...] is an immutable sequence of float (ellipsis is the slicing Ellipsis)
  • Callable[[arg1, arg2], return]
  • Callable[..., float] is a function of complicated args/kwargs returning float

Make your own generic types

  • typing.TypeVar, typing.Generic

  • T = TypeVar('T')

  • class Chart(Generic[T]):

  • def foo(self) -> T:

  • "a watered-down version of things you can do with Java"

  • in general you can define type aliases using typing objects: AnyStr = Union[str, bytes]

  • interesting problem: split(s: AnyStr, sep: AnyStr) -> List[AnyStr]

  • s and sep need to be the same type

  • constrained type variables: AnyStr = TypeVar('AnyStr', str, bytes) must be str or bytes

  • AnyStr is actually predefined in typing

Trivia

  • forward defs: use strings. class Node: def set_left(self, n: 'Node'):

  • variable annotations: # type: <type> comments` (considered a pragmatic compromise)

  • isinstance extended to take types: isinstance(42, Union[int, str]) works

  • stub files end in .pyi and have same syntax as Python but with everything stubbed out

  • checker prefers stubs to real files

  • no multiple dispatch. faked with @overload decorator, which is only allowed in stubs.

  • can use @overload to define the same function multiple times with different types

  • ex: __getitem__ with int or slice arguments

  • probably better to use constrained type variables

  • implementation constrained by desire not to add any new syntax or C code

  • back-compatible to Python 3.2

Graph database patterns in Python

  • Elizabeth Ramirez

  • Engineer for New York Times, search and semantics group

  • definition of property graph

  • graphs at scale with Titan, Cassandra, and ElasticSearch

  • Gremlin Query Language

  • Python patterns for Titan models

  • uses of graph DBs:

    • semantic web
    • network impact analysis: if one node goes down, what else breaks?
    • useful for highly interconnected stuff in general

Property graph

  • directed
  • edges and vertices are both labeled and have properties attached
  • vertices have lists of incoming and outgoing edges
  • edges store from and to vertices

Properties of graph DBs

  • graph DBs provide "index-free adjacency", meaning you don't need to hit an adjacency index to get a node's immediate neighbors
  • graph DBs are not schemaless and need some kind of schema to prevent inconsistencies

TinkerPop stack

  • Blueprints is Java lib for graph data structures
  • Pipes are basically extended iterators
  • Gremlin is a query lang on top of Blueprints, Pipes, and Groovy
  • Rexter is a REST server for Titan
    • also provides RexPro binary protocol, which is what Elizabeth uses in production due to poor performance of HTTP

Why Titan?

  • multiple options for storage and search
  • already distributed
  • small/medium graph: 10M-100M edges
  • Titan setup for that graph size:
    • 3 Cassandra nodes
    • 2 ES nodes
    • 1 Titan node
    • 1 Rexster node
  • all JVM
  • can run Cassandra and Titan in same JVM, but not recommended

Semantic knowledge

  • synonyms
  • concepts related
  • concepts combined: "acid" + "attack" combinations result in "acid attack" concept
  • extraction rules to deal with variations like "Obama", "Barack Obama", "President Obama"

Gremlin queries

  • look up vertex by ID

  • all vertices with a particular property value

    • can be handled from index without hitting the whole graph
  • retrieve a vertex's outbound adjacents of a certain type

  • get all edges for a vertex

  • more complicated query: go from ebola, to all virus topics related to ebola, to the combination of science topics and medicine topics related to ebola

  • basically end up writing Groovy that looks like IEnumerable chained expressions

  • Gremlin can use external ES indexes to answer GIS location queries, full text queries

How can we map Gremlin syntax to Python?

  • pipe patterns: transform, filter, sideEffect
  • bunch of Python functions that basically end up generating a Gremlin query
  • once you have one, send it to Rexpro, get back your result set
  • single metaclass for vertex and edge model classes, since they share things like property maps
  • parent classes for vertex and edge
  • vertex model class for extraction rule derives from vertex class but uses common metaclass

http2:

  • Cory Benfield
  • Metaswitch Networks
  • requests, urllib3 core contributor
  • HTTPBis and HTTP/2 IETF working group member
  • implemented the hyper http2 stack for Python
  • Twitter: @lukasaoz
  • GH: @lukasa

HTTP 1.1 is inefficient

  • uses TCP poorly
  • TCP works best if you keep your connections alive so it can adapt behavior
  • HTTP 1.1 generally doesn't reuse connections
  • results in a lot of concurrent connections to get all the resources
  • or nasty hacks like:
    • image spriting
    • inlining resources as data: URLs
    • CSS/JS concatenation
  • these hacks lead to poor performance with HTTP caching: change one resource and you need to regenerate your big combo files

HTTP 2 is a binary protocol

  • based on length-prefixed frames

  • not at all readable, but easy to parse

  • HTTP 1.1 makes it difficult to predict resource allocations like header sizes

  • harder for embedded developers

bonus features on top of new protocol

muxing with priority and flow control

  • single req/resp pair is called a "stream" and has a stream ID (ex: 56)
  • priorities prevent "head of line blocking"

HPACK header compression

  • domain-specific compression
  • includes commands like "resend header 1 from last time" (good for user agents)
  • can blacklist things like password headers from compression, which prevents BREACH/CRIME/other compression oracle attacks

server push

  • send resources you know the client is going to want soon, before the client asks
  • ex: JS and CSS dependencies for a page

HTTP 2 has not been well received

  • phk hates it

  • tricky to reason about

  • interpreting a request requires knowledge of prior requests

  • tools now need to export connection state in debug logs

  • you need to do a ton of interop testing

  • nasty edge cases resulting from back compat with HTTP 1.1

  • HTTP 2 total headers limited to 16k

  • Kerberos frequently generates single headers that large

HTTP 2 is inherently concurrent

  • fun problem for Python

  • makes requests support very tricky

  • Cory expects asyncio adoption to spike based on nature of HTTP 2

  • gophertiles demo compares HTTP 1.1 and 2

  • shows off parallel download by sending an image as a pile of tiles

  • 34 known implementations of HTTP 2

  • nghttp2 is the open source reference implementation that does everything, client and server

  • nghttp2 is also a nightmare to compile

  • Wireshark supports HTTP 2 frames (but you need a TLS-enabled build for most uses)

hyper

  • Python's only HTTP 2 implementation

  • client only

  • similar to httpclient in scope

  • designed to go at the bottom of a more featureful library

  • https://github.com/Lukasa/hyper

  • http2bin.org is the successor to httpbin.org

  • running behind H2O (HTTP 2 and 1.1 reverse proxy)

  • H2O can be used to wrap an HTTP 1.1 service in HTTP 2

Further reading

  • http://daniel.haxx.se/http2/ (from the author of cURL)

  • Apache mod_h2 is built on top of nghttp2

  • HTTP 2 is already more widely used than IPv6

  • Google, Chrome, Twitter, Facebook (soon)

  • popular web frameworks will need extensions to support push features

  • machine learning frameworks might support a predictive push reverse proxy

HTTP 2 for dumb embedded platforms (Arduino, etc.)

  • TLS can be omitted
  • flow control can be reduced to a sort of no-op case that always sends all the data you ask for
  • hyper actually disabled flow control at one point
  • HPACK previous-header reuse can be disabled
  • HPACK Huffman decoding probably can't be disabled

Lessons learned with asyncio

  • https://us.pycon.org/2015/schedule/presentation/387/

  • Nick Tollervey

  • @ntoll

  • freelance Python dev

  • examination of a personal project

  • DHTs are decentralized

  • this implementation is based on Kademlia

  • a callback cannot start until the one before it has finished

  • asyncio tasks are therefore only sort of concurrent

  • the only thing that can happen meanwhile is network I/O

  • this serialization is actually required by the asyncio PEP

  • in asyncio usage, "coroutine" refers to both the generator object itself, and the function that creates it

  • see the docs

  • routing table organization:

    • peers are stored in buckets
    • buckets get bigger when they are farther away
  • peer lookups can be concurrent

  • peer lookup is also recursive

  • if you can't find someone, ask someone in their bucket where they are

  • Twisted is not very Pythonic

  • asyncio definitely preferable

  • Nick's DHT code has 100% unit test coverage

  • 890 lines

  • asyncio makes it easier to write testable code

  • asyncio is only suitable for I/O-bound code

  • streams API is higher-level than protocols/transports, but Nick didn't use it

  • asyncio.org site is a rollup of better examples than the docs

  • asyncio has no particular facilities for dealing with multicore setups

Performance by the Numbers: analyzing the performance of web applications

  • https://us.pycon.org/2015/schedule/presentation/349/

  • slide deck

  • Geoff Gerrietts

  • @ggerrietts

  • AppNeta

  • large retailers have shown that 100ms latency reduction can increase revenue by 1%

  • the DBA cannot fix all your problems with better indexes

  • Drunken Man approach to perf (named by Brendan Gregg): come up with something plausible, try really hard to do it, hope it helps

  • if you're designing a project before you know where the problem is, don't

  • don't lean too heavily on one or two perf tools

  • they all have blind spots, and one tool is never enough

Profilers

  • profilers have been the go-to tool for perf analysis for decades

  • best suited to looking at a specific code path

  • tend to have very large instrumentation overhead

  • traditional line-oriented profilers can't be used in production

  • instrumentation overhead can exaggerate impact of small functions vs. larger, slower functions

  • profiling off production requires something like traffic replay

  • Apache traffic logs don't include POST data

  • statistical profiling uses periodic random sampling

  • random sampling tends to miss stuff, and especially miss context of calls

OS tools

  • about a jillion of them
  • can generally be used in production
  • limited to one box, don't have cross-node context
  • great for tracking resource depletion, and host or OS failures

Ad-hoc instrumentation

  • stats services written for specific application metrics
  • push (StatsD) or pull (Munin) models
  • great for tracking and trending discrete events
  • every point of instrumentation is hardcoded into app
  • difficult to interpret lots of graphs
  • still no inter-node context

Tracing

  • ex: Twitter Zipkin
  • Zipkin can draw timeline/waterfall graphs like Chrome Dev Tools
  • Traces are a good place to get started, and provide the context for the other tools
  • Trace infrastructure is nontrivial, basically identical to an analytics pipeline
  • further reading:

Python bytecode

  • Allison Kaptur

  • Python interpreter is a stack machine

  • bytecode is output by lexer -> parser -> compiler

  • our goal today is a bytecode interpreter: Byterun

  • why? we're not going to be faster than CPython

  • Ned Batchelder (author of coverage) wanted to get bytecode-level coverage

  • why write it in Python (instead of PyPy)?

  • want to be able to fall back to real Python objects

  • a function foo has its bytecode in foo.func_code.co_code

  • stdlib dis module can show bytecode in human-readable format

  • dis.dis(foo) returns a string with an annotated disassembly foo's bytecode

  • calling a function consumes a function from the calling function's data stack, and creates a new frame

  • list of Python bytecodes

  • getting into CPython: start in ceval.c, get confused, go from there

  • almost the canonical interpreted language, circa 1989

  • 1500-line switch statement, too big for some older C compilers

  • Python 3 uses computed gotos instead of a giant switch

  • LOAD_FAST is the most common instruction in most Python codebases

  • Byterun's problem with nested generators was a result of having one data stack for the entire program, rather than one per frame

  • you need to be able to pause and resume frames to implement generators

  • problem with implementing LOAD_FAST, LOAD_FAST, BINARY_MODULO is that you don't know whether % is a string or number until runtime

  • BINARY_MODULO now has to be really smart

  • has a fast path for string formatting

  • without type information, every opcode might as well be INVOKE_ARBITRARY_METHOD (from "How Fast Can We Make Interpreted Python")

  • Python v.Next will actually use type hints to go faster

  • more detail coming at keynote

  • Python knows that some instructions are frequently paired and has fast paths that depend on next instruction

  • block stack (in addition to data, frame stacks) is used for handling loops and exceptions

Python Concurrency From the Ground Up

  • David Beazley

  • @dabeaz

  • totally packed house

  • threads and coroutines

  • tradeoffs, perf characteristics, things that can go wrong

no threads

  • start with a naive Fibonacci function

  • slows down around fib(35)

  • let's make a microservice out of this

  • start with from socket import * 😖

  • fib server:

    • single threaded, listen(5)
  • fib handler:

    • recv with timeout 100
  • can't handle multiple clients

threads

  • start each new fib handler on a new thread

  • Python uses OS threads

  • some GIL problems are well known

  • GIL prevents using more than one CPU core, so multiple clients are competing for a single core

  • other GIL problems are less well known

  • GIL prioritizes CPU-heavy threads

  • with fast fib(1) client and slow fib(30) client, fast client takes huge hit

  • OS threads do not do this: OS gives priority to short tasks that look like interactive behavior

thread pools

  • accept request in thread
  • offload work to thread pool, using concurrent.future, then wait for result
  • high CPU load of serializing data and shipping it into the pool and back
  • threads solve problem of blocking accept loop

generators

  • also solves problems of blocking: yield stops execution until next() is called again

  • new task: countdown iterator

  • using deque, create round-robin queue of countdown iterators

  • take iterator, yield value, print it, put back in queue

  • fib handler: yield '<callname>', sock before blocking calls like recv or send

  • yield statement communicates intention to wait for something

  • fib server now yields 'recv', sock while waiting to accept

  • additional run method:

    • waiting queue for stuff that's waiting to recv
    • waiting queue for send
    • loop that calls next(task), assigns task to one of these queues
    • task queue starts with [fib_server]

select

  • how do we pull tasks off the waiting queues so they can do work again?
  • modify run loop: runs as long as there's any task waiting to run
  • use select library to see if there are any sockets in wait queues that can recv or send
  • when task queue is empty, wait with select for something to happen

problems with coroutines

  • does not solve the GIL CPU core contention

  • does not solve domination by one long CPU-heavy task

  • add the thread pool back?

  • still doesn't solve the long-task problem, because future.result() blocks and is thus not coroutine-friendly

  • now you need a future-wait queue

  • how do you get stuff out of that queue?

  • tasks that wait on futures need to generate other tasks that move tasks back from the future wait queue back to the task queue

  • use a socketpair so that you can select on future task completion

  • now we can run a CPU-heavy job without totally killing throughput of a fast job

  • coroutines don't mean you can ignore the GIL

  • you probably still need a thread or multiprocess worker pool

but coroutines are ugly

  • don't want to write explicit coroutine yields?
  • wrap socket in an AsyncSocket class that provides generators for all blocking methods
  • yield from AsyncSocket.accept so you can call it more than once
  • now it looks like threading code again
  • and you basically have asyncio

questions

  • concurrent.future slightly easier to use than multiprocessing pool

  • surprise syntax:

    • a, b, [] = ([1], [2], []) is somehow legal
    • a, b, [] = ([1], [2], [3]) is not: ValueError: too many values to unpack (expected 0)
  • coroutines let you handle many more simultaneous connections than OS threads

  • every async I/O implementation ends up with a select loop somewhere

Python Performance Profiling: The Guts And The Glory

  • https://us.pycon.org/2015/schedule/presentation/400/

  • A. Jesse Jiryu Davis

  • @jessejiryudavis

  • MongoDB engineer

  • pymongo maintainer (Python driver for MongoDB)

  • some guy published an article on DZone about 80,000 MongoDB inserts per second with the node.js driver

  • Python driver clocked at 29,000/sec on same hardware

  • Jesse now has a problem

  • benchmarker inserts 80k docs in 5k-long batches

  • creates a list to stick batches in before inserts

  • starts inserting data, calling datetime! and random!! functions to generate it as needed!

  • appends to end of batch list

  • MongoDB can handle batch inserts of up to 16 MB at a time

  • but Node.js code has same morally objectionable structure, so the perf difference actually is in the language or driver

  • optimization is like debugging

  • don't ask "why is my code slow?"

  • ask "will changing this part of my code make it faster?"

  • hypothesis-experiment cycle with benchmarks

  • warning: optimization generally makes your code harder to read and maintain

  • example: caching layers

  • why is profiling useful?

  • profiling lets you generate hypotheses

  • it is not the experiment

  • profiling affects your code too much to be part of the experiment

  • benchmarks should be on uninstrumented code

  • the profiler that Jesse reaches for first is not cProfile

  • cProfile is "severely overrated"

  • Jesse uses Yappi, third-party profiler by Sümer Cip

  • as fast as cProfile

  • can profile every thread in the app

  • can measure both CPU and wall clock time

  • can export to callgrind format, which cProfile can't

  • can profile builtins

  • still requires code modification to start Yappi and save profiles

  • KCacheGrind can read callgrind format

  • spending ⅔ of time in Collection.insert in Python MongoDB driver

  • Hypothesis: if pymongo was infinitely fast, it would only match the perf of the Node.js driver

  • Test: replace Collection.insert with del

  • Hypothesis proved by benchmark

  • Possible explanation: V8 has a JIT, CPython doesn't

  • removing datetime stuff gets from 30,000/sec to 50,000/sec

  • years later, updating to PyMongo 3.0 goes from 38k to 59k

  • PyPy (CPython 2.7 compatible build with JIT) gets to 73k

  • try stubbing code out before you try actually optimizing your code

  • the Monary driver bypasses Python and lets NumPy talk directly to Mongo

  • PyMongo is not asyncio: it's a blocking driver for threaded apps

  • Jesse wrote an async Tornado driver called Motor

scikit-learn for street maps

  • Michelle Fulwood

  • Twitter: @michelleful

  • grad student

  • interested in classifiying national origins of Singapore's street names

  • visualizing clusters of roads

  • color-coding roads by national origin of name

  • we need:

    • locations of roads, from OpenStreetMap
    • linguistic classification, done by machine learning

Wrangling geodata with GeoPandas

  • original data in GeoJSON
  • hierarchical dictionary format: pandas hates it
  • geopandas can translate many formats into GeoDataFrames
  • 60k roads. Is that too many? (yes)
  • geopandas can plot geodata with df.plot()
  • looks like many of the roads are outside Singapore
  • we can use the within function to clip a dataset to a geometric boundary
  • standard pandas functions are available:
    • filter out empty road names
    • filter out roads that are not on a list of accepted types (yes to highway, no to footpath)

Classifying with scikit-learn

supervised classification

  • we need:
    • a set of labels
    • a set of features
    • a labeled train and test set
  • sklearn provides train_test_split function
  • features: n-gram letter frequency for 1-, 2-, and 3-grams
  • sklearn.feature_extraction.text.CountVectorizer does n-grams for multiple ns at once

selecting a classifier

  • see Scikit's cheat sheet
  • chosen: linear support vector classification (SVC)
  • selected accuracy score as metric for model performance (vs ROC AUC, etc.: scikit supports a lot of metrics)
  • wanted to minimize hand correction of classifier

initial results

  • first result: right 55% of the time

  • random chance: 16.6% (over six linguistic categories)

  • not terrible, could be better

  • scikit-learn makes it very easy to swap classifiers

adding features

  • see A Few Useful Things to Know About Machine Learning
  • feature choice is huge factor in machine learning project success
  • new features:
    • number of words
    • avg length of word
    • are all the words in a language dictionary?
    • are road tags Malay? (Street, Road vs Jalan, Lorong)

Pipelines

  • minimize repetitive feature extraction code by using sklearn.pipeline
  • add ngram and SVC stages to a new pipeline object
  • don't need to run fit_transform/transform on train/test sets
  • just feed data into the pipeline

Make your own pipeline stage

  • inherit from BaseEstimator and TransformerMixin

  • implement transform, stub out fit

  • use FeatureUnion to run feature extraction stages in parallel!

  • reached 65% accuracy with new features

  • pipelines and FeatureUnion doesn't improve performance much, but much easier to read

Hyperparameter tuning

  • default parameters of the SVC, etc. model classes
  • GridSearchCV does a brute force (?) search of hyperparameter space
  • reached 68% accuracy with new hyperparameters
  • when doing your own hyperparameter tuning:
    • read the papers
    • use the hyperparams listed first in the docs
    • go see what other people are using on Github

Further neat tricks

  • mplleaflet turns matplotlib output into Leaflet.js zoomable tiled maps

streamparse: real-time streams with Python and Apache Storm

Storm topology

  • Storm abstractions: tuples, spouts, bolts, topology (DAG)
  • tuples are basically DB rows, and have schemas
  • spout: data source
  • bolt: computation node
    • can ack a tuple
    • fail a tuple
    • emit new tuples

Storm internals

  • Manning Press book: Storm Applied
  • tuple tree: bolt can explode a tuple into a bunch of derived tuples, but Storm keeps track of the origin of the child tuples
  • used by reliability features: guaranteed processing
  • Storm HA: Nimbus is a supervisor, has some ZK stuff somewhere, responsible for code upload to workers
  • allocates Python code to Python slots on physical worker nodes

Python + Storm

  • Storm "multi-lang" protocol is JSON over pipes or something
  • 1 Python process per Storm task
  • a lot of Python↔︎JVM data interchange
  • Storm bundles storm.py but it's not good: assumes Storm packaging, meaning you put your stuff in a JAR

streamparse

  • 3 paid Parse.ly maintainers + DARPA funding

  • pip install streamparse

  • sparse CLI tool creates streamparse boilerplate (like Django)

  • sparse also requires Leiningen to set up a cluster for you

  • sparse functions:

    • make virtualenvs on workers
    • package Python code as Storm JAR
    • talks to Nimbus to deploy your topology
  • replaces storm.py

  • supports Python 3.4 and PyPy

  • Storm has a DSL for topology setup: make a Python spout, make a Python bolt, run this bolt on two workers, etc.

  • Storm grouping can make sure that certain tuples ("dog") always get routed to the same nodes, so that the dog count doesn't end up on two workers

PyKafka

  • Kafka traditionally hasn't had a great Python library
  • don't tell the KIXEYE people who used to maintain kafka-python ;)
  • Kafka makes a good Storm spout

Questions

  • does streamparse handle packaging Python dependencies in the Storm JAR?

  • no it doesn't. relies on SSH configuration of its persistent virtualenvs.

  • somewhat redundant with Yelp's pyleus

  • How do you debug/log/diagnose Python bolts?

  • redirect sys.stdout to Python logging (so it doesn't interfere with data), so you have to use Python logging

  • sparse tail can tail the logs from each worker

  • needs improvement

what can programmers learn from pilots:

  • Andrew Godwin

  • fail soft if related to external controls

  • fail hard otherwise

  • either way, make some noise!

  • but not a continuous ignorable noise

  • or you'll just get used to errors and not handle them

  • aircraft testing results in known statistical limits for components

  • and combinations of components

  • don't rely on automatic failover

  • make sure you can always manually swap to a spare

  • checklists for EVERYTHING

  • automation should be best for the worst cases, because that's when you'll need to get it right

  • ex: database failover

  • aviate, navigate, communicate

  • first fly the plane

  • then fly the plane somewhere

  • only then, talk to ATC

  • don't get distracted by communication

  • emergencies have priority on radio channels

  • if Air Force 1 is on the same channel, and you have an emergency, they have to shut up

  • know your critical features (Eventbrite ex: tickets)

  • know what you can sacrifice if necessary (analytics)

  • single person should make decisions

  • but don't ignore your copilot or flight crew

  • postmortems:

  • there are always multiple factors to an accident

  • blaming someone solves nothing, and can distract the investigation

  • planes don't do deadlines

  • they always carry extra fuel

  • always have a plan or some buffer to deal with unknown problems

  • don't ship crap code to meet the deadline, you'll just spend weeks fixing it

  • don't be a hero

  • ops are like pilots: hours of boredom punctuated by moments of terror

What can Python learn from Erlang?

  • Benoit Chesneau

  • https://github.com/benoitc

  • https://twitter.com/benoitc

  • author of gunicorn

  • https://speakerdeck.com/benoitc/what-python-can-learn-from-erlang

  • Erlang is concurrent

  • CPython is single-threaded

  • sounds silly at first

  • but concurrency is about passing messages and hiding internals

  • this talk is about reliability

  • if your program isn't reliable, it won't perform

  • a reliable program

    • stays alive
    • resistant to failures
    • recovers from failures
    • supports hot upgrades
  • if processes are isolated and don't share memory

  • then a failed process doesn't necessarily take down other processes with it

  • can't corrupt shared data

  • other processes can detect crashed processes and restart them

  • Python often uses supervisord or other supervisor-like processes

  • supervisor doesn't know where your process crashed, so it can't restart without discarding partial progress

  • Python must use OS processes to achieve isolation

  • processes that crash should try to handle the fatal exception by writing out any work in progress so their successors can pick up

  • immutable data can be shared all you want

  • the real world is mutable

  • but you can take immutable snapshots

  • use functional data structures like FunkTown

  • be specific about the exceptions you catch

  • don't catch anything you can't handle by writing out partial state (or logging)

  • just crash and let the supervisor restart you

  • don't use exceptions for control flow

  • use pattern matching on Either types

  • patterns implementation for Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment