SteelPangolin/Exploring is never boring.md

## Exploring is never boring.md

      
    Raw
  

              Exploring is never boring.md
            
          
    Exploring is never boring: understanding CPython without reading the code


Allison Kaptur


https://us.pycon.org/2015/schedule/presentation/333/


"just read the code" is bad advice


understanding a codebase is a specialized skill


you can practice and improve it


analogy to "How to Read a Paper"


how do you not read the innards of Python?


observation and experimentation


Observation


play the role of a 19th-century naturalist,
coming back from an island to give a talk at the local scientific society


take into account history and evolution of the code


remember that not everything is intentional


observational astronomy: why are we seeing what we're seeing?


is it because of what we're looking at, or where we're looking?


texts are not (normally) designed to deceive or mislead you


but code can and will due to external constraints (deadlines, perf) or mistakes


inspect is a useful module for observation


inspect.getsource(foo) equivalent to IPython foo??


doesn't work on C functions


cinspect extends inspect to handle C code


use history and changelogs


Python used to have a good rep for very clean and readable C code


15 years later, perf constrains have changed this


but you can go back and look at earlier versions!


hg blame -r revnum for Mercurial changelogs


look at the source


look at the AST


look at the bytecode


False is False is False not equivalent to (False is False) is False)


original version is actually a ternary compare


parenthesized version is not


difference is most obvious at bytecode level


Experimentation


run experiments and test hypotheses


timeit module runs code snippets many times, best of 3,
to minimize measurement errors and startup costs


python -m timeit -s "foo()"


write tests to demonstrate invariants


break CPython as much as you want (provided you don't contribute the breakage back)


poke stuff and see what happens


## Gradual typing for Python 3.md

      
    Raw
  

              Gradual typing for Python 3.md
            
          
    Gradual Typing for Python 3


Guido van Rossum
Python developer
Dropbox

Timeline


2006: PEP 3107 introduces annotation syntax but no semantics
2013: MyPy adapted to use PEP 3107 syntax, List[T] for generics
2015: PEP 484 for type hints and gradual typing, targeted at Python 3.5

PEP 484


static type checker outside runtime

Google, Dropbox have their own analyzers
products like Semmle and PyCharm
open source: MyPy


standard syntax for type hint annotations


stub files to add types to code you can't change:

C code
Files that need to be backwards compatible with Python 2 (no annotations)
Other people's code (OH: "monkey typing" 🙈)


aimed at static analyzers, IDEs


many idioms in Python that defeat unannotated static analysis


static type checkers will help warn you if your annotations are incorrect


a big IDE developer said they can currently work out types for about 50-60% of code


Python 3.5 type checker is provisional (PEP 411)


code generation is not a focus of PEP 484


neither CPython nor PyPy uses them (yet)


Cython can use them, optionally


Type hint syntax


@no_type_checker decorator for disabling checking if you're using incompatible annotations


unannotated functions are treated as if they had an annotation of Any for every param & return


Any is a superclass and subclass of every object


breaks issubclass transitivity


creates a new "is consistent with" relationship between types


Jeremy Siek's "What is Gradual Typing" blog post


typing.py provides Any and other helpers such as Dict, List, Union, Callable, Tuple


only concrete change from proposal


backwards-compatible with Python 3.2 to 3.4


can use builtin types and your own classes as type annotations


unparameterized types only


new bracket magic for parameterized types: def foo() -> List[Tuple[float, float]]:


can't use list[str] because list is already an object and doesn't have an index operator


typing.Tuple is treated more like a struct than a sequence


implementation: all typing stuff is derived from a metaclass that abuses __getitem__


types are for the type checker, classes are for the runtime


Helpers from typing


Union[a, b] is identical to PHP docstring a|b
Optional[int] is sugar for Union[int, None]
Tuple[float, float] is a 2-tuple
Tuple[float, ...] is an immutable sequence of float (ellipsis is the slicing Ellipsis)
Callable[[arg1, arg2], return]
Callable[..., float] is a function of complicated args/kwargs returning float

Make your own generic types


typing.TypeVar, typing.Generic


T = TypeVar('T')


class Chart(Generic[T]):


def foo(self) -> T:


"a watered-down version of things you can do with Java"


in general you can define type aliases using typing objects: AnyStr = Union[str, bytes]


interesting problem: split(s: AnyStr, sep: AnyStr) -> List[AnyStr]


s and sep need to be the same type


constrained type variables: AnyStr = TypeVar('AnyStr', str, bytes) must be str or bytes


AnyStr is actually predefined in typing


Trivia


forward defs: use strings. class Node: def set_left(self, n: 'Node'):


variable annotations: # type: <type> comments` (considered a pragmatic compromise)


isinstance extended to take types: isinstance(42, Union[int, str]) works


stub files end in .pyi and have same syntax as Python but with everything stubbed out


checker prefers stubs to real files


no multiple dispatch. faked with @overload decorator, which is only allowed in stubs.


can use @overload to define the same function multiple times with different types


ex: __getitem__ with int or slice arguments


probably better to use constrained type variables


implementation constrained by desire not to add any new syntax or C code


back-compatible to Python 3.2


## Graph database patterns in Python.md

      
    Raw
  

              Graph database patterns in Python.md
            
          
    Graph database patterns in Python


Elizabeth Ramirez


Engineer for New York Times, search and semantics group


definition of property graph


graphs at scale with Titan, Cassandra, and ElasticSearch


Gremlin Query Language


Python patterns for Titan models


uses of graph DBs:

semantic web
network impact analysis: if one node goes down, what else breaks?
useful for highly interconnected stuff in general


Property graph


directed
edges and vertices are both labeled and have properties attached
vertices have lists of incoming and outgoing edges
edges store from and to vertices

Properties of graph DBs


graph DBs provide "index-free adjacency",
meaning you don't need to hit an adjacency index to get a node's immediate neighbors
graph DBs are not schemaless and need some kind of schema to prevent inconsistencies

TinkerPop stack


Blueprints is Java lib for graph data structures
Pipes are basically extended iterators
Gremlin is a query lang on top of Blueprints, Pipes, and Groovy
Rexter is a REST server for Titan

also provides RexPro binary protocol, which is what Elizabeth uses in production
due to poor performance of HTTP


Why Titan?


multiple options for storage and search
already distributed
small/medium graph: 10M-100M edges
Titan setup for that graph size:

3 Cassandra nodes
2 ES nodes
1 Titan node
1 Rexster node


all JVM
can run Cassandra and Titan in same JVM, but not recommended

Semantic knowledge


synonyms
concepts related
concepts combined: "acid" + "attack" combinations result in "acid attack" concept
extraction rules to deal with variations like "Obama", "Barack Obama", "President Obama"

Gremlin queries


look up vertex by ID


all vertices with a particular property value

can be handled from index without hitting the whole graph


retrieve a vertex's outbound adjacents of a certain type


get all edges for a vertex


more complicated query: go from ebola, to all virus topics related to ebola,
to the combination of science topics and medicine topics related to ebola


basically end up writing Groovy that looks like IEnumerable chained expressions


Gremlin can use external ES indexes to answer GIS location queries, full text queries


How can we map Gremlin syntax to Python?


pipe patterns: transform, filter, sideEffect
bunch of Python functions that basically end up generating a Gremlin query
once you have one, send it to Rexpro, get back your result set
single metaclass for vertex and edge model classes, since they share things like property maps
parent classes for vertex and edge
vertex model class for extraction rule derives from vertex class but uses common metaclass


## http2.md

      
    Raw
  

              http2.md
            
          
    http2:


Cory Benfield
Metaswitch Networks
requests, urllib3 core contributor
HTTPBis and HTTP/2 IETF working group member
implemented the hyper http2 stack for Python
Twitter: @lukasaoz
GH: @lukasa

HTTP 1.1 is inefficient


uses TCP poorly
TCP works best if you keep your connections alive so it can adapt behavior
HTTP 1.1 generally doesn't reuse connections
results in a lot of concurrent connections to get all the resources
or nasty hacks like:

image spriting
inlining resources as data: URLs
CSS/JS concatenation


these hacks lead to poor performance with HTTP caching:
change one resource and you need to regenerate your big combo files

HTTP 2 is a binary protocol


based on length-prefixed frames


not at all readable, but easy to parse


HTTP 1.1 makes it difficult to predict resource allocations like header sizes


harder for embedded developers


bonus features on top of new protocol

muxing with priority and flow control


single req/resp pair is called a "stream" and has a stream ID (ex: 56)
priorities prevent "head of line blocking"

HPACK header compression


domain-specific compression
includes commands like "resend header 1 from last time" (good for user agents)
can blacklist things like password headers from compression,
which prevents BREACH/CRIME/other compression oracle attacks

server push


send resources you know the client is going to want soon,
before the client asks
ex: JS and CSS dependencies for a page

HTTP 2 has not been well received


phk hates it


tricky to reason about


interpreting a request requires knowledge of prior requests


tools now need to export connection state in debug logs


you need to do a ton of interop testing


nasty edge cases resulting from back compat with HTTP 1.1


HTTP 2 total headers limited to 16k


Kerberos frequently generates single headers that large


HTTP 2 is inherently concurrent


fun problem for Python


makes requests support very tricky


Cory expects asyncio adoption to spike based on nature of HTTP 2


gophertiles demo compares HTTP 1.1 and 2


shows off parallel download by sending an image as a pile of tiles


34 known implementations of HTTP 2


nghttp2 is the open source reference implementation that does everything, client and server


nghttp2 is also a nightmare to compile


Wireshark supports HTTP 2 frames (but you need a TLS-enabled build for most uses)


hyper


Python's only HTTP 2 implementation


client only


similar to httpclient in scope


designed to go at the bottom of a more featureful library


https://github.com/Lukasa/hyper


http2bin.org is the successor to httpbin.org


running behind H2O (HTTP 2 and 1.1 reverse proxy)


H2O can be used to wrap an HTTP 1.1 service in HTTP 2


Further reading


http://daniel.haxx.se/http2/ (from the author of cURL)


Apache mod_h2 is built on top of nghttp2


HTTP 2 is already more widely used than IPv6


Google, Chrome, Twitter, Facebook (soon)


popular web frameworks will need extensions to support push features


machine learning frameworks might support a predictive push reverse proxy


HTTP 2 for dumb embedded platforms (Arduino, etc.)


TLS can be omitted
flow control can be reduced to a sort of no-op case
that always sends all the data you ask for
hyper actually disabled flow control at one point
HPACK previous-header reuse can be disabled
HPACK Huffman decoding probably can't be disabled


## Lessons learned with asyncio.md

      
    Raw
  

              Lessons learned with asyncio.md
            
          
    Lessons learned with asyncio


https://us.pycon.org/2015/schedule/presentation/387/


Nick Tollervey


@ntoll


freelance Python dev


examination of a personal project


DHTs are decentralized


this implementation is based on Kademlia


a callback cannot start until the one before it has finished


asyncio tasks are therefore only sort of concurrent


the only thing that can happen meanwhile is network I/O


this serialization is actually required by the asyncio PEP


in asyncio usage, "coroutine" refers to both the generator object itself,
and the function that creates it


see the docs


routing table organization:

peers are stored in buckets
buckets get bigger when they are farther away


peer lookups can be concurrent


peer lookup is also recursive


if you can't find someone, ask someone in their bucket where they are


Twisted is not very Pythonic


asyncio definitely preferable


Nick's DHT code has 100% unit test coverage


890 lines


asyncio makes it easier to write testable code


asyncio is only suitable for I/O-bound code


streams API is higher-level than protocols/transports, but Nick didn't use it


asyncio.org site is a rollup of better examples than the docs


asyncio has no particular facilities for dealing with multicore setups


## Performance by the Numbers.md

      
    Raw
  

              Performance by the Numbers.md
            
          
    Performance by the Numbers: analyzing the performance of web applications


https://us.pycon.org/2015/schedule/presentation/349/


slide deck


Geoff Gerrietts


@ggerrietts


AppNeta


large retailers have shown that 100ms latency reduction can increase revenue by 1%


the DBA cannot fix all your problems with better indexes


Drunken Man approach to perf (named by Brendan Gregg):
come up with something plausible, try really hard to do it, hope it helps


if you're designing a project before you know where the problem is, don't


don't lean too heavily on one or two perf tools


they all have blind spots, and one tool is never enough


Profilers


profilers have been the go-to tool for perf analysis for decades


best suited to looking at a specific code path


tend to have very large instrumentation overhead


traditional line-oriented profilers can't be used in production


instrumentation overhead can exaggerate impact of small functions vs. larger, slower functions


profiling off production requires something like traffic replay


Apache traffic logs don't include POST data


statistical profiling uses periodic random sampling


random sampling tends to miss stuff, and especially miss context of calls


OS tools


about a jillion of them
can generally be used in production
limited to one box, don't have cross-node context
great for tracking resource depletion, and host or OS failures

Ad-hoc instrumentation


stats services written for specific application metrics
push (StatsD) or pull (Munin) models
great for tracking and trending discrete events
every point of instrumentation is hardcoded into app
difficult to interpret lots of graphs
still no inter-node context

Tracing


ex: Twitter Zipkin
Zipkin can draw timeline/waterfall graphs like Chrome Dev Tools
Traces are a good place to get started, and provide the context for the other tools
Trace infrastructure is nontrivial, basically identical to an analytics pipeline
further reading:

Google Dapper (paper)
Yammer Telemetry (inspired by Dapper, somewhat rough)
Twitter Zipkin (doesn't support Python)
AppNeta has its own product called TraceView, which looks a lot like New Relic


## Python bytecode.md

      
    Raw
  

              Python bytecode.md
            
          
    Python bytecode


Allison Kaptur


Python interpreter is a stack machine


bytecode is output by lexer -> parser -> compiler


our goal today is a bytecode interpreter: Byterun


why? we're not going to be faster than CPython


Ned Batchelder (author of coverage) wanted to get bytecode-level coverage


why write it in Python (instead of PyPy)?


want to be able to fall back to real Python objects


a function foo has its bytecode in foo.func_code.co_code


stdlib dis module can show bytecode in human-readable format


dis.dis(foo) returns a string with an annotated disassembly foo's bytecode


calling a function consumes a function from the calling function's data stack,
and creates a new frame


list of Python bytecodes


getting into CPython: start in ceval.c, get confused, go from there


almost the canonical interpreted language, circa 1989


1500-line switch statement, too big for some older C compilers


Python 3 uses computed gotos instead of a giant switch


LOAD_FAST is the most common instruction in most Python codebases


Byterun's problem with nested generators was a result of
having one data stack for the entire program,
rather than one per frame


you need to be able to pause and resume frames to implement generators


problem with implementing LOAD_FAST, LOAD_FAST, BINARY_MODULO
is that you don't know whether % is a string or number until runtime


BINARY_MODULO now has to be really smart


has a fast path for string formatting


without type information, every opcode might as well be INVOKE_ARBITRARY_METHOD
(from "How Fast Can We Make Interpreted Python")


Python v.Next will actually use type hints to go faster


more detail coming at keynote


Python knows that some instructions are frequently paired
and has fast paths that depend on next instruction


block stack (in addition to data, frame stacks) is used for handling loops and exceptions


## Python Concurrency From the Ground Up.md

      
    Raw
  

              Python Concurrency From the Ground Up.md
            
          
    Python Concurrency From the Ground Up


David Beazley


@dabeaz


totally packed house


threads and coroutines


tradeoffs, perf characteristics, things that can go wrong


no threads


start with a naive Fibonacci function


slows down around fib(35)


let's make a microservice out of this


start with from socket import * 😖


fib server:

single threaded, listen(5)


fib handler:

recv with timeout 100


can't handle multiple clients


threads


start each new fib handler on a new thread


Python uses OS threads


some GIL problems are well known


GIL prevents using more than one CPU core,
so multiple clients are competing for a single core


other GIL problems are less well known


GIL prioritizes CPU-heavy threads


with fast fib(1) client and slow fib(30) client,
fast client takes huge hit


OS threads do not do this:
OS gives priority to short tasks that look like interactive behavior


thread pools


accept request in thread
offload work to thread pool, using concurrent.future, then wait for result
high CPU load of serializing data and shipping it into the pool and back
threads solve problem of blocking accept loop

generators


also solves problems of blocking: yield stops execution until next() is called again


new task: countdown iterator


using deque, create round-robin queue of countdown iterators


take iterator, yield value, print it, put back in queue


fib handler: yield '<callname>', sock before blocking calls like recv or send


yield statement communicates intention to wait for something


fib server now yields 'recv', sock while waiting to accept


additional run method:

waiting queue for stuff that's waiting to recv
waiting queue for send
loop that calls next(task), assigns task to one of these queues
task queue starts with [fib_server]


select


how do we pull tasks off the waiting queues so they can do work again?
modify run loop: runs as long as there's any task waiting to run
use select library to see if there are any sockets in wait queues that can recv or send
when task queue is empty, wait with select for something to happen

problems with coroutines


does not solve the GIL CPU core contention


does not solve domination by one long CPU-heavy task


add the thread pool back?


still doesn't solve the long-task problem,
because future.result() blocks and is thus not coroutine-friendly


now you need a future-wait queue


how do you get stuff out of that queue?


tasks that wait on futures need to generate other tasks
that move tasks back from the future wait queue
back to the task queue


use a socketpair so that you can select on future task completion


now we can run a CPU-heavy job without totally killing throughput of a fast job


coroutines don't mean you can ignore the GIL


you probably still need a thread or multiprocess worker pool


but coroutines are ugly


don't want to write explicit coroutine yields?
wrap socket in an AsyncSocket class that provides generators for all blocking methods
yield from AsyncSocket.accept so you can call it more than once
now it looks like threading code again
and you basically have asyncio

questions


concurrent.future slightly easier to use than multiprocessing pool


surprise syntax:

a, b, [] = ([1], [2], []) is somehow legal
a, b, [] = ([1], [2], [3]) is not: ValueError: too many values to unpack (expected 0)


coroutines let you handle many more simultaneous connections than OS threads


every async I/O implementation ends up with a select loop somewhere


## Python Performance Profiling.md

      
    Raw
  

              Python Performance Profiling.md
            
          
    Python Performance Profiling: The Guts And The Glory


https://us.pycon.org/2015/schedule/presentation/400/


A. Jesse Jiryu Davis


@jessejiryudavis


MongoDB engineer


pymongo maintainer (Python driver for MongoDB)


some guy published an article on DZone about 80,000 MongoDB inserts per second with the node.js driver


Python driver clocked at 29,000/sec on same hardware


Jesse now has a problem


benchmarker inserts 80k docs in 5k-long batches


creates a list to stick batches in before inserts


starts inserting data, calling datetime! and random!! functions to generate it as needed!


appends to end of batch list


MongoDB can handle batch inserts of up to 16 MB at a time


but Node.js code has same morally objectionable structure,
so the perf difference actually is in the language or driver


optimization is like debugging


don't ask "why is my code slow?"


ask "will changing this part of my code make it faster?"


hypothesis-experiment cycle with benchmarks


warning: optimization generally makes your code harder to read and maintain


example: caching layers


why is profiling useful?


profiling lets you generate hypotheses


it is not the experiment


profiling affects your code too much to be part of the experiment


benchmarks should be on uninstrumented code


the profiler that Jesse reaches for first is not cProfile


cProfile is "severely overrated"


Jesse uses Yappi,
third-party profiler by Sümer Cip


as fast as cProfile


can profile every thread in the app


can measure both CPU and wall clock time


can export to callgrind format, which cProfile can't


can profile builtins


still requires code modification to start Yappi and save profiles


KCacheGrind can read callgrind format


spending ⅔ of time in Collection.insert in Python MongoDB driver


Hypothesis: if pymongo was infinitely fast, it would only match the perf of the Node.js driver


Test: replace Collection.insert with del


Hypothesis proved by benchmark


Possible explanation: V8 has a JIT, CPython doesn't


removing datetime stuff gets from 30,000/sec to 50,000/sec


years later, updating to PyMongo 3.0 goes from 38k to 59k


PyPy (CPython 2.7 compatible build with JIT) gets to 73k


try stubbing code out before you try actually optimizing your code


the Monary driver bypasses Python
and lets NumPy talk directly to Mongo


PyMongo is not asyncio: it's a blocking driver for threaded apps


Jesse wrote an async Tornado driver called Motor


## scikit-learn for street maps.md

      
    Raw
  

              scikit-learn for street maps.md
            
          
    scikit-learn for street maps


Michelle Fulwood


Twitter: @michelleful


grad student


interested in classifiying national origins of Singapore's street names


visualizing clusters of roads


color-coding roads by national origin of name


we need:

locations of roads, from OpenStreetMap
linguistic classification, done by machine learning


Wrangling geodata with GeoPandas


original data in GeoJSON
hierarchical dictionary format: pandas hates it
geopandas can translate many formats into GeoDataFrames
60k roads. Is that too many? (yes)
geopandas can plot geodata with df.plot()
looks like many of the roads are outside Singapore
we can use the within function to clip a dataset to a geometric boundary
standard pandas functions are available:

filter out empty road names
filter out roads that are not on a list of accepted types
(yes to highway, no to footpath)


Classifying with scikit-learn

supervised classification


we need:

a set of labels
a set of features
a labeled train and test set


sklearn provides train_test_split function
features: n-gram letter frequency for 1-, 2-, and 3-grams
sklearn.feature_extraction.text.CountVectorizer does n-grams for multiple ns at once

selecting a classifier


see Scikit's cheat sheet
chosen: linear support vector classification (SVC)
selected accuracy score as metric for model performance
(vs ROC AUC, etc.: scikit supports a lot of metrics)
wanted to minimize hand correction of classifier

initial results


first result: right 55% of the time


random chance: 16.6% (over six linguistic categories)


not terrible, could be better


scikit-learn makes it very easy to swap classifiers


adding features


see A Few Useful Things to Know About Machine Learning
feature choice is huge factor in machine learning project success
new features:

number of words
avg length of word
are all the words in a language dictionary?
are road tags Malay? (Street, Road vs Jalan, Lorong)


Pipelines


minimize repetitive feature extraction code by using sklearn.pipeline
add ngram and SVC stages to a new pipeline object
don't need to run fit_transform/transform on train/test sets
just feed data into the pipeline

Make your own pipeline stage


inherit from BaseEstimator and TransformerMixin


implement transform, stub out fit


use FeatureUnion to run feature extraction stages in parallel!


reached 65% accuracy with new features


pipelines and FeatureUnion doesn't improve performance much,
but much easier to read


Hyperparameter tuning


default parameters of the SVC, etc. model classes
GridSearchCV does a brute force (?) search of hyperparameter space
reached 68% accuracy with new hyperparameters
when doing your own hyperparameter tuning:

read the papers
use the hyperparams listed first in the docs
go see what other people are using on Github


Further neat tricks


mplleaflet turns matplotlib output into Leaflet.js zoomable tiled maps


## streamparse.md

      
    Raw
  

              streamparse.md
            
          
    streamparse: real-time streams with Python and Apache Storm


https://us.pycon.org/2015/schedule/presentation/359/


Andrew Montalenti


@amontalenti


Parse.ly


queue & worker systems are a pain to build and maintain


Apache Storm is a simplification of queue & worker systems


Storm is Java + Clojure


streamparse is Pythonic Storm: "the Django of data pipelines"


good for analytics, logs, sensors, low-latency applications in general


Storm topology


Storm abstractions: tuples, spouts, bolts, topology (DAG)
tuples are basically DB rows, and have schemas
spout: data source
bolt: computation node

can ack a tuple
fail a tuple
emit new tuples


Storm internals


Manning Press book: Storm Applied
tuple tree: bolt can explode a tuple into a bunch of derived tuples,
but Storm keeps track of the origin of the child tuples
used by reliability features: guaranteed processing
Storm HA: Nimbus is a supervisor, has some ZK stuff somewhere,
responsible for code upload to workers
allocates Python code to Python slots on physical worker nodes

Python + Storm


Storm "multi-lang" protocol is JSON over pipes or something
1 Python process per Storm task
a lot of Python↔︎JVM data interchange
Storm bundles storm.py but it's not good:
assumes Storm packaging, meaning you put your stuff in a JAR

streamparse


3 paid Parse.ly maintainers + DARPA funding


pip install streamparse


sparse CLI tool creates streamparse boilerplate (like Django)


sparse also requires Leiningen to set up a cluster for you


sparse functions:

make virtualenvs on workers
package Python code as Storm JAR
talks to Nimbus to deploy your topology


replaces storm.py


supports Python 3.4 and PyPy


Storm has a DSL for topology setup:
make a Python spout,
make a Python bolt,
run this bolt on two workers, etc.


Storm grouping can make sure that certain tuples ("dog") always get routed to the same nodes,
so that the dog count doesn't end up on two workers


PyKafka


Kafka traditionally hasn't had a great Python library
don't tell the KIXEYE people who used to maintain kafka-python ;)
Kafka makes a good Storm spout

Questions


does streamparse handle packaging Python dependencies in the Storm JAR?


no it doesn't. relies on SSH configuration of its persistent virtualenvs.


somewhat redundant with Yelp's pyleus


How do you debug/log/diagnose Python bolts?


redirect sys.stdout to Python logging (so it doesn't interfere with data),
so you have to use Python logging


sparse tail can tail the logs from each worker


needs improvement


## what can programmers learn from pilots.md

      
    Raw
  

              what can programmers learn from pilots.md
            
          
    what can programmers learn from pilots:


Andrew Godwin


fail soft if related to external controls


fail hard otherwise


either way, make some noise!


but not a continuous ignorable noise


or you'll just get used to errors and not handle them


aircraft testing results in known statistical limits for components


and combinations of components


don't rely on automatic failover


make sure you can always manually swap to a spare


checklists for EVERYTHING


automation should be best for the worst cases,
because that's when you'll need to get it right


ex: database failover


aviate, navigate, communicate


first fly the plane


then fly the plane somewhere


only then, talk to ATC


don't get distracted by communication


emergencies have priority on radio channels


if Air Force 1 is on the same channel,
and you have an emergency,
they have to shut up


know your critical features (Eventbrite ex: tickets)


know what you can sacrifice if necessary (analytics)


single person should make decisions


but don't ignore your copilot or flight crew


postmortems:


there are always multiple factors to an accident


blaming someone solves nothing,
and can distract the investigation


planes don't do deadlines


they always carry extra fuel


always have a plan or some buffer to deal with unknown problems


don't ship crap code to meet the deadline,
you'll just spend weeks fixing it


don't be a hero


ops are like pilots: hours of boredom punctuated by moments of terror


## What can Python learn from Erlang.md

      
    Raw
  

              What can Python learn from Erlang.md
            
          
    What can Python learn from Erlang?


Benoit Chesneau


https://github.com/benoitc


https://twitter.com/benoitc


author of gunicorn


https://speakerdeck.com/benoitc/what-python-can-learn-from-erlang


Erlang is concurrent


CPython is single-threaded


sounds silly at first


but concurrency is about passing messages and hiding internals


this talk is about reliability


if your program isn't reliable, it won't perform


a reliable program

stays alive
resistant to failures
recovers from failures
supports hot upgrades


if processes are isolated and don't share memory


then a failed process doesn't necessarily take down other processes with it


can't corrupt shared data


other processes can detect crashed processes and restart them


Python often uses supervisord or other supervisor-like processes


supervisor doesn't know where your process crashed,
so it can't restart without discarding partial progress


Python must use OS processes to achieve isolation


processes that crash should try to handle the fatal exception
by writing out any work in progress so their successors can pick up


immutable data can be shared all you want


the real world is mutable


but you can take immutable snapshots


use functional data structures like FunkTown


be specific about the exceptions you catch


don't catch anything you can't handle by writing out partial state (or logging)


just crash and let the supervisor restart you


don't use exceptions for control flow


use pattern matching on Either types


patterns implementation for Python