brucecrevensten/oscon.md

## oscon.md

      
    Raw
  

              oscon.md
            
          
    OSCON 2012

Notes.
Other presentations/slides: http://www.oscon.com/oscon2012/public/schedule/proceedings
Data science in R

http://courses.had.co.nz/12-oscon/

problem solving: name it, google it: "the hard part is knowing what the problem is called"
strength of visualization: "allows you to see things you did not expect", understand problems, make the questions preciseweaknes: "there's a human in the loop"
flip side of visualization: once you have a sufficiently clear question about the data, you can write an algorithm to explore data naturally.

notes on the R language


named arguments: c(x=x, y=y) returns [x y] [5 10]
escaping scope: scope bubbles towards global.
R implicitly returns the result of the last expression, no explicit return needed.
you can examine 'environments' to do some amount of reflection
"(" <- function(a) a + 1
$ === . (scoping operator)
str() means "structure", what is the object?
ggmap / ggplot2 <- useful libs for maps/plotting.
tracemem()
readRDS(), saveRDS() <- serialization
split/apply/combine -- ddply

Canvas Deep Dive

This tutorial just walked through the book: http://projects.joshy.org/books/canvasdeepdive/
Postgres configuration and ops considerations

Authors site: thebuild.com
Nomenclature


database is set of tables & schema objects,
mysql "databases" === postgres "Schema"
"cluster" means "collection of databases"
pg_ctl: start/stop
$PGBASE
postgresql.conf
security: "role" === object that can own other objects and that has privileges
"user" === role that can log into the system; otherwise synonyms
important params in config: logging, memory, checkpoints, planner; done.

Tuning


use SSDs; put the transaction log onto spinning disk if you want.  use server-grade for graceful memory flush in case of power failure.

Tuning logging


change logging first to get the data you need:
where to log?  syslog, or, use CSV format to files.
log_min_duration_statement = 250ms <<--- important!for finding slow things
log_lock_waits = on <-- if anything's waiting for locks, log

Tuning memory

memory configs:

dont' run other servers on the same thing(?)
shared_buffers -- best == 8GB >= 32GB, otherwise 25% total system.
look for SHMALL and SHMMAX in the syslogs in the kernel -- tweak kernel params:
-- calculate shared_memory in bytes + 20%.  huge decimal number.
-- sudo sysctl -w kernel.shmmax = (value)
-- sudo sysctl -w kernel.shmmall = (value) / 4096

work_mem:

start low: 32-64MB.
look for lines about "temp files," to see if there's needing to be more temp memory.
set to 2 - 3x largest temp file you see.  (why bigger?  more efficient on disk, need more actual hot RAM).
dont' exceed 5-10% of systems RAM, because it's the amount of memory per planner node.
can configure per-session

maintenance_work_mem

10% of sysram up to 1GB
if vaccuum problems, maybe need more.

effective_cache_size

set to amount of file system cache available.
no idea what to set? set it to 50% system RAM
this isn't allocated, it's a hint to the planner to decide how much RAM is available.

General points:


random_page_cost && work_mem === biggest performance gains
prefer xfs or ext4 on linux.

Checkpoint tuning

a complete flush or dirty buffers to disk. two params control this:

number of WAL segments written to disk,
whenever a timeout occurs.

tuning these parameters:

wal_buffers # 16mb
checkpoint_completion_target # 0.9
checkopoint_timeout # 10m-30m # longer to start, less IO while operating
checkpoint_segments # 32

monitoring:

look for checkpoint entries in log,
happening how often? more than checkpoint_timeout?  if so, it's exhausting the WAL segments more quickly, so bump up the checkpoint segments until they're less frequent than the timeout

Planner tuning


effective_io_concurrency: set to number of IO channels, otherwise ignore; if SSD with 32 channels, set to 32, etc.
random_page_cost - 3.0 is typical RAID10, 2.0 for SAN, 1.1 for AMazone EBS.  Radio between time to grab random disk page vs. sequential. SSD ~ 1.5 range.  controls index v. sequential search.  lower#use index.

do not touch


fsync = on.  never change. this controls if postgres will flush then wait for the result.
synchronous_commit = on; you can turn this off (won't corrupt DB if fail);

Concepts

Write-ahead log (WAL)


continuous stream of comitteed database modifications, broken into 16mb segments
starts with DB cluster creation, lasts forever
checkpoints mean "last consistent state", segements before that WAL can be thrown away
PUT IT ON ITS OWN FILE SYSTEM because it's append-only, basically.  own set of disks, etc.  it stays put.

MVCC (multiversion concurrency control)


helps prevent locking, alternative to pessmisitic locking; allows higher performance
writers don't block readers, readers get old version of row.
writers block writers to the same row.
multiple versions of row may be in DB; deleted/updated aren't immediately removed.
VACUUM cleans tuples(rows) that aren't seen by anything/anyone/any transaction.
post 8.0, autovacuum happens; good idea to do manual vacuum after bulk update/delete operations
ANALYZE regenerates table stats to help make good guesses for how to execute queries; always do this after major database changes, such as restore from bacup.
"share" vs "exclusive" locks exist.
surprising locks: table-level locks when you add a new non-NULL column -- fix by creating it NULL then changing the column later

Transaction modes


read committed
repeatable read
serializable

Schema design & operations


keep data in normal form, don't fear joins
"fast/slow" rule: "fast data" changes a lot, "slow data" infrequently -- put these into separate tables.  Slow data tends to be the parent of other data via foreign keys.
some indexing strategies:
-- index should be selective in the sense that when the index is used, it should return a small number of rows.
-- partial index: an index that only applies to certain/specific conditions (along the lines of "index where(clause)")
-- index should be frequently used.
drop unused indexes.  create indexes on the basis of real-life needs, and look for sequential scans
built-in views to check indexes:
-- pg_stat_user_tables -- how many times a sequential scan has been done,
-- pg_stat_user_indexes -- how often an index has been used.
SELECT COUNT(*) from myHugeTable is implemented as a full table scan.  Don't do it.  pg_stat_activity has an approximate, but try and avoid it.  It's not a fast-performant thing on Postgres.
taming autovacuum -- you can cut down the number of workers, making it run more frequently, etc. other sections in the configuration file.
bulk loading: use COPY, not INSERT.

Debugging

"this query is slow"


EXPLAIN or EXPLAIN ANALYZE -- gets the query plan
http://explain.depesz.com
estimate vs. actual rows return means planner's confused
nested loops often mean joins that you can't use an index for

"the DB is slow"


pg_stat_activity -- is it waiting on a query? etc
tail -f logs
pg_locks, in connection with pg_stat_activity.

System/network


cloud hosting has terrible IO; since DBs are IO bound, you want to get as much RAM as you can (up to 2x DB size), CPU capacity isn't that important as RAM; always replicate.
store configurations in VCS
our-own-hardware:
-- get SSDs, otherwise SAS drives.

-- RAID10;
-- put pg_xlog on its own volume;
-- move pg_stat_tmp to a RAMdisk if you want to (transient data, write intensive)

if you have little SSD, put your busy indexes / tables onto it.
monitoring


nagios: disk, cpu, mem, (if used) replication log
"checkpostgres.pl" from Bucardo.org <- use
pgAdmin3 for management, handy
log analysis: pgFouine (Traditional, not maintained much); pgbadger (new, active).

Open source web mapping

Technolgies in play:

mapnik: http://mapnik.org/
qgis
node.js (which is what TileMill is written in! interesting)
TileStash: http://tilestache.org/
CartoDB: http://cartodb.com/
Leaflet: javascript library: http://leaflet.cloudmade.com/examples/quick-start.html

some notes


mapnik - c libs, has node bindings!
avoid maps as single lock-in point (Google) or point of failure or ugly jarring clashing design, etc.
open data -- osm.org, naturalearthdata.com, us census, local governments.
example: http://npr.org/censusmap/ -- shows chart interpolation on map

Keynotes, Wednesday July 18


leaders set norms for communities
open source relies on its social capital
four strategies to use in the course of technical conversation: inquiry, paraphrase, acknowledge, advocate.
axes of understanding and learning: perception vs. imagination, emotional vs. analytical.

Hadoop 2.0


check: hortonworks, Hortonworks Open Hadoop Platform
hadoop-0.20 is the basis of current/stable
scaling beyond 100s of petabytes due to Federation
YARN -- running arbitrary applications across Hadoop?
http://blogs.msdn.com/b/brandonwerner/archive/2011/11/13/how-to-set-up-hadoop-on-os-x-lion-10-7.aspx
Apache Streaming is how you can run arbitrary code as mappers/reducers through Hadoop

Javascript library overviews


seek to modularize use of JQuery
consider the mobile audience with respect to javascript performance and optimization
three alternatives: jQuery alternatives; javascript MVC; Javascript alternatives
jQuery alternative -- useful for mobile -- Zetpo.js - trying to use jQuery syntax, subset of features, mobile focus (esp Mobile WebKit).
hello again, Backbone.js

Effective code review

"Do it."
why?


you write better code when you know it's going to be reviewed.
defects vs. bugs
helps more than one person understand the code well
makes you a better deveoper -- more reading, writing, and comprehension
both newer and more experienced developers benefit
gives real status updates
builds trust and morale
selling code review to others: easier than unit testing; bottom-up approach (costs nothing for over-the-shoulder).
if you're writing code, you should be reviewing code (it's for everyone).
differentiate perhaps between "here's where the architecture goes," and "how about this specific code?"
"coverage of reviewed code" . . .

important things to discuss during code review


bad design
lack of clarity: easy to read, easy to understand code.
conformity: style adherence
performance hazards. IO, memory/leak, object literals in Javascript, etc.

unimportant things


optimization (vs. performance).
skill/experience gaps -- "something folks tend to fret about" -- letting less experienced coders into main areas is useful
personal style

integration into development cycle?


when?  when it's committed?  ad-hoc at the time?  review meetings (weekly)?
persistence: over-the-shoulder? wiki, mailing list? watch for patterns with respect
tools: gerrit, fisheye
geographically dispersed teams: helps async teams, builds cohesive codebase,

Sensor Network Data Collection and Storage


What are sensor networks?  Association of senseors to monitor anevent or conditions.  Wired (lab, manufacturing); wireless (environmental monitoring, security)
https://launchpad.net/mysql-arduino
Data nodes vs. sensor nodes: data nodes are more complex, store the data, and mix types; sensor nodes generally don't store or process, and just have a single value/type of thing.
Collector nodes: collect, parse, store, or transmit the data.
Pachube, nimbits, ThinkSpeak, Digi, Sensor Cloud (post to cloud-based services)
MySQL Connector/Arduino > dump directly to a database
Home automation with recorded history of events, visualization
uses xbee wireless for sensor nodes

Hypermedia URLs

https://speakerdeck.com/u/steveklabnik/p/oscon-2012-designing-hypermedia-apis
http://coderwall.com/p/xvzu-g
The speaker's previous work with learning/tech: Learning Ruby: JumpstartLab, HungryAcademy

consider using CURL for doing the development
communicate the messages in JSON ('cos why not?)')
"build your application to respect the fundamental architecture of the web."
anarchy as a motivation!
Respect HTTP, use a hypermedia type.  (stateless)  hypermedia as engine of application state.
hypermedia types - RFC5988 - web linking, relations.  rel="whut."  "PROFILE" link relationship, additional semantics to an endpoint.
Adding this profile information gives you a sort of Hypermedia Type even though the underlying content is JSON, which isn't quite hypermedia
Collection + JSON / HAL is another one.
Determinism: state machines for application state
Media types are dynamic contracts between client/server -- processing services on the server, client how consume and interact.
Consider Mechanize (perl/ruby) or web scrapers to help with client development.

Keynotes

Canonical/ubuntu speaker:

juju -- Amazon re: scaling
ubuntu 12.10 - HUD

Digging into Open Data

http://assets.en.oreilly.com/1/event/80/Digging%20into%20Open%20Data%20Presentation%202.pdf

"public" /= "open" data -- could have copyrights, patents, trademarks, restricted licenses, etc.
"open data" is accessible without limitations on entity or intent, in a digital, machine-readable format; free of restriction or use or redistribution in its licensing conditions.
"open" != "exempt" -- verify the data use policies of sources (citations, attributions).
some unexpected open sources: "open" != "government".  Publications (The Guardian, WSJ, NYT, The Economist); Companies (GE, Yahoo, Nike, Mint, Trulia); Academia (Carnegie Mellon DASL, Berkeley Data Lab, MIT Open Data Library).
"politilines" -- example of data visualization(?)  -- as an example.  What's the process for using this stuff?
finding data: gov sites, commerical data markets, http://thedatahub.org, open science data (http://oad.simmons.edu/oadwiki/Data_repositories.  Research time = liberal estimate * 5)
scraping data: consider Dapper, Google, ScraperWiki.
python is the language of choice: urllib2, requests, mechanize; html5lib, lxml, BeautifulSoup
nltk - natural language tokenizer
Cleaning data: Google Refine, Data Wrangler, ParseNIP, python, SQL
Visualizing: R, D3, Many Eyes, Swivel
Some business considerations: data timeliness, thinking ahead in terms of the stability of open data, ins/outs of rolling your own parsing scripts; screen-scraping makes some challenges for maintenance of scripts.

Node.js in production: Debugging and perfomance analysis

http://assets.en.oreilly.com/1/event/80/Node_js%20in%20Production_%20Postmortem%20Debugging%20and%20Performance%20Analysis%20Presentation.pdf
David Pacheco @ Joyent
Scenario: hung aggregator.  How do you debug it?


Check the logs?  Check the syscall activity with truss or strace?
GDB to check the thread stacks?  A mess: there's the node + V8 scaffolding, but then... no.
We can add more logging... but no way of introspecting it.
Node.js can perhaps connect to remote node instances?

More generic debugging notes


add more instrumentation (console.log()).  Downsides: lose credibility when doing lots of redeploying, some risk with redeploying, performance can be problematic.  If you're lucky or if the problem is pretty simple, this can work OK.
better: for C programs, when the program crashes (or on demand) you can create a core file, then you can use a debugger to inspect the system state.  Can this work for node.js?
The problem is that few dynamic environments have produced rich toolsets for introspecting program executing.  The tools we use for C aren't useful here.
In order for this to work, we need to translate the native abstractions (symbols, functions, structs) into JavaScript counterparts (variables, Functions, Objects);
some abstractions don't even exist explicitly in the language itself (e.g., Javascript's event queue).
mdb_v8: postmortem debugging for Node.  Based on MDB, prints call stacks including JS functions/args; given a pointer, prints out as a C++ object AND its JS counterpart; scans heap to see instances of object types exist.
check restify, a node module for REST interfaces

Need to get the slides from this, it had some very sophisticated modification of tools that run on (at least) illumnos that can be used to do low-level information and profiling using flamegraphs.
Running: hacking the body


joining a club as a kind of injury avoidance
the secret run faster.  You can generally run faster at the same fitness level.
interval training.  Shorter bursts of higher intensity.
Amdahl's law: if you have two chunks, the maximum amount of speed-up depends on the propotion of the total time of the thing you're trying to optimize.
running is done by muscular contraction; ATP production -- glycolosis from glycogens and oxygen (aerobic) and anaerobic respoiration (no o2).
blood: flow is impacted by the volume of your heart; blood goodies are determined by and the hermatocrit / haemoglobin levels.
we can optimize the lungs, heart, and liver.
we can't optimize: age, maximum heart rate (tied to age)
we can optimize ATP creation (o2 concentration in blood, heart rate/stroke vlumne; lung capacity; glycogen stocks); ATP consumption (strength, weight, "running economy" (form), lactic tolerance (pain thresholds)).
drinking alcohol impacts ability to store glycogen.
strength: skipping, hill-climbing or steps; doing a plank to fail is a good measure of overall core strength.
running economy: "form", strenth, and suppleness.  Injury prevention: no more than 10% increase week-to-week; every 3rd week, decrease 10%; be aware of intensity.
stretching: stretching the soleus is important (missed in calf stretches).  Stretching is important for warming up muscles (warm-up exercise); lengthening muscles (extending); suppleness.
heel strike, forefoot strike; converting to forefoot strike: "hundred up".
Cardiovascular efficiency: v02 max.  Goal is to maximize this number.  YOu can't increase max heart rate but you can increase v02 max.  How to measure?  Get it done by medical staff.  But we can measure vvo2 max (velocity of attaining vo2 max). "Beep test", without warmup, run according to beeps between cones of a fixed distance.  "Semi Cooper": warm up, then run fast for 6 minutes.
Interval training: shorter intervals = more reps; short rest preiods; 10-15 minutes at vVO2 Max; time is the critical factor.
Prequsite: you must be able to run 45 minutes at a time a couple of times per week, time at a steady jog; 15 warmup, 20 intense, 10 cooldown.  Time and intensity are what's important -- not too fast/too slow, and not distance.
Examples of intervals: 60/60. 30/30, 20/20 (time based),
phases: 2 week cycles, each cycle a focus, phases: endurance, speed, race preparation, taper and recovery; top tip: join a club.

Twitter bootstrap

Some libraries by @fat: Bootstrap, Ender, Hogan.js, MooTools-flot, Stache, snapysnap.
@fat is a very funny twerpy nerd.
Karel Capek / R.U.R. > science, godlessness, robots > TJ Holowaychuck (javascript) === stylist, mocha, express.  Github replaces notifications with infinity symbols if you have too many!  Tons of traffic about issues, to wit:
http://www.github.com/necolas/issue-guidelines
(from issues) "we get so overrun that we forget to innovate"  The Dark Thesis: "I can close 50+ issues without committing a single line of code."  Other solutions: some projects add contributors who only manage tickets, but it's hard to find people who want to do it and who are the right kind of people to do that work.  Some projects moved ticketing off github network, but then you have inconsistent ticket implementation/locations.
"what if we can clone ourselves?" this is the "Old Rossom" approach -- see Capek.  Or, "what we focused on really simple tasks (young Rossum)"?  A bot which would implement Necolas's issue-guidelines.
instead build "universal"; research "Bots": "chatterbot, spambots, botnets, gaming bots, votebots", so, "bots are scripts that run automated tasks on the web".  wanted.  "Haunt."  something similar to Rossum's protoplasm.  node module for creating robots || services.  lets you run unit tests against issues and pull requests, then make decisions about closing, tagging, commenting, -- programmatically.
check out: http://git.io/haunt
How do developers learn?


presenter is interested in craftspeople, not just the people who write code but don't especially care.
learning: it's a tarp!  CodeAcademy, Google courseware,
two kinds of learning: a good coder learning a new language, vs. when you need to learn something right now (but don't really need to learn a new skill )
vocabularies differ depending on dev perspective

learning for work -- hottest topics:


Drupal & wordpress, hot and on fire
Python, Ruby, Javascript

Notes regarding tech learning


there's an implicit assumption that the most recent content is the best content.  This reflects the theme: stop and think about the environment you're in, before racing to get the answers.
Some search terms are "false friends," showing up more often than expected in statistical anaylsis of word frequency in books.
when do people use these devices for learning?  more use ipads (higher % than mobile/non-ipad); largely at night.
younger folks tend to prefer video/screencasts (learning from video)
github is a valid learning resource; github has a "real" search with actual search semantics.  An approach: "path:sytnax" and NOT operators; check out Github's search for more details.
We often feel as though we must finish books: instead, ignore the end of the book and just move on when you fell it's been of enough value for you.

Keynotes, Friday

How good is your internet? Chris DiBona, Google


"The Measurement Lab" http://www.measurementlab.net/ -- some data re: telecommunications transfers.  Idea: compare Alaska to this data set?

Data visualization with Clojurescript

http://keminglabs.com/talks/kevin_lynagh_web_data_visualization_OSCON_2012_slides.pdf
http://keminglabs.com/talks/kevin_lynagh_web_data_visualization_OSCON_2012_handout.pdf

Clojurescript - compiles Clojure to Javascript
"treat your data like data", "it's better to have 100 functions operate on one data structure than 10 functions on 10 data structures. [alan perlis].  with OO style stuff we tend to encapsulate data, so we do a lot of work to get things into a box, then a lot of work to get it out again.
consider doing stuff without using the DOM as much as possible, since that makes it easier to manipulate and test against data without requiring headless browser or other strange things.

Components with HTML5


Google Web Toolkit -- consider
Vaadin -- https://vaadin.com/home#intro
Design based on needs, and think about the user experience implied by the use of that component: can you achieve the use case you are heading towards with existing components?
list of real quantifiable requirements for UX
https://jojule.github.com

Feedback for presenters


surprises are OK, but don't show features/tools that don't work yet.  "upcoming revision" # "waste of my time today."
don't read to me from your book.
specify the audience or be aware when your talk isn't sticking with its billing -- seemed like a beginning, wasn't.
don't depend on the network.
not much typing
don't use music that has vocals in it if your presentation includes sound

Geospatial notes trying to bend data around

make a NODATA band transparent

Apply the dem, then warp -- the warp changes the transparency (can probably make it work out, but this works too)?  (This didn't quite work a second time, not sure)

gdaldem color-relief -alpha input.tif ramp.txt colorized.tif
gdalwarp -s_srs EPSG:3338 -t_srs EPSG:3785 -r bilinear input.tif output-reprojected.tif

ramp.txt:
nv 0,0,0,0
0 237,248,251,255
270 178,226,226,255
280 102,194,164,255
290 44,162,95,255
365 0,109,44,255

making contours


gdal_contour -a dof final.tif contours50.shp -i 50
ogr2ogr final.shp contours.shp -t_srs EPSG:900913

some raster computations to isolate data

gdal_calc.py -A dof_5modelAvg_sresb1_2090_2099.tif -B dof_5modelAvg_sresb1_2010_2019.tif --outfile=week.tif --calc="A*((A-B) > 7)" --NoDataValue=0