Skip to content

Instantly share code, notes, and snippets.

@brucecrevensten
Created July 19, 2012 18:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brucecrevensten/3145849 to your computer and use it in GitHub Desktop.
Save brucecrevensten/3145849 to your computer and use it in GitHub Desktop.
OSCON notes

OSCON 2012

Notes.

Other presentations/slides: http://www.oscon.com/oscon2012/public/schedule/proceedings

Data science in R

http://courses.had.co.nz/12-oscon/

  • problem solving: name it, google it: "the hard part is knowing what the problem is called"
  • strength of visualization: "allows you to see things you did not expect", understand problems, make the questions preciseweaknes: "there's a human in the loop"
  • flip side of visualization: once you have a sufficiently clear question about the data, you can write an algorithm to explore data naturally.

notes on the R language

  • named arguments: c(x=x, y=y) returns [x y] [5 10]
  • escaping scope: scope bubbles towards global.
  • R implicitly returns the result of the last expression, no explicit return needed.
  • you can examine 'environments' to do some amount of reflection
  • "(" <- function(a) a + 1
  • $ === . (scoping operator)
  • str() means "structure", what is the object?
  • ggmap / ggplot2 <- useful libs for maps/plotting.
  • tracemem()
  • readRDS(), saveRDS() <- serialization
  • split/apply/combine -- ddply

Canvas Deep Dive

This tutorial just walked through the book: http://projects.joshy.org/books/canvasdeepdive/

Postgres configuration and ops considerations

Authors site: thebuild.com

Nomenclature

  • database is set of tables & schema objects,
  • mysql "databases" === postgres "Schema"
  • "cluster" means "collection of databases"
  • pg_ctl: start/stop
  • $PGBASE
  • postgresql.conf
  • security: "role" === object that can own other objects and that has privileges
  • "user" === role that can log into the system; otherwise synonyms
  • important params in config: logging, memory, checkpoints, planner; done.

Tuning

  • use SSDs; put the transaction log onto spinning disk if you want. use server-grade for graceful memory flush in case of power failure.

Tuning logging

  • change logging first to get the data you need:
  • where to log? syslog, or, use CSV format to files.
  • log_min_duration_statement = 250ms <<--- important!for finding slow things
  • log_lock_waits = on <-- if anything's waiting for locks, log

Tuning memory

memory configs:

  • dont' run other servers on the same thing(?)
  • shared_buffers -- best == 8GB >= 32GB, otherwise 25% total system.
  • look for SHMALL and SHMMAX in the syslogs in the kernel -- tweak kernel params: -- calculate shared_memory in bytes + 20%. huge decimal number. -- sudo sysctl -w kernel.shmmax = (value) -- sudo sysctl -w kernel.shmmall = (value) / 4096

work_mem:

  • start low: 32-64MB.
  • look for lines about "temp files," to see if there's needing to be more temp memory.
  • set to 2 - 3x largest temp file you see. (why bigger? more efficient on disk, need more actual hot RAM).
  • dont' exceed 5-10% of systems RAM, because it's the amount of memory per planner node.
  • can configure per-session

maintenance_work_mem

  • 10% of sysram up to 1GB
  • if vaccuum problems, maybe need more.

effective_cache_size

  • set to amount of file system cache available.
  • no idea what to set? set it to 50% system RAM
  • this isn't allocated, it's a hint to the planner to decide how much RAM is available.

General points:

  • random_page_cost && work_mem === biggest performance gains
  • prefer xfs or ext4 on linux.

Checkpoint tuning

a complete flush or dirty buffers to disk. two params control this:

  • number of WAL segments written to disk,
  • whenever a timeout occurs.

tuning these parameters:

  • wal_buffers # 16mb
  • checkpoint_completion_target # 0.9
  • checkopoint_timeout # 10m-30m # longer to start, less IO while operating
  • checkpoint_segments # 32

monitoring:

  • look for checkpoint entries in log,
  • happening how often? more than checkpoint_timeout? if so, it's exhausting the WAL segments more quickly, so bump up the checkpoint segments until they're less frequent than the timeout

Planner tuning

  • effective_io_concurrency: set to number of IO channels, otherwise ignore; if SSD with 32 channels, set to 32, etc. random_page_cost - 3.0 is typical RAID10, 2.0 for SAN, 1.1 for AMazone EBS. Radio between time to grab random disk page vs. sequential. SSD ~ 1.5 range. controls index v. sequential search. lower#use index.

do not touch

  • fsync = on. never change. this controls if postgres will flush then wait for the result.
  • synchronous_commit = on; you can turn this off (won't corrupt DB if fail);

Concepts

Write-ahead log (WAL)

  • continuous stream of comitteed database modifications, broken into 16mb segments
  • starts with DB cluster creation, lasts forever
  • checkpoints mean "last consistent state", segements before that WAL can be thrown away
  • PUT IT ON ITS OWN FILE SYSTEM because it's append-only, basically. own set of disks, etc. it stays put.

MVCC (multiversion concurrency control)

  • helps prevent locking, alternative to pessmisitic locking; allows higher performance
  • writers don't block readers, readers get old version of row.
  • writers block writers to the same row.
  • multiple versions of row may be in DB; deleted/updated aren't immediately removed.
  • VACUUM cleans tuples(rows) that aren't seen by anything/anyone/any transaction.
  • post 8.0, autovacuum happens; good idea to do manual vacuum after bulk update/delete operations
  • ANALYZE regenerates table stats to help make good guesses for how to execute queries; always do this after major database changes, such as restore from bacup.
  • "share" vs "exclusive" locks exist.
  • surprising locks: table-level locks when you add a new non-NULL column -- fix by creating it NULL then changing the column later

Transaction modes

  • read committed
  • repeatable read
  • serializable

Schema design & operations

  • keep data in normal form, don't fear joins
  • "fast/slow" rule: "fast data" changes a lot, "slow data" infrequently -- put these into separate tables. Slow data tends to be the parent of other data via foreign keys.
  • some indexing strategies: -- index should be selective in the sense that when the index is used, it should return a small number of rows. -- partial index: an index that only applies to certain/specific conditions (along the lines of "index where(clause)") -- index should be frequently used.
  • drop unused indexes. create indexes on the basis of real-life needs, and look for sequential scans
  • built-in views to check indexes: -- pg_stat_user_tables -- how many times a sequential scan has been done, -- pg_stat_user_indexes -- how often an index has been used.
  • SELECT COUNT(*) from myHugeTable is implemented as a full table scan. Don't do it. pg_stat_activity has an approximate, but try and avoid it. It's not a fast-performant thing on Postgres.
  • taming autovacuum -- you can cut down the number of workers, making it run more frequently, etc. other sections in the configuration file.
  • bulk loading: use COPY, not INSERT.

Debugging

"this query is slow"

  • EXPLAIN or EXPLAIN ANALYZE -- gets the query plan
  • http://explain.depesz.com
  • estimate vs. actual rows return means planner's confused
  • nested loops often mean joins that you can't use an index for

"the DB is slow"

  • pg_stat_activity -- is it waiting on a query? etc
  • tail -f logs
  • pg_locks, in connection with pg_stat_activity.

System/network

  • cloud hosting has terrible IO; since DBs are IO bound, you want to get as much RAM as you can (up to 2x DB size), CPU capacity isn't that important as RAM; always replicate.
  • store configurations in VCS
  • our-own-hardware: -- get SSDs, otherwise SAS drives.
    -- RAID10; -- put pg_xlog on its own volume; -- move pg_stat_tmp to a RAMdisk if you want to (transient data, write intensive)

if you have little SSD, put your busy indexes / tables onto it.

monitoring

  • nagios: disk, cpu, mem, (if used) replication log
  • "checkpostgres.pl" from Bucardo.org <- use
  • pgAdmin3 for management, handy
  • log analysis: pgFouine (Traditional, not maintained much); pgbadger (new, active).

Open source web mapping

Technolgies in play:

some notes

  • mapnik - c libs, has node bindings!
  • avoid maps as single lock-in point (Google) or point of failure or ugly jarring clashing design, etc.
  • open data -- osm.org, naturalearthdata.com, us census, local governments.
  • example: http://npr.org/censusmap/ -- shows chart interpolation on map

Keynotes, Wednesday July 18

  • leaders set norms for communities
  • open source relies on its social capital
  • four strategies to use in the course of technical conversation: inquiry, paraphrase, acknowledge, advocate.
  • axes of understanding and learning: perception vs. imagination, emotional vs. analytical.

Hadoop 2.0

Javascript library overviews

  • seek to modularize use of JQuery
  • consider the mobile audience with respect to javascript performance and optimization
  • three alternatives: jQuery alternatives; javascript MVC; Javascript alternatives
  • jQuery alternative -- useful for mobile -- Zetpo.js - trying to use jQuery syntax, subset of features, mobile focus (esp Mobile WebKit).
  • hello again, Backbone.js

Effective code review

"Do it."

why?

  • you write better code when you know it's going to be reviewed.
  • defects vs. bugs
  • helps more than one person understand the code well
  • makes you a better deveoper -- more reading, writing, and comprehension
  • both newer and more experienced developers benefit
  • gives real status updates
  • builds trust and morale
  • selling code review to others: easier than unit testing; bottom-up approach (costs nothing for over-the-shoulder).
  • if you're writing code, you should be reviewing code (it's for everyone).
  • differentiate perhaps between "here's where the architecture goes," and "how about this specific code?"
  • "coverage of reviewed code" . . .

important things to discuss during code review

  • bad design
  • lack of clarity: easy to read, easy to understand code.
  • conformity: style adherence
  • performance hazards. IO, memory/leak, object literals in Javascript, etc.

unimportant things

  • optimization (vs. performance).
  • skill/experience gaps -- "something folks tend to fret about" -- letting less experienced coders into main areas is useful
  • personal style

integration into development cycle?

  • when? when it's committed? ad-hoc at the time? review meetings (weekly)?
  • persistence: over-the-shoulder? wiki, mailing list? watch for patterns with respect
  • tools: gerrit, fisheye
  • geographically dispersed teams: helps async teams, builds cohesive codebase,

Sensor Network Data Collection and Storage

  • What are sensor networks? Association of senseors to monitor anevent or conditions. Wired (lab, manufacturing); wireless (environmental monitoring, security)
  • https://launchpad.net/mysql-arduino
  • Data nodes vs. sensor nodes: data nodes are more complex, store the data, and mix types; sensor nodes generally don't store or process, and just have a single value/type of thing.
  • Collector nodes: collect, parse, store, or transmit the data.
  • Pachube, nimbits, ThinkSpeak, Digi, Sensor Cloud (post to cloud-based services)
  • MySQL Connector/Arduino > dump directly to a database
  • Home automation with recorded history of events, visualization
  • uses xbee wireless for sensor nodes

Hypermedia URLs

https://speakerdeck.com/u/steveklabnik/p/oscon-2012-designing-hypermedia-apis

http://coderwall.com/p/xvzu-g

The speaker's previous work with learning/tech: Learning Ruby: JumpstartLab, HungryAcademy

  • consider using CURL for doing the development
  • communicate the messages in JSON ('cos why not?)')
  • "build your application to respect the fundamental architecture of the web."
  • anarchy as a motivation!
  • Respect HTTP, use a hypermedia type. (stateless) hypermedia as engine of application state.
  • hypermedia types - RFC5988 - web linking, relations. rel="whut." "PROFILE" link relationship, additional semantics to an endpoint.
  • Adding this profile information gives you a sort of Hypermedia Type even though the underlying content is JSON, which isn't quite hypermedia
  • Collection + JSON / HAL is another one.
  • Determinism: state machines for application state
  • Media types are dynamic contracts between client/server -- processing services on the server, client how consume and interact.
  • Consider Mechanize (perl/ruby) or web scrapers to help with client development.

Keynotes

Canonical/ubuntu speaker:

  • juju -- Amazon re: scaling
  • ubuntu 12.10 - HUD

Digging into Open Data

http://assets.en.oreilly.com/1/event/80/Digging%20into%20Open%20Data%20Presentation%202.pdf

  • "public" /= "open" data -- could have copyrights, patents, trademarks, restricted licenses, etc.
  • "open data" is accessible without limitations on entity or intent, in a digital, machine-readable format; free of restriction or use or redistribution in its licensing conditions.
  • "open" != "exempt" -- verify the data use policies of sources (citations, attributions).
  • some unexpected open sources: "open" != "government". Publications (The Guardian, WSJ, NYT, The Economist); Companies (GE, Yahoo, Nike, Mint, Trulia); Academia (Carnegie Mellon DASL, Berkeley Data Lab, MIT Open Data Library).
  • "politilines" -- example of data visualization(?) -- as an example. What's the process for using this stuff?
  • finding data: gov sites, commerical data markets, http://thedatahub.org, open science data (http://oad.simmons.edu/oadwiki/Data_repositories. Research time = liberal estimate * 5)
  • scraping data: consider Dapper, Google, ScraperWiki.
  • python is the language of choice: urllib2, requests, mechanize; html5lib, lxml, BeautifulSoup
  • nltk - natural language tokenizer
  • Cleaning data: Google Refine, Data Wrangler, ParseNIP, python, SQL
  • Visualizing: R, D3, Many Eyes, Swivel
  • Some business considerations: data timeliness, thinking ahead in terms of the stability of open data, ins/outs of rolling your own parsing scripts; screen-scraping makes some challenges for maintenance of scripts.

Node.js in production: Debugging and perfomance analysis

http://assets.en.oreilly.com/1/event/80/Node_js%20in%20Production_%20Postmortem%20Debugging%20and%20Performance%20Analysis%20Presentation.pdf

David Pacheco @ Joyent

Scenario: hung aggregator. How do you debug it?

  • Check the logs? Check the syscall activity with truss or strace?
  • GDB to check the thread stacks? A mess: there's the node + V8 scaffolding, but then... no.
  • We can add more logging... but no way of introspecting it.
  • Node.js can perhaps connect to remote node instances?

More generic debugging notes

  • add more instrumentation (console.log()). Downsides: lose credibility when doing lots of redeploying, some risk with redeploying, performance can be problematic. If you're lucky or if the problem is pretty simple, this can work OK.
  • better: for C programs, when the program crashes (or on demand) you can create a core file, then you can use a debugger to inspect the system state. Can this work for node.js?
  • The problem is that few dynamic environments have produced rich toolsets for introspecting program executing. The tools we use for C aren't useful here.
  • In order for this to work, we need to translate the native abstractions (symbols, functions, structs) into JavaScript counterparts (variables, Functions, Objects);
  • some abstractions don't even exist explicitly in the language itself (e.g., Javascript's event queue).
  • mdb_v8: postmortem debugging for Node. Based on MDB, prints call stacks including JS functions/args; given a pointer, prints out as a C++ object AND its JS counterpart; scans heap to see instances of object types exist.
  • check restify, a node module for REST interfaces

Need to get the slides from this, it had some very sophisticated modification of tools that run on (at least) illumnos that can be used to do low-level information and profiling using flamegraphs.

Running: hacking the body

  • joining a club as a kind of injury avoidance
  • the secret run faster. You can generally run faster at the same fitness level.
  • interval training. Shorter bursts of higher intensity.
  • Amdahl's law: if you have two chunks, the maximum amount of speed-up depends on the propotion of the total time of the thing you're trying to optimize.
  • running is done by muscular contraction; ATP production -- glycolosis from glycogens and oxygen (aerobic) and anaerobic respoiration (no o2).
  • blood: flow is impacted by the volume of your heart; blood goodies are determined by and the hermatocrit / haemoglobin levels.
  • we can optimize the lungs, heart, and liver.
  • we can't optimize: age, maximum heart rate (tied to age)
  • we can optimize ATP creation (o2 concentration in blood, heart rate/stroke vlumne; lung capacity; glycogen stocks); ATP consumption (strength, weight, "running economy" (form), lactic tolerance (pain thresholds)).
  • drinking alcohol impacts ability to store glycogen.
  • strength: skipping, hill-climbing or steps; doing a plank to fail is a good measure of overall core strength.
  • running economy: "form", strenth, and suppleness. Injury prevention: no more than 10% increase week-to-week; every 3rd week, decrease 10%; be aware of intensity.
  • stretching: stretching the soleus is important (missed in calf stretches). Stretching is important for warming up muscles (warm-up exercise); lengthening muscles (extending); suppleness.
  • heel strike, forefoot strike; converting to forefoot strike: "hundred up".
  • Cardiovascular efficiency: v02 max. Goal is to maximize this number. YOu can't increase max heart rate but you can increase v02 max. How to measure? Get it done by medical staff. But we can measure vvo2 max (velocity of attaining vo2 max). "Beep test", without warmup, run according to beeps between cones of a fixed distance. "Semi Cooper": warm up, then run fast for 6 minutes.
  • Interval training: shorter intervals = more reps; short rest preiods; 10-15 minutes at vVO2 Max; time is the critical factor.
  • Prequsite: you must be able to run 45 minutes at a time a couple of times per week, time at a steady jog; 15 warmup, 20 intense, 10 cooldown. Time and intensity are what's important -- not too fast/too slow, and not distance.
  • Examples of intervals: 60/60. 30/30, 20/20 (time based), phases: 2 week cycles, each cycle a focus, phases: endurance, speed, race preparation, taper and recovery; top tip: join a club.

Twitter bootstrap

Some libraries by @fat: Bootstrap, Ender, Hogan.js, MooTools-flot, Stache, snapysnap.

@fat is a very funny twerpy nerd.

Karel Capek / R.U.R. > science, godlessness, robots > TJ Holowaychuck (javascript) === stylist, mocha, express. Github replaces notifications with infinity symbols if you have too many! Tons of traffic about issues, to wit:

http://www.github.com/necolas/issue-guidelines

(from issues) "we get so overrun that we forget to innovate" The Dark Thesis: "I can close 50+ issues without committing a single line of code." Other solutions: some projects add contributors who only manage tickets, but it's hard to find people who want to do it and who are the right kind of people to do that work. Some projects moved ticketing off github network, but then you have inconsistent ticket implementation/locations.

"what if we can clone ourselves?" this is the "Old Rossom" approach -- see Capek. Or, "what we focused on really simple tasks (young Rossum)"? A bot which would implement Necolas's issue-guidelines.

instead build "universal"; research "Bots": "chatterbot, spambots, botnets, gaming bots, votebots", so, "bots are scripts that run automated tasks on the web". wanted. "Haunt." something similar to Rossum's protoplasm. node module for creating robots || services. lets you run unit tests against issues and pull requests, then make decisions about closing, tagging, commenting, -- programmatically.

check out: http://git.io/haunt

How do developers learn?

  • presenter is interested in craftspeople, not just the people who write code but don't especially care.
  • learning: it's a tarp! CodeAcademy, Google courseware,
  • two kinds of learning: a good coder learning a new language, vs. when you need to learn something right now (but don't really need to learn a new skill )
  • vocabularies differ depending on dev perspective

learning for work -- hottest topics:

  • Drupal & wordpress, hot and on fire
  • Python, Ruby, Javascript

Notes regarding tech learning

  • there's an implicit assumption that the most recent content is the best content. This reflects the theme: stop and think about the environment you're in, before racing to get the answers.
  • Some search terms are "false friends," showing up more often than expected in statistical anaylsis of word frequency in books.
  • when do people use these devices for learning? more use ipads (higher % than mobile/non-ipad); largely at night.
  • younger folks tend to prefer video/screencasts (learning from video)
  • github is a valid learning resource; github has a "real" search with actual search semantics. An approach: "path:sytnax" and NOT operators; check out Github's search for more details.
  • We often feel as though we must finish books: instead, ignore the end of the book and just move on when you fell it's been of enough value for you.

Keynotes, Friday

How good is your internet? Chris DiBona, Google

Data visualization with Clojurescript

http://keminglabs.com/talks/kevin_lynagh_web_data_visualization_OSCON_2012_slides.pdf

http://keminglabs.com/talks/kevin_lynagh_web_data_visualization_OSCON_2012_handout.pdf

  • Clojurescript - compiles Clojure to Javascript
  • "treat your data like data", "it's better to have 100 functions operate on one data structure than 10 functions on 10 data structures. [alan perlis]. with OO style stuff we tend to encapsulate data, so we do a lot of work to get things into a box, then a lot of work to get it out again.
  • consider doing stuff without using the DOM as much as possible, since that makes it easier to manipulate and test against data without requiring headless browser or other strange things.

Components with HTML5

  • Google Web Toolkit -- consider
  • Vaadin -- https://vaadin.com/home#intro
  • Design based on needs, and think about the user experience implied by the use of that component: can you achieve the use case you are heading towards with existing components?
  • list of real quantifiable requirements for UX
  • https://jojule.github.com

Feedback for presenters

  • surprises are OK, but don't show features/tools that don't work yet. "upcoming revision" # "waste of my time today."
  • don't read to me from your book.
  • specify the audience or be aware when your talk isn't sticking with its billing -- seemed like a beginning, wasn't.
  • don't depend on the network.
  • not much typing
  • don't use music that has vocals in it if your presentation includes sound

Geospatial notes trying to bend data around

make a NODATA band transparent

Apply the dem, then warp -- the warp changes the transparency (can probably make it work out, but this works too)? (This didn't quite work a second time, not sure)

  1. gdaldem color-relief -alpha input.tif ramp.txt colorized.tif
  2. gdalwarp -s_srs EPSG:3338 -t_srs EPSG:3785 -r bilinear input.tif output-reprojected.tif

ramp.txt:

nv 0,0,0,0
0 237,248,251,255
270 178,226,226,255
280 102,194,164,255
290 44,162,95,255
365 0,109,44,255

making contours

  1. gdal_contour -a dof final.tif contours50.shp -i 50
  2. ogr2ogr final.shp contours.shp -t_srs EPSG:900913

some raster computations to isolate data

gdal_calc.py -A dof_5modelAvg_sresb1_2090_2099.tif -B dof_5modelAvg_sresb1_2010_2019.tif --outfile=week.tif --calc="A*((A-B) > 7)" --NoDataValue=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment