Skip to content

Instantly share code, notes, and snippets.

@mrflip
Last active December 15, 2015 17:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mrflip/5294281 to your computer and use it in GitHub Desktop.
Save mrflip/5294281 to your computer and use it in GitHub Desktop.
Notes for 2013 spec

Things

"Big Five" == Elasticsearch, Storm, Kafka, HBase, wukong decorators

  • Faster chef convergence (custom packages; local physical cluster)
  • Centralized log archiving
  • Performance qualification of
  • Visibility and request manipulation
  • Metarepo (deb/rpm, gem, egg, maven)
  • Deploy framework (chef? rake? something?)
  • Triggered full-stack execution (resque? cssh-on-steroids?)
  • Vayacondios (systemwide lightweight notification & syndication) (flingr?).
    • Announce/discover/report
    • Java VCD client.
    • metrics/gauges/counters/timers
  • Configliere refactor, and Configliere for Java
  • Java style guide
  • Barrel (auth/circuit breaker/injection point/tracing)
  • Monitoring ergonomics. Should feel more like ES' 'bigdesk' and not like Netscape
  • Feature flags
  • Blue/green clusters
  • Cluster templates; Stacks&components
  • Log searching
  • Automated deploys
  • 24/7 monitoring team
  • status.client.chimpy.us (profitbricks)
  • Testing (full cluster; advanced versions - ie git head)
  • Some name changes: VCD -> flingr; wukong dataflow descriptions -> hanuman; customer production clusters -> production (L1), L2, L3.

Also:

  • Need to have place for backups, log archiving in datacenter applications;
  • should do CD on edge and current versions
  • Logs need to go to own volume? Partition?

Performance Qualification

  • Performance
    • bandwidth (rec/s, MB/s) and latency (s) for 100B, 1kB, 100kB records
    • under read, write, read/write
    • in degraded state: a) loss of one/two servers and recovering; b) elevated packet latency + drop rate between "regions"
    • High concurrency
  • keepalive
  • bad input flood
  • restart of service; reboot of machine; stop/start of machine
  • Utilization, Saturation, Errors
    • commonly observed errors and their meaning
  • exemplars and mountweasels

Elasticsearch

  • Five queries everyone should know
    • their performance at baseline
  • Field Cache usage vs number of records
  • Write throughput in a) full-weight records; b) Cache map use case (lots of deletes on compaction)
  • Version upgrade
  • Recovery
    • plugin for recovery strategy
  • Shard assignment
    • Separate Read/write/transport boxes
    • probably only one or the other types of nodes are masters
    • Cross-geo replication?
  • Machine sizes: m1.x vs m3.x; ebs optimized vs not; for write nodes, c1.xl?
  • Failover and backup

Storm

  • CacheMap metrics, tuning
  • In-stream database calls
  • Can I "push"/"flush" DRPC calls?
  • What happens when I fail a tuple?
    • fail-forever / fail-retriably
    • "failure" streams
  • Tracing
    • "tracing" stream
  • Wukong shim
    • failure/error handling
    • tuple vs record
    • serialization
  • Batch size tradeoffs

(later: wukong-proc shim)

New Work

Barrel (auth/circuit breaker/injection point/tracing)

Current thought is that it's nginx+goliath; but considering varnish+(varnish directly; finagle; goliath; netty)

  • HTTP and TCP proxy
  • Authentication (HTTP)
  • Utilization, Saturation, Errors
  • Heartbeating, Announce, Discover
  • Tracing -- with Trace model on, hidden parameter causes VCD notification w/ detailed trace info
  • Circuit Breaker -- set QoS bounds on reqs/s, dropping requests (later, queuing requests) as directed
  • Fault Injection -- with FI mode on, hidden parameters in the request will cause specified a) delay, before or after processing; b) error, before or after processing; c) response+headers, immediately
  • Load Balancing

Flingr (nee VCD)

Compare: Twitter's Zipkin and Ostrich, Google's Dapper. Have been advised Twitter has a hoary-as-hell thing like VCD that they don't brag about :).

  • Announce / Discover / Report
  • Gauges, metrics and timing
  • Control-path events
  • Automation lifecycle progress

Instrumented Builds

Instrument Elasticsearch, HBase, Storm, Kafka. JMX? jvmgcprof and Ostrich?

Configliere

  • Refactor interface to allow hierarchical configs
  • Java client
  • Dynamic updates

cf Archaius

Later

  • Automated load testing
  • TCP Load Balancing
  • SNMP hooks
  • Pluggable Web Resources (eg Karyon, Finagle) These don't seem necessary unless/until we run our own multi-tenant cloud
  • Mesos
  • Multi-vendor DNS

References

Appendix: Notes on Flingr+Barrel+Configliere

Notes on Barrel+VCD:

  • Bootstrapping, Libraries and Lifecycle Management
  • Runtime Insights and Diagnostics
  • Cloud-Ready hooks (Service Registration and Discovery, HealthCheck hooks etc.)
  • Runtime Configuration of Properties (Configliere)
  • Latency and Fault tolerance (Hystrix, HAproxy)
  • Pluggable Web Resources (eg Karyon, Finagle)
  • Circuit Breaker (http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html)

cf Netflix's Karyon; Twitter's Finagle+Ostrich+Zipkin

Appendix: Not and Not Yet

the "Non-Negotiables"

  • No latency guarantees below ~ 100s of ms
  • Fundamental components: Storm, Elasticsearch, wukong topology, APIs once published
  • OS (RHEL only)
  • No new databases
  • No petabyte scale, no 1000s of machines
  • No customer access to machines (transitionally excepting user access to Hadoop)

Banking-specific features

These are (some of) the costly and frictionful things I expect Banking client brings that aren't front-line requirements for other enterprise customers. (They're all important features, but with banking they become essential)

  • Full multi-tenancy / Chinese walls
  • Banking-grade security
  • Enterprise Hooks (SNMP etc)
  • Auditing
  • External review/qualification
  • Authorized Vendor hoops

No-time-soons:

  • SDK ergonomics
  • Hadoop control panel
  • Widgets (machine learning, integration,
  • Graph DB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment