mrflip/Solid.md

## Solid.md

      
    Raw
  

              Solid.md
            
          
    Things

"Big Five" == Elasticsearch, Storm, Kafka, HBase, wukong decorators

Faster chef convergence (custom packages; local physical cluster)
Centralized log archiving
Performance qualification of
Visibility and request manipulation
Metarepo (deb/rpm, gem, egg, maven)
Deploy framework (chef? rake? something?)
Triggered full-stack execution (resque? cssh-on-steroids?)
Vayacondios (systemwide lightweight notification & syndication) (flingr?).

Announce/discover/report
Java VCD client.
metrics/gauges/counters/timers


Configliere refactor, and Configliere for Java
Java style guide
Barrel (auth/circuit breaker/injection point/tracing)
Monitoring ergonomics. Should feel more like ES' 'bigdesk' and not like Netscape
Feature flags
Blue/green clusters
Cluster templates; Stacks&components
Log searching
Automated deploys
24/7 monitoring team
status.client.chimpy.us (profitbricks)
Testing (full cluster; advanced versions - ie git head)
Some name changes: VCD -> flingr; wukong dataflow descriptions -> hanuman; customer production clusters -> production (L1), L2, L3.


Also:


Need to have place for backups, log archiving in datacenter applications;
should do CD on edge and current versions
Logs need to go to own volume? Partition?

Performance Qualification


Performance

bandwidth (rec/s, MB/s) and latency (s) for 100B, 1kB, 100kB records
under read, write, read/write
in degraded state: a) loss of one/two servers and recovering; b) elevated packet latency + drop rate between "regions"
High concurrency


keepalive
bad input flood
restart of service; reboot of machine; stop/start of machine
Utilization, Saturation, Errors

commonly observed errors and their meaning


exemplars and mountweasels


Elasticsearch


Five queries everyone should know

their performance at baseline


Field Cache usage vs number of records
Write throughput in a) full-weight records; b) Cache map use case (lots of deletes on compaction)
Version upgrade
Recovery

plugin for recovery strategy


Shard assignment

Separate Read/write/transport boxes
probably only one or the other types of nodes are masters
Cross-geo replication?


Machine sizes: m1.x vs m3.x; ebs optimized vs not; for write nodes, c1.xl?
Failover and backup

Storm


CacheMap metrics, tuning
In-stream database calls
Can I "push"/"flush" DRPC calls?
What happens when I fail a tuple?

fail-forever / fail-retriably
"failure" streams


Tracing

"tracing" stream


Wukong shim

failure/error handling
tuple vs record
serialization


Batch size tradeoffs

(later: wukong-proc shim)
New Work

Barrel (auth/circuit breaker/injection point/tracing)

Current thought is that it's nginx+goliath; but considering varnish+(varnish directly; finagle; goliath; netty)

HTTP and TCP proxy
Authentication (HTTP)
Utilization, Saturation, Errors
Heartbeating, Announce, Discover
Tracing -- with Trace model on, hidden parameter causes VCD notification w/ detailed trace info
Circuit Breaker -- set QoS bounds on reqs/s, dropping requests (later, queuing requests) as directed
Fault Injection -- with FI mode on, hidden parameters in the request will cause specified a) delay, before or after processing; b) error, before or after processing; c) response+headers, immediately
Load Balancing

Flingr (nee VCD)

Compare: Twitter's Zipkin and Ostrich, Google's Dapper. Have been advised Twitter has a hoary-as-hell thing like VCD that they don't brag about :).

Announce / Discover / Report
Gauges, metrics and timing
Control-path events
Automation lifecycle progress

Instrumented Builds

Instrument Elasticsearch, HBase, Storm, Kafka. JMX? jvmgcprof and Ostrich?
Configliere


Refactor interface to allow hierarchical configs
Java client
Dynamic updates

cf Archaius
Later


Automated load testing
TCP Load Balancing
SNMP hooks
Pluggable Web Resources (eg Karyon, Finagle) These don't seem necessary unless/until we run our own multi-tenant cloud
Mesos
Multi-vendor DNS

References


Twitter's Stack
Netflix Stack
Google Dapper large-scale tracing framework


Appendix: Notes on Flingr+Barrel+Configliere

Notes on Barrel+VCD:

Bootstrapping, Libraries and Lifecycle Management
Runtime Insights and Diagnostics
Cloud-Ready hooks (Service Registration and Discovery, HealthCheck hooks etc.)
Runtime Configuration of Properties (Configliere)
Latency and Fault tolerance (Hystrix, HAproxy)
Pluggable Web Resources (eg Karyon, Finagle)
Circuit Breaker (http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html)

cf Netflix's Karyon; Twitter's Finagle+Ostrich+Zipkin
Appendix: Not and Not Yet

the "Non-Negotiables"


No latency guarantees below ~ 100s of ms
Fundamental components: Storm, Elasticsearch, wukong topology, APIs once published

OS (RHEL only)
No new databases
No petabyte scale, no 1000s of machines
No customer access to machines (transitionally excepting user access to Hadoop)

Banking-specific features

These are (some of) the costly and frictionful things I expect Banking client brings that aren't front-line requirements for other enterprise customers. (They're all important features, but with banking they become essential)

Full multi-tenancy / Chinese walls
Banking-grade security
Enterprise Hooks (SNMP etc)
Auditing
External review/qualification
Authorized Vendor hoops

No-time-soons:


SDK ergonomics
Hadoop control panel
Widgets (machine learning, integration,
Graph DB