"Big Five" == Elasticsearch, Storm, Kafka, HBase, wukong decorators
- Faster chef convergence (custom packages; local physical cluster)
- Centralized log archiving
- Performance qualification of
- Visibility and request manipulation
- Metarepo (deb/rpm, gem, egg, maven)
- Deploy framework (chef? rake? something?)
- Triggered full-stack execution (resque? cssh-on-steroids?)
- Vayacondios (systemwide lightweight notification & syndication) (flingr?).
- Announce/discover/report
- Java VCD client.
- metrics/gauges/counters/timers
- Configliere refactor, and Configliere for Java
- Java style guide
- Barrel (auth/circuit breaker/injection point/tracing)
- Monitoring ergonomics. Should feel more like ES' 'bigdesk' and not like Netscape
- Feature flags
- Blue/green clusters
- Cluster templates; Stacks&components
- Log searching
- Automated deploys
- 24/7 monitoring team
- status.client.chimpy.us (profitbricks)
- Testing (full cluster; advanced versions - ie git head)
- Some name changes: VCD -> flingr; wukong dataflow descriptions -> hanuman; customer production clusters -> production (L1), L2, L3.
- Need to have place for backups, log archiving in datacenter applications;
- should do CD on edge and current versions
- Logs need to go to own volume? Partition?
- Performance
- bandwidth (rec/s, MB/s) and latency (s) for 100B, 1kB, 100kB records
- under read, write, read/write
- in degraded state: a) loss of one/two servers and recovering; b) elevated packet latency + drop rate between "regions"
- High concurrency
- keepalive
- bad input flood
- restart of service; reboot of machine; stop/start of machine
- Utilization, Saturation, Errors
- commonly observed errors and their meaning
- exemplars and mountweasels
- Five queries everyone should know
- their performance at baseline
- Field Cache usage vs number of records
- Write throughput in a) full-weight records; b) Cache map use case (lots of deletes on compaction)
- Version upgrade
- Recovery
- plugin for recovery strategy
- Shard assignment
- Separate Read/write/transport boxes
- probably only one or the other types of nodes are masters
- Cross-geo replication?
- Machine sizes: m1.x vs m3.x; ebs optimized vs not; for write nodes, c1.xl?
- Failover and backup
- CacheMap metrics, tuning
- In-stream database calls
- Can I "push"/"flush" DRPC calls?
- What happens when I fail a tuple?
- fail-forever / fail-retriably
- "failure" streams
- Tracing
- "tracing" stream
- Wukong shim
- failure/error handling
- tuple vs record
- serialization
- Batch size tradeoffs
(later: wukong-proc shim)
Current thought is that it's nginx+goliath; but considering varnish+(varnish directly; finagle; goliath; netty)
- HTTP and TCP proxy
- Authentication (HTTP)
- Utilization, Saturation, Errors
- Heartbeating, Announce, Discover
- Tracing -- with Trace model on, hidden parameter causes VCD notification w/ detailed trace info
- Circuit Breaker -- set QoS bounds on reqs/s, dropping requests (later, queuing requests) as directed
- Fault Injection -- with FI mode on, hidden parameters in the request will cause specified a) delay, before or after processing; b) error, before or after processing; c) response+headers, immediately
- Load Balancing
Compare: Twitter's Zipkin and Ostrich, Google's Dapper. Have been advised Twitter has a hoary-as-hell thing like VCD that they don't brag about :).
- Announce / Discover / Report
- Gauges, metrics and timing
- Control-path events
- Automation lifecycle progress
Instrument Elasticsearch, HBase, Storm, Kafka. JMX? jvmgcprof and Ostrich?
- Refactor interface to allow hierarchical configs
- Java client
- Dynamic updates
cf Archaius
- Automated load testing
- TCP Load Balancing
- SNMP hooks
- Pluggable Web Resources (eg Karyon, Finagle) These don't seem necessary unless/until we run our own multi-tenant cloud
- Mesos
- Multi-vendor DNS
- Twitter's Stack
- Netflix Stack
- Google Dapper large-scale tracing framework
Notes on Barrel+VCD:
- Bootstrapping, Libraries and Lifecycle Management
- Runtime Insights and Diagnostics
- Cloud-Ready hooks (Service Registration and Discovery, HealthCheck hooks etc.)
- Runtime Configuration of Properties (Configliere)
- Latency and Fault tolerance (Hystrix, HAproxy)
- Pluggable Web Resources (eg Karyon, Finagle)
- Circuit Breaker (http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html)
cf Netflix's Karyon; Twitter's Finagle+Ostrich+Zipkin
- No latency guarantees below ~ 100s of ms
- Fundamental components: Storm, Elasticsearch, wukong topology, APIs once published
- OS (RHEL only)
- No new databases
- No petabyte scale, no 1000s of machines
- No customer access to machines (transitionally excepting user access to Hadoop)
These are (some of) the costly and frictionful things I expect Banking client brings that aren't front-line requirements for other enterprise customers. (They're all important features, but with banking they become essential)
- Full multi-tenancy / Chinese walls
- Banking-grade security
- Enterprise Hooks (SNMP etc)
- Auditing
- External review/qualification
- Authorized Vendor hoops
- SDK ergonomics
- Hadoop control panel
- Widgets (machine learning, integration,
- Graph DB