Skip to content

Instantly share code, notes, and snippets.

@zachm
Last active September 28, 2017 17:48
Show Gist options
  • Save zachm/e2491d38875fd0bd5105afc233ea6790 to your computer and use it in GitHub Desktop.
Save zachm/e2491d38875fd0bd5105afc233ea6790 to your computer and use it in GitHub Desktop.
Notes from day 3 of .conf

Splunk IT Service Intelligence (ITSI): Event management is dead - event analytics is revolutionizing IT

David Mills - Staff Architect, IT Operations Analytics

Basically we're not just looking at events. We're instead looking to tie events together with some ML, with some dashboards, and this ITSI tooling. They're using New Relic events as an example, but the workflow looks like you could just pump PagerDuty events into Splunk for a similar effect. (n.b. why are we not doing this?)

A little bit of discussion on defining good Opsy KPIs but nothing that doesn't follow. They wrap in Businessy KPIs,

They're doing logical actions, like opening tickets, paging people downstream, etc. I'm not sure we'd want to move straight to pumping alerting through Splunk to PD, but we could do some cool analytics by linking Sensu/SFX/etc PD alerting into Splunk.

Points out that certain fields are "garbage" - either too noisy or not actionable. Run an iterative clustering algorithm on them to pick these out, set the thresholds accordingly. It's an interesting approach - cluster-level alerting - and it's cool to see it extended away from just operational alerting.

What's the point? We simplify operations, make it easier to find what's borked and fix it. (n.b., this is the point of ~20% of the sessions at .conf)

Index Clustering Internals, Scaling, and Performance Testing

Da Xu and Chloe Yeung

Covering the RESTful endpoints that are used for communicating between the cluster master and the indexers. Slides might be useful here - they're supposed to be published post-hoc.

If you want to peer in deeply, the endpoints cluster/master/{peers,buckets} gives a lot of tight info. Although I think this is all exposed in the DMC so it's probably not needed to query directly from outside of Splunk.

services/cluster/master/buckets?filter=replication_factor<3 You can find hit rate of these endpoints - and how long the responses are - in metrics.log. If we ever have cluster master performance issues this would be a good place to start the investigation.

Large cluster issues: Cluster master will slow down as the number of buckets increases. Tweak like so: server.conf: {cxn,recv,send}_timeout: self-explanatory, bump the timeouts... indexes.conf: rotatePeriodInSeconds: how often to check all buckets for ones needed to roll.

Splunk 6.6 onward scales better: 5 million buckets+ is fine.

  • Batched indexer adds
  • Lockless heartbeats
  • 7.0 is faster overall and the CM uses less memory.

Performance testing: 2x 24-core servers, 128GB RAM, 1GB NIC. Replication at 3/2. Splunk 7.0.

  • Non-clustered avg tput: 35MBPS, for clustered or multisite drops down to 30, 25. Testing on node failure, master restart: Faster in 6.6 and in 7.0 than in prior versions

Master's server.conf has max_peers_to_download_bundle: especially good for bundles over 200MB. New in 6.6 I think.

These numbers... they describe them as a high load scenario but it's actually smaller than our cluster :/ And their buckets are definitely far smaller than ours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment