Skip to content

Instantly share code, notes, and snippets.

@zachm
Last active June 21, 2018 16:29
Show Gist options
  • Save zachm/054847b0b7474739f4d9bb4dea1b2285 to your computer and use it in GitHub Desktop.
Save zachm/054847b0b7474739f4d9bb4dea1b2285 to your computer and use it in GitHub Desktop.

Detect Numeric Outliers – Advances

Iman Makaremi - Senior Data Scientist, Splunk

Matthew Modestino - ITOA Practitioner, Splunk

So they want to move away from static alarming/decision making. Can the data itself tell you what's normal? Basically, looking for outliers with ML (and the MLTK). One of them is Ops, the other did the math.

"We know what's normal - we collect it every day." You already have the baseline. But how do you write SPL to detect deviation? (Hoping this next bit is relevant to sourcetype volume tracking and to larger anomaly detection work at Yelp.)

github.com/matthewmodestino - is going to publish the SPL soon-ish.

Alerts have to be actionable: combine Splunk Ninjas, Decision Makers, Domain Experts. This is an interesting approach - far away from ours - because they're not expecting folks to do their own SPL (at least not all the time). On the one hand, I appreciate the separation of concerns; on the other hand, it doesn't optimally empower folks to make Splunk work for them. They consider it a different type of collaboration, which is valid.

Timewrap, Median Absolute Deviation


| tstats prestats=t count WHERE index=main by _time span=300s
| timechart span=300s partial=f count
| timewrap d series=short
| rename s0 AS Today
| foreach s*
... I can't type fast enough :(

Anyway, we're looking for outliers and the key thing here is Median Absolute Deviation. You can calculate outliers based on the past info. (I bet we could have even better basis by using summary indices, but we might fall prey to overfitting concerns.)

Finds an outlier where his kids streamed Netflix at 7am. Why's that interesting? They don't usually stream Netflix at that time. Pretty cool - he's using his home network as a testbed, but I bet we could do some fancy stuff along the same lines.

Can we do this a lot more? Like, on multiple fields? ...yep. They have an even fancier SPL query for that.

  • You can't do anomaly detection on everything. Have to pick solid KPIs! You'll drown in alerts otherwise.
  • Engage everyone, get multiple folks involved.
  • Use MLTK to do long time-horizon analyses.
  • timewrap, streamstats methods - baseline, identify outliers.
  • Validation is important!

We'd have to do some kind of darklaunching to really rely on this. Also how it plays with not-really-RT data is an open question.

Analyzing Logs from Microservices

Nikhil Mungel - Principal Software Engineer, Splunk

Brian Krueger - Software Engineer, Splunk

This appears to be all about Splunk Cloud plus Amazon ECS. Project Nova, they're calling it. But what are they actually doing?

  • Request tracing across services. (Zipkin)
  • Analyze API access logs (Scrafka)
  • Tracking build/deploy health (Eng Effectiveness metrics)
  • etc etc
  • Code coverage - this build vs prior builds (oh wow he just said Buildbot)

Collection of HTTP microservices that kind of do all this stuff together. (splunknova.com) Swagger, RESTful, etc. GET or POST, /v1/events, just send it JSON. (Waiting to see how this is better than the HTTP event collector.)

Okay here's the secret sauce: You can use these endpoints to get data very easily. If you send it data with the name 'pageLoadTime', you can then do this: /v1/pageLoadTime/{mean,p95,etc} And get back raw data easily. It also supports collectd, fluentd, etc.

Seems like it would be cool to offload basic time series data away from Docker, especially if you didn't run your own PaaS like we do. Another example: push/pull data into/out of this API to trigger ChatOps, ITTT actions, etc.

Multi-Tenancy : Achieving Security, Collaboration, and Operational Efficiency

David Safian - Sr. Systems Engineer, UNC Chapel Hill

Benjamin August - Senior Solutions Engineer, UNC Chapel Hill

We were late to this one, having a lunch date w/ our support and sales folks that went long. Unfortunately for us, the part we did see didn't really go into holding data from different stakeholders. Instead, it went into exposing common data to different stakeholders. Things like exposing upgrade status to local network admins in different departments.

Lesser Known Search Commands

Kyle Smith - Integration Developer, Aplura

Ooh it's a potpourri! I tend to really enjoy talks like this - they're really information heavy, just light on application. But it's a great way to force yourself to learn new tools to add to your belt, which is always a win in my opinion.

Administrative (generating) commands

  • rest: get data out of splunk API endpoint. We already use this and know it pretty well.
  • makeresults: fake data! Generate a number of events w/ fake data (and a current _time field). Very fast, won't block search.
  • gentimes: Generates empty time buckets on given interval, up to the end parameter. Put this first, then a map command after it to indject data into the buckets.
  • metasearch: Retrieves event md from indexes based on terms in indexes (like tstats). |metasearch eventtype=foo earliest=-24h@h
  • metadata: source, sourcetypes, hosts. Use to learn what's in indexes based on metadata. Respects timepicker but more so bucket times.
  • tstats: Hello, old friend :)

Iterative commands

  • union: Merge results from two or more datasets into one. Supports different time ranges, etc. So you can do a search over last_week and a search over this_day and summarize that into one table.
  • map: loop to run a search repeatedly for each input event/result. Can even run it on an existing savedsearch. Watch out for too-large input sets - it'll kill your cluster.
  • foreach: equivalent to multiple chained eval commands! That's cool. Z scores: |foreach * [eval Z_<<FIELD>> = ((<<FIELD>>-MEAN<<MATCHSTR>>) / STDEV<<MATCHSTR>>)]
  • untable: convert from tabular format to stats output. Inverse of xyseries, can come after a timechart too.
  • contingency: Statistics! Learn relationship between two-plus categorical variables. Study the association between variables. Appears to just work at least on two variables, and it's fast.
  • xyseries: Converts results into something you can graph. Allows you to compare strings with strings. \weather_data` | xyseries icon weather sweather.
  • streamstats: Cumulative summary stats for all results in a streaming fashion. Calculates stats on each event at the time the event is seen.
  • mstats: NEW IN 7.0. Analyze metrics from the metric store!
  • autoregress: Get the last value and append it to the current value.

SPL Hacks (not commands... moar liek 1337 h4x0rz amirite?)

  • Use eval statement in a timechart! ... | timechart span=1h eval(avg(kb)/avg(ev)) as "AVG KB/event" avg(ev2) as "AVG KB/event - 2". Note there's a difference in sig digits, and you have to rename the field.
  • Dynamic eval (indirect reference): sourcetype=perfmon:dns | eval cnt_{counter} = Value | stats avg(cnt_*) as *
  • Dynamic eval (subsearch): You can run a subsearch anywhere. You could even run it as a parameter to replace!

CLI Commands

  • splunk reload monitor: Don't need chef/puppet to restart monitor. Anything in a UF config will be picked up w/ this! Very useful for syslog, since doing a hard restart will break its realtime property.
  • splunk cmd pcregextest: Useful for testing regex for extractions.
  • splunk btool: We know this one. Look at all your configs and show what wins.
  • cmd python: Use the builtin python interpreter.
  • cmd splunkd print-modinput-config: Prints modular input config.

Integrating Splunk and AWS Lambda: Big Results at Fast-Food Prices

Gary Mikula - Senior Director, Cyber & Information Security, FINRA

Siddhartha Dadana - Lead Security Engineer, FINRA

Kuljeet Singh - Lead Security Engineer, FINRA

All about monitoring Lambdas. Starting with "how do you find the long-running process?" Starts off by looking at aggregate CloudWatch metrics, which show the issue... but they don't show the drilled down main issue. You'd have to go look at that specific lambda - and if you have a lot of them, good luck.

So how do you extricate stuff from lambdas? HTTP event collector, apparently. They build some classes into their lambda deploy packaging so developers can just use it. Basically this is Meteorite but it goes to Splunk instead of an aggregator. Have to make sure to clean up after yourself, because lambdas will keep running as long as something is alive. (This seems lame, I bet there's a way around it...)

They use this stuff to track billing of lambdas at a very fine-grained level. I'd wonder how this RYO approach compares to a dedicated solution like CH, but we haven't built enough lambdas afaik for them to be more than a blip.

Wonder how they handle discovery of the HTTP endpoints? Didn't really say. Could be a big showstopper for us, or anyone who operates at scale.

Now they go into security concerns of lambdas. A lot of it relies on Cloudtrail. 15m granularity... not very RT, now is it? But they're using lambdas to forward Cloudtrail logs from S3 into HTTP event collector.

Honestly, our SQS forwarder is very similar but almost inarguably better. Downside is the maintenance cost, and possibly latency (but hey, you think lambdas never break?...)

They're claiming mean time to get events into Splunk is 2 seconds. (But what is the 99th?) I'm not sure this is a win given Cloudtrail's granularity anyway.

They also raise a lot of SQS-related issues they've encountered to get that 2s latency. Can't lock SQS messages... so they just get a bigger server to poll? We get around this with visibility timeouts, but of course anything that gets caught in that manner will sit around for ~15 minutes before processing. They apparently really, really care about a fast pipeline. "This is a 15m rolling window, but darn I need it right now!"

Cost is really nice though - about $4 to run 100K lambda functions.

Finally, they're going over compliance. Usually they just secure a polling server that goes and asks other systems for data. Looks like part of their motivation for this whole thing is a many-VPC setup, and they're trying to get away from manual provisioning.

Splunk DB Connect Is Back, and It Is Better Than Ever

Tyler Muth - Analytics Architect, Splunk

Denis Vergnes - Senior Software Engineer, Splunk

This is mostly a Q&A plus demo session around the new flavor of Splunk DB Connect.

What's the point of this? Use SPL on DB data; join events with DB data; hold your Splunk reports in DB (?!) and unify your visualization tooling.

What's new and shiny?

  • Better UI w/ better templating etc.
  • Performance: boost up to 10x
  • Stored procedures, more supported platforms (would we need these?)

Apprently you can use this to ingest data from DBs into Splunk - and they have premade input templating: you just need connection, input name, and index. Pretty nice.

It precreates connections too, but this might actually be a bad thing for redshift as we'd absolutely always hog a query slot. On the other hand, maybe that is a Good Thing - that is, we could do the queuing on our end rather than relying on redshift to handle connections, timeouts, etc.

Performance:

  • Connection pooling is removed for commands (whaaaaa)
  • SHC doesn't run scheduled tasks, so there's no HA solution readymade
  • Locally stored checkpoints, 1 JVM per task ...
  • Large datasets are output 2-9 times faster than v2.4.x.
  • Throughput of single input is the same as 2.3.x - 2.5MB/s.

They definitely make it really easy (in this new version) to add new inputs defined as specific queries. The wizard is actually really cool. Of course, we use a different mode (raw SQL) so this would kind of be a different mode (more lead time, and who has the permissions?) to join data. They do support query mode which is closer to how we do it now, but it's clearly not what they envision as the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment