Iman Makaremi - Senior Data Scientist, Splunk
Matthew Modestino - ITOA Practitioner, Splunk
So they want to move away from static alarming/decision making. Can the data itself tell you what's normal? Basically, looking for outliers with ML (and the MLTK). One of them is Ops, the other did the math.
"We know what's normal - we collect it every day." You already have the baseline. But how do you write SPL to detect deviation? (Hoping this next bit is relevant to sourcetype volume tracking and to larger anomaly detection work at Yelp.)
github.com/matthewmodestino - is going to publish the SPL soon-ish.
Alerts have to be actionable: combine Splunk Ninjas, Decision Makers, Domain Experts. This is an interesting approach - far away from ours - because they're not expecting folks to do their own SPL (at least not all the time). On the one hand, I appreciate the separation of concerns; on the other hand, it doesn't optimally empower folks to make Splunk work for them. They consider it a different type of collaboration, which is valid.
Timewrap, Median Absolute Deviation
| tstats prestats=t count WHERE index=main by _time span=300s
| timechart span=300s partial=f count
| timewrap d series=short
| rename s0 AS Today
| foreach s*
... I can't type fast enough :(
Anyway, we're looking for outliers and the key thing here is Median Absolute Deviation. You can calculate outliers based on the past info. (I bet we could have even better basis by using summary indices, but we might fall prey to overfitting concerns.)
Finds an outlier where his kids streamed Netflix at 7am. Why's that interesting? They don't usually stream Netflix at that time. Pretty cool - he's using his home network as a testbed, but I bet we could do some fancy stuff along the same lines.
Can we do this a lot more? Like, on multiple fields? ...yep. They have an even fancier SPL query for that.
- You can't do anomaly detection on everything. Have to pick solid KPIs! You'll drown in alerts otherwise.
- Engage everyone, get multiple folks involved.
- Use MLTK to do long time-horizon analyses.
- timewrap, streamstats methods - baseline, identify outliers.
- Validation is important!
We'd have to do some kind of darklaunching to really rely on this. Also how it plays with not-really-RT data is an open question.
Nikhil Mungel - Principal Software Engineer, Splunk
Brian Krueger - Software Engineer, Splunk
This appears to be all about Splunk Cloud plus Amazon ECS. Project Nova, they're calling it. But what are they actually doing?
- Request tracing across services. (Zipkin)
- Analyze API access logs (Scrafka)
- Tracking build/deploy health (Eng Effectiveness metrics)
- etc etc
- Code coverage - this build vs prior builds (oh wow he just said Buildbot)
Collection of HTTP microservices that kind of do all this stuff together. (splunknova.com) Swagger, RESTful, etc. GET or POST, /v1/events, just send it JSON. (Waiting to see how this is better than the HTTP event collector.)
Okay here's the secret sauce: You can use these endpoints to get data very easily. If you send it data with the name 'pageLoadTime', you can then do this: /v1/pageLoadTime/{mean,p95,etc} And get back raw data easily. It also supports collectd, fluentd, etc.
Seems like it would be cool to offload basic time series data away from Docker, especially if you didn't run your own PaaS like we do. Another example: push/pull data into/out of this API to trigger ChatOps, ITTT actions, etc.
David Safian - Sr. Systems Engineer, UNC Chapel Hill
Benjamin August - Senior Solutions Engineer, UNC Chapel Hill
We were late to this one, having a lunch date w/ our support and sales folks that went long. Unfortunately for us, the part we did see didn't really go into holding data from different stakeholders. Instead, it went into exposing common data to different stakeholders. Things like exposing upgrade status to local network admins in different departments.
Kyle Smith - Integration Developer, Aplura
Ooh it's a potpourri! I tend to really enjoy talks like this - they're really information heavy, just light on application. But it's a great way to force yourself to learn new tools to add to your belt, which is always a win in my opinion.
rest
: get data out of splunk API endpoint. We already use this and know it pretty well.makeresults
: fake data! Generate a number of events w/ fake data (and a current _time field). Very fast, won't block search.gentimes
: Generates empty time buckets on given interval, up to the end parameter. Put this first, then a map command after it to indject data into the buckets.metasearch
: Retrieves event md from indexes based on terms in indexes (like tstats).|metasearch eventtype=foo earliest=-24h@h
metadata
: source, sourcetypes, hosts. Use to learn what's in indexes based on metadata. Respects timepicker but more so bucket times.tstats
: Hello, old friend :)
union
: Merge results from two or more datasets into one. Supports different time ranges, etc. So you can do a search over last_week and a search over this_day and summarize that into one table.map
: loop to run a search repeatedly for each input event/result. Can even run it on an existing savedsearch. Watch out for too-large input sets - it'll kill your cluster.foreach
: equivalent to multiple chained eval commands! That's cool. Z scores:|foreach * [eval Z_<<FIELD>> = ((<<FIELD>>-MEAN<<MATCHSTR>>) / STDEV<<MATCHSTR>>)]
untable
: convert from tabular format to stats output. Inverse of xyseries, can come after a timechart too.contingency
: Statistics! Learn relationship between two-plus categorical variables. Study the association between variables. Appears to just work at least on two variables, and it's fast.xyseries
: Converts results into something you can graph. Allows you to compare strings with strings.\
weather_data` | xyseries icon weather sweather.streamstats
: Cumulative summary stats for all results in a streaming fashion. Calculates stats on each event at the time the event is seen.mstats
: NEW IN 7.0. Analyze metrics from the metric store!autoregress
: Get the last value and append it to the current value.
- Use eval statement in a timechart!
... | timechart span=1h eval(avg(kb)/avg(ev)) as "AVG KB/event" avg(ev2) as "AVG KB/event - 2"
. Note there's a difference in sig digits, and you have to rename the field. - Dynamic eval (indirect reference):
sourcetype=perfmon:dns | eval cnt_{counter} = Value | stats avg(cnt_*) as *
- Dynamic eval (subsearch): You can run a subsearch anywhere. You could even run it as a parameter to
replace
!
splunk reload monitor
: Don't need chef/puppet to restart monitor. Anything in a UF config will be picked up w/ this! Very useful for syslog, since doing a hard restart will break its realtime property.splunk cmd pcregextest
: Useful for testing regex for extractions.splunk btool
: We know this one. Look at all your configs and show what wins.cmd python
: Use the builtin python interpreter.cmd splunkd print-modinput-config
: Prints modular input config.
Gary Mikula - Senior Director, Cyber & Information Security, FINRA
Siddhartha Dadana - Lead Security Engineer, FINRA
Kuljeet Singh - Lead Security Engineer, FINRA
All about monitoring Lambdas. Starting with "how do you find the long-running process?" Starts off by looking at aggregate CloudWatch metrics, which show the issue... but they don't show the drilled down main issue. You'd have to go look at that specific lambda - and if you have a lot of them, good luck.
So how do you extricate stuff from lambdas? HTTP event collector, apparently. They build some classes into their lambda deploy packaging so developers can just use it. Basically this is Meteorite but it goes to Splunk instead of an aggregator. Have to make sure to clean up after yourself, because lambdas will keep running as long as something is alive. (This seems lame, I bet there's a way around it...)
They use this stuff to track billing of lambdas at a very fine-grained level. I'd wonder how this RYO approach compares to a dedicated solution like CH, but we haven't built enough lambdas afaik for them to be more than a blip.
Wonder how they handle discovery of the HTTP endpoints? Didn't really say. Could be a big showstopper for us, or anyone who operates at scale.
Now they go into security concerns of lambdas. A lot of it relies on Cloudtrail. 15m granularity... not very RT, now is it? But they're using lambdas to forward Cloudtrail logs from S3 into HTTP event collector.
Honestly, our SQS forwarder is very similar but almost inarguably better. Downside is the maintenance cost, and possibly latency (but hey, you think lambdas never break?...)
They're claiming mean time to get events into Splunk is 2 seconds. (But what is the 99th?) I'm not sure this is a win given Cloudtrail's granularity anyway.
They also raise a lot of SQS-related issues they've encountered to get that 2s latency. Can't lock SQS messages... so they just get a bigger server to poll? We get around this with visibility timeouts, but of course anything that gets caught in that manner will sit around for ~15 minutes before processing. They apparently really, really care about a fast pipeline. "This is a 15m rolling window, but darn I need it right now!"
Cost is really nice though - about $4 to run 100K lambda functions.
Finally, they're going over compliance. Usually they just secure a polling server that goes and asks other systems for data. Looks like part of their motivation for this whole thing is a many-VPC setup, and they're trying to get away from manual provisioning.
Tyler Muth - Analytics Architect, Splunk
Denis Vergnes - Senior Software Engineer, Splunk
This is mostly a Q&A plus demo session around the new flavor of Splunk DB Connect.
What's the point of this? Use SPL on DB data; join events with DB data; hold your Splunk reports in DB (?!) and unify your visualization tooling.
What's new and shiny?
- Better UI w/ better templating etc.
- Performance: boost up to 10x
- Stored procedures, more supported platforms (would we need these?)
Apprently you can use this to ingest data from DBs into Splunk - and they have premade input templating: you just need connection, input name, and index. Pretty nice.
It precreates connections too, but this might actually be a bad thing for redshift as we'd absolutely always hog a query slot. On the other hand, maybe that is a Good Thing - that is, we could do the queuing on our end rather than relying on redshift to handle connections, timeouts, etc.
Performance:
- Connection pooling is removed for commands (whaaaaa)
- SHC doesn't run scheduled tasks, so there's no HA solution readymade
- Locally stored checkpoints, 1 JVM per task ...
- Large datasets are output 2-9 times faster than v2.4.x.
- Throughput of single input is the same as 2.3.x - 2.5MB/s.
They definitely make it really easy (in this new version) to add new inputs defined as specific queries. The wizard is actually really cool. Of course, we use a different mode (raw SQL) so this would kind of be a different mode (more lead time, and who has the permissions?) to join data. They do support query mode which is closer to how we do it now, but it's clearly not what they envision as the future.