zachm/Splunk_.conf_17_day2.md

## Splunk_.conf_17_day2.md

      
    Raw
  

              Splunk_.conf_17_day2.md
            
          
    Splunk Data Lifecycle: Determining When and Where to Roll Your Data

Jeff Champagne, Principle Architect, Splunk
Events fall into buckets, 1+ buckets make up an index, indexes live on indexers.

As buckets grow, they roll hot->warm->cold->{frozen|delete}
Hot buckets live in $HOME path
Data roll: Can roll out to HDFS

Hot: At least 1 hot bucket per index, per indexer. More created for each parallel ingestion pipeline, or when quarantine is needed.
Quarantine: Happens when you load in data from ages ago (too old). Also when timestamps are broken.
We roll hot->warm when maxHotBuckets (default=3) is exceeded. Also when timespan is too large, hasn't received data in a while, metadata too large, etc. Key number: 800+ IOPS required for standard workload. 1200+ IOPS for heavy workload. (tl;dr use SSD ;)
He uses Bonnie++ to measure this stuff. They do not support NFS/NAS for hot/warm; instead, use EXT4 or XFS.
Cold storage: Historical data; allows older data to live on slower storage since they're searched less often. They roll when maxWarmBuckets (default=300) exceeded (shocking). 350 IOPS probably the minimum. Lower IOPS will just make for slower searches. Supports NFS/NAS.
Frozen storage: No longer searchable. Rolsl when total size (h+w+c) too big, or when oldest event in bucket exceeds a certain age. TSIDX file removed; bucket copied to another location; Splunk no longer manages the file. Can create a custom script to do this process.
Thawing data:

Manually copy data into thawedPath, then run rebuild from the CLI.
Reindexing doesn't count against your license.
Takes time: estimate same rate as indexing new data.
Avoid this if you'll be frequently doing it, since it's a pain and takes a while.

Deleting data: If you don't set up freezing. Same criteria apply.
Splunk Data Roll:

Enabled per index. Archive data to HDFS (incl. EMR/S3) once oldest event reaches a certain age.
Virtual indexes created to reference the archived data. A unified search can use both native, virtual indexes together.
There's a parameter to set the cutoff seconds, which means that you won't get back duplicate results. (Cool!)
Do not start using this if you don't already have HDFS set up! Can deploy Splunk in a similar cost/config to do the same thing :)

I am so glad we have no reason to deploy HDFS :)


TSIDX Reduce:

Raw events get pushed into a lexicon, which contains offsets pointing us to locations in the raw data file, showing where they exist.

Lexicon therefore scales with the cardinality of your data.
TSIDX Reduce deletes the lexicon, so all searches become brute force.


How much do you save? Buckets become 30-70% smaller; typical is 60-70%.

You can look at merged_lexicon.lex parameter to see how much you might save.
Can reduce warm, cold buckets on a per index basis.


When to use it?

Use it only for infrequent searches. (Do we even have these?)
Dense searches: More than 10% of searches match - they're largely unaffected because they are already essentially brute force.
Sparse searches: Will become painfully slow: 3-10x slower.
But commonly, 90% of searches will be under 1 day...


Retention:

Managed per indexer. Hence, data imbalance issues with partially missing data.
Don't force bucket rolling: Bad performance, may become unstable. So, don't do every-night deploys with no changes included ;)

Volume Definitions: (this is cool - why aren't we using this?!)

Set retention for all indexes that reference that volume
Forces a defined storage amount across multiple indexes.
Danger: Noisy indexes can cause other indexes' data to be deleted.

Index clustering: We know most of this already! Rebalancing, etc.
Frozen buckets: They're not fixed up, nor are they ever deduplicated. You'd have to dedupe using <localid> in folder path.
What is fixup? If an indexer dies, we'll re-replicate the buckets that were supposed to exist on that indexer. Doesn't happen on frozen.
Capacity Planning: There's an app: splunk-sizing.appspot.com

Search factor: Number of TSIDX file copies
Replication factor: Number of raw data copies

Data Model Accelerations

You're defining a structure for your data, usually a subset of fields, so much smaller.
Keep these in hot/warm.
They're deleted when oldest event exceeds the summary range - not kept longer than raw data.

Can you use RAID 0? Yeah if you trust your infrastructure.
Revealing the Magic: The Life Cycle of a Splunk Search

Kellen Green, Senior Software Engineer, Splunk
This is kind of a "how does search work" talk at a general level, but with specific extensions to how Splunk's proprietary stuff works. Much of this talk was kinda "yeah that makes sense" but I'm sure it was more useful to beginners or folks without an algorithms background.
Bloom filters and how they can speed up some parts of searching. Then lots of "walk me through binary search" type stuff.
transaction is way slower than stats because it runs on the search head rather than on each indexer. Splunk will run everything it can on the indexers, until the first 'slow' command in the chain forces it to move up to the search head. So, ordering matters a lot.
Shrinking the Elephant in the Room: Maximizing Logs’ Business Value with AWS

Chris Gordon - Software Engineer, Yelp
Zachary Musgrave - Lead Engineer, Yelp
Patrick Shumate - Solutions Architect, AWS
Oh hai! (Ballroom C - don't forget :)
Splunking with Multiple Personalities: Extending Role Based Access Control to Achieve Fine Grain Security of Your Data

Sabrina Lea - Senior Sales Engineer, Splunk
Shaun C - Security Architect, HMG
getUserInfo: We already use this, etc. This talk is about the below command...
getSearchFilter: Scripted authentication - return a search filter to apply to each user.
You just build a quick python script - her example loads in a CSV file and does something like this:
--search-filter=(NOT sourcetype=whatever OR (...
So you can apply this (I think) transparently to each user. Pretty fine-grained stuff, more so than the 'what indexes can you search' that we do now.
You can set stuff for auditing/security requirements, like "you can only see auditing data if you're in the correct country, AND if you assume the correct role."
This basically allows "row-level security" not "cell-level security. That's their big stated next step.