/gist:2884516

## gistfile1.txt
The Windows Event Viewer is lousy. We can do better in Splunk, and it will be a nice test case for the new AppFx framework. Since AppFx is still in early develoment, I have intentionally done my thinking about viewing Windows events in Splunk before I learned what has been done so far on AppFx.

I trawled through questions tagged with ‘windows’ on ServerFault, looking for issues people were trying to diagnose. A few areas came up as clear, obvious areas where we can provide a lot of value very quickly:

- When were systems booted, shut down, and restarted over the history of the machine, how long did it take, and where was that time spent?
- When were applications/MSIs installed, changed, or uninstalled, and what are their detailed information (GUIDs, etc.)?
- Which Windows updates were applied when? Which were opted out of?
- What programs have bound and released TCP ports over time? What program is dead, but hasn’t released that port you need?
- What are the IPs and other information bound to various network interfaces in the past? What are the changes made to the network configuration, and when were they made? What was the IP address of that old NIC you just replaced?
- When were files opened/closed/read/written, and which applications and users did so? Which program still has that file you’re trying to delete open?

I also spent some time analyzing how I used logs. It boils down to three situations:

1. I am recording events at known execution points to probe the internals of a process.
2. The system has not behaved as I expected, and I’m trying to figure out why.
3. I am performing a routine inspection, baselined against a previous period, or my own internal model of a correct state.

* 1. Recording events to probe a process

My precise process here:

1. Instrument the process.
2. Choose a filter to get the probe events I’m using.
3. Run the process, watch the events.
4. Formulate a hypothesis, and invent a way to test it.
5. Run the process one or more times to generate the data for testing the hypothesis.
6. Isolate the event traces for each run.
7. Compare the traces. Do they match the hypothesis? If not, go to 4.

In a filter, the fields are: relevant logs, relevant severities, keywords, expected event codes. The time range should be from ten minutes before I started working on this to now.

The interface I envision for this has two views:

- timeline :: a timeline with a pause/play button for the search. I use the timeline to watch the events as they run, and to select regions.
- regions :: a viewport on all the regions I have selected, lined up in columns that I can scroll left and right. Can also reorder and delete them.

For the timeline view, the events have height determined by the amount of time between them. Say I have events at times x, y, and z. Then y’s display runs from min(A, max(B, x+y/2)) to min(A, max(B, y+z/2)), where A is the maximum half-width of an event (just to prevent ridiculous gobs of empty space), and B is the minimum half-width. This scaling is to make selecting regions (which will typically end and be followed by not much going on) easy.

There will be a ‘new region’ button where you can set the bounds of the region. First select the start of the region (or space bar to select the first event after the previous region). Then select the end of the region (or space bar for the last recorded event).

* 2. Computer not behaving as expected

Investigating this goes through three phases.

1. Figure out how to reproduce the behavior and minimize that process.
2. Adjust the filter to pick up what’s needed for diagnosis.
3. Probe as in the previous section.

For (1), the log is useful mostly just to get an idea of what’s going on when the behavior occurs. The best we can provide here is a souped up ‘tail -f’ (with pause/play and quick filtering).

When we get to (2), we want to be able to very quickly try a change and back it out. For that, I want browser-style forward and back buttons. The same fields are relevant as before (severity, keywords, relevant logs, event codes), plus now we want to be able to set the time window.

For the keywords, error codes, and severities, we want to be able to quickly and temporarily deactivate a query item to see what happens without it, then with no more than a click, reinstate it or delete it fully.

* 3. Baselining for auditing

When I’m auditing, I either work against a fictional model in my head, or against some previous state that I know to be good. However, I don’t want to deal with raw events. I want to have a representation of the two time series I’m trying to compare.

Instead of events, the unit of the time series is a tuple emitted in an event. For example, a tuple of a username, login succeeded event code, and a particular machine would be an element of the time series model. It is emitted, with various other data around it, each time that user logs in, and as long as it is valid and occurs in a reasonably consistent fashion, we don’t worry about it.

So the events are summarized by a hidden Markov model that emits events containing tuples over time. We calculate emissions rates for each tuple from both the baseline data and the data to audit, and from the emission rates we calculate any explanatory variables in terms of time and other data in the events. The interface must provide a simple way to compare the rates and breakdown into explanatory variables, and to modify the explanatory variables.

It must also show the events that are not accounted for by the tuple calculation (for instance, a one time 404 from someone mistyping a URL by hand on a web server might not make it through a filter for an important tuple, but it should be visible as something funny that happened).

A user can examine the actual events in both the audit and baseline data set that matched each tuple, force terms out of or into tuples, and add and remove explanatory variables from the models generating tuples. The explanatory variables take the form of “query := [field=]val [(AND|OR) query]”. val can be * or a particular value. If it is a particular value, the boolean expression becomes a binary variable dividing events into two conditions. If *, it tries to regress if the values are numbers, and otherwise does a categorical breakdown.

Each tuple in the audit, if there is a different in rate detected, is marked as “new”, “missing”, “changed” or some other relevant value. A tuple with a rate difference an be declared benign (and thus used in future baselines) or declared as an incident (the user is going to take care of it, but it shouldn’t be part of baseline).
	The Windows Event Viewer is lousy. We can do better in Splunk, and it will be a nice test case for the new AppFx framework. Since AppFx is still in early develoment, I have intentionally done my thinking about viewing Windows events in Splunk before I learned what has been done so far on AppFx.

	I trawled through questions tagged with ‘windows’ on ServerFault, looking for issues people were trying to diagnose. A few areas came up as clear, obvious areas where we can provide a lot of value very quickly:

	- When were systems booted, shut down, and restarted over the history of the machine, how long did it take, and where was that time spent?
	- When were applications/MSIs installed, changed, or uninstalled, and what are their detailed information (GUIDs, etc.)?
	- Which Windows updates were applied when? Which were opted out of?
	- What programs have bound and released TCP ports over time? What program is dead, but hasn’t released that port you need?
	- What are the IPs and other information bound to various network interfaces in the past? What are the changes made to the network configuration, and when were they made? What was the IP address of that old NIC you just replaced?
	- When were files opened/closed/read/written, and which applications and users did so? Which program still has that file you’re trying to delete open?

	I also spent some time analyzing how I used logs. It boils down to three situations:

	1. I am recording events at known execution points to probe the internals of a process.
	2. The system has not behaved as I expected, and I’m trying to figure out why.
	3. I am performing a routine inspection, baselined against a previous period, or my own internal model of a correct state.

	* 1. Recording events to probe a process

	My precise process here:

	1. Instrument the process.
	2. Choose a filter to get the probe events I’m using.
	3. Run the process, watch the events.
	4. Formulate a hypothesis, and invent a way to test it.
	5. Run the process one or more times to generate the data for testing the hypothesis.
	6. Isolate the event traces for each run.
	7. Compare the traces. Do they match the hypothesis? If not, go to 4.

	In a filter, the fields are: relevant logs, relevant severities, keywords, expected event codes. The time range should be from ten minutes before I started working on this to now.

	The interface I envision for this has two views:

	- timeline :: a timeline with a pause/play button for the search. I use the timeline to watch the events as they run, and to select regions.
	- regions :: a viewport on all the regions I have selected, lined up in columns that I can scroll left and right. Can also reorder and delete them.

	For the timeline view, the events have height determined by the amount of time between them. Say I have events at times x, y, and z. Then y’s display runs from min(A, max(B, x+y/2)) to min(A, max(B, y+z/2)), where A is the maximum half-width of an event (just to prevent ridiculous gobs of empty space), and B is the minimum half-width. This scaling is to make selecting regions (which will typically end and be followed by not much going on) easy.

	There will be a ‘new region’ button where you can set the bounds of the region. First select the start of the region (or space bar to select the first event after the previous region). Then select the end of the region (or space bar for the last recorded event).

	* 2. Computer not behaving as expected

	Investigating this goes through three phases.

	1. Figure out how to reproduce the behavior and minimize that process.
	2. Adjust the filter to pick up what’s needed for diagnosis.
	3. Probe as in the previous section.

	For (1), the log is useful mostly just to get an idea of what’s going on when the behavior occurs. The best we can provide here is a souped up ‘tail -f’ (with pause/play and quick filtering).

	When we get to (2), we want to be able to very quickly try a change and back it out. For that, I want browser-style forward and back buttons. The same fields are relevant as before (severity, keywords, relevant logs, event codes), plus now we want to be able to set the time window.

	For the keywords, error codes, and severities, we want to be able to quickly and temporarily deactivate a query item to see what happens without it, then with no more than a click, reinstate it or delete it fully.

	* 3. Baselining for auditing

	When I’m auditing, I either work against a fictional model in my head, or against some previous state that I know to be good. However, I don’t want to deal with raw events. I want to have a representation of the two time series I’m trying to compare.

	Instead of events, the unit of the time series is a tuple emitted in an event. For example, a tuple of a username, login succeeded event code, and a particular machine would be an element of the time series model. It is emitted, with various other data around it, each time that user logs in, and as long as it is valid and occurs in a reasonably consistent fashion, we don’t worry about it.

	So the events are summarized by a hidden Markov model that emits events containing tuples over time. We calculate emissions rates for each tuple from both the baseline data and the data to audit, and from the emission rates we calculate any explanatory variables in terms of time and other data in the events. The interface must provide a simple way to compare the rates and breakdown into explanatory variables, and to modify the explanatory variables.

	It must also show the events that are not accounted for by the tuple calculation (for instance, a one time 404 from someone mistyping a URL by hand on a web server might not make it through a filter for an important tuple, but it should be visible as something funny that happened).

	A user can examine the actual events in both the audit and baseline data set that matched each tuple, force terms out of or into tuples, and add and remove explanatory variables from the models generating tuples. The explanatory variables take the form of “query := [field=]val [(AND\|OR) query]”. val can be * or a particular value. If it is a particular value, the boolean expression becomes a binary variable dividing events into two conditions. If *, it tries to regress if the values are numbers, and otherwise does a categorical breakdown.

	Each tuple in the audit, if there is a different in rate detected, is marked as “new”, “missing”, “changed” or some other relevant value. A tuple with a rate difference an be declared benign (and thus used in future baselines) or declared as an incident (the user is going to take care of it, but it shouldn’t be part of baseline).