Skip to content

Instantly share code, notes, and snippets.

@sethhall
Created August 23, 2018 14:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save sethhall/727ac36a630a642ca941661db68b87f4 to your computer and use it in GitHub Desktop.
Save sethhall/727ac36a630a642ca941661db68b87f4 to your computer and use it in GitHub Desktop.
Get some extra file names from http
redef record HTTP::Info += {
potential_fname: string &optional;
};
event http_request(c: connection, method: string, original_URI: string,
unescaped_URI: string, version: string) &priority=5
{
# Get rid of uri arguments
local path = split_string(c$http$uri, /\?/)[0];
local out = split_string(path, /\//);
# Take the last component in the uri path
c$http$potential_fname = out[|out|-1];
}
event http_header(c: connection, is_orig: bool, name: string, value: string) &priority=3
{
if ( is_orig )
return;
if ( name == "ETAG" && /\"/ in value )
{
if ( c$http?$potential_fname && c$http$potential_fname != "" )
c$http$current_entity$filename = c$http$potential_fname;
}
}
@sethhall
Copy link
Author

(replying to your previous comment about this script now...)

I think our documentation for the filename field indicates more information about the field...

A filename for the file if one is available from the source for the file. These will frequently come from “Content-Disposition” headers in network protocols.

It's not a very adept description, but I think that it does pretty clearly point out that the filename is merely an indicator of the name of the file collected from its source. Even the assumption that you made about characteristics of the filename isn't exactly correct because different systems will have different possible filenames and already from network traffic we're getting files names from HTTP, SMTP, SMB, FTP, and possible others that I'm not thinking about off the top of my head. Those different carrier protocols have different notions of valid filenames already.

We could change the name of the field and it wouldn't really matter, if someone wants to shoot themselves in the foot, they'll see a filename collected from the network and write it to disk with that name which is why I've been regularly pointing out for years not to do that. 😄

My only hope is to educate enough people that eventually enough people get the message that there will always be someone there to inform a user that hasn't been fully informed yet (and to also never implement any code publicly that does this because it's such a blindingly bad habit). Of course, as you can probably guess, I've definitely implemented it before just because it was fun.

@duffy-ocraven
Copy link

This thread has taken on a life of its own, and really--I only started out thinking "I'll suggest some validation" and then "hey, the code can also as a convenience do the benign instances of unescaping while it validates", which engendered the first post. I am glad to have had the dive-deeper into what distinguishes an id vs a uri vs a filename in the zeek realms. I came into this discussion with a profound respect for the uri escape specification in https://tools.ietf.org/html/rfc3986. I think they correctly captured: pay with human-reader-inconvenience and do substitution of all potentially dangerous characters (a surprisingly fair-price since those usually are not-relevant to legit use-cases). https://tools.ietf.org/html/rfc8089 for filenames specifically builds upon that, and I think we will avoid trouble if we stay close to those principles.

@sethhall
Copy link
Author

I've been enjoying this discussion quite a bit!

The 8089 RFC actually points out...

Treating a non-local file URI as local, or otherwise attempting to
   perform local operations on a non-local URI, can result in security
   problems.

I would take it to mean that the 8089 RFC is actually not terribly relevant to anything we're doing with Zeek because basically everything we have comes from the network and would be classified as non-local.

My general perspective on Zeek logs is that they aren't necessarily "words" anyway. Internally in Zeek they're all just arbitrary sequences of bytes which means they can technically haves nulls or anything else in them and the moment we start to try protect something like the filename field as safe to use for arbitrary purposes is the day that we start getting ourselves into trouble. Basically none of the data coming out of Zeek is safe to do anything with if a user decodes escaped fields (since the tab separated data and json both escape non-printable stuff) they have to adopt the notion that any data could end up in any field.

Have you viewed Zeek logs as sequences of bytes where anything could end up in them or had you taken a different perspective on the logs? I'm sort of wondering if we just have a disconnect on the way we think about the logs.

@duffy-ocraven
Copy link

I also think this discussion is still helping us each articulate and see some subtle but important things. Thanks for hanging in there, if I ever express myself too obtusely.

You identified something there at the last: "we have a disconnect on the way we think about the logs". I am relatively new to zeek (just a few months) and one of the top-of-mind thoughts I had was "I have got to code myself a customized LESS, that columnarizes and word-wraps when log viewing, at least as well as HTML tables and/or RTF does it." I think the human reader of logs is a vital audience, activating the penchant for pattern detection that the human brain is for-better-or-for-worse so prone to.

LESS has saved my bacon innumerable times. It can still let oneself get in a jam (for instance don't jump to end of file if the files ends with thousands of \x00, if you ever want to get your keyboard to respond again) but I had expected and I guess I am trying to here encourage zeek to regard "verbatim representation" as not as much of a boon to the zeek programmers and users, as a boon to the attackers. "unambiguous representation" I absolutely concur is something that the zeek programmers and users must have. But it can be expressed with a dialect that is constrained to only be comprised of benign characters.

@duffy-ocraven
Copy link

duffy-ocraven commented Sep 15, 2020

Oh and a small clarification, so that we don't digress over a canard. I realize Zeek logs aren't sequences of bytes where anything could end up in them, because the tab separated data and json both escape non-printable stuff. But internally in Zeek I worry if in every datatype they're all just arbitrary sequences of bytes which means they can technically haves nulls or anything else in them. I would blanche if hash results could haves nulls or anything such in them. The point I am raising in this discussion is that programmers carry some semantic baggage as they read variable and type names. I blanche if a "filename" can contain a * or / or \. It needs to be termed a filepath if it is the '/' delimited hierarchy. It needs to be a fullpath if it is the filepath and filename concatenated. It needs to be a pattern if it can contain * or ?.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment