holybit/gist:a387a88ae4f131c4f327

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Current Apache Log Format

Old school key=value pairs. Apache httpd.conf snippet follows:
LogFormat "site=%{site_name}e ip=%h datetime="%{%F %H:%M:%S %z}t" timestamp=%{%s}t host=%V request="%r" status=%>s response_size=%b response_time=%>D referer="%{Referer}i" user_agent="%{User-Agent}i" filename=%f session_id=%{rp_session_id}n tracking_id=%{RPID}C user_id=%{user_id}n realm_id=%{realm_id}n superuser_id=%{superuser_id}n" custom_log
Log Storage

Logs are currently parsed by a Perl script and then stored in Hadoop Hive for a number of business critical use cases.
Going Forward

We want to start using ELK soon. But we can not store the logs in Elasticsearch for another month or so as the ES cluster is not yet ready.
The big caveat is that logs must continue to load to Hadoop Hive.
Questions

Should we change the Apache output to JSON or leave it as is (i.e., key=value)? If we feed JSON logs into Logstash then we'll have to both store them on disc for eventual insertion into Elasticsearch but at the same time pivot the data back to key=value pairs and emit to a file that can be loaded to Hadoop Hive.
What are folks general thoughts on using JSON for this scenario or should we just leave it on the old key=value format?