shiumachi/ltsv_faq.rst

## ltsv_faq.rst

      
    Raw
  

              ltsv_faq.rst
            
          
    LTSV FAQ

This document is a translation of LTSV FAQ. (Japanese)
What is LTSV?

LTSV(Labeled Tab-Separated Values) is a specification of text format just like CSV, TSV, and JSON. It's useful for httpd access logging.
The specification is available at http://ltsv.org .
LTSV is just a log format.
Hey, LTSV format seems like just naming a value in TSV!

Yes!
For example, the following log will be converted into ... :
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
like this:
host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>time:[10/Oct/2000:13:55:36 -0700]<TAB>req:GET /apache_pb.gif HTTP/1.0<TAB>status:200<TAB>size:2326<TAB>referer:http://www.example.com/start.html<TAB>ua:Mozilla/4.08 [en] (Win98; I ;Nav)
Why such a log format has become a hot topic?

'combined' log format, which is common as Apache access_log, has a couple of bad points:

it's inconvenient to parse
it's hard to add value

Everyone has been using this format only because eveyone has been using the format. But recently someone noticed that the above problems can be solved by LTSV, which requires very small changes. That's why many people have been excited at this format.
For more details about this excitement, see the following page. (Japanese)
【今北産業】3分で分かるLTSV業界のまとめ【LTSV】
What benefits does LTSV provides?


easy to parse


in ruby:


Hash[gets.split("\t").map{|f| f.split(":", 2)}]

specific parser is not required
specific formatter is not required to output data


you can set on Apache/nginx embedded config file


thanks to labeled value, easy to process the parsed data
row-oriented format makes it easy to integrate with other program

For more detail, see the following URLs (both Japanese):

Labeled Tab Separated Valueノススメ - stanakaのブログ
LTSV が行指向な Key-Value フォーマットで捗る話 - naoyaのはてなダイアリー

What does it mean "open for extension"?

Imagine you have a LTSV log like this:
host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>
And, you a hundred of scripts which parses logs and does something:
#!/usr/bin/env ruby

while gets
  record = Hash[$_.split("\t").map{|f| f.split(":", 2)}]
  # do something for the record
end
One day, you noticed that the log doesn't contain timestamp and you want to add it.
time:[10/Oct/2000:13:55:36 -0700]<TAB>host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>
Does this change affect the hundred of scripts? Do they fail to parse the new data? No.
If the log used combined format and it was parsed with regular expression, all the scripts would not work.
It doesn't affect the scripts even if you insert the time field into any place. Additionally, if a script can accept arbitrary number of values, the script can use timestamp after you just added time field into the record.
What's weak point of LTSV?

Comparing to 'combined' log:

it's a bit less readable than combined log


but, do you really think combined log is readable?


record size will be increased by the length of field name

I don't think these points are critical, or they can be solved.
JSON seems better, it can contain structured data ...


While JSON and MessagePack is good for labeling data, it's not easy to parse data with the format.
We have to do non-trivial way to generate Apache/nginx log in JSON format.

The advantage of LTSV is that user can migrate from less extensible log format without an effort.
Why escape is not included in the specification?

The specification of LTSV is as follows:

do not use colon ":" as key. which is delimiter in LTSV.
each field is delimited by TAB.

That's all.
LTSV specification doesn't contain escape.
Here are the reasons:

parsing become harder if escape is defined strictly.
"Hey, I wonder if some string like User-Agent contains TAB character ..." -> "Never."

Kazuho Oku mentioned about this in his blog (Japanese). According to the blog, Apache HTTP Server escapes all control character in the log due to a vulnerability.


You can find may implemantation in various languages in ltsv.org, but it's not necessary to use those implementation to parse LTSV.
Just use this tiny script:
#!/usr/bin/env ruby

while gets
  record = Hash[$_.split("\t").map{|f| f.split(":", 2)}]
  p record
end
Pretty easy! I don't need escape specification, but there is some discussion about extended specifications like strict-LTSV.
Can I use LTSV for non-access log?

Of course!
Since it's just a format specification like CSV, TSV, and JSON, you can use LTSV in anywhere.
Can I name a label without any limitation?

Yes. If you apply LTSV to access log, I recommend to use labels in "Recommendation for labeling" in ltsv.org.
LTSV is not readable :(

You can use a filter like ltsview. It's pretty easy to implement a filter.
If you have this kind of filter, You can tail the log with formatting:
$ tail -f access_log | ltsview
If you want to read a log in combined format, you can write a filter which convert LTSV to combined format.
Since LTSV specification is based on UNIX philosophy, LTSV is row-oriented, self-describing, and open for extension. That's why you can implement LTSV filter very easily.
I'm concerned log size will be bigger...

This is My personal opinion:

In access_log, request URI, User-Agent, and Referer should be a large portion of total data size. so the size of label doesn't matter.
If your system are so big that adding label generates huge amount of data, you should have another solution which can process/ingest massive size of log.


processing with Mapreduce(Hadoop or Amazon EMR), storing data on DWH, etc...
If you import the log via fluentd for example, the size of the label will be dissipated.


Hatena, a large web service company in Japan (+1M users) has used LTSV for 3 years. This indicates that it doesn't matter if you have a system which is smaller than Hatena.
Is LTSV a specification for Fluentd?

No.
LTSV is just a format specification and it's not a sub project of any other software. Why fluentd community are so excited is that fluentd is popular for processing Apache or nginx access log.
If you use LTSV, fluentd configuration will be simple and DRY. This will solve a tough problem for administrator (especially in long-term operation).
How to transform combined log into LTSV?

Some blogger wrote a Perl script:
404 Blog Not Found:perl - Apache Combined Log を LTSV に (Japanese)
You can use this script without any external library.
Who designs the specification?

Like other open specification, no one make a decision for the specification. Anyone who is interested in LTSV does any action. You may think me as a leader or a member of a committee because I wrote this FAQ, but I don't have any privileges for LTSV community and specification.
Though @stanaka has ltsv.org domain and he is a main person of the community, ltsv.org repository is public and anyone can join the community.
The internet is interesting!
How to follow the activities?

Search 'ltsv' on Twitter. I recommend to use free word search rather than hashtag. The word LTSV is searchable :)
Hey, only some guys in Japan are excited at LTSV.

I would appreciate if you contribute to promote LTSV globally :)
For exmaple, you can translate documents and send it to @stanaka or just send pull request. You can also submit an entry to Hacker News.
How can I contribute?

As I mentioned above, there is no decision maker, so you can do anything! Please write something to your blog, implement some parser and tweet it with hashtag #ltsv, or write an English document.
Enjoy!