shagunsodhani/The Unified Logging Infrastructure for Data Analytics at Twitter.md

## The Unified Logging Infrastructure for Data Analytics at Twitter.md

      
    Raw
  

              The Unified Logging Infrastructure for Data Analytics at Twitter.md
            
          
    The paper presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.
Messaging Infrastructure

Twitter uses Scribe as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses Zookeeper to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time.
Twitter uses Thrift as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called Elephant Bird is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in Pig(Latin) and scheduled using Oink.
Application Specific Logging

Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well.

Inconsistent naming conventions: eg uid vs userId vs user_Id
Inconsistent semantics associated with each category name causing resource discovery problem.
Inconsistent format of log messages.

All these issues make it difficult to reconstruct user session activity.
Client Events

This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6-level schema is imposed on all the events (as described in the table below).


Component
Description
Example


client
client application
web, iPhone, android


page
page or functional grouping
home, profile, who_to_follow


section
tab or stream on a page
home, mentions, retweets, searches, suggestions


component
component object or objects
search_box, tweet


element
UI element within the component
button, avatar


action
actual user or application action
impression, click, hover


Table 1: Hierarchical decomposition of client event names.
For example, the following event, web:home:mentions:stream:avatar:profile_click is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left).
The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as fine-grained logging as required. But the client events model was chosen to make the top level aggregate queries easier.
A client event is a Thrift structure that contains the components given in the table below.


Field
Description


event initiator
{client, server} × {user, app}


event_name
event name


user_id
user id


session_id
session id


ip
user’s IP address


timestamp
timestamp


event_details
event details


Table 2: Definition of a client event.
The logging infrastructure is unified in two senses:

All log messages share a common format with clear semantics.
All log messages are stored in a single place.

Session Sequences

A session sequence is a sequence of symbols S = {s₀, s₁, s₂...s_n} such that each symbol is drawn from a finite alphabet Σ. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a group-by on user_id and session_id. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons:

A lot of brute force scans.
Large group-by operations needed to reconstruct user session.

Alternate Designs Considered


Reorganize complete Thrift messages by reconstructing user sessions - This solves the second problem but not the first.
Use a columnar storage format - This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself.

The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues.
Client Event Catalog

To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date.
Applications

Client Event Logs and Session Sequences are used in following applications:

Summary Statistics - Session sequences are used to compute various statistics about sessions.
Event Counting - Used to understand what feature of users take advantage of a particular feature.
Funnel Analytics - Used to focus on user attention in a multi-step process like signup process.
User Modeling - Used to identify "interesting" user behavior. N-gram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations.

Possible Extensions

Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use Trojan layouts which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized.

I write these summaries as part of my initiative called a-paper-a-week where in I committed myself to reading and summarising one research paper every week. You can find all the summaries here.
Component	Description	Example
client	client application	web, iPhone, android
page	page or functional grouping	home, profile, who_to_follow
section	tab or stream on a page	home, mentions, retweets, searches, suggestions
component	component object or objects	search_box, tweet
element	UI element within the component	button, avatar
action	actual user or application action	impression, click, hover
Field	Description
event initiator	{client, server} × {user, app}
event_name	event name
user_id	user id
session_id	session id
ip	user’s IP address
timestamp	timestamp
event_details	event details