elprup/migrate_from_facebook_scribe_to_apache_flume_path_ii.md

## migrate_from_facebook_scribe_to_apache_flume_path_ii.md

      
    Raw
  

              migrate_from_facebook_scribe_to_apache_flume_path_ii.md
            
          
    Migrate from Facebook scribe to Apache Flume (Part II)

In last article we talked about how to setup flume and write files HDFS. This article, we begin to change flume to write file in scribe like style category.
Multiplexing Way?

The first thought is using source multiplex to distribute log to different destination. Flume distribute log events by event header. So we google to find out which field in header is referring to scribe header.
https://apache.googlesource.com/flume/+/d66bf94b1dd059bc7e4b1ff332be59a280498077/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ScribeSource.java
category in header will refer to scribe category. So we try to use multiplexing source:
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = category
a1.sources.r1.selector.mapping.t1 = c1
a1.sources.r1.selector.mapping.t2 = c2 c3
a1.sources.r1.selector.default = c4

It works great! But the drawback is you must configure category all the time. If a new category name arrives, it will write to default channel. Is there any way to work around it?
Macro Path

We notice the macro fields in HDFS sink configurations. At first, we try to config it as file roll directory path, but it doesn’t work. But it really works when configure as HDFS file path:
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://[namenode-ip]:9000/flume/%{category}/%y-%m-%d
a1.sinks.k2.hdfs.filePrefix = events-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.useLocalTimeStamp = true

Now we have flume working as scribe server. In the future, we’ll start to explore use hbase as sink node and try to do realtime analyze on that.