Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save elprup/ae2351c668640b7d3e75 to your computer and use it in GitHub Desktop.
Save elprup/ae2351c668640b7d3e75 to your computer and use it in GitHub Desktop.
migrate from facebook scribe to apache flume part II

Migrate from Facebook scribe to Apache Flume (Part II)

In last article we talked about how to setup flume and write files HDFS. This article, we begin to change flume to write file in scribe like style category. Multiplexing Way?

The first thought is using source multiplex to distribute log to different destination. Flume distribute log events by event header. So we google to find out which field in header is referring to scribe header.

https://apache.googlesource.com/flume/+/d66bf94b1dd059bc7e4b1ff332be59a280498077/flume-ng-sources/flume-scribe-source/src/main/java/org/apache/flume/source/scribe/ScribeSource.java

category in header will refer to scribe category. So we try to use multiplexing source:

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = category
a1.sources.r1.selector.mapping.t1 = c1
a1.sources.r1.selector.mapping.t2 = c2 c3
a1.sources.r1.selector.default = c4

It works great! But the drawback is you must configure category all the time. If a new category name arrives, it will write to default channel. Is there any way to work around it? Macro Path

We notice the macro fields in HDFS sink configurations. At first, we try to config it as file roll directory path, but it doesn’t work. But it really works when configure as HDFS file path:

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://[namenode-ip]:9000/flume/%{category}/%y-%m-%d
a1.sinks.k2.hdfs.filePrefix = events-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.useLocalTimeStamp = true

Now we have flume working as scribe server. In the future, we’ll start to explore use hbase as sink node and try to do realtime analyze on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment