dqtweb/migrate_from_facebook_scribe_to_apache_flume_part_1.md

## migrate_from_facebook_scribe_to_apache_flume_part_1.md

      
    Raw
  

              migrate_from_facebook_scribe_to_apache_flume_part_1.md
            
          
    Migrate from Facebook scribe to Apache Flume (Part I)

Reason

We use scribe as our logging server for a long time. At first, everything works fine. Easy to config, easy to manage. As data grows everyday, single scribe server can’t handle that. We have to migrate some category to second log server and attach a big disk. As data is keep growing, we want a big data storage for that instead of local disk. So we decide to use Scribe with HDFS plugin. It is as tough as we first compile scribe from source. Finally we complied the scribed with hdfs support. But after a short period usage, we find a bug that haven’t solved by facebook. (the project is deprecated several years ago). The bug cause scribe can’t write to hdfs if it accidently killed by SIG 9. So we start to test flume and find out ways to migrate.
Configure Flume

Flume is easy to deploy because it is written in Java. Install java and download jar package, we’ve done all the jobs.
Now we start with official example, listen for 4444 and write to log.
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

–D flume root logger doesn’t work as expect. We download log4j.properties from source then the INFO level will show in ./log/flume.log file.
log4j.properties link
https://github.com/apache/flume/blob/trunk/conf/log4j.properties
We deploy flume in an EC2 instance out of Hadoop cluster, so you need to copy hadoop libs to local. At first, we copy it from EMR master node, but find out flume will detect it and try to find emr metrics automaticly and there’re no way to get around.
INFO metrics.MetricsConfig: could NOT read /mnt/var/lib/info/job-flow.json, assume not within EMR cluster.
So we just download an official hadoop bin package and copy its jar package to local.
Nice article about config flume as hdfs writer or s3 writer
https://gist.github.com/crowdmatt/5256881
BTW, if there’s permission issue for write, make sure your hdfs path is owned by your script runner user name.
Now log are write to HDFS.
Last Step
We’re happy to see scribe source, so we configure flume source as scribe, start to enjoy flume.
Applix:
start_flume.sh
flume-ng agent --conf conf --conf-file flume.conf --name a1 -C "/home/flume/lib/*"

flume.conf
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.r1.type = org.apache.flume.source.scribe.ScribeSource
a1.sources.r1.port = 1463
a1.sources.r1.workerThreads = 5

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /tmp/flume
a1.sinks.k1.sink.rollInterval = 0

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://10.160.25.107:9000/flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k2.hdfs.filePrefix = events-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.useLocalTimeStamp = true

a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

In the future

We want category concept and time period rotation work as in scribe, next article will about that.