Data acquisition is a very important part of building a big data ecosystem. Data acquisition allows you to extract various types of data such as a file, DB, streaming, web page etc. If you are just setting up your local environment, not in the real business scenarios, you can resolve data acquisition by making use of machine logs which exist everywhere.
Apache Flume has been around for some time and can fulfill most log extracting requirements. It is also pretty straightforward to configure. In this example, I will introduce you to a quick way of setting up Flume along with Hadoop in your local Linux VM.
Prerequisites
You will need a Linux (CentOS or Ubuntu) virtual machine and you should have th SSH service configured. I don’t recommend using Cygwin because you may encounter some unexpected issues during your installation.
Steps
- Install Hadoop 2.7.2 version in the Linux system, which is the most recent version to date. Configure one primary and one replica on the same VM. I won’t discuss the Hadoop configuration details since much information is currently available.
- Download Apache Flume binary archive package from here and untar to the folder.
- Review the quick user guide to understand the basic concepts of Source, Channel, and Sinks.
- Flume supports many sources, especially Avro file which is another Apache project for file serializing. It also supports for JDBC, file, memory, exec, JMS and others. In this example, to simplify, we use the Sequence generator as a source. The generator starts from 0 and progresses in increments of 1 (1,2,3)
- Since we will use HDFS sink, we will also need to copy all needed *.jars to /$FLUME_HOME/lib from the following folders:
/$HADOOP_HOME/share/hadoop/common/*.jar
/$HADOOP_HOME/share/hadoop/common/lib/*.jar
- The most important part is to configure the agent on the property of Flume Source, Channel, and Sink
# in this case called ‘agent1’
agent1.sources = seqGenSrc
agent1.channels = memoryChannel
agent1.sinks = hdfs1
# For each one of the sources, the type is defined
agent1.sources.seqGenSrc.type = seq
# The channel can be defined as follows.
agent1.sources.seqGenSrc.channels = memoryChannel
# Each sink’s type must be defined
#agent.sinks.loggerSink.type = logger
agent1.sinks.hdfs1.type = hdfs
agent1.sinks.hdfs1.hdfs.path = hdfs://master:8010/flume/
agent1.sinks.hdfs1.hdfs.fileType = DataStream
agent1.sinks.hdfs1.hdfs.file.Prefix = syslogfiles-
agent1.sinks.hdfs1.hdfs.round = true
agent1.sinks.hdfs1.hdfs.roundValue = 10
agent1.sinks.hdfs1.hdfs.roundUnit = second
#Specify the channel the sink should use
agent1.sinks.hdfs1.channel = memoryChannel
- Go to the Flume folder and run the script
bin/flume-ng agent -n agent1 -c conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console
- To check if the sequence data has been loaded to HDFS, access the URL: http://master:50070
The above steps demonstrated the single Agent. The typical use case of Flume is to collect the system logs from many Web servers. In our big data lab, we have several Spark and Hadoop nodes so in the next step we will be configuring the multiple agents & flows to collect logs from the multiple nodes in the cluster.
It will not load data properly and what about the external logs