Skip to main content

Data & Intelligence

How to Load Log Data into HDFS using Flume

Data acquisition is a very important part of building a big data ecosystem. Data acquisition allows you to extract various types of data such as a file, DB, streaming, web page etc. If you are just setting up your local environment, not in the real business scenarios, you can resolve data acquisition by making use of machine logs which exist everywhere.

Apache Flume has been around for some time and can fulfill most log extracting requirements. It is also pretty straightforward to configure. In this example, I will introduce you to a quick way of setting up Flume along with Hadoop in your local Linux VM.

Prerequisites

You will need a Linux (CentOS or Ubuntu) virtual machine and you should have th SSH service configured. I don’t recommend using Cygwin because you may encounter some unexpected issues during your installation.

Steps

  1. Install Hadoop 2.7.2 version in the Linux system, which is the most recent version to date. Configure one primary and one replica on the same VM. I won’t discuss the Hadoop configuration details since much information is currently available.
  2. Download Apache Flume binary archive package from here and untar to the folder.
  3. Review the quick user guide to understand the basic concepts of Source, Channel, and Sinks.
  4. Flume supports many sources, especially Avro file which is another Apache project for file serializing. It also supports for JDBC, file, memory, exec, JMS and others. In this example, to simplify, we use the Sequence generator as a source. The generator starts from 0 and progresses in increments of 1 (1,2,3)
  5. Since we will use HDFS sink, we will also need to copy all needed *.jars to /$FLUME_HOME/lib from the following folders:

/$HADOOP_HOME/share/hadoop/common/*.jar

/$HADOOP_HOME/share/hadoop/common/lib/*.jar

  1. The most important part is to configure the agent on the property of Flume Source, Channel, and Sink

# in this case called ‘agent1’

agent1.sources = seqGenSrc

agent1.channels = memoryChannel

agent1.sinks = hdfs1

 

# For each one of the sources, the type is defined

agent1.sources.seqGenSrc.type = seq

 

# The channel can be defined as follows.

agent1.sources.seqGenSrc.channels = memoryChannel

 

# Each sink’s type must be defined

#agent.sinks.loggerSink.type = logger

 

agent1.sinks.hdfs1.type = hdfs

agent1.sinks.hdfs1.hdfs.path = hdfs://master:8010/flume/

agent1.sinks.hdfs1.hdfs.fileType = DataStream

agent1.sinks.hdfs1.hdfs.file.Prefix = syslogfiles-

agent1.sinks.hdfs1.hdfs.round = true

agent1.sinks.hdfs1.hdfs.roundValue = 10

agent1.sinks.hdfs1.hdfs.roundUnit = second

 

#Specify the channel the sink should use

agent1.sinks.hdfs1.channel = memoryChannel

  1. Go to the Flume folder and run the script

bin/flume-ng agent -n agent1 -c conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console

  1. To check if the sequence data has been loaded to HDFS, access the URL: http://master:50070

1Next

The above steps demonstrated the single Agent. The typical use case of Flume is to collect the system logs from many Web servers. In our big data lab, we have several Spark and Hadoop nodes so in the next step we will be configuring the multiple agents & flows to collect logs from the multiple nodes in the cluster.

Tags

Thoughts on “How to Load Log Data into HDFS using Flume”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us