How to Load Log Data into HDFS using Flume / Blogs / Perficient

Data acquisition is a very important part of building a big data ecosystem. Data acquisition allows you to extract various types of data such as a file, DB, streaming, web page etc. If you are just setting up your local environment, not in the real business scenarios, you can resolve data acquisition by making use of machine logs which exist everywhere.

Apache Flume has been around for some time and can fulfill most log extracting requirements. It is also pretty straightforward to configure. In this example, I will introduce you to a quick way of setting up Flume along with Hadoop in your local Linux VM.

Prerequisites

You will need a Linux (CentOS or Ubuntu) virtual machine and you should have th SSH service configured. I don’t recommend using Cygwin because you may encounter some unexpected issues during your installation.

Steps

Install Hadoop 2.7.2 version in the Linux system, which is the most recent version to date. Configure one primary and one replica on the same VM. I won’t discuss the Hadoop configuration details since much information is currently available.
Download Apache Flume binary archive package from here and untar to the folder.
Review the quick user guide to understand the basic concepts of Source, Channel, and Sinks.
Flume supports many sources, especially Avro file which is another Apache project for file serializing. It also supports for JDBC, file, memory, exec, JMS and others. In this example, to simplify, we use the Sequence generator as a source. The generator starts from 0 and progresses in increments of 1 (1,2,3)
Since we will use HDFS sink, we will also need to copy all needed *.jars to /$FLUME_HOME/lib from the following folders:

/$HADOOP_HOME/share/hadoop/common/*.jar

/$HADOOP_HOME/share/hadoop/common/lib/*.jar

The most important part is to configure the agent on the property of Flume Source, Channel, and Sink

# in this case called ‘agent1’

agent1.sources = seqGenSrc

agent1.channels = memoryChannel

agent1.sinks = hdfs1

# For each one of the sources, the type is defined

Build an AI-First Enterprise

From early pilots to enterprise-wide deployment, our award-winning AI consulting and technical services help you build the right foundation, scale responsibly, and deliver meaningful business outcomes.

Learn More

agent1.sources.seqGenSrc.type = seq

# The channel can be defined as follows.

agent1.sources.seqGenSrc.channels = memoryChannel

# Each sink’s type must be defined

#agent.sinks.loggerSink.type = logger

agent1.sinks.hdfs1.type = hdfs

agent1.sinks.hdfs1.hdfs.path = hdfs://master:8010/flume/

agent1.sinks.hdfs1.hdfs.fileType = DataStream

agent1.sinks.hdfs1.hdfs.file.Prefix = syslogfiles-

agent1.sinks.hdfs1.hdfs.round = true

agent1.sinks.hdfs1.hdfs.roundValue = 10

agent1.sinks.hdfs1.hdfs.roundUnit = second

#Specify the channel the sink should use

agent1.sinks.hdfs1.channel = memoryChannel

Go to the Flume folder and run the script

bin/flume-ng agent -n agent1 -c conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console

To check if the sequence data has been loaded to HDFS, access the URL: http://master:50070

Next

The above steps demonstrated the single Agent. The typical use case of Flume is to collect the system logs from many Web servers. In our big data lab, we have several Spark and Hadoop nodes so in the next step we will be configuring the multiple agents & flows to collect logs from the multiple nodes in the cluster.

Thoughts on “How to Load Log Data into HDFS using Flume”

singhkumar July 10, 2017 at 7:39 am

It will not load data properly and what about the external logs

This site uses Akismet to reduce spam. Learn how your comment data is processed.

How to Load Log Data into HDFS using Flume

by Kent Jiang on August 22nd, 2016 | ~ minute read

Build an AI-First Enterprise

Tags

Thoughts on “How to Load Log Data into HDFS using Flume”

Leave a Reply

Kent Jiang

Categories

Follow Us