With the growth of big data technologies, there has been much discussion about Hadoop and traditional databases. One question that comes up frequently is “How do you handle data sets in a traditional database?” Hadoop has demonstrated that it can handle a large quantity of data with ease in processing and high throughput. Apache Sqoop provides a solution for accessing data in traditional databases. With Sqoop in hand people can play with data that resides in traditional databases to Hadoop clusters and vice versa.
Sqoop – Introduction
Apache sqoop helps in transferring larger data from databases to HDFS (Hadoop distributed File System), databases like Mysql, PostgreSql, Teradata & Oracle can be used. Sqoop uses export and import commands for transferring datasets from other databases to HDFS. Internally, Sqoop uses a map reduce program for storing datasets to HDFS. Sqoop provides automation for transferring data from various databases and offers parallel processing as well as fault tolerance.
The above diagram explains how sqoop works internally. Commands entered through command line are associated with a map task to retrieve data from external databases. A reduce task will be used for placing the retrieved data into HDFS/Hbase/Hive. Internally, Sqoop uses various connector API for connecting with several databases. We also have the ability to create custom connectors which will help make database interactions fast and reliable.
Here are the commands used to connect with other databases
Import Data:
Command used for importing data (Employee table) from other MYSQL databases:
sqoop import –connect jdbc:mysql://localhost:3396/din –table Employee –username dinesh –password ****
- –import: command that tells Sqoop to initiate an import.
- –connect <connect string>, –username <user name>, –password <password>: connection parameters used to connect with the database.
- –table <table name>: table to be imported.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
The above diagram explains how sqoop uses the import command internally to retrieve data from other databases. In the first step, sqoop analyses database and frames metadata in order for the data to be retrieved.
In the next step it uses map job to transfer data to HDFS using the metadata formed in the first step. Data will be imported to the HDFS system under a file related to the retrieved table. Sqoop also provides features to specify the file name in the HDFS system.
The file created by HDFS will have comma separated values for each column and each row will be placed in a new line. Based on the size of the table, HDFS may create one or more files. Sqoop provisions various format for importing data, for eg: using option -as -avrodatafile will import in avro format.
Export Command:
Command used for exporting data (Employee table) to MYSQL database:
sqoop export –connect jdbc:mysql://localhost:3396/din –table Employee –username dinesh –password **** –export-dir /user/din/Employee
- –export: command that tells Sqoop to initiate an export.
- –connect <connect string>, –username <user name>, –password <password>: connection parameters used to connect with the database.
- –table <table name>: table to be loaded with data.
- –export-dir <directory path>: HDFS directory from where data will be exported.
First, sqoop gathers meta data for which it should export data and then it performs a map job which transfers data. Sqoop performs this transfer by splitting the input data set into various map tasks and sends it to the database. Map job implements this transfer over several transactions to ensure optimal throughput and minimal resource utilization. Some of the connectors in sqoop facilitate a staging table concept where the data set will be transferred to a staging table first and then to a target table. This prevents data loss in case of a failure so the job can be repeated again if there is a failure.
Sqoop Connectors:
Connectors help databases which don’t have native JDBC support connect to the database. Connectors are plugin components built on the sqoop framework and can be easily added to the sqoop installation. Sqoop provides various built-in connectors for most of the databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. Fast-path connectors are available for MySQL and PostgreSQL which help transfer data with high throughput. Apart from built-in connectors, each company can develop their own connectors for enterprise warehouse systems to NoSQL datastores.
Winding up:
Above are some of the basic commands to connect to HDFS from other databases. Sqoop provides various mechanisms to communicate with other databases using different connectors and data formats.
Click here for more information.