Apache Sqoop – SQL to Hadoop

Updated: Sep 12, 2021

Today we are going to discuss a very efficient tool – Apache Sqoop. The heading itself tells you the meaning and purpose of tool which is SQL to Hadoop. We will first build some concepts and then have some demos that would help us understand these concepts better. Alright! Let’s Start.


What is Sqoop?


Apache Sqoop is a tool designed to efficiently transfer bulk data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from an RDBMS such as MySQL or Oracle or a mainframe into the HDFS transform the data in Hadoop Map Reduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.

Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.



Sqoop Architecture

  • When you use Sqoop to transfer data, the dataset being transferred is split into multiple partitions, and a map-only job is launched.

  • Individual mappers are now responsible for the transfer of each slice/partition of the dataset.

  • The metadata of the database is used to handle each data record in a type-safe manner. Type-Safety is extent to which a programming language discourages or prevents type errors due to difference in data types.



How Sqoop uses MapReduce jobs internally?

  • Once Sqoop connects to the database, it uses JDBC to examine the table to be imported by retrieving a list of all the columns and their SQL data types.

  • The SQL data types (integer, varchar etc.) can be mapped to Java data types (Integer, String etc.).

  • Sqoop has a code generator which creates a table-specific java class to hold the extracted records from the table by using information given by the JDBC about the data types, etc.

  • Then Sqoop connects to the cluster to submit a MapReduce job using the java class generated.

  • The dataset being transferred is split into multiple partitions, and a map-only job is launched.

  • The output of this is a set of files containing the imported data. As the import process is performed in parallel, the output is in multiple files.


How Sqoop Import & Export works?


Sqoop Tools and Help Command


Sqoop is a collection of related tools. To use Sqoop, specify the tool you want to use and the arguments that control the tool.

Sqoop tool-name [tool-arguments]

Sqoop ships with a help tool which displays a list of all available tools. Type the command as:

Sqoop help

We can get help for a specific tool as well by entering:

Sqoop help [tool-name]
Sqoop help import


Sqoop Import Tool


The import tool imports an individual table from an RDBMS to HDFS. Each row form a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or Sequence files.

sqoop import
--connect jdbc:mysql://jdbc-uri/database_name
--table mysqltablename
--username mysql_username
--password mysql_password
-m num_mappers_to_run
--target-dir target_directory

Sqoop Export Tool


The export tool exports a set of files from HDFS back to RDBMS. The target must already exist in the database.

The input files are read and parsed into a set of records according to the user-specified delimiters. The default operation is to transform these into a set of INSERT statements that inject the records into the database.

In “update mode”, Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.

sqoop export
--connect jdbc:mysql://jdbc-uri/database_name
--username mysql_username
--password mysql_password
--table mysqltable
-m num_mappers_to_run
--export-dir target_directory

Now that we have learned some good concepts around Sqoop lets move to our demos. We will first move data from RDBMS to HDFS and then in another demo we will try to move our data back to RDBMS from HDFS location.


Demo 1: Importing MySQL table to HDFS


Let’s create the following MySQL table ‘databare_emp’ into dataguy database and consider it to be moved to HDFS location.

Create database dataguy;

CREATE TABLE `dataguy`.`databare_emp` (
  `empid` INT NOT NULL,
  `fname` VARCHAR(45) NOT NULL,
  `lname` VARCHAR(45) NULL,
  `dob` DATE NOT NULL,
  `gender` CHAR NOT NULL,
  `hire_date` DATE NOT NULL,
  PRIMARY KEY (`empid`));
  
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1001,'Nitin','Khandelwal','1986-08-01','M','2017-10-20');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1002,'Naveen','Khandelwal','1988-01-12','M','2018-09-19');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1003,'Ravi','Aggarwal','1997-12-21','M','2020-06-03');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1004,'Rupika','Kaur','1992-05-07','F','2019-05-14');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1005,'Anurag','Pandiya','1977-03-21','M','2000-08-11');

Select * from `dataguy`.`databare_emp` ;

We will use following Sqoop command to perform this import.

sqoop import --connect jdbc:mysql://dbserver.databare.com/dataguy --table databare_emp --username dataguyAD --password mypassword -m 1 --target-dir /user/ad/emp_files/

You can see the results as under:

Once the operation is complete we can check the HDFS location for our data.

hdfs dfs –cat /user/ad/emp_files/part*

At the target folder you will find files generated with names as part-m-00000, part-m-00001 etc. depending upon the no of mappers used. Each mapper works on a portion of data and imports it to the corresponding file like part-m-00000.

hdfs dfs –ls /user/ad/emp_files/


Demo 2: Exporting data to MySQL


We will consider the same example used in demo1 and will now move the data from HDFS to some table in MySQL.

While we do this it is mandatory to manually create the table in the database. We will use following script to do that.

CREATE TABLE `dataguy`.`databare_employees` (
  `empid` INT NOT NULL,
  `fname` VARCHAR(45) NOT NULL,
  `lname` VARCHAR(45) NULL,
  `dob` DATE NOT NULL,
  `gender` CHAR NOT NULL,
  `hire_date` DATE NOT NULL,
  PRIMARY KEY (`empid`));

Now let’s use Sqoop to move our data back to MySQL.

sqoop export --connect jdbc:mysql://dbserver.databare.com/dataguy --username dataguyAD --password mypassword --table databare _employees -m 1 --export-dir /user/ad/emp_files/

You can now query your databare _employees in dataguy database for the exported data.

Use dataguy;
Select * from databare_employees;

So with these two demos we learned how to import and export data between RDBMS and HDFS using Sqoop.


Hope you enjoyed it. Happy learning!