Hadoop Sqoop Tool for transferring data between Hadoop and Relational Databases

Nareshkumar Jayavelu
6 min readNov 3, 2021

--

Introduction:

Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Sqoop is an open-source software product of the Apache Software Foundation.

Architecture:

Key features of Sqoop

  • Bulk import: Sqoop facilitates the import of singular tables and comprehensive databases into HDFS. The information is saved in the native directories and files in the HDFS file system
  • Direct input: Sqoop can also enable import and map SQL (relational) databases into Hive and HBase
  • Data interaction: Sqoop is capable of generating Java classes so that you can interact with the data in the scope of programming
  • Data export: Sqoop can also export information from HDFS into a relational database with the help of a target table definition based on the specifics of the target database.

How Sqoop works?

Let's look at the two most common sqoop tools:-

  1. Sqoop Import:

The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port. For example:

$ sqoop import --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database named employees on the host database.example.com. It’s important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal name localhost, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

Simple Syntax for import:

Sqoop import --connect --table --username --password --target-dir

table - table to read/import from

username - username to connect to the database server

password - password for the user to connect to the database server

target-dir - location of the directory in the HDFS where the data will be stored.

To overwrite the directory, use --delete-target-dir to delete the existing directory.

Sqoop import --connect --table --username --password --delete-target-dir --target-dir

Syntax:

sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop123 \
--table employees \
--incremental append \
--check-column id \
--last-value 1

incremental-append - to append data to the existing directory

last-value - after which value to receive and append to the mentioned directory

check-column - Column to be checked for last value while performing incremental append (primary key).

For importing selected data from a table,

sqoop import --connect --table --username --password --columns –where

For importing data from Query,

sqoop import --connect --table --username -- password –query

2. Sqoop Export:

The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.

If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key

sqoop-export--connect --username –password --export-dir

export-dir - source location from where data is taken and written to RDBMS table

sqoop export --connect jdbc:mysql://db.example.com/analytics --table score --export-dir /results/score_data

table - a destination table in RDBMS to write the data into.

db.example.com - hostname of where the database resides.

analytics - database name

Hadoop Hue

Hadoop Hue is an open-source user experience or user interface for Hadoop components. The user can access Hue right from within the browser and it enhances the productivity of Hadoop developers. This is developed by the Cloudera and is an open-source project. Through Hue, the user can interact with HDFS and MapReduce applications. Users do not have to use command line interface to use Hadoop ecosystem if he/she use Hue.

Features of Hue

A lot of features are available in Hue apart from just a web interface that it provides to the Hadoop developers. Hue provides the following listed features due to which it is getting a popular tool for the Hadoop developers:

  • Hadoop API Access
  • Presence of HDFS File Browser
  • Browser and Job Designer
  • User Admin Interface
  • Editor for Hive Query
  • Editor for Pig Query
  • Hadoop Shell Access
  • Workflows can access Oozie Interface
  • SOLR searches can get a separate interface

It's easier to store, manage, run and schedule Sqoop scripts and/or jobs through Hadoop Hue (Oozie workflow)

Conclusion

Sqoop is an app that RDBMS uses to travel their own set of data to a particular local, i.e Hadoop with high security and within the budget and we can use it efficiently and effectively everywhere. And as it is fast in-process every one wants this technology to be processed at their own sites to get better results.

It allows the transfer of data with a variety of structured data stores like Teradata, Petadata, etc.

  • Sqoop can execute the data transfer in parallel.
  • Since the data is transferred and get in the Hadoop cluster, sqoop allows offloading certain processing done in the ETL process into fast, low cost and effective one.
  • It handles the bundle of data.

Sqoop doesn’t only support database administrators, and data engineers to perform ETL tasks and data transfers. It immensely helps data analysts and scientists to excel at their roles.

Feel free to reach me with any comments on my Linkedin account and thank you for reading this post.

--

--