Hadoop Sqoop Tool for transferring data between Hadoop and Relational Databases
Introduction:
Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Sqoop is an open-source software product of the Apache Software Foundation.
Architecture:
Key features of Sqoop
- Bulk import: Sqoop facilitates the import of singular tables and comprehensive databases into HDFS. The information is saved in the native directories and files in the HDFS file system
- Direct input: Sqoop can also enable import and map SQL (relational) databases into Hive and HBase
- Data interaction: Sqoop is capable of generating Java classes so that you can interact with the data in the scope of programming
- Data export: Sqoop can also export information from HDFS into a relational database with the help of a target table definition based on the specifics of the target database.
How Sqoop works?
Let's look at the two most common sqoop tools:-
- Sqoop Import:
The import
tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.
Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect
argument. This describes the server and database to connect to; it may also specify the port. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
This string will connect to a MySQL database named employees
on the host database.example.com
. It’s important that you do not use the URL localhost
if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal name localhost
, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.
Simple Syntax for import:
Sqoop import --connect --table --username --password --target-dir
— table - table to read/import from
— username - username to connect to the database server
— password - password for the user to connect to the database server
— target-dir - location of the directory in the HDFS where the data will be stored.
To overwrite the directory, use --delete-target-dir
to delete the existing directory.
Sqoop import --connect --table --username --password --delete-target-dir --target-dir
Syntax:
sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop123 \
--table employees \
--incremental append \
--check-column id \
--last-value 1
— incremental-append - to append data to the existing directory
— last-value - after which value to receive and append to the mentioned directory
— check-column - Column to be checked for last value while performing incremental append (primary key).
For importing selected data from a table,
sqoop import --connect --table --username --password --columns –where
For importing data from Query,
sqoop import --connect --table --username -- password –query
2. Sqoop Export:
The export
tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
By default, sqoop-export
appends new rows to a table; each input record is transformed into an INSERT
statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT
statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you specify the --update-key
argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE
statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key
sqoop-export--connect --username –password --export-dir
— export-dir - source location from where data is taken and written to RDBMS table
sqoop export --connect jdbc:mysql://db.example.com/analytics --table score --export-dir /results/score_data
— table - a destination table in RDBMS to write the data into.
db.example.com - hostname of where the database resides.
analytics - database name
Hadoop Hue
Hadoop Hue is an open-source user experience or user interface for Hadoop components. The user can access Hue right from within the browser and it enhances the productivity of Hadoop developers. This is developed by the Cloudera and is an open-source project. Through Hue, the user can interact with HDFS and MapReduce applications. Users do not have to use command line interface to use Hadoop ecosystem if he/she use Hue.
Features of Hue
A lot of features are available in Hue apart from just a web interface that it provides to the Hadoop developers. Hue provides the following listed features due to which it is getting a popular tool for the Hadoop developers:
- Hadoop API Access
- Presence of HDFS File Browser
- Browser and Job Designer
- User Admin Interface
- Editor for Hive Query
- Editor for Pig Query
- Hadoop Shell Access
- Workflows can access Oozie Interface
- SOLR searches can get a separate interface
It's easier to store, manage, run and schedule Sqoop scripts and/or jobs through Hadoop Hue (Oozie workflow)
Conclusion
Sqoop is an app that RDBMS uses to travel their own set of data to a particular local, i.e Hadoop with high security and within the budget and we can use it efficiently and effectively everywhere. And as it is fast in-process every one wants this technology to be processed at their own sites to get better results.
It allows the transfer of data with a variety of structured data stores like Teradata, Petadata, etc.
- Sqoop can execute the data transfer in parallel.
- Since the data is transferred and get in the Hadoop cluster, sqoop allows offloading certain processing done in the ETL process into fast, low cost and effective one.
- It handles the bundle of data.
Sqoop doesn’t only support database administrators, and data engineers to perform ETL tasks and data transfers. It immensely helps data analysts and scientists to excel at their roles.
Feel free to reach me with any comments on my Linkedin account and thank you for reading this post.