Chapter 1. HDP Data Movement and Integration
Enterprises that adopt a modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management. As Hadoop is deployed in corporate data and processing environments, data movement and lineage must be managed centrally and comprehensively to provide security, data governance, and administration teams the necessary oversight to ensure compliance with corporate standards for data management. Hortonworks offers the HDP Data Movement and Integration Suite (DMI Suite) to provide that comprehensive management for data movement in to and out of Hadoop.
Use cases for data movement and integration (DMI) include the following:
Definition and scheduling of data manipulation jobs, including:
Data transfer
Data replication
Mirroring
Snapshots
Disaster recovery
Data processing
Monitoring and administration of data manipulation jobs
Root cause analysis of failed jobs
Job restart, rerun, suspend and termination
Workflow design and management
Ad-hoc bulk data transfer and transformation
Collection, aggregation, and movement of large amounts of streaming data
Intended Audience
Administrators, operators, and DevOps team members who are responsible for the overall health and performance of the HDP ecosystem use DMI for management, monitoring, and administration. Management, monitoring, and administration are performed using the Falcon Dashboard.
- Database Administrators
Responsible for establishing recurring transfers of data between RDBMS systems and Hadoop.
- Business Analysts or other business users
Need the ability to perform ad-hoc ETL and analytics with a combination of Hadoop-based and RDBMS-based data.
- DevOps
Responsible for:
Maximizing the predictability, efficiency, security, and maintainability of operational processes.
Use the DMI Suite to create an abstraction of sources, data sets and target systems along with jobs and processes for importing, exporting, disaster recovery and processing.
Designing workflows of various types of actions including Java, Apache Hive, Apache Pig, Apache Spark, Hadoop distributed file system (HDFS) operations, along with SSH, shell, and email.
The collection, aggregation, and movement of streaming data, such as log events.
Data Movement Components
The HDP Data Movement and Integration Suite (DMI Suite) leverages the following Apache projects:
- Apache Falcon
Management and abstraction layer to simplify and manage data movement in Hadoop
- Apache Oozie
Enterprise workflow operations
- Apache Sqoop
Bulk data transfers between Hadoop and RDBMS systems
- Apache Flume
Distributed, reliable service for collecting, aggregating, and moving large amounts of streaming data
In addition, the DMI Suite integrates other Apache APIs to simplify creation of complex processes, validate user input, and provide integrated management and monitoring.
Beyond the underlying components, the DMI Suite provides powerful user interfaces that simplify and streamline creation and management of complex processes.