Data Movement and Integration
Copyright © 2012-2017 Hortonworks, Inc.
Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License |
2017-10-30
Abstract
The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.
Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source.
Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. You can contact us directly to discuss your specific needs.
Contents
- 1. What's New in the Data Movement and Integration Guide
- 2. HDP Data Movement and Integration
- 3. Data Management and Falcon Overview
- 4. Prerequisite to Installing or Upgrading Falcon
- 5. Considerations for Using Falcon
- 6. Configuring for High Availability
- 7. Creating Falcon Entity Definitions
- 8. Mirroring Data with Falcon
- 9. Replicating Data with Falcon
- 10. Mirroring Data with HiveDR in a Secure Environment
- 11. Enabling Mirroring and Replication with Azure Cloud Services
- 12. Using Advanced Falcon Features
- Locating and Managing Entities
- Accessing File Properties from Ambari
- Enabling Transparent Data Encryption
- Putting Falcon in Safe Mode
- Viewing Alerts in Falcon
- Late Data Handling
- Setting a Retention Policy
- Setting a Retry Policy
- Enabling Email Notifications
- Understanding Dependencies in Falcon
- Viewing Dependencies
- 13. Using Apache Sqoop to Transfer Bulk Data
- Apache Sqoop Connectors
- Storing Protected Passwords in Sqoop
- Sqoop Import Table Commands
- Sqoop Import Jobs Using --as-avrodatafile
- Netezza Connector
- Sqoop-HCatalog Integration
- Controlling Transaction Isolation
- Automatic Table Creation
- Delimited Text Formats and Field and Line Delimiter Characters
- HCatalog Table Requirements
- Support for Partitioning
- Schema Mapping
- Support for HCatalog Data Types
- Providing Hive and HCatalog Libraries for the Sqoop Job
- Examples
- Configuring a Sqoop Action to Use Tez to Load Data into a Hive Table
- Troubleshooting Sqoop
- 14. Using HDP for Workflow and Scheduling With Oozie
- 15. Using Apache Flume for Streaming
- 16. Troubleshooting
- 17. Appendix
List of Figures
List of Tables
- 7.1. Supported HDP Versions for Replication
- 7.2. Cluster Entity General Properties
- 7.3. Cluster Entity Interface Properties
- 7.4. Cluster Entity Properties & Location Properties
- 7.5. Cluster Entity Advanced Properties
- 7.6. General Feed Properties
- 7.7. Hive Source and Target Feed Properties
- 7.8. HDFS Source and Target Feed Properties
- 7.9. RDBMS Import Source and Target Feed Properties
- 7.10. RDBMS Export Source and Target Feed Properties
- 7.11. Advanced Feed Properties
- 7.12. General Process Properties
- 7.13. Process Detail and Engine Properties
- 7.14. Advanced Process Properties
- 8.1. General HDFS Mirror Properties
- 8.2. Source and Target Mirror Properties
- 8.3. Advanced HDFS Mirror Properties
- 8.4. General Hive Mirror Properties
- 8.5. Source and Target Hive Mirror Properties
- 8.6. Advanced Hive Mirror Properties
- 8.7. Source and Target Snapshot Mirror Properties
- 8.8. Advanced Snapshot Mirror Properties
- 10.1. General Hive Mirror Properties
- 10.2. Source and Target Hive Mirror Properties
- 10.3. Advanced Hive Mirror Properties
- 12.1. Available Falcon Event Alerts
- 12.2. Email Notification Startup Properties