Data Governance Guide
Also available as:
loading table of contents...

Understanding the HDP Metadata Services Framework

Hadoop presents data governance challenges because it is a platform comprised of autonomous projects that define their own future and share no common framework. For example, disparate tools, such as HCatalog, Ranger, and Falcon provide pieces of an overall data governance solution, but there is no comprehensive governance within the Hadoop stack. In addition, there is no means to integrate the Hadoop stack with external governance frameworks.

Atlas provides the means to centrally manage the data lifecycle in HDP, providing a repository that collects metadata for the platform that can be searched, tagged, and managed. A REST API is also available that can be used to integrate third-party governance tools with HDP. For information about the REST API, see Appendix D in this guide.

Figure 3.1. Atlas Architecture

  • REST API handles all interaction with the metadata services.

  • Existing HDP stack plug-in model leveraged by metadata services.

  • Metadata search provided in two ways:

    • DSL (domain-specific language) search. A SQL-like query language.

    • Lucene-style full text search.

  • Type system provides flexible modeling capability to model any business, data asset, or process, including inheritance.

  • Titan/HBase Graph database that runs the type system.

  • Bridge, a native connector to automatically fetch lineage and metadata. The Hive bridge connector ships with HDP 2.3. Additional components to follow.

  • Solr/Elastic provide additional plugable search capability that can be used without affecting the REST API or Atlas capabilities.