Applying Metadata to HDFS and Hive Entities using the API

Using the Navigator API and JSON formatted metadata definition files, entities can be assigned properties in bulk, prior to extraction.

Metadata Definition Files

You can add tags and properties to HDFS entities using metadata files. With metadata files, you can assign metadata to entities in bulk and create metadata before it is extracted. A metadata file is a JSON file with the following structure:

{
  "name" : "name",
  "description" : "description",
  "properties" : {
    "key_name_1" : "value_1", 
        "key_name_2" : "value_2"
  },
  "tags" : [ "tag_1" ]
}
To add metadata files to files and directories, create a metadata file with the extension .navigator, naming the files as follows:
  • File - The path of the metadata file must be .filename.navigator. For example, to apply properties to the file /user/test/file1.txt, the metadata file path is /user/test/.file1.txt.navigator.
  • Directory - The path of the metadata file must be dirpath/.navigator. For example, to apply properties to the directory /user, the metadata path must be /user/.navigator.
The metadata file is applied to the entity metadata when the extractor runs.

Applying HDFS and Hive Metadata

The Navigator APIs can be used modify metadata of HDFS or Hive entities, such as databases, tables, and operations, before or after entity extraction. There are three methods to apply or update entity metadata:
  • Pre-register metadata for an asset that isn't yet in Navigator. (POST /entities)
  • Apply new or update metadata for a single entity. (PUT /entities/{id})
  • Apply new or update metadata for a set of entities. (PUT /entities/bulk)

The Navigator SDK includes examples of pre-registering entities. For example, see this example of creating a Hive operation, which will allow you to see lineage for pre-registered entities.

Retrieving Existing Metadata for Entities: GET /entities

When you call an API to update or add metadata to existing Navigator entities, the call replaces the metadata of a given type with the metadata provided in the API call. Therefore when you update an entity, you must supply the existing metadata as well as the new metadata in the call payload.

Typically, you would call the GET /entities API to retrieve the existing metadata for the entities you want to update. The GET call returns a JSON object that includes all the metadata for the entities you request. You can then strip the technical metadata from the JSON, augment the remaining metadata with the values you want to update, then use the JSON as the payload for the PUT or POST call.

The GET /entities API allows you to retrieve entity metadata using an entity identifier or by using a search query. The search query uses the same syntax and field names as the full query text in the Navigator console Search tab. There are many examples of useful GET /entity calls in GET Existing Metadata Examples.

Pre-registering Navigator Entities: POST /entities

Navigator provides an API to create new entities without a metadata extraction process. Use this API to insert entities in Navigator from sources that do not have an extractor plugin or when you want to store metadata during ingest or other processes that occur before extraction would run. The API generates the appropriate internal identifier for the entity. If you call the API before the metadata for an entity is extracted, the user-defined metadata is stored with the following entity properties:

  • Identity
  • Source ID
  • Metadata fields (name, description, tags, properties)
  • Fields relevant to the identifier

Other fields (attributes) for the entity, such as Type, are not present when the entity is generated and are filled in later by an automated metadata extraction process.

Use the following cURL syntax to call the POST /entities API:

POST curl http://fqdn-n.example.com:port/api/APIversion/entities/ \
-u username:password \
-X POST \
-H "Content-Type: application/json" \
-d '{properties}'

where properties are:

  • sourceId (required) — The identifier value from an existing source entity. See the GET /entities example to find the existing sources.
  • parentPath — The path of the parent entity, defined as:
    • HDFS file or directory — fileSystemPath of the parent directory. (Do not provide this field if the entity affected is the root directory.) The parentPath for /user/admin/input_dir is /user/admin.

      If you add metadata to a directory, the metadata does not propagate to any files or folders in that directory.

    • Hive database — If you are updating database metadata, do not specify this field.
    • Hive table or view — The name of database containing the table or view. The parentPath for a table in the default database is default.
    • Hive column — database name/table name/view name. The parentPath for a column in the sample_07 table is default/sample_07.
  • originalName (required) — The name as defined by the source system.
    • HDFS file or directory — Name of file or directory (ROOT if the entity is the root directory). The originalName for /user/admin/input_dir is input_dir.
    • Hive database, table, view, or column — The name of the database, table, view, or column.
      • Example for default database: default
      • Example for sample_07 table: sample_07
  • Metadata fields (name, description, tags, user-defined properties, managed properties) that describe the final entity, including existing metadata.

All existing naming rules apply, and if any value is invalid, the entire request is denied. The call adds the entity to Navigator with the provided metadata and generates the entity identity:

identity — The ID assigned to an entity by Navigator. This value is generated by Navigator based on the originalName, parentPath, and sourceId.

Updating Metadata for a Single Entity: PUT /entities

You can use the PUT /entity/{id} API to update the metadata for a single entity.

User-defined metadata provided through the API overwrites existing metadata. In addition, null values for some metadata overwrites existing metadata. For example, passing empty or no name and description fields with an empty array for tags and empty property dictionary with the API call removes the existing metadata. If you omit the tags or properties fields, the existing tags or properties values remain unchanged. If you want to add a tag to a list of existing tags, you must include the existing tags in your update.

Use the following cURL syntax to make a PUT /entities call:

PUT curl http://fqdn-n.example.com:port/api/APIversion/entities/identity \
-u username:password \
-X PUT \
-H "Content-Type: application/json" \
-d '{properties}'

where identity is an entity ID and properties are:

  • name — The name for the entity. This "name" is used in Navigator and is distinct from the "originalName" that Navigator extracted from the source. If there is already a name applied, you must supply the existing name to prevent the call from overwriting the name with a null value.
  • description — The description for the entity. If you are not updating the description, you much supply the existing description to avoid overwriting a description with a null value.
  • tags — Tag metadata. If you provide any tags, you must also supply existing tags to avoid removing existing tags.
  • properties — User-defined properties. The format is {key: value}. Again, if you are adding a user-defined property, you must also supply the existing user-defined properties associated with this entity to avoid overwriting them.
  • customProperties — Managed properties. The format is {namespace: {key: value}}. If a property is assigned a value that does not conform to type constraints, the call returns an error. If you are associating a managed property with an entity, you must also supply the existing managed properties associated with this entity to avoid overwriting them.

If any value is invalid, the entity is not updated.

API Usage Examples

This section contains some examples, which include the URL that points to the Navigator Metadata Server host and port:

http://fqdn-n.example.com:port/api/APIversion/entities/ 

where fqdn-n.example.com is the host running the Navigator Metadata Server role instance listening for HTTP connections at the specified port number (7187 is the default port number). APIversion is the running version of the API as indicated in the footer of the API documentation (available from the Help menu in the Navigator console) or by calling http://fqdn-n.example.com:port/api/version.

To create an authorization token to use in subsequent calls:

This example makes a dummy call that saves an authentication token to the current location in a file named NavCookie. You can use the token in later calls to avoid having to specify the username and password.

curl https://node1.example.com:7187/api/v14/interactive/entities?limit=1&offset=0 \
-c NavCookie -u username:password -X GET

HDFS PUT Custom Metadata Example for /user/admin/input_dir Directory

curl http://node1.example.com:7187/api/v13/entities/e461de8de38511a3ac6740dd7d51b8d0 \
-u username:password \
-X PUT \
-H "Content-Type: application/json"\
-d '{"name":"my_name",
   "description":"My description",
   "tags":["tag1","tag2"],
   "properties":{"property1":"value1","property2":"value2"}}'

HDFS POST Custom Metadata Example for /user/admin/input_dir Directory

curl http://node1.example.com:7187/api/v13/entities/ \
-u username:password \
-X POST \
-H "Content-Type: application/json" \
-d '{"sourceId":"a09b0233cc58ff7d601eaa68673a20c6",
     "parentPath":"/user/admin",
     "originalName":"input_dir",
     "name":"my_name",
     "description":"My description",
     "tags":["tag1","tag2"],
     "properties":{"property1":"value1","property2":"value2"}}'

Hive POST Custom Metadata Example for total_emp Column

curl http://node1.example.com:7187/api/v13/entities/\
-u username:password \
-X POST \
-H "Content-Type: application/json" \
-d '{"sourceId":"4fbdadc6899638782fc8cb626176dc7b",
     "parentPath":"default/sample_07",
     "originalName":"total_emp",
     "name":"my_name",
     "description":"My description",
     "tags":["tag1","tag2"],
     "properties":{"property1":"value1","property2":"value2"}}'

HDFS PUT Managed Properties Example

This example adds a property and a tag to the entity identified as "14", which happens to be the Hive email_preferences column in the customers sample table. The Approved property is a Boolean to indicate whether or not the metadata for this column was reviewed and approved. The ToBeReviewed tag marks the column temporarily so a data steward can easily find this column for review:
curl http://node1.example.com:7187/api/v13/entities/14 \
-u username:password \
-X PUT \
-H "Content-Type: application/json" \
-d '{"tags": ["ToBeReviewed"], \
     "customProperties": {"Operations": \
      {"Approved": false}}}'
The server responds:
{
  "originalName" : "email_preferences",
  "originalDescription" : null,
  "sourceId" : "7",
  "firstClassParentId" : "13",
  "parentPath" : "/default/customers",
  "deleteTime" : null,
  "extractorRunId" : "7##1",
  "customProperties" : {
    "Operations" : {
      "Approved" : true
    }
  },
  "name" : null,
  "description" : null,
  "tags" : [ "ToBeReviewed" ],
  "properties" : {
    "__cloudera_internal__hueLink" : "https://node1.example.com:8889/hue/metastore/table/default/customers"
  },
  "technicalProperties" : null,
  "dataType" : "struct<email_format:string,frequency:string,categories:struct<promos:boolean,surveys:boolean>>",
  "type" : "FIELD",
  "sourceType" : "HIVE",
  "userEntity" : false,
  "metaClassName" : "hv_column",
  "deleted" : false,
  "packageName" : "nav",
  "identity" : "14",
  "internalType" : "hv_column"
}

GET Existing Metadata Examples

This section shows some examples of useful GET /entity calls:

To retrieve a specific entity:

Use the entity identifier that shows in the URL in the Navigator console. This example gets the metadata for the entity with identifier 21302.
curl https://node1.example.com:7187/api/v14/entities/21302 \
-X GET -b NavCookie
To retrieve entities that describe the sources Navigator extracts data from:
curl https://node1.example.com:7187/api/v14/entities/?query=type%3ASOURCE \
-b NavCookie -X GET
To use the source in a query, find its identity: In the output, find the name of the source, then collect the identity for that entity. In this example, the HIVE-1 source has the identity "6".
{
  "originalName" : "HIVE-1",
  "originalDescription" : null,
  "sourceId" : null,
  "firstClassParentId" : null,
  "parentPath" : null,
  "deleteTime" : null,
  "extractorRunId" : null,
  "customProperties" : null,
  "name" : "HIVE-1",
  "description" : null,
  "tags" : null,
  "properties" : null,
  "technicalProperties" : null,
  "clusterName" : "Cluster 1",
  "sourceUrl" : "thrift://node2.example.com:9083",
  "sourceType" : "HIVE",
  "sourceExtractIteration" : 15,
  "sourceTemplate" : true,
  "hmsDbHost" : "node1.example.com",
  "hmsDbName" : "hive1",
  "hmsDbPort" : "3306",
  "hmsDbUser" : "hive1",
  "type" : "SOURCE",
  "userEntity" : false,
  "deleted" : null,
  "metaClassName" : "source",
  "packageName" : "nav",
  "identity" : "6",
  "internalType" : "source"
}

To retrieve all entities from a single source:

Run the previous call to determine the identity of the source, then use that value as the sourceId in the query. This example uses sourceId of 6.

curl https://node1.example.com:7187/api/v14/entities/?query=sourceId%3A6 \
-b NavCookie -X GET

To retrieve all entities marked with a specific tag:

Use quotes around the tag name. This example retrieves metadata for entities tagged with "sensitive".

curl https://node1.example.com:7187/api/v14/entities/?query=tags%3A%22sensitive%22 \
-b NavCookie -X GET

To retrieve all entities from a single source and marked with a tag:

This example includes a plus sign (+) before each of the query components to ensure they are treated as an AND relation. Note that this is a case where the query returns the correct results only with the ASCII codes.

curl https://node1.example.com:7187/api/v14/entities/?query=%2btags%3A%22sensitive%22%20%2bsourceId%3A6 \
-b cookie -X GET

To search for pre-registered entities:

This example shows how to get metadata for pre-registered entities, which do not include an internal type:

curl http://node1.example.com:7187/api/v13/entities/?query=-internalType:* \
-b cookie -X GET

Using the Cloudera Navigator SDK for Metadata Management

To facilitate working with metadata using the Cloudera Navigator APIs, Cloudera provides the Cloudera Navigator SDK. Cloudera Navigator SDK is a client library that provides functionality for extracting and enriching metadata with custom models, entities, and relationships. See GitHub cloudera/navigator-sdk for details.