Registering training data lineage using a linking file
The Cloudera AI projects, model builds, model
deployments, and associated metadata are tracked in Apache Atlas, which is
available in the environment's SDX cluster. You can also specify additional
metadata to be tracked for a given model build. For example, you can specify
metadata that links training data to a project through a special file called
the linking file (lineage.yaml
).
The lineage.yaml
file describes additional metadata and
the lineage relationships between the project’s models and training data.
You can use a single lineage.yaml
file for all the models
within the project.
- Create a YAML file in your Cloudera AI project called
lineage.yaml
.If you have used a template to create your project, a
lineage.yaml
file should already exist in your project. - Insert statements in the file that describe the relationships you
want to track between a model and the training data. You can include
additional descriptive metadata through key-value pairs in a
metadata
section.
The following example linking file shows entries for two models in your project:YAML YAML Structure Description Model name Top-level entry A Cloudera AI model name associated with the current project. There can be more than one model per linking file. hive_table_qualified_names
Second-level entry This pre-defined key introduces sequence items that list the names of Hive tables used as training data. Table names Sequence items The qualified names of Hive tables used as training data enclosed in double quotation marks. Qualified names are of the format db-name.table-name@cluster-name
metadata
Second-level entry This pre-defined key introduces additional metadata to be included in the Atlas representation of the relationship between the model and the training data. key:value
Third-level entries Key-value pairs that describe information about how this data is used in the model. For example, consider including the query text that is used to extract training data or the name of the training file used. modelName1
andmodelName2
:modelName1: # the name of your model hive_table_qualified_names: # this is a predefined key to link to # training data - "db.table1@namespace" # the qualifiedName of the hive_table # object representing training data - "db.table2@ns" metadata: # this is a predefined key for # additional metadata key1: value1 key2: value2 query: "select id, name from table" # suggested use case: query used to # extract training data training_file: "fit.py" # suggested use case: training file # used modelName2: # multiple models can be specified in # one file hive_table_qualified_names: - "db.table2@ns"