Registering training data lineage using a linking file
The Cloudera Data Science Workbench projects, model builds, model deployments, and
associated metadata are tracked in Apache Atlas, which is available in the environment's SDX
cluster. You can also specify additional metadata to be tracked for a given model build. For
example, you can specify metadata that links training data to a project through a special file
called the linking file (lineage.yaml).
The lineage.yaml file defines additional metadata and the
lineage relationships between the project’s models and training data. You can use a single
lineage.yaml file for all the models within the project.
Create a YAML file in your CDSW project called lineage.yaml. If you have used a
template to create your project, a lineage.yaml file should already exist in your
project.
Insert statements in the file that describe the relationships you want to track
between a model and the training data. You can include additional descriptive
metadata through key-value pairs in a metadata section.
YAML
YAML Structure
Description
Model name
Top-level entry
A ML model name associated with the current project. There can be more
than one model per linking file.
hive_table_qualified_names
Second-level entry
This pre-defined key introduces sequence items that list the names of
Hive tables used as training data.
Table names
Sequence items
The qualified names of Hive tables used as training data enclosed
in double quotation marks. Qualified names are of the format
db-name.table-name@cluster-name
metadata
Second-level entry
This pre-defined key introduces additional metadata to be included in the
Atlas representation of the relationship between the model and the
training data.
key:value
Third-level entries
Key-value pairs that describe information about how this data is used in
the model. For example, consider including the query text that is used to
extract training data or the name of the training file used.
The following example linking file shows entries for two models in your project:
modelName1 and modelName2:
modelName1: # the name of your model
hive_table_qualified_names: # this is a predefined key to link to
# training data
- "db.table1@namespace" # the qualifiedName of the hive_table
# object representing training data
- "db.table2@ns"
metadata: # this is a predefined key for
# additional metadata
key1: value1
key2: value2
query: "select id, name from table" # suggested use case: query used to
# extract training data
training_file: "fit.py" # suggested use case: training file
# used
modelName2: # multiple models can be specified in
# one file
hive_table_qualified_names:
- "db.table2@ns"