Registering training data lineage using a linking file

The Machine Learning (ML) projects, model builds, model deployments, and the associated metadata are automatically tracked in Apache Atlas which is available in the environment's SDX cluster. You can also manually specify additional metadata to be tracked for a given model build. For example, linking training data to a project. This is done through a special file called the linking file (lineage.yaml).

Create a YAML file in your ML project called lineage.yaml. If you have used a template to create your project, a lineage.yaml file should already exist in your project.

The lineage.yaml file describes additional metadata and the lineage relationships between the project’s models and training data. You can use a single lineage.yaml file for all the models within the project. The following is an example of a linking file for two models in your project: modelName1 and modelName2:
modelName1:                               # the name of your model
  hive_table_qualified_names:             # this is a predefined key to link to
                                          # training data
    - "db.table1@namespace"               # the qualifiedName of the hive_table
                                          # object representing training data
    - "db.table2@ns"
  metadata:                               # this is a predefined key for 
                                          # additional metadata
    key1: value1                                    
    key2: value2                                    
    query: "select id, name from table"   # suggested use case: query used to
                                          # extract training data
    training_file: "fit.py"               # suggested use case: training file
                                          # used
modelName2:                               # multiple models can be specified in 
                                          # one file
  hive_table_qualified_names:             
    - "db.table2@ns"