Datasets overview

A dataset is a group of assets that fit a set of search criteria so that you can manage and administer them collectively for specific business purposes.

Figure 1. Datasets

Asset collections enable you to perform the following tasks when working with your data:

  • Organize

    Group data assets into datasets based on business classifications, purpose, protections, relevance, ownership etc.

    Example use cases:

    • A company may mark all data that needs to be compliant with GDPR, for example, with the tag "PII" and use Ranger policies to control access to these assets.
    • A data steward may mark all data related to sales with tags as sales_transactions, sales_targets, customer_segments and include them into a common dataset for easier discovery.
    • Ownership by different teams can be signified by datasets.
  • Search

    Find tags or assets in your data lake using Hive assets, attribute facets, or free text.

    Advanced asset search uses facets of technical and business metadata about the assets, such as those captured in Apache Atlas, to help users define and build collections of interest.

  • Understand

    Audit data asset security and use for anomaly detection, forensic audit and compliance, and proper control mechanisms.

You can edit datasets after you create them and the assets contained within the collection will be updated. CRUD (Create, Read, Update, Delete) is supported for datasets.