Access on-premise datasets in place without replication

Learn about the advantages of storing data on-premises and processing in the public cloud.

In this hybrid model, data is physically stored in the on-premise infrastructure (e.g., HDFS or Ozone), while processing takes place in the public cloud (e.g., using Apache Spark, Hive, or Impala). Although compute runs in the cloud, the data source remains on-premises. Workloads submitted to the cloud directly access on-premise data. The workload execution plan dictates the strict data subsets that are leveraged in cloud for processing without creating data copies.

The primary advantage of this model is that analytics applications typically do not need to be rewritten; logical paths and data structures remain the same. This avoids the cost and complexity of application redesign or full data migration. It is especially efficient for handling burst workloads where temporarily increased compute capacity is needed.

Another benefit is keeping metadata and security management centralized on-premises. This means table definitions, access permissions, audit settings, and governance rules are enforced consistently, whether the query runs in the cloud or on-premises.