Cross-engine timezone handling for Spark

Spark treats TIMESTAMP as an instant with local timezone (LTZ) by default, while Hive, Impala, and Iceberg treat it as a wall-clock value without timezone (NTZ), which can shift displayed hours across engines.

For Spark users, cross-engine consistency in the interpretation of the generic TIMESTAMP keyword is a challenge. In Spark, a TIMESTAMP is traditionally treated as an instant (a specific point in time on a global timeline), whereas Hive, Impala, and Iceberg interpret it as a wall-clock value (a literal date and time regardless of location).

Engine timestamp semantics

The discrepancy creates a shift in data values. Because Spark assumes a timestamp is an instant, it automatically converts the value to UTC when writing and shifts it back to the local session timezone when reading. If the data was originally written as a wall-clock value by Hive or Impala, Spark's attempt to correct for timezones results in the hours shifting. For example, a 12:00 PM entry might suddenly appear as 07:00 AM if the Spark session is set to EST.

Starting in Spark 3.4, Spark SQL introduced a native wall-clock type (TIMESTAMP_NTZ). However, the default behavior of the TIMESTAMP alias remains set to the instant type for backward compatibility. This creates a fundamental mismatch during table creation and metadata sharing across the platform.

Engines generally follow one of two semantic behaviors:

  • Wall-clock (no time zone / NTZ): Represents an abstract "what you see is what you get" time (for example, 12:00 PM). It remains identical for all users regardless of their location.
  • Instant (local time zone / LTZ): Represents a single global point in time, stored as UTC. When queried, the engine shifts the value to match the viewer's local session timezone (for example, 10:00 AM in New York appears as 3:00 PM in London).

The following table shows how each engine interprets the TIMESTAMP keyword and which specific types they support:

Table 1. Engine timestamp handling comparison
Engine Default TIMESTAMP behavior NTZ type name LTZ type name
Spark lower than 3.4 Instant (LTZ) Not supported TIMESTAMP
Spark 3.4 and higher Instant (LTZ) TIMESTAMP_NTZ TIMESTAMP_LTZ
Hive Wall-clock (NTZ) TIMESTAMP TIMESTAMP WITH LOCAL TIME ZONE
Impala Wall-clock (NTZ) TIMESTAMP Not supported
Iceberg Wall-clock (NTZ) timestamp timestamptz
Trino Instant (LTZ) TIMESTAMP TIMESTAMP WITH TIME ZONE

Timestamp handling in tables

Inconsistencies occur when data is written by one engine and read by another, leading to the hour value changing unexpectedly.

Behavior differences by Spark version

  • Spark lower than 3.4: Only supports one timestamp type (LTZ), and strips the timezone during the write. Upon reading, Spark assumes the value is UTC and shifts it to the local session timezone, causing an hour offset.
  • Spark 3.4 and higher: Introduces native TIMESTAMP_NTZ. However, by default, the TIMESTAMP alias still points to TIMESTAMP_LTZ to maintain backward compatibility. This version allows for alignment with Hive and Impala if configured correctly.

Interoperability matrix

The following table summarizes cross-engine behavior when creating or reading tables:

Table 2. Timestamp interoperability matrix
Action Spark version Table format Resulting behavior
Create table Lower than 3.4 Iceberg Created as timestamptz. Impala cannot write to this table. Hive and Spark can read from and write to this table.
Create table 3.4 and higher Iceberg Created as timestamptz (default). Impala cannot write to this table.
Create table 3.4 and higher Iceberg If spark.sql.timestampType=NTZ is set, created as timestamp. Fully compatible with all engines.
Read table Any Hive/Parquet Spark shifts the wall-clock time to local time. In Impala, Hive, and Spark, the hour displayed shifts due to the change in time zone interpretation, even though the underlying UTC timestamp value remains the same.

Workarounds for Spark lower than 3.4

  1. For Iceberg tables:

    In Spark 3.3, Cloudera recommends using timestamp columns in an Iceberg table. Set spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables to true so that when Spark creates an Iceberg table, the Spark type timestamp (with local timezone) is mapped to the Iceberg timestamp (without local timezone) type rather than the timestamptz (with local timezone) type.

    Set spark.sql.iceberg.handle-timestamp-without-timezone to true so that when Spark reads or writes an Iceberg table with an Iceberg timestamp column, the data is treated as Spark timestamp (with local timezone) data. When Hive or Impala read and write the tables created by Spark, they treat them as wall-clock timestamps.

    At the same time, Cloudera recommends setting the Spark session timezone to UTC to achieve consistent results.

  2. For Hive tables:
    1. For working with Hive tables written in Parquet file format, see TIMESTAMP compatibility for Parquet files. Setting spark.sql.parquet.int96TimestampConversion=true means that the values are both read and written in a way that is interoperable with Impala.
    2. For any other Hive tables, when a table is created by Spark and read or written from Hive or Impala, or vice versa, Spark treats the timestamp as UTC and converts it to the current Spark session time, whereas Hive and Impala treat it as wall-clock values. A silent shift in timestamps occurs. In this case, Cloudera recommends setting all Spark session timezones to UTC.

Workarounds for Spark 3.4 and higher

  1. For Iceberg tables:

    Spark has two timestamp types: TIMESTAMP_LTZ and TIMESTAMP_NTZ. Use the spark.sql.timestampType configuration to set the default timestamp type to TIMESTAMP_NTZ or TIMESTAMP_LTZ (the default is TIMESTAMP_LTZ).

    Iceberg tables have a mapping where Spark LTZ maps to Iceberg timestamptz, and Spark NTZ maps to Iceberg timestamp. When Hive or Impala read them, they treat them as wall-clock values.

  2. For Hive tables:

    Same as Spark lower than 3.4. The silent shift issue still exists. Cloudera recommends setting all Spark session timezones to UTC.

Timestamp handling in views

Spark views are managed directly by Spark. When a view uses TIMESTAMP_NTZ, this type is written directly to HMS. Because Hive and its libraries do not recognize TIMESTAMP_NTZ, attempts by Hive-based tools to read the view's metadata result in failures or errors.

This lack of Iceberg-based translation for Spark views creates a compatibility gap, restricting their use in mixed-engine environments compared to Iceberg tables. You must be cautious about using timezone-less timestamps in Spark views intended for query by HMS-dependent tools.

When a view is created in Spark 3.4 and higher using the TIMESTAMP_NTZ type:

  1. HMS rejection: The Hive Metastore does not recognize the name timestamp_ntz.
  2. Spark fallback: To save the view, Spark stores the column in HMS as a dummy type (typically array<string>).
  3. Cross-engine failure: While Spark can still read the view using internal metadata, Hive and Impala see the array<string> type and fail to query the column or return corrupted data.

The root causes are the lack of HMS support for Spark's TIMESTAMP_NTZ type and the mismatch between Spark and Hive timestamp types.

Workarounds for views

To share views across all engines successfully:

  1. Do not create Iceberg views from Spark if TIMESTAMP_NTZ is used. If Iceberg views are created by Spark with TIMESTAMP_NTZ, the views are not readable for Hive or Impala engines.
  2. Instead, create views via Hive (standard TIMESTAMP keyword) and use spark.sql.timestampType=NTZ in Spark when querying.