Cross-engine timezone handling for Spark
Spark treats TIMESTAMP as an instant with local timezone (LTZ) by default, while Hive, Impala, and Iceberg treat it as a wall-clock value without timezone (NTZ), which can shift displayed hours across engines.
For Spark users, cross-engine consistency in the interpretation of the generic
TIMESTAMP keyword is a challenge. In Spark, a TIMESTAMP is
traditionally treated as an instant (a specific point in time on a global timeline), whereas
Hive, Impala, and Iceberg interpret it as a wall-clock value (a literal date and time
regardless of location).
Engine timestamp semantics
The discrepancy creates a shift in data values. Because Spark assumes a timestamp is an instant, it automatically converts the value to UTC when writing and shifts it back to the local session timezone when reading. If the data was originally written as a wall-clock value by Hive or Impala, Spark's attempt to correct for timezones results in the hours shifting. For example, a 12:00 PM entry might suddenly appear as 07:00 AM if the Spark session is set to EST.
Starting in Spark 3.4, Spark SQL introduced a native wall-clock type
(TIMESTAMP_NTZ). However, the default behavior of the
TIMESTAMP alias remains set to the instant type for backward
compatibility. This creates a fundamental mismatch during table creation and metadata sharing
across the platform.
Engines generally follow one of two semantic behaviors:
- Wall-clock (no time zone / NTZ): Represents an abstract "what you see is what you get" time (for example, 12:00 PM). It remains identical for all users regardless of their location.
- Instant (local time zone / LTZ): Represents a single global point in time, stored as UTC. When queried, the engine shifts the value to match the viewer's local session timezone (for example, 10:00 AM in New York appears as 3:00 PM in London).
The following table shows how each engine interprets the TIMESTAMP keyword
and which specific types they support:
| Engine | Default TIMESTAMP behavior | NTZ type name | LTZ type name |
|---|---|---|---|
| Spark lower than 3.4 | Instant (LTZ) | Not supported | TIMESTAMP |
| Spark 3.4 and higher | Instant (LTZ) | TIMESTAMP_NTZ |
TIMESTAMP_LTZ |
| Hive | Wall-clock (NTZ) | TIMESTAMP |
TIMESTAMP WITH LOCAL TIME ZONE |
| Impala | Wall-clock (NTZ) | TIMESTAMP |
Not supported |
| Iceberg | Wall-clock (NTZ) | timestamp |
timestamptz |
| Trino | Instant (LTZ) | TIMESTAMP |
TIMESTAMP WITH TIME ZONE |
Timestamp handling in tables
Inconsistencies occur when data is written by one engine and read by another, leading to the hour value changing unexpectedly.
Behavior differences by Spark version
- Spark lower than 3.4: Only supports one timestamp type (LTZ), and strips the timezone during the write. Upon reading, Spark assumes the value is UTC and shifts it to the local session timezone, causing an hour offset.
- Spark 3.4 and higher: Introduces native
TIMESTAMP_NTZ. However, by default, theTIMESTAMPalias still points toTIMESTAMP_LTZto maintain backward compatibility. This version allows for alignment with Hive and Impala if configured correctly.
Interoperability matrix
The following table summarizes cross-engine behavior when creating or reading tables:
| Action | Spark version | Table format | Resulting behavior |
|---|---|---|---|
| Create table | Lower than 3.4 | Iceberg | Created as timestamptz. Impala cannot write to this table.
Hive and Spark can read from and write to this table. |
| Create table | 3.4 and higher | Iceberg | Created as timestamptz (default). Impala cannot write to
this table. |
| Create table | 3.4 and higher | Iceberg | If spark.sql.timestampType=NTZ is set, created as
timestamp. Fully compatible with all engines. |
| Read table | Any | Hive/Parquet | Spark shifts the wall-clock time to local time. In Impala, Hive, and Spark, the hour displayed shifts due to the change in time zone interpretation, even though the underlying UTC timestamp value remains the same. |
Workarounds for Spark lower than 3.4
- For Iceberg tables:
In Spark 3.3, Cloudera recommends using
timestampcolumns in an Iceberg table. Setspark.sql.iceberg.use-timestamp-without-timezone-in-new-tablestotrueso that when Spark creates an Iceberg table, the Spark typetimestamp(with local timezone) is mapped to the Icebergtimestamp(without local timezone) type rather than thetimestamptz(with local timezone) type.Set
spark.sql.iceberg.handle-timestamp-without-timezonetotrueso that when Spark reads or writes an Iceberg table with an Icebergtimestampcolumn, the data is treated as Sparktimestamp(with local timezone) data. When Hive or Impala read and write the tables created by Spark, they treat them as wall-clock timestamps.At the same time, Cloudera recommends setting the Spark session timezone to UTC to achieve consistent results.
- For Hive tables:
- For working with Hive tables written in Parquet file format, see
TIMESTAMPcompatibility for Parquet files. Settingspark.sql.parquet.int96TimestampConversion=truemeans that the values are both read and written in a way that is interoperable with Impala. - For any other Hive tables, when a table is created by Spark and read or written from Hive or Impala, or vice versa, Spark treats the timestamp as UTC and converts it to the current Spark session time, whereas Hive and Impala treat it as wall-clock values. A silent shift in timestamps occurs. In this case, Cloudera recommends setting all Spark session timezones to UTC.
- For working with Hive tables written in Parquet file format, see
Workarounds for Spark 3.4 and higher
- For Iceberg tables:
Spark has two timestamp types:
TIMESTAMP_LTZandTIMESTAMP_NTZ. Use thespark.sql.timestampTypeconfiguration to set the default timestamp type toTIMESTAMP_NTZorTIMESTAMP_LTZ(the default isTIMESTAMP_LTZ).Iceberg tables have a mapping where Spark LTZ maps to Iceberg
timestamptz, and Spark NTZ maps to Icebergtimestamp. When Hive or Impala read them, they treat them as wall-clock values. - For Hive tables:
Same as Spark lower than 3.4. The silent shift issue still exists. Cloudera recommends setting all Spark session timezones to UTC.
Timestamp handling in views
Spark views are managed directly by Spark. When a view uses
TIMESTAMP_NTZ, this type is written directly to HMS. Because Hive and its
libraries do not recognize TIMESTAMP_NTZ, attempts by Hive-based tools to
read the view's metadata result in failures or errors.
This lack of Iceberg-based translation for Spark views creates a compatibility gap, restricting their use in mixed-engine environments compared to Iceberg tables. You must be cautious about using timezone-less timestamps in Spark views intended for query by HMS-dependent tools.
When a view is created in Spark 3.4 and higher using the TIMESTAMP_NTZ
type:
- HMS rejection: The Hive Metastore does not recognize the name
timestamp_ntz. - Spark fallback: To save the view, Spark stores the column in HMS as a dummy type
(typically
array<string>). - Cross-engine failure: While Spark can still read the view using internal metadata, Hive
and Impala see the
array<string>type and fail to query the column or return corrupted data.
The root causes are the lack of HMS support for Spark's TIMESTAMP_NTZ type
and the mismatch between Spark and Hive timestamp types.
Workarounds for views
To share views across all engines successfully:
- Do not create Iceberg views from Spark if
TIMESTAMP_NTZis used. If Iceberg views are created by Spark withTIMESTAMP_NTZ, the views are not readable for Hive or Impala engines. - Instead, create views via Hive (standard
TIMESTAMPkeyword) and usespark.sql.timestampType=NTZin Spark when querying.
