Known Issues in Apache Hive
Learn about the known issues in Hive, the impact or changes to the functionality, and the workaround.
- CDPD-28809: error:java.lang.OutOfMemoryError: Java heap space while uploading a file to ABFS Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.hadoop.io.ElasticByteBufferPool.getBuffer(ElasticByteBufferPool.java:96) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.writeCurrentBufferToService(AbfsOutputStream.java:414) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.writeCurrentBufferToService(AbfsOutputStream.java:394) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.write(AbfsOutputStream.java:210) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
- How upload works in ABFS: Max number of upload requests which can be queued is 2*4*num_available_processors. Every request will hold byte buffer of size `fs.azure.write.request.size` whose default is 8 MB. Thus if num_avalaible processers is 8, total memory utilised will 2*4*8*8 = 512 MB which can cause processes to go OOM.
- CDPD-28809 For GCS: Error: error:java.lang.OutOfMemoryError: Java heap space while uploading a file to Google Cloud storage at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48) at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:609) at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408) Local Variable: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.GenericUrl#49 Local Variable: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpResponse#24 at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) Local Variable: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.GenericUrl#48 Local Variable: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader#24 at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:551) at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:475) at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:592) Local Variable: com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Insert#24 at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:318)
- How upload works in ABFS: Every GCS output stream will create a byte buffer chunk size 64 MB, which can be configured via fs.gs.outputstream.upload.chunk.size So if there are large number of streams getting created by a process, it will eat up lot of memory thus causing OOM issue. For example if there are 31 threads, total memory utilised will be 31 * 64 MB = 1984 MB.
- CDPD-15518: ACID tables you write using the Hive Warehouse Connector cannot be read from an Impala virtual warehouse.
- Read the tables from a Hive virtual warehouse or using Impala queries in Data Hub.
- CDPD-13636: Hive job fails with OutOfMemory exception in the Azure DE cluster
- Set the parameter hive.optimize.sort.dynamic.partition.threshold=0. Add this parameter in Cloudera Manager (Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml)
- ENGESC-2214: Hiveserver2 and HMS service logs are not deleted
- Update Hive log4j configurations. Hive -> Configuration -> HiveServer2 Logging Advanced Configuration Snippet (Safety Valve) Hive Metastore -> Configuration -> Hive Metastore Server Logging Advanced Configuration Snippet (Safety Valve) Add the following to the configurations: appender.DRFA.strategy.action.type=DELETE appender.DRFA.strategy.action.basepath=${log.dir} appender.DRFA.strategy.action.maxdepth=1 appender.DRFA.strategy.action.PathConditions.glob=${log.file}.* appender.DRFA.strategy.action.PathConditions.type=IfFileName appender.DRFA.strategy.action.PathConditions.nestedConditions.type=IfAccumulatedFileCount appender.DRFA.strategy.action.PathConditions.nestedConditions.exceeds=same value as appender.DRFA.strategy.max
- HiveServer Web UI displays incorrect data
- If you enabled auto-TLS for TLS encryption, the HiveServer2 Web UI does not display the correct data in the following tables: Active Sessions, Open Queries, Last Max n Closed Queries
- CDPD-11890: Hive on Tez cannot run certain queries on tables stored in encryption zones
- This problem occurs when the Hadoop Key Management Server (KMS) connection is SSL-encrypted and a self signed certificate is used. SSLHandshakeException might appear in Hive logs.
Technical Service Bulletins
- TSB 2021-501: JOIN queries return wrong result for join keys with large size in Hive
- JOIN queries return wrong results when performing joins on large size keys (larger than 255 bytes). This happens when the fast hash table join algorithm is enabled, which is enabled by default.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-501: JOIN queries return wrong result for join keys with large size in Hive
- TSB 2021-518: Incorrect results returned when joining two tables with different bucketing versions
- Incorrect results are returned when joining two tables with different bucketing versions, and with the following Hive configurations: set hive.auto.convert.join = false and set mapreduce.job.reduces = any custom value.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-518: Incorrect results returned when joining two tables with different bucketing versions
- TSB 2021-520: Cleaner causes data loss when processing an aborted dynamic partitioning transaction
- Data loss may occur when an operation that involves dynamic partitioning is aborted in Hive. Cleaner does not know what partition contains the aborted deltas, so it goes over all partitions and removes aborted and `obsolete` deltas below the HighWatermark (highest writeid that could be cleaned up). Those `obsolete` deltas may be `active` ones. There is no easy way to identify obsolete deltas that are active because HighWatermark is defined on a table level.
- Upstream JIRA
- HIVE-25502
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-520: Cleaner causes data loss when processing an aborted dynamic partitioning transaction
- TSB 2021-532: HWC fails to write empty DataFrame to orc files
- HWC writes fail when an empty DataFrame write is attempted. That is because the writer does not create an orc file if no records are present in the DataFrame. This causes the HWC write commit validation to fail.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-532: HWC fails to write empty DataFrame to orc files
- TSB 2023-627: IN/OR predicate on binary column returns wrong result
- An IN or an OR predicate involving a binary datatype column may
produce wrong results. The OR predicate is converted to an IN due to the setting
hive.optimize.point.lookup
which is true by default. Only binary data types are affected by this issue. See https://issues.apache.org/jira/browse/HIVE-26235 for example queries which may be affected. - Upstream JIRA
- HIVE-26235
- Knowledge article
- For the latest update on this issue, see the corresponding Knowledge article: TSB 2023-627: IN/OR predicate on binary column returns wrong result