Using Hive Warehouse Connector with Oozie Spark action
Hive and Spark use different incompatible Thrift versions. If you have the Hive Warehouse Connector (HWC) JAR in Oozie's Spark classpath, there will be conflicting Hive classes. Classes can come from Oozie's default Spark classpath with the original signature and also from the HWC JAR with the changed signature because of the shading process.
Upgrading Thrift in Hive is complicated and may not be resolved in the near future. Therefore, Thrift packages are shaded inside the HWC JAR to make Hive Warehouse Connector work with Spark and Oozie’s Spark action.
This shading process changes the signature of some Hive classes inside the HWC JAR because the HWC JAR is a fat JAR and contains Hive classes as well. Oozie's Spark action also has Hive libraries on its classpath (added as part of the Cloudera stack) because you can run simple Hive commands with Oozie's Spark action (not with HWC but on its own). You can also run Hive actions with Hive Warehouse Connector through Oozie's Spark action.
You can resolve this issue using one of the following options:
Option 1: Always use HWC when executing Hive commands using Oozie’s Spark action
- Place the HWC JAR in Oozie's Spark ShareLib.
- Remove all other Hive JARs from Oozie's Spark ShareLib.
Option 2-A: Updating job.properties to execute Hive commands using both HWC and non-HWC
When you update the job.properties, Hive Jars coming from Oozie’s Spark ShareLib will be ignored and only the HWC Jar is used.
- Create a new ShareLib using a different name, such as hwc.
- Place the HWC JAR onto the new ShareLib.
- Execute a ShareLib update.
- When executing a Spark action using the HWC include the following properties in the
job.properties
file:
oozie.action.sharelib.for.spark=spark,hwc oozie.action.sharelib.for.spark.exclude=^.*\/hive\-(?!warehouse-connector).*\.jar$
Option 2-B: Updating action-level configurations to execute Hive commands using both HWC and non-HWC
If you have a workflow which contains an action where you would like to use HWC and another action where you do not want to use HWC, you can achieve the same by specifying the ShareLib properties at the action level.
<spark xmlns="uri:oozie:spark-action:1.0"> ... <configuration> <property xmlns=""> <name>oozie.action.sharelib.for.spark</name> <value>spark,hwc</value> </property> <property xmlns=""> <name>oozie.action.sharelib.for.spark.exclude</name> <value>spark/hive-.+</value> </property> </configuration> ... </spark>
Appendix - Creating a new ‘hwc’ ShareLib
- Kinit as oozie.
- Check the current available
ShareLibs:
oozie admin -shareliblist -oozie {url}
- Create the folder for it on HDFS:
hdfs dfs -mkdir /user/oozie/share/lib/lib_{latestTimestamp}/hwc
- Add the JAR files to it.
- Update the ShareLib
property:
oozie admin -sharelibupdate -oozie {url}
- List the ShareLibs again to check if hwc is
present:
oozie admin -shareliblist -oozie {url}