Enabling Spark SQL User Impersonation for the Spark Thrift Server
By default, the Spark Thrift server runs queries under the identity of the operating system account running the Spark Thrift server. In a multi-user environment, queries often need to run under the identity of the end user who originated the query; this capability is called "user impersonation."
When user impersonation is enabled, Spark Thrift server runs Spark SQL queries as the submitting user. By running queries under the user account associated with the submitter, the Thrift server can enforce user level permissions and access control lists. Associated data cached in Spark is visible only to queries from the submitting user.
User impersonation enables granular access control for Spark SQL queries at the level of files or tables.
The user impersonation feature is controlled with the doAs
property. When
doAs
is set to true, Spark Thrift server launches an on-demand Spark
application to handle user queries. These queries are shared only with connections from the
same user. Spark Thrift server forwards incoming queries to the appropriate Spark
application for execution, making the Spark Thrift server extremely lightweight: it merely
acts as a proxy to forward requests and responses. When all user connections for a Spark
application are closed at the Spark Thrift server, the corresponding Spark application also
terminates.
Prerequisites
Spark SQL user impersonation is supported for Apache Spark 1 versions 1.6.3 and later.
If storage-based authorization is to be enabled, complete the instructions in Configuring Storage-based Authorization in the Data Access Guide before enabling user impersonation.
Enabling User Impersonation on an Ambari-managed Cluster
To enable user impersonation for the Spark Thrift server on an Ambari-managed cluster, complete the following steps:
Enable
doAs
support. Navigate to the “Advanced spark-hive-site-override” section and sethive.server2.enable.doAs=true
.Add DataNucleus jars to the Spark Thrift server classpath. Navigate to the “Custom spark-thrift-sparkconf” section and set the
spark.jars
property as follows:spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
(Optional) Disable Spark Yarn application for Spark Thrift server master. Navigate to the "Advanced spark-thrift-sparkconf" section and set
spark.master=local
. This prevents launching a spark-client HiveThriftServer2 application master, which is not needed whendoAs=true
because queries are executed by the Spark AM, launched on behalf of the user. Whenspark.master
is set tolocal
, SparkContext uses only the local machine for driver and executor tasks.(When the Thrift server runs with
doAs
set tofalse
, you should setspark.master
toyarn-client
, so that query execution leverages cluster resources.)Restart the Spark Thrift server.
Enabling User Impersonation on an Cluster Not Managed by Ambari
To enable user impersonation for the Spark Thrift server on a cluster not managed by Ambari, complete the following steps:
Enable doAs support. Add the following setting to the
/usr/hdp/current/spark-thriftserver/conf/hive-site.xml
file:<property> <name>hive.server2.enable.doAs</name> <value>true</value> </property>
Add DataNucleus jars to the Spark Thrift server classpath. Add the following setting to the
/usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf
file:spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
(Optional) Disable Spark Yarn application for Spark Thrift server master. Add the following setting to the
/usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf
file:spark.master=local
This prevents launching an unused spark-client HiveThriftServer2 application master, which is not needed when
doAs=true
because queries are executed by the Spark AM, launched on behalf of the user. Whenspark.master
is set tolocal
, SparkContext uses only the local machine for driver and executor tasks.(When the Thrift server runs with
doAs
set tofalse
, you should setspark.master
toyarn-client
, so that query execution leverages cluster resources.)Restart the Spark Thrift server.
For more information about user impersonation for the Spark Thrift Server, see Using Spark SQL.