SQLContext and HiveContext
Beginning in Spark 2.0, all Spark functionality, including Spark
SQL, can be accessed through the SparkSessions
class,
available as spark
when you launch
spark-shell
. You can create a DataFrame from an RDD, a
Hive table, or a data source.
If you use spark-submit
, use code like the following
at the start of the program (this example uses Python):
from pyspark import SparkContext, HiveContext sc = SparkContext(appName = "test") sqlContext = HiveContext(sc)
The host from which the Spark application is submitted or on which
spark-shell
or pyspark
runs must have
a Hive gateway role defined in Cloudera Manager and client configurations
deployed.
When a Spark job accesses a Hive view, Spark must have privileges to
read the data files in the underlying Hive tables. Currently, Spark cannot
use fine-grained privileges based on the columns or the
WHERE
clause in the view definition. If Spark does not
have the required privileges on the underlying data files, a SparkSQL
query against the view returns an empty result set, rather than an error.