Insert data in test_table through Spark

Identify a host to start a spark-shell on the Compute cluster, Compute 2. Open the Cloudera Manager Admin Console and go to Clusters > Compute 2 > SPARK_ON_YARN-1 > Instances.

Open a terminal session host <HiveServer2 Host URL>

Verify access for the user:

Assign the user starting spark-shell to a Linux group that has create/insert access configured in Sentry. .e. hive.

[root@vpc_host-nnwznq-1 ~]# usermod -aG hive systest

The user will also need to be created and added to the group on all the hosts of the Base cluster.

kinit the user:

[root@vpc_host-nnwznq-1 ~]# kinit -kt /cdep/keytabs/systest.keytab systest

Start spark-shell:

[root@vpc_host-nnwznq-1 ~]# sudo -u systest spark-shell

Setting default log level to "WARN".
Spark context Web UI available at http://vpc_host-nnwznq-1.tut.myco.com:4043
Spark context available as 'sc' (master = yarn, app id = application_1552095496593_0007).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/
scala>

Insert data into the table:

scala> val insertData = sqlContext.sql("insert into test_data.test_table values (1998, 'France')")

19/03/10 00:21:26 WARN shims.HdfsUtils: Unable to inherit permissions for file hdfs://ns1/user/hive/warehouse/test_data.db/test_table/part-00000-f0fc5e0d-daa3-4fe2-9ded-d62e1ef45ca9-c000 from file hdfs://ns1/user/hive/warehouse/test_data.db/test_table
insertData: org.apache.spark.sql.DataFrame = []

scala> val tableTestData = sqlContext.sql("select * from test_data.test_table")
tableTestData: org.apache.spark.sql.DataFrame = [year: int, winner: string]
 
scala> tableTestData.show()
+----+-------+                                                                  
|year| winner|
+----+-------+
|2002| Brazil|
|2018| France|
|2014|Germany|
|2010|  Spain|
|2006|  Italy|
|1998| France|
+----+-------+

Verify and track the queries in the Yarn service application on the Compute cluster:

Insert data in test_table through Spark

We want your opinion

How can we improve this page?