For this demonstration, we will be using the tips.csv dataset.
Use the following steps to save this file to a project in
Cloudera Data Science Workbench, and then load it into a table in Apache Impala.
-
Create a new Cloudera Data Science workbench project.
-
Create a folder called
data
and upload tips.csv to this folder.
-
The next steps require access to services on the CDH cluster. If Kerberos has been
enabled on the cluster, enter your credentials (username, password/keytab) in Cloudera
Data Science Workbench to enable access.
-
Navigate back to the project Overview page and click Open
Workbench.
-
Launch a new session (Python or R).
-
Open the Terminal.
-
Run the following command to create an empty table in Impala called tips. Replace
<impala_daemon_hostname> with the hostname for your Impala
daemon.
impala-shell -i <impala_daemon_hostname>:21000 -q '
CREATE TABLE default.tips (
`total_bill` FLOAT,
`tip` FLOAT,
`sex` STRING,
`smoker` STRING,
`day` STRING,
`time` STRING,
`size` TINYINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION "hdfs:///user/hive/warehouse/tips/";'
-
Run the following command to load data from the
/data/tips.csv
file into the Impala table.
hdfs dfs -put data/tips.csv /user/hive/warehouse/tips/