Loading CSV Data into an Impala Table
For this
demonstration, we will be using the tips.csv dataset. Use the following steps to save
this file to a project in Cloudera Machine Learning, and then load it into a table in Apache
Impala.
- Create a new Cloudera Machine Learning project.
- Create a folder called
data
and upload tips.csv to this folder. For detailed instructions, see Managing Project Files. - The next steps require access to services on the CDH cluster. If Kerberos has been enabled on the cluster, enter your credentials (username, password/keytab) in Cloudera Machine Learning to enable access.
- Navigate back to the project Overview page and click Open Workbench.
- Launch a new session (Python or R).
- Open the Terminal.
- Run the following command to create an empty table in Impala called tips. Replace
<impala_daemon_hostname> with the hostname for your
Impala daemon.
impala-shell -i <impala_daemon_hostname>:21000 -q ' CREATE TABLE default.tips ( `total_bill` FLOAT, `tip` FLOAT, `sex` STRING, `smoker` STRING, `day` STRING, `time` STRING, `size` TINYINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LOCATION "hdfs:///user/hive/warehouse/tips/";'
- Run the following command to load data from the
/data/tips.csv
file into the Impala table.hdfs dfs -put data/tips.csv /user/hive/warehouse/tips/
- Run the following command to create an empty table in Impala called tips. Replace
<impala_daemon_hostname> with the hostname for your
Impala daemon.