Loading CSV Data into an Impala Table

For this demonstration, we will be using the tips.csv dataset.

Use the following steps to save this file to a project in Cloudera Data Science Workbench, and then load it into a table in Apache Impala.
  1. Create a new Cloudera Data Science workbench project.
  2. Create a folder called data and upload tips.csv to this folder.
  3. The next steps require access to services on the CDH cluster. If Kerberos has been enabled on the cluster, enter your credentials (username, password/keytab) in Cloudera Data Science Workbench to enable access.
  4. Navigate back to the project Overview page and click Open Workbench.
  5. Launch a new session (Python or R).
  6. Open the Terminal.
    1. Run the following command to create an empty table in Impala called tips. Replace <impala_daemon_hostname> with the hostname for your Impala daemon.
      impala-shell -i <impala_daemon_hostname>:21000 -q '
        CREATE TABLE default.tips (
          `total_bill` FLOAT,
          `tip` FLOAT,
          `sex` STRING,
          `smoker` STRING,
          `day` STRING,
          `time` STRING,
          `size` TINYINT)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
        LOCATION "hdfs:///user/hive/warehouse/tips/";'
    1. Run the following command to load data from the /data/tips.csv file into the Impala table.
      hdfs dfs -put data/tips.csv /user/hive/warehouse/tips/