Loading CSV Data into an Impala Table

For this demonstration, we will be using the tips.csv dataset. Use the following steps to save this file to a project in Cloudera Data Science Workbench, and then load it into a table in Apache Impala.
  1. Create a new Cloudera Machine Learning project.
  2. Create a folder called data and upload tips.csv to this folder. For detailed instructions, see Managing Project Files.
  3. The next steps require access to services on the CDH cluster. If Kerberos has been enabled on the cluster, enter your credentials (username, password/keytab) in Cloudera Machine Learning to enable access. For instructions, see Hadoop Authentication with FreeIPA for ML Workspaces.
  4. Navigate back to the project Overview page and click Open Workbench.
  5. Launch a new session (Python or R).
  6. Open the Terminal.
    1. Run the following command to create an empty table in Impala called tips. Replace <impala_daemon_hostname> with the hostname for your Impala daemon.
      impala-shell -i <impala_daemon_hostname>:21000 -q '
        CREATE TABLE default.tips (
          `total_bill` FLOAT,
          `tip` FLOAT,
          `sex` STRING,
          `smoker` STRING,
          `day` STRING,
          `time` STRING,
          `size` TINYINT)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
        LOCATION "hdfs:///user/hive/warehouse/tips/";'
    2. Run the following command to load data from the /data/tips.csv file into the Impala table.
      hdfs dfs -put data/tips.csv /user/hive/warehouse/tips/