Aggregating and grouping data

You use functions, such as AVG, SUM, or MAX, to aggregate data. The GROUP BY clause groups data query results in one or more columns.

The GROUP BY clause explicitly groups data. Implicit grouping occurs when aggregating the table.
  • You upload a CSV file to S3 having the following data:
    1,jane doe,engineer,service,2022,375000
    2,john smith,sales rep,sales,2021,375000
    3,naoko murai,service rep,service,2022,200000
    4,somporn thong,ceo,sales,2020,500000
    5,xi singh,cfo,finance,2023,222000
    6,mary brown,sales rep,sales,2021,475000
    7,rashida kumar,sales rep,sales,2021,299000
  • You create a table having the following data in Hive or Impala as shown in the examples below:
    +-----+---------------+-------------+---------+------+----------+
    | eid | name          | job         | dept    | year | salary   |
    +-----+---------------+-------------+---------+------+----------+
    | 1   | jane doe      | engineer    | service | 2022 | 375000.0 |
    | 2   | john smith    | sales rep   | sales   | 2021 | 375000.0 |
    | 3   | naoko murai   | service rep | service | 2022 | 200000.0 |
    | 4   | somporn thong | ceo         | sales   | 2020 | 500000.0 |
    | 5   | xi singh      | cfo         | finance | 2023 | 222000.0 |
    | 6   | mary brown    | sales rep   | sales   | 2021 | 475000.0 |
    | 7   | rashida kumar | sales rep   | sales   | 2021 | 299000.0 |
    +-----+---------------+-------------+---------+------+----------+

    Hive and Impala example:

    CREATE EXTERNAL TABLE employees (eid int, name string, job string, dept string, year int, salary float)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION 's3a://cdpsaasdemo-cdp-private-default-1ogvplm/user/andrena';
  1. Run a query that returns the average salary of all employees in the engineering department grouped by year.
    Hive and Impala example:
    SELECT year, AVG(salary) as average_pay
    FROM employees
    WHERE dept = 'sales' GROUP BY year;
    Hive output:
    +-------+--------------+
    | year  | average_pay  |
    +-------+--------------+
    | 2020  | 500000.0     |
    | 2021  | 383000.0     |
    +-------+--------------+
    Impala output:
    +------+-------------+
    | year | average_pay |
    +------+-------------+
    | 2021 | 383000.0    |
    | 2020 | 500000.0    |
    +------+-------------+
  2. Run an implicit grouping query to get the highest paid employee and the average pay.
    Hive and Impala example:
    ELECT MAX(salary) as highest_pay, 
    AVG(salary) as average_pay
    FROM employees
    WHERE dept = 'sales';
    Hive and Impala output:
    +--------------+--------------+
    | highest_pay  | average_pay  |
    +--------------+--------------+
    | 500000.0     | 412250.0     |
    +--------------+--------------+
  3. Delete the employees.csv from S3.