Aggregating and grouping data

You use functions, such as AVG, SUM, or MAX, to aggregate data. The GROUP BY clause groups data query results in one or more columns.

The GROUP BY clause explicitly groups data. Implicit grouping occurs when aggregating the table.

You upload a CSV file to S3 having the following data:

1,jane doe,engineer,service,2022,375000
2,john smith,sales rep,sales,2021,375000
3,naoko murai,service rep,service,2022,200000
4,somporn thong,ceo,sales,2020,500000
5,xi singh,cfo,finance,2023,222000
6,mary brown,sales rep,sales,2021,475000
7,rashida kumar,sales rep,sales,2021,299000

You create a table having the following data in Hive or Impala as shown in the examples below:

+-----+---------------+-------------+---------+------+----------+
| eid | name          | job         | dept    | year | salary   |
+-----+---------------+-------------+---------+------+----------+
| 1   | jane doe      | engineer    | service | 2022 | 375000.0 |
| 2   | john smith    | sales rep   | sales   | 2021 | 375000.0 |
| 3   | naoko murai   | service rep | service | 2022 | 200000.0 |
| 4   | somporn thong | ceo         | sales   | 2020 | 500000.0 |
| 5   | xi singh      | cfo         | finance | 2023 | 222000.0 |
| 6   | mary brown    | sales rep   | sales   | 2021 | 475000.0 |
| 7   | rashida kumar | sales rep   | sales   | 2021 | 299000.0 |
+-----+---------------+-------------+---------+------+----------+

Hive and Impala example:

CREATE EXTERNAL TABLE employees (eid int, name string, job string, dept string, year int, salary float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://cdpsaasdemo-cdp-private-default-1ogvplm/user/andrena';

Run a query that returns the average salary of all employees in the engineering department grouped by year.

Hive and Impala example:

SELECT year, AVG(salary) as average_pay
FROM employees
WHERE dept = 'sales' GROUP BY year;

Hive output:

+-------+--------------+
| year  | average_pay  |
+-------+--------------+
| 2020  | 500000.0     |
| 2021  | 383000.0     |
+-------+--------------+

Impala output:

+------+-------------+
| year | average_pay |
+------+-------------+
| 2021 | 383000.0    |
| 2020 | 500000.0    |
+------+-------------+

Run an implicit grouping query to get the highest paid employee and the average pay.

Hive and Impala example:

ELECT MAX(salary) as highest_pay, 
AVG(salary) as average_pay
FROM employees
WHERE dept = 'sales';

Hive and Impala output:

+--------------+--------------+
| highest_pay  | average_pay  |
+--------------+--------------+
| 500000.0     | 412250.0     |
+--------------+--------------+

Delete the employees.csv from S3.

Aggregating and grouping data

We want your opinion

How can we improve this page?