Hive analytic functions

Apache Hive Analytical Functions available since Hive 0.11.0, are a special group of functions that scan the multiple input rows ( a window)to compute each output value. Apache Hive Analytical Functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause that is limited to one result value per group, analytic functions operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER PARTITION clause.Though analytic functions give aggregate results, they do not group the result set. Through this article I aim to provide various analytic functions used in hive along with explanations and additional applications.

Below is the dataset used for the analytic functions:

id dept_id dept_amt
1 10 100
2 10 100
3 10 300
4 10 400
5 10 400
6 20 500
7 20 600
8 20 600
9 30 800
10 30 1000

  1. COUNT analytic function

COUNT analytic function is used to obtain the count of the columns within the window specification.

Syntax:

COUNT( column ) over ( window specification)

For example:

Below query gives the count of IDs per department

select distinct dept_id, count(id) over (partition by dept_id) as dept_count from
analytic_func

dept_id dept_count
10 5
20 3
30 2

2. SUM analytic function

SUM analytic function is used to obtain the sum total of columns within the window specification.

Syntax:

SUM( column ) over ( window specification)

The below query provides us the total amount spent by each department.

select distinct dept_id, sum(dept_amt) over (partition by dept_id ) as sum_dept_amount from
analytic_func

dept_id sum_dept_amount
10 1300
20 1700
30 1800

3. AVG(AVERAGE) analytic function

AVG analytic function is used to obtain the average value of columns within the window specification.

Syntax:

AVG( column ) over ( window specification)

The below query provides us the average amount spent by each department.

select distinct dept_id, avg(dept_amt) over (partition by dept_id ) as avg_dept_amount from
analytic_func

dept_id avg_dept_amount
10 260.0
20 566.6666666666666
30 900.0

Additional application of the AVG function:

AVG function is very helpful in rolling average calculations. For example, consider the below table
with 2 columns month and users, below will be the method to get a 3-month rolling average of users:

month users
Jan 10
Feb 20
Mar 30
Apr 40
May 50
Jun 60
Jul 70
Aug 80
Sep 90
Oct 100
Nov 110
Dec 120

select month, users
,avg(dummy_test_user) over (order by dummy_test_user rows BETWEEN 2 PRECEDING AND 0 PRECEDING) as avg_user
from rolling_average

month user avg_user
Jan 10 10.0
Feb 20 15.0
Mar 30 20.0
Apr 40 30.0
May 50 40.0
Jun 60 50.0
Jul 70 60.0
Aug 80 70.0
Sep 90 80.0
Oct 100 90.0
Nov 110 100.0
Dec 120 110.0

Rolling averages play a very important role in week-on-week or month-on-month comparisons.
These comparisons could be very effective as Data Quality monitors as if there’s a significant variation in these, they could indicate inaccurate data.

4. MIN and MAX Analytic Functions

MIN and MAX analytic function is used to obtain the MIN and MAX values within the window specification.

Syntax:

MIN( column ) over ( window specification)
MAX( column ) over ( window specification)

The below query provides the min and the max amount in each of the department.

select distinct dept_id
,min(dept_amt) over (partition by dept_id ) as min_dept_amount
,max(dept_amt) over (partition by dept_id ) as max_dept_amount
from analytic_func

dept_id min_dept_amount max_dept_amount
10 100 400
20 500 600
30 800 1000

5. LAG and LEAD Analytic Functions

LAG and LEAD analytic functions are used to compare different rows within a table using an offset from the current row.

Syntax:

LEAD(column, offset, default) OVER( window specification)
LAG(column, offset, default) OVER( window specification)

The default value of offset is 1. Offset is the relative position of the row to be accessed. In case of LEAD, the rows after the current row are compared and in LAG, prior rows are compared to the current row. LEAD returns NULL when the lead for the current row extends beyond the end of the window.
LAG returns null when the lag for the current row extends before the beginning of the window. These null values can be replaced using the default value.

The below query provides the amount prior and later than the current row in each of the department.

select distinct id,dept_id,dept_amt
,lag(dept_amt,1,0) over (partition by dept_id ) as lag_dept_amount
,lead(dept_amt,1,0) over (partition by dept_id ) as lead_dept_amount
from analytic_func
order by dept_id

id dept_id dept_amt lag_dept_amount lead_dept_amount
1 10 100 0 100
2 10 100 100 300
3 10 300 100 400
4 10 400 300 400
5 10 400 400 0
6 20 500 0 600
7 20 600 500 600
8 20 600 600 0
9 30 800 0 1000
10 30 1000 800 0

Explanation:

The first row has the value 0 for lag_dept_amount as there are no prior records for that row.
Since, the default value is 0, 0 gets populated.
Similarly, row number 5 has the value 0 for lead_dept_amount as there are no further records with the dept_id as 10 ( the partition column here is dept_id)

Additional application for LAG function:

It’s extremely useful in handling SDC ( Slowly Changing Dimensions)

Slowly Changing Dimensions (SCD) are the dimensions that change slowly over time, rather than changing on regular schedule, time-base. There is a need to track changes in dimension attributes in order to report historical data. For example, if an employee gets promoted on January 1, 2021, that person’s new designation is effective from January 1,2021
and the person’s previous designation is active until 31 December 2020

Existing Record: table_existing

ID Name Designation Start_date End_date
1 AAA Engineer 2018–01–01 9999–12–31

New Record: new_table

ID Name Designation Start_date End_date
1 AAA Manager 2021–01–01 9999–12–31

Expected record ( post update)

ID Name Designation Start_date End_date
1 AAA Engineer 2018–01–01 2020–12–31
1 AAA Manager 2021–01–01 9999–12–31

LAG functionality can be used to achieve this

Firstly, an union all can be performed to obtain the below dataset:

ID Name Designation Start_date End_date
1 AAA Engineer 2018–01–01 9999–12–31
1 AAA Manager 2021–01–01 9999–12–31

Below query will help us achieve the desired result:

select ID, Name,Designation, Start_date
,COALESCE(date_sub(lag(Start_date) over (partition by ID order by Start_date desc),1),’9999–12–31') as End_date
from slowly_changing_dimension

Explanation:
Because of order by desc, the first row by default will have 9999–12–31 value as the End_date. For the other rows, the date prior to the start date becomes the end date.

6. ROW_NUMBER,RANK and DENSE_RANK Analytic Functions:

The main differences between these 3 analytic functions are:

ROW_NUMBER will always give unique value to each row.
RANK on the other hand will gives the same rank to equal values within the window specification. Also, in case of equal vales, the next rank gets skipped.
DENSE_RANK gives the same rank to equal values within the window specification. However, no rank value gets skipped in this case.

select id,dept_id,dept_amt
,ROW_NUMBER() over(order by dept_amt) as row_num
,RANK() over(order by dept_amt) as rank
,DENSE_RANK() over(order by dept_amt) as dense_rank
from analytic_func

Below is the data obtained:

I hope you find this article informative and gave an idea about the usage of analytic functions in hive.
Please feel free to reach me at varunkr95@gmail.com