hive-窗口函数

1 窗口函数语法

分析函数/专用窗口函数 over(partition by 列名 order by 列名 rows between 开始位置 and
结束位置)

常用的分析函数

常用的分析函数:sum()、max()、min()、avg()、count()

常用的专用窗口函数

专用窗口函数:row_number()、rank()、dense_rank()

窗口函数

窗口函数的3个组成部分可以单独使用,也可以混合使用,也可以全部不用

over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)

partition by

对指定的字段进行分组,后续都会以组为单位,把每个分组单独作为一个窗口进行统计分析操作。

案例 01:对窗口中的数据求和,并把求和结果分别分发到对应窗口的每一条数据中

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select
col1
,sum(col2) over(partition by col1) as '对窗口中的数据求和'
from temp
输出结果:
col     对窗口中的数据求和
A       2
A       2
B       1

案例 02:对整体数据求和,并把求和结果分发到每一条数据中

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select
col1
,sum(col2) over() as '对整体数据求和'
from temp
输出结果:
col     对整体数据求和
A       3
A       3
B       3

order by

order by 与 partition by 连用的时候,可以对各个分组内的数据,按照指定的字段进行排序。如果没有 partition by 指定分组字段,那么会对全局的数据进行排序。

案例 01:对数据进行全局排序

with temp as(
select 'A' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select col1,row_number() over(order by col1 desc) as 排序 from temp
输出结果:
col1     排序 
C        1
C        2
B        3
A        4

案例 02:当排序的维度不存在重复的情况下,即 order by 指定的字段,使用 order by + 分析函数 sum(),可以产生求整体累计数的效果。

with temp_01 as(
select 'A' as col1,1 as col2
union all 
select 'D' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select col1,sum(col2) over(order by col1) as 求累计 from temp_01
输出结果:
col1    求累计
A       1
B       2
C       3
D       4

但是当 order by 指定的字段,数据存在重复的时候,会在不重复的数据中产生累计效果,重复的数据中,会把整体的累计结果分发到每条重复的数据中。

with temp_02 as(
select 'A' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select col1,sum(col2) over(order by col1) as 求累计 from temp_02
输出结果:
col1    求累计
A       1
B       2
C       4
C       4

案例 03:当排序的维度不存在重复的情况下,即 order by 指定的字段,parition by 与 order by 连用,并使用分析函数 sum() ,可以产生求窗口中累计数的效果。

with temp_01 as(
select 'A' as col1,'b' as col2,1 as col3
union all 
select 'A' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'b' as col2,1 as col3
)

select col1,sum(col3) over(partition by col1 order by col1,col2) as 求累计 from temp_01
输出结果:
col1     求累计
A        1
A        2
C        1
C        2

但是当 order by 指定的字段组合,数据存在重复的时候,会在不重复的数据中产生累计效果,而重复的数据中,也是会把整体的累计结果分发到每条重复的数据中,如下,用 col1 和 col2 排序。

with temp_02 as(
select 'A' as col1,'b' as col2,1 as col3
union all 
select 'A' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
)

select col1,sum(col3) over(partition by col1 order by col1,col2) as 求累计 from temp_02
输出结果:
col1     求累计
A        1
A        2
C        2
C        2

rows between 开始位置 and 结束位置

rows between 是用来划分窗口中,函数发挥作用的数据范围。我们用如下例子加深 rows between 的理解。
hive-窗口函数_第1张图片

在A分组中,计算第2行,第2列的值的时候,会从窗口起点(第2行,第2列),计算到当前行(第2行,第2列)。结果是1。
在A分组中,计算第3行,第2列的值的时候,会从窗口起点(第2行,第2列),计算到当前行(第3行,第2列)。结果是2。

rows between 常用的参数如下:

① n preceding:往前

② n following:往后

③ current row:当前行

④ unbounded:起点(一般结合preceding,following使用)

a. unbounded preceding:表示该窗口最前面的行(起点) b. unbounded
following:表示该窗口最后面的行(终点)

使用例子如下:

rows between unbounded preceding and current row(表示从起点到当前行的数据进行) rows
between current row and unbounded following(表示当前行到终点的数据进行) rows
between unbounded preceding and unbounded following (表示起点到终点的数据) rows
between 1 preceding and 1 following(表示往前1行到往后1行的数据) rows between 1
preceding and current row(表示往前1行到当前行)

rows between unbounded preceding and current row与 partition by 、order by 连用,可以产生对窗口中的数据求累计数的效果。

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select
col1
,sum(col2) over(partition by col1 order by col2 desc rows between unbounded preceding and current row) as '对窗口中的数据求和'
from temp
输出结果:
col1     对窗口中的数据求和
A        1
A        2
B        1
  1. 排序窗口函数
    2.1 排序并产生自增编号,自增编号不重复且连续
    我们可以使用函数:row_number() over()

数据样例:

col1                             ranks
a            1
b              2
b              3
b              4
c              5
d              6

具体语法如下:

> row_number() over(partition by 列名 order by 列名 rows between 开始位置 and
> 结束位置)

案例如下:

>with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)
>
>select col1,row_number() over(order by col1) as ranks from temp
输出结果:
col1     rank
a        1
b        2
b        3
b        4
c        5
d        6

2.2 排序并产生自增编号,自增编号会重复且不连续
我们可以使用函数:rank() over()

数据样例:

col1     ranks
a        1
b        2
b        2
b        2
c        5
d        6

具体语法如下:

rank() over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)

案例如下:

with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)

select col1,rank() over(order by col1) as ranks from temp
输出结果:
col1     rank
a        1
b        2
b        2
b        2
c        5
d        6

2.3 排序并产生自增编号,自增编号会重复且连续
我们可以使用函数:dense_rank() over()

数据样例:

col1     ranks
a        1
b        2
b        2
b        2
c        3
d        4

具体语法如下:

dense_rank() over(partition by 列名 order by 列名 rows between 开始位置 and
结束位置)

案例如下:

with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)

select col1,dense_rank() over(order by col1) as ranks from temp	
输出结果:
col1     ranks
a        1
b        2
b        2
b        2
c        3
d        4

聚合窗口函数

3.1 求窗口中的累计值
我们可以使用:sum() over()

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)

select
col1
,sum(col2) over(partition by col1 order by col2 desc rows between unbounded preceding and current row) as '对窗口中的数据求和'
from temp
输出结果:
col1     对窗口中的数据求和
A        1
A        2
A        3
B        1
B        1

3.2 求窗口中 3 天的平均价格
我们可以使用 avg() over()

with temp as(
select 'A' as col1,'2022-11-01' as date_time,50 as price
union all 
select 'A' as col1,'2022-11-02' as date_time,60 as price
union all 
select 'A' as col1,'2022-11-03' as date_time,45 as price
union all 
select 'A' as col1,'2022-11-04' as date_time,70 as price
union all 
select 'A' as col1,'2022-11-05' as date_time,40 as price
union all 
select 'A' as col1,'2022-11-06' as date_time,40 as price
union all 
select 'B' as col1,'2022-11-01' as date_time,40 as price
union all 
select 'B' as col1,'2022-11-02' as date_time,30 as price
union all 
select 'B' as col1,'2022-11-03' as date_time,50 as price
union all 
select 'B' as col1,'2022-11-04' as date_time,50 as price
)

select
col1
,date_time
,price
,avg(price) over(partition by col1 order by date_time rows between 2 preceding and current row) as '3天的平均价格'
from temp

输出结果:
col1      date_time        price      3天的平均价格
A         2022-11-01       50         50
A         2022-11-02       60         55
A         2022-11-03       45         51.666666666666664
A         2022-11-04       70         58.333333333333336
A         2022-11-05       40         51.666666666666664
A         2022-11-06       40         50
B         2022-11-01       40         40
B         2022-11-02       30         35
B         2022-11-03       50         40
B         2022-11-01       50         43.333333333333336

3.3 求分组中的最大值/最小值

with temp_01 as(
select 'A' as col1,10 as col2
union all 
select 'C' as col1,10 as col2
union all 
select 'C' as col1,20 as col2
union all 
select 'A' as col1,20 as col2
union all
select 'A' as col1,20 as col2
)


select
col1
,col2
,max(col2) over(partition by col1) as 窗口中的最大值
,min(col2) over(partition by col1) as 窗口中的最小值
from temp_01

输出结果:
col1     col2     窗口中的最大值     窗口中的最小值
A        10       20                 10
A        20       20                 10
A        20       20                 10
C        10       20                 10
C        20       20                 10

3.4 求分组中的总记录数

with temp_01 as(
select 'A' as col1,'a' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'A' as col1,'b' as col2
union all
select 'A' as col1,'b' as col2
)


select
col1
,col2
,count(col2) over(partition by col1) as 分组中的记录数
from temp_01

输出结果:
col1     col2     分组中的记录数
A        a        3
A        b        3
A        b        3
C        a        2
C        a        2
  1. 位移窗口函数
    4.1 获取分组中往前 n 行的值
    基础语法:

lead(field,n,default_value) over()

语法解析:

  1. field 是指定的列名

  2. n 是往前的行数

  3. 行往前导致的,最后的 n 行值为 null,可以用 default_value 代替。

使用案例:

with temp_01 as(
select 'A' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-02' as date_time
union all 
select 'A' as col1,'2022-12-02' as date_time
union all
select 'A' as col1,'2022-12-03' as date_time
)

select
col1
,date_time
,lead(date_time,1,'2022-12-30') over(partition by col1 order by date_time) as 往前n行的值 
from temp_01

输出结果:
col1     date_time     往前n行的值
A        2022-12-01    2022-12-02
A        2022-12-02    2022-12-03
A        2022-12-03    2022-12-30
C        2022-12-01    2022-12-02
C        2022-12-02    2022-12-30

4.2 获取分组中往后 n 行的值
基础语法:

lag(field,n, default_value) over()

语法解析:

  1. field 是指定的列名

  2. n 是往前的行数

  3. 行往后导致的,前面的 n 行值为 null,可以用 default_value 代替。

使用案例:

with temp_01 as(
select 'A' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-02' as date_time
union all 
select 'A' as col1,'2022-12-02' as date_time
union all
select 'A' as col1,'2022-12-03' as date_time
)

select
col1
,date_time
,lag(date_time,1,'2022-12-30') over(partition by col1 order by date_time) as 往前n行的值 
from temp_01

输出结果:
col1      date_time     往前n行的值
A         2022-12-01    2022-12-30
A         2022-12-02    2022-12-01
A         2022-12-03    2022-12-02
C         2022-12-01    2022-12-30
C         2022-12-02    2022-12-01

极值窗口函数

5.1 获取分组内第一行的值
我们可以使用 first_value(col,true/false) over(),作用是:取分组内排序后,截止到当前行,第一个值。

注意:

当第二个参数为 true 的时候,会跳过空值 当 over() 中不指定排序的时候,会默认使用表中数据的原排序。
案例:将每行数据换成当前窗口指定字段的第一个值

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)

select
col1
,first_value(col2) over(partition by col1 order by col2) as 第一个值 
from temp_01
输出结果:
col1     第一个值
A        a
A        a
A        a
C        a 
C        a
select
col1
,first_value(col2) over(partition by col1) as 第一个值 
from temp_01
输出结果:
col1     第一个值
A        b
A        b
A        b
C        a
C        a

5.2 获取分组内最后一行的值
我们可以使用 last_value(col,true/false) over(),作用是:取分组内排序后,截止到当前行,最后一个值。所以,如果使用 order by 排序的时候,想要取最后一个值,需要与 rows between unbounded preceding and unbounded following 连用。

注意:

当第二个参数为 true 的时候,会跳过空值 当 over() 中不指定排序的时候,会默认使用表中数据的原排序。 当 over()
中指定排序的时候,要与 rows between unbounded preceding and unbounded following
连用

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)


select
col1
,last_value(col2) over(partition by col1 order by col2 rows between unbounded preceding and unbounded following) as 第一个值 
from temp_01

输出结果:
col1       第一个值
A          b
A          b
A          b
C          b
C          b

相信大家都发现了,在本案例中,我们使用 order by 的时候与 rows between unbounded preceding and unbounded following 连用了,这是需要注意的一个点,如果不连用,将会产生以下效果:

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)


select
col1
,last_value(col2) over(partition by col1 order by col2) as 第一个值 
from temp_01

输出结果:
col1      第一个值
A         a
A         b
A         b
C         a
C         b
  1. 分箱窗口函数
    ntile() over() 分箱窗口函数,用于将分组数据按照顺序切分成 n 片,返回当前切片值,如果切片不均匀,默认增加到第一个切片中。

案例:查询成绩前 20% 的人。

with temp as(
select 'A' as col1,90 as grade
union all 
select 'B' as col1,80 as grade
union all
select 'C' as col1,82 as grade
union all
select 'D' as col1,99 as grade
union all 
select 'E' as col1,100 as grade
union all 
select 'F' as col1,92 as grade
union all
select 'G' as col1,93 as grade
union all
select 'H' as col1,85 as grade
union all
select 'I' as col1,95 as grade
union all
select 'J' as col1,70 as grade
)

select 
col1
,grade
from
    (select
    col1
    ,grade
    ,ntile(5) over(order by grade desc) as level
    from temp
    )t1
where t1.level = 1

输出结果:
col1     grade
E        100
D        99

转载:https://zhuanlan.zhihu.com/p/587440793

你可能感兴趣的:(大数据学习之旅,hive,hadoop,数据仓库)