Hive的窗口函数(Window Functions)是其SQL功能的核心亮点之一,用于在分组数据上执行计算,同时保留原始表的行数(不压缩分组)。窗口函数特别适用于排名分析、趋势计算、移动统计等复杂场景,是处理时间序列数据和多维分析的利器。
窗口函数的语法结构:
function_name(arg1, arg2...) OVER (
[PARTITION BY col1, col2...] -- 分组(类似GROUP BY,但不压缩行数)
[ORDER BY col3, col4...] -- 排序(决定窗口内数据的处理顺序)
[ROWS/RANGE BETWEEN ... AND ...] -- 窗口范围(可选,定义当前行的关联行)
)
函数部分:
sum()
、avg()
、count()
,在窗口内执行聚合。row_number()
、rank()
,生成排名。lead()
、lag()
,获取前后行数据。OVER子句:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
:从分组首行到当前行。ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
:当前行的前1行到后1行(共3行)。row_number()
SELECT
user_id,
order_time,
row_number() OVER (
PARTITION BY user_id
ORDER BY order_time DESC -- 按订单时间倒序
) AS rn
FROM orders;
-- rn=1即为每个用户的最近订单
rank()
vs dense_rank()
rank()
:相同值排名相同,后续排名跳号(如1,1,3)。dense_rank()
:相同值排名相同,后续排名不跳号(如1,1,2)。SELECT
score,
rank() OVER (ORDER BY score DESC) AS rk,
dense_rank() OVER (ORDER BY score DESC) AS drk
FROM students;
-- 若score有两个85分,rk为1,1,3;drk为1,1,2
sum()
/ avg()
/ count()
SELECT
order_date,
amount,
sum(amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 累计到当前行
) AS cumulative_sum,
avg(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 近3天移动平均
) AS moving_avg
FROM daily_orders;
lead(col, n)
/ lag(col, n)
n
行(lag
)或后n
行(lead
)数据。SELECT
date,
sales,
lag(sales, 1) OVER (ORDER BY date) AS prev_day_sales, -- 前一天销量
lead(sales, 1) OVER (ORDER BY date) AS next_day_sales -- 后一天销量
FROM sales_data;
first_value(col)
/ last_value(col)
last_value
默认在当前行截止,需显式指定窗口范围到分组末尾。SELECT
user_id,
order_time,
amount,
first_value(amount) OVER (
PARTITION BY user_id
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_order_amount,
last_value(amount) OVER (
PARTITION BY user_id
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_order_amount
FROM orders;
窗口框架定义了当前行的关联行范围,决定了函数的计算边界。语法:
ROWS/RANGE BETWEEN start AND end
ROWS模式:按物理行号定位(不受值影响)。
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
(当前行的前2行到后1行)。RANGE模式:按值的范围定位(适用于数值或时间类型)。
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
(当前日期的前7天内)。UNBOUNDED PRECEDING
:分组的第一行。UNBOUNDED FOLLOWING
:分组的最后一行。CURRENT ROW
:当前行。需求:找出每个部门薪资最高的前3名员工。
解法:
WITH ranked_employees AS (
SELECT
department,
name,
salary,
row_number() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS rn
FROM employees
)
SELECT * FROM ranked_employees WHERE rn <= 3;
需求:计算每日销售额的7日移动平均值。
解法:
SELECT
date,
sales,
avg(sales) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- 包含当前行共7天
) AS rolling_avg
FROM daily_sales;
需求:计算每日销售额的环比增长率(较前一日)。
解法:
SELECT
date,
sales,
prev_day_sales,
((sales - prev_day_sales) / prev_day_sales) * 100 AS growth_rate
FROM (
SELECT
date,
sales,
lag(sales, 1) OVER (ORDER BY date) AS prev_day_sales
FROM daily_sales
) t;
需求:计算每个用户的订单金额占其总金额的累计百分比。
解法:
SELECT
user_id,
order_id,
amount,
sum(amount) OVER (
PARTITION BY user_id
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / sum(amount) OVER (PARTITION BY user_id) AS cumulative_percent
FROM orders;
合理使用窗口框架:
UNBOUNDED FOLLOWING
,可能导致全量数据缓存。ROWS
模式(比RANGE
更高效)。分区与排序优化:
PARTITION BY
的列尽量选择数据分布均匀的字段,避免数据倾斜。大表计算避免全量窗口:
WHERE
过滤后再应用窗口函数。特性 | 窗口函数(Window Functions) | GROUP BY |
---|---|---|
输出行数 | 保留原始行数(不压缩) | 压缩为分组后的行数 |
聚合方式 | 对每个分组内的每行数据单独计算 | 对每个分组汇总为单行结果 |
典型应用 | 排名、移动统计、累计计算 | 分组求和、平均值等 |
能否同时访问原始列 | 是(可保留未参与分组的列) | 否(只能访问分组列或聚合值) |
窗口函数是Hive处理复杂分析需求的核心工具,通过PARTITION BY
、ORDER BY
和窗口框架的组合,可实现排名分析、趋势计算、时间序列处理等高级功能。实际应用中,需根据业务场景选择合适的窗口函数类型(排名、聚合、分析),并注意窗口框架的边界定义,以确保计算结果符合预期。