Hive 用ROW_NUMBER取每组top n

今天用hive查数据时需要取每个分组的第一条数据,查了一发现hive 已经从0.11.0版本加入row_number函数,可以满足查询的需求。

ROW_NUMBER() 是从1开始,按照顺序,生成分组内记录的序列
用法如下:

ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC) rn

rn 是排序的别名执行时每组的编码从1开始
partition by:类似hive的建表,分区的意思;COLUMN_A 是分组字段
order by :排序,默认是升序,加desc降序;COLUMN_B 是排序字段

列如:

SELECT 
g_field,
day
pv,
ROW_NUMBER() OVER(PARTITION BY g_field ORDER BY pv desc) AS rn 
FROM test ;
g_field             day           pv       rn

group01 2015-04-12          7           1
group01 2015-04-11          5           2
group01 2015-04-15          4           3
group01 2015-04-16          4           4
group01 2015-04-13          3           5
group01 2015-04-14          2           6
group01 2015-04-10          1           7
group02 2015-04-15          9           1
group02 2015-04-16          7           2
group02 2015-04-13          6           3
group02 2015-04-12          5           4
group02 2015-04-14          3           5
group02 2015-04-11          3           6
group02 2015-04-10          2           7

根据 rn就可以求出每个组的top n

select * from 
(
SELECT 
g_field
pv,
ROW_NUMBER() OVER(PARTITION BY g_field ORDER BY pv desc) AS rn 
FROM test )
where rn<=1;



http://lxw1234.com/archives/2015/04/181.htm
http://www.jianshu.com/p/51599bab0c00

你可能感兴趣的:(hive)