大数据学习
系列专栏: 哲学语录: 用力所能及,改变世界。
如果觉得博主的文章还不错的话,请点赞+收藏⭐️+留言支持一下博主哦
问题:查询每个部门薪资最高且入职最早的前 2 名员工。
思路:
ORDER BY salary DESC, hire_date ASC
实现多条件排序。ROW_NUMBER()
生成唯一排名,避免并列。代码模板:
WITH ranked_employees AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY dept_id
ORDER BY salary DESC, hire_date ASC
) AS rank
FROM employees
)
SELECT * FROM ranked_employees WHERE rank <= 2;
问题:查询每个部门薪资前 10% 的员工。
思路:
NTILE(10)
将数据按薪资分为 10 组,取第 1 组。代码模板:
WITH salary_tiles AS (
SELECT
*,
NTILE(10) OVER (
PARTITION BY dept_id
ORDER BY salary DESC
) AS salary_tile
FROM employees
)
SELECT * FROM salary_tiles WHERE salary_tile = 1;
问题:检测用户登录记录中连续缺失超过 3 天的区间。
思路:
日期-行号
分组连续缺失区间。代码模板:
WITH all_dates AS (
-- 生成日期序列(略)
),
missing_dates AS (
SELECT
user_id,
date,
CASE WHEN login_id IS NULL THEN 1 ELSE 0 END AS is_missing
FROM all_dates
LEFT JOIN user_logins USING (user_id, date)
),
missing_groups AS (
SELECT
user_id,
date,
DATE_SUB(date, ROW_NUMBER() OVER (
PARTITION BY user_id, is_missing
ORDER BY date
)) AS grp
FROM missing_dates
WHERE is_missing = 1
)
SELECT
user_id,
MIN(date) AS start_date,
MAX(date) AS end_date,
COUNT(*) AS missing_days
FROM missing_groups
GROUP BY user_id, grp
HAVING COUNT(*) > 3;
问题:识别用户每周固定某天的登录习惯(如每周三)。
思路:
DAYOFWEEK()
获取星期几,按用户和星期分组统计频次。代码模板:
SELECT
user_id,
DAYOFWEEK(login_date) AS day_of_week,
COUNT(*) AS login_count,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY COUNT(*) DESC
) AS rank
FROM user_logins
GROUP BY user_id, DAYOFWEEK(login_date)
HAVING rank = 1; -- 取频次最高的一天
问题:计算每个商品在不同促销活动下的加权平均销量(权重为活动持续天数)。
思路:
SUM(销量*权重)/SUM(权重)
实现加权平均。代码模板:
SELECT
product_id,
SUM(sales * duration_days) / SUM(duration_days) AS weighted_avg_sales
FROM (
SELECT
product_id,
campaign_id,
SUM(daily_sales) AS sales,
DATEDIFF(end_date, start_date) + 1 AS duration_days
FROM sales_records
GROUP BY product_id, campaign_id, start_date, end_date
) t
GROUP BY product_id;
问题:计算用户每次登录后 24 小时内的消费总额。
思路:
JOIN
关联同一用户的登录和消费记录,筛选时间窗口。代码模板:
SELECT
l.user_id,
l.login_time,
SUM(o.amount) AS total_spent
FROM user_logins l
LEFT JOIN orders o
ON l.user_id = o.user_id
AND o.order_time BETWEEN l.login_time AND DATE_ADD(l.login_time, 1)
GROUP BY l.user_id, l.login_time;
问题:同时计算按部门、职位和两者组合的薪资总和。
思路:
GROUPING SETS
生成多种分组组合。代码模板:
SELECT
dept_id,
position,
SUM(salary) AS total_salary
FROM employees
GROUP BY GROUPING SETS(
(dept_id, position), -- 部门+职位分组
(dept_id), -- 部门分组
(position), -- 职位分组
() -- 总计
);
问题:计算 2023 年每月销售额的同比和环比增长率。
思路:
LAG()
获取上月 / 去年同月数据,或用 JOIN
关联时间偏移表。代码模板:
WITH monthly_sales AS (
SELECT
YEAR(sale_date) AS sale_year,
MONTH(sale_date) AS sale_month,
SUM(amount) AS total_amount
FROM sales
GROUP BY YEAR(sale_date), MONTH(sale_date)
)
SELECT
curr.sale_year,
curr.sale_month,
curr.total_amount,
prev_month.total_amount AS prev_month_amount,
prev_year.total_amount AS prev_year_amount,
(curr.total_amount - prev_month.total_amount) / prev_month.total_amount AS mom_growth,
(curr.total_amount - prev_year.total_amount) / prev_year.total_amount AS yoy_growth
FROM monthly_sales curr
LEFT JOIN monthly_sales prev_month
ON curr.sale_year = prev_month.sale_year
AND curr.sale_month = prev_month.sale_month + 1
LEFT JOIN monthly_sales prev_year
ON curr.sale_year = prev_year.sale_year + 1
AND curr.sale_month = prev_year.sale_month;
问题:统计每个城市商圈内的店铺数量。
思路:
ST_Contains()
判断点(店铺)是否在多边形(商圈)内。代码模板:
SELECT
district_name,
COUNT(shop_id) AS shop_count
FROM shops s
JOIN districts d
ON ST_Contains(
ST_GeomFromText(d.polygon_wkt), -- 商圈多边形
ST_Point(s.longitude, s.latitude) -- 店铺坐标
)
GROUP BY district_name;
问题:为每个用户找到距离最近的 3 个服务点。
思路:
ROW_NUMBER()
取 Top N。代码模板:
WITH distances AS (
SELECT
u.user_id,
s.service_id,
6371 * 2 * ASIN(
SQRT(
POWER(SIN((s.lat - u.lat) * PI()/180 / 2), 2) +
COS(u.lat * PI()/180) * COS(s.lat * PI()/180) *
POWER(SIN((s.lon - u.lon) * PI()/180 / 2), 2)
)
) AS distance_km
FROM users u
CROSS JOIN service_points s
)
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY distance_km
) AS rank
FROM distances
) t
WHERE rank <= 3;
问题:计算每小时的平均请求数。
思路:
DATE_TRUNC()
截断时间到小时,按小时分组。代码模板:
SELECT
DATE_TRUNC('HOUR', request_time) AS hour,
COUNT(request_id) AS request_count,
AVG(response_time) AS avg_response_time
FROM requests
GROUP BY DATE_TRUNC('HOUR', request_time);
问题:计算每个用户最近 5 次登录的平均停留时长。
思路:
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
定义滑动窗口。代码模板:
SELECT
user_id,
login_time,
session_duration,
AVG(session_duration) OVER (
PARTITION BY user_id
ORDER BY login_time
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
) AS avg_last_5_sessions
FROM user_sessions;
问题:将用户标签(每行一个标签)转为列(每个标签一列)。
思路:
collect_set()
聚合标签,size()
判断是否存在。代码模板:
WITH user_tags AS (
SELECT
user_id,
collect_set(tag) AS tags
FROM user_tag_mapping
GROUP BY user_id
)
SELECT
user_id,
CASE WHEN 'vip' IN (SELECT * FROM UNNEST(tags)) THEN 1 ELSE 0 END AS is_vip,
CASE WHEN 'new' IN (SELECT * FROM UNNEST(tags)) THEN 1 ELSE 0 END AS is_new,
-- 动态添加更多标签判断
FROM user_tags;
问题:统计不同年龄段和性别用户的消费金额分布。
思路:
CASE WHEN
组合维度,SUM()
聚合金额。代码模板:
SELECT
age_group,
SUM(CASE WHEN gender = 'M' THEN amount ELSE 0 END) AS male_amount,
SUM(CASE WHEN gender = 'F' THEN amount ELSE 0 END) AS female_amount,
SUM(amount) AS total_amount
FROM users u
JOIN orders o USING (user_id)
GROUP BY age_group;
问题:查询员工及其所有上级的完整路径。
思路:
代码模板:
WITH RECURSIVE employee_hierarchy AS (
SELECT
emp_id,
manager_id,
emp_name,
CAST(emp_name AS STRING) AS path
FROM employees
WHERE manager_id IS NULL -- 根节点(CEO)
UNION ALL
SELECT
e.emp_id,
e.manager_id,
e.emp_name,
CONCAT(eh.path, ' -> ', e.emp_name) AS path
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.emp_id
)
SELECT * FROM employee_hierarchy;
问题:计算每个区域及其子区域的总销售额。
思路:
SUM() OVER (PARTITION BY region_id)
。代码模板:
WITH region_sales AS (
-- 基础销售额(略)
),
region_hierarchy AS (
-- 区域层级关系(略)
),
recursive_sales AS (
-- 递归计算子区域销售额(略)
)
SELECT
region_id,
region_name,
SUM(sales_amount) OVER (
PARTITION BY region_id
) AS total_sales
FROM recursive_sales;
问题:从日志中提取 user_id
和 action
(格式:[user_id:1001][action:click]
)。
思路:
regexp_extract()
或 substr()
+instr()
提取子串。代码模板:
SELECT
regexp_extract(log_line, '\\[user_id:(\\d+)\\]', 1) AS user_id,
regexp_extract(log_line, '\\[action:(\\w+)\\]', 1) AS action
FROM logs;
问题:找出商品名称中包含特定关键词的记录。
思路:
LIKE
或 REGEXP
匹配,或用 levenshtein_distance()
计算编辑距离。代码模板:
-- 方法1:模糊匹配
SELECT * FROM products WHERE product_name LIKE '%关键词%';
-- 方法2:正则匹配
SELECT * FROM products WHERE product_name REGEXP '关键词';
-- 方法3:相似度计算
SELECT *
FROM products
WHERE levenshtein_distance(product_name, '目标名称') <= 3;
EXPLAIN
分析执行计划,避免全表扫描。