我们将构建如下字段的结构化数据,每条代表一个候选人:
字段 | 含义 | 类型 | 示例值 |
---|---|---|---|
degree |
学历等级 | 分类变量(本科/硕士/博士) | “硕士” |
university_type |
学校等级 | 分类变量(双一流/普通) | “双一流” |
work_years |
工作年限 | 数值型 | 3 |
skill_python |
是否掌握 Python | 布尔/0-1 | 1 |
skill_sql |
是否掌握 SQL | 布尔/0-1 | 1 |
skill_ml |
是否掌握机器学习 | 布尔/0-1 | 0 |
project_count |
项目数量 | 数值型 | 2 |
project_desc_len |
项目描述总长度(字符) | 数值型 | 500 |
pass_screening |
是否通过简历初筛 | 目标值(0/1) | 1 |
我们将生成 200 条样本数据,模拟随机简历,规则包含:
一个名为 resume_data.csv
的训练数据文件
使用以下代码生成样本数据:
# Kernel was reset, so we need to regenerate the dataset again
import pandas as pd
import numpy as np
import os
# 设置随机种子
np.random.seed(42)
n = 200
# 模拟字段
degrees = np.random.choice(['本科', '硕士', '博士'], size=n, p=[0.6, 0.3, 0.1])
univ_types = np.random.choice(['普通', '双一流'], size=n, p=[0.7, 0.3])
work_years = np.random.randint(0, 11, size=n)
skill_python = np.random.choice([0, 1], size=n, p=[0.3, 0.7])
skill_sql = np.random.choice([0, 1], size=n, p=[0.4, 0.6])
skill_ml = np.random.choice([0, 1], size=n, p=[0.6, 0.4])
project_count = np.random.randint(0, 6, size=n)
project_desc_len = np.random.randint(50, 1000, size=n)
# 评分构造
score = (
(degrees == '博士') * 2 +
(degrees == '硕士') * 1 +
(univ_types == '双一流') * 1 +
work_years * 0.2 +
skill_python * 1.5 +
skill_sql * 1.2 +
skill_ml * 1.3 +
project_count * 0.8 +
(project_desc_len / 500)
)
prob = 1 / (1 + np.exp(-(score - 7))) # sigmoid
pass_screening = (np.random.rand(n) < prob).astype(int)
# 生成 DataFrame
df = pd.DataFrame({
"degree": degrees,
"university_type": univ_types,
"work_years": work_years,
"skill_python": skill_python,
"skill_sql": skill_sql,
"skill_ml": skill_ml,
"project_count": project_count,
"project_desc_len": project_desc_len,
"pass_screening": pass_screening
})
# 保存 CSV
csv_path = "./data/resume_data.csv"
df.to_csv(csv_path, index=False)
print(f"✅ 模拟数据已生成并保存到 {csv_path}")
字段说明字典或文档 (如上述第一章所示)