J ( S ) = ∑ i = 1 n ∣ w i ∣ ( 特征重要性评分 ) J(S) = \sum_{i=1}^n |w_i| \quad (特征重要性评分) J(S)=i=1∑n∣wi∣(特征重要性评分)
S k + 1 = S k − arg min f J ( S k ∖ { f } ) ( 迭代消除策略 ) S_{k+1} = S_k - \arg\min_{f} J(S_k \setminus \{f\}) \quad (迭代消除策略) Sk+1=Sk−argfminJ(Sk∖{f})(迭代消除策略)
案例:在信用评分模型中,通过线性回归系数绝对值评估特征重要性,每轮迭代移除权重最小的特征
w ^ i ( t ) = α w i ( t ) + ( 1 − α ) w ^ i ( t − 1 ) ( 指数平滑 ) \hat{w}_i^{(t)} = \alpha w_i^{(t)} + (1-\alpha)\hat{w}_i^{(t-1)} \quad (指数平滑) w^i(t)=αwi(t)+(1−α)w^i(t−1)(指数平滑)
案例:电商用户流失预测中,使用移动平均策略稳定特征重要性评估
# 自定义特征评分器
class FeatureScorer(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.linear = nn.Linear(input_dim, 1)
def forward(self, x):
return torch.mean(torch.abs(self.linear.weight.detach()), dim=0)
# RFE迭代过程
def rfe_elimination(X, y, k=10):
scorer = FeatureScorer(X.shape[1])
optimizer = torch.optim.Adam(scorer.parameters())
for _ in range(100): # 训练评分模型
optimizer.zero_grad()
loss = F.mse_loss(scorer(X), y)
loss.backward()
optimizer.step()
scores = scorer(X).numpy()
return np.argsort(scores)[-k:] # 保留top-k特征
class DynamicRFE(tf.keras.Model):
def __init__(self, feature_size):
super().__init__()
self.selector = tf.keras.layers.Dense(1, use_bias=False)
self.importance_history = tf.Variable(
initial_value=tf.zeros(feature_size),
trainable=False
)
def call(self, inputs):
current_weights = tf.abs(tf.squeeze(self.selector.weights[0]))
self.importance_history.assign(0.9*self.importance_history + 0.1*current_weights)
return self.importance_history.numpy()
场景:某银行信用卡欺诈检测系统
| 指标 | 优化前 | 优化后 |
|------------|--------|--------|
| AUC | 0.812 | 0.837 |
| 推理速度 | 58ms | 22ms |
| 内存占用 | 2.3GB | 0.8GB |
案例:COVID-19 CT图像分类
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_features_to_select': [10, 20, 30],
'step': [1, 5, 10],
'importance_getter': ['auto', 'coef_', 'feature_importances_']
}
grid_search = GridSearchCV(
RFE(estimator=RandomForestClassifier()),
param_grid,
cv=5
)
# 使用Dask进行并行处理
import dask_ml.feature_selection
dask_rfe = dask_ml.feature_selection.RFECV(
estimator=LogisticRegression(),
cv=3,
n_jobs=-1
)
dask_rfe.fit(dask_X, dask_y) # 处理100+GB数据集
pip install featuretools[rfe]
from deep_rfe import DeepFeatureSelector
selector = DeepFeatureSelector(pretrained='resnet50')
from river import feature_selection
online_rfe = feature_selection.RFE(
model=LinearRegression(),
n_features=15,
step=2
)