☕ 请作者喝杯咖啡,持续更新更深入的干货
你是否想过,有没有一种方式,可以自动、稳定地从小说平台抓取评论数据?今天我们就来手把手拆解一个真实项目:如何通过 WebMagic + Spring Boot + MyBatis 构建一个高效的小说爬虫系统。
这不仅是一个爬虫示例,更是一套工程化数据采集解决方案。
本项目采用了如下技术栈:
com.catty.novel.an
├── controller.admin.SpiderController # 提供 REST 接口
├── spider.SpiderService # 构建爬虫流程
├── spider.NovelPageProcessor # 处理页面逻辑
├── spider.NovelPipeline # 保存数据逻辑
├── mbg.mapper.VNovelIndexMapper # 小说索引 Mapper
├── mbg.mapper.VNovelCommentMapper # 评论 Mapper
└── mbg.entity.NovelIndexDto # 索引 DTO 对象
下面我们从控制器开始,一步步拆解整个爬虫执行流程。
@RestController
@RequestMapping("/crawler")
public class SpiderController {
@Autowired
private SpiderService spiderService;
@PostMapping("/start")
public String startOnce() {
spiderService.startCrawl();
return "Crawl started";
}
@PostMapping("/test")
public String testOnce(@RequestParam("bookName") String bookName,
@RequestParam("bookId") Integer bookId) {
new Thread(() -> spiderService.testCrawl(bookName, bookId)).start();
return "Test crawl started for bookName = " + bookName;
}
}
你可以选择全量调度(start),也可以选择某一本书测试(test)。
@Service
public class SpiderService {
private static final String URL_SUGGEST = NovelPageProcessor.URL_SUGGEST;
@Autowired
private VNovelIndexMapper indexMapper;
@Autowired
private NovelPageProcessor processor;
@Autowired
private NovelPipeline pipeline;
public void testCrawl(String bookName, Integer bookId) {
NovelIndexDto novelIndexDto = indexMapper.selectByNovelId(bookId);
Spider spider = Spider.create(processor)
.thread(1)
.addPipeline(pipeline)
.setExitWhenComplete(true);
JSONObject json = new JSONObject();
json.put("keyword", bookName.replaceAll("[‘’`]", "'"));
HttpRequestBody body = HttpRequestBody.json(json.toJSONString(), "UTF-8");
Request req = new Request(URL_SUGGEST);
req.setMethod(HttpConstant.Method.POST).setRequestBody(body);
req.putExtra("type", "suggest")
.putExtra("novelName", novelIndexDto.getOriginName())
.putExtra("novelId", bookId)
.putExtra("novelDesc", novelIndexDto.getNovelDesc().substring(0, Math.min(20, novelIndexDto.getNovelDesc().length())).replaceAll("[‘’`]", "'"));
spider.addRequest(req);
spider.run();
}
}
这一步负责构造 suggest
请求,通过关键词获取 bookId
。
@Component
public class NovelPageProcessor implements PageProcessor {
public static final String URL_SUGGEST = "https://www.xxx.com/xxx/book/search/suggest";
public static final String URL_COMMENTS = "https://www.xxx.com/xxx/comment/book/comments";
private final Site site = Site.me()
.setRetryTimes(3)
.setTimeOut(3000)
.setSleepTime(3000)
.addHeader("Content-Type", "application/json;charset=UTF-8")
.addCookie("cookie", "currentLanguage=en")
.addHeader("currentlanguage", "en")
.addHeader("user-agent", "Mozilla/5.0 ...")
.addHeader("Accept", "application/json");
@Override
public void process(Page page) {
Request req = page.getRequest();
String type = req.getExtra("type");
if ("suggest".equals(type)) {
JSONObject root = JSON.parseObject(page.getRawText());
JSONArray suggestArr = root.getJSONObject("data").getJSONArray("suggest");
if (CollectionUtils.isEmpty(suggestArr)) return;
String novelName = req.getExtra("novelName");
String novelDesc = req.getExtra("novelDesc");
for (int i = 0; i < suggestArr.size(); i++) {
JSONObject s = suggestArr.getJSONObject(i);
if (novelName.equals(s.getString("bookName")) && novelDesc.startsWith(s.getString("introduction"))) {
String bookId = s.getString("bookId");
JSONObject json = new JSONObject();
json.put("bookId", bookId);
json.put("pageNo", 1);
Request commentsReq = new Request(URL_COMMENTS)
.setMethod(HttpConstant.Method.POST)
.setRequestBody(HttpRequestBody.json(json.toJSONString(), "UTF-8"))
.putExtra("type", "comments")
.putExtra("bookId", bookId)
.putExtra("novelId", req.getExtra("novelId"))
.putExtra("pageNo", 1);
page.addTargetRequest(commentsReq);
page.setSkip(true);
return;
}
}
}
if ("comments".equals(type)) {
JSONObject root = JSON.parseObject(page.getRawText());
JSONObject wb = root.getJSONObject("data").getJSONObject("webBookComments");
page.putField("novelId", req.getExtra("novelId"));
page.putField("bookId", req.getExtra("bookId"));
page.putField("pageNo", req.getExtra("pageNo"));
page.putField("pages", wb.getIntValue("pages"));
page.putField("records", wb.getJSONArray("records"));
int current = req.getExtra("pageNo");
int pages = wb.getIntValue("pages");
if (current < pages) {
JSONObject json = new JSONObject();
json.put("bookId", req.getExtra("bookId"));
json.put("pageNo", current + 1);
Request next = new Request(URL_COMMENTS)
.setMethod(HttpConstant.Method.POST)
.setRequestBody(HttpRequestBody.json(json.toJSONString(), "UTF-8"))
.putExtra("type", "comments")
.putExtra("novelId", req.getExtra("novelId"))
.putExtra("bookId", req.getExtra("bookId"))
.putExtra("pageNo", current + 1);
page.addTargetRequest(next);
}
}
}
@Override
public Site getSite() {
return site;
}
}
@Component
public class NovelPipeline implements Pipeline {
@Autowired
private VNovelCommentMapper commentMapper;
@Override
public void process(ResultItems items, Task task) {
Integer novelId = items.get("novelId");
String bookId = items.get("bookId");
List<JSONObject> recs = items.get("records");
if (recs == null || recs.isEmpty()) return;
for (JSONObject c : recs) {
VNovelComment mc = new VNovelComment();
mc.setNovelId(novelId);
mc.setBookId(Long.parseLong(bookId));
mc.setChapterId(c.getIntValue("chapterId"));
mc.setReviewId(c.getLong("id"));
mc.setUserId(c.getIntValue("userId"));
mc.setUserName(c.getString("userNickname"));
mc.setUserImg(c.getString("userAvatar"));
mc.setTotalScore(c.getDouble("rate"));
mc.setReplyAmount(c.getIntValue("replyNum"));
mc.setLikeAmount(c.getIntValue("likeNum"));
mc.setTopStatus(String.valueOf(c.getBoolean("praise")));
mc.setComment(c.getString("content"));
mc.setCreateTime(parseEpoch(c.getString("ctime")));
mc.setIsLike(c.getBoolean("praise") ? (short) 1 : (short) 0);
mc.setStatus(Short.parseShort(c.getString("type")));
mc.setcDate(DateUtils.parseDate(c.getString("ctime"), new String[]{"yyyy-MM-dd HH:mm:ss"}));
mc.setuDate(DateUtils.parseDate(c.getString("utime"), new String[]{"yyyy-MM-dd HH:mm:ss"}));
commentMapper.insert(mc);
}
}
private long parseEpoch(String dt) {
return LocalDateTime
.parse(dt, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"))
.toInstant(ZoneOffset.ofHours(8))
.toEpochMilli();
}
}
字段 | 描述 |
---|---|
reviewId |
评论唯一 ID |
userId/userName/userAvatar |
用户基本信息 |
rate |
评分(总分) |
likeNum |
点赞数量 |
replyNum |
回复数量 |
content |
评论内容 |
ctime/utime |
评论创建 / 更新时间 |
praise |
是否点赞标识(布尔转 short) |
其中 parseEpoch
方法将时间转为时间戳格式,方便统计计算。
实现自动化、高并发、精准匹配的评论数据抓取方案!
下一节预告:如何处理反爬?如何扩展到多源评论?能否做情感分析 NLP?
☕ 请作者喝杯咖啡,持续更新更深入的干货
彩蛋时间:如果你看到了这里,说明你是那种喜欢动手实战的人。那我悄悄分享一个开发圈流传的工具试用入口,貌似跟高效调试很有关系,地址也挺特别的:
入口
据说注册还能解锁一些隐藏功能,懂的都懂(别外传 )