我用这套爬虫架构,批量采集了全网小说评论,还写进了公司项目里!

☕ 请作者喝杯咖啡,持续更新更深入的干货


用 WebMagic 高效爬取 小说评论数据:一套 Java 实战解决方案!

你是否想过,有没有一种方式,可以自动、稳定地从小说平台抓取评论数据?今天我们就来手把手拆解一个真实项目:如何通过 WebMagic + Spring Boot + MyBatis 构建一个高效的小说爬虫系统

这不仅是一个爬虫示例,更是一套工程化数据采集解决方案。


一、系统架构概览

本项目采用了如下技术栈:

  • WebMagic:核心爬虫框架
  • Spring Boot:依赖注入、定时任务调度
  • MyBatis:数据库存储评论数据
  • FastJSON2:JSON 数据解析

项目结构图:

com.catty.novel.an
├── controller.admin.SpiderController      # 提供 REST 接口
├── spider.SpiderService                   # 构建爬虫流程
├── spider.NovelPageProcessor          # 处理页面逻辑
├── spider.NovelPipeline               # 保存数据逻辑
├── mbg.mapper.VNovelIndexMapper           # 小说索引 Mapper
├── mbg.mapper.VNovelCommentMapper         # 评论 Mapper
└── mbg.entity.NovelIndexDto               # 索引 DTO 对象

二、项目启动流程实战详解

下面我们从控制器开始,一步步拆解整个爬虫执行流程。

Step 1️⃣:通过接口触发爬虫

@RestController
@RequestMapping("/crawler")
public class SpiderController {

    @Autowired
    private SpiderService spiderService;

    @PostMapping("/start")
    public String startOnce() {
        spiderService.startCrawl();
        return "Crawl started";
    }

    @PostMapping("/test")
    public String testOnce(@RequestParam("bookName") String bookName,
                           @RequestParam("bookId") Integer bookId) {
        new Thread(() -> spiderService.testCrawl(bookName, bookId)).start();
        return "Test crawl started for bookName = " + bookName;
    }
}

你可以选择全量调度(start),也可以选择某一本书测试(test)。


Step 2️⃣:构建请求并启动 Spider

@Service
public class SpiderService {

    private static final String URL_SUGGEST = NovelPageProcessor.URL_SUGGEST;

    @Autowired
    private VNovelIndexMapper indexMapper;
    @Autowired
    private NovelPageProcessor processor;
    @Autowired
    private NovelPipeline pipeline;

    public void testCrawl(String bookName, Integer bookId) {
        NovelIndexDto novelIndexDto = indexMapper.selectByNovelId(bookId);
        Spider spider = Spider.create(processor)
                .thread(1)
                .addPipeline(pipeline)
                .setExitWhenComplete(true);

        JSONObject json = new JSONObject();
        json.put("keyword", bookName.replaceAll("[‘’`]", "'"));
        HttpRequestBody body = HttpRequestBody.json(json.toJSONString(), "UTF-8");
        Request req = new Request(URL_SUGGEST);
        req.setMethod(HttpConstant.Method.POST).setRequestBody(body);
        req.putExtra("type", "suggest")
           .putExtra("novelName", novelIndexDto.getOriginName())
           .putExtra("novelId", bookId)
           .putExtra("novelDesc", novelIndexDto.getNovelDesc().substring(0, Math.min(20, novelIndexDto.getNovelDesc().length())).replaceAll("[‘’`]", "'"));

        spider.addRequest(req);
        spider.run();
    }
}

这一步负责构造 suggest 请求,通过关键词获取 bookId


Step 3️⃣:解析接口返回,分页提取评论

@Component
public class NovelPageProcessor implements PageProcessor {

    public static final String URL_SUGGEST = "https://www.xxx.com/xxx/book/search/suggest";
    public static final String URL_COMMENTS = "https://www.xxx.com/xxx/comment/book/comments";

    private final Site site = Site.me()
            .setRetryTimes(3)
            .setTimeOut(3000)
            .setSleepTime(3000)
            .addHeader("Content-Type", "application/json;charset=UTF-8")
            .addCookie("cookie", "currentLanguage=en")
            .addHeader("currentlanguage", "en")
            .addHeader("user-agent", "Mozilla/5.0 ...")
            .addHeader("Accept", "application/json");

    @Override
    public void process(Page page) {
        Request req = page.getRequest();
        String type = req.getExtra("type");

        if ("suggest".equals(type)) {
            JSONObject root = JSON.parseObject(page.getRawText());
            JSONArray suggestArr = root.getJSONObject("data").getJSONArray("suggest");
            if (CollectionUtils.isEmpty(suggestArr)) return;

            String novelName = req.getExtra("novelName");
            String novelDesc = req.getExtra("novelDesc");
            for (int i = 0; i < suggestArr.size(); i++) {
                JSONObject s = suggestArr.getJSONObject(i);
                if (novelName.equals(s.getString("bookName")) && novelDesc.startsWith(s.getString("introduction"))) {
                    String bookId = s.getString("bookId");
                    JSONObject json = new JSONObject();
                    json.put("bookId", bookId);
                    json.put("pageNo", 1);

                    Request commentsReq = new Request(URL_COMMENTS)
                            .setMethod(HttpConstant.Method.POST)
                            .setRequestBody(HttpRequestBody.json(json.toJSONString(), "UTF-8"))
                            .putExtra("type", "comments")
                            .putExtra("bookId", bookId)
                            .putExtra("novelId", req.getExtra("novelId"))
                            .putExtra("pageNo", 1);

                    page.addTargetRequest(commentsReq);
                    page.setSkip(true);
                    return;
                }
            }
        }

        if ("comments".equals(type)) {
            JSONObject root = JSON.parseObject(page.getRawText());
            JSONObject wb = root.getJSONObject("data").getJSONObject("webBookComments");

            page.putField("novelId", req.getExtra("novelId"));
            page.putField("bookId", req.getExtra("bookId"));
            page.putField("pageNo", req.getExtra("pageNo"));
            page.putField("pages", wb.getIntValue("pages"));
            page.putField("records", wb.getJSONArray("records"));

            int current = req.getExtra("pageNo");
            int pages = wb.getIntValue("pages");
            if (current < pages) {
                JSONObject json = new JSONObject();
                json.put("bookId", req.getExtra("bookId"));
                json.put("pageNo", current + 1);

                Request next = new Request(URL_COMMENTS)
                        .setMethod(HttpConstant.Method.POST)
                        .setRequestBody(HttpRequestBody.json(json.toJSONString(), "UTF-8"))
                        .putExtra("type", "comments")
                        .putExtra("novelId", req.getExtra("novelId"))
                        .putExtra("bookId", req.getExtra("bookId"))
                        .putExtra("pageNo", current + 1);

                page.addTargetRequest(next);
            }
        }
    }

    @Override
    public Site getSite() {
        return site;
    }
}

Step 4️⃣:评论入库逻辑与字段映射

@Component
public class NovelPipeline implements Pipeline {

    @Autowired
    private VNovelCommentMapper commentMapper;

    @Override
    public void process(ResultItems items, Task task) {
        Integer novelId = items.get("novelId");
        String bookId = items.get("bookId");
        List<JSONObject> recs = items.get("records");
        if (recs == null || recs.isEmpty()) return;

        for (JSONObject c : recs) {
            VNovelComment mc = new VNovelComment();
            mc.setNovelId(novelId);
            mc.setBookId(Long.parseLong(bookId));
            mc.setChapterId(c.getIntValue("chapterId"));
            mc.setReviewId(c.getLong("id"));
            mc.setUserId(c.getIntValue("userId"));
            mc.setUserName(c.getString("userNickname"));
            mc.setUserImg(c.getString("userAvatar"));
            mc.setTotalScore(c.getDouble("rate"));
            mc.setReplyAmount(c.getIntValue("replyNum"));
            mc.setLikeAmount(c.getIntValue("likeNum"));
            mc.setTopStatus(String.valueOf(c.getBoolean("praise")));
            mc.setComment(c.getString("content"));
            mc.setCreateTime(parseEpoch(c.getString("ctime")));
            mc.setIsLike(c.getBoolean("praise") ? (short) 1 : (short) 0);
            mc.setStatus(Short.parseShort(c.getString("type")));
            mc.setcDate(DateUtils.parseDate(c.getString("ctime"), new String[]{"yyyy-MM-dd HH:mm:ss"}));
            mc.setuDate(DateUtils.parseDate(c.getString("utime"), new String[]{"yyyy-MM-dd HH:mm:ss"}));

            commentMapper.insert(mc);
        }
    }

    private long parseEpoch(String dt) {
        return LocalDateTime
                .parse(dt, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"))
                .toInstant(ZoneOffset.ofHours(8))
                .toEpochMilli();
    }
}

✅ 字段说明:

字段 描述
reviewId 评论唯一 ID
userId/userName/userAvatar 用户基本信息
rate 评分(总分)
likeNum 点赞数量
replyNum 回复数量
content 评论内容
ctime/utime 评论创建 / 更新时间
praise 是否点赞标识(布尔转 short)

其中 parseEpoch 方法将时间转为时间戳格式,方便统计计算。


小结:从请求构造到存库的完整闭环

  • ✅ controller 启动爬虫任务
  • ✅ processor 请求接口并解析分页数据
  • ✅ pipeline 将结果数据字段映射并写入数据库

实现自动化、高并发、精准匹配的评论数据抓取方案!

下一节预告:如何处理反爬?如何扩展到多源评论?能否做情感分析 NLP?

☕ 请作者喝杯咖啡,持续更新更深入的干货

彩蛋时间:如果你看到了这里,说明你是那种喜欢动手实战的人。那我悄悄分享一个开发圈流传的工具试用入口,貌似跟高效调试很有关系,地址也挺特别的:

入口

据说注册还能解锁一些隐藏功能,懂的都懂(别外传 )

你可能感兴趣的:(程序员的思维乐园,java,开发语言)