基于Spring Boot框架+Jsoup实现网络爬虫
1.背景
最近在项目中需要用到天眼查里面的企业数据,然后就研究了一下使用Jsoup爬取数据的方法,为了以后查找方便以及与更多热爱技术的朋友交流就想到了写一篇技术博客的想法,如果有不对的地方请各位大神指教。
2…预备知识
[1] 对于不太了解spring boot框架的朋友可以参考链 接:https://www.cnblogs.com/wmyskxz/p/9010832.html
[2]Jsoup简介
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
(1)org.jsoup.Jsoup类
Jsoup类是任何Jsoup程序的入口点,并将提供从各种来源加载和解析HTML文档的方法。Jsoup类的一些重要方法如下:
static Connection connect(String url) 创建并返回URL的连接
static Document parse(File in, String charsetName) 将指定的字符集文件解析成文档
static Document parse(String html) 将给定的html代码解析成文档
static String clean(String bodyHtml, Whitelist whitelist) 从输入HTML返回安全的HTML,通过解析输入HTML并通过允许的标签和属性的白名单进行过滤
(2)org.jsoup.nodes.Element类
该类表示通过Jsoup库加载HTML文档。可以使用此类执行适用于整个HTML文档的操作。Element类的重要方法可以参见 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html
3.天眼查实战
(1)首先使用spring boot框架搭建项目,使用maven做项目管理工具。引入如下依赖包jsoup 版本可以有好多种可以选择
org.jsoup
jsoup
1.11.3
(2)载入文件这里有两种方式,一种是基于http未加密的,另一种是基于https加密的,分别做如下概述:
[1]基于http未加密方式载入文件:从URL加载文档,使用Jsoup.connect()方法从URL加载HTML,代码如下:
try { Document document = Jsoup.connect("http://www.yiibai.com").get(); System.out.println(document.title()); } catch (IOException e) { e.printStackTrace(); }
从文件加载文档使用Jsoup.parse()方法从字符串加载HTML代码如下:
try
{
Document document = Jsoup.parse( new File( "D:/temp/index.html" ) , "utf-8" );
System.out.println(document.title());
}
catch (IOException e)
{
e.printStackTrace();
}
[2]使用https方式从URL加载文档,代码如下:
public static String url = "https://www.tianyancha.com/search?key=%E8%85%BE%E8%AE%AF";
RequestBuilder requestBuilder = RequestBuilder.get(url);
for (Map.Entry entries : headers.entrySet()) {
requestBuilder.setHeader(entries.getKey(), entries.getValue());
}
ThreeTuple postResult = HttpClientProxyManager.getHttpClientProxy().executeStringResult(requestBuilder);
Document document = Jsoup.parse(postResult.getThird());
注:使用https方式加载文档需要设置“X-AUTH-TOKEN”的值,需要设置的头文件如下代码所示:
public static Map headers = new HashMap<>();
static {
headers.put("Accept", "application/json, text/javascript, */*; q=0.01");
headers.put("Accept-Encoding", " gzip, deflate, br");
headers.put("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8");
headers.put("Cache-Control", "no-cache");
headers.put("Connection", "keep-alive");
headers.put("Cookie", "TYCID=459b7760671911e88a962752a04d49e3; undefined=459b7760671911e88a962752a04d49e3; ssuid=8800877279; aliyungf_tc=AQAAAJaEZlt/LgoAlAFM2q76k16Acws9; csrfToken=ebZ222cIdpZBqAQLLefohzTT; _ga=GA1.2.1043693661.1541656659; _gid=GA1.2.248580359.1541656659; bannerFlag=true; RTYCID=d60f1634a5e44461887b801ec5347579; CT_TYCID=0e2bded5cce44b44a8bef56c29a204a0; cloud_token=a2cb19178fca467285af880f0aaf1d25; token=70a1ce32f1a84d3395364a9858a0e47f; _utm=8dc98349cf3f4b6293ecee4786167b04; tyc-user-info=%257B%2522myQuestionCount%2522%253A%25220%2522%252C%2522integrity%2522%253A%25220%2525%2522%252C%2522state%2522%253A%25220%2522%252C%2522vipManager%2522%253A%25220%2522%252C%2522onum%2522%253A%25220%2522%252C%2522monitorUnreadCount%2522%253A%25220%2522%252C%2522discussCommendCount%2522%253A%25220%2522%252C%2522new%2522%253A%25221%2522%252C%2522token%2522%253A%2522eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNTI3NDk2MzgyNiIsImlhdCI6MTU0MTY2NzA1NSwiZXhwIjoxNTU3MjE5MDU1fQ.U-vu1TOdC_0ThSu6ZCgGusoRz_ft6FY0gbzdMvX7k4J0RWl68hKhSDud-PRIcHYof0_TqgHZWAMFeU1IN632Yg%2522%252C%2522redPoint%2522%253A%25220%2522%252C%2522pleaseAnswerCount%2522%253A%25220%2522%252C%2522vnum%2522%253A%25220%2522%252C%2522bizCardUnread%2522%253A%25220%2522%252C%2522mobile%2522%253A%252215274963826%2522%257D; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNTI3NDk2MzgyNiIsImlhdCI6MTU0MTY2NzA1NSwiZXhwIjoxNTU3MjE5MDU1fQ.U-vu1TOdC_0ThSu6ZCgGusoRz_ft6FY0gbzdMvX7k4J0RWl68hKhSDud-PRIcHYof0_TqgHZWAMFeU1IN632Yg; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1541656659,1541659805,1541667039; _gat_gtag_UA_123487620_1=1; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1541667910");
headers.put("DNT", "1");
headers.put("Host", "www.tianyancha.com");
headers.put("Pragma", "no-cache");
headers.put("Referer", "https://www.tianyancha.com/search?key=%E8%85%BE%E8%AE%AF");
headers.put("X-AUTH-TOKEN","eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNTI3NDk2MzgyNiIsImlhdCI6MTU0MTY2NzA1NSwiZXhwIjoxNTU3MjE5MDU1fQ.U-vu1TOdC_0ThSu6ZCgGusoRz_ft6FY0gbzdMvX7k4J0RWl68hKhSDud-PRIcHYof0_TqgHZWAMFeU1IN632Yg");
headers.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36");
}
(3)解析html
以解析天眼查上面的部分html为例:
以下是解析部分字段代码:
Elements elements = document.getElementsByClass("result-list");
Element resultElement = elements.get(0);
Elements searchElements = resultElement.getElementsByClass("search-item");
Element itemElement = searchElements.get(0);
String province=itemElement.getElementsByClass("site").get(0).text();
Element contentElement = itemElement.getElementsByClass("content").get(0);
Element contactElemet = contentElement.getElementsByClass("contact").get(0);
Element emailElement = contactElemet.getElementsByClass("col").get(1);
String email = emailElement.getElementsByTag("span").get(1).text();
Element phoneElement = contactElemet.getElementsByClass("col").get(0);
String phone = phoneElement.getElementsByTag("span").get(1).text();
Element infoElement = contentElement.getElementsByClass("info").get(0);
Element personLinkElement = infoElement.getElementsByClass("title text-ellipsis").get(0).getElementsByTag("a").get(0);
String legalName = personLinkElement.text();
Element companyElement=contentElement.getElementsByClass("header").get(0);
Element companyLinkElement=companyElement.getElementsByTag("a").get(0);
String detailUrl = companyLinkElement.attr("abs:href");