Python爬虫学习笔记

Python爬虫学习笔记

文章目录

  • Python爬虫学习笔记
    • 写在前面
    • 第一章 初识爬虫
      • 1.1 什么是爬虫
      • 1.2 需要使用的软件
      • 1.3 第一个小爬虫
      • 1.4 Web请求过程剖析
      • 1.5 Http协议
      • 1.6 Requests入门
        • 1.6.1 爬取搜狗搜索页面
        • 1.6.2 快速获取百度翻译结果
        • 1.6.3 爬取豆瓣电影排行
      • 1.7 关闭resp
    • 第二章 数据解析与提取
      • 2.1 数据解析概述
      • 2.2 Re解析 正则表达式
      • 2.3 Python的re模块使用
      • 2.4 手刃豆瓣top250电影排行
      • 2.5 屠戮电影天堂电影信息
      • 2.5 Bs解析前戏-Html语法规则
      • 2.6 Bs4解析入门-搞搞菜价
      • 2.7 Bs4解析案例-抓取优美图库图片
      • 2.8 XPath入门
      • 2.9 Xpath实战 抓取猪八戒网信息
    • 第三章 Requests进阶
      • 3.1 Requests进阶概述
      • 3.2 处理cookie 登录小说网
      • 3.3 防盗链 抓取梨视频
      • 3.4 代理
      • 3.5 综合训练 抓取网易云音乐评论信息
    • 第四章 异步
      • 4.1 第四章概述
      • 4.2 多线程
      • 4.3 多进程
      • 4.4 线程池与进程池入门
      • 4.5 线程池案例-抓取新发地菜价
      • 4.6 协程
        • 4.6.1 协程概念
        • 4.6.2 多任务异步交互
        • 4.6.3 关于异步协程-过时警告
      • 4.7 异步http请求aiohttp模块
      • 4.8 异步爬虫实战-扒光一部小说
      • 4.9 爬取视频
        • 4.9.1 综合训练-视频网站的工作原理
        • 4.9.2 抓取云播TV-简单版
    • 第五章 selenium
      • 5.1 selenium引入概念
      • 5.2 selenium各种操作-抓拉钩
      • 5.3 各种操作-窗口间的切换
      • 5.4 selenium操作-无头浏览器
      • 5.5 selenium各种操作-超级鹰处理验证码
      • 5.6 selenium -超级鹰干超级鹰
      • 5.7 selenium-搞定12306的登陆问题

写在前面

该笔记是我在学习b站up主路飞学城IT的爬虫视频时做的,详细内容请去b站找原视频,文章仅供参考,如有不对请指正,另外文章内可能有些网站已失效,请自行寻找适合的网站

第一章 初识爬虫

1.1 什么是爬虫

爬虫是从互联网上爬取各类资源,包括图片,文字,视频等格式,其原理就是用代码模拟浏览器下载各种资源。爬虫不一定要使用python语言,也可以使用java、c等,其原因还是因为python比较简洁,并且有丰富的第三方库,使爬虫技术更为简便。

什么是robots.txt?robots.txt就是一个文件包含了这个网页哪些可以爬哪些不可爬,查看方法就是在该url后面添加"/robots.txt",例http://www.bilibili.com/robots.txt。

1.2 需要使用的软件

  • Python3.8
  • Pycharm 等编译器
  • requests、urllib等模块

1.3 第一个小爬虫

第一个小爬虫就是爬取整个百度的网页,比较简单

from urllib.request import urlopen

url = "http://www.baidu.com"
resp = urlopen(url)

with open("myBaidu.html", mode="w", encoding="utf-8") as f:		# 这里需要注意Windows用户需要添一个“encoding='utf-8'”,因为百度网页编码格式是utf-8,而open()函数默认是gbk,否则出现的网页将会乱码
    f.write(resp.read().decode("utf-8"))
print('success!')

1.4 Web请求过程剖析

  • 服务器渲染:在页面源代码中能看到数据,在服务器端将数据和html整合在一起,统一返回给客户端。
  • 客户端渲染:在页面源代码中不能看到数据,第一次请求只要一个html骨架,第二次请求拿到数据,进行数据展示。

要熟练使用浏览器数据抓包工具,F12-Network

1.5 Http协议

协议:就是两个计算机之间为了能够流畅的进行沟通而设置的一个君子协议,常见的协议有TCP/IP,SOAP协议,HTTP协议,SMTP协议等等······

HTTP协议,Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(www:World Wide Web)服务器传输超文本到本地浏览器的传送协议,直白点就是浏览器和服务器之间的数据交互遵守的就是HTTP协议。

HTTP协议把一条消息分为三大块内容,无论是请求还是响应都是三块内容

请求:

请求行	-> 请求方式 请求url地址 协议
请求头 -> 放一些服务器要使用的附加信息

请求体 -> 一般放一些请求参数

响应:

状态行 -> 协议 状态码
响应头 -> 放一些客户端要使用的一些附加信息

响应体 -> 服务器返回的真正客户端要用的内容(HTML,json)

请求头中最常见的一些重要内容(爬虫需要):

  1. User-Agent:请求载体的身份标识(用啥发送的请求)
  2. Referer:防盗链(这次请求是从哪个页面来的?反爬会用到)
  3. cookie:本地字符串数据信息(用户登录信息,反爬的token)

响应头中一些重要的内容:

  1. cookie:本地字符串数据信息(用户登录信息,反爬的token)
  2. 各种神奇的莫名其妙的字符串(这个需要经验,一般都是token字样,防止各种攻击和反爬)

1.6 Requests入门

1.6.1 爬取搜狗搜索页面

首先安装requests模块 pip install requests

import requests

url = 'https://www.sogou.com/web?query=周杰伦'
headers = {
     
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36" 	# 这里的消息头可以去浏览器审查元素Network里找到,具体为Network第一个文件里的Request Headers——user-agent,可以理解为模拟浏览器标识
}
resp = requests.get(url, headers=headers)   

print(resp)
print(resp.text)

1.6.2 快速获取百度翻译结果

在百度翻译上找到获取翻译结果的url:https://fanyi.baidu.com/sug

在这里用的是POST方法,上传需要翻译的单词,返回翻译结果,post上传参数为data

import requests

url = "https://fanyi.baidu.com/sug"
text = input("请输入你要翻译的英文单词")
data = {
     
    "kw": text
}
# 发送post请求,发送的数据必须放在字典中,通过data参数进行传递
resp = requests.post(url, data=data)
print(resp.json())  # 将服务器返回的内容直接处理成json() -> dict

1.6.3 爬取豆瓣电影排行

爬虫不好使第一个尝试User-Agent,python爬虫默认的user-agent:python-requests/2.25.1,不是浏览器标识

在这里使用的是GET方法,获取豆瓣电影排行,get上传参数为param

import requests

url = "https://movie.douban.com/j/chart/top_list"
# 重新封装参数
param = {
     
    "type": "24",
    "interval_id": "100:90",
    "action": "",
    "start": 0,
    "limit": 20
}
headers = {
     
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
                  "Safari/537.36 "
}

resp = requests.get(url, params=param, headers=headers)
print(resp.json())
resp.close()

1.7 关闭resp

在程序的最后需要关闭resp(连接端口),不关闭的话可能会因为多次访问最后进不去,所以需要在最后添加一句resp.close(),包括打开文件,最后也要关闭

第二章 数据解析与提取

2.1 数据解析概述

在上一章中,我们基本上掌握了抓取整个网页的基本技能。但是呢,大多数情况下,我们并不是需要整个网页的内容,只需要其中的一小部分。那么这就涉及到了数据提取的问题。

本课程中,提供三种解析方式:

  1. re解析
  2. bs4解析
  3. xpath解析

这三种方式可以混合进行使用,完全以结果做导向,只要能拿到你想要的数据,用什么方案并不重要,当你掌握这些之后再考虑性能问题。

2.2 Re解析 正则表达式

Regular Expression,正则表达式,一种使用表达式的方式对字符串进行匹配的语法规则。

我们抓取到的网页源代码本质上就是一个超长的字符串,想从里面提取内容,用正则再适合不过。

正则的优点:速度快,效率高,准确性高

正则的缺点:新手上手难度比较高

正则的语法:使用元字符进行排列组合用来匹配字符串,在线测试正则表达式http://tool.oschina.net/regex/

元字符:具有固定含义的特殊符号

常用元字符:

.	匹配除换行以外的任意字符
\w	匹配字母或数字或下划线
\s	匹配任意的空白符
\d	匹配数字
\n	匹配一个换行符
\t	匹配一个制表符

^	匹配字符串的开始
$	匹配字符串的结尾

\W	匹配非字母或数字或下划线
\S	匹配非空白符
\D	匹配非数字
a|b	匹配字符a或字符b
()	匹配括号内的表达式,也表示一个组
[...]	匹配字符组中的字符
[^...]	匹配除了字符组中字符的所有字符

量词:控制前面的元字符出现的次数

*	重复零次或更多次
+	重复一次或更多次
?	重复零次或一次
{
     n}	重复n次
{
     n,}	重复n次或更多次
{
     n,m}	重复n到m次

贪婪匹配和惰性匹配

.*		贪婪匹配
.*?		惰性匹配

爬虫中最多使用的就是惰性匹配,因此对此需要重视

惰性匹配就是尽可能少的去匹配内容,举例

str:玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏啊
reg:玩儿.*?游戏
# 这里的原理是:首先匹配“玩儿”两个字,然后再找“.*”次“游戏”,“.*”是尽可能多的进行匹配,因此此时匹配到的会是“玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏”,然后“?”限制搜索次数,限制到最小次数,最终结果就为“玩儿吃鸡游戏”
此时结果为:玩儿吃鸡游戏

str<div class="jay">周杰伦</div><div class="jj">林俊杰</div>
reg: <div class=".*?">.*?</div>
结果:<div class="jay">周杰伦</div>
	<div class="jj">林俊杰</div>

2.3 Python的re模块使用

学习正则后,该如何在程序中使用呢?

import re

# findall:匹配字符串中所有符合正则的内容
lst = re.findall(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(lst)

# finditer:匹配字符串中所有的内容[返回的迭代器],从迭代器中拿到内容需要.group()
it = re.finditer(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
for i in it:
    print(i.group())

# search是找到一个结果就返回,返回的结果是match对象,拿数据需要.group()
s = re.search(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())

# match是从头开始匹配,因此第一个是中文匹配不到
s = re.match(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())

当正则表达式很长的时候,我们也可以使用预加载正则表达式

# 预加载正则表达式
obj = re.compile(r"\d+")

ret = obj.finditer("我的电话号是10086,我的女朋友电话号是10010")
for it in ret:
    print(it.group())

obj.findall("sadadsa223dawswefq123fasdigjoihuiohuiogsdf")
print(ret)

那么如何单独提取出字符串中的内容呢?

import re
s = """
    
张富帅
张富贵
吕富帅
小狗头
小煞笔
"""
# (?P<分组名字>正则)可以单独从正则匹配的内容中进一步提取内容 obj = re.compile(r"
(?P.*?)
"
, re.S) # re.S 让.能匹配换行符 res = obj.finditer(s) for it in res: print(it.group("name")) print(it.group("id"))

2.4 手刃豆瓣top250电影排行

  1. 拿到页面源代码 requests
  2. 通过re来提取想要的信息 re
import requests
import re
import csv

# 提取页面
url = "http://movie.douban.com/top250"
headers = {
     
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
                  "Safari/537.36 "
}
resp = requests.get(url, headers=headers)
page_content = resp.text

# 解析数据 拿到电影名,导演,年份,评分,评分人数信息
obj = re.compile(r'
  • .*?
    .*?(?P.*?)</span>.*?'</span> <span class="token string">r'<p class="">.*?导演: (?P<director>.*?) .*?<br>(?P<year>.*?) .*?'</span> <span class="token string">r'<span class="rating_num" property="v:average">(?P<average>.*?)</span>.*?'</span> <span class="token string">r'<span>(?P<people>.*?)</span>'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> <span class="token comment"># 开始匹配</span> result <span class="token operator">=</span> obj<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>page_content<span class="token punctuation">)</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"top250.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token keyword">for</span> it <span class="token keyword">in</span> result<span class="token punctuation">:</span> <span class="token comment"># print(it.group("title"))</span> <span class="token comment"># print(it.group("director").strip())</span> <span class="token comment"># print(it.group("year").strip())</span> <span class="token comment"># print('评分'+it.group("average"))</span> <span class="token comment"># print(it.group('people'))</span> dic <span class="token operator">=</span> it<span class="token punctuation">.</span>groupdict<span class="token punctuation">(</span><span class="token punctuation">)</span> dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">.</span>values<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success'</span><span class="token punctuation">)</span> </code></pre> <h3>2.5 屠戮电影天堂电影信息</h3> <ol> <li>定位到2021必看热片</li> <li>从2021必看热片中提取电影子页面的链接地址</li> <li>请求子页面中的链接地址,拿到我们想要的下载磁链接</li> </ol> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> re <span class="token comment"># 定位阶段</span> domain <span class="token operator">=</span> <span class="token string">"https://dytt89.com/"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>domain<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># verify=False 去掉安全验证</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span> <span class="token comment"># 指定字符集</span> <span class="token comment"># print(resp.text)</span> <span class="token comment"># 提取阶段 拿到<ul>里面的<li></span> obj1 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"2021必看热片.*?<ul>(?P<ul>.*?)</ul>"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> obj2 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"<a href='(?P<href>.*?)'"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> obj3 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r'◎片  名 (?P<movie>.*?)<br />.*?<td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href="('</span> <span class="token string">r'?P<download>.*?)"'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> result1 <span class="token operator">=</span> obj1<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> child_href_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> it <span class="token keyword">in</span> result1<span class="token punctuation">:</span> ul <span class="token operator">=</span> it<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'ul'</span><span class="token punctuation">)</span> <span class="token comment"># 提取子页面链接</span> result2 <span class="token operator">=</span> obj2<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>ul<span class="token punctuation">)</span> <span class="token keyword">for</span> itt <span class="token keyword">in</span> result2<span class="token punctuation">:</span> <span class="token comment"># 拼接子页面的url地址:域名+子页面地址</span> child_href <span class="token operator">=</span> domain <span class="token operator">+</span> itt<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span> child_href_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>child_href<span class="token punctuation">)</span> <span class="token comment"># 把子页面链接存储起来</span> <span class="token comment"># 提取子页面内容</span> <span class="token keyword">for</span> href <span class="token keyword">in</span> child_href_list<span class="token punctuation">:</span> child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span> result3 <span class="token operator">=</span> obj3<span class="token punctuation">.</span>search<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'movie'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'download'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>2.5 Bs解析前戏-Html语法规则</h3> <p>bs4解析比较简单,但是需要一定的html知识,然后再去使用bs4去提取,逻辑和编写难度就会非常简单清晰,有前端基础的可略过</p> <p>HTML(Hyper Text Markup Language)超文本标记语言,是我们编写网页的最基本也是最核心的一种语言。其语法规则就是用不同的标签对网页上的内容进行标记,从而使网页显示出不同的展示效果。</p> <pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span><span class="token punctuation">></span></span> Hello World! <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span> </code></pre> <p>上述代码的含义是在页面显示“Hello World!”一句,但是这句话被</p> <h1>和</h1>标记了。白话就是括起来了,被H1标签括起来了。这个时候,浏览器在展示的时候就会让“Hello World!”这句话加粗加大,变为标题,所以HTML的语法就是用类似这样的标签对页面内容进行标记。不同的标签表现出来的效果也是不一样的。 <p></p> <pre><code class="prism language-html">h1:一级标题 h2:二级标题 p:段落 font:字体(已被废弃,但还能用) body:主体 </code></pre> <p>标签还有很多,这里就不一一列举。接下来是属性</p> <pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span> <span class="token attr-name">align</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>center<span class="token punctuation">'</span></span><span class="token punctuation">></span></span> Hello World! <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>1<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>a<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>2<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>b<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>3<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>c<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> </code></pre> <p>其中"align"就是标签属性,"center"就是属性值,后续的bs4解析就是可以根据id的属性值进行检索。</p> <h3>2.6 Bs4解析入门-搞搞菜价</h3> <p>首先pip install bs4安装模块</p> <ol> <li>拿到页面源代码</li> <li>使用bs4进行解析 拿到数据</li> </ol> <p>视频中的网站源代码已改变,因此这里选用的url是:http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml,最后结果类似</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> csv url <span class="token operator">=</span> <span class="token string">"http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"vegetable_price.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token comment"># 解析数据</span> <span class="token comment"># 1.把页面源代码交给BeautifulSoup进行处理,生成bs对象</span> page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> <span class="token comment"># 指定html解析器</span> <span class="token comment"># 2.从bs对象中查找数据</span> <span class="token comment"># find(标签,属性=值) 只找第一个</span> <span class="token comment"># findall(标签,属性=值) 找到所有的</span> table <span class="token operator">=</span> page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"table"</span><span class="token punctuation">,</span> attrs<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"style"</span><span class="token punctuation">:</span> <span class="token string">"margin: 0px auto; width: 588px; height: 847px; border-collapse: collapse;"</span><span class="token punctuation">,</span> <span class="token string">"width"</span><span class="token punctuation">:</span> <span class="token string">"588"</span><span class="token punctuation">,</span> <span class="token string">"cellspacing"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"cellpadding"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"border"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"align"</span><span class="token punctuation">:</span> <span class="token string">"center"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token comment"># 拿到所有数据行</span> trs <span class="token operator">=</span> table<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"tr"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">7</span><span class="token punctuation">:</span><span class="token punctuation">]</span> <span class="token keyword">for</span> tr <span class="token keyword">in</span> trs<span class="token punctuation">:</span> <span class="token comment"># 每一行数据</span> tds <span class="token operator">=</span> tr<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'td'</span><span class="token punctuation">)</span> <span class="token comment"># 拿到每行数据中的td</span> name <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text <span class="token comment"># .text表示拿到被标签标记的内容</span> kind <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text high <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text low <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span><span class="token punctuation">[</span>name<span class="token punctuation">,</span> kind<span class="token punctuation">,</span> high<span class="token punctuation">,</span> low<span class="token punctuation">]</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success!'</span><span class="token punctuation">)</span> </code></pre> <h3>2.7 Bs4解析案例-抓取优美图库图片</h3> <ol> <li>拿到主页面的源代码 提取子页面的链接地址 href</li> <li>通过href拿到子页面的内容,从子页面找到图片的下载地址 img->src</li> <li>下载图片</li> </ol> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> time url_index <span class="token operator">=</span> <span class="token string">"https://umei.cc"</span> url <span class="token operator">=</span> <span class="token string">"https://umei.cc/bizhitupian/weimeibizhi/"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token comment"># 把源代码交给BeautifulSoup</span> main_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> a_list <span class="token operator">=</span> main_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"div"</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">"TypeList"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"a"</span><span class="token punctuation">)</span> <span class="token comment"># print(a_list)</span> <span class="token keyword">for</span> a <span class="token keyword">in</span> a_list<span class="token punctuation">:</span> href <span class="token operator">=</span> url_index <span class="token operator">+</span> a<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token comment"># 直接通过get就可以直接拿到属性值</span> <span class="token comment"># 拿到子页面源代码</span> child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">)</span> child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token comment"># 从子页面拿到图片下载链接</span> child_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> p <span class="token operator">=</span> child_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"p"</span><span class="token punctuation">,</span> align<span class="token operator">=</span><span class="token string">"center"</span><span class="token punctuation">)</span> img <span class="token operator">=</span> p<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"img"</span><span class="token punctuation">)</span> src <span class="token operator">=</span> img<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"src"</span><span class="token punctuation">)</span> <span class="token comment"># 下载图片</span> img_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>src<span class="token punctuation">)</span> <span class="token comment"># img_resp.content # 这里拿到的是字节</span> img_name <span class="token operator">=</span> src<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># 切割 拿到url中的最后一个/以后的内容</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>img_name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>img_resp<span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token comment"># 图片内容写入文件</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">,</span> img_name<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 防止访问过多服务器压力过大</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"all over"</span><span class="token punctuation">)</span> </code></pre> <h3>2.8 XPath入门</h3> <p>xpath是在XML文档中搜索内容的一门语言</p> <p>html是xml的一个子集</p> <p>安装lxml模块 pip install lxml</p> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree xml <span class="token operator">=</span> <span class="token triple-quoted-string string">""" <book> <id>1</id> <name>野花遍地香</name> <price>1.23</price> <author> <nick id="10086">周大强</nick> <nick id="10010">周芷若</nick> <nick class="joy">周杰伦</nick> <nick class="jolin">蔡依林</nick> <div> <nick>rerererererer</nick> </div> <div> <nick>rerererererer2</nick> <div> <nick>rerererererer3</nick> </div> </div> </author> <partner> <nick id="ppc">胖胖陈</nick> <nick id="ppbc">胖胖不陈</nick> </partner> </book> """</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>XML<span class="token punctuation">(</span>xml<span class="token punctuation">)</span> <span class="token comment"># result = tree.xpath("/book") # /表示层级关系,第一个/是根节点</span> <span class="token comment"># result = tree.xpath("/book/name/text()") # text()表示拿文本</span> <span class="token comment"># result = tree.xpath("/book/author//nick/text()") # 后代 拿出nick里的文本以及三个rerere</span> <span class="token comment"># result = tree.xpath("/book/author/*/nick/text()") # *任意节点,通配符 只拿出re1,re2</span> result <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/book//nick/text()"</span><span class="token punctuation">)</span> <span class="token comment"># 拿出所有nick的文本</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result<span class="token punctuation">)</span> </code></pre> <p>在html文件中,[]可以表示索引,索引为第几个,例如///</p> <ul> <li>[1]//text()表示第一条</li> <li>中标签的文字内容;</li> <li><p></p> <p>[]里面也可以表示为标签的属性筛选,例如///</p> </li> <li>/[@href=‘dapao’]/text(),表示href为“dapao”的标签的文字内容;</li> <li><p></p> <p>///</p> </li> <li>//@href可以单取a标签href的属性值。</li> <li><p></p> <p><strong>小技巧</strong>:可以从网页中按F12,页面源代码中可以快速复制xpath</p> <h3>2.9 Xpath实战 抓取猪八戒网信息</h3> <ol> <li>拿到页面源代码</li> <li>提取和解析数据</li> </ol> <p>在这里我搜索的是“小程序开发”,遇到许多视频中没有出现的问题,好在通过百度也算是解决了,如果有更好的解决方法麻烦大佬留言</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token comment"># 我这搜索的是小程序开发,爬取过程中有许多不方便的,尽量尝试搜索英文</span> url <span class="token operator">=</span> <span class="token string">"https://beijing.zbj.com/search/f/?kw=%E5%B0%8F%E7%A8%8B%E5%BA%8F%E5%BC%80%E5%8F%91"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># print(resp.text)</span> <span class="token comment"># 解析</span> html <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 拿到第一个服务商的div</span> divs <span class="token operator">=</span> html<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[6]/div/div/div[3]/div[4]/div[1]/div"</span><span class="token punctuation">)</span> <span class="token keyword">for</span> div <span class="token keyword">in</span> divs<span class="token punctuation">:</span> <span class="token comment"># 每一个服务商的信息</span> price_w <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[1]/span[1]/text()'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token keyword">not</span> price_w<span class="token punctuation">:</span> <span class="token comment"># 我在爬取价格时遇到空字符,因此设个if语句跳过该价格</span> <span class="token keyword">break</span> price <span class="token operator">=</span> price_w<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> title <span class="token operator">=</span> <span class="token string">"小程序"</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[2]/p/text()'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> company <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/p/text()'</span><span class="token punctuation">)</span> <span class="token comment"># 爬取结果含有换行符</span> company <span class="token operator">=</span> <span class="token builtin">list</span><span class="token punctuation">(</span><span class="token builtin">filter</span><span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token punctuation">[</span>x<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> x <span class="token keyword">in</span> company<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 去除换行符后再将list中的空字符去除</span> location <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/div/span/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> price<span class="token punctuation">,</span> company<span class="token punctuation">,</span> location<span class="token punctuation">)</span> </code></pre> <h2>第三章 Requests进阶</h2> <h3>3.1 Requests进阶概述</h3> <p>我们在之前的爬虫中其实已经使用过headers了。header为HTTP协议中的请求头,一般存放一些和请求内容无关的数据,有时也会存放一些安全验证信息。比如常见的User-Agent,token,cookie等。</p> <p>通过requests发送的请求,我们可以把请求头信息放在headers中,也可以单独进行存放,最终由requests自动帮我们拼接成完整的http请求头。</p> <p>本章内容:</p> <ol> <li>模拟浏览器登录->处理cookie</li> <li>防盗链处理->抓取梨视频数据</li> <li>代理->放hi被封IP</li> </ol> <p>综合训练:抓取网易云评论信息</p> <h3>3.2 处理cookie 登录小说网</h3> <p>登录->得到cookie</p> <p>带着cookie去请求到书架url -> 书架上的内容</p> <p>必须得把上面的两个操作连起来 我们可以使用session进行请求->session可以认为一连串的请求。在这个过程中cookie不会丢失</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token comment"># 会话</span> session <span class="token operator">=</span> requests<span class="token punctuation">.</span>session<span class="token punctuation">(</span><span class="token punctuation">)</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"loginName"</span><span class="token punctuation">:</span> <span class="token string">"13757696746"</span><span class="token punctuation">,</span> <span class="token string">"password"</span><span class="token punctuation">:</span> <span class="token string">"123qweasdzxc"</span> <span class="token punctuation">}</span> <span class="token comment"># 1.登录</span> url <span class="token operator">=</span> <span class="token string">"https://passport.17k.com/ck/user/login"</span> resp <span class="token operator">=</span> session<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span> <span class="token comment"># 拿书架的数据</span> resp_b <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp_b<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 另一种方法</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"Cookie"</span><span class="token punctuation">:</span> <span class="token string">"浏览器中复制的cookie"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>3.3 防盗链 抓取梨视频</h3> <p>爬取过程中视频url并不会出现在页面源代码里,推测视频链接是由js生成,通过拦截发现一段与视频链接非常相似的链接,于是需要将其拼接</p> <ol> <li>拿到contID</li> <li>拿到videoStatus返回的json -> srcURL</li> <li>srcURL里面的内容进行修整</li> <li>下载视频</li> </ol> <p><strong>什么是防盗链</strong>:溯源,防盗链相当于在页面请求过程中有个层级关系,它要求你必须是从第一个页面转到第二个页面,否则你直接访问第二个页面是不行的,防盗链就是这个页面的上一级页面</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests url <span class="token operator">=</span> <span class="token string">"https://www.pearvideo.com/video_1738675"</span> contID <span class="token operator">=</span> url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"_"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> videoStatusUrl <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"https://www.pearvideo.com/videoStatus.jsp?contId=</span><span class="token interpolation"><span class="token punctuation">{ </span>contID<span class="token punctuation">}</span></span><span class="token string">&mrd=0.5611111607819312"</span></span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"User-Agent"</span><span class="token punctuation">:</span> <span class="token string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "</span> <span class="token string">"Safari/537.36 "</span><span class="token punctuation">,</span> <span class="token comment"># 防盗链:</span> <span class="token string">"Referer"</span><span class="token punctuation">:</span> url <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>videoStatusUrl<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> srcUrl <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"videoInfo"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"videos"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"srcUrl"</span><span class="token punctuation">]</span> systemTime <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"systemTime"</span><span class="token punctuation">]</span> srcUrl <span class="token operator">=</span> srcUrl<span class="token punctuation">.</span>replace<span class="token punctuation">(</span>systemTime<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"cont-</span><span class="token interpolation"><span class="token punctuation">{ </span>contID<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"videos/a.mp4"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>srcUrl<span class="token punctuation">)</span><span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">)</span> </code></pre> <h3>3.4 代理</h3> <p>原理:通过第三方的一个机器去发送请求</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token comment"># 36.112.139.146</span> proxies <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"http"</span><span class="token punctuation">:</span> <span class="token string">"http://36.112.139.146:3128"</span> <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span> proxies<span class="token operator">=</span>proxies<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> </code></pre> <h3>3.5 综合训练 抓取网易云音乐评论信息</h3> <ol> <li>找到未加密的参数</li> <li>想办法把参数进行加密(必须参考网易的洛基),params => encText,encSecKey => encSecKey</li> <li>请求到网易,拿到评论信息</li> </ol> <p>爬取过程中遇到极其复杂的信息加密,Network项目中拦截到神评后,可以发现该请求的data是加密了的,在Initiator里可以看到它生成神评都是经过哪些js,点击第一个也就是最后运行的js文件查看代码,对该行代码标记后往前推找到对应url,可以看到右边Scope栏中Local底下有加密的data信息,那么我们可以倒推代码找到它是在哪一行里加密的,所以在右边Call Stack栏里往后倒推,一个一个查看Local属性里的data是否有加密,最后排查到u0x.be1x这一步中data还未加密,可以推测这段js就是对data的加密。注意:js文件中的变量名每次刷新都会变化</p> <pre><code class="prism language-js"> u9l<span class="token punctuation">.</span><span class="token function-variable function">be9V</span> <span class="token operator">=</span> <span class="token keyword">function</span><span class="token punctuation">(</span><span class="token parameter"><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">var</span> i9b <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> <span class="token punctuation">,</span> e9f <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span><span class="token punctuation">{ </span><span class="token punctuation">}</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span> <span class="token punctuation">,</span> mo3x <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">indexOf</span><span class="token punctuation">(</span><span class="token string">"?"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>window<span class="token punctuation">.</span>GEnc <span class="token operator">&&</span> <span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">(^|\.com)\/api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">.</span><span class="token function">test</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>headers <span class="token operator">&&</span> e9f<span class="token punctuation">.</span>headers<span class="token punctuation">[</span>eu0x<span class="token punctuation">.</span>Bl8d<span class="token punctuation">]</span> <span class="token operator">==</span> eu0x<span class="token punctuation">.</span>Io0x<span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span>e9f<span class="token punctuation">.</span>noEnc<span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">if</span> <span class="token punctuation">(</span>mo3x <span class="token operator">!=</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span>mo3x <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> mo3x<span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token punctuation">}</span> i9b<span class="token punctuation">[</span><span class="token string">"csrf_token"</span><span class="token punctuation">]</span> <span class="token operator">=</span> u9l<span class="token punctuation">.</span><span class="token function">gP1x</span><span class="token punctuation">(</span><span class="token string">"__csrf"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span><span class="token string">"api"</span><span class="token punctuation">,</span> <span class="token string">"weapi"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> e9f<span class="token punctuation">.</span>method <span class="token operator">=</span> <span class="token string">"post"</span><span class="token punctuation">;</span> <span class="token keyword">delete</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">;</span> <span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> e9f<span class="token punctuation">.</span>data <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">cs0x</span><span class="token punctuation">(</span><span class="token punctuation">{ </span> params<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encText<span class="token punctuation">,</span> encSecKey<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encSecKey <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">var</span> cdnHost <span class="token operator">=</span> <span class="token string">"y.music.163.com"</span><span class="token punctuation">;</span> <span class="token keyword">var</span> apiHost <span class="token operator">=</span> <span class="token string">"interface.music.163.com"</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>location<span class="token punctuation">.</span>host <span class="token operator">===</span> cdnHost<span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span>cdnHost<span class="token punctuation">,</span> apiHost<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">match</span><span class="token punctuation">(</span><span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">^\/(we)?api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token string">"//"</span> <span class="token operator">+</span> apiHost <span class="token operator">+</span> <span class="token constant">Y9P</span> <span class="token punctuation">}</span> e9f<span class="token punctuation">.</span>cookie <span class="token operator">=</span> <span class="token boolean">true</span> <span class="token punctuation">}</span> <span class="token function">cwR2x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span> <span class="token punctuation">}</span> </code></pre> <p>过程比较复杂,最好跟着视频学习.</p> <p>在该方法里一步一步推导,可以发现</p> <pre><code class="prism language-js"><span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> </code></pre> <p>这里后面开始的加密,仔细研究可以看出来是替换了内容params => encText,encSecKey => encSecKey,那么就去找window.asrsea()这个方法,搜索后发现它的值全靠这一句window.asrsea = d,网上看可以看到d方法的定义过程</p> <pre><code class="prism language-js"> <span class="token keyword">function</span> <span class="token function">d</span><span class="token punctuation">(</span><span class="token parameter">d<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">,</span> g</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">var</span> h <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> <span class="token punctuation">,</span> i <span class="token operator">=</span> <span class="token function">a</span><span class="token punctuation">(</span><span class="token number">16</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>d<span class="token punctuation">,</span> g<span class="token punctuation">)</span><span class="token punctuation">,</span> h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>h<span class="token punctuation">.</span>encText<span class="token punctuation">,</span> i<span class="token punctuation">)</span><span class="token punctuation">,</span> h<span class="token punctuation">.</span>encSecKey <span class="token operator">=</span> <span class="token function">c</span><span class="token punctuation">(</span>i<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">)</span><span class="token punctuation">,</span> h <span class="token punctuation">}</span> </code></pre> <p>d()的四个元素中,d代表数据,e在控制台中过几遍可以发现是固定值010001,f是一串很长的外星文,g也是固定值“0CoJUm6Qyw8W8jud”</p> <p>然后就根据属性值,分析d()究竟要干什么,接下来内容的分析就不再做详细的介绍,a()返回16位随机字符串</p> <p>我这爬取了用户的昵称以及评论,具体步骤需要去b站看视频</p> <pre><code class="prism language-python"><span class="token keyword">from</span> Crypto<span class="token punctuation">.</span>Cipher <span class="token keyword">import</span> AES <span class="token keyword">from</span> base64 <span class="token keyword">import</span> b64encode <span class="token keyword">import</span> requests <span class="token keyword">import</span> json url <span class="token operator">=</span> <span class="token string">"https://music.163.com/weapi/comment/resource/comments/get?csrf_token="</span> <span class="token comment"># 请求方式POST</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"csrf_token"</span><span class="token punctuation">:</span> <span class="token string">""</span><span class="token punctuation">,</span> <span class="token string">"cursor"</span><span class="token punctuation">:</span> <span class="token string">"-1"</span><span class="token punctuation">,</span> <span class="token string">"offset"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"orderType"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"pageNo"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"pageSize"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span> <span class="token string">"rid"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span><span class="token punctuation">,</span> <span class="token string">"threadId"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span> <span class="token punctuation">}</span> <span class="token comment"># 服务于d</span> e <span class="token operator">=</span> <span class="token string">"010001"</span> f <span class="token operator">=</span> <span class="token string">"00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e "</span> g <span class="token operator">=</span> <span class="token string">"0CoJUm6Qyw8W8jud"</span> i <span class="token operator">=</span> <span class="token string">"7HCsoSguhIA6SpNw"</span> <span class="token comment"># 手动固定 函数中是随机的</span> encSecKey <span class="token operator">=</span> <span class="token string">"21fb180e564113d59d37865081a91daf1f775fb67ef063dc046bda9966613ea4a384b597e11ce05c442df9dfa8538347c58aa87d9be92636fbda399b28f04bbf31e91751e25f359a05538b8d5c51999a03e1348e21cbe90fbfa54d013399c0ab240e41c73750ef463542fe5c14637db16abeffa8a2ab74027e085aa570c01395 "</span> <span class="token comment"># 转化成16的倍数,为下方的加密算法服务</span> <span class="token keyword">def</span> <span class="token function">to_16</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span> pad <span class="token operator">=</span> <span class="token number">16</span> <span class="token operator">-</span> <span class="token builtin">len</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token operator">%</span> <span class="token number">16</span> data <span class="token operator">+=</span> <span class="token builtin">chr</span><span class="token punctuation">(</span>pad<span class="token punctuation">)</span> <span class="token operator">*</span> pad <span class="token keyword">return</span> data <span class="token keyword">def</span> <span class="token function">get_encSecKey</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 由于i是固定的,因此encSecKey也是固定的,c()函数获得的结果也是固定的</span> <span class="token keyword">return</span> encSecKey <span class="token keyword">def</span> <span class="token function">get_params</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 默认这里接受到的为字符串</span> first <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>data<span class="token punctuation">,</span> g<span class="token punctuation">)</span> second <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>first<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">return</span> second <span class="token comment"># 返回的就是params</span> <span class="token keyword">def</span> <span class="token function">enc_params</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span> key<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 加密过程</span> <span class="token comment"># 导入AES加密模块需要导入新包</span> iv <span class="token operator">=</span> <span class="token string">"0102030405060708"</span> data <span class="token operator">=</span> to_16<span class="token punctuation">(</span>data<span class="token punctuation">)</span> aes <span class="token operator">=</span> AES<span class="token punctuation">.</span>new<span class="token punctuation">(</span>key<span class="token operator">=</span>key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> IV<span class="token operator">=</span>iv<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> mode<span class="token operator">=</span>AES<span class="token punctuation">.</span>MODE_CBC<span class="token punctuation">)</span> <span class="token comment"># 创建加密器</span> bs <span class="token operator">=</span> aes<span class="token punctuation">.</span>encrypt<span class="token punctuation">(</span>data<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 加密,加密内容的长度必须是16的倍数</span> <span class="token keyword">return</span> <span class="token builtin">str</span><span class="token punctuation">(</span>b64encode<span class="token punctuation">(</span>bs<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token comment"># 转化成字符串返回</span> <span class="token comment"># 处理加密过程</span> <span class="token triple-quoted-string string">""" function a(a) { # 返回随机的16位字符串 var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = ""; for (d = 0; a > d; d += 1) # 循环16次 e = Math.random() * b.length, # 随机数 e = Math.floor(e), # 取整 c += b.charAt(e); # 去字符串中的x位置 return c } function b(a, b) { # a是要加密的内容, var c = CryptoJS.enc.Utf8.parse(b) # b是密钥 , d = CryptoJS.enc.Utf8.parse("0102030405060708") , e = CryptoJS.enc.Utf8.parse(a) # e是数据 , f = CryptoJS.AES.encrypt(e, c, { # c 加密的密钥 iv: d, # 偏移量 mode: CryptoJS.mode.CBC # 模式:cbc }); return f.toString() } function c(a, b, c) { var d, e; return setMaxDigits(131), d = new RSAKeyPair(b,"",c), e = encryptedString(d, a) } function d(d, e, f, g) { var h = {} # 这里为空 , i = a(16); # i就是16位随机值,把i设为固定值 return h.encText = b(d, g), # g密钥 h.encText = b(h.encText, i), # 返回的就是params i也是密钥 h.encSecKey = c(i, e, f), # 返回的就是encSecKey,e和f是定死的,如果此时把i固定得到的key是固定的 h } function e(a, b, d, e) { var f = {}; return f.encText = c(a + e, b, d), f } 两次加密: 数据+g => b => 第一次加密+i => b => params """</span> <span class="token comment"># 发送请求,得到评论结果</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"params"</span><span class="token punctuation">:</span> get_params<span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"encSecKey"</span><span class="token punctuation">:</span> get_encSecKey<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> hotComments <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"hotComments"</span><span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> hotComments<span class="token punctuation">:</span> username <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"user"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"nickname"</span><span class="token punctuation">]</span> content <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>username<span class="token punctuation">,</span> <span class="token string">":"</span><span class="token punctuation">,</span> content<span class="token punctuation">)</span> </code></pre> <h2>第四章 异步</h2> <h3>4.1 第四章概述</h3> <p>到目前为止,我们可以解决爬虫的基本抓取流程了,但是抓取效率还不够高。如何提高抓取效率呢?我们可以选择多线程,多进程,协程等操作完成异步爬虫。</p> <p>什么是异步?假设我们有一万条数据需要爬取,一个一个爬的话就会需要很长的时间,那异步就是多条线路同时进行,可以一次性爬取多条数据。</p> <p>本章内容:</p> <ol> <li>快速学会多线程</li> <li>快速学会多进程</li> <li>线程池和进程池</li> <li>扒光新发地</li> <li>协程</li> <li>多任务异步协程实现</li> <li>aiohttp模块详解</li> <li>扒光一本小说</li> <li>综合训练-抓取一部电影</li> </ol> <h3>4.2 多线程</h3> <ul> <li>进程是资源单位,每一个进程至少要有一个线程</li> <li>线程是执行单位</li> </ul> <p>第一套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">def</span> <span class="token function">func</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"func "</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>func<span class="token punctuation">)</span> <span class="token comment"># 创建线程并给线程安排任务,相当于创建一个员工,括号内为他要做的工作</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 多线程状态为可以开始工作状态,具体的执行时间由CPU决定</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"main"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> </code></pre> <p>第二套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">class</span> <span class="token class-name">MyThread</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t <span class="token operator">=</span> MyThread<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># t.run() # 方法调用了,依然是单线程</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 开启线程</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> </code></pre> <h3>4.3 多进程</h3> <p>多进程的写法与多线程基本相同</p> <pre><code class="prism language-python"><span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process <span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子进程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> p <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">)</span> p<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span> </code></pre> <p>那如果要区分两个进程应该怎么写?</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 打印括号内的名字</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t1 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">" 周杰伦"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 传递参数必须是元组</span> t1<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> t2 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">"王力宏"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> t2<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>4.4 线程池与进程池入门</h3> <p>线程池:一次性开辟一些线程,我们用户直接给线程池子提交任务。线程任务的调度交给线程池来完成</p> <pre><code class="prism language-python"><span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor<span class="token punctuation">,</span> ProcessPoolExecutor <span class="token comment"># ThreadPoolExecutor, ProcessPoolExecutor一个对应线程一个对应进程,选择使用</span> <span class="token keyword">def</span> <span class="token function">fn</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 创建线程池</span> <span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span> <span class="token comment"># 创建50个线程</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span> t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>fn<span class="token punctuation">,</span> name<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"线程</span><span class="token interpolation"><span class="token punctuation">{ </span>i<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span> <span class="token comment"># 等待线程池中的任务全部执行完毕,才继续执行(守护)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Done"</span><span class="token punctuation">)</span> </code></pre> <h3>4.5 线程池案例-抓取新发地菜价</h3> <ol> <li>如何提取单个页面的数据</li> <li>上线程池,多个页面同时抓取</li> </ol> <p>因为页面更新,数据不会保存在页面源代码,更新后是用json生成数据,因此与视频代码不同</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> csv <span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"data.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">,</span> newline<span class="token operator">=</span><span class="token string">""</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">download_one_page</span><span class="token punctuation">(</span>page<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 拿到页面源代码</span> url <span class="token operator">=</span> <span class="token string">"http://www.xinfadi.com.cn/getPriceData.html"</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"limit"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span> <span class="token string">"current"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>page<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token comment"># 对应第几页</span> <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span> <span class="token keyword">for</span> txt <span class="token keyword">in</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">"list"</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token comment"># 提取自己需要的内容</span> dic <span class="token operator">=</span> <span class="token punctuation">[</span>txt<span class="token punctuation">[</span><span class="token string">"prodName"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"prodCat"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"lowPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"highPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"place"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"pubDate"</span><span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token comment"># 将数据存放至文件中</span> csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{ </span>page<span class="token punctuation">}</span></span><span class="token string">页下载完成"</span></span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># for i in range(1, 17712): # 效率极其低下</span> <span class="token comment"># download_one_page(i)</span> <span class="token comment"># 创建线程池</span> <span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">200</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 把下载任务提交给线程池</span> t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>download_one_page<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"全部下载完毕"</span><span class="token punctuation">)</span> </code></pre> <h3>4.6 协程</h3> <h4>4.6.1 协程概念</h4> <p>当代码中time.sleep()的时候,当前线程是处于阻塞状态,CPU是部位我工作的</p> <p>同样的,input()程序也是处于阻塞状态</p> <p>requests.get(url) 在网络请求返回数据之前,程序也是处于阻塞状态</p> <p>一般情况下,当程序处于IO操作的时候,线程都会处于阻塞状态</p> <p><strong>协程</strong>:当程序遇见IO操作的时候,可以选择性的切换到其他任务上。在微观上是一个任务一个任务的进行切换,切换条件一般就是IO操作;在宏观上,我们能看到的其实是多个任务一起在执行。</p> <h4>4.6.2 多任务异步交互</h4> <pre><code class="prism language-python"><span class="token keyword">import</span> asyncio <span class="token keyword">import</span> time <span class="token comment"># async def func():</span> <span class="token comment"># print("你好,我叫赛利亚")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># g = func() # 此时的函数是异步协程函数,此时函数执行得到的是一个协程对象</span> <span class="token comment"># asyncio.run(g) # 协程程序运行需要asyncio模块的支持</span> <span class="token comment"># async def func1():</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment"># # time.sleep(3) # 当程序出现同步操作的时候,异步就中断了</span> <span class="token comment"># await asyncio.sleep(3) # 异步操作的代码,表明在这段等待时间切换到下一个任务</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func2():</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment"># # time.sleep(4)</span> <span class="token comment"># await asyncio.sleep(4)</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func3():</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment"># # time.sleep(2)</span> <span class="token comment"># await asyncio.sleep(2)</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># f1 = func1()</span> <span class="token comment"># f2 = func2()</span> <span class="token comment"># f3 = func3()</span> <span class="token comment"># tasks = [</span> <span class="token comment"># f1, f2, f3</span> <span class="token comment"># ]</span> <span class="token comment"># t1 = time.time()</span> <span class="token comment"># # 一次性启动多个任务(协程)</span> <span class="token comment"># asyncio.run(asyncio.wait(tasks))</span> <span class="token comment"># t2 = time.time()</span> <span class="token comment"># print(t2-t1)</span> <span class="token comment"># 上面的这种并不是推荐写法,推荐写法为下方这种,因为这种写法可以套在爬虫上</span> <span class="token comment"># async def func1():</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment"># await asyncio.sleep(3)</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func2():</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment"># await asyncio.sleep(4)</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func3():</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment"># await asyncio.sleep(2)</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def main():</span> <span class="token comment"># # 第一种写法</span> <span class="token comment"># # f1 = func1()</span> <span class="token comment"># # await f1 # 一般await挂起操作放在协程对象前面</span> <span class="token comment"># # 第二种写法(推荐)</span> <span class="token comment"># tasks = [</span> <span class="token comment"># asyncio.create_task(func1()), # py3.8以后加上asyncio.create_task()</span> <span class="token comment"># asyncio.create_task(func2()),</span> <span class="token comment"># asyncio.create_task(func3())</span> <span class="token comment"># ]</span> <span class="token comment"># await asyncio.wait(tasks)</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># asyncio.run(main())</span> <span class="token comment"># 在爬虫领域的应用</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">download</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"准备开始下载"</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 模拟网络请求</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"下载完成"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> urls <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span> <span class="token string">"http://www.bilibili.com"</span><span class="token punctuation">,</span> <span class="token string">"http://www.163.com"</span> <span class="token punctuation">]</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span> d <span class="token operator">=</span> download<span class="token punctuation">(</span>url<span class="token punctuation">)</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>d<span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h4>4.6.3 关于异步协程-过时警告</h4> <p>在python3.8的版本后,task打包需要添加asyncio.create_task(),括号内为任务,3.11版本后将会彻底删除,到时候会直接报错。</p> <h3>4.7 异步http请求aiohttp模块</h3> <p>首先要安装模块pip install aiohttp</p> <p>requests.get()同步的代码–>异步操作aiohttp</p> <pre><code class="prism language-python"><span class="token keyword">import</span> aiohttp <span class="token keyword">import</span> asyncio urls <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan.jpg"</span><span class="token punctuation">,</span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-001.jpg"</span><span class="token punctuation">,</span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-003.jpg"</span> <span class="token punctuation">}</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> url<span class="token punctuation">.</span>rsplit<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># s = aiohttp.ClientSession() <==> requests.session()</span> <span class="token comment"># s.get(),post() = requests.get(),post()</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> <span class="token comment"># 请求回来了 写入文件</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token keyword">await</span> resp<span class="token punctuation">.</span>content<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 读取内容是异步的 需要await挂起</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">"done!"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 这里使用asyncio.run(main())会报RuntimeError: Event loop is closed,改为下方这种就不会报错了</span> loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span> loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>4.8 异步爬虫实战-扒光一部小说</h3> <ol> <li>同步操作:访问 getCatalog 拿到所有章节cid和名称</li> <li>异步操作:访问 getChapterContent 下载所有的文章内容</li> </ol> <pre><code class="prism language-python"><span class="token comment"># http://dushu.baidu.com/api/pc/getCatalog?data={'book_id':'4306063500'} # 获取章节的内容</span> <span class="token comment"># 获得小说内容</span> <span class="token comment"># http://dushu.baidu.com/api/pc/getChapterContent?data={'book_id':'4306063500','cid':'4306063500|11348571','need_bookinfo':1}</span> <span class="token keyword">import</span> requests <span class="token keyword">import</span> asyncio <span class="token keyword">import</span> aiohttp <span class="token keyword">import</span> aiofiles <span class="token keyword">import</span> json <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">:</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"book_id"</span><span class="token punctuation">:</span> b_id<span class="token punctuation">,</span> <span class="token string">"cid"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>b_id<span class="token punctuation">}</span></span><span class="token string">|</span><span class="token interpolation"><span class="token punctuation">{ </span>cid<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token string">"need_bookinfo"</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token punctuation">}</span> data <span class="token operator">=</span> json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span> url <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"http://dushu.baidu.com/api/pc/getChapterContent?data=</span><span class="token interpolation"><span class="token punctuation">{ </span>data<span class="token punctuation">}</span></span><span class="token string">"</span></span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> dic <span class="token operator">=</span> <span class="token keyword">await</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiofiles<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"西游记/"</span> <span class="token operator">+</span> title<span class="token operator">+</span><span class="token string">".txt"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">await</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> <span class="token string">"success"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">getCatalog</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"items"</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token comment"># item就是对应每个章节的名称和id</span> title <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> cid <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"cid"</span><span class="token punctuation">]</span> <span class="token comment"># 准备异步任务</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> b_id <span class="token operator">=</span> <span class="token string">"4306063500"</span> url <span class="token operator">=</span> <span class="token string">'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'</span> <span class="token operator">+</span> b_id <span class="token operator">+</span> <span class="token string">'"}'</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>getCatalog<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>4.9 爬取视频</h3> <h4>4.9.1 综合训练-视频网站的工作原理</h4> <p>我们在编写网站的时候,对于视频文件会有一个视频标签,但是如果一个视频网站这样放视频那么每次播放的时候都相当于把视频完整下载,那这个会非常耗时。</p> <p><strong>那一般的视频网站是怎么做的</strong>?</p> <p>用户上传 -> 转码(把视频做处理,2k,1080,标清) -> 切片处理(把单个文件进行拆分成多个文件,用户在拖动进度条的时候只需要加载对应文件)</p> <p>既然要把视频切成非常多个小碎片,那就需要一个文件来记录:1.视频播放顺序,2.视频存放的路径。该文件一般为M3U文件,M3U文件中的内容经过utf-8的编码后,就是M3U8文件,今天我们看到的各大视频网站平台使用的几乎都是M3U8文件。</p> <p>M3U8文件解读:</p> <pre><code class="prism language-python"><span class="token comment">#EXTM3U</span> <span class="token comment">#EXT-X-VERSION:3</span> <span class="token comment">#EXT-X-TARGETDURATION:13 每个视频功片最大时长 </span> <span class="token comment">#EXT-X-MEDIA-SEQUENCE:0</span> <span class="token comment">#EXT-X-KEY:METH0D=AES-128,URI="key.key" 切片文件的加密方式以及加密的密钥地址,如果有加密,需要先解密才能播放</span> <span class="token comment">#EXTINF:12.600000, 持续时间 </span> cFN803436000<span class="token punctuation">.</span>ts 这里面不带<span class="token string">'#'</span>开头的就是每个ts文作的地址 <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436001<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000, </span> cFN8o3436002<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436003<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436004<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436005<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:6.880000 </span> cFN803436006<span class="token punctuation">.</span>ts </code></pre> <p>那么想要抓取一个视频的流程:</p> <ol> <li>找到M3U8(各种手段)</li> <li>通过M3U8下载到ts文件</li> <li>可以通过各种手段(不仅是编程手段)把ts文件合并为一个mp4文件</li> </ol> <h4>4.9.2 抓取云播TV-简单版</h4> <p>网站失效,使用云播tv</p> <p>url:https://www.yunbtv.com/</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> re url <span class="token operator">=</span> <span class="token string">"https://video.buycar5.cn/20200813/uNqvsBhl/2000kb/hls/index.m3u8"</span> key_uri<span class="token operator">=</span> <span class="token string">"https://ts1.yuyuangewh.com:9999/20200813/uNqvsBhl/2000kb/hls/key.key"</span> <span class="token comment"># 1.首先打印出m3u8文件的内容 发现内容有加密</span> m3u8_text <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># 2.将m3u8文件下载并改名为index.m3u8</span> <span class="token comment"># with open("download_video/"+"index.m3u8", mode="wb") as f:</span> <span class="token comment"># f.write(m3u8_text.content)</span> <span class="token comment"># m3u8_text.close()</span> <span class="token comment"># print("m3u8 success")</span> <span class="token comment"># 3.下载key.key文件并改名为key.m3u8</span> <span class="token comment"># key_text = requests.get(key_uri)</span> <span class="token comment"># with open("download_video/"+"key.m3u8", mode="wb") as f:</span> <span class="token comment"># f.write(key_text.content)</span> <span class="token comment"># key_text.close()</span> <span class="token comment"># print("key success")</span> <span class="token comment"># 4.解析m3u8文件</span> n <span class="token operator">=</span> <span class="token number">1</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/index.m3u8"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'r'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">:</span> line <span class="token operator">=</span> line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 先去掉空格,换行符</span> <span class="token keyword">if</span> line<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">"#"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 如果以#开头跳过该行</span> <span class="token keyword">continue</span> <span class="token comment"># 下载视频片段</span> resp2 <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>line<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/"</span><span class="token operator">+</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>n<span class="token punctuation">}</span></span><span class="token string">.ts"</span></span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>resp2<span class="token punctuation">.</span>content<span class="token punctuation">)</span> f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> resp2<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> n <span class="token operator">+=</span> <span class="token number">1</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{ </span>n<span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">}</span></span><span class="token string">个完成"</span></span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> </code></pre> <p>这是根据我在网上搜到的一些资料做的,与视频不同,并且还未优化</p> <h2>第五章 selenium</h2> <h3>5.1 selenium引入概念</h3> <p>selenium是一个自动化测试工具,它可以打开浏览器,然后像人一样去操作浏览器,程序员可以从selenium中直接提取网页上的各种信息</p> <p>环境搭建:</p> <ul> <li>pip install selenium</li> <li>下载浏览器驱动http://npm.taobao.org/mirrors/chromedriver</li> <li>下载对应浏览器版本的文件解压缩,把浏览器驱动chromedriver放在python解释器所在的文件夹</li> <li>让selenium启动chrome</li> </ul> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token comment"># 1.创建浏览器对象</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 2.打开一个网址</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">)</span> </code></pre> <h3>5.2 selenium各种操作-抓拉钩</h3> <p>本节中使用selenium来抓取抓钩招聘网的岗位信息</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span> <span class="token comment"># 找到某个元素 点击</span> el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span> el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 点击事件</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 让浏览器缓一会</span> <span class="token comment"># 找到输入框 输入python => 输入回车/点击搜索按钮</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 查找存放数据的位置 进行数据提取</span> <span class="token comment"># 找到页面中存放数据的所有li</span> li_list <span class="token operator">=</span> web<span class="token punctuation">.</span>find_elements_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> job_name <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_tag_name<span class="token punctuation">(</span><span class="token string">"h3"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text job_price <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div/div[2]/div/span"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text job_company <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div[2]/div/a"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text <span class="token keyword">print</span><span class="token punctuation">(</span>job_name<span class="token punctuation">,</span> job_company<span class="token punctuation">,</span> job_price<span class="token punctuation">)</span> </code></pre> <h3>5.3 各种操作-窗口间的切换</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span> el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span> el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 在selenium眼中 新窗口是默认切换不过来的</span> web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># 在新窗口中提取内容</span> job_detail <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="job_detail"]/dd[2]/div'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text <span class="token keyword">print</span><span class="token punctuation">(</span>job_detail<span class="token punctuation">)</span> <span class="token comment"># 关掉子窗口</span> web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 变更selenium的窗口视角 回到原本的窗口</span> web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text<span class="token punctuation">)</span> </code></pre> <h3>5.4 selenium操作-无头浏览器</h3> <p>爬取某个页面信息时希望浏览器在后台默默运行</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>support<span class="token punctuation">.</span>select <span class="token keyword">import</span> Select <span class="token keyword">import</span> time <span class="token comment"># 无头浏览器 准备好参数配置</span> opt <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--headless"</span><span class="token punctuation">)</span> opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--disable-gpu"</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>opt<span class="token punctuation">)</span> <span class="token comment"># 把参数配置设置到浏览器中</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.endata.com.cn/BoxOffice/BO/Year/index.html"</span><span class="token punctuation">)</span> <span class="token comment"># 定位到下拉列表</span> sel_el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="OptionDate"]'</span><span class="token punctuation">)</span> <span class="token comment"># 对元素进行包装,包装成下拉菜单</span> sel <span class="token operator">=</span> Select<span class="token punctuation">(</span>sel_el<span class="token punctuation">)</span> <span class="token comment"># 让浏览器进行调整选项</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>sel<span class="token punctuation">.</span>options<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># i就是每一个下拉框选项的索引位置</span> sel<span class="token punctuation">.</span>select_by_index<span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token comment"># 按照索引切换</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> table <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="TableList"]/table'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>table<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 打印所有文本信息</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"============================================="</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 如何拿到页面代码Elements(经过数据加载以及js执行之后的结果的html内容)</span> <span class="token comment"># print(web.page_source)</span> </code></pre> <h3>5.5 selenium各种操作-超级鹰处理验证码</h3> <ol> <li>图像识别</li> <li>选择互联网上成熟的验证码破解工具</li> </ol> <p>超级鹰就是网上的一种识别验证码的工具,需要自行注册以及购买使用积分,在官网的开发文档中可以找到对应语言的文档,只需运行该文档就可以实现功能</p> <h3>5.6 selenium -超级鹰干超级鹰</h3> <p>这一节的内容就是使用超级鹰自动登录超级鹰网站,主要考验的就是对超级鹰方法的使用</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> chaojiying <span class="token keyword">import</span> Chaojiying_Client <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.chaojiying.com/user/login/"</span><span class="token punctuation">)</span> <span class="token comment"># 处理验证码</span> img <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[3]/div/div[3]/div[1]/form/div/img"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>screenshot_as_png chaojiying <span class="token operator">=</span> Chaojiying_Client<span class="token punctuation">(</span><span class="token string">'超级鹰用户名'</span><span class="token punctuation">,</span> <span class="token string">'超级鹰密码'</span><span class="token punctuation">,</span> <span class="token string">'ID'</span><span class="token punctuation">)</span> verity_code <span class="token operator">=</span> chaojiying<span class="token punctuation">.</span>PostPic<span class="token punctuation">(</span>img<span class="token punctuation">,</span> <span class="token number">1902</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">'pic_str'</span><span class="token punctuation">]</span> <span class="token comment"># 向页面中填入用户名,密码,验证码</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰用户名"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰密码"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span>verity_code<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token comment"># 点击登录</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>5.7 selenium-搞定12306的登陆问题</h3> <p>12306登陆页面已取消图片验证,因此与视频有所不同</p> <p>12306可以检测你的浏览器是否是自动测试软件控制,因此如果没有特殊方法无法通过滑块验证,检测原理就是浏览器控制台中输入<strong>window.navigator.webdriver</strong>,可以发现我们测试中的Chrome浏览器返回的结果为True,而一般浏览器是False,所以12306就是根据这个返回的结果判断你是不是在自动测试。</p> <p>不被检测方法:</p> <ul> <li> <p>Chrome版本号小于88:在你启动浏览器的时候(此时没有加载任何网页内容),向页面嵌入js代码,去掉webdriver,也就是在web.get()代码前嵌入</p> </li> <li> <pre><code class="prism language-python">web<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{ </span> <span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">""" navigator.webdriver = undefined Object.defineProperty(navigator, 'webdriver', { get: () => undefined }] """</span> <span class="token punctuation">}</span><span class="token punctuation">)</span>xxxxxxxxxx web<span class="token punctuation">.</span>executeweb<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{ </span> <span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">""" navigator.webdriver = undefined Object.defineProperty(navigator, 'webdriver', { get: () => undefined }] """</span><span class="token punctuation">}</span><span class="token punctuation">)</span> </code></pre> </li> <li> <p>Chrome版本号大于88:需要导入一个包,增加options属性</p> </li> <li> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span> </code></pre> </li> </ul> <p>以下是我的代码</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>action_chains <span class="token keyword">import</span> ActionChains <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options <span class="token keyword">import</span> time option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://kyfw.12306.cn/otn/resources/login.html"</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 等待响应</span> <span class="token comment"># 切换到账号登陆</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 填写账号密码</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-userName"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-password"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span> <span class="token comment"># 点击登录</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-login"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 滑块拖拽验证 使用动作链</span> span_element <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="nc_1_n1z"]'</span><span class="token punctuation">)</span> ActionChains<span class="token punctuation">(</span>web<span class="token punctuation">)</span><span class="token punctuation">.</span>drag_and_drop_by_offset<span class="token punctuation">(</span>span_element<span class="token punctuation">,</span> <span class="token number">320</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>perform<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre></li> </ul> </div> </div> </div> </div> </div> <!--PC和WAP自适应版--> <div id="SOHUCS" sid="1450727448940462080"></div> <script type="text/javascript" src="/views/front/js/chanyan.js"></script> <!-- 文章页-底部 动态广告位 --> <div class="youdao-fixed-ad" id="detail_ad_bottom"></div> </div> <div class="col-md-3"> <div class="row" id="ad"> <!-- 文章页-右侧1 动态广告位 --> <div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_1"> </div> </div> <!-- 文章页-右侧2 动态广告位 --> <div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_2"></div> </div> <!-- 文章页-右侧3 动态广告位 --> <div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_3"></div> </div> </div> </div> </div> </div> </div> <div class="container"> <h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(笔记,python,爬虫,python,爬虫)</h4> <div id="paradigm-article-related"> <div class="recommend-post mb30"> <ul class="widget-links"> <li><a href="/article/1950233451282100224.htm" title="python 读excel每行替换_Python脚本操作Excel实现批量替换功能" target="_blank">python 读excel每行替换_Python脚本操作Excel实现批量替换功能</a> <span class="text-muted">weixin_39646695</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E8%AF%BBexcel%E6%AF%8F%E8%A1%8C%E6%9B%BF%E6%8D%A2/1.htm">读excel每行替换</a> <div>Python脚本操作Excel实现批量替换功能大家好,给大家分享下如何使用Python脚本操作Excel实现批量替换。使用的工具Openpyxl,一个处理excel的python库,处理excel,其实针对的就是WorkBook,Sheet,Cell这三个最根本的元素~明确需求原始excel如下我们的目标是把下面excel工作表的sheet1表页A列的内容“替换我吧”批量替换为B列的“我用来替换的</div> </li> <li><a href="/article/1950232782412247040.htm" title="日更006 终极训练营day3" target="_blank">日更006 终极训练营day3</a> <span class="text-muted">懒cici</span> <div>人生创业课(2)今天的主题:学习方法一:遇到有用的书,反复读,然后结合自身实际,列践行清单,不要再写读书笔记思考这本书与我有什么关系,我在哪些地方能用到,之后我该怎么用方法二:读完书没映像怎么办?训练你的大脑,方法:每读完一遍书,立马合上书,做一场分享,几分钟都行对自己的学习要求太低,要逼自己方法三:学习深度不够怎么办?找到细分领域的榜样,把他们的文章、书籍、产品都体验一遍,成为他们的超级用户,向</div> </li> <li><a href="/article/1950220179610857472.htm" title="【花了N长时间读《过犹不及》,不断练习,可以越通透】" target="_blank">【花了N长时间读《过犹不及》,不断练习,可以越通透】</a> <span class="text-muted">君君Love</span> <div>我已经记不清花了多长时间去读《过犹不及》,读书笔记都写了42页,这算是读得特别精细的了。是一本难得的好书,虽然书中很多内容和圣经吻合,我不是基督徒,却觉得这样的文字值得细细品味,和我们的生活息息相关。我是个界线建立不牢固的人,常常愧疚,常常害怕他人的愤怒,常常不懂拒绝,还有很多时候表达不了自己真实的感受,心里在说不嘴里却在说好……这本书给我很多的启示,让我学会了怎样去建立属于自己的清晰的界限。建立</div> </li> <li><a href="/article/1950218819616174080.htm" title="基于redis的Zset实现作者的轻量级排名" target="_blank">基于redis的Zset实现作者的轻量级排名</a> <span class="text-muted">周童學</span> <a class="tag" taget="_blank" href="/search/Java/1.htm">Java</a><a class="tag" taget="_blank" href="/search/redis/1.htm">redis</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E7%BC%93%E5%AD%98/1.htm">缓存</a> <div>基于redis的Zset实现轻量级作者排名系统在今天的技术架构中,Redis是一种广泛使用的内存数据存储系统,尤其在需要高效检索和排序的场景中表现优异。在本篇博客中,我们将深入探讨如何使用Redis的有序集合(ZSet)构建一个高效的笔记排行榜系统,并提供相关代码示例和详细的解析。1.功能背景与需求假设我们有一个笔记分享平台,用户可以发布各种笔记,系统需要根据用户发布的笔记数量来生成一个实时更新的</div> </li> <li><a href="/article/1950216170401492992.htm" title="常规笔记本和加固笔记本的区别" target="_blank">常规笔记本和加固笔记本的区别</a> <span class="text-muted">luchengtech</span> <a class="tag" taget="_blank" href="/search/%E7%94%B5%E8%84%91/1.htm">电脑</a><a class="tag" taget="_blank" href="/search/%E4%B8%89%E9%98%B2%E7%AC%94%E8%AE%B0%E6%9C%AC/1.htm">三防笔记本</a><a class="tag" taget="_blank" href="/search/%E5%8A%A0%E5%9B%BA%E8%AE%A1%E7%AE%97%E6%9C%BA/1.htm">加固计算机</a><a class="tag" taget="_blank" href="/search/%E5%8A%A0%E5%9B%BA%E7%AC%94%E8%AE%B0%E6%9C%AC/1.htm">加固笔记本</a> <div>在现代科技产品中,笔记本电脑因其便携性和功能性被广泛应用。根据使用场景和需求的不同,笔记本可分为常规笔记本和加固笔记本,二者在多个方面存在显著区别。适用场景是区分二者的重要标志。常规笔记本主要面向普通消费者和办公人群,适用于家庭娱乐、日常办公、学生学习等相对稳定的室内环境。比如,人们在家用它追剧、处理文档,学生在教室用它完成作业。而加固笔记本则专为特殊行业设计,像军事、野外勘探、工业制造、交通运输</div> </li> <li><a href="/article/1950210374787723264.htm" title="第八课: 写作出版你最关心的出书流程和市场分析(无戒学堂复盘)" target="_blank">第八课: 写作出版你最关心的出书流程和市场分析(无戒学堂复盘)</a> <span class="text-muted">人在陌上</span> <div>今天是周六,恰是圣诞节。推掉了两个需要凑腿的牌局,在一个手机,一个笔记本,一台电脑,一杯热茶的陪伴下,一个人静静地回听无戒学堂的最后一堂课。感谢这一个月,让自己的习惯开始改变,至少,可以静坐一个下午而不觉得乏味枯燥难受了,要为自己点个赞。我深知,这最后一堂课的内容,以我的资质和毅力,可能永远都用不上。但很明显,无戒学堂是用了心的,毕竟,有很多优秀学员,已经具备了写作能力,马上就要用到这堂课的内容。</div> </li> <li><a href="/article/1950208107430866944.htm" title="python笔记14介绍几个魔法方法" target="_blank">python笔记14介绍几个魔法方法</a> <span class="text-muted">抢公主的大魔王</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>python笔记14介绍几个魔法方法先声明一下各位大佬,这是我的笔记。如有错误,恳请指正。另外,感谢您的观看,谢谢啦!(1).__doc__输出对应的函数,类的说明文档print(print.__doc__)print(value,...,sep='',end='\n',file=sys.stdout,flush=False)Printsthevaluestoastream,ortosys.std</div> </li> <li><a href="/article/1950205034075582464.htm" title="《感官品牌》读书笔记 1" target="_blank">《感官品牌》读书笔记 1</a> <span class="text-muted">西红柿阿达</span> <div>原文:最近我在东京街头闲逛时,与一位女士擦肩而过,我发现她的香水味似曾相识。“哗”的一下,记亿和情感立刻像潮水般涌了出来。这个香水味把我带回了15年前上高中的时候,我的一位亲密好友也是用这款香水。一瞬间,我呆站在那里,东京的街景逐渐淡出,取而代之的是我年少时的丹麦以及喜悦、悲伤、恐惧、困惑的记忆。我被这熟悉的香水味征服了。感想:感官是有记忆的,你所听到,看到,闻到过的有代表性的事件都会在大脑中深深</div> </li> <li><a href="/article/1950204954295726080.htm" title="Anaconda 和 Miniconda:功能详解与选择建议" target="_blank">Anaconda 和 Miniconda:功能详解与选择建议</a> <span class="text-muted">古月฿</span> <a class="tag" taget="_blank" href="/search/python%E5%85%A5%E9%97%A8/1.htm">python入门</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/conda/1.htm">conda</a> <div>Anaconda和Miniconda详细介绍一、Anaconda的详细介绍1.什么是Anaconda?Anaconda是一个开源的包管理和环境管理工具,在数据科学、机器学习以及科学计算领域发挥着关键作用。它以Python和R语言为基础,为用户精心准备了大量预装库和工具,极大地缩短了搭建数据科学环境的时间。对于那些想要快速开展数据分析、模型训练等工作的人员来说,Anaconda就像是一个一站式的“数</div> </li> <li><a href="/article/1950204701714739200.htm" title="环境搭建 | Python + Anaconda / Miniconda + PyCharm 的安装、配置与使用" target="_blank">环境搭建 | Python + Anaconda / Miniconda + PyCharm 的安装、配置与使用</a> <span class="text-muted"></span> <div>本文将分别介绍Python、Anaconda/Miniconda、PyCharm的安装、配置与使用,详细介绍Python环境搭建的全过程,涵盖Python、Pip、PythonLauncher、Anaconda、Miniconda、Pycharm等内容,以官方文档为参照,使用经验为补充,内容全面而详实。由于图片太多,就先贴一个无图简化版吧,详情请查看Python+Anaconda/Minicond</div> </li> <li><a href="/article/1950203883577995264.htm" title="我不想再当知识的搬运工" target="_blank">我不想再当知识的搬运工</a> <span class="text-muted">楚煜楚尧</span> <div>因为学校课题研究的需要,这个暑假我依然需要完成一本书的阅读笔记。我选的是管建刚老师的《习课堂十讲》。这本书,之前我读过,所以重读的时候,感到很亲切,摘抄起来更是非常得心应手。20页,40面,抄了十天,终于在今天大功告成了。这对之前什么事都要一拖再拖的我来说,是破天荒的改变。我发现至从认识小尘老师以后,我的确发生了很大的改变。遇到必须做却总是犹豫不去做的事,我学会了按照小尘老师说的那样,在心里默默数</div> </li> <li><a href="/article/1950202938265759744.htm" title="你竟然还在用克隆删除?Conda最新版rename命令全攻略!" target="_blank">你竟然还在用克隆删除?Conda最新版rename命令全攻略!</a> <span class="text-muted">曦紫沐</span> <a class="tag" taget="_blank" href="/search/Python%E5%9F%BA%E7%A1%80%E7%9F%A5%E8%AF%86/1.htm">Python基础知识</a><a class="tag" taget="_blank" href="/search/conda/1.htm">conda</a><a class="tag" taget="_blank" href="/search/%E8%99%9A%E6%8B%9F%E7%8E%AF%E5%A2%83%E7%AE%A1%E7%90%86/1.htm">虚拟环境管理</a> <div>文章摘要Conda虚拟环境管理终于迎来革命性升级!本文揭秘Conda4.9+版本新增的rename黑科技,彻底告别传统“克隆+删除”的繁琐操作。从命令解析到实战案例,手把手教你如何安全高效地重命名Python虚拟环境,附带版本检测、环境迁移、故障排查等进阶技巧,助你提升开发效率10倍!一、颠覆认知:Conda居然自带重命名功能?很多开发者仍停留在“Conda无法直接重命名环境”的认知阶段,实际上自</div> </li> <li><a href="/article/1950202054706262016.htm" title="centos7安装配置 Anaconda3" target="_blank">centos7安装配置 Anaconda3</a> <span class="text-muted"></span> <div>Anaconda是一个用于科学计算的Python发行版,Anaconda于Python,相当于centos于linux。下载[root@testsrc]#mwgethttps://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.shBegintodownload:Anaconda3-5.2.0-L</div> </li> <li><a href="/article/1950202054219722752.htm" title="Pandas:数据科学的超级瑞士军刀" target="_blank">Pandas:数据科学的超级瑞士军刀</a> <span class="text-muted">科技林总</span> <a class="tag" taget="_blank" href="/search/DeepSeek%E5%AD%A6AI/1.htm">DeepSeek学AI</a><a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a> <div>**——从零基础到高效分析的进化指南**###**一、Pandas诞生:数据革命的救世主****2010年前的数据分析噩梦**:```python#传统Python处理表格数据data=[]forrowincsv_file:ifrow[3]>100androw[2]=="China":data.append(float(row[5])#代码冗长易错!```**核心痛点**:-Excel处理百万行崩</div> </li> <li><a href="/article/1950199576451411968.htm" title="20210517坚持分享53天读书摘抄笔记 非暴力沟通——爱自己" target="_blank">20210517坚持分享53天读书摘抄笔记 非暴力沟通——爱自己</a> <span class="text-muted">f79a6556cb19</span> <div>让生命之花绽放在赫布·加德纳(HerbGardner)编写的《一千个小丑》一剧中,主人公拒绝将他12岁的外甥交给儿童福利院。他郑重地说道:“我希望他准确无误地知道他是多么特殊的生命,要不,他在成长的过程中将会忽视这一点。我希望他保持清醒,并看到各种奇妙的可能。我希望他知道,一旦有机会,排除万难给世界一点触动是值得的。我还希望他知道为什么他是一个人,而不是一张椅子。”然而,一旦负面的自我评价使我们看</div> </li> <li><a href="/article/1950196906563006464.htm" title="Unity学习笔记1" target="_blank">Unity学习笔记1</a> <span class="text-muted">zy_777</span> <div>通过一个星期的简单学习,初步了解了下unity,unity的使用,以及场景的布局,UI,以及用C#做一些简单的逻辑。好记性不如烂笔头,一些关键帧还是记起来比较好,哈哈,不然可能转瞬即逝了,(PS:纯小白观点,unity大神可以直接忽略了)一:MonoBehaviour类的初始化1,Instantiate()创建GameObject2,通过Awake()和Start()来做初始化3,Update、L</div> </li> <li><a href="/article/1950195876991397888.htm" title="【Jupyter】个人开发常见命令" target="_blank">【Jupyter】个人开发常见命令</a> <span class="text-muted">TIM老师</span> <a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/Pycharm/1.htm">Pycharm</a><a class="tag" taget="_blank" href="/search/%26amp%3B/1.htm">&</a><a class="tag" taget="_blank" href="/search/VSCode/1.htm">VSCode</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/Jupyter/1.htm">Jupyter</a> <div>1.查看python版本importsysprint(sys.version)2.ipynb/py文件转换jupyternbconvert--topythonmy_file.ipynbipynb转换为mdjupyternbconvert--tomdmy_file.ipynbipynb转为htmljupyternbconvert--tohtmlmy_file.ipynbipython转换为pdfju</div> </li> <li><a href="/article/1950194363237724160.htm" title="用 Python 开发小游戏:零基础也能做出《贪吃蛇》" target="_blank">用 Python 开发小游戏:零基础也能做出《贪吃蛇》</a> <span class="text-muted"></span> <div>本文专为零基础学习者打造,详细介绍如何用Python开发经典小游戏《贪吃蛇》。无需复杂编程知识,从环境搭建到代码编写、功能实现,逐步讲解核心逻辑与操作。涵盖Pygame库的基础运用、游戏界面设计、蛇的移动与食物生成规则等,让新手能按步骤完成开发,同时融入SEO优化要点,帮助读者轻松入门Python游戏开发,体验从0到1做出游戏的乐趣。一、为什么选择用Python开发《贪吃蛇》对于零基础学习者来说,</div> </li> <li><a href="/article/1950193733681082368.htm" title="基于Python的AI健康助手:开发与部署全攻略" target="_blank">基于Python的AI健康助手:开发与部署全攻略</a> <span class="text-muted">AI算力网络与通信</span> <a class="tag" taget="_blank" href="/search/AI%E7%AE%97%E5%8A%9B%E7%BD%91%E7%BB%9C%E4%B8%8E%E9%80%9A%E4%BF%A1%E5%8E%9F%E7%90%86/1.htm">AI算力网络与通信原理</a><a class="tag" taget="_blank" href="/search/AI%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD%E5%A4%A7%E6%95%B0%E6%8D%AE%E6%9E%B6%E6%9E%84/1.htm">AI人工智能大数据架构</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/ai/1.htm">ai</a> <div>基于Python的AI健康助手:开发与部署全攻略关键词:Python、AI健康助手、机器学习、自然语言处理、Flask、部署、健康管理摘要:本文将详细介绍如何使用Python开发一个AI健康助手,从需求分析、技术选型到核心功能实现,再到最终部署上线的完整过程。我们将使用自然语言处理技术理解用户健康咨询,通过机器学习模型提供个性化建议,并展示如何用Flask框架构建Web应用接口。文章包含大量实际代</div> </li> <li><a href="/article/1950192849786040320.htm" title="AI人工智能中的数据挖掘:提升智能决策能力" target="_blank">AI人工智能中的数据挖掘:提升智能决策能力</a> <span class="text-muted"></span> <div>AI人工智能中的数据挖掘:提升智能决策能力关键词:数据挖掘、人工智能、机器学习、智能决策、数据分析、特征工程、模型优化摘要:本文深入探讨了数据挖掘在人工智能领域中的核心作用,重点分析了如何通过数据挖掘技术提升智能决策能力。文章从基础概念出发,详细介绍了数据挖掘的关键算法、数学模型和实际应用场景,并通过Python代码示例展示了数据挖掘的全流程。最后,文章展望了数据挖掘技术的未来发展趋势和面临的挑战</div> </li> <li><a href="/article/1950192217708621824.htm" title="lesson20:Python函数的标注" target="_blank">lesson20:Python函数的标注</a> <span class="text-muted">你的电影很有趣</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>目录引言:为什么函数标注是现代Python开发的必备技能一、函数标注的基础语法1.1参数与返回值标注1.2支持的标注类型1.3Python3.9+的重大改进:标准集合泛型二、高级标注技巧与最佳实践2.1复杂参数结构标注2.2函数类型与回调标注2.3变量注解与类型别名三、静态类型检查工具应用3.1mypy:最流行的类型检查器3.2Pyright与IDE集成3.3运行时类型验证四、函数标注的工程价值与</div> </li> <li><a href="/article/1950190325960077312.htm" title="Jupyter Notebook:数据科学的“瑞士军刀”" target="_blank">Jupyter Notebook:数据科学的“瑞士军刀”</a> <span class="text-muted">a小胡哦</span> <a class="tag" taget="_blank" href="/search/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E5%9F%BA%E7%A1%80/1.htm">机器学习基础</a><a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/1.htm">机器学习</a> <div>在数据科学的世界里,JupyterNotebook是一个不可或缺的工具,它就像是数据科学家手中的“瑞士军刀”,功能强大且灵活多变。今天,就让我们一起深入了解这个神奇的工具。一、JupyterNotebook是什么?JupyterNotebook是一个开源的Web应用程序,它允许你创建和共享包含实时代码、方程、可视化和解释性文本的文档。它支持多种编程语言,其中Python是最常用的语言之一。Jupy</div> </li> <li><a href="/article/1950190146074767360.htm" title="大数据技术笔记—spring入门" target="_blank">大数据技术笔记—spring入门</a> <span class="text-muted">卿卿老祖</span> <div>篇一spring介绍spring.io官网快速开始Aop面向切面编程,可以任何位置,并且可以细致到方法上连接框架与框架Spring就是IOCAOP思想有效的组织中间层对象一般都是切入service层spring组成前后端分离已学方式,前后台未分离:Spring的远程通信:明日更新创建第一个spring项目来源:科多大数据</div> </li> <li><a href="/article/1950187554129113088.htm" title="Django学习笔记(一)" target="_blank">Django学习笔记(一)</a> <span class="text-muted"></span> <div>学习视频为:pythondjangoweb框架开发入门全套视频教程一、安装pipinstalldjango==****检查是否安装成功django.get_version()二、django新建项目操作1、新建一个项目django-adminstartprojectproject_name2、新建APPcdproject_namedjango-adminstartappApp注:一个project</div> </li> <li><a href="/article/1950185789447008256.htm" title="Python 程序设计讲义(26):字符串的用法——字符的编码" target="_blank">Python 程序设计讲义(26):字符串的用法——字符的编码</a> <span class="text-muted">睿思达DBA_WGX</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/%E8%AE%B2%E4%B9%89/1.htm">讲义</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>Python程序设计讲义(26):字符串的用法——字符的编码目录Python程序设计讲义(26):字符串的用法——字符的编码一、字符的编码二、`ASCII`编码三、`Unicode`编码四、使用`ord()`函数查询一个字符对应的`Unicode`编码五、使用`chr()`函数查询一个`Unicode`编码对应的字符六、`Python`字符串的特征一、字符的编码计算机默认只能处理二进制数,而不能处</div> </li> <li><a href="/article/1950183898780594176.htm" title="【Python】pypinyin-汉字拼音转换工具" target="_blank">【Python】pypinyin-汉字拼音转换工具</a> <span class="text-muted">鸟哥大大</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86/1.htm">自然语言处理</a> <div>文章目录1.主要功能2.安装3.常用API3.1拼音风格3.2核心API3.2.1pypinyin.pinyin()3.2.2pypinyin.lazy_pinyin()3.2.3pypinyin.load_single_dict()3.2.4pypinyin.load_phrases_dict()3.2.5pypinyin.slug()3.3注册新的拼音风格4.基本用法4.1库导入4.2基本汉字</div> </li> <li><a href="/article/1950183268448006144.htm" title="python编程第十四课:数据可视化" target="_blank">python编程第十四课:数据可视化</a> <span class="text-muted">小小源助手</span> <a class="tag" taget="_blank" href="/search/Python%E4%BB%A3%E7%A0%81%E5%AE%9E%E4%BE%8B/1.htm">Python代码实例</a><a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>Python数据可视化:让数据“开口说话”在当今数据爆炸的时代,数据可视化已成为探索数据规律、传达数据信息的关键技术。Python凭借其丰富的第三方库,为数据可视化提供了强大而灵活的解决方案。本文将带你深入了解Matplotlib库的基础绘图、Seaborn库的高级可视化以及交互式可视化工具Plotly,帮助你通过图表清晰地展示数据背后的故事。一、Matplotlib库基础绘图Matplotlib</div> </li> <li><a href="/article/1950180118999658496.htm" title="Python数据可视化:用代码绘制数据背后的故事" target="_blank">Python数据可视化:用代码绘制数据背后的故事</a> <span class="text-muted">AAEllisonPang</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>引言:当数据会说话在数据爆炸的时代,可视化是解锁数据价值的金钥匙。Python凭借其丰富的可视化生态库,已成为数据科学家的首选工具。本文将带您从基础到高级,探索如何用Python将冰冷数字转化为引人入胜的视觉叙事。一、基础篇:二维可视化的艺术表达1.1Matplotlib:可视化领域的瑞士军刀importmatplotlib.pyplotaspltimportnumpyasnpx=np.linsp</div> </li> <li><a href="/article/1950179614320029696.htm" title="python学习笔记(汇总)" target="_blank">python学习笔记(汇总)</a> <span class="text-muted">朕的剑还未配妥</span> <a class="tag" taget="_blank" href="/search/python%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0%E6%95%B4%E7%90%86/1.htm">python学习笔记整理</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>文章目录一.基础知识二.python中的数据类型三.运算符四.程序的控制结构五.列表六.字典七.元组八.集合九.字符串十.函数十一.解决bug一.基础知识print函数字符串要加引号,数字可不加引号,如print(123.4)print('小谢')print("洛天依")还可输入表达式,如print(1+3)如果使用三引号,print打印的内容可不在同一行print("line1line2line</div> </li> <li><a href="/article/1950175578921431040.htm" title="Redis 分布式锁深度解析:过期时间与自动续期机制" target="_blank">Redis 分布式锁深度解析:过期时间与自动续期机制</a> <span class="text-muted">爱恨交织围巾</span> <a class="tag" taget="_blank" href="/search/%E5%88%86%E5%B8%83%E5%BC%8F%E4%BA%8B%E5%8A%A1/1.htm">分布式事务</a><a class="tag" taget="_blank" href="/search/redis/1.htm">redis</a><a class="tag" taget="_blank" href="/search/%E5%88%86%E5%B8%83%E5%BC%8F/1.htm">分布式</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%BE%AE%E6%9C%8D%E5%8A%A1/1.htm">微服务</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/go/1.htm">go</a> <div>Redis分布式锁深度解析:过期时间与自动续期机制在分布式系统中,Redis分布式锁的可靠性很大程度上依赖于对锁生命周期的管理。上一篇文章我们探讨了分布式锁的基本原理,今天我们将聚焦于一个关键话题:如何通过合理设置过期时间和实现自动续期机制,来解决分布式锁中的死锁与锁提前释放问题。一、为什么过期时间是分布式锁的生命线?你的笔记中提到"服务挂掉时未删除锁可能导致死锁",这正是过期时间要解决的核心问题</div> </li> <li><a href="/article/68.htm" title="mongodb3.03开启认证" target="_blank">mongodb3.03开启认证</a> <span class="text-muted">21jhf</span> <a class="tag" taget="_blank" href="/search/mongodb/1.htm">mongodb</a> <div>下载了最新mongodb3.03版本,当使用--auth 参数命令行开启mongodb用户认证时遇到很多问题,现总结如下: (百度上搜到的基本都是老版本的,看到db.addUser的就是,请忽略) Windows下我做了一个bat文件,用来启动mongodb,命令行如下: mongod --dbpath db\data --port 27017 --directoryperdb --logp</div> </li> <li><a href="/article/195.htm" title="【Spark103】Task not serializable" target="_blank">【Spark103】Task not serializable</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/Serializable/1.htm">Serializable</a> <div>Task not serializable是Spark开发过程最令人头疼的问题之一,这里记录下出现这个问题的两个实例,一个是自己遇到的,另一个是stackoverflow上看到。等有时间了再仔细探究出现Task not serialiazable的各种原因以及出现问题后如何快速定位问题的所在,至少目前阶段碰到此类问题,没有什么章法 1.   package spark.exampl</div> </li> <li><a href="/article/322.htm" title="你所熟知的 LRU(最近最少使用)" target="_blank">你所熟知的 LRU(最近最少使用)</a> <span class="text-muted">dalan_123</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>关于LRU这个名词在很多地方或听说,或使用,接下来看下lru缓存回收的实现 1、大体的想法     a、查询出最近最晚使用的项     b、给最近的使用的项做标记 通过使用链表就可以完成这两个操作,关于最近最少使用的项只需要返回链表的尾部;标记最近使用的项,只需要将该项移除并放置到头部,那么难点就出现 你如何能够快速在链表定位对应的该项? 这时候多</div> </li> <li><a href="/article/449.htm" title="Javascript 跨域" target="_blank">Javascript 跨域</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/jsonp/1.htm">jsonp</a><a class="tag" taget="_blank" href="/search/%E8%B7%A8%E5%9F%9F/1.htm">跨域</a><a class="tag" taget="_blank" href="/search/cross-domain/1.htm">cross-domain</a> <div>                                   </div> </li> <li><a href="/article/576.htm" title="linux下安装apache服务器" target="_blank">linux下安装apache服务器</a> <span class="text-muted">g21121</span> <a class="tag" taget="_blank" href="/search/apache/1.htm">apache</a> <div>安装apache 下载windows版本apache,下载地址:http://httpd.apache.org/download.cgi   1.windows下安装apache Windows下安装apache比较简单,注意选择路径和端口即可,这里就不再赘述了。 2.linux下安装apache: 下载之后上传到linux的相关目录,这里指定为/home/apach</div> </li> <li><a href="/article/703.htm" title="FineReport的JS编辑框和URL地址栏语法简介" target="_blank">FineReport的JS编辑框和URL地址栏语法简介</a> <span class="text-muted">老A不折腾</span> <a class="tag" taget="_blank" href="/search/finereport/1.htm">finereport</a><a class="tag" taget="_blank" href="/search/web%E6%8A%A5%E8%A1%A8/1.htm">web报表</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E8%BD%AF%E4%BB%B6/1.htm">报表软件</a><a class="tag" taget="_blank" href="/search/%E8%AF%AD%E6%B3%95%E6%80%BB%E7%BB%93/1.htm">语法总结</a> <div>  JS编辑框: 1.FineReport的js。 作为一款BS产品,browser端的JavaScript是必不可少的。 FineReport中的js是已经调用了finereport.js的。 大家知道,预览报表时,报表servlet会将cpt模板转为html,在这个html的head头部中会引入FineReport的js,这个finereport.js中包含了许多内置的fun</div> </li> <li><a href="/article/830.htm" title="根据STATUS信息对MySQL进行优化" target="_blank">根据STATUS信息对MySQL进行优化</a> <span class="text-muted">墙头上一根草</span> <a class="tag" taget="_blank" href="/search/status/1.htm">status</a> <div>mysql  查看当前正在执行的操作,即正在执行的sql语句的方法为:      show processlist 命令   mysql> show global status;可以列出MySQL服务器运行各种状态值,我个人较喜欢的用法是show status like '查询值%';一、慢查询mysql> show variab</div> </li> <li><a href="/article/957.htm" title="我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名" target="_blank">我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名</a> <span class="text-muted">aijuans</span> <a class="tag" taget="_blank" href="/search/Spring+3/1.htm">Spring 3</a> <div>本文介绍如何给Spring的Bean配置文件的Bean定义别名? 原始的 <bean id="business" class="onlyfun.caterpillar.device.Business"> <property name="writer"> <ref b</div> </li> <li><a href="/article/1084.htm" title="高性能mysql 之 性能剖析" target="_blank">高性能mysql 之 性能剖析</a> <span class="text-muted">annan211</span> <a class="tag" taget="_blank" href="/search/%E6%80%A7%E8%83%BD/1.htm">性能</a><a class="tag" taget="_blank" href="/search/mysql/1.htm">mysql</a><a class="tag" taget="_blank" href="/search/mysql+%E6%80%A7%E8%83%BD%E5%89%96%E6%9E%90/1.htm">mysql 性能剖析</a><a class="tag" taget="_blank" href="/search/%E5%89%96%E6%9E%90/1.htm">剖析</a> <div> 1 定义性能优化 mysql服务器性能,此处定义为 响应时间。 在解释性能优化之前,先来消除一个误解,很多人认为,性能优化就是降低cpu的利用率或者减少对资源的使用。 这是一个陷阱。 资源时用来消耗并用来工作的,所以有时候消耗更多的资源能够加快查询速度,保持cpu忙绿,这是必要的。很多时候发现 编译进了新版本的InnoDB之后,cpu利用率上升的很厉害,这并不</div> </li> <li><a href="/article/1211.htm" title="主外键和索引唯一性约束" target="_blank">主外键和索引唯一性约束</a> <span class="text-muted">百合不是茶</span> <a class="tag" taget="_blank" href="/search/%E7%B4%A2%E5%BC%95/1.htm">索引</a><a class="tag" taget="_blank" href="/search/%E5%94%AF%E4%B8%80%E6%80%A7%E7%BA%A6%E6%9D%9F/1.htm">唯一性约束</a><a class="tag" taget="_blank" href="/search/%E4%B8%BB%E5%A4%96%E9%94%AE%E7%BA%A6%E6%9D%9F/1.htm">主外键约束</a><a class="tag" taget="_blank" href="/search/%E8%81%94%E6%9C%BA%E5%88%A0%E9%99%A4/1.htm">联机删除</a> <div>目标;第一步;创建两张表 用户表和文章表         第二步;发表文章       1,建表; ---用户表 BlogUsers --userID唯一的 --userName --pwd --sex create </div> </li> <li><a href="/article/1338.htm" title="线程的调度" target="_blank">线程的调度</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">多线程</a><a class="tag" taget="_blank" href="/search/thread/1.htm">thread</a><a class="tag" taget="_blank" href="/search/%E7%BA%BF%E7%A8%8B%E7%9A%84%E8%B0%83%E5%BA%A6/1.htm">线程的调度</a><a class="tag" taget="_blank" href="/search/java%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">java多线程</a> <div>1.       Java提供一个线程调度程序来监控程序中启动后进入可运行状态的所有线程。线程调度程序按照线程的优先级决定应调度哪些线程来执行。   2.       多数线程的调度是抢占式的(即我想中断程序运行就中断,不需要和将被中断的程序协商) a) </div> </li> <li><a href="/article/1465.htm" title="查看日志常用命令" target="_blank">查看日志常用命令</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E5%91%BD%E4%BB%A4/1.htm">命令</a><a class="tag" taget="_blank" href="/search/unix/1.htm">unix</a> <div>一.日志查找方法,可以用通配符查某台主机上的所有服务器grep "关键字" /wls/applogs/custom-*/error.log   二.查看日志常用命令1.grep '关键字' error.log:在error.log中搜索'关键字'2.grep -C10 '关键字' error.log:显示关键字前后10行记录3.grep '关键字' error.l</div> </li> <li><a href="/article/1592.htm" title="【持久化框架MyBatis3一】MyBatis版HelloWorld" target="_blank">【持久化框架MyBatis3一】MyBatis版HelloWorld</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/helloworld/1.htm">helloworld</a> <div>MyBatis这个系列的文章,主要参考《Java Persistence with MyBatis 3》。   样例数据 本文以MySQL数据库为例,建立一个STUDENTS表,插入两条数据,然后进行单表的增删改查     CREATE TABLE STUDENTS ( stud_id int(11) NOT NULL AUTO_INCREMENT, </div> </li> <li><a href="/article/1719.htm" title="【Hadoop十五】Hadoop Counter" target="_blank">【Hadoop十五】Hadoop Counter</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a> <div>   1. 只有Map任务的Map Reduce Job File System Counters FILE: Number of bytes read=3629530 FILE: Number of bytes written=98312 FILE: Number of read operations=0 FILE: Number of lar</div> </li> <li><a href="/article/1846.htm" title="解决Tomcat数据连接池无法释放" target="_blank">解决Tomcat数据连接池无法释放</a> <span class="text-muted">ronin47</span> <a class="tag" taget="_blank" href="/search/tomcat+%E8%BF%9E%E6%8E%A5%E6%B1%A0%E3%80%80%E4%BC%98%E5%8C%96/1.htm">tomcat 连接池 优化</a> <div> 近段时间,公司的检测中心报表系统(SMC)的开发人员时不时找到我,说用户老是出现无法登录的情况。前些日子因为手头上 有Jboss集群的测试工作,发现用户不能登录时,都是在Tomcat中将这个项目Reload一下就好了,不过只是治标而已,因为大概几个小时之后又会 再次出现无法登录的情况。 今天上午,开发人员小毛又找到我,要我协助将这个问题根治一下,拖太久用户难保不投诉。 简单分析了一</div> </li> <li><a href="/article/1973.htm" title="java-75-二叉树两结点的最低共同父结点" target="_blank">java-75-二叉树两结点的最低共同父结点</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div> import java.util.LinkedList; import java.util.List; import ljn.help.*; public class BTreeLowestParentOfTwoNodes { public static void main(String[] args) { /* * node data is stored in</div> </li> <li><a href="/article/2100.htm" title="行业垂直搜索引擎网页抓取项目" target="_blank">行业垂直搜索引擎网页抓取项目</a> <span class="text-muted">carlwu</span> <a class="tag" taget="_blank" href="/search/Lucene/1.htm">Lucene</a><a class="tag" taget="_blank" href="/search/Nutch/1.htm">Nutch</a><a class="tag" taget="_blank" href="/search/Heritrix/1.htm">Heritrix</a><a class="tag" taget="_blank" href="/search/Solr/1.htm">Solr</a> <div>公司有一个搜索引擎项目,希望各路高人有空来帮忙指导,谢谢! 这是详细需求: (1) 通过提供的网站地址(大概100-200个网站),网页抓取程序能不断抓取网页和其它类型的文件(如Excel、PDF、Word、ppt及zip类型),并且程序能够根据事先提供的规则,过滤掉不相干的下载内容。 (2) 程序能够搜索这些抓取的内容,并能对这些抓取文件按照油田名进行分类,然后放到服务器不同的目录中。 </div> </li> <li><a href="/article/2227.htm" title="[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费" target="_blank">[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费</a> <span class="text-muted">comsci</span> <a class="tag" taget="_blank" href="/search/%E8%B5%84%E6%BA%90/1.htm">资源</a> <div>       降低通讯服务资费,就意味着有更多的用户进入,就意味着通讯服务提供商要接待和服务更多的用户,在总体运维成本没有由于技术升级而大幅下降的情况下,这种降低资费的行为将导致每个用户的平均带宽不断下降,而享受到的服务质量也在下降,这对用户和服务商都是不利的。。。。。。。。     &nbs</div> </li> <li><a href="/article/2354.htm" title="Java时区转换及时间格式" target="_blank">Java时区转换及时间格式</a> <span class="text-muted">Cwind</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>本文介绍Java API 中 Date, Calendar, TimeZone和DateFormat的使用,以及不同时区时间相互转化的方法和原理。   问题描述: 向处于不同时区的服务器发请求时需要考虑时区转换的问题。譬如,服务器位于东八区(北京时间,GMT+8:00),而身处东四区的用户想要查询当天的销售记录。则需把东四区的“今天”这个时间范围转换为服务器所在时区的时间范围。 </div> </li> <li><a href="/article/2481.htm" title="readonly,只读,不可用" target="_blank">readonly,只读,不可用</a> <span class="text-muted">dashuaifu</span> <a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/jsp/1.htm">jsp</a><a class="tag" taget="_blank" href="/search/disable/1.htm">disable</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a> <div>readOnly 和 readonly 不同,在做js开发时一定要注意函数大小写和jsp黄线的警告!!!我就经历过这么一件事: 使用readOnly在某些浏览器或同一浏览器不同版本有的可以实现“只读”功能,有的就不行,而且函数readOnly有黄线警告!!!就这样被折磨了不短时间!!!(期间使用过disable函数,但是发现disable函数之后后台接收不到前台的的数据!!!)   </div> </li> <li><a href="/article/2608.htm" title="LABjs、RequireJS、SeaJS 介绍" target="_blank">LABjs、RequireJS、SeaJS 介绍</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/Web/1.htm">Web</a> <div>LABjs 的核心是 LAB(Loading and Blocking):Loading 指异步并行加载,Blocking 是指同步等待执行。LABjs 通过优雅的语法(script 和 wait)实现了这两大特性,核心价值是性能优化。LABjs 是一个文件加载器。RequireJS 和 SeaJS 则是模块加载器,倡导的是一种模块化开发理念,核心价值是让 JavaScript 的模块化开发变得更</div> </li> <li><a href="/article/2735.htm" title="[应用结构]入口脚本" target="_blank">[应用结构]入口脚本</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/yii2/1.htm">yii2</a> <div>入口脚本 入口脚本是应用启动流程中的第一环,一个应用(不管是网页应用还是控制台应用)只有一个入口脚本。终端用户的请求通过入口脚本实例化应用并将将请求转发到应用。 Web 应用的入口脚本必须放在终端用户能够访问的目录下,通常命名为 index.php,也可以使用 Web 服务器能定位到的其他名称。 控制台应用的入口脚本一般在应用根目录下命名为 yii(后缀为.php),该文</div> </li> <li><a href="/article/2862.htm" title="haoop shell命令" target="_blank">haoop shell命令</a> <span class="text-muted">eksliang</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a><a class="tag" taget="_blank" href="/search/hadoop+shell/1.htm">hadoop shell</a> <div> cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text </div> </li> <li><a href="/article/2989.htm" title="MultiStateView不同的状态下显示不同的界面" target="_blank">MultiStateView不同的状态下显示不同的界面</a> <span class="text-muted">gundumw100</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>只要将指定的view放在该控件里面,可以该view在不同的状态下显示不同的界面,这对ListView很有用,比如加载界面,空白界面,错误界面。而且这些见面由你指定布局,非常灵活。 PS:ListView虽然可以设置一个EmptyView,但使用起来不方便,不灵活,有点累赘。 <com.kennyc.view.MultiStateView xmlns:android=&qu</div> </li> <li><a href="/article/3116.htm" title="jQuery实现页面内锚点平滑跳转" target="_blank">jQuery实现页面内锚点平滑跳转</a> <span class="text-muted">ini</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/html/1.htm">html</a><a class="tag" taget="_blank" href="/search/jquery/1.htm">jquery</a><a class="tag" taget="_blank" href="/search/html5/1.htm">html5</a><a class="tag" taget="_blank" href="/search/css/1.htm">css</a> <div>平时我们做导航滚动到内容都是通过锚点来做,刷的一下就直接跳到内容了,没有一丝的滚动效果,而且 url 链接最后会有“小尾巴”,就像#keleyi,今天我就介绍一款 jquery 做的滚动的特效,既可以设置滚动速度,又可以在 url 链接上没有“小尾巴”。   效果体验:http://keleyi.com/keleyi/phtml/jqtexiao/37.htmHTML文件代码: &</div> </li> <li><a href="/article/3243.htm" title="kafka offset迁移" target="_blank">kafka offset迁移</a> <span class="text-muted">kane_xie</span> <a class="tag" taget="_blank" href="/search/kafka/1.htm">kafka</a> <div>在早前的kafka版本中(0.8.0),offset是被存储在zookeeper中的。   到当前版本(0.8.2)为止,kafka同时支持offset存储在zookeeper和offset manager(broker)中。   从官方的说明来看,未来offset的zookeeper存储将会被弃用。因此现有的基于kafka的项目如果今后计划保持更新的话,可以考虑在合适</div> </li> <li><a href="/article/3370.htm" title="android > 搭建 cordova 环境" target="_blank">android > 搭建 cordova 环境</a> <span class="text-muted">mft8899</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>  1 , 安装 node.js        http://nodejs.org      node -v   查看版本   2, 安装 npm   可以先从  https://github.com/isaacs/npm/tags  下载 源码 解压到</div> </li> <li><a href="/article/3497.htm" title="java封装的比较器,比较是否全相同,获取不同字段名字" target="_blank">java封装的比较器,比较是否全相同,获取不同字段名字</a> <span class="text-muted">qifeifei</span> <div> 非常实用的java比较器,贴上代码: import java.util.HashSet; import java.util.List; import java.util.Set; import net.sf.json.JSONArray; import net.sf.json.JSONObject; import net.sf.json.JsonConfig; i</div> </li> <li><a href="/article/3624.htm" title="记录一些函数用法" target="_blank">记录一些函数用法</a> <span class="text-muted">.Aky.</span> <a class="tag" taget="_blank" href="/search/%E4%BD%8D%E8%BF%90%E7%AE%97/1.htm">位运算</a><a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%87%BD%E6%95%B0/1.htm">函数</a><a class="tag" taget="_blank" href="/search/IP/1.htm">IP</a> <div>高手们照旧忽略。 想弄个全天朝IP段数据库,找了个今天最新更新的国内所有运营商IP段,copy到文件,用文件函数,字符串函数把玩下。分割出startIp和endIp这样格式写入.txt文件,直接用phpmyadmin导入.csv文件的形式导入。(生命在于折腾,也许你们觉得我傻X,直接下载人家弄好的导入不就可以,做自己的菜鸟,让别人去说吧) 当然用到了ip2long()函数把字符串转为整型数</div> </li> <li><a href="/article/3751.htm" title="sublime text 3 rust" target="_blank">sublime text 3 rust</a> <span class="text-muted">wudixiaotie</span> <a class="tag" taget="_blank" href="/search/Sublime+Text/1.htm">Sublime Text</a> <div>1.sublime text 3 => install package => Rust 2.cd ~/.config/sublime-text-3/Packages 3.mkdir rust 4.git clone https://github.com/sp0/rust-style 5.cd rust-style 6.cargo build --release 7.ctrl</div> </li> </ul> </div> </div> </div> <div> <div class="container"> <div class="indexes"> <strong>按字母分类:</strong> <a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a> </div> </div> </div> <footer id="footer" class="mb30 mt30"> <div class="container"> <div class="footBglm"> <a target="_blank" href="/">首页</a> - <a target="_blank" href="/custom/about.htm">关于我们</a> - <a target="_blank" href="/search/Java/1.htm">站内搜索</a> - <a target="_blank" href="/sitemap.txt">Sitemap</a> - <a target="_blank" href="/custom/delete.htm">侵权投诉</a> </div> <div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved. <!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>--> </div> </div> </footer> <!-- 代码高亮 --> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script> <link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/> <script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script> </body> </html>