12.从web抓取数据
1)项目:利用webbrowser模块的maplt.py
webbrowser模块 webbrowser.open(‘url’)
test_1201.py
#! python3
# mapIt.py - launches a map in the browser using an address from the command line or clipboard.
import webbrowser,sys,pyperclip
if len(sys.argv) > 1:
#Get address from command line.
address = ' '.join(sys.argv[1:])
else:
#Get address from clipboard.
address = pyperclip.paste()
webbrowser.open('www.baidu.com/s?wd=' + address)
2)用requests模块下载文件
用requests模块下载网页
requests.get(‘url’)
res.status_code = requests.codes.ok
res.text
检查错误res.raise_for_status()
3)下载文件保存到硬盘上
保存文件 res.iter_content()
测试程序:test_1202.py
import requests
res = requests.get('https://www.csdn.net/')
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
playFile = open('d:/temp/csdn.txt','wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
playFile.close()
4)HTML
查看页面源代码
打开浏览器开发者工具 ,寻找HTML元素。
test_1203.html
hello world!
Al's free Python books
5)用bs4模块解析HTML
Beautiful Soup模块bs4从HTML页面提取信息
test_1204.py
import requests,bs4
exampleFile = open('test_1204.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(),'html.parser')
elems = exampleSoup.select('#author')
print('elem:' +str(elems[0]))
print('elem.text :' + elems[0].getText())
print('elem.attrs:' + str(elems[0].attrs))
pelems =exampleSoup.select('p')
print('0:' + str(pelems[0]))
print('
0-text: ' + pelems[0].getText())
print('
1:' + str(pelems[1]))
print('
2:'+ str(pelems[2]))
test_1204.html
The Website Title
Download my Pythonbook from my website.
Learn Python the easy way!
By
从HTML创建BeautifulSoup对象 bs4.BeautifulSoup()
用select()方法寻找元素 soup.select()
a)找a标签
print(soup.select('a')) # 通过标签的名称查找
b)通过类名来查找:class="sister"
print(soup.select('.sister'))
c)通过id查找:id="link1"
print(soup.select('#link1'))
d)特殊的查找方式:
选择父元素是
#! python3
# searchpypi.py - Opens serveral search results.
import requests,sys,webbrowser,bs4
import logging
logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(levelname)s - %(message)s')
searchUrl = 'https://www.baidu.com/s?wd=site:finance.sina.com.cn ' + ' '.join(sys.argv[1:])
logging.info('Search...' + searchUrl) # display text while downloading the search result page
webbrowser.open(searchUrl)
res = requests.get(searchUrl)
res.raise_for_status()
# Retrieve top search result links.
logging.debug('res len:' + str(len(res.text)))
pFile= open('d:/temp/res.html','wb')
for chunk in res.iter_content(10000):
pFile.write(chunk)
pFile.close()
soup = bs4.BeautifulSoup(res.text,'html.parser')
# Open browser tabl for each result.
linkElems = soup.select('a[href="https://www.baidu.com]')
numOpen = min(5,len(linkElems))
logging.debug('numOpen:' + str(numOpen))
for i in range(numOpen):
urlToOpen = linkElems[i].get('href')
print('Open--',urlToOpen)
webbrowser.open(urlToOpen)
7)项目:下载所有XKCD动画
使用XKCD中文网站,页面元素与书中不同,做了修改。加了前十张爬图的限制
test_1206.py
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests,os,bs4,logging
os.chdir('d:/temp')
urlXkcd = 'https://xkcd.in/' # starting url
url = urlXkcd
os.makedirs('d:/temp/xkcd',exist_ok = True) # store comics in ./xkcd
for phn in range(10):
if url.endswith('#'):
break;
# Download the page.
print('No. %s Downloading page %s...' % (phn,url))
res = requests.get(url)
res.raise_for_status()
print('res len:' + str(len(res.text)))
#pFile= open('d:/temp/img_res.html','wb')
#for chunk in res.iter_content(10000):
# pFile.write(chunk)
# pFile.close()"""
soup = bs4.BeautifulSoup(res.text,'html.parser')
# Find the Url of the comic image.
comicElem = soup.select('.comic-body a img')
if comicElem == []:
print('Could not find currentImg image.')
else:
comicUrl = urlXkcd + comicElem[0].get('src')
# Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
# Save the image to ./xkcd.
imageFile = open(os.path.join('d:/temp/xkcd',os.path.basename(comicUrl)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
# Get the Next button's url
nextLink = soup.select('.nextLink a')[0]
url = urlXkcd + str(nextLink.get('href'))
print('Done.')
8)用selenium模块控制浏览器
from selenium import webdriver
browser = webdriver.Chrome() #用Chrome浏览器
browser.get(‘http://www.baidu.com’)
在页面寻找元素webdriver的方法
find_element_by_class_name(name)