Python爬虫初步

学习的链接地址:慕课网 Python开发简单爬虫
一.网页加载器:

  1. 方法一:
import urllib2
response = urllib2.urlopen('http://www.baidu.com')
print response.getcode()
cont = response.read()
print cont

显示结果:
Python爬虫初步_第1张图片
有一个疑问为什么ctrl + B 能在控制台显示结果,但是 alt + q (自己设置的运行快捷键却总是卡住????)

  1. 方法二:
    方法有疑问?显示不对
import urllib2
request = urllib2.Request('http://www.baidu.com')
request.add_data('a','1') #视频给取消了
request.add_header('User-Agent','Mozilla/5.0')
response = urllib2.urlopen(request)
cont = response.read()
print cont

报了一个错误:
Traceback (most recent call last):TypeError: add_data() takes exactly 2 arguments (3 given)

  • 方法三:

特殊情景的处理器
- HttpCookieProcessor
- ProxyHandler
- HttpsHandler
- HttpRedirectHandler

import urllib2, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
cont = response.read()
print cont

Python爬虫初步_第2张图片

二.网页解释器:

import bs4,re
from bs4 import BeautifulSoup

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
soup = BeautifulSoup(html_doc,'html.parser',from_encoding = 'utf-8')
links = soup.find_all('a')
# print links
link_node1 = soup.find('a',href = 'http://example.com/tillie')
print link_node1.name, link_node1['href'],link_node1.get_text()
link_node2 = soup.find('a',href = re.compile(r'ill'))
print link

你可能感兴趣的:(python,爬虫)