Python爬虫网安-beautiful soup+示例

目录

beautiful soup:

解析器:

节点选择器:

嵌套选择:

关联选择:

子节点:

子孙节点:

父节点:

祖先节点:

兄弟节点:

上一个兄弟节点:

下一个兄弟节点:

后面所有的兄弟节点:

前面所有的兄弟节点:

方法选择器:

CSS选择器:


beautiful soup:

bs4
用于解析html and xml文档
解析器:html.parser、lxml解析器和XML的内置解析器
文档遍历:跟xpath差不多,也是整理成树形结构
搜索:find() find_all()
修改:增删改查bs4都支持
提取数据
处理特殊字符

解析器:

html.parser(内置) 还行
lxml 速度比较快
xml  速度比较快 
html5lib 用的比较少   速度慢

pip install bs4

#导入 beautiful soup库
from bs4 import BeautifulSoup

#创建一个BeautifulSoup对象,用于解析html文档
soup = BeautifulSoup('

hello

','lxml') print(soup.p.string)
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html,'lxml') #使用prettify() print(soup.prettify())

节点选择器:

直接调用节点名既可选择该节点。
如果有多个标签,使用标签名来打印该节点的话,就是打印第一个
类型:

属性:
name:表示标签的名字

from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html,'lxml') #选择html文档中的title标签 # print(soup.title) #打印标签下的文本内容 # print(soup.title.string) # print(soup.head) # print(soup.p) #查看数据类型 # print(type(soup.title)) #用name属性获取节点的名称 # print(soup.head.name) #获取属性 id class #attrs # print(soup.p.attrs) # print(soup.p.attrs['name']) print(soup.p['name']) print(soup.p['class'])

嵌套选择:

返回,可以继续下一步的选择

from bs4 import BeautifulSoup

html = """
The Dormouse's story

"""

soup = BeautifulSoup(html,'lxml')

#打印 html文档里面的 title标签
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

关联选择:

有时候我们选择的时候不能一部到位,需要先选择某节点,然后选择 他的 父节点,子节点

子节点:

soup.p.children

子孙节点:

soup.p.descendants

from bs4 import BeautifulSoup

html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html,'lxml') #输出一个列表,里面包含p节点的所有子节点 包括文本 # print(soup.p.contents) # print(soup.p.children) # for i,child in enumerate(soup.p.children): # # print(i) # print(child) #打印所有的子孙节点 print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child)

父节点:

soup.a.parent

祖先节点:

soup.a.parents

from bs4 import BeautifulSoup

html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html,'lxml') #打印父节点 # print(soup.a.parent) #打印祖先节点: print(soup.a.parents) print(list(enumerate(soup.a.parents)))

兄弟节点:

上一个兄弟节点:

soup.a.previous_sibling

下一个兄弟节点:

soup.a.next_sibling

后面所有的兄弟节点:

soup.a.next_siblings

前面所有的兄弟节点:

soup.a.previous_siblings

from bs4 import BeautifulSoup

html = """

    
        

Once upon a time there were three little sisters; and their names were Elsie Hello Lacie and Tillie and they lived at the bottom of a well.

""" soup = BeautifulSoup(html,'lxml') #输出下一个兄弟节点 # print(soup.a.next_sibling) #输出上一个兄弟节点 # print(soup.a.previous_sibling) #输出所有后续的兄弟节点: # print(list(enumerate(soup.a.next_siblings))) #输出所有前面的兄弟节点: print(list(enumerate(soup.a.previous_siblings)))
from bs4 import BeautifulSoup

html = """

    
        

Once upon a time there were three little sisters; and their names were BobLacie

""" soup = BeautifulSoup(html,'lxml') print(soup.a.next_sibling.string) # print(soup.a.parents) # print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs['class'])

方法选择器:

find_all,
find方法

from bs4 import BeautifulSoup

html = """

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
""" soup = BeautifulSoup(html,'lxml') #查找所有的ul元素 # print(soup.find_all(name='ul')) print(type(soup.find_all(name='ul')[1])) #遍历所有的ul for ul in soup.find_all(name='ul'): #打印所有ul下的li元素 # print(type(ul.find_all(name='li'))) for li in ul.find_all(name='li'): print(li.string)
from bs4 import BeautifulSoup

html = """

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
""" soup = BeautifulSoup(html,'lxml') #使用find_all查询所有既有id属性为'list-2'的元素 print(soup.find_all(attrs={'id':'list-2'})) # print(soup.find_all(attrs={'class':'element'})) print(soup.find_all(class_='element'))
from bs4 import BeautifulSoup
import re

html = """
""" soup = BeautifulSoup(html,'lxml') #使用正则表达式查询所有带有link的标签 found_elements = soup.find_all(string=re.compile('link')) print(found_elements)
from bs4 import BeautifulSoup

html = """

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
""" soup = BeautifulSoup(html,'lxml') ul_tag = soup.find(name='ul') print(ul_tag) # list_element = soup.find(class_='list') # print(list_element)

CSS选择器:

调用select方法,传入对应的css

from bs4 import BeautifulSoup

html = """

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
""" soup = BeautifulSoup(html,'lxml') # print(soup.select('.panel .panel-heading')) # print(soup.select('ul li')) # print(soup.select('#list-2 .element')) # for ul in soup.select('ul'): # # print(ul.select('li')) # # print(ul['id']) # print(ul.attrs['id']) for li in soup.select('li'): # print(li.string) print(li.get_text())

你可能感兴趣的:(Python爬虫网安-beautiful soup+示例)