安装方法
使用 pip
直接安装(推荐大多数场景):
pip install lxml
• 验证安装:导入库无报错即成功:
from lxml import etree, html
lxml
提供了两种常见的解析方法:
html.fromstring()
用于解析 HTML 字符串。html.parse()
用于解析 HTML 文件。from lxml import html
# 假设有一个HTML字符串
html_content = """
Title of the page
This is a paragraph.
Click Here
"""
# 解析HTML字符串
tree = html.fromstring(html_content)
# 提取数据
title = tree.xpath('//h1/text()') # 使用XPath提取标题
print(title) # 输出 ['Title of the page']
content = tree.xpath('//p[@class="content"]/text()') # 通过类名提取段落
print(content) # 输出 ['This is a paragraph.']
link = tree.xpath('//a/@href') # 获取超链接的URL
print(link) # 输出 ['https://example.com']
from lxml import html
# 解析HTML文件
tree = html.parse('example.html')
# 提取数据
title = tree.xpath('//h1/text()')
print(title)
lxml
的强大之处在于其支持 XPath 查询,它可以用来从 HTML 或 XML 文档中精确查找和提取数据。
//tagname
:查找所有的指定标签。//tagname[@attribute='value']
:查找具有特定属性值的标签。tagname/text()
:提取标签的文本内容。@attribute
:提取属性值。from lxml import html
html_content = """
Title of the page
This is a paragraph.
Another paragraph.
Click Here
"""
tree = html.fromstring(html_content)
# 获取所有的p标签内容
paragraphs = tree.xpath('//p/text()')
print(paragraphs) # 输出 ['This is a paragraph.', 'Another paragraph.']
# 获取所有class为content的p标签内容
content_paragraphs = tree.xpath('//p[@class="content"]/text()')
print(content_paragraphs) # 输出 ['This is a paragraph.', 'Another paragraph.']
# 获取所有的超链接
links = tree.xpath('//a/@href')
print(links) # 输出 ['https://example.com']
lxml
也支持 CSS 选择器来查找和提取数据。
from lxml import html
html_content = """
Title of the page
This is a paragraph.
Another paragraph.
Click Here
"""
tree = html.fromstring(html_content)
# 使用CSS选择器提取
title = tree.cssselect('h1')[0].text
print(title) # 输出 'Title of the page'
# 获取class为content的p标签内容
content_paragraphs = tree.cssselect('p.content')
for p in content_paragraphs:
print(p.text)
# 输出:
# This is a paragraph.
# Another paragraph.
lxml
也可以用于解析和处理 XML 文档,和 HTML 文档的处理类似。
from lxml import etree
xml_content = """
Python Programming
John Doe
29.99
Aprendiendo Python
Jane Smith
24.99
"""
# 解析XML
tree = etree.fromstring(xml_content)
# 获取所有的title标签内容
titles = tree.xpath('//title/text()')
print(titles) # 输出 ['Python Programming', 'Aprendiendo Python']
# 获取所有作者的名字
authors = tree.xpath('//author/text()')
print(authors) # 输出 ['John Doe', 'Jane Smith']
# 获取第一个book的价格
price = tree.xpath('//book[1]/price/text()')
print(price) # 输出 ['29.99']
lxml
可以将解析后的树状结构保存回文件或转换为字符串。
from lxml import etree
xml_content = """
Python Programming
John Doe
29.99
"""
tree = etree.fromstring(xml_content)
# 保存为XML文件
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding='UTF-8')
html_content = """
Title of the page
"""
tree = html.fromstring(html_content)
html_str = etree.tostring(tree, pretty_print=True, encoding='unicode')
print(html_str)
如果XML或HTML文档使用了命名空间,lxml
可以处理这些命名空间。
from lxml import etree
xml_content = """
Python Programming
John Doe
29.99
"""
tree = etree.fromstring(xml_content)
# 使用命名空间查找元素
namespaces = {'ns': 'http://example.com'}
title = tree.xpath('//ns:title/text()', namespaces=namespaces)
print(title) # 输出 ['Python Programming']
以上是 lxml
的一些基本用法,涵盖了 HTML 解析、XPath 查询、CSS 选择器、XML 处理、文件输出以及命名空间处理等方面。