Python 玩转数据 13 - 数据读写 Data I/O for XML File Format

引言

本文主要介绍有关 Python 对 XML 文件格式的读写,及格式化,序列化,更多 Python 进阶系列文章,请参考 Python 进阶学习 玩转数据系列

内容提要:

  1. JSON vs. XML
  2. Python 对 XML 数据读写模块
  3. xml ElementTree API
  4. 设置获取 XML 属性 .set() and .get()
  5. 格式化 XML 输出
  6. Pretty Printing of the XML Tree
  7. xml ElementTree 解析 XML
  8. XPath 遍历 XML 例子
  9. 处理 XML Exceptions
  10. 读取 XML 转换成 Pandas DataFrame
  11. 用 xmltodict 转换 XML 成 JSON

JSON vs. XML

XML: Extensible Markup Language
JSON:JavaScript Object Notation

Example

JSON Example:
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第1张图片
XML Example:
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第2张图片

比较

  1. 相似:
    ● 两者都是自我描述性语言,可读性强
    ● 两者都是层级式结构
    ● 两者都可以被多种语言解析并使用
    ● 两者都可以从一个 XMLHttpRequest 获取
  2. 不同:
    ● JSON 不用结束标签 tag
    ● JSON 更短
    ● JSON 读写更快
    ● JSON 可以用数组
    ● XML 只能通过 XML parser 解析
    ● JSON 可以被普通的 JavaScript function 解析
  3. 为什么 JSON 比 XML 更好?
    ● XML 比 JSON 更难解析
    ● JSON 可以解析成现在的 JavaScript 对象.

Python 对 XML 数据读写模块

Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第3张图片
相关资料

  1. XML to JSON with xmltodict
    ● https://micropyramid.com/blog/how-to-convert-xml-content-into-json-using-xmltodict/
    ● https://pypi.org/project/xmltodict/
  2. XML to Pandas DataFrame
    ● http://gokhanatil.com/2017/11/python-for-data-science-importing-xml-to-pandas-dataframe.html
  3. XML Pretty Printing with lxml
    ● https://lxml.de/tutorial.html
  4. XML Parser: lxml
    ● https://lxml.de/xpathxslt.html

xml ElementTree API

用来创建和解析 XML

  1. 两个主要的类:
    ● ElementTree - XML and document 操作
    ● Element - XML element 的封装
  2. ElementTree.find() 和 ElementTree.findall() 方法:
    提供 XPath 搜索
    • XPath 代表 XML Path Language
    • XPath 使用路径的格式来定位 XML 文档中的 nodes 节点

创建 XML:

  1. 用 Element 创建一个 root node
  2. 用 ElementTree 创建基于 root node 的一个 tree
  3. tree.write() 将 XML Tree 保存到一个 XML 文件
from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import Element

root = Element("root")
tree = ElementTree(root)

tree.write('results.xml', encoding='utf8')

输出:
生成的 results.xml 内容:
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第4张图片

举例:

from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import Element
from collections import namedtuple

root = Element('contacts') # 
tree = ElementTree(root)

# create ContactRecord class, its fields are first, last, age and email
Contact = namedtuple("ContactRecord", 'first last age email')
# Information to populate XML tree with 
records = [
    Contact('Tom', 'Smith', 53, '[email protected]'),
    Contact('Phil', 'Hammer', 42, '[email protected]'),
    Contact('Mary', 'Fast', 22, '[email protected]'),
    Contact('Jessica', 'Rest', 33, '[email protected]')
]

records.sort(key=lambda a: a.age, reverse=True)

print("records:\n{}".format(records))

# Now build and append nodes to the XML tree:
for record in records:
    contact = Element('contact') # 
    name    = Element('name') 
    first   = Element('first') # 
    last    = Element('last') 
    email   = Element('email')
    
    name.attrib = {'age': str(record.age)} # < name age='43'>
    first.text = record.first
    last.text = record.last
    email.text = record.email
    name.append(first) # John
    name.append(last)
    contact.append(name)
    contact.append(email)
    root.append(contact)

#  save the built XML tree as an XML file:
tree.write('results.xml', encoding='utf8')

输出:

records:
[ContactRecord(first='Tom', last='Smith', age=53, email='[email protected]'), ContactRecord(first='Phil', last='Hammer', age=42, email='[email protected]'), ContactRecord(first='Jessica', last='Rest', age=33, email='[email protected]'), ContactRecord(first='Mary', last='Fast', age=22, email='[email protected]')]

生成的 results.xml 内容:
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第5张图片

设置获取 XML 属性 .set() and .get()

设置和获取属性的方法:

.set(): 设置属性
element.set(name, value)

.attrib(): 设置属性
element.attrib = {name:value}
element.attrib[name] = value

.get(): 获取属性
element.get(name)
element.get(name, default value)

举例:

from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import Element
from collections import namedtuple

root = Element('contacts') # 
tree = ElementTree(root)

# create ContactRecord class, its fields are first, last, age and email
Contact = namedtuple("ContactRecord", 'first last age email')
# Information to populate XML tree with 
records = [
    Contact('Tom', 'Smith', 53, '[email protected]'),
    Contact('Phil', 'Hammer', 42, '[email protected]'),
    Contact('Mary', 'Fast', 22, '[email protected]'),
    Contact('Jessica', 'Rest', 33, '[email protected]')
]

records.sort(key=lambda a: a.age, reverse=True)

print("records:\n{}".format(records))

# Now build and append nodes to the XML tree:
for record in records:
    name = Element('name')
    name.set('age', str(record.age))
    # or
    name.attrib = {'age':str(record.age)}
    # or
    name.attrib['age'] = str(record.age)
    # ... the rest of the code
    
    # possible KeyError
    print("age attribute is", name.attrib['age'])
    # possible None
    print("age attribute is", name.get('age'))
    # will use a default 
    print("no age_foo attribute, default is", name.get('age_foo', 50))

输出:

records:
[ContactRecord(first='Tom', last='Smith', age=53, email='[email protected]'), ContactRecord(first='Phil', last='Hammer', age=42, email='[email protected]'), ContactRecord(first='Jessica', last='Rest', age=33, email='[email protected]'), ContactRecord(first='Mary', last='Fast', age=22, email='[email protected]')]
age attribute is 53
age attribute is 53
no age_foo attribute, default is 50
age attribute is 42
age attribute is 42
no age_foo attribute, default is 50
age attribute is 33
age attribute is 33
no age_foo attribute, default is 50
age attribute is 22
age attribute is 22
no age_foo attribute, default is 50

Pretty Printing of the XML Tree

• xml ElementTree 不支持漂亮的格式输出
• minidom API 可以作为一个 work-around
• lxml 版本的 ElementTree 支持漂亮的格式选项

lxml 版本的 ElementTree

https://lxml.de/xpathxslt.html

● 推荐使用 LXML,有丰富的解析器
● 是基于 libxml2 C++ library 开发的
● 是一个验证解析器: 支持 schema 和 DTDs
● 支持 full XPath 语法, 和 XSLT 能力
● 安装: pip install lxml

XLST 是 XML 样式语言 style sheet language,利用它可以将一个 XML 文档转换成 HTML

DTD 是文档类型定义,一个 DID 定义 XML 文档的结构,合法的元素及其属性

letree.tostring(lroot, pretty_print=True).decode(‘utf8’)

import lxml.etree as letree
from collections import namedtuple

# Example: XML pretty printing with lxml
lroot = letree.Element("root")

Contact = namedtuple("ContactRecord", 'first last age email')

records = [
    Contact('Tom', 'Smith', 53, '[email protected]'),
    Contact('Phil', 'Hammer', 42, '[email protected]'),
    Contact('Mary', 'Fast', 22, '[email protected]'),
    Contact('Jessica', 'Rest', 33, '[email protected]')
]
records.sort(key=lambda a: a.age, reverse=True)

for record in records:
    
    contact = letree.Element('contact') # 
    name    = letree.Element('name') 
    first   = letree.Element('first') # 
    last    = letree.Element('last') 
    email   = letree.Element('email')
    
    name.set('age', str(record.age))
    first.text = record.first
    last.text = record.last
    email.text = record.email
    name.append(first) # John
    name.append(last)
    contact.append(name)
    contact.append(email)
    lroot.append(contact)

print(letree.tostring(lroot, pretty_print=True).decode('utf8'))  

输出:

<root>
  <contact>
    <name age="53">
      <first>Tom</first>
      <last>Smith</last>
    </name>
    <email>tsmith@boo.com</email>
  </contact>
  <contact>
    <name age="42">
      <first>Phil</first>
      <last>Hammer</last>
    </name>
    <email>phammer@boo.com</email>
  </contact>
  <contact>
    <name age="33">
      <first>Jessica</first>
      <last>Rest</last>
    </name>
    <email>jrest@goo.com</email>
  </contact>
  <contact>
    <name age="22">
      <first>Mary</first>
      <last>Fast</last>
    </name>
    <email>mfast@boo.com</email>
  </contact>
</root>

Minidom API

pretty_xml = minidom.parseString(xml_str).toprettyxml(encoding=‘utf8’)

import xml.etree.cElementTree as etree
from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import Element
from collections import namedtuple
from xml.dom import minidom

root = Element('contacts') # 
tree = ElementTree(root)

# create ContactRecord class, its fields are first, last, age and email
Contact = namedtuple("ContactRecord", 'first last age email')
# Information to populate XML tree with 
records = [
    Contact('Tom', 'Smith', 53, '[email protected]'),
    Contact('Phil', 'Hammer', 42, '[email protected]'),
    Contact('Mary', 'Fast', 22, '[email protected]'),
    Contact('Jessica', 'Rest', 33, '[email protected]')
]

records.sort(key=lambda a: a.age, reverse=True)

print("records:\n{}".format(records))

# Now build and append nodes to the XML tree:
for record in records:
    contact = Element('contact') # 
    name    = Element('name') 
    first   = Element('first') # 
    last    = Element('last') 
    email   = Element('email')
    
    name.attrib = {'age': str(record.age)} # < name age='43'>
    first.text = record.first
    last.text = record.last
    email.text = record.email
    name.append(first) # John
    name.append(last)
    contact.append(name)
    contact.append(email)
    root.append(contact)

xml_str = etree.tostring(root)
pretty_xml = minidom.parseString(xml_str).toprettyxml(encoding='utf8')
print(pretty_xml.decode())  

with open("pretty.xml", 'w') as f:
    f.write(pretty_xml.decode())

输出:

records:
[ContactRecord(first='Tom', last='Smith', age=53, email='[email protected]'), ContactRecord(first='Phil', last='Hammer', age=42, email='[email protected]'), ContactRecord(first='Jessica', last='Rest', age=33, email='[email protected]'), ContactRecord(first='Mary', last='Fast', age=22, email='[email protected]')]
<?xml version="1.0" encoding="utf8"?>
<contacts>
        <contact>
                <name age="53">
                        <first>Tom</first>
                        <last>Smith</last>
                </name>
                <email>tsmith@boo.com</email>
        </contact>
        <contact>
                <name age="42">
                        <first>Phil</first>
                        <last>Hammer</last>
                </name>
                <email>phammer@boo.com</email>
        </contact>
        <contact>
                <name age="33">
                        <first>Jessica</first>
                        <last>Rest</last>
                </name>
                <email>jrest@goo.com</email>
        </contact>
        <contact>
                <name age="22">
                        <first>Mary</first>
                        <last>Fast</last>
                </name>
                <email>mfast@boo.com</email>
        </contact>
</contacts>

生成的 pretty.xml 内容:
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第6张图片

xml ElementTree 解析 XML

results.xml 内容:


<contacts><contact><name age="53"><first>Tomfirst><last>Smithlast>name><email>[email protected]email>contact><contact><name age="42"><first>Philfirst><last>Hammerlast>name><email>[email protected]email>contact><contact><name age="33"><first>Jessicafirst><last>Restlast>name><email>[email protected]email>contact><contact><name age="22"><first>Maryfirst><last>Fastlast>name><email>[email protected]email>contact>contacts>

解析 results.xml 内容

from xml.etree.cElementTree import ElementTree
from collections import namedtuple

Contact = namedtuple('ContactRecord', 'first last age email')
tree = ElementTree().parse('results.xml')
contacts = []

for contact in tree.getiterator('contact'):
    first = contact.find('.//first').text
    last = contact.find('.//last').text
    age = contact.find('./name').get('age')
    email = contact.find('.//email').text
    contacts.append(Contact(first, last, age, email))
print(contacts)

输出:

[ContactRecord(first='Tom', last='Smith', age='53', email='[email protected]'), ContactRecord(first='Phil', last='Hammer', age='42', email='[email protected]'), ContactRecord(first='Jessica', last='Rest', age='33', email='[email protected]'), ContactRecord(first='Mary', last='Fast', age='22', email='[email protected]')]

XPath 遍历 XML 例子

Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第7张图片

from xml.etree.cElementTree import ElementTree

xml = '''

  
    
      
        
          2260
        
      
      
        
          
            
              1853
            
          
        
      
    
        
      
        
          3312
        
      
      
        
          
            
              1853
            
          
        
      
    
  
'''

with open("xpath.xml",'w') as f:
    f.write(xml)

fp = open("xpath.xml",'r')
root = ElementTree().parse(fp)
elements = root.findall('Items/Item/ItemAttributes/ListPrice/Amount')
for i in elements:
    print(i.text)

输出:

2260
3312

处理 XML Exceptions

XML 操作可能会抛出异常,所以需要用 try-except 来处理异常。
Python 玩转数据 13 - 数据读写 Data I/O for XML File Format_第8张图片
举例:

from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import ParseError
from collections import namedtuple
import sys

try:
    tree = ElementTree().parse('results.xml')
except ParseError as e:
    print('Parse error: {err}'.format(err=e))
    sys.exit()

contacts = []
Contact = namedtuple('ContactRecord', 'first last age email')

for contact in tree.getiterator('contact'):
    try:
        first = contact.find('.//first').text
        last = contact.find('.//last').text
        age = contact.find('./name').get('age')
        email = contact.find('.//email').text
        contacts.append(Contact(first, last, age, email))
    except AttributeError as e:
        print('Element error: {err}'.format(err=e))
print(contacts)

输出:

[ContactRecord(first='Tom', last='Smith', age='53', email='[email protected]'), ContactRecord(first='Phil', last='Hammer', age='42', email='[email protected]'), ContactRecord(first='Jessica', last='Rest', age='33', email='[email protected]'), ContactRecord(first='Mary', last='Fast', age='22', email='[email protected]')]

读取 XML 转换成 Pandas DataFrame

from xml.etree.cElementTree import ElementTree
from xml.etree.cElementTree import ParseError
import pandas as pd
import sys

cols = ['first', 'last','age', 'email']

xml_df = pd.DataFrame(columns = cols,dtype=str)

try:
    tree = ElementTree().parse('results.xml')
except ParseError as e:
    print('Parse error: {err}'.format(err=e))
    sys.exit()

for contact in tree.getiterator('contact'):
    try:
        first = contact.find('.//first').text
        last = contact.find('.//last').text
        age = contact.find('./name').get('age')
        email = contact.find('.//email').text
        xml_df = xml_df.append(
            pd.Series([first, last, age, email],index=cols),
            ignore_index=True)
    except AttributeError as e:
        print('Element error: {err}'.format(err=e))

print("xml_df:\n{}".format(xml_df))

输出:

xml_df:
     first    last age            email
0      Tom   Smith  53   tsmith@boo.com
1     Phil  Hammer  42  phammer@boo.com
2  Jessica    Rest  33    jrest@goo.com
3     Mary    Fast  22    mfast@boo.com

用 xmltodict 转换 XML 成 JSON

• pip install xmltodict
• read XML to OrderedDict

有关 JSON 和 Python Object 序列化和反序列化,请参考Python JSON 操作 - JSON 与 Python 对象,自定义对象 之间的互相转化

解析 results.xml 成 JSON 格式:


<contacts><contact><name age="53"><first>Tomfirst><last>Smithlast>name><email>[email protected]email>contact><contact><name age="42"><first>Philfirst><last>Hammerlast>name><email>[email protected]email>contact><contact><name age="33"><first>Jessicafirst><last>Restlast>name><email>[email protected]email>contact><contact><name age="22"><first>Maryfirst><last>Fastlast>name><email>[email protected]email>contact>contacts>
import xmltodict
import json

with open('results.xml') as f:
    xml_input = f.read()

ordered_dict_object_from_xml = xmltodict.parse(xml_input)
print("ordered_dict_object_from_xml:\n{}".format(ordered_dict_object_from_xml))

# serialize ordered_dict_object to json str
json_str_from_xml = json.dumps(ordered_dict_object_from_xml)
print("json_str_from_xml:\n{}".format(json_str_from_xml))

# deserialize json str to python object
json_from_xml = json.loads(json_str_from_xml)
print("json_from_xml:\n{}".format(json_from_xml))

输出:

ordered_dict_object_from_xml:
OrderedDict([('contacts', OrderedDict([('contact', [OrderedDict([('name', OrderedDict([('@age', '53'), ('first', 'Tom'), ('last', 'Smith')])), ('email', '[email protected]')]), OrderedDict([('name', OrderedDict([('@age', '42'), ('first', 'Phil'), ('last', 'Hammer')])), ('email', '[email protected]')]), OrderedDict([('name', OrderedDict([('@age', '33'), ('first', 'Jessica'), ('last', 'Rest')])), ('email', '[email protected]')]), OrderedDict([('name', OrderedDict([('@age', '22'), ('first', 'Mary'), ('last', 'Fast')])), ('email', '[email protected]')])])]))])
json_str_from_xml:
{"contacts": {"contact": [{"name": {"@age": "53", "first": "Tom", "last": "Smith"}, "email": "[email protected]"}, {"name": {"@age": "42", "first": "Phil", "last": "Hammer"}, "email": "[email protected]"}, {"name": {"@age": "33", "first": "Jessica", "last": "Rest"}, "email": "[email protected]"}, {"name": {"@age": "22", "first": "Mary", "last": "Fast"}, "email": "[email protected]"}]}}
json_from_xml:
{'contacts': {'contact': [{'name': {'@age': '53', 'first': 'Tom', 'last': 'Smith'}, 'email': '[email protected]'}, {'name': {'@age': '42', 'first': 'Phil', 'last': 'Hammer'}, 'email': '[email protected]'}, {'name': {'@age': '33', 'first': 'Jessica', 'last': 'Rest'}, 'email': '[email protected]'}, {'name': {'@age': '22', 'first': 'Mary', 'last': 'Fast'}, 'email': '[email protected]'}]}}

你可能感兴趣的:(#,Python,进阶系列,xml,python,json,lxml,pandas)