神奇夜光杯

Python酷库之旅-第三方库Pandas(008)

一、用法精讲

16、pandas.DataFrame.to_json函数

16-1、语法

16-2、参数

16-3、功能

16-4、返回值

16-5、说明

16-6、用法

16-6-1、数据准备

16-6-2、代码示例

16-6-3、结果输出

17、pandas.read_html函数

17-1、语法

17-2、参数

17-3、功能

17-4、返回值

17-5、说明

17-6、用法

17-6-1、数据准备

17-6-2、代码示例

17-6-3、结果输出

18、pandas.DataFrame.to_html函数

18-1、语法

18-2、参数

18-3、功能

18-4、返回值

18-5、说明

18-6、用法

18-6-1、数据准备

18-6-2、代码示例

18-6-3、结果输出

二、推荐阅读

1、Python筑基之旅

2、Python函数之旅

3、Python算法之旅

4、Python魔法之旅

5、博客个人主页

一、用法精讲

16、pandas.DataFrame.to_json函数

16-1、语法

# 16、pandas.DataFrame.to_json函数
DataFrame.to_json(path_or_buf=None, *, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=None, indent=None, storage_options=None, mode='w')
Convert the object to a JSON string.

Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters:
path_or_bufstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string.

orientstr
Indication of expected JSON string format.

Series:

default is ‘index’

allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.

DataFrame:

default is ‘columns’

allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.

The format of the JSON string:

‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

‘records’ : list like [{column -> value}, … , {column -> value}]

‘index’ : dict like {index -> {column -> value}}

‘columns’ : dict like {column -> {index -> value}}

‘values’ : just the values array

‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}}

Describing the data, where data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10
The number of decimal places to use when encoding floating point values. The possible maximal value is 15. Passing double_precision greater than 15 will raise a ValueError.

force_asciibool, default True
Force encoded string to be ASCII.

date_unitstr, default ‘ms’ (milliseconds)
The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None
Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False
If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.

compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

New in version 1.5.0: Added support for .tar files.

Changed in version 1.4.0: Zstandard support.

indexbool or None, default None
The index is only used when ‘orient’ is ‘split’, ‘index’, ‘column’, or ‘table’. Of these, ‘index’ and ‘column’ do not support index=False.

indentint, optional
Length of whitespace used to indent each record.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

modestr, default ‘w’ (writing)
Specify the IO mode for output when supplying a path_or_buf. Accepted args are ‘w’ (writing) and ‘a’ (append) only. mode=’a’ is only supported when lines is True and orient is ‘records’.

Returns:
None or str
If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.

16-2、参数

16-2-1、path_or_buf(可选，默认值为None)：字符串或文件对象，如果为字符串，表示JSON数据将被写入该路径的文件中；如果为文件对象，则数据将被写入该文件对象；如果为None，则返回生成的JSON格式的字符串。

16-2-2、orient(可选，默认值为None)：字符串，用于指示JSON文件中数据的期望格式。

16-2-2-1、'split'：字典像{index -> [index], columns -> [columns], data -> [values]}。

16-2-2-2、'records'：列表像[{column -> value}, ... , {column -> value}]。

16-2-2-3、'index'：字典像index -> {column -> value}}，其中索引是JSON对象中的键。

16-2-2-4、'columns'：字典像{{column -> index} -> value}。

16-2-2-5、'values'：仅仅是值数组。

16-2-2-6、如果没有指定，Pandas会尝试自动推断。

16-2-3、date_format(可选，默认值为None)：字符串，用于日期时间对象的格式。默认为 None，意味着使用ISO8601格式。

16-2-4、double_precision(可选，默认值为10)：整数，指定浮点数的精度(小数点后的位数)。

16-2-5、force_ascii(可选，默认值为True)：布尔值，是否确保所有非ASCII字符都被转义。

16-2-6、date_unit(可选，默认值为'ms')：字符串，用于时间戳的时间单位，'s', 'ms', 'us', 'ns' 分别代表秒、毫秒、微秒、纳秒。

16-2-7、default_handler(可选，默认值为None)：可调用对象，用于处理无法转换为JSON的对象。默认为None，此时会抛出TypeError。

16-2-8、lines(可选，默认值为False)：布尔值，如果为True，则输出将是每行一个记录的JSON字符串的列表。

16-2-9、compression(可选，默认值为'infer')：字符串或None，指定用于写入文件的压缩方式。'infer'(默认)会根据文件扩展名自动选择压缩方式(如 .gz)。

16-2-10、index(可选，默认值为None)：布尔值或字符串列表，是否将索引作为JSON的一部分输出。如果为False，则不输出索引；如果为True，则输出所有索引；如果为字符串列表，则只输出指定的索引。

16-2-11、indent(可选，默认值为None)：整数或None，指定输出JSON字符串的缩进量。如果为None，则不进行缩进。

16-2-12、storage_options(可选，默认值为None)：字典，用于文件存储的额外选项，如AWS S3访问密钥。

16-2-13、mode(可选，默认值为'w')：字符串，'w' 表示写入模式(如果文件存在则覆盖)，'a'表示追加模式。

16-3、功能

将Pandas DataFrame对象转换为JSON格式的数据，并可以选择性地将其写入文件或作为字符串返回。

16-4、返回值

16-4-1、如果path_or_buf参数为None(默认值)，则函数返回一个包含JSON数据的字符串。

16-4-2、如果path_or_buf参数被指定为一个文件路径或文件对象，则函数不返回任何值(即返回None)，而是将JSON数据写入指定的文件或文件对象。

16-5、说明

该函数在数据分析和数据科学中非常有用，特别是当你需要将DataFrame的内容导出到前端应用程序、Web服务或进行跨语言的数据交换时。

16-6、用法

16-6-1、数据准备

无

16-6-2、代码示例

# 16、pandas.DataFrame.to_json函数
# 16-1、直接输出
import pandas as pd
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
json_str = df.to_json(orient='records')
print(json_str) # 输出：[{"A":1,"B":4},{"A":2,"B":5},{"A":3,"B":6}]

# 16-2、写入文件
import pandas as pd
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
df.to_json('data.json', orient='records', lines=True)
# 在Python脚本所在目录自动生成data.json文件，文件中包含了JSON数据

16-6-3、结果输出

# 16、pandas.DataFrame.to_json函数
# 16-1、直接输出
# [{"A":1,"B":4},{"A":2,"B":5},{"A":3,"B":6}] 

# 16-2、写入文件
# 在Python脚本所在目录自动生成data.json文件，文件中包含了JSON数据

17、pandas.read_html函数

17-1、语法

# 17、pandas.read_html函数
pandas.read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, extract_links=None, dtype_backend=_NoDefault.no_default, storage_options=None)
Read HTML tables into a list of DataFrame objects.

Parameters:
iostr, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a string read() function. The string can represent a URL or the HTML itself. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

Deprecated since version 2.1.0: Passing html literal strings is deprecated. Wrap literal string/bytes input in io.StringIO/io.BytesIO instead.

matchstr or compiled regular expression, optional
The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.

flavor{“lxml”, “html5lib”, “bs4”} or list-like, optional
The parsing engine (or list of parsing engines) to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

headerint or list-like, optional
The row (or list of rows for a MultiIndex) to use to make the columns headers.

index_colint or list-like, optional
The column (or list of columns) to use to create the index.

skiprowsint, list-like or slice, optional
Number of rows to skip after parsing the column integer. 0-based. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.

attrsdict, optional
This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example,

attrs = {'id': 'table'}
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document.

attrs = {'asdf': 'table'}
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.

parse_datesbool, optional
See read_csv() for more details.

thousandsstr, optional
Separator to use to parse thousands. Defaults to ','.

encodingstr, optional
The encoding used to decode the web page. Defaults to None.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).

decimalstr, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).

convertersdict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.

na_valuesiterable, default None
Custom NA values.

keep_default_nabool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

displayed_onlybool, default True
Whether elements with “display: none” should be parsed.

extract_links{None, “all”, “header”, “body”, “footer”}
Table elements in the specified section(s) with  tags will have their href extracted.

New in version 1.5.0.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backed DataFrame (default).

"pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

New in version 2.0.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

New in version 2.1.0.

Returns:
dfs
A list of DataFrames.

17-2、参数

17-2-1、io(必须)：字符串、文件型对象或类文件对象，表示HTML数据的来源，可以是URL、文件路径或包含HTML内容的字符串。

17-2-2、match(可选，默认值为'.+')：字符串或正则表达式，用于过滤出符合条件的表格。默认值为 '.+'，意味着匹配所有表格。

17-2-3、flavor(可选，默认值为None)：字符串，指定解析HTML的库，Pandas使用lxml或bs4(BeautifulSoup 4)来解析HTML。如果未指定，Pandas会尝试自动选择。

17-2-4、header(可选，默认值为None)：整数或整数列表，指定作为列名的行。如果为None，则不使用列名，DataFrame的列名将会是默认的整数序列；如果为整数列表，则可以使用多行作为列名的多级索引。

17-2-5、index_col(可选，默认值为None)：整数、字符串或整数列表/字符串列表，指定作为行索引的列。如果为None，则不设置行索引。

17-2-6、skiprows(可选，默认值为None)：整数或整数列表，指定需要跳过的行(不用于标题行)，这对于跳过表格头部的无关信息非常有用。

17-2-7、attrs(可选，默认值为None)：字典，用于指定额外的属性来匹配表格，这些属性会作为HTML标签的属性进行匹配。

17-2-8、parse_dates(可选，默认值为False)：布尔值或列表，如果为True，则尝试将数据解析为日期类型；如果为列表，则指定需要解析为日期的列。

17-2-9、thousands(可选，默认值为',')：字符串，用于解析千分位分隔符。

17-2-10、encoding(可选，默认值为None)：字符串，指定用于编码的字符集。如果未指定，则使用文档的声明编码(如果有的话)。

17-2-11、decimal(可选，默认值为'.')：字符串，指定小数点字符，这对于处理非标准小数点的数据非常有用。

17-2-12、converters(可选，默认值为None)：字典，用于指定列的转换器。键是列名(或列的索引)，值是用于转换该列数据的函数。

17-2-13、na_values(可选，默认值为None)：标量、字符串列表或字典，用于指定哪些值应该被视为缺失值(NaN)。

17-2-14、keep_default_na(可选，默认值为True)：布尔值，如果为True，则使用pandas的默认NaN值集。

17-2-15、displayed_only(可选，默认值为True)：布尔值，如果为True，则只解析可见的表格元素(忽略