快速上手Python三剑客--Pandas篇

Pandas学习

什么是Pandas?

  • Pandas是一个开源的数据分析和数据处理库,它是基于 Python 编程语言的
  • Pandas提供了易于使用的数据结构和数据分析工具,特别适用于处理结构化数据,如表格型数据(类似于Excel表格)
  • Pandas是数据科学和分析领域中常用的工具之一,它使得用户能够轻松地从各种数据源中导入数据,并对数据进行高效的操作和分析

Pandas的数据结构有哪些?

Pandas主要引入了两种新的数据结构: Series 和 DataFrame。

  • Series:类似于一维数组或列表,是由一组数据以及与之相关的数据标签(索引)构成。Series可以看作是 DataFrame中的一列,也可以是单独存在的一维数据结构。
    快速上手Python三剑客--Pandas篇_第1张图片
  • DataFrame:类似于一个二维表格,它是Pandas中最重要的数据结构。DataFrame可以看作是由多个Series按列排列构成的表格,它既有行索引也有列索引,因此可以方便地进行行列选择、过滤、合并等操作。
    快速上手Python三剑客--Pandas篇_第2张图片
    DataFrame可视为由多个 Series 组成的数据结构:
    快速上手Python三剑客--Pandas篇_第3张图片

Pandas的应用

Pandas在数据科学和数据分析领域中具有广泛的应用,其主要优势在于能够处理和分析结构化数据。
以下是Pandas的一些主要应用领域:

  • 数据清洗和预处理:Pandas被广泛用于清理和预处理数据,包括处理缺失值、异常值、重复值等。它提供了各种方法来使数据更适合进行进一步的分析
  • 数据分析和统计:Pandas可以从各种文件格式比如 CSV、JSON、SQL、Microsoft Excel 导入数据,通过DataFrame和Series的灵活操作,使数据分析变得更加简单,用户可以轻松地进行统计分析、汇总、聚合等操作。从均值、中位数到标准差和相关性分析,Pandas都提供了丰富的功能
  • 数据可视化:将Pandas与Matplotlib等数据可视化库结合使用,可以创建各种图表和图形,从而更直观地理解数据分布和趋势
  • 时间序列分析:Pandas在处理时间序列数据方面表现出色,支持对日期和时间进行高效操作。这对于金融领域、生产领域以及其他需要处理时间序列的行业尤为重要
  • 机器学习和数据建模:在机器学习中,数据预处理是非常关键的一步,而Pandas提供了强大的功能来处理和准备数据。它可以帮助用户将数据整理成适用于机器学习算法的格式
  • 数据库操作:Pandas可以轻松地与数据库进行交互,从数据库中导入数据到DataFrame中,进行分析和处理,然后将结果导回数据库。这在数据库管理和分析中非常有用
  • 实时数据分析:对于需要实时监控和分析数据的应用,Pandas的高效性能使其成为一个强大的工具。结合其他实时数据处理工具,可以构建实时分析系统

基础知识

# 导入Pandas和NumPy
import numpy as np
import pandas as pd

生成对象

# 用列表生成Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
# 用Series、字典{"A": "1, "B": "xx"}、对象生成DataFrame
# A-浮点数,B-TimeStamp,C-Series,D-ndarray,E-组合,F-字符串
df = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20010809'),
    'C': pd.Series(1, index=[0, 1, 2, 3], dtype='float32'),
    'D': np.array([3, 3, 3, 3], dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
})
df
A B C D E F
0 1.0 2001-08-09 1.0 3 test foo
1 1.0 2001-08-09 1.0 3 train foo
2 1.0 2001-08-09 1.0 3 test foo
3 1.0 2001-08-09 1.0 3 train foo
# 用含日期时间索引、标签、NumPy数组生成DataFrame
# 随机生成6行4列的二维数组,行标签为之前生成的d,列标签为A、B、C、D
d = pd.date_range('20010807', periods=6)
print(d)
df1 = pd.DataFrame(np.random.randn(6, 4), index=d, columns=list('ABCD'))
df1
DatetimeIndex(['2001-08-07', '2001-08-08', '2001-08-09', '2001-08-10',
               '2001-08-11', '2001-08-12'],
              dtype='datetime64[ns]', freq='D')
A B C D
2001-08-07 -0.297676 0.049568 -1.299745 0.636079
2001-08-08 0.339365 0.500148 0.243576 -0.160504
2001-08-09 0.704013 1.720233 -0.186915 -1.322280
2001-08-10 -0.951185 1.420427 -0.194470 -0.305272
2001-08-11 -0.182651 0.939551 -0.432457 -1.060890
2001-08-12 1.178420 1.950314 -0.622115 1.248751

查看数据和数据类型

# 查看DataFrame列的数据类型
print(type(df))
print(df.dtypes)

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object
# 查看DataFrame头部和尾部数据
# head()参数为空则默认查看5条
df1.head()
A B C D
2001-08-07 -0.297676 0.049568 -1.299745 0.636079
2001-08-08 0.339365 0.500148 0.243576 -0.160504
2001-08-09 0.704013 1.720233 -0.186915 -1.322280
2001-08-10 -0.951185 1.420427 -0.194470 -0.305272
2001-08-11 -0.182651 0.939551 -0.432457 -1.060890
# 查看DataFrame尾部数据
df1.tail(3)
A B C D
2001-08-10 -0.951185 1.420427 -0.194470 -0.305272
2001-08-11 -0.182651 0.939551 -0.432457 -1.060890
2001-08-12 1.178420 1.950314 -0.622115 1.248751
# 查看索引
print(df1.index)
print('------')
# 查看列名
print(df1.columns)
DatetimeIndex(['2001-08-07', '2001-08-08', '2001-08-09', '2001-08-10',
               '2001-08-11', '2001-08-12'],
              dtype='datetime64[ns]', freq='D')
------
Index(['A', 'B', 'C', 'D'], dtype='object')
# 查看数据的统计摘要(最大最小值,均值等)
df1.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.131714 1.096707 -0.415354 -0.160686
std 0.764475 0.734752 0.521604 0.979992
min -0.951185 0.049568 -1.299745 -1.322280
25% -0.268920 0.609999 -0.574701 -0.871986
50% 0.078357 1.179989 -0.313463 -0.232888
75% 0.612851 1.645281 -0.188804 0.436933
max 1.178420 1.950314 0.243576 1.248751

排序

# 按轴排序
df1.sort_index(axis=1, ascending=False)
D C B A
2001-08-07 0.884528 -1.969893 3.054879 1.128499
2001-08-08 1.532802 0.214965 -0.847056 0.640980
2001-08-09 0.039443 0.228000 -0.209567 0.932230
2001-08-10 -1.657565 -0.738947 0.626624 -1.114161
2001-08-11 0.972585 0.767757 -0.272565 -0.436437
2001-08-12 0.023965 -0.084927 -2.336696 1.306975
# 按值排序
df1.sort_values(by='B')
A B C D
2001-08-12 1.306975 -2.336696 -0.084927 0.023965
2001-08-08 0.640980 -0.847056 0.214965 1.532802
2001-08-11 -0.436437 -0.272565 0.767757 0.972585
2001-08-09 0.932230 -0.209567 0.228000 0.039443
2001-08-10 -1.114161 0.626624 -0.738947 -1.657565
2001-08-07 1.128499 3.054879 -1.969893 0.884528

选择数据

# 选择单列
df1['A']
2001-08-07   -0.297676
2001-08-08    0.339365
2001-08-09    0.704013
2001-08-10   -0.951185
2001-08-11   -0.182651
2001-08-12    1.178420
Freq: D, Name: A, dtype: float64
# 用[]切片行
df1[1:3]
A B C D
2001-08-08 0.339365 0.500148 0.243576 -0.160504
2001-08-09 0.704013 1.720233 -0.186915 -1.322280
df1['2001-08-07':'2001-08-11']
A B C D
2001-08-07 -0.297676 0.049568 -1.299745 0.636079
2001-08-08 0.339365 0.500148 0.243576 -0.160504
2001-08-09 0.704013 1.720233 -0.186915 -1.322280
2001-08-10 -0.951185 1.420427 -0.194470 -0.305272
2001-08-11 -0.182651 0.939551 -0.432457 -1.060890

按标签选择

df1.loc['2001-08-08']
A    0.339365
B    0.500148
C    0.243576
D   -0.160504
Name: 2001-08-08 00:00:00, dtype: float64
# 用标签提取一行数据
df1.loc[d[0]]
A   -0.297676
B    0.049568
C   -1.299745
D    0.636079
Name: 2001-08-07 00:00:00, dtype: float64
# 用标签选择多列数据
df1.loc[:, ['A', 'B']]
A B
2001-08-07 -0.297676 0.049568
2001-08-08 0.339365 0.500148
2001-08-09 0.704013 1.720233
2001-08-10 -0.951185 1.420427
2001-08-11 -0.182651 0.939551
2001-08-12 1.178420 1.950314
# 用标签切片
df1.loc['2001-08-07':'2001-08-10', ['C', 'D']]
C D
2001-08-07 -1.299745 0.636079
2001-08-08 0.243576 -0.160504
2001-08-09 -0.186915 -1.322280
2001-08-10 -0.194470 -0.305272
# 取某个标签的值
df1.loc['2001-08-10', 'A']
-0.9511849480978183
# 快速取某个标签的值
df1.at['2001-08-09', 'C']
-0.1869148162186709

按位置选择

loc函数:通过行索引 “Index” 中的具体值来取行数据(如取"Index"为"A"的行),一般用于按标签赋值
iloc函数:通过行号来取行数据(如取第二行的数据),一般用于按位置赋值

# 按位置选择
print(df1.iloc[3])

# 用整数切片
print(df1.iloc[3:5, 0:2])

# 用整数列表按位置切片
print(df1.iloc[[1, 2, 4], [0, 2]])

# 整行切片
print(df1.iloc[1:3, :])

# 整列切片
print(df1.iloc[:, 1:3])

# 显示提取值
print(df1.iloc[1, 1])

# 快速访问标量
print(df1.iat[1, 1])
A   -0.951185
B    1.420427
C   -0.194470
D   -0.305272
Name: 2001-08-10 00:00:00, dtype: float64
                   A         B
2001-08-10 -0.951185  1.420427
2001-08-11 -0.182651  0.939551
                   A         C
2001-08-08  0.339365  0.243576
2001-08-09  0.704013 -0.186915
2001-08-11 -0.182651 -0.432457
                   A         B         C         D
2001-08-08  0.339365  0.500148  0.243576 -0.160504
2001-08-09  0.704013  1.720233 -0.186915 -1.322280
                   B         C
2001-08-07  0.049568 -1.299745
2001-08-08  0.500148  0.243576
2001-08-09  1.720233 -0.186915
2001-08-10  1.420427 -0.194470
2001-08-11  0.939551 -0.432457
2001-08-12  1.950314 -0.622115
0.5001484694436036
0.5001484694436036

筛选

print(df1[df1['C'] > 0])
print(df1[df1 > 0])

# 用isin()筛选
newdf = df1.copy()
newdf['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(newdf)
newdf[newdf['E'].isin(['two', 'four'])]
                   A         B         C         D
2001-08-08  0.339365  0.500148  0.243576 -0.160504
                   A         B         C         D
2001-08-07       NaN  0.049568       NaN  0.636079
2001-08-08  0.339365  0.500148  0.243576       NaN
2001-08-09  0.704013  1.720233       NaN       NaN
2001-08-10       NaN  1.420427       NaN       NaN
2001-08-11       NaN  0.939551       NaN       NaN
2001-08-12  1.178420  1.950314       NaN  1.248751
                   A         B         C         D      E
2001-08-07 -0.297676  0.049568 -1.299745  0.636079    one
2001-08-08  0.339365  0.500148  0.243576 -0.160504    one
2001-08-09  0.704013  1.720233 -0.186915 -1.322280    two
2001-08-10 -0.951185  1.420427 -0.194470 -0.305272  three
2001-08-11 -0.182651  0.939551 -0.432457 -1.060890   four
2001-08-12  1.178420  1.950314 -0.622115  1.248751  three
A B C D E
2001-08-09 0.704013 1.720233 -0.186915 -1.32228 two
2001-08-11 -0.182651 0.939551 -0.432457 -1.06089 four

赋值

# 用索引自动对齐新增列的数据
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=d)
print(s1)
print()

df1['F'] = s1
print(df1)
print()

# 按标签赋值
df1.at['2001-08-07', 'A'] = 0
print(df1)
print()

# 按位置赋值
df1.iat[1, 0] = 0
print(df1)
print()

# 按数组赋值
df1['G'] = [1, 2, 3, 4, 5, 6]
print(df1)
print()

# 按NumPy数组赋值
df1.loc[:, 'D'] = np.array([5] * len(df1))
print(df1)
print()

# 条件赋值
newdf = df1.copy()
newdf[newdf < 0] = 0
print(newdf)
2001-08-07    1
2001-08-08    2
2001-08-09    3
2001-08-10    4
2001-08-11    5
2001-08-12    6
Freq: D, dtype: int64

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568 -1.299745  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233 -0.186915  5.0  3  3
2001-08-10 -0.951185  1.420427 -0.194470  5.0  4  4
2001-08-11 -0.182651  0.939551 -0.432457  5.0  5  5
2001-08-12  1.178420  1.950314 -0.622115  5.0  6  6

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568 -1.299745  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233 -0.186915  5.0  3  3
2001-08-10 -0.951185  1.420427 -0.194470  5.0  4  4
2001-08-11 -0.182651  0.939551 -0.432457  5.0  5  5
2001-08-12  1.178420  1.950314 -0.622115  5.0  6  6

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568 -1.299745  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233 -0.186915  5.0  3  3
2001-08-10 -0.951185  1.420427 -0.194470  5.0  4  4
2001-08-11 -0.182651  0.939551 -0.432457  5.0  5  5
2001-08-12  1.178420  1.950314 -0.622115  5.0  6  6

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568 -1.299745  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233 -0.186915  5.0  3  3
2001-08-10 -0.951185  1.420427 -0.194470  5.0  4  4
2001-08-11 -0.182651  0.939551 -0.432457  5.0  5  5
2001-08-12  1.178420  1.950314 -0.622115  5.0  6  6

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568 -1.299745  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233 -0.186915  5.0  3  3
2001-08-10 -0.951185  1.420427 -0.194470  5.0  4  4
2001-08-11 -0.182651  0.939551 -0.432457  5.0  5  5
2001-08-12  1.178420  1.950314 -0.622115  5.0  6  6

                   A         B         C    D  F  G
2001-08-07  0.000000  0.049568  0.000000  5.0  1  1
2001-08-08  0.000000  0.500148  0.243576  5.0  2  2
2001-08-09  0.704013  1.720233  0.000000  5.0  3  3
2001-08-10  0.000000  1.420427  0.000000  5.0  4  4
2001-08-11  0.000000  0.939551  0.000000  5.0  5  5
2001-08-12  1.178420  1.950314  0.000000  5.0  6  6

空值

# reindex重写索引(这个方法用的比较多)
newdf = df1.reindex(index=d[0:4], columns=list(df1.columns) + ['E'])
newdf.loc[d[0]:d[1], 'E'] = 1
newdf
A B C D F G E
2001-08-07 0.000000 0.049568 -1.299745 5.0 1 1 1.0
2001-08-08 0.000000 0.500148 0.243576 5.0 2 2 1.0
2001-08-09 0.704013 1.720233 -0.186915 5.0 3 3 NaN
2001-08-10 -0.951185 1.420427 -0.194470 5.0 4 4 NaN
# 使用dropna函数来删除空值
# how='any'
# 可选参数,默认为any
#    any: 如果存在NaN,则删除该行或列
#    all: 如果所有值都是NaN,则删除该行或列

newdf.dropna(how='any')
# 另注意此处只是操作后的结果,但是并没有赋值给newdf,故newdf展示的结果为两行,其实还是4行
# newdf = newdf.dropna(how='any')结果就为两行
A B C D F G E
2001-08-07 0.0 0.049568 -1.299745 5.0 1 1 1.0
2001-08-08 0.0 0.500148 0.243576 5.0 2 2 1.0
# 值为NaN的地方赋值为5
newdf.fillna(value=5)
A B C D F G E
2001-08-07 0.000000 0.049568 -1.299745 5.0 1 1 1.0
2001-08-08 0.000000 0.500148 0.243576 5.0 2 2 1.0
2001-08-09 0.704013 1.720233 -0.186915 5.0 3 3 5.0
2001-08-10 -0.951185 1.420427 -0.194470 5.0 4 4 5.0
# 判断此处值是否为NaN
pd.isna(newdf)
A B C D F G E
2001-08-07 False False False False False False False
2001-08-08 False False False False False False False
2001-08-09 False False False False False False True
2001-08-10 False False False False False False True

运算

算术运算

df1 = pd.DataFrame(np.random.randn(2, 5))
print(df1)
print("----------")
df2 = pd.DataFrame(np.random.randn(3, 4))
print(df2)
print("----------")
print("df1+df2\n", df1 + df2)
print("----------")
print("df1-df2\n", df1 - df2)
print("----------")
print("df1*df2\n", df1 * df2)
print("----------")
print("df1/df2\n", df1 / df2)
          0         1         2         3         4
0  0.841797  0.269025  2.514216 -1.896455  0.479344
1  1.352019 -0.467855 -0.754422  0.242742  1.465160

          0         1         2         3
0  1.403584  0.082813  0.469722 -0.534022
1  0.320259  0.331209  0.291575  0.477501
2  1.226319 -1.297662 -0.540806 -0.050445

df1+df2
           0         1         2         3   4
0  2.245381  0.351839  2.983938 -2.430477 NaN
1  1.672278 -0.136645 -0.462847  0.720243 NaN
2       NaN       NaN       NaN       NaN NaN

df1-df2
           0         1         2         3   4
0 -0.561788  0.186212  2.044494 -1.362433 NaN
1  1.031759 -0.799064 -1.045997 -0.234759 NaN
2       NaN       NaN       NaN       NaN NaN

df1*df2
           0         1         2         3   4
0  1.181533  0.022279  1.180982  1.012749 NaN
1  0.432996 -0.154958 -0.219971  0.115910 NaN
2       NaN       NaN       NaN       NaN NaN

df1/df2
           0         1         2         3   4
0  0.599748  3.248576  5.352562  3.551266 NaN
1  4.221638 -1.412564 -2.587399  0.508359 NaN
2       NaN       NaN       NaN       NaN NaN

比较操作

print("df1 等于 df2\n", df1.eq(df2))
print("----------")
print("df1 不等于 df2\n", df1.ne(df2))
print("----------")
print("df1 大于 df2\n", df1.gt(df2))  # greater than
print("----------")
print("df1 小于 df2\n", df1.lt(df2))  # less than
print("----------")
print("df1 大于等于 df2\n", df1.ge(df2))
print("----------")
print("df1 小于等于 df2\n", df1.le(df2))
df1 等于 df2
        0      1      2      3      4
0  False  False  False  False  False
1  False  False  False  False  False
2  False  False  False  False  False
----------
df1 不等于 df2
       0     1     2     3     4
0  True  True  True  True  True
1  True  True  True  True  True
2  True  True  True  True  True
----------
df1 大于 df2
        0      1      2      3      4
0  False   True   True  False  False
1   True  False  False  False  False
2  False  False  False  False  False
----------
df1 小于 df2
        0      1      2      3      4
0   True  False  False   True  False
1  False   True   True   True  False
2  False  False  False  False  False
----------
df1 大于等于 df2
        0      1      2      3      4
0  False   True   True  False  False
1   True  False  False  False  False
2  False  False  False  False  False
----------
df1 小于等于 df2
        0      1      2      3      4
0   True  False  False   True  False
1  False   True   True   True  False
2  False  False  False  False  False

统计

函数 描述 函数 描述 函数 描述 函数 描述
count 统计非空值数量 sum 汇总值 mean 平均值 mad 平均绝对偏差
median 算数中位数 min 最小值 max 最大值 mode 重数
abs 绝对值 prod 乘积 std 标准偏差 var 无偏方差
sem 平均值的标准误差 skew 样本偏度(第三阶) kurt 样本峰度(第四阶) quantile 样本分位数(不同%的值)
cumsum 累计和 cumprod 累乘 cummax 累积最大值 cunmin 累积最小值
print(df1)
# 求每一列的均值
print(df1.mean())
# 求每一列的累加和并将每一列的累计和赋给每一列的最下面一行
print(df1.cumsum())
          0         1         2         3         4
0  0.841797  0.269025  2.514216 -1.896455  0.479344
1  1.352019 -0.467855 -0.754422  0.242742  1.465160
0    1.096908
1   -0.099415
2    0.879897
3   -0.826857
4    0.972252
dtype: float64
          0         1         2         3         4
0  0.841797  0.269025  2.514216 -1.896455  0.479344
1  2.193815 -0.198829  1.759794 -1.653713  1.944505

合并concat

pd.concat([df1, df2])
0 1 2 3 4
0 0.841797 0.269025 2.514216 -1.896455 0.479344
1 1.352019 -0.467855 -0.754422 0.242742 1.465160
0 1.403584 0.082813 0.469722 -0.534022 NaN
1 0.320259 0.331209 0.291575 0.477501 NaN
2 1.226319 -1.297662 -0.540806 -0.050445 NaN

连接join

left = pd.DataFrame({'key': ['foo', 'foo', 'bar'], 'lval': [1, 2, 2]})
print(left)
right = pd.DataFrame({'key': ['foo', 'foo', 'bar'], 'rval': [3, 4, 5]})
print(right)
pd.merge(left, right, on='key')
   key  lval
0  foo     1
1  foo     2
2  bar     2
   key  rval
0  foo     3
1  foo     4
2  bar     5
key lval rval
0 foo 1 3
1 foo 1 4
2 foo 2 3
3 foo 2 4
4 bar 2 5

追加Append(已失效),追加可改为concat来实现

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
print(df)
s = df.iloc[3]
print(s)
df.append(s, ignore_index=True)  # 已失效
          A         B         C         D
0  0.277439 -0.363739  1.026139  1.614032
1 -1.595705  1.259329 -1.062648  0.186739
2 -0.639943  0.402358  0.110181  0.180963
3 -0.621929  0.401519 -0.975065 -1.001928
4 -3.077506  1.075743 -0.544791  2.573899
5  2.038906  0.301643 -0.920341  1.700568
6  1.679596  0.642480 -0.688277  0.447207
7 -1.582690 -0.033994 -1.513041  1.009212
A   -0.621929
B    0.401519
C   -0.975065
D   -1.001928
Name: 3, dtype: float64



---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In[112], line 5
      3 s=df.iloc[3]
      4 print(s)
----> 5 df.append(s,ignore_index=True)#已失效


File D:\Python\Miniconda3\miniconda3\envs\p2s\lib\site-packages\pandas\core\generic.py:6204, in NDFrame.__getattr__(self, name)
   6197 if (
   6198     name not in self._internal_names_set
   6199     and name not in self._metadata
   6200     and name not in self._accessors
   6201     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202 ):
   6203     return self[name]
-> 6204 return object.__getattribute__(self, name)


AttributeError: 'DataFrame' object has no attribute 'append'

分组group

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'two', 'three', 'four', 'one', 'two', 'three', 'four'],
    'C':
    np.random.randn(8),
    'D':
    np.random.randn(8)
})
print(df)
df.groupby('A').sum()
     A      B         C         D
0  foo    one  0.189163 -0.806318
1  bar    two  2.407904 -0.095191
2  foo  three  0.074226  0.293678
3  bar   four  0.546407  0.809018
4  foo    one -0.450351 -0.499252
5  bar    two -1.720710  0.931944
6  foo  three  0.695358  0.185726
7  bar   four  2.094319  1.072681
B C D
A
bar twofourtwofour 3.327920 2.718452
foo onethreeonethree 0.508397 -0.826166
df.groupby(['A', 'B']).sum()
C D
A B
bar four -1.064164 0.872241
two -0.608013 0.091348
foo one 3.065079 1.367708
three 1.380634 -1.439255

数据透视表

df = pd.DataFrame({
    'A': ['one', 'two', 'three', 'four'] * 3,
    'B': ['A', 'B', 'C'] * 4,
    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
    'D': np.random.randn(12),
    'E': np.random.randn(12)
})
print(df)
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
        A  B    C         D         E
0     one  A  foo -1.546985  0.590130
1     two  B  foo  1.732416 -2.586836
2   three  C  foo -1.105102 -0.858783
3    four  A  bar -0.732389 -0.475518
4     one  B  bar  0.533465  0.736240
5     two  C  bar  0.197097 -1.329285
6   three  A  foo  0.495105  0.426743
7    four  B  foo -0.611343 -0.204255
8     one  C  foo -0.678577 -2.013504
9     two  A  bar -0.538301 -1.216611
10  three  B  bar -1.503484 -0.199938
11   four  C  bar  1.323900  0.883130
C bar foo
A B
four A -0.732389 NaN
B NaN -0.611343
C 1.323900 NaN
one A NaN -1.546985
B 0.533465 NaN
C NaN -0.678577
three A NaN 0.495105
B -1.503484 NaN
C NaN -1.105102
two A -0.538301 NaN
B NaN 1.732416
C 0.197097 NaN

时间序列

# 将秒级的数据转换为5分钟为频率的数据
rng = pd.date_range('1/1/2001', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(ts)
# 每5分钟合并为一个,上面数据不足5分钟,故最后合成为1个
ts.resample('5Min').sum()
2001-01-01 00:00:00    492
2001-01-01 00:00:01     12
2001-01-01 00:00:02    154
2001-01-01 00:00:03    100
2001-01-01 00:00:04    184
                      ... 
2001-01-01 00:01:35    164
2001-01-01 00:01:36    118
2001-01-01 00:01:37    245
2001-01-01 00:01:38    288
2001-01-01 00:01:39     20
Freq: S, Length: 100, dtype: int32





2001-01-01    26724
Freq: 5T, dtype: int32
# 时区表示
rng = pd.date_range('3/6/2001 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
print(ts)
ts_utc = ts.tz_localize('UTC')
print(ts_utc)
ts_utc.tz_convert('US/Eastern')
2001-03-06    0.029706
2001-03-07    1.486630
2001-03-08    1.418271
2001-03-09    0.120030
2001-03-10    0.579539
Freq: D, dtype: float64
2001-03-06 00:00:00+00:00    0.029706
2001-03-07 00:00:00+00:00    1.486630
2001-03-08 00:00:00+00:00    1.418271
2001-03-09 00:00:00+00:00    0.120030
2001-03-10 00:00:00+00:00    0.579539
Freq: D, dtype: float64





2001-03-05 19:00:00-05:00    0.029706
2001-03-06 19:00:00-05:00    1.486630
2001-03-07 19:00:00-05:00    1.418271
2001-03-08 19:00:00-05:00    0.120030
2001-03-09 19:00:00-05:00    0.579539
Freq: D, dtype: float64
# 转换时间段
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), rng)
print(ts)
# to_period 函数允许将日期转换为特定的时间间隔。
ps = ts.to_period()
print(ps)
ps.to_timestamp()
# 结果中的freq = "M",是以月为频率产生时间序列,以月末为时间点,而freq = "MS"则是以月初为时间点。
2012-01-31   -0.892879
2012-02-29   -0.340107
2012-03-31    0.813457
2012-04-30    2.199679
2012-05-31    2.256466
Freq: M, dtype: float64
2012-01   -0.892879
2012-02   -0.340107
2012-03    0.813457
2012-04    2.199679
2012-05    2.256466
Freq: M, dtype: float64





2012-01-01   -0.892879
2012-02-01   -0.340107
2012-03-01    0.813457
2012-04-01    2.199679
2012-05-01    2.256466
Freq: MS, dtype: float64

可视化

ts = pd.Series(np.random.randn(1000),
               index=pd.date_range('1/1/2001', periods=1000))
print(ts)
ts1 = ts.cumsum()
print(ts1)
ts1.plot()
2001-01-01    0.309942
2001-01-02    0.850757
2001-01-03   -0.798396
2001-01-04    0.297484
2001-01-05   -0.592258
                ...   
2003-09-23   -0.679936
2003-09-24    1.986236
2003-09-25    0.665965
2003-09-26   -0.479215
2003-09-27    0.731958
Freq: D, Length: 1000, dtype: float64
2001-01-01     0.309942
2001-01-02     1.160698
2001-01-03     0.362303
2001-01-04     0.659787
2001-01-05     0.067529
                ...    
2003-09-23    87.444694
2003-09-24    89.430929
2003-09-25    90.096894
2003-09-26    89.617680
2003-09-27    90.349637
Freq: D, Length: 1000, dtype: float64


快速上手Python三剑客--Pandas篇_第4张图片

import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(1000, 4),
                  index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best')

快速上手Python三剑客--Pandas篇_第5张图片

数据输入/输出

CSV

df.to_csv('foo.csv')
pd.read_csv('foo.csv')
Unnamed: 0 A B C D
0 2001-01-01 -1.992527 -1.719912 -0.231028 -0.493246
1 2001-01-02 -1.568642 -2.505710 0.815862 -2.581479
2 2001-01-03 -2.122416 -2.599320 -0.761854 -3.090637
3 2001-01-04 -4.027765 -2.955616 -0.219192 -2.320385
4 2001-01-05 -6.431664 -2.809861 -3.017848 -2.569958
... ... ... ... ... ...
995 2003-09-23 3.825476 10.674352 37.139296 9.314710
996 2003-09-24 3.565270 9.656261 37.906229 9.560828
997 2003-09-25 2.923609 10.739928 38.297159 8.431035
998 2003-09-26 4.654014 10.587138 39.214216 8.264657
999 2003-09-27 4.539063 10.066622 40.246983 10.362177

1000 rows × 5 columns

HDF5

!pip install tables
df.to_hdf('foo.h5', 'df')
pd.read_hdf('foo.h5', 'df')
A B C D
2001-01-01 -1.992527 -1.719912 -0.231028 -0.493246
2001-01-02 -1.568642 -2.505710 0.815862 -2.581479
2001-01-03 -2.122416 -2.599320 -0.761854 -3.090637
2001-01-04 -4.027765 -2.955616 -0.219192 -2.320385
2001-01-05 -6.431664 -2.809861 -3.017848 -2.569958
... ... ... ... ...
2003-09-23 3.825476 10.674352 37.139296 9.314710
2003-09-24 3.565270 9.656261 37.906229 9.560828
2003-09-25 2.923609 10.739928 38.297159 8.431035
2003-09-26 4.654014 10.587138 39.214216 8.264657
2003-09-27 4.539063 10.066622 40.246983 10.362177

1000 rows × 4 columns

Excel

!pip install openpyxl
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Unnamed: 0 A B C D
0 2001-01-01 -1.992527 -1.719912 -0.231028 -0.493246
1 2001-01-02 -1.568642 -2.505710 0.815862 -2.581479
2 2001-01-03 -2.122416 -2.599320 -0.761854 -3.090637
3 2001-01-04 -4.027765 -2.955616 -0.219192 -2.320385
4 2001-01-05 -6.431664 -2.809861 -3.017848 -2.569958
... ... ... ... ... ...
995 2003-09-23 3.825476 10.674352 37.139296 9.314710
996 2003-09-24 3.565270 9.656261 37.906229 9.560828
997 2003-09-25 2.923609 10.739928 38.297159 8.431035
998 2003-09-26 4.654014 10.587138 39.214216 8.264657
999 2003-09-27 4.539063 10.066622 40.246983 10.362177

1000 rows × 5 columns

更详细教程可关注
https://www.runoob.com/pandas/pandas-tutorial.html

Congratulations!

记录者:ZL-Qin

你可能感兴趣的:(快速上手Python三剑客,Python,python,pandas)