Python科学计算库pandas——初步了解pandas

pandas包含三种对象,一维数组Series、二维数组DataFrame、三维panel,相关使用环境:jupyter、python3.5,本次相关了解,难度极低

怎么说呢,怎么一个概述就这么多东西,告辞告辞,来日再续

创建对象

创建一个series

In [2]:
 
import pandas as pd
import numpy as np
In [3]:
 
 s = pd.Series([1,3,5,np.nan,6,8])
In [4]:
s
 
s
Out[4]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

创建dataframe

In [5]:
df
 
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[5]:
  A B C D
2013-01-01 0.047283 0.956434 -0.101887 -0.431993
2013-01-02 -0.992626 0.176542 0.505918 0.094845
2013-01-03 -0.329624 -0.059020 -0.732350 -0.098737
2013-01-04 -0.578537 0.674577 0.872515 -0.600548
2013-01-05 0.625189 0.391360 0.607619 -0.797956
2013-01-06 1.230331 -0.175162 -0.224363 1.387679

        初始函数(numpy的二维数组、时间索引、标签)

  通过一个字典创建dataframe

df2
 
df2 = pd.DataFrame({ 'A' : 1.,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
df2
Out[6]:
  A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

        :array必须有相同长度,固定值不必

浏览数据

head():数据的头部

In [17]:
2
 
df.head(2)
Out[17]:
  A B C D
2013-01-01 0.047283 0.956434 -0.101887 -0.431993
2013-01-02 -0.992626 0.176542 0.505918 0.094845

tail():浏览数据的尾部

In [18]:
 
df.tail()
Out[18]:
  A B C D
2013-01-02 -0.992626 0.176542 0.505918 0.094845
2013-01-03 -0.329624 -0.059020 -0.732350 -0.098737
2013-01-04 -0.578537 0.674577 0.872515 -0.600548
2013-01-05 0.625189 0.391360 0.607619 -0.797956
2013-01-06 1.230331 -0.175162 -0.224363 1.387679

index:查看数据索引

In [20]:
 
df.index
Out[20]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

values:查看值

In [21]:
values
 
df.values
Out[21]:
array([[ 0.04728322,  0.95643357, -0.10188749, -0.43199346],
       [-0.99262552,  0.1765424 ,  0.50591781,  0.09484533],
       [-0.32962395, -0.05902012, -0.73235045, -0.09873698],
       [-0.57853707,  0.67457668,  0.87251535, -0.60054814],
       [ 0.62518934,  0.39136014,  0.60761926, -0.79795564],
       [ 1.23033145, -0.17516244, -0.22436309,  1.38767862]])

:使用tab查看dataframe的属性!!!!

describe:查看数据概况

In[22]:
 
df.describe()
Out[22]:
  A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.000336 0.327455 0.154575 -0.074452
std 0.816759 0.434731 0.606652 0.786784
min -0.992626 -0.175162 -0.732350 -0.797956
25% -0.516309 -0.000129 -0.193744 -0.558409
50% -0.141170 0.283951 0.202015 -0.265365
75% 0.480713 0.603773 0.582194 0.046450

数据查找(selection)

 缺失值处理

 在统计中对于缺失值的处理主要有两种,一是除去含有缺失值的例,二是使用默认值取代

 操作

 对象合并

 分组


 整形

 时间序列


你可能感兴趣的:(Python语言学习)