在windows下安装pandas,除了安装pandas外,还需把用到的相关包都装上,共需要安装如下包:
pyparsing-2.0.7-py2.py3-none-any.whl
matplotlib-1.5.1-cp27-none-win32.whl
openpyxl-2.3.2-py2.py3-none-any.whl
setuptools-19.2-py2.py3-none-any.whl
numpy-1.10.4+mkl-cp27-none-win32.whl
six-1.10.0-py2.py3-none-any.whl
python_dateutil-2.4.2-py2.py3-none-any.whl
这些安装包的下载地址是:
http://www.lfd.uci.edu/~gohlke/pythonlibs
如果搜不到原包,只要是这个模块就行
pip install 模块名
****************************************************************************************************************
numpy
import numpy as np In [16]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] In [17]: arr2 = np.array(data2) In [18]: arr2 Out[18]:array([[1, 2, 3, 4], [5, 6, 7, 8]]) 查看长度,结构,数据类型dtype In [19]: arr2.ndim Out[19]: 2 In [20]: arr2.shape Out[20]: (2, 4) In [21]: arr2.dtype 改变dtype In [38]: numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_) In [39]: numeric_strings.astype(float) Out[39]: array([ 1.25, -9.6 , 42. ]) In [57]: arr_slice = arr[5:8] In [58]: arr_slice[1] = 12345 In [59]: arr Out[59]: array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8, 9]) In [140]: xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5]) In [141]: yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5]) In [142]: cond = np.array([True, False, True, True, False]) 假如我们想要当对应的 cond 值为 True 时,从 xarr 中获取一个值,否则从 yarr 中获取值 In [145]: result = np.where(cond, xarr, yarr) In [146]: result Out[146]: array([ 1.1, 2.2, 1.3, 1.4, 2.5]) np.where 的第一个和第二个参数不需要是数组;它们中的一个或两个可以是纯量。 在数据分析中 where 的典型使用是生成一个新的数组,其值基于另一个数组。假如你有一个矩阵,其数据是随机生成的,你想要把其中的正值替换为2,负值替换为-2,使用 np.where 非常容易: In [147]: arr = randn(4, 4) In [148]: arr Out[148]: array([[ 0.6372, 2.2043, 1.7904, 0.0752], [-1.5926, -1.1536, 0.4413, 0.3483], [-0.1798, 0.3299, 0.7827, -0.7585], [ 0.5857, 0.1619, 1.3583, -1.3865]]) In [149]: np.where(arr > 0, 2, -2) Out[149]: array([[ 2, 2, 2, 2], [-2, -2, 2, 2], [-2, 2, 2, -2], [ 2, 2, 2, -2]]) In [150]: np.where(arr > 0, 2, arr) # 仅设置正值为 2 Out[150]: array([[ 2. , 2. , 2. , 2. ], [-1.5926, -1.1536, 2. , 2. ], [-0.1798, 2. , 2. , -0.7585], [ 2. , 2. , 2. , -1.3865]])
pandas---应用
In [1]: from pandas import Series, DataFrame In [2]: import pandas as pd
Series
列表建立(index长度和列表长度必须一样)
In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
字典建立(index可以是字典的部分key)
In [20]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} In [21]: obj3 = Series(sdata,index=['Ohio','Texas'])
检索
In [11]: obj2['a'] Out[11]: -5 In [12]: obj2['d'] = 6 In [13]: obj2[['c', 'a', 'd']] #比列表多一项 In [13]: obj2[['a':'d']] #包括结束点d In [15]: obj2[obj2 > 0] In [16]: obj2 * 2 In [17]: np.exp(obj2) In [18]: 'b' in obj2 #字典的性质 Out[18]: True In [19]: 'e' in obj2 Out[19]: False In[69]: for i in obj3: ... print i ... 35000 #输出值并非索引 71000
改变索引
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] #只能以列表的形式改变,不能是obj.index[1]='b'单独改变 改变索引长度及值 In [81]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) In [82]: obj2 Out[82]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN In [83]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0) Out[83]: a -5.3 b 7.2 c 3.6 d 4.5 e 0.0
检索
DataFrame
列表建立(index长度和列表长度必须一样,columns可多可少)
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} In [40]: frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], ....: index=['one', 'two', 'three', 'four', 'five']) In [41]: frame2 Out[41]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN
字典建立(index可以是字典的部分key)
In [57]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}} In [58]: frame3 = DataFrame(pop) Out[87]: Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
检索:
columns检索 In [43]: frame2['state'] In [44]: frame2.year frame2[['state','year']] index检索 In [45]: frame2.ix['three'] frame2[:2] In[106]: frame2.ix[['one','two']] 按index的'one','two'索引 ,选取columns的前两个 In[105]: frame2.ix[['one','two'],:2] Out[105]: year state one 2000 Ohio two 2001 Ohio 按index的'one','two'索引 ,选取columns的pop列 In[107]: frame2.ix[['one','two'],'pop'] 按index的'one','two'索引 ,选取columns的1,2,3列 In[108]: frame2.ix[['one','two'],[1,2,3]] In[109]: frame2.ix[['one','two'],['state','pop','debt']] Out[108/109]: state pop debt one Ohio 1.5 0 two Ohio 1.7 0 In[110]: frame2.ix['three',['state','pop','debt']] 按index的'one','two''three'索引 ,选取columns的1,2,3列 In[111]: frame2.ix[:'three',['state','pop','debt']] Out[111]: state pop debt one Ohio 1.5 0 two Ohio 1.7 0 three 0 0.0 0 In[104]: frame2.ix[frame2['pop']>0,:2] Out[104]: year state one 2000 Ohio two 2001 Ohio four 2001 Nevada 联合检索: frame3['Nevada'][:2] 函数检索(frame3不会改变) In[88]: frame3.drop(2000) Out[88]: Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 In[89]: frame3.drop('Ohio',axis=1)# Out[89]: Nevada 2000 NaN 2001 2.4 2002 2.9 条件检索 In[94]: frame2[frame2['year']>2001] Out[94]: year state pop debt three 2002 Ohio 3.6 0 five 2002 Nevada 2.9 0
改变一列:
In [46]: frame2['debt'] = 16.5 In [48]: frame2['debt'] = range(5) 该表一列的某些值 In [50]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) In [51]: frame2['debt'] = val 随即改变 frame2[frame2['year']>2001]=0 添加一列: In[46]: frame2['gyf']=[1,2,3,4,5] In[70]: del frame2['gyf']
改变索引
In [86]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], ....: columns=['Ohio', 'Texas', 'California']) In [87]: frame Out[87]: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 改变index In [88]: frame2 = frame.reindex(['a', 'b', 'c', 'd']) In [89]: frame2 Out[89]: Ohio Texas California a 0 1 2 b NaN NaN NaN c 3 4 5 d 6 7 8 改变columns索引: In [90]: states = ['Texas', 'Utah', 'California'] In [91]: frame.reindex(columns=states) Out[91]: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8 In [92]: frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', ....: columns=states) Out[92]: Texas Utah California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8
转置:
In [60]: frame3.T
前后向填充(ffill,bfill)
In [84]: obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) In [85]: obj3.reindex(range(6), method='ffill') Out[85]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow
更多信息关注博主:http://my.oschina.net/lionets/blog 的DA & ML系列文章