神剑山庄资源网 Design By www.hcban.com
本文实例讲述了Python数据分析模块pandas用法。分享给大家供大家参考,具体如下:
一 介绍
pandas(Python Data Analysis Library)是基于numpy的数据分析模块,提供了大量标准数据模型和高效操作大型数据集所需要的工具,可以说pandas是使得Python能够成为高效且强大的数据分析环境的重要因素之一。
pandas主要提供了3种数据结构:
1)Series,带标签的一维数组。
2)DataFrame,带标签且大小可变的二维表格结构。
3)Panel,带标签且大小可变的三维数组。
二 代码
1、生成一维数组
>import pandas as pd >import numpy as np > x = pd.Series([1,3,5, np.nan]) >print(x) 01.0 13.0 25.0 3NaN dtype: float64
2、生成二维数组
> dates = pd.date_range(start='20170101', end='20171231', freq='D')#间隔为天 >print(dates) DatetimeIndex(['2017-01-01','2017-01-02','2017-01-03','2017-01-04', '2017-01-05','2017-01-06','2017-01-07','2017-01-08', '2017-01-09','2017-01-10', ... '2017-12-22','2017-12-23','2017-12-24','2017-12-25', '2017-12-26','2017-12-27','2017-12-28','2017-12-29', '2017-12-30','2017-12-31'], dtype='datetime64[ns]', length=365, freq='D') > dates = pd.date_range(start='20170101', end='20171231', freq='M')#间隔为月 >print(dates) DatetimeIndex(['2017-01-31','2017-02-28','2017-03-31','2017-04-30', '2017-05-31','2017-06-30','2017-07-31','2017-08-31', '2017-09-30','2017-10-31','2017-11-30','2017-12-31'], dtype='datetime64[ns]', freq='M') > df = pd.DataFrame(np.random.randn(12,4), index=dates, columns=list('ABCD')) >print(df) A B C D 2017-01-31-0.6825560.2441020.4508550.236475 2017-02-28-0.6300600.5906670.4824380.225697 2017-03-311.0669890.3193391.0949531.716053 2017-04-300.334944-0.053049-1.009493-1.039470 2017-05-31-0.380778-0.0444290.0756470.931243 2017-06-300.8675400.872197-0.738974-1.114596 2017-07-310.423371-1.0863860.183820-0.438921 2017-08-311.2851630.634134-0.4729731.281057 2017-09-30-1.002832-0.888122-1.316014-0.070637 2017-10-311.735617-0.2538150.5544031.536211 2017-11-302.0303840.6675561.0126980.239479 2017-12-312.059718-0.0890501.4205170.224578 > df = pd.DataFrame([[np.random.randint(1,100)for j in range(4)]for i in range(12)], index=dates, columns=list('ABCD')) >print(df) A B C D 2017-01-317532522 2017-02-2870997098 2017-03-3199477567 2017-04-3033701749 2017-05-3162886891 2017-06-3019751844 2017-07-3150856582 2017-08-315628776 2017-09-306173111 2017-10-318296692 2017-11-306359194 2017-12-3179586933 > df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)], 'B':pd.date_range(start='20130101', periods=4, freq='D'), 'C':pd.Series([1,2,3,4],index=list(range(4)),dtype='float32'), 'D':np.array([3]*4,dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]), 'F':'foo'}) >print(df) A B C D E F 0152013-01-011.03 test foo 1112013-01-022.03 train foo 2912013-01-033.03 test foo 3912013-01-044.03 train foo > df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)], 'B':pd.date_range(start='20130101', periods=4, freq='D'), 'C':pd.Series([1,2,3,4],index=['zhang','li','zhou','wang'],dtype='float32'), 'D':np.array([3]*4,dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]), 'F':'foo'}) >print(df) A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo >
3、二维数据查看
> df.head() #默认显示前5行 A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo > df.head(3) #查看前3行 A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo > df.tail(2) #查看最后2行 A B C D E F zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo
4、查看二维数据的索引、列名和数据
> df.index Index(['zhang','li','zhou','wang'], dtype='object') > df.columns Index(['A','B','C','D','E','F'], dtype='object') > df.values array([[36,Timestamp('2013-01-01 00:00:00'),1.0,3,'test','foo'], [86,Timestamp('2013-01-02 00:00:00'),2.0,3,'train','foo'], [10,Timestamp('2013-01-03 00:00:00'),3.0,3,'test','foo'], [79,Timestamp('2013-01-04 00:00:00'),4.0,3,'train','foo']], dtype=object)
5、查看数据的统计信息
> df.describe() #平均值、标准差、最小值、最大值等信息 A C D count 4.0000004.0000004.0 mean 52.7500002.5000003.0 std 36.0682221.2909940.0 min 10.0000001.0000003.0 25%29.5000001.7500003.0 50%57.5000002.5000003.0 75%80.7500003.2500003.0 max 86.0000004.0000003.0
6、二维数据转置
> df.T zhang li zhou A 368610 B 2013-01-0100:00:002013-01-0200:00:002013-01-0300:00:00 C 123 D 333 E test train test F foo foo foo wang A 79 B 2013-01-0400:00:00 C 4 D 3 E train F foo
7、排序
> df.sort_index(axis=0, ascending=False)#对轴进行排序 A B C D E F zhou 102013-01-033.03 test foo zhang 362013-01-011.03 test foo wang 792013-01-044.03 train foo li 862013-01-022.03 train foo > df.sort_index(axis=1, ascending=False) F E D C B A zhang foo test 31.02013-01-0136 li foo train 32.02013-01-0286 zhou foo test 33.02013-01-0310 wang foo train 34.02013-01-0479 > df.sort_index(axis=0, ascending=True) A B C D E F li 862013-01-022.03 train foo wang 792013-01-044.03 train foo zhang 362013-01-011.03 test foo zhou 102013-01-033.03 test foo > df.sort_values(by='A')#对数据进行排序 A B C D E F zhou 102013-01-033.03 test foo zhang 362013-01-011.03 test foo wang 792013-01-044.03 train foo li 862013-01-022.03 train foo > df.sort_values(by='A', ascending=False)#降序排列 A B C D E F li 862013-01-022.03 train foo wang 792013-01-044.03 train foo zhang 362013-01-011.03 test foo zhou 102013-01-033.03 test foo
8、数据选择
> df['A']#选择列 zhang 1 li 1 zhou 60 wang 58 Name: A, dtype: int64 > df[0:2]#使用切片选择多行 A B C D E F zhang 12013-01-011.03 test foo li 12013-01-022.03 train foo > df.loc[:,['A','C']]#选择多列 A C zhang 11.0 li 12.0 zhou 603.0 wang 584.0 > df.loc[['zhang','zhou'],['A','D','E']]#同时指定多行与多列进行选择 A D E zhang 13 test zhou 603 test > df.loc['zhang',['A','D','E']] A 1 D 3 E test Name: zhang, dtype: object
9、数据修改和设置
> df.iat[0,2]=3#修改指定行、列位置的数据值 >print(df) A B C D E F zhang 12013-01-013.03 test foo li 12013-01-022.03 train foo zhou 602013-01-033.03 test foo wang 582013-01-044.03 train foo > df.loc[:,'D']=[np.random.randint(50,60)for i in range(4)]#修改某列的值 >print(df) A B C D E F zhang 12013-01-013.057 test foo li 12013-01-022.052 train foo zhou 602013-01-033.057 test foo wang 582013-01-044.056 train foo > df['C']=-df['C']#对指定列数据取反 >print(df) A B C D E F zhang 12013-01-01-3.057 test foo li 12013-01-02-2.052 train foo zhou 602013-01-03-3.057 test foo wang 582013-01-04-4.056 train foo
10、缺失值处理
> df1 = df.reindex(index=['zhang','li','zhou','wang'], columns=list(df.columns)+['G']) >print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo NaN li 12013-01-02-2.052 train foo NaN zhou 602013-01-03-3.057 test foo NaN wang 582013-01-04-4.056 train foo NaN > df1.iat[0,6]=3#修改指定位置元素值,该列其他元素为缺失值NaN >print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo NaN zhou 602013-01-03-3.057 test foo NaN wang 582013-01-04-4.056 train foo NaN > pd.isnull(df1)#测试缺失值,返回值为True/False阵列 A B C D E F G zhang FalseFalseFalseFalseFalseFalseFalse li FalseFalseFalseFalseFalseFalseTrue zhou FalseFalseFalseFalseFalseFalseTrue wang FalseFalseFalseFalseFalseFalseTrue > df1.dropna()#返回不包含缺失值的行 A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 > df1['G'].fillna(5, inplace=True)#使用指定值填充缺失值 >print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo 5.0 zhou 602013-01-03-3.057 test foo 5.0 wang 582013-01-04-4.056 train foo 5.0
11、数据操作
> df1.mean()#平均值,自动忽略缺失值 A 30.0 C -3.0 D 55.5 G 4.5 dtype: float64 > df.mean(1)#横向计算平均值 zhang 18.333333 li 17.000000 zhou 38.000000 wang 36.666667 dtype: float64 > df1.shift(1)#数据移位 A B C D E F G zhang NaNNaTNaNNaNNaNNaNNaN li 1.02013-01-01-3.057.0 test foo 3.0 zhou 1.02013-01-02-2.052.0 train foo 5.0 wang 60.02013-01-03-3.057.0 test foo 5.0 > df1['D'].value_counts()#直方图统计 572 561 521 Name: D, dtype: int64 >print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo 5.0 zhou 602013-01-03-3.057 test foo 5.0 wang 582013-01-04-4.056 train foo 5.0 > df2 = pd.DataFrame(np.random.randn(10,4)) >print(df2) 0123 0-0.939904-1.856658-0.2819650.203624 10.3501620.060674-0.9148080.135735 2-1.031384-1.6112740.341546-0.363671 30.139464-0.050959-0.810610-0.772648 4-1.146810-0.7916081.488790-0.490004 5-0.100707-0.763545-0.071274-0.298142 6-0.2120140.8097090.6931960.980568 7-0.812985-0.000325-0.675101-0.217394 80.066969-0.084609-0.4330990.535616 9-0.319120-0.5328541.321712-1.751913 > p1 = df2[:3] > print(p1) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 > p2 = df2[3:7] > print(p2) 0 1 2 3 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 > p3 = df2[7:] > print(p3) 0 1 2 3 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 > df3 = pd.concat([p1, p2, p3]) #数据行合并 > print(df3) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 > df2 == df3 0 1 2 3 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True > df4 = pd.DataFrame({'A':[np.random.randint(1,5) for i in range(8)], 'B':[np.random.randint(10,15) for i in range(8)], 'C':[np.random.randint(20,30) for i in range(8)], 'D':[np.random.randint(80,100) for i in range(8)]}) > print(df4) A B C D 0 4 11 24 91 1 1 13 28 95 2 2 12 27 91 3 1 12 20 87 4 3 11 24 96 5 1 13 21 99 6 3 11 22 95 7 2 13 26 98 > > df4.groupby('A').sum() #数据分组计算 B C D A 1 38 69 281 2 25 53 189 3 22 46 191 4 11 24 91 > > df4.groupby(['A','B']).mean() C D A B 1 12 20.0 87.0 13 24.5 97.0 2 12 27.0 91.0 13 26.0 98.0 3 11 23.0 95.5 4 11 24.0 91.0
12、结合matplotlib绘图
>import pandas as pd >import numpy as np >import matplotlib.pyplot as plt > df = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum() >print(df) B C 00.0898860.511081 11.3237661.584758 21.489479-0.438671 30.831331-0.398021 4-0.2482330.494418 5-0.0130850.684518 60.666951-1.422161 71.768838-0.658786 82.6610800.648505 91.9517510.836261 103.5387851.657475 113.2540342.052609 124.2486201.568401 134.0771730.055622 143.452590-0.200314 152.627620-0.408829 163.690537-0.210440 173.1849240.365447 183.646556-0.150044 194.164563-0.023405 202.3914470.517872 212.8651530.686649 223.6231830.663927 231.5451170.151044 243.5959240.903619 253.0138041.855083 264.4388011.014572 275.1552160.882628 284.4314570.741509 292.8419490.709991 ........ 970-7.910646-13.738689 971-7.318091-14.811335 972-9.144376-15.466873 973-9.538658-15.367167 974-9.061114-16.822726 975-9.803798-17.368350 976-10.180575-17.270180 977-10.601352-17.671543 978-10.804909-19.535919 979-10.397964-20.361419 980-10.979640-20.300267 981-8.738223-20.202669 982-9.339929-21.528973 983-9.780686-20.902152 984-11.072655-21.235735 985-10.849717-20.439201 986-10.953247-19.708973 987-13.032707-18.687553 988-12.984567-19.557132 989-13.508836-18.747584 990-13.420713-19.883180 991-11.718125-20.474092 992-11.936512-21.360752 993-14.225655-22.006776 994-13.524940-20.844519 995-14.088767-20.492952 996-14.169056-20.666777 997-14.798708-19.960555 998-15.766568-19.395622 999-17.281143-19.089793 [1000 rows x 2 columns] > df['A']= pd.Series(list(range(len(df)))) >print(df) B C A 00.0898860.5110810 11.3237661.5847581 21.489479-0.4386712 30.831331-0.3980213 4-0.2482330.4944184 5-0.0130850.6845185 60.666951-1.4221616 71.768838-0.6587867 82.6610800.6485058 91.9517510.8362619 103.5387851.65747510 113.2540342.05260911 124.2486201.56840112 134.0771730.05562213 143.452590-0.20031414 152.627620-0.40882915 163.690537-0.21044016 173.1849240.36544717 183.646556-0.15004418 194.164563-0.02340519 202.3914470.51787220 212.8651530.68664921 223.6231830.66392722 231.5451170.15104423 243.5959240.90361924 253.0138041.85508325 264.4388011.01457226 275.1552160.88262827 284.4314570.74150928 292.8419490.70999129 ........... 970-7.910646-13.738689970 971-7.318091-14.811335971 972-9.144376-15.466873972 973-9.538658-15.367167973 974-9.061114-16.822726974 975-9.803798-17.368350975 976-10.180575-17.270180976 977-10.601352-17.671543977 978-10.804909-19.535919978 979-10.397964-20.361419979 980-10.979640-20.300267980 981-8.738223-20.202669981 982-9.339929-21.528973982 983-9.780686-20.902152983 984-11.072655-21.235735984 985-10.849717-20.439201985 986-10.953247-19.708973986 987-13.032707-18.687553987 988-12.984567-19.557132988 989-13.508836-18.747584989 990-13.420713-19.883180990 991-11.718125-20.474092991 992-11.936512-21.360752992 993-14.225655-22.006776993 994-13.524940-20.844519994 995-14.088767-20.492952995 996-14.169056-20.666777996 997-14.798708-19.960555997 998-15.766568-19.395622998 999-17.281143-19.089793999 [1000 rows x 3 columns] > plt.figure() <matplotlib.figure.Figure object at 0x000002A2A0B10F28> > df.plot(x='A') <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A12FE7F0> > plt.show()
> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d']) >print(df) a b c d 00.5044340.1908750.0016870.327372 10.4068440.6020290.9120750.815889 20.8285340.9859100.0946620.552089 30.1988430.8187850.7506490.967054 40.4984940.1513780.4175060.264438 50.6552880.6727880.0886160.433270 60.4931270.0092540.1794790.396655 70.4193860.9109860.0200040.229063 80.6714690.6121890.3749200.407093 90.4149780.0334990.7560250.717849 > df.plot(kind='bar') <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A17BD7B8> > plt.show()
运行结果
> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d']) > df.plot(kind='barh', stacked=True) <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A3784390> > plt.show()
更多关于Python相关内容感兴趣的读者可查看本站专题:《Python数学运算技巧总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》
希望本文所述对大家Python程序设计有所帮助。
神剑山庄资源网 Design By www.hcban.com
神剑山庄资源网
免责声明:本站文章均来自网站采集或用户投稿,网站不提供任何软件下载或自行开发的软件!
如有用户或公司发现本站内容信息存在侵权行为,请邮件告知! 858582#qq.com
神剑山庄资源网 Design By www.hcban.com
暂无Python数据分析模块pandas用法详解的评论...
稳了!魔兽国服回归的3条重磅消息!官宣时间再确认!
昨天有一位朋友在大神群里分享,自己亚服账号被封号之后居然弹出了国服的封号信息对话框。
这里面让他访问的是一个国服的战网网址,com.cn和后面的zh都非常明白地表明这就是国服战网。
而他在复制这个网址并且进行登录之后,确实是网易的网址,也就是我们熟悉的停服之后国服发布的暴雪游戏产品运营到期开放退款的说明。这是一件比较奇怪的事情,因为以前都没有出现这样的情况,现在突然提示跳转到国服战网的网址,是不是说明了简体中文客户端已经开始进行更新了呢?
更新日志
2024年11月19日
2024年11月19日
- 群星.2022-福茂巨星·时空之轮日本唱片志系列DISC2范晓萱-RAIN【福茂】【WAV+CUE】
- 王闻-《男人四十4》[正版CD低速原抓WAV+CUE]
- 青燕子-八只眼演唱组《爱心》[WAV+CUE]
- 祁露想着你的好》WAV+CUE
- 陈致逸《赴梦之约 游戏主题原声音乐》[FLAC/分轨][159.96MB]
- 贵族音乐《睡眠自然流水声 ASMR白噪音背景音》[320K/MP3][155.72MB]
- 贵族音乐《睡眠自然流水声 ASMR白噪音背景音》[FLAC/分轨][857.58MB]
- 朱昕嵘《琴意绵绵6N纯银SQCD》[WAV+CUE]
- 降央卓玛《草原情6N纯银SQCD》WAV+CUE
- 傲日格乐《黑马琴HQCD》[WAV+CUE]
- 群星.2022-福茂巨星·时空之轮日本唱片志系列DISC3范晓萱-自言自语【福茂】【WAV+CUE】
- 群星.2022-福茂巨星·时空之轮日本唱片志系列DISC4那英-白天不懂夜的黑【福茂】【WAV+CUE】
- 群星.2015-华丽上班族电影原声大碟【大右音乐】【WAV+CUE】
- 陈粒《乌有乡地图》[320K/MP3][21.81MB]
- 陈粒《乌有乡地图》[FLAC/分轨][398.39MB]