141 阿里云技术社区[云栖]

即学即用：Pandas入门与时间序列分析

这篇文章是Alexander Hendorf 在PyData Florence 2017上做的报告。报告前半部分主要为初学者介绍Pandas的基本功能，如数据输入/输出、可视化、聚合与选择与访问，后半部分主要介绍如何使用Pandas进行时间序列分析，源代码亲测可用。

PS：PyData集数据分析工具的用户和开发者，大家交流经验、相互学习，为各领域的数据科学爱好者提供一个经验共享平台，一起讨论如何使用语言和工具应对来自数据管理、处理、分析和可视化各方面的挑战。

022360f2e59539d9836f7a561d66d8cbc61ff298

【Pandas起源与目标】

1. 开源Python库

2. 实际数据分析-高速/高效/简单

4. Wes McKinney 2008年开始编写, 再到现在Continuum Analytics的Anaconda

5. 定期更新的稳定项目

6. 地址：https://github.com/pandas-dev/pandas

【特征】

4.数据可视化

6.类似数据库操作

：https://github.com/Koenigsweg/data-timeseries-analysis-with-pandas

1. DataSeries & DataFrame

2. I/O（输入/输出）

3.Data analysis &aggregation（数据分析&聚合）

4.Indexes（索引）

5. Visualization（可视化）

6.Interacting with the Data（数据交互）

29ae33aede51e9828162cd420977c6d5f01bf806

【输入/输出】

1.	import pandas as pd  
2.	# 读取数据  
3.	df = pd.read_csv('raw_weather_data_aug_sep_2014/tempm.csv', header=None)  
4.	print df.head(5)   #输出前n行  
5.	print df.tail(5)   #输出最后n行

b201ce1ff910cb32fd57526bb8ef5891bb40abf3

1. 使用Matplotlib库，.plot() 函数

b5ed6cfa1616283f331290fe0a8113689cbc467a

【数据结构：Series和DataFrame】

【Series篇】

1. 一维有标签数组结构，可以存入任一种python的数据类型(integers, strings, floating point numbers, Python objects, etc.)

2. 序列的标签通常称为索引（index）

930cc8712120ff8770eb48786c837edc8056d60c

数据选择与访问方式：

1. 可以通过标签（index）选择，也可以通过位置来选择（从0开始）；

2. 通过切片/布尔索引访问数据，例如：

1.	series[x], series[[x, y]]  
2.	series[2], series[[2, 3]], series[2:3]  
3.	series.ix() / .iloc() / .loc()  
4.	# .ix()这种方式相当于混合了loc()和iloc()两种方式

43c47686ef42fdad16608ca1c9b328d0e7dbe8dd

【DataFrame】

二维有标签数据结构，如2维Numpy数组，关于索引，有如下规定：

2.索引可以重置或者替换；

3. 类型：位置，时间戳，时间范围，标签…；

4.一个索引号可能会出现多次（不唯一）

例1. 给列命名

1.	df.columns = ['timestamp', 'temperature']  
2.	df.head(3)

78564e92a3ac4e3cf87ac63e7a779eb17bf25f01

例2. 对数据进行运算：

1.	def to_fahrenheit(celsius):  
2.	    return (celsius * 9./5.) + 32.  
3.	df['temperature'].map(to_fahrenheit)[:5]  
4.	df['temperature F'] = df['temperature'].map(to_fahrenheit)  
5.	df.head(5)  
6.	df['temperature F'] = df['temperature'].apply(lambda x: (x * 9./5.) + 32.)  
7.	df.head()

d94a4bf2b76558d48e5950b5c28b57df7f36492d

	timestamp	temperature	temperature F
0	2014-09-26T03:50:00	14.0	57.2
1	2014-08-10T05:00:00	14.0	57.2
2	2014-08-21T22:50:00	12.0	53.6
3	2014-08-17T13:20:00	16.0	60.8
4	2014-08-06T01:20:00	14.0	57.2

例3. 两列之间也可以直接进行运算，如

1.	df['ruleoftumb'] = df['temperature F'] / df['temperature']  
2.	df.head()

	timestamp	temperature	temperature F	ruleoftumb
0	2014-09-26T03:50:00	14.0	57.2	4.085714
1	2014-08-10T05:00:00	14.0	57.2	4.085714
2	2014-08-21T22:50:00	12.0	53.6	4.466667
3	2014-08-17T13:20:00	16.0	60.8	3.800000
4	2014-08-06T01:20:00	14.0	57.2	4.085714

【修改Series和DataFrame】

Series和DataFrame的方法实际上并没有修改原始的Series和DataFrame,而是返回一个新的Series或DataFrame，可以使用inplace参数来决定是否要用新结果替换掉原来的数据。

1.	# 对列名进行重命名，rename()函数返回一个新DataFrame，  
2.	# inplace参数表示是否替换原来的DataFrame  
3.	df.rename(columns={'ruleoftumb': 'bad_rule'}, inplace=True)  
4.	df.head()

	timestamp	temperature	temperature F	bad_rule
0	2014-09-26T03:50:00	14.0	57.2	4.085714
1	2014-08-10T05:00:00	14.0	57.2	4.085714
2	2014-08-21T22:50:00	12.0	53.6	4.466667
3	2014-08-17T13:20:00	16.0	60.8	3.800000
4	2014-08-06T01:20:00	14.0	57.2	4.085714

5.	# 删除列,inplace参数同上  
6.	df.drop('bad_rule', axis=1, inplace=True)  
7.	df.head()

	timestamp	temperature	temperature F
0	2014-09-26T03:50:00	14.0	57.2
1	2014-08-10T05:00:00	14.0	57.2
2	2014-08-21T22:50:00	12.0	53.6
3	2014-08-17T13:20:00	16.0	60.8
4	2014-08-06T01:20:00	14.0	57.2

1. describe()

4. mean(), sum(), median(),…

例1. 创建新列：

1.	# .mean()函数计算指定数据的均值  
2.	df['deviation'] = df['temperature'] - df['temperature'].mean()  
3.	df.head()

	timestamp	temperature	temperature F	deviation
0	2014-09-26T03:50:00	14.0	57.2	-1.590951
1	2014-08-10T05:00:00	14.0	57.2	-1.590951
2	2014-08-21T22:50:00	12.0	53.6	-3.590951
3	2014-08-17T13:20:00	16.0	60.8	0.409049
4	2014-08-06T01:20:00	14.0	57.2	-1.590951

例2. 用groupby()分组

1.	#按温度分组，统计每个温度出现的次数  
2.	df.groupby('temperature').count()

3fbd0d00eaa5bbd6eeb8a31e27951f66035721c0

例3. 输出指定数据统计信息

1.	# describe()方法返回数据的统计信息，不考虑空值    
2.	df['temperature'].describe(percentiles=[.1,.5,.6,.7])

b29064c4c77758af11828a4d62c38319882582ce

NaN表示空值，可以使用drop( )移除；也可以用默认值替换或者前向填充/后向填充

例1. 使用Isnull( )函数判断是否为空

1.	df['temperature'].isnull()[2350:2357]

69ffbb5127625db6f1f403423847fc8ce1b66322

例2. 删除缺失值：

1.	df.dropna(inplace=True)  
2.	print df['temperature'].isnull().any()

输出： False ，因为已经删除缺失值，并且用删除之后的数据替换掉原数据，所以判断是否存在空值时，返回False，即不存在空数据。

Part2 时间序列分析（以时间戳为index的序列）

在进行时间序列分析时，先将DataFrame的索引值由默认的数字索引变为时间戳索引：

1.	#新增一列deviation,然后将默认的索引值变为时间戳索引值  
2.	df['deviation']=df['temperature']-df['temperature'].mean()  
3.	df.index=pd.to_datetime(df['timestamp'])  
4.	df.head()

9480fb06f05333c40ec7a6031b7e818031528a26

画出DataFrame前100行，此时图的横坐标不再是数值索引，而是时间戳。如下：

1.	ax=df[:100].plot()  
2.	ax.axhline(df[:100]['temperature'].median(),color='r',linestyle='-')

3. plt.show()

c75c7823c286ade803790880ed8b321e588abd93

此时，对DataFrame加入weekday列和weekend列，

1.	# DatetimeIndex.weekday 将返回该日期是一星期中的第几天，星期一是0，星期天是6  
2.	df['weekday'] = df.index.weekday  
3.	# isin()返回布尔值，表示df['weekday']是否在{5,6}中，  
4.	# 即判断是否是周末  
5.	df['weekend'] = df['weekday'].isin({5, 6})  
6.	# 根据日期来分组，进行统计  
7.	df.groupby(df.index.date).count()

6ae9b6ed4d3aa634c3deef98efcf9f823f74f07e

那么就可以进一步分析温度随着时间的变化趋势，比如观察每周气温的变化情况：

1.	# 以周为时间单位进行聚合，分析气温的变化情况  
2.	# 前面已经将时间序列作为索引值，那么这里df.index.week返回的是一年的第几周

3. df.groupby(df.index.week).plot()

51bd23d73a4dac459c86bfc0b68bc792396737ab

也可以分析指定时间内的温度变化趋势

1.	# 2014年9月气温变化图（左图）    
2.	df['2014-09']['temperature'].plot()     
3.	# 12点到16点之间的气温变化图（右图）    
4.	df[(df.index.hour > 12) & (df.index.hour <=16)]['temperature'].plot()

时间序列重采样（resample）

重采样是对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法，分为降采样和升采样，将高频率数据聚合到低频率数据称为降采样（downsampling），将低频率转换到高频率称为升采样(upsampling)。

092a46beca4f34b5cbf196e73d661a00529ef712

首先用DataFrame进行重采样：

1.	# 按3天为时间间隔采样  
2.	df.resample('3D').plot()

d67eb0efabddcf07b75e4daf98eb2c07726d01e9

又如对Series进行降采样：

1.	import random  
2.	index = pd.date_range('1/1/2016', periods=1200, freq='S') 
3.	series = pd.Series([random.randint(0,100) for p in range(1200)], index=index)  
4.	# label参数表示采用区间左边的时间戳还是右边的时间戳，  
5.	# closed参数表示区间是左边闭合还是右边闭合，和数学中[  )，(  ]区间表示形式一样  
6.	# 一个时间戳只能属于一个时间段，所有时间段合并起来必须能组成原始的整个时间帧
7.	# 降采样，从之前的1秒变为5分钟
8.	resampled = series.resample('5T', label='right', closed='right')  
9.	print resampled 
10.	print series.resample('5T', label='left', closed='right')  
11.	print series.resample('5T', label='left', closed='left')

5a941189f371b09196e4892bf206331171e8a4ba

升采样，采样频率从5分钟变到100秒，

1.	# 升采样默认会引入缺失值
2.	print resampled.resample('100S')[:6]  
3.	# ffill()向前填充，即用上一个有效值填充缺失数据  
4.	# bfill()向后填充，即用下一个有效值填充缺失数据  
5.	print resampled.resample('100S').ffill()[:6]  
6.	print resampled.resample('100S').bfill()[:6]

64be7a3242e08746a5812158d136d61cfd9bf5ef

使用statsmodels库进一步分析时序数据

c15af75d5bade72e2212dd5f664eed18f9abdacb

1.	dtap=pd.DataFrame(mdf.groupby(mdf.index)['activity'].sum())  
2.	# 对缺失数据插值  
3.	dtap.activity.interpolate(inplace=True)  
4.	res=sm.tsa.seasonal_decompose(dtap.activity)  
5.	resplot=res.plot()  
6.	resplot.set_size_inches(15,15)

输出：

79e9e5f8824796a6c158ca64d447d0caa5c26d42

安利一个会议：EuroPython 2017，欧洲最大的Python会议，欢迎参加。

282abf14ff9ffb045daed98b73e346f9a31ea2d4

作者介绍

9dc3fab8483a5e5d64c3a722abb050d8fd0b8293

以上为译文

文章原标题《Introduction to Pandas and Time Series Analysis》，作者：Alexander C. S. Hendorf，译者：李烽审校：

文章为简译，更为详细的内容，请查看原文。

PS：中文译制PDF版食用更佳，可读性更强，见附件。

最后更新：2017-05-15 13:01:26

即学即用：Pandas入门与时间序列分析

【输入/输出】

【数据结构：Series和DataFrame】

【Series篇】

【DataFrame】

【修改Series和DataFrame】

Part2 时间序列分析（以时间戳为index的序列）

时间序列重采样（resample）

使用statsmodels库进一步分析时序数据

上一篇： Java程序员的十个调试技巧

下一篇： Python 3 学习之路（1）变量，用户交互，if else ， while ，for

相关内容

热门内容

最新内容

即学即用：Pandas入门与时间序列分析

【输入/输出】

【数据结构：Series和DataFrame】

【Series篇】

【DataFrame】

【修改Series和DataFrame】

Part2 时间序列分析（以时间戳为index的序列）

时间序列重采样（resample）

使用statsmodels库进一步分析时序数据

上一篇： Java程序员的十个调试技巧

下一篇： Python 3 学习之路 （1）变量，用户交互，if else ， while ，for

相关内容

热门内容

最新内容

下一篇： Python 3 学习之路（1）变量，用户交互，if else ， while ，for