时间序列模型:严格来说包含4个要素,Trend/趋势、Circle/循环、Seasonal /季节性和不规则要素。但是实际中C和S差不代指同一个东西。
一组时间序列数据有可能包含T和S,这都导致数据集不平稳。因为T会造成均值跟着时间变化而变化,S会造成方差随时间变动。
在平稳化时间序列数据中,差分/differencing
是种用得广&受欢迎的方法。
" If we fit a stationary model to data, we assume our data are realization of a stationary process. So our first step in an analysis should be to check whether there is any evidece of a trend or seasonal effects and, if there is, remove them."
——Page 122, Introductory Time Series with R.
“Differencing can help stabilize the mean of the time series by removeing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality.”
—— Page 215, Forcasting: principles and practice.
一阶差分 / lag-1 difference:
d i f f e r e n c e t = o b s e r v a t i o n t − o b s e r v a t i o n t − 1 difference_t = observation_t - observation_{t-1} differencet=observationt−observationt−1
复原差分:
i n v e r t e d t = d i f f e r e n c e d t + o b s e r v a t i o n t − 1 inverted_t = differenced_t + observation_{t-1} invertedt=differencedt+observationt−1
# 差分的函数
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i-interval]
diff.append()
return Series(diff)
# 复原差分的函数
def inverse_difference(las_ob, value):
return value + last_ob
差分特性 | 使用模型 | 适用的数据走势类型 |
---|---|---|
一阶差分 | 一次线性模型 | y ^ t = b 0 + b 1 t y\hat{}_t = b_0 + b_1t y^t=b0+b1t |
二阶差分 | 二次线性模型 | y ^ t = b 0 + b 1 t + b 2 t 2 y\hat{}_t = b_0 + b_1t + b_2t^2 y^t=b0+b1t+b2t2 |
三阶差分 | 三次线性模型 | y ^ t = b 0 + b 1 t + b 2 t 2 + b 3 t 3 y\hat{}_t = b_0 + b_1t + b_2t^2 + b_3t^3 y^t=b0+b1t+b2t2+b3t3 |
一阶差分的函数pandas是
df = df.diff()
,二阶的是df = df.diff().diff()
,以此类推得到
lag-n difference。
T会使得时间序列不平稳,这会让不同时间的均值受影响。直接上例子:
# 先造个差分方程出来
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return diff
# 再造个复原差分的函数
def inverse_difference(last_ob, value):
return value + last_ob
# 定义个有linear trend的数据集
data = [i+1 for i in range(20)]
print(data)
# 用差分函数处理data
diff = difference(data)
print(diff)
# 复原diff
inverted = [inverse_difference(data[i], diff[i]) for i in range(len(diff))]
print(inverted)
# 结果如下
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
所谓S / Seasonal variation/ seasonality,即随着时间周期性出现的波动。
A repeating pattern within each year is known as seasonal variation, although the term is applied more generally to repeating patterns within any fixed period.
—— Page 6, Introductory Time Series with R.
看例子:
from numpy import sin, radians
import matplotlib.pyplot as plt
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return diff
def inverse_difference(last_ob, value):
return value + last_ob
data = [sin(radians(i)) for i in range(360)] + [sin(radians(i)) for i in range(360)]
diff = difference(data, 360)
inverted = [inverse_difference(data[i], diff[i]) for i in range(len(diff))]
fig, axes = plt.subplots(3, 1)
axes[0].plot(data)
axes[0].title.set_text('data')
axes[1].plot(diff)
axes[1].title.set_text('diff')
axes[2].plot(inverted)
axes[2].title.set_text('inverted')
plt.tight_layout()
plt.show()