Thanks to

Time-Series Forecasting with TensorFlow 2.0

Time-Series Forecasting with TensorFlow 2.0

Analyzing Time-Series Data

In the previous lesson, we got familiar with Time-Series Data. In this lesson, we will be learning to perform some basic analysis (both statistical analysis as well as graphical analysis) on Time-Series Data. We will be continuing with the same DataFrame (df) we used in our previous lesson. However, in this lesson, we will be analyzing only the temperature column (T (degC)) of the DataFrame. So we can make a new DataFrame taking only one column from our original DataFrame as,

# Taking only one column
df_temp = pd.DataFrame(df['T (degC)'])
df_temp.index = date_time 

# Displaying top 5 rows
Time series Data

We will now be working only with the df_temp DataFrame in this entire lesson.

Graphical Analysis on Time-Series Data

Data Visualization is a very powerful technique for analyzing data. There are many techniques for visualizing time-series data. In this chapter, we will learn about Line Plots, Histogram and Density Plots, Box Plots, and Calendar Heatmaps. For this, we will be using Matplotlib and Calmap libraries. So before diving into creating different plots, we need to import the necessary libraries as,

import matplotlib
import matplotlib.pyplot as plt
import calmap

# Defining figure size
matplotlib.rcParams['figure.figsize'] = (12, 6)
%matplotlib inline
1. Line Plots

Line plot is a type of chart that displays information as a series of data points connected by straight lines. It is one of the most commonly used plots for visualizing time-series data. In such plots, time is represented on the x-axis. A line plot can easily be created using pandas.DataFrame.plot function as,

# Plot all temperature
Analyzing Time-Series Data

In the above graph, we plotted the entire data. However, it is sometimes necessary to plot the data for a specific time period. The following example shows how we can do that,

# Plotting for a specific time period
start_date_time = '2009-01-01 01:00:00'
finish_date_time = '2009-02-01 01:00:00'
data = df_temp[start_date_time:finish_date_time]

data['T (degC)'].plot()
Analyzing Time-Series Data

Line plots are also helpful to understand how the pattern of our data varies after every certain interval (eg: after every year). For this, we can group our data by year, and then plot the data of every year as,

# Group values of different years
group_years = df_temp.groupby(df_temp.index.year)
years = pd.DataFrame()

for name, group in group_years:
    values = group['T (degC)'].values
    years[name] = pd.Series(values)
# Plot data   
years.plot(subplots=True, legend=True, figsize=(12,8))
Analyzing Time-Series Data
2. Histogram and Density Plot

A histogram is an approximate representation of the distribution of numerical data. Some of the time-series forecasting methods assume certain distribution of data (such as bell curve or normal distribution). So, plotting a histogram will give us a rough idea about the distribution of our data. Histograms can be plotted using the pandas.Series.hist function as,

# Getting data as pandas.Series
temp = df_temp['T (degC)']

# Plotting the Series
Analyzing Time-Series Data

Another plot that can provide us with a better idea about our data distribution is a Density Plot. For simplicity, it can be seen as a smoothed version of the histogram plot. It can be created using the pandas.Series.plot function with kind as ‘kde’ (Kernel Density Estimate) as,

Analyzing Time-Series Data
3. Box Plots

Box plots are useful to summarize the distribution of our data into different boxes. If you are not familiar with box plots, here is a quick anatomy of a box plot,

Analyzing Time-Series Data
Source: Quant Girl

We can create box plots for each year in a similar way we created histogram plots,

# Group values of different years
group_years = df_temp.groupby(df_temp.index.year)
years = pd.DataFrame()

for name, group in group_years:
    values = group['T (degC)'].values
    years[name] = pd.Series(values)

# Construct Box Plots    
Analyzing Time-Series Data

From the above box plot, we can see that almost all observations taken at different years lie within a similar range. This can also be justified by the fact that the temperature of a certain place almost remains within a certain range every year.

4. Calendar Heatmap

Calendar heatmap is a kind of plot that shows the intensity of data over days of a year using color gradients. The darker shade in the heatmap indicates a higher value. We will be using the calmap library for creating our calendar heatmap as,

import calmap

YEAR = 2010
year_data = df_temp[df_temp.index.year == YEAR]

calmap.calendarplot(year_data['T (degC)'])

Analyzing Time-Series Data

Statistical Analysis on Time-Series Data

Statistical approach is another important method of analyzing time-series data. Such analysis involves computing various metrics of the series such as mean, medium, etc. However, in time-series data, we use the concept of rolling windows for computing different values.

Rolling windows split the data into time windows. The different windows created overlap and “roll” along at the same frequency as the data, so the transformed time series is at the same frequency as the original time series. Statistical metrics such as mean and median are calculated over only those observations inside the rolling windows.

Let us compute the rolling mean and median over a window size of 48 (corresponding to 48hrs/2days of observation),

df_temp['rolling_mean'] = df_temp['T (degC)'].rolling(window = 48).mean()
df_temp['rolling_median'] = df_temp['T (degC)'].rolling(window = 48).median()

Analyzing Time-Series Data

Patterns in Time-Series Data

There can be various patterns underlying time series data.  It is often helpful to split a time series into several components, each representing an underlying pattern category. Such splitting is very helpful for Exploratory Data Analysis (EDA).

One of the most common splitting technique is to split the data into three different patterns: trend, seasonality and the error terms. A trend is observed when there is an increasing or decreasing slope observed in the time series. Whereas seasonality is observed when there is a distinct repeated pattern observed between regular intervals due to seasonal factors. It could be because of the month of the year, the day of the month, weekdays or even time of the day.

So, a time series may be imagined as a combination of the trend, seasonality, and error terms. We can use the statsmodel module in Python to decompose the time series into error, trend, and seasonality. As we have 24 rows of data for each day (as the reading is taken every hour in a day), we will be dividing the length of the DataFrame by 24 and using that as frequency during the decompose.

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df_temp['T (degC)'], model='additive', freq=len(df_temp)//(24))
fig = result.plot()
Analyzing Time-Series Data

Stationary and Non-Stationary Time-Series Data

A stationary time series data is that type of data where the values of the series are not a function of time, i.e., the statistical properties of the series like mean, and variance are constant over time.

Most statistical forecasting methods are designed to work on a stationary time series. So it is generally suggested to check the series for stationarity before forecasting.

How to test for stationarity of a time series data?

The stationarity of a times series data can be checked using a statistical test called as ‘Unit Root Tests’. There are different variants of Unit Root Test. In this lesson, we will be covering one of the most commonly used variant of it, the Augmented Dickey Fuller test (ADF test).

In ADF test assumes two hypothesis,

  • Null Hypothesis: The series is stationary.
  • Alternative Hypothesis: The series is not stationary.

Then, the P-value is computed and checked against the significance level (0.05). If the P-Value in ADF test is less than the significance level (0.05), we reject the null hypothesis and accept the alternative hypothesis.

The following block of code illustrates how we can perform ADF test using python,

from statsmodels.tsa.stattools import adfuller

def adf_check(time_series):
    result = adfuller(time_series)
    print("Augmented Dicky-Fuller Test")
    labels = ['ADF Test Statistic', 'p-value', '# of lags','# of observations used']
    for value, label in zip(result, labels):
        print(label + " : " + str(value))
    if result[1] <= 0.05:
        print("Strong evidence against null hypothesis.\nReject Null Hypothesis.\nData has no unit root and is stationary.")
        print("Weak evidence against null hypothesis.\nFail to reject Null Hypothesis.\nData has a unit root and is non-stationary.")

# Performing adf test for the first 10,000 rows of data (to save computational time)
adf_check(df_temp['T (degC)'][:10000])
Augmented Dicky-Fuller Test
ADF Test Statistic : -3.4650592619838014
p-value : 0.008931120463794082
of lags : 38
of observations used : 9961
Strong evidence against null hypothesis.
Reject Null Hypothesis.
Data has no unit root and is stationary.
How to make a series stationary?

In the above example, our series was stationary? But what if the series was non-stationary? In such cases, we generally transform the series into stationary.

One of the most common approaches for making a series stationary is Differencing. In this method, we compute the difference of consecutive terms in the series. It is typically performed to get rid of the varying mean.

If Y_t is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.

For example, consider the following series: [5, 8, 2, 1, 10].

We can perform differencing on the series by subtracting the next value by current value as,

[8-5, 2-8, 1-2, 10-1] = [3, -6, -1, 9]

With this we have come to the end of this chapter. From the next chapter onward, we will start working on building our forecasting model!

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner