Thanks to theidioms.com

Time-Series Forecasting with TensorFlow 2.0

Time-Series Forecasting with TensorFlow 2.0

Getting Started with Time-Series Data

Before we dive into time-series forecasting, we first need to be familiar with time-series data and how we can manipulate it using Python.

In this lesson, we will learn to perform data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data.

1. Importing necessary libraries

First, let us import some essential Python libraries that will be used later in this chapter for manipulating time-series data.

import os
import datetime

# For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# For building model and loading dataset
import tensorflow as tf

# Set basic configurations
mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False
%matplotlib inline

Now that we’ve imported the necessary libraries, let us import and visualize the dataset.

2. Importing and visualizing the dataset

The dataset that we will be using for this tutorial is the weather time series dataset which contains 14 different features such as humidity, air temperature, etc. All of these recorded data are separated by 10 minutes of time.

We will be using the get_file() module of tf.keras to download the dataset and save it as a csv file.

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

It may take about 20-60 seconds to download the dataset based on your internet connection. After downloading, let us load the data and have a look at the first 5 rows of it.

# Reading in the dataset
df = pd.read_csv(csv_path)

# Looking at the first five rows of the DataFrame
df.head()
Getting Started with Time-Series Data

As we can see in the Date Time column, the data is recorded at the interval of every 10 minutes. However, to make this tutorial easier to digest in a single go, we will be sub-sampling the data into 1-hour intervals rather than 10 minutes.

# slice [start:stop:step], starting from index 5 take every 6th record.
df = df[5::6]

# Store the datetime values in a separate variable for future processing
date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S')

# Looking at the first five rows of the DataFrame
df.head()
Getting Started with Time-Series Data

As we can see, we have successfully sub-sampled our original dataset. You can also apply a similar technique to sub-sample other time-series data you can encounter in the future. Let us now visualize how some of the features look in relation to the time.

plot_cols = ['T (degC)', 'p (mbar)', 'rho (g/m**3)'] # Columns we want to plot
plot_features = df[plot_cols] # Getting the columns
plot_features.index = date_time # Setting the index as date time
_ = plot_features.plot(subplots=True)
Getting Started with Time-Series Data

Since our dataset has huge number of rows, the line plots drawn above can be some what difficult to understand. To take a closer look of the data, we can also plot only few data points as shown below,

# Taking only first 480 points
plot_features = df[plot_cols][:480]
plot_features.index = date_time[:480]

# Plotting
_ = plot_features.plot(subplots=True)
Getting Started with Time-Series Data
3. Data cleaning

Now, we are going to clean the time-series data. For that, let us have a look at the statistical values of the dataset that we are working with.

df.describe().transpose()
Getting Started with Time-Series Data

As we can see, the min value of the wind velocity, wv (m/s) and max. wv (m/s) columns looks incorrect that is -9999. Let us replace it with zeroes.

# Getting indices of wv and max. wv with value -9999
bad_wv = df['wv (m/s)'] == -9999.0
bad_max_wv = df['max. wv (m/s)'] == -9999.0

# Repalcing the incorrect values with 0.0
df.loc[bad_wv,'wv (m/s)']  = 0.0
df.loc[bad_max_wv, 'max. wv (m/s)']  = 0.0

# Checking if the above inplace edits are reflected in the DataFrame
df['wv (m/s)'].min()
0.0
4. Feature Engineering

To build an accurate model, we should spend some time on feature engineering by converting the data into appropriate formats. In this section, we will learn how to perform feature engineering on time-series data.

Let us convert the wind direction and wind velocity columns to a vector with x and y components.

wv = df.pop('wv (m/s)')
max_wv = df.pop('max. wv (m/s)')

# Convert to radians
wd_rad = df.pop('wd (deg)')*np.pi / 180

# Calculate the wind x and y components
df['Wx'] = wv*np.cos(wd_rad)
df['Wy'] = wv*np.sin(wd_rad)

# Calculate the max wind x and y components
df['max Wx'] = max_wv*np.cos(wd_rad)
df['max Wy'] = max_wv*np.sin(wd_rad)

We can also convert the datetime feature into multiple features.

timestamp_s = date_time.map(datetime.datetime.timestamp)

day = 24*60*60
year = (365.2425)*day

df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))
5. Splitting and normalizing the data

For testing the predictions of our forecasting model, we need to split the data into a validation set as well as a testing set. We’ll be performing a 70:30:10 split where 70% of the data will be our training dataset, 30% of the dataset will be our validation dataset and 10% data of the data will be our validation dataset.

# Dictionary of column names and their indices, i.e., assigning indices to column names
column_indices = {name: i for i, name in enumerate(df.columns)}

# Number of rows
n = len(df)

#  Splitting the dataset with a 70:20:10 split
train_df = df[0:int(n*0.7)] # From 0% to 70%
val_df = df[int(n*0.7):int(n*0.9)] # From 70% to 90%
test_df = df[int(n*0.9):] # All above 90%

# Number of features in our dataset
num_features = df.shape[1]
print(f'Total number of features: {num_features}')
Total number of features: 15

Next, let us normalize the data using the mean and standard deviation of the training dataset.

train_mean = train_df.mean()
train_std = train_df.std()

train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

A violin plot will help us understand about the distribution of the normalized data.

df_std = (df - train_mean) / train_std
df_std = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_std)
_ = ax.set_xticklabels(df.keys(), rotation=90)
Getting Started with Time-Series Data

Thus, in this chapter, we got familiar with performing data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data. In the next lesson, you will get introduced to some basics of time-series forecasting.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami