Getting Started

The purpose of this guide is to introduce some of the main features of pyWATTS. It assumes a basic knowledge of data science and machine learning principles.

We will work through the steps of creating a pipeline, adding in modules, running the pipeline and finding the results all based on the example.py pipeline. The data used in this guide is available through the Open Power System Data Portal. We use load time series for Germany from various sources for the year 2018.

Initial Imports

Before we start creating the pipeline, we need to import the pipeline module from the pyWATTS core.

from pywatts.core.pipeline import Pipeline

We also need to import all the modules we plan on adding into our pipeline as well as any external Scikit-Learn modules we will be using.

# Other modules required for the pipeline are imported
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

# From pyWATTS the pipeline is imported
from pywatts.callbacks import LinePlotCallback
from pywatts.core.computation_mode import ComputationMode
from pywatts.core.pipeline import Pipeline
# All modules required for the pipeline are imported
from pywatts.modules import CalendarExtraction, CalendarFeature, Select, LinearInterpolater, SKLearnWrapper,
from pywatts.summaries import RMSE

With the modules imported, we can now work on building the pipeline.

Creating The Pipeline

We create the pipeline in the main function of example.py. The very first step is to create the pipeline and specify the path.

pipeline = Pipeline(path="results")

It is essential to specify the path since a time-stamped folder with all the outputs from the pipeline will be generated and saved in this location every time we run the pipeline.

Now that the pipeline exists, we can add in modules.

Dummy Calendrical Features

Often we require dummy calendrical features, such as month, weekday, hour and whether or not the day is a weekend, for forecasting problems. CalendarExtraction modules are able to extract these features. Since this is the first module in our pipeline, we do not have to worry about defining the proceeding module. However, we must specify the column of the dataset which should be used as input for that module. Therefore, we use round brackets with the pipeline name inside and square brackets to to achieve this: (x=pipeline["load_power_statistics"]).

calendar = CalendarExtraction(continent="Europe",
                              country="Germany",
                              features=[CalendarFeature.month, CalendarFeature.weekday, CalendarFeature.weekend]
                             )(x=pipeline["load_power_statistics"])

When we define a CalendarExtraction module, we need to choose what encoding to use. In the case, we choose the continent and the country that is used to calculate public holidays. This is particularly important for public holidays that only exist in certain parts of the world (e.g. Thanksgiving). The extracted features are the numerical extracted month, the weekday, and the weekend.

Linear Interpolation

The next model we include deals with missing values by filling them through linear interpolation.

imputer_power_statistics = LinearInterpolater(method="nearest",
                                              dim="time",
                                              name="imputer_power")(x=pipeline["load_power_statistics"])

The parameters here (method and dim) are related to the scipy interpolate method which is used inside the module. As before, we need to correctly place the linear interpolator in the pipeline. This example takes the column ‘’load_power_statistics’’ from the input data. Consequently, we specify the input by (x=pipeline["load_power_statistics"]) again.

Scaling

It is also possible to integrate SciKit-Learn modules directly into the pipeline. We achieve this by using the SKLearnWrapper:

power_scaler = SKLearnWrapper(module=StandardScaler(), name="scaler_power")
scale_power_statistics = power_scaler(x=imputer_power_statistics)

Here we use the wrapper to import a SciKit-Learn StandardScaler in the pipeline. In the second line we apply the StandardScaler on the imputed load time series, resulting in a normalised time series.

Creating Lags

Often in time-series analysis, we want to consider time-lags, i.e. shifting the time series back by one or more values. In pyWATTS, we use the ClockShift module to perform this task.

lag_features = Select(start=-1, stop=1, step=1)(x=scale_power_statistics)

In the above example, we create a sampled time series with the two values for each time step (past value and current value). The input for this module is the same scaled time series from above. When we modules of the same type (here two ClockShift modules, it is highly advisable to name them. Without a user defined name there will be a conflict in the pipeline. pyWATTS automatically changes the name to avoid this conflict and you receive a warning message, but we advise avoiding this.

Creating multiple targets

For every hour, we want to predict the values for the next 24 hours. We use the Select to create windows containing 24 values.

target_multiple_output = Select(start=1, stop=25, step=1 name="sampled_data")(x=scale_power_statistics)

Selecting features

We use the SciKit-learn wrapper around the module SelectKBest to automatically select useful features.

selected_features = SKLearnWrapper(
    module=SelectKBest(score_func=f_regression, k=2)
)(
    power_lag1=shift_power_statistics,
    power_lag2=shift_power_statistics2,
    calendar=calendar,
    target=scale_power_statistics,
)

Linear Regression

We also use the SciKit-learn wrapper for linear regression. The implementation is, however, slightly different.

regressor_power_statistics = SKLearnWrapper(
    module=LinearRegression(fit_intercept=True)
)(
    features=selected_features,
    target=target_multiple_output,
    callbacks=[LinePlotCallback("linear_regression")]
)

First we see that standard SciKit-learn parameters can be adjusted directly inside the SciKit-learn constructor. Here, for example, we have set the fit_intercept parameter to true. Furthermore, a linear regression can have more than one input and also requires a target for fitting. Therefore, we include the inputs by keyword-arguments. Additional features could be added by using additional keywords. Note that all keyword-arguments that start with target are considered as target variables by pyWATTS.

Rescaling

Before we performed the linear regression, we normalised the time-series with a SciKit-learn module. To transform the predictions from the linear regression back to the original scale, we need to call the scaler a second time, and ensure we use the inverse transformation.

inverse_power_scale = power_scaler(x=regressor_power_statistics,
                                    computation_mode=ComputationMode.Transform, use_inverse_transform=True,
                                     callbacks=[LinePlotCallback('rescale')])

We also set computation_mode=ComputationMode.Transform for this inverse transformation to work. If this is not set, then the scaler will automatically fit itself to the new scaled dataset, and the inverse transformation will be useless. Moreover, we can use callbacks for visualizing or writing the results into files.

Root Mean Squared Error

To measure the accuracy of our regression model, we can calculate the root mean squared error (RMSE).

rmse = RMSE()(y_hat=inverse_power_scale, y=target_multiple_output)

The target variable is determined by the key-word y. All other keyword arguments are considered as predictions.

Executing, Saving and Loading the Pipeline

With the desired modules added to the pipeline, we can now train and test it. We do this by calling the train method or test method. Both methods require some input data. Therefore, we read some data with [pandas](https://pandas.pydata.org/) or [xarray](http://xarray.pydata.org/en/stable/index.html) and split it into a train and a test set.

data = pd.read_csv("../data/getting_started_data.csv",
            index_col="time",
            parse_dates=["time"],
            infer_datetime_format=True,
            sep=",")
train = data.iloc[:6000, :]
pipeline.train(data=train)

test = data.iloc[6000:, :]
pipeline.test(data=test)

The above code snipped not only starts the pipeline and hereby saves the results in the results folder, but also generates a graphical representation of the pipeline. This enables us to see how the data flows through the pipeline and to control if everything is set up as planned.

We can now save the pipeline to a folder:

pipeline.to_folder("./pipe_getting_started")

Saving the pipeline generates a series of json and pickle files so that the same pipeline can be reloaded at any point in time in the future to check results. We see below an example:

pipeline2 = Pipeline()
pipeline2.from_folder("./pipe_getting_started")

Here, we create a new pipeline and use it to load the information from the original pipeline.

Warning

Sometimes from_folder use unpickle for loading modules. Note that this is not safe. Consequently, load only pipelines you trust with from_folder. For more details about pickling see https://docs.python.org/3/library/pickle.html

Results

All results are saved in the results folder specified when creating the pipeline. Here another folder with a time-stamp indicating when the pipeline was executed will be automatically generated when the pipeline is run. In this folder, we find the following items:

  • linear_regression_target.png: A plot of the 24 training targets against time.
  • linear_regression_target_2..png: A plot of the 24 test targets against time.
  • rescale_scaler_power.png: A plot of the 24 rescaled predictions on the training set against time.
  • rescale_scaler_power_2..png: A plot of the 24 rescaled predictions on the test set against time.
  • summary.md: A summary of the training run, including the RMSE and runtimes.
  • summary_2..md: A summary of the test run, including the RMSE and runtimes.

Furthermore, pickle and json files containing information about the pipeline can be found in the folder pipe_getting_started.

Summary

This guide has provided an elementary introduction into pyWATTS. For more information, consider working through the other examples provided or reading the documentation.

For further information on how to use pyWATTS, please have a look at (How to use?).