.. _gettingstarted: Getting Started =============== The purpose of this guide is to introduce some of the main features of `pyWATTS`. It assumes a basic knowledge of data science and machine learning principles. We will work through the steps of creating a pipeline, adding in modules, running the pipeline and finding the results all based on the ``example.py`` pipeline. The data used in this guide is available through the `Open Power System Data Portal `_. We use load time series for Germany from various sources for the year 2018. Initial Imports *************** Before we start creating the pipeline, we need to import the pipeline module from the `pyWATTS` core. .. code-block:: python from pywatts.core.pipeline import Pipeline We also need to import all the modules we plan on adding into our pipeline as well as any external Scikit-Learn modules we will be using. .. code-block:: python # Other modules required for the pipeline are imported import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_regression # From pyWATTS the pipeline is imported from pywatts.callbacks import LinePlotCallback from pywatts.core.computation_mode import ComputationMode from pywatts.core.pipeline import Pipeline # All modules required for the pipeline are imported from pywatts.modules import CalendarExtraction, CalendarFeature, Select, LinearInterpolater, SKLearnWrapper, from pywatts.summaries import RMSE With the modules imported, we can now work on building the pipeline. Creating The Pipeline ********************* We create the pipeline in the ``main`` function of ``example.py``. The very first step is to create the pipeline and specify the path. .. code-block:: python pipeline = Pipeline(path="results") It is essential to specify the path since a time-stamped folder with all the outputs from the pipeline will be generated and saved in this location every time we run the pipeline. Now that the pipeline exists, we can add in modules. **Dummy Calendrical Features** Often we require dummy calendrical features, such as month, weekday, hour and whether or not the day is a weekend, for forecasting problems. ``CalendarExtraction`` modules are able to extract these features. Since this is the first module in our pipeline, we do not have to worry about defining the proceeding module. However, we must specify the column of the dataset which should be used as input for that module. Therefore, we use round brackets with the pipeline name inside and square brackets to to achieve this: ``(x=pipeline["load_power_statistics"])``. .. code-block:: python calendar = CalendarExtraction(continent="Europe", country="Germany", features=[CalendarFeature.month, CalendarFeature.weekday, CalendarFeature.weekend] )(x=pipeline["load_power_statistics"]) When we define a ``CalendarExtraction`` module, we need to choose what encoding to use. In the case, we choose the continent and the country that is used to calculate public holidays. This is particularly important for public holidays that only exist in certain parts of the world (e.g. Thanksgiving). The extracted features are the numerical extracted month, the weekday, and the weekend. **Linear Interpolation** The next model we include deals with missing values by filling them through linear interpolation. .. code-block:: python imputer_power_statistics = LinearInterpolater(method="nearest", dim="time", name="imputer_power")(x=pipeline["load_power_statistics"]) The parameters here (method and dim) are related to the *scipy* ``interpolate`` method which is used inside the module. As before, we need to correctly place the linear interpolator in the pipeline. This example takes the column ''load_power_statistics'' from the input data. Consequently, we specify the input by ``(x=pipeline["load_power_statistics"])`` again. **Scaling** It is also possible to integrate SciKit-Learn modules directly into the pipeline. We achieve this by using the ``SKLearnWrapper``: .. code-block:: python power_scaler = SKLearnWrapper(module=StandardScaler(), name="scaler_power") scale_power_statistics = power_scaler(x=imputer_power_statistics) Here we use the wrapper to import a SciKit-Learn ``StandardScaler`` in the pipeline. In the second line we apply the ``StandardScaler`` on the imputed load time series, resulting in a normalised time series. **Creating Lags** Often in time-series analysis, we want to consider time-lags, i.e. shifting the time series back by one or more values. In `pyWATTS`, we use the ``ClockShift`` module to perform this task. .. code-block:: python lag_features = Select(start=-1, stop=1, step=1)(x=scale_power_statistics) In the above example, we create a sampled time series with the two values for each time step (past value and current value). The input for this module is the same scaled time series from above. When we modules of the same type (here two ``ClockShift`` modules, it is highly advisable to name them. Without a user defined name there will be a conflict in the pipeline. `pyWATTS` automatically changes the name to avoid this conflict and you receive a warning message, but we advise avoiding this. **Creating multiple targets** For every hour, we want to predict the values for the next 24 hours. We use the Select to create windows containing 24 values. .. code-block:: python target_multiple_output = Select(start=1, stop=25, step=1 name="sampled_data")(x=scale_power_statistics) **Selecting features** We use the SciKit-learn wrapper around the module ``SelectKBest`` to automatically select useful features. .. code-block:: python selected_features = SKLearnWrapper( module=SelectKBest(score_func=f_regression, k=2) )( power_lag1=shift_power_statistics, power_lag2=shift_power_statistics2, calendar=calendar, target=scale_power_statistics, ) **Linear Regression** We also use the SciKit-learn wrapper for linear regression. The implementation is, however, slightly different. .. code-block:: python regressor_power_statistics = SKLearnWrapper( module=LinearRegression(fit_intercept=True) )( features=selected_features, target=target_multiple_output, callbacks=[LinePlotCallback("linear_regression")] ) First we see that standard SciKit-learn parameters can be adjusted directly inside the SciKit-learn constructor. Here, for example, we have set the ``fit_intercept`` parameter to true. Furthermore, a linear regression can have more than one input and also requires a target for fitting. Therefore, we include the inputs by keyword-arguments. Additional features could be added by using additional keywords. Note that all keyword-arguments that start with *target* are considered as target variables by pyWATTS. **Rescaling** Before we performed the linear regression, we normalised the time-series with a SciKit-learn module. To transform the predictions from the linear regression back to the original scale, we need to call the scaler a second time, and ensure we use the inverse transformation. .. code-block:: python inverse_power_scale = power_scaler(x=regressor_power_statistics, computation_mode=ComputationMode.Transform, use_inverse_transform=True, callbacks=[LinePlotCallback('rescale')]) We also set ``computation_mode=ComputationMode.Transform`` for this inverse transformation to work. If this is not set, then the scaler will automatically fit itself to the new scaled dataset, and the inverse transformation will be useless. Moreover, we can use callbacks for visualizing or writing the results into files. **Root Mean Squared Error** To measure the accuracy of our regression model, we can calculate the root mean squared error (RMSE). .. code-block:: python rmse = RMSE()(y_hat=inverse_power_scale, y=target_multiple_output) The target variable is determined by the key-word ``y``. All other keyword arguments are considered as predictions. Executing, Saving and Loading the Pipeline ****************************************** With the desired modules added to the pipeline, we can now train and test it. We do this by calling the ``train`` method or ``test`` method. Both methods require some input data. Therefore, we read some data with [pandas](https://pandas.pydata.org/) or [xarray](http://xarray.pydata.org/en/stable/index.html) and split it into a train and a test set. .. code-block:: python data = pd.read_csv("../data/getting_started_data.csv", index_col="time", parse_dates=["time"], infer_datetime_format=True, sep=",") train = data.iloc[:6000, :] pipeline.train(data=train) test = data.iloc[6000:, :] pipeline.test(data=test) The above code snipped not only starts the pipeline and hereby saves the results in the ``results`` folder, but also generates a graphical representation of the pipeline. This enables us to see how the data flows through the pipeline and to control if everything is set up as planned. We can now save the pipeline to a folder: .. code-block:: python pipeline.to_folder("./pipe_getting_started") Saving the pipeline generates a series of *json* and *pickle* files so that the same pipeline can be reloaded at any point in time in the future to check results. We see below an example: .. code-block:: python pipeline2 = Pipeline() pipeline2.from_folder("./pipe_getting_started") Here, we create a new pipeline and use it to load the information from the original pipeline. .. warning:: Sometimes from_folder use unpickle for loading modules. Note that this is not safe. Consequently, load only pipelines you trust with `from_folder`. For more details about pickling see https://docs.python.org/3/library/pickle.html Results ******* All results are saved in the ``results`` folder specified when creating the pipeline. Here another folder with a time-stamp indicating when the pipeline was executed will be automatically generated when the pipeline is run. In this folder, we find the following items: - *linear_regression_target.png*: A plot of the 24 training targets against time. - *linear_regression_target_2..png*: A plot of the 24 test targets against time. - *rescale_scaler_power.png*: A plot of the 24 rescaled predictions on the training set against time. - *rescale_scaler_power_2..png*: A plot of the 24 rescaled predictions on the test set against time. - *summary.md*: A summary of the training run, including the RMSE and runtimes. - *summary_2..md*: A summary of the test run, including the RMSE and runtimes. Furthermore, *pickle* and *json* files containing information about the pipeline can be found in the folder ``pipe_getting_started``. Summary ******* This guide has provided an elementary introduction into `pyWATTS`. For more information, consider working through the other examples provided or reading the documentation. For further information on how to use pyWATTS, please have a look at (:ref:`howtouse`).