A great library that Auto Extract features from Timeseries data

This article is about automated feature extraction of time series for machine learning with a new library tsfresh. This procedure can be done without user interaction and without the knowledge of the underlying distribution. The library tsfresh allows you to perform feature extraction based on scalable hypothesis tests specifically designed to work with time series data.

Tsfresh is a general formulation of feature transformation for time series data. Tsfresh is scalable in the sense that the hypothesis tests are implemented using a divide-and-conquer strategy and it is positioned to be the basis for the next generation of scalable machine learning feature transformers. Lets cover what, how, when part of Tsfresh

What is Tsfresh ?

Time Series Feature Extraction based on scalable hypothesis tests.

Features are extracted from a time series in order to be used for machine learning applications, such as classification or regression. Therefore, it is not the raw data that is used as input for the learning algorithms, but rather a set of calculated features.

Usually, these features are handcrafted and expert knowledge is required to create them. They are tailored to the specific use case and therefore do not generalize well to other problems or even similar problems with small changes in the data.

This toolbox aims at automating this process by extracting automatically a large number of features from time series and selecting only those relevant to the specific problem at hand.

Thereby, it frees the user from having to perform an extensive feature engineering process before training any machine learning algorithm.

This package computes a large number of time series characteristics, the so-called features. Furthermore, the tsfresh package provides algorithms to select the most relevant features from the dataset (feature selection). As such, tsfresh can be used for automatic feature extraction and selection for your time series datasets.

The goal of this project is to automate feature extraction from time series. For this purpose, we use a statistical approach, where features are extracted by evaluating a predefined set of hypothesis tests for each possible feature.

What are the characteristic, advantages and challenges of TSfresh

The concept of tsfresh is based on the observation that a time series can be decomposed into certain components (like trend, seasonality and noise) and these components have their own characteristics. For example, if we consider a trend component, it might show an upward trend over a longer period of time and a downward trend over the recent period.

These characteristics can be formalized in hypotheses, for example:

A positive linear trend has a positive slope with 99% confidence
The maximum value occurs within the first 10% of the time series
The change between consecutive values is usually constant within a 10% range

tsfresh implements many such hypotheses and tests them on the input data. The results of these tests (the p-values) are usually fed into another machine learning model which then learns to distinguish between two classes. This approach has been used for anomaly detection [1] or for predicting sensor failures [2].

There are a number of advantages to using the tsfresh transformer. First, it automatically identifies and selects features that are relevant to the target variable. Second, it is compatible with any type of time series data, regardless of length or sampling frequency. Third, it can be applied to multiple time series at once (although this has not been implemented in our pipeline).

There are also some challenges related to tsfresh. The biggest issue is that it may take longer than other feature extraction methods due to its heavy use of Python loops over pandas objects.

Additionally, since some functions (e.g., rolling mean) require user-specified window sizes, there may be a need for experimentation in order to determine optimal parameters (window size/lag length) before running it on new data sets or even existing ones with different distributions from what was used previously.

How and when to use Tsfresh :

Using tsfresh, we can extract features from time series. tsfresh works in two steps:

Step 1: Calculate the feature values for each time series individually.

Step 2: Combine these feature value tables into a single table with a feature column and a target column.

In step 1, we can use the tsfresh transformer, which essentially creates an iterator over (id, time series) pairs and applies the same feature-extraction process to all of them.

In step 2, we can use the extract_features transformer to take several different iterators that contain calculated features and combine them into a single table with a feature column and a target column.

This package is still in beta, but the basic functionality has not changed a lot yet (last update: March 2020).

Features are extracted from univariate time series, which are assumed to be equidistant. In other words, your data should be in a so-called long format, meaning that one row corresponds to one value of the time series. Additionally, the “time” column will be used as index for the time series. The class ts_transform expects this format by default.

Some examples and use cases of tsfresh

Real-world data is often messy, noisy, and highly dimensional. It can be hard to know where to even start in analyzing such data.

This is where feature extraction comes in. Feature extraction essentially boils down the data so it’s easier to use and understand. One way of doing this is through tsfresh, a Python package that provides an automated way to extract statistical features from time series data.

The tsfresh transformer is useful because it can extract features from both univariate and multivariate time series data, and does not require any domain-specific knowledge about the data. This means it can be applied to virtually any time series dataset (unlike methods that do require specialized knowledge).

For example, let’s say we have a dataset with hourly energy consumption for 30 days, totaling 720 records. After applying the tsfresh transformer on this dataset:

we’ll have a new dataset with each record representing the energy consumption for each hour of the day
there will be features for each column of our original dataset that describe its characteristics (such as mean, maximum, minimum, standard deviation)

The main functions of tsfresh are:

1) Extracting relevant features from univariate/multivariate time series
2) Extracting relevant features from univariate/multivariate time series. These are statistical and descriptive functions that characterize the time series in the time domain and frequency domain. Because most machine learning algorithms cannot directly process raw time series, this step is very important in order to get good results.

The big advantage of tsfresh over other Python packages for calculating features from time series is that it calculates them automatically for you! This saves you the tedious task of calculating all the different features manually. You just need to define which kind of features you want (e.g., aggregate or statistical) and how many values they should take (e.g., 1 or 10).

Implementation of TSfresh in Python

#installing tsfresh
pip install tsfresh

#load data
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, \
    load_robot_execution_failures
download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

Detail repo here

TSFRESH Transformer is a recursive, stacked autoregressive model that uses all available information in the series at a given moment to predict future values and it’s able to capture trend in future values by embedding aspects of seasonality.

It works well for either current or previous months. In addition, it allows adjustment of seasonality without losing too much of the cyclical movement. With this information, we hope that you are ready to use and apply TFSrefresh to your data sets and research, whatever they may be. Hope you enjoyed this article at MLDots.