What Is Data Splitting for Time Series Modelling?
In machine learning, data splitting is a process of segmenting your data for training, evaluation, and testing purposes. For time series modelling, data splitting needs to be carefully performed to preserve the chronological order of the data. In short, data samples in time series can no longer be assumed to be independent and identically distributed.
Examples of Time Series Data Splitting
Common strategies for data splitting include:
- Random sampling.
- Stratified sampling for imbalanced data.
- Grouped sampling for data with group labels.
In the case of time series data splitting, more advanced strategies need to be adopted to use only future datapoints for evaluation. The two broad categories for time series data splitting are:
- Sliding/rolling window
- Expanding window
To further promote independence of the data samples in your evaluation split, a gap can also be introduced between the training and evaluation dataset. Indeed, depending on your time series dataset, hybrid strategies that combine various elements of the above may be required to test your model appropriately.
Why Is Data Splitting Important?
For time series problems the test dataset must be new data that the model has not been trained on and ideally come from the same feature distribution of the training data. In this way, we have more confidence that the model we have trained can generalise well in production when new data becomes available.
The modelling mindset here is to prevent data leakage – a scenario where data from outside of the training is used to create the model. Data leakage can come in the form of
- target leakage, where data that is used in training is not available during predictions, and
- train-test contamination, where data from the validation or test set is unintentionally included in the training set.
When a model is trained on this contaminated dataset, the ML practitioner may obtain an overly optimistic model performance during evaluation. This is expected but undesirable because the model has been effectively trained on information that it would otherwise not know in production. It follows that when the model is deployed to production, the same model will perform poorly or not at all.
Federated Time-Series Learning with OctaiPipe
In distributed learning across edge devices, such as with federated learning, it is even more crucial to carefully define appropriate holdout sets on the local databases to ensure a robust global model is produced. The two key concepts that need to be taken into account for
creating the validation set is:
- Chronological order.
The following diagrams illustrate the strategies for time series splitting in a federated learning setting:
Diagram illustrating data splitting strategy for federated time series learning with OctaiPipe
How Do I Define Time-Series Splitting in OctaiPipe?
In our experience, use cases typically involve multivariate time series data, such as time-to-failure predictions for roller bearings, solar panel, or battery health status. To assist with implementing advanced time series data splitting strategies in these use cases, OctaiPipe natively interfaces with InfluxDB and allows user to have granular control of the validation data on each local database of the devices through configurations (see snippet below from lines 8 to 9):
How Are Time Series Sequences Defined in OctaiPipe?
Furthermore, in the case of predicting when an event occurs, time sequences or groups need to be identified for splitting. OctaiPipe provides the functionality to specify groups via the cycle_id parameter (see snippet below in line 3). Then, this parameter can be used to
split and stratify the samples for validation on each edge device during federated learning: