Introduction

Preprocessing is the process of transforming raw data into a form that is more suitable for training of a surrogate. It is a common step in data analysis pipelines to apply various transformations to the data before it is fed to a machine learning algorithm. For example, it is common to scale the data to have zero mean and unit variance, or to normalize the data so that the values are between 0 and 1. In this section, we will discuss some of the most common preprocessing techniques and how to use them in JuliaSimSurrogates's PreProcessing module.

Note

As a default, MinMaxNorm is automatically applied in case of DigitalEcho.

Transforms

The PreProcessing module provides a number of pre-defined transform steps that can be chained using a PreProcessingChain:

Scales the data between provided bounds. Bounds default to between 0 and 1
ZScore - Scales the data to have zero mean and unit variance.
FilterContinuousValues - Filters out values that remain constant from the continuous fields in the data.
CustomTransform - A step which applies a custom preprocessing function to the data.
FilterFields - A step which filters out certain variables using the index or the name of the variable.

Loading the dataset

We will load a pre-generated dataset for using it to demonstrate the usage.

using JuliaHub, JLSO, DataGeneration

train_dataset_name = "lotka_volterra"
path = JuliaHub.download_dataset(("juliasimtutorials", train_dataset_name), "path to save")

ed = ExperimentData(JLSO.load(path)[:result])

 Number of Trajectories in ExperimentData: 10 
  Basic Statistics for Given Dynamical System's Specifications 
  Number of u0s in the ExperimentData: 2 
  Number of ps in the ExperimentData: 4 
 ╭─────────┬────────────────────────────────────────────────────────────────────╮
│  Field  │                                                                    │
├─────────┼────────────────────────────────────────────────────────────────────┤
│         │  ╭────────────┬──────────────┬──────────────┬────────┬──────────╮  │
│         │  │   Labels   │  LowerBound  │  UpperBound  │  Mean  │  StdDev  │  │
│         │  ├────────────┼──────────────┼──────────────┼────────┼──────────┤  │
│         │  │  states_1  │     1.0      │     1.0      │  1.0   │   0.0    │  │
│   u0s   │  ├────────────┼──────────────┼──────────────┼────────┼──────────┤  │
│         │  │     ⋮      │      ⋮       │      ⋮       │   ⋮    │    ⋮     │  │
│         │  │     ⋮      │      ⋮       │      ⋮       │   ⋮    │    ⋮     │  │
│         │  ├────────────┼──────────────┼──────────────┼────────┼──────────┤  │
│         │  │  states_2  │     1.0      │     1.0      │  1.0   │   0.0    │  │
│         │  ╰────────────┴──────────────┴──────────────┴────────┴──────────╯  │
├─────────┼────────────────────────────────────────────────────────────────────┤
│         │  ╭──────────┬──────────────┬──────────────┬─────────┬──────────╮   │
│         │  │  Labels  │  LowerBound  │  UpperBound  │  Mean   │  StdDev  │   │
│         │  ├──────────┼──────────────┼──────────────┼─────────┼──────────┤   │
│         │  │   p_1    │    1.562     │    2.438     │  1.969  │  0.302   │   │
│   ps    │  ├──────────┼──────────────┼──────────────┼─────────┼──────────┤   │
│         │  │    ⋮     │      ⋮       │      ⋮       │    ⋮    │    ⋮     │   │
│         │  │    ⋮     │      ⋮       │      ⋮       │    ⋮    │    ⋮     │   │
│         │  ├──────────┼──────────────┼──────────────┼─────────┼──────────┤   │
│         │  │   p_4    │    1.766     │    1.984     │  1.87   │  0.074   │   │
│         │  ╰──────────┴──────────────┴──────────────┴─────────┴──────────╯   │
╰─────────┴────────────────────────────────────────────────────────────────────╯
 Basic Statistics for Given Dynamical System's Continuous Fields 
  Number of states in the ExperimentData: 2 
 ╭──────────┬─────────────────────────────────────────────────────────────────...
──╮...
│  Field   │...
       │...
├──────────┼─────────────────────────────────────────────────────────────────...
──┤...
│          │  ╭────────────┬──────────────┬──────────────┬─────────┬─────────...
│          │  │   Labels   │  LowerBound  │  UpperBound  │  Mean   │  StdDev...
│          │  ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│          │  │  states_1  │     0.61     │    1.851     │  1.131  │  0.294...
│  states  │  ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│          │  │     ⋮      │      ⋮       │      ⋮       │    ⋮    │...
│          │  │     ⋮      │      ⋮       │      ⋮       │    ⋮    │...
│          │  ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│          │  │  states_2  │    0.585     │     1.93     │  1.068  │  0.272...
│          │  ╰────────────┴──────────────┴──────────────┴─────────┴─────────...
╰──────────┴─────────────────────────────────────────────────────────────────...
──╯...

Splitting Datasets

A fundamental pre-processing step performed in most conventional machine learning workflows is splitting a dataset into various buckets. For instance, it is often required to split portions of a dataset for training, and validation. The PreProcessing module provides a train_valid_split step which can be used to split a dataset into training and validation sets.

Example

Here is an example demonstrating how to use the train_valid_split function to split a dataset into training and testing sets.

@info "Size of original ED" length(ed.results.states.vals)

using PreProcessing

# Define PreProcessing steps
ed_train, ed_val = train_valid_split(ed; train_ratio=0.8)

@info "Size of Train ED" length(ed_train.results.states.vals)
@info "Size of Validation ED" length(ed_val.results.states.vals)

┌ Info: Size of original ED
└   length(ed.results.states.vals) = 10
┌ Info: Size of Train ED
└   length(ed_train.results.states.vals) = 8
┌ Info: Size of Validation ED
└   length(ed_val.results.states.vals) = 2

Creating a PreProcessing pipeline using Chains

A PreProcessingChain is a collection of steps which are applied sequentially to the data. It is a callable object which takes in data and returns transformed data. The following is an example of a chain which takes in a ExperimentData object and shows that it produces the same result as the individual steps applied sequentially.

Example

norm_pre = MinMaxNorm(ed, :states)
filter_pre = FilterContinuousValues(ed, :states)

# Define a PreProcessingChain
chain = PreProcessingChain(norm_pre, filter_pre)

# Apply the preprocessing chain to the data
ed_preprocessed = chain(ed)
@info "Stats of original ED" ed.results.states.stats
@info "Stats of preprocessed ED" ed_preprocessed.results.states.stats

┌ Info: Stats of original ED
└   ed.results.states.stats = (lb = [0.6098798922540988; 0.5851842121921965;;], ub = [1.851268466389882; 1.9298439018722724;;], mean = [1.1313672856099248; 1.0678037215329483;;], std = [0.2944880215199974; 0.2719585629989631;;])
┌ Info: Stats of preprocessed ED
└   ed_preprocessed.results.states.stats = (lb = [0.0; 0.0;;], ub = [1.0; 1.0;;], mean = [0.4200839319943557; 0.3589157264434517;;], std = [0.2372246914911484; 0.20225084836421922;;])

Splines

Splines are a powerful tool that allows one to have access to a continuous time curve from a discrete time dataset The SplineED step in the PreProcessing module can be used to create splines for the variables of an ExperimentData object.

using PreProcessing

spline = SplineED()
ed_splined = spline(ed)

ed_splined.results.states.vals[1](0.1)

2-element Vector{Float64}:
 0.9912014250153924
 0.9429020456009467

The SplineED defaults to Cubic Splines. However, it can be use any spline type defined by the DataInterpolations.jl package. For example, the following code snippet shows how to use a linear spline.

using PreProcessing.DataInterpolations
spline = SplineED(
    states_interp=LinearInterpolation,
    controls_interp=LinearInterpolation
    )

SplineED{UnionAll, UnionAll}(DataInterpolations.LinearInterpolation, DataInterpolations.LinearInterpolation)

Warning

Make sure the spline is used to interpolate within the range of the data. For example, if the data is in the time span of (0, 1), then the spline should be used to interpolate only within 0 and 1. Otherwise, the spline approximation will be inaccurate as it will be extrapolating.

Note

SplineED can accept spline types for the state variables and controls independently.