Introduction
Preprocessing is the process of transforming raw data into a form that is more suitable for training of a surrogate. It is a common step in data analysis pipelines to apply various transformations to the data before it is fed to a machine learning algorithm. For example, it is common to scale the data to have zero mean and unit variance, or to normalize the data so that the values are between 0 and 1. In this section, we will discuss some of the most common preprocessing techniques and how to use them in JuliaSimSurrogates
's PreProcessing
module.
As a default, MinMaxNorm
is automatically applied in case of DigitalEcho
.
Transforms
The PreProcessing module provides a number of pre-defined transform steps that can be chained using a PreProcessingChain
:
- Scales the data between provided bounds. Bounds default to between 0 and 1
ZScore
- Scales the data to have zero mean and unit variance.FilterContinuousValues
- Filters out values that remain constant from the continuous fields in the data.CustomTransform
- A step which applies a custom preprocessing function to the data.FilterFields
- A step which filters out certain variables using the index or the name of the variable.
Loading the dataset
We will load a pre-generated dataset for using it to demonstrate the usage.
using JuliaHub, JLSO, DataGeneration
train_dataset_name = "lotka_volterra"
path = JuliaHub.download_dataset(("juliasimtutorials", train_dataset_name), "path to save")
ed = ExperimentData(JLSO.load(path)[:result])
Number of Trajectories in ExperimentData: 10
Basic Statistics for Given Dynamical System's Specifications
Number of u0s in the ExperimentData: 2
Number of ps in the ExperimentData: 4
╭─────────┬────────────────────────────────────────────────────────────────────╮
│ Field │ │
├─────────┼────────────────────────────────────────────────────────────────────┤
│ │ ╭────────────┬──────────────┬──────────────┬────────┬──────────╮ │
│ │ │ Labels │ LowerBound │ UpperBound │ Mean │ StdDev │ │
│ │ ├────────────┼──────────────┼──────────────┼────────┼──────────┤ │
│ │ │ states_1 │ 1.0 │ 1.0 │ 1.0 │ 0.0 │ │
│ u0s │ ├────────────┼──────────────┼──────────────┼────────┼──────────┤ │
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ │
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ │
│ │ ├────────────┼──────────────┼──────────────┼────────┼──────────┤ │
│ │ │ states_2 │ 1.0 │ 1.0 │ 1.0 │ 0.0 │ │
│ │ ╰────────────┴──────────────┴──────────────┴────────┴──────────╯ │
├─────────┼────────────────────────────────────────────────────────────────────┤
│ │ ╭──────────┬──────────────┬──────────────┬─────────┬──────────╮ │
│ │ │ Labels │ LowerBound │ UpperBound │ Mean │ StdDev │ │
│ │ ├──────────┼──────────────┼──────────────┼─────────┼──────────┤ │
│ │ │ p_1 │ 1.562 │ 2.438 │ 1.969 │ 0.302 │ │
│ ps │ ├──────────┼──────────────┼──────────────┼─────────┼──────────┤ │
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ │
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ │
│ │ ├──────────┼──────────────┼──────────────┼─────────┼──────────┤ │
│ │ │ p_4 │ 1.766 │ 1.984 │ 1.87 │ 0.074 │ │
│ │ ╰──────────┴──────────────┴──────────────┴─────────┴──────────╯ │
╰─────────┴────────────────────────────────────────────────────────────────────╯
Basic Statistics for Given Dynamical System's Continuous Fields
Number of states in the ExperimentData: 2
╭──────────┬─────────────────────────────────────────────────────────────────...
──╮...
│ Field │...
│...
├──────────┼─────────────────────────────────────────────────────────────────...
──┤...
│ │ ╭────────────┬──────────────┬──────────────┬─────────┬─────────...
│ │ │ Labels │ LowerBound │ UpperBound │ Mean │ StdDev...
│ │ ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│ │ │ states_1 │ 0.61 │ 1.851 │ 1.131 │ 0.294...
│ states │ ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │...
│ │ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │...
│ │ ├────────────┼──────────────┼──────────────┼─────────┼─────────...
│ │ │ states_2 │ 0.585 │ 1.93 │ 1.068 │ 0.272...
│ │ ╰────────────┴──────────────┴──────────────┴─────────┴─────────...
╰──────────┴─────────────────────────────────────────────────────────────────...
──╯...
Splitting Datasets
A fundamental pre-processing step performed in most conventional machine learning workflows is splitting a dataset into various buckets. For instance, it is often required to split portions of a dataset for training, and validation. The PreProcessing
module provides a train_valid_split
step which can be used to split a dataset into training and validation sets.
Example
Here is an example demonstrating how to use the train_valid_split
function to split a dataset into training and testing sets.
@info "Size of original ED" length(ed.results.states.vals)
using PreProcessing
# Define PreProcessing steps
ed_train, ed_val = train_valid_split(ed; train_ratio=0.8)
@info "Size of Train ED" length(ed_train.results.states.vals)
@info "Size of Validation ED" length(ed_val.results.states.vals)
┌ Info: Size of original ED
└ length(ed.results.states.vals) = 10
┌ Info: Size of Train ED
└ length(ed_train.results.states.vals) = 8
┌ Info: Size of Validation ED
└ length(ed_val.results.states.vals) = 2
Creating a PreProcessing pipeline using Chains
A PreProcessingChain
is a collection of steps which are applied sequentially to the data. It is a callable object which takes in data and returns transformed data. The following is an example of a chain which takes in a ExperimentData
object and shows that it produces the same result as the individual steps applied sequentially.
Example
norm_pre = MinMaxNorm(ed, :states)
filter_pre = FilterContinuousValues(ed, :states)
# Define a PreProcessingChain
chain = PreProcessingChain(norm_pre, filter_pre)
# Apply the preprocessing chain to the data
ed_preprocessed = chain(ed)
@info "Stats of original ED" ed.results.states.stats
@info "Stats of preprocessed ED" ed_preprocessed.results.states.stats
┌ Info: Stats of original ED
└ ed.results.states.stats = (lb = [0.6098798922540988; 0.5851842121921965;;], ub = [1.851268466389882; 1.9298439018722724;;], mean = [1.1313672856099248; 1.0678037215329483;;], std = [0.2944880215199974; 0.2719585629989631;;])
┌ Info: Stats of preprocessed ED
└ ed_preprocessed.results.states.stats = (lb = [0.0; 0.0;;], ub = [1.0; 1.0;;], mean = [0.4200839319943557; 0.3589157264434517;;], std = [0.2372246914911484; 0.20225084836421922;;])
Splines
Splines are a powerful tool that allows one to have access to a continuous time curve from a discrete time dataset The SplineED
step in the PreProcessing
module can be used to create splines for the variables of an ExperimentData
object.
using PreProcessing
spline = SplineED()
ed_splined = spline(ed)
ed_splined.results.states.vals[1](0.1)
2-element Vector{Float64}:
0.9912014250153924
0.9429020456009467
The SplineED
defaults to Cubic Splines. However, it can be use any spline type defined by the DataInterpolations.jl
package. For example, the following code snippet shows how to use a linear spline.
using PreProcessing.DataInterpolations
spline = SplineED(
states_interp=LinearInterpolation,
controls_interp=LinearInterpolation
)
SplineED{UnionAll, UnionAll}(DataInterpolations.LinearInterpolation, DataInterpolations.LinearInterpolation)
Make sure the spline is used to interpolate within the range of the data. For example, if the data is in the time span of (0, 1), then the spline should be used to interpolate only within 0 and 1. Otherwise, the spline approximation will be inaccurate as it will be extrapolating.
SplineED
can accept spline types for the state variables and controls independently.