Pre-processing pipelines come in many different forms. JuliaSim Surrogates provides the PreProcessing module as a flexible framework which can be used to describe any user's pipeline. This framework can be broken down into two simple components:

  • PreProcessingStep
  • PreProcessingChain

Many steps can be combined to form a chain. Many chains can be combined to form larger chains. Both steps and chains are themselves callable objects which can be performed on given data.


The atomic structure of any pre-processing pipeline.


Functions to prepare data for surrogatization.

A PreProcessingStep can be created given some method. Once created, a PreprocessingStep is callable over a given data structure.

Optionally, users can also provide a configuration for the step. This configuration will modify the behavior of the method being performed over the specified data structure. If no configuration is needed, then PreProcessingStep can be called with only one argument for method.


# define method
squared_magnitude = abs2

# create step
squared_magnitude_step = PreprocessingStep(squared_magnitude)

# perform step on `X`


Describe complex pipelines by chaining many steps or chains together.


Sequence of functions to prepare data for surrogatization.

A PreProcessingChain can be created from any collection of PreProcessingStep objects. Similar to PreProcessingStep ojects, a PreProcessingChain is also a callable object over some given data structure. Users could also define a PreProcessingChain made up of several PreProcessingChain objects - making a flexible framework for any pre-processing pipeline.


# define methods
squared_magnitude = x -> abs2(x)
cap_at_100 = x -> min(x, 100)

# create steps
squared_magnitude_step = PreprocessingStep(squared_magnitude)
cap_at_100_step = PreprocessingStep(cap_at_100)

# create chain
prepro_chain = PreProcessingChain([squared_magnitude_step, cap_at_100])

# perform chain on `X`

Basic Example

A fundamental pre-processing step performed in most conventional machine learning workflows is splitting a dataset into various buckets. For instance, it is often required to split portions of a dataset for training, testing, and validation. The SplitDataset is one PreProcessingStep to handle this task.

SplitDataset(ratio::T) where {T <: Union{Vector{<:Real}, Real}}

This interface allows users to easily define dataset-splitting as a part of their pre-processing pipelines (i.e., using PreProcessingChain).