PreProcessing
Pre-processing pipelines come in many different forms. JuliaSim Surrogates provides the PreProcessing
module as a flexible framework which can be used to describe any user's pipeline. This framework can be broken down into two simple components:
PreProcessingStep
PreProcessingChain
Many steps can be combined to form a chain. Many chains can be combined to form larger chains. Both steps and chains are themselves callable objects which can be performed on given data.
Steps
The atomic structure of any pre-processing pipeline.
PreProcessing.PreProcessingStep
— TypePreProcessingStep
Functions to prepare data for surrogatization.
A PreProcessingStep
can be created given some method
. Once created, a PreprocessingStep
is callable over a given data structure.
Optionally, users can also provide a configuration
for the step. This configuration will modify the behavior of the method being performed over the specified data structure. If no configuration
is needed, then PreProcessingStep
can be called with only one argument for method
.
Examples
# define method
squared_magnitude = abs2
# create step
squared_magnitude_step = PreprocessingStep(squared_magnitude)
# perform step on `X`
squared_magnitude_step(X)
Chains
Describe complex pipelines by chaining many steps or chains together.
PreProcessing.PreProcessingChain
— TypePreProcessingChain
Sequence of functions to prepare data for surrogatization.
A PreProcessingChain
can be created from any collection of PreProcessingStep
objects. Similar to PreProcessingStep
ojects, a PreProcessingChain
is also a callable object over some given data structure. Users could also define a PreProcessingChain
made up of several PreProcessingChain
objects - making a flexible framework for any pre-processing pipeline.
Examples
# define methods
squared_magnitude = x -> abs2(x)
cap_at_100 = x -> min(x, 100)
# create steps
squared_magnitude_step = PreprocessingStep(squared_magnitude)
cap_at_100_step = PreprocessingStep(cap_at_100)
# create chain
prepro_chain = PreProcessingChain([squared_magnitude_step, cap_at_100])
# perform chain on `X`
prepro_chain(X)
Basic Example
A fundamental pre-processing step performed in most conventional machine learning workflows is splitting a dataset into various buckets. For instance, it is often required to split portions of a dataset for training, testing, and validation. The SplitDataset
is one PreProcessingStep
to handle this task.
PreProcessing.SplitDataset
— FunctionSplitDataset(ratio::T) where {T <: Union{Vector{<:Real}, Real}}
This interface allows users to easily define dataset-splitting as a part of their pre-processing pipelines (i.e., using PreProcessingChain
).