# PreProcessing

Pre-processing pipelines come in many different forms. JuliaSim Surrogates provides the `PreProcessing`

module as a flexible framework which can be used to describe any user's pipeline. This framework can be broken down into two simple components:

`PreProcessingStep`

`PreProcessingChain`

Many steps can be combined to form a chain. Many chains can be combined to form larger chains. Both steps and chains are themselves callable objects which can be performed on given data.

## Steps

The atomic structure of any pre-processing pipeline.

`PreProcessing.PreProcessingStep`

— Type`PreProcessingStep`

Functions to prepare data for surrogatization.

A `PreProcessingStep`

can be created given some `method`

. Once created, a `PreprocessingStep`

is callable over a given data structure.

Optionally, users can also provide a `configuration`

for the step. This configuration will modify the behavior of the method being performed over the specified data structure. If no `configuration`

is needed, then `PreProcessingStep`

can be called with only one argument for `method`

.

**Examples**

```
# define method
squared_magnitude = abs2
# create step
squared_magnitude_step = PreprocessingStep(squared_magnitude)
# perform step on `X`
squared_magnitude_step(X)
```

## Chains

Describe complex pipelines by chaining many steps or chains together.

`PreProcessing.PreProcessingChain`

— Type`PreProcessingChain`

Sequence of functions to prepare data for surrogatization.

A `PreProcessingChain`

can be created from any collection of `PreProcessingStep`

objects. Similar to `PreProcessingStep`

ojects, a `PreProcessingChain`

is also a callable object over some given data structure. Users could also define a `PreProcessingChain`

made up of several `PreProcessingChain`

objects - making a flexible framework for any pre-processing pipeline.

**Examples**

```
# define methods
squared_magnitude = x -> abs2(x)
cap_at_100 = x -> min(x, 100)
# create steps
squared_magnitude_step = PreprocessingStep(squared_magnitude)
cap_at_100_step = PreprocessingStep(cap_at_100)
# create chain
prepro_chain = PreProcessingChain([squared_magnitude_step, cap_at_100])
# perform chain on `X`
prepro_chain(X)
```

## Basic Example

A fundamental pre-processing step performed in most conventional machine learning workflows is splitting a dataset into various buckets. For instance, it is often required to split portions of a dataset for training, testing, and validation. The `SplitDataset`

is one `PreProcessingStep`

to handle this task.

`PreProcessing.SplitDataset`

— Function`SplitDataset(ratio::T) where {T <: Union{Vector{<:Real}, Real}}`

This interface allows users to easily define dataset-splitting as a part of their pre-processing pipelines (i.e., using `PreProcessingChain`

).