# Data Generation

In JuliaSimSurrogates, training algorithms are run on *samples*. Samples are various simulations run from the original model. So generating any one sample requires all the inputs needed for a simulation run: initial conditions, parameter values, and control functions. However, many samples are needed to train a surrogate model and it is not feasible to manually prepare every configuration for every sample. Instead, JuliaSimSurrogates provides user-friendly interfaces for describing various *sampling spaces* and then combining them to into a `SimulatorConfig`

which can help users generate many samples in parallel.

## Sampling Spaces

There are 3 different sampling spaces:

- Parameter Space
- Initial Conditions Space
- Control Functions Space

JuliaSimSurrogates provides a consistent interface for defining each of these spaces in order to generate the desired samples for surrogate training.

### Parameter Space

Parameter spaces contain all model parameters which will be used for simulation runs. Spaces can be produced by only specifying the parameter bounds.

`DataGeneration.ParameterSpace`

— Type```
ParameterSpace(lb, ub) -> ParameterSpace
ParameterSpace(lb, ub, nsamples) -> ParameterSpace
ParameterSpace(
lb,
ub,
nsamples,
alg;
labels
) -> ParameterSpace
```

Generate some parameter space within lower and upper bounds using a specified sampling algorithm.

**Optional Arguments**

`nsamples::Integer`

: number of samples to generate (defaults to $100$).`alg<:SamplingAlgorithm`

: algorithm to generate samples (defaults to Sobol sequence).`labels::Vector{String}`

: names for each parameter of the sampled space (defaults to`["p_1", "p_2", ..., "p_n"]`

where`n`

is the same as`nsamples`

).

```
ParameterSpace(
lb,
ub,
samples::AbstractMatrix
) -> ParameterSpace
ParameterSpace(
lb,
ub,
samples::AbstractMatrix,
alg;
labels
) -> ParameterSpace
```

Generate some parameter space using some collection of pre-existing samples.

Note that `samples`

are in matrix form and `alg = nothing`

since the samples already exist.

**Optional Arguments**

`labels::Vector{String}`

: names for each parameter of the sampled space (defaults to`["p_1", "p_2", ..., "p_n"]`

where`n`

is the same as`nsamples`

).

### Initial Condition Space

Initial condition spaces contain all required starting values for simulation runs. Spaces can be produced by only specifying the parameter bounds. However, some optional arguments exist to produce the desired sample space.

`DataGeneration.ICSpace`

— Type```
ICSpace(lb, ub) -> ICSpace
ICSpace(lb, ub, nsamples) -> ICSpace
ICSpace(lb, ub, nsamples, alg; labels) -> ICSpace
```

Generate some initial condition space within lower and upper bounds using a specified sampling algorithm.

**Optional Arguments**

`nsamples::Integer`

: number of samples to generate (defaults to $100$).`alg<:SamplingAlgorithm`

: algorithm to generate samples (defaults to Sobol sequence).`labels::Vector{String}`

: names for each parameter of the sampled space (defaults to`["p_1", "p_2", ..., "p_n"]`

where`n`

is the same as`nsamples`

).

```
ICSpace(lb, ub, samples::AbstractMatrix) -> ICSpace
ICSpace(
lb,
ub,
samples::AbstractMatrix,
alg;
labels
) -> ICSpace
```

Generate some initial condition space using some collection of pre-existing samples.

Note that `samples`

are in matrix form and `alg = nothing`

since the samples already exist.

**Optional Arguments**

`labels::Vector{String}`

: names for each parameter of the sampled space (defaults to`["p_1", "p_2", ..., "p_n"]`

where`n`

is the same as`nsamples`

).

### Control Space

Control spaces determine input parameterization for various simulation runs. Spaces can be produced by specifying the parameter bounds and a function which describes the nature of input parameter values throughout simulation run. For example, one could define a two-parameter control function with the following form.

`f(u, p, t) = [p[1]*sin(t), p[2]*cos(t)]`

This funtion, `f`

, can be passed to construct a `CtrlSpace`

.

`DataGeneration.CtrlSpace`

— Type```
CtrlSpace(lb, ub, prob_func) -> CtrlSpace
CtrlSpace(lb, ub, prob_func, nsamples) -> CtrlSpace
CtrlSpace(
lb,
ub,
prob_func,
nsamples,
alg;
labels
) -> CtrlSpace
```

Generate some control space within specified bounds using a provided control function.

Control space consists of samples for pre-defined time varying inputs that drive a system. For example, a pre-defined time varying input could be `a*sin(t + b)`

where `a`

and `b`

are parameters - each with a lower and upper bound to sample from. In contrast to a system's state space, which represents possible values that state can take depending on model variables, control space depends on the bounded subset of values allowed for controls applied to a system.

**Optional Arguments**

`nsamples::Integer`

: number of samples to generate (defaults to $100$).`alg<:SamplingAlgorithm`

: algorithm to generate samples (defaults to Sobol sequence).`labels::Vector{String}`

: names for each parameter of the sampled space (defaults to`["p_1", "p_2", ..., "p_n"]`

where`n`

is the same as`nsamples`

).

```
CtrlSpace(
lb,
ub,
prob_func,
samples::AbstractMatrix
) -> CtrlSpace
CtrlSpace(
lb,
ub,
prob_func,
samples::AbstractMatrix,
alg;
labels
) -> CtrlSpace
```

Generate some control space using some collection of pre-existing samples.

Note that `samples`

are in matrix form and `alg = nothing`

since the samples already exist.

### Additional Samples

Users may wish to add samples to an existing sample space. This can be acheived by making a collection of additional samples in matrix form and using the `add_samples`

function.

`DataGeneration.add_samples`

— Function```
add_samples(sample_space::AbstractSampleSpace, samples; alg)
```

Function to add manual samples to an existing sample space.

## Simulators

Once all sampling spaces are defined, *Simulators* can be configured to describe all the scenarios to be simulated and facilitate running these simulations. Simulator configurations make it easy to distribute the execution of simulations cross any number of machines. This means that JuliaHub can spin up one machine per simulation, and generate thousands of samples in the same time it takes to generate one.

`SimulatorConfig`

objects are themselves callable objects which describe all the scenarios which should be simulated for a given problem definition. Compatible problem definitions can take various forms, such as:

`ODEProblem`

(Julia code),- FMU (an FMI compliant model).

`DataGeneration.SimulatorConfig`

— Type`struct SimulatorConfig{IC, C, P} <: AbstractSpaceConfig`

Simulator configurations contain information for all three sampling spaces:

Simulators help users run simulation ensembles over these sampling spaces in parallel. Once a `SimulatorConfig`

object is created, it can be called as a function with an FMU or `ODEProblem`

as its argument (`kwargs`

may also be passed for the running the simulations).

Note that Simulator configurations **do not** require all sampling spaces. For example, if a system does not have a control space, then a Simulator configuration can be created without one.

## Experiment Data

No matter if the problem definition is Julia code or an FMU, when called with a `SimulatorConfig`

, the product is always `ExperimentData`

. This is the common format needed to Surrogatize a model.

What happens if a problem cannot be easily described by its mechanics, but its behaviour can be captured in a dataset? Often real-world data can be captured from sensors, and this data may be useful for constructing surrogate models. In cases like this, a dataset can be converted directly to `ExperimentData`

, making surrogate generation possible with out a formal mathematical problem definition. This level of flexibility in defining problems creates many new opportunities to generate surrogate models.

`JSSBase.ExperimentData`

— Type`ExperimentData(dict::AbstractDict)`

Constructs an ExperimentData object using a given dictionary of the following format.

Note that the labels in the dictionary must be exactly as shown.

```
* "states_labels": Vector{String},
* "states": Vector{Matrix{Float64}} with every element matrix being size (state_num, time_num)
* "observable_labels": Vector{String},
* "observables": Vector{Matrix{Float64}} with every element matrix being size (observable_num, time_num)
* "param_labels": Vector{String} every element corresponds to the name of a parameter
* "params": Vector{Vector{Float64}} with every element being a vector of real values
* "control_labels": Vector{String} every element corresponds to the name of a control
* "controls": Vector{Matrix} where every element matrix of shape (state_num, time_num)
* "ts": Vector{Vector} where every element is a vector of real values corresponding to the time steps the simulation was evaluated at
```

Each of `states`

, `params`

, `controls`

and `ts`

must be of length of the number of trajectories in the experiment.

Each of `states_labels`

, `param_labels`

, `control_labels`

must of the length corresponding to the number of states, parameters and controls in the experiment respectively.

Note: In the case that any field out of `states`

, `controls`

or `params`

does not exist, it (along with the corresponding labels field) must be set to `nothing`

.