# Subsampling

While the `vpop`

function returns a `VirtualPopulation`

ensemble of parameter values which correspond to relatively good fits to the data, in many cases this return is referred to as a "plausible patient population", i.e. a set of potentially good parameters which may or may not reflect the statistical effects of the population.

For example, say that for the data series that is being fit, 50% of the population is known to be fast matabolizers and the other 50% is comprised of slow matabolizers (characterized by some parameter or measurement in the model). A virtual population method, `vpop`

, will return a plausible population of `N`

plausible patients but there is no guarentee that it captured the distribution of fast and slow metabolizers from the actual population. The purpose of a `subsample`

algorithm is to downsample from `N`

to `M`

to find a subpopulation which captures important statistical quantities better than the plausible population.

## The `subsample`

Function

`JuliaSimModelOptimizer.subsample`

— Function```
function subsample(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)
```

Subsamples a plausible patient population `vp`

in order to create a virtual population which satisfies the population-level constraints as specified by the sampling algorithm `alg`

defined on a given `trial`

.

**Keyword Arguments**

Any keyword arguments `kwargs`

provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.

## Subsampling Algorithms

`JuliaSimModelOptimizer.MAPEL`

— Type```
MAPEL(; data::DataFrame, mechanistic_axes, nbins::Int, init_noise,
maxiters=1000, optimizer = NelderMead())
```

The Mechanistic Axes Ensemble Population Linkage (MAPEL) algorithm. MAPEL assigns prevalence weights to each patient in a virtual population in order to match discretized (binned) measure derived from actual patient populations. The algorithm first maps plausible patients to the chosen mechanistic axes and discretizes these axes with the chosen number of bins. MAPEL then assigns a probability to each mechanistic axis bin, and by extent to all plausible patients that belong to that bin. MAPEL optimizes these bin probabilities so that the bin frequencies on the output measure axes match those in the population data.

Finally, when calling `subsample`

with `MAPEL`

, the prevalence weights of plausible patients will be used to subsample the initial virtual population into a smaller population that better matches the data. Patients with higher prevalence weight are more likely to be sampled in the final population.

**Keyword Arguments**

`data`

: A 'DataFrame' containing all information about the discretized output measure. Each row of`data`

is a different output measure including the name of the measure, the time at which the measure was taken, the number of patients that data is derived from and all bin probabilities and bin edges. Column names are not case sensitive. The required column names are :`measure`

: Output measures that were taken from the actual population.`t`

: Time at which a measure was taken.`n`

: Number of patients that were part of each measure.`p1`

,`p2`

,`p3`

, ... : Bin probabilities. There must be as many such columns as bins. The numbering must follow the natural numbers.`e1`

,`e2`

,`e3`

, ... : Bin edges. Each edge is a`Tuple`

of the form`(left_edge, right_edge)`

.

`mechanistic_axes`

: A`Vector`

of the mechanistic axes names. These should match the names provided in the`search_space`

of the`InverseProblem`

.`n`

: Number of plausible patients to subsample to the virtual population.`nbins`

: Number of bins to discretize each mechanistic axis into.`init_noise`

: Standard deviation of the noise that will be added to the initial bin probabilities before optimization. The noise is sampled from a`Normal`

distribution centered at 0. If`init_noise = 0`

(default value) then no noise is added and the initial bin probabilities are uniform over each mechanistic axis.`optimizer`

: an Optimization.jl optimization method. This is used to find the optimal mechanistic axis bin probabilities. See the Optimization.jl documentation for more information on possible optimizers.

**Example**

In the following example there are two output measures, `s1(t)`

and `s2(t)`

and two bins per measure. Hence, `data`

must contain columns `p1`

and `p2`

for bin probabilities and `e1`

and `e2`

for bin edges.

```
df = DataFrame(
"measure" => ["s1(t)", "s2(t)"],
"N" => [4000, 5000],
"p1" => [0.7, 0.66],
"p2" => [0.3, 0.34],
"e1" => [(0.0, 0.35), (0.0, 0.3)],
"e2" => [(0.35, 0.7), (0.3, 0.8)],
"t" => [5, 4.8]
)
alg = MAPEL(
data = df,
mechanistic_axes = ["k1", "c1"],
nbins = 2,
optimizer = NelderMead(),
maxiters = 10_000,
n = 1000,
init_noise = 0.001
)
```

**References**

Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.

`JuliaSimModelOptimizer.ARM`

— Type```
ARM(; data::DataFrame, bw=fill(0.5, length(save_names)), n_neighbors::Int=5,
optimizer=StochGlobalOpt(maxiters=100))
```

The Allen-Rieger-Musante (ARM) subsampling algorithm.

**Keyword Arguments**

`data`

:`DataFrame`

containing data from the reference popultion, which is used to construct the probability density function and the cumulative density functions of the population.`bw`

: Real number or vector of length`= length(data)`

. Bandwidth of the kernel density function that will be fitted on`data`

. This is a non-negative quantity that determines how much smoothing will be applied to each data dimension when producing the kernel density function. Higher values of`bw`

correspond to greater smoothing. Defaults to 0.5 for all dimensions. See MultiKDE.jl for an example.`optimizer`

: an Optimization.jl optimization method. This is used to find the optimal prior probability of including a plausible patient into the virtual population. See the Optimization.jl documentation for more information on possible optimizers.`n_neighbors`

: Number of neighboring points around each point in model output space that are considered when estimating the probability density function. The density at each point is estimated as`n_neighbors/V`

, where`V`

is the volume of a hypersphere containing the point and`n_neighbors`

around it.

**Reference**

Allen RJ, Rieger TR, Musante CJ. Efficient Generation and Selection of Virtual Populations in Quantitative Systems Pharmacology Models. CPT Pharmacometrics Syst Pharmacol. 2016 Mar;5(3):140-6. doi: 10.1002/psp4.12063. Epub 2016 Mar 17. PMID: 27069777; PMCID: PMC4809626.

`JuliaSimModelOptimizer.DDS`

— Type`DDS(; reference, n::Int, nbins::Int)`

Discretized Density Sampling (DDS) performs subsampling by matching the histogram of each model state under consideration to a reference histogram. The method matches the frequency of plausible patients that fall within each bin to the frequency of the same bin in the reference distribution. This way, the final Virtual Population has bin frequencies (or probabilities) equal to those of the reference distribution for each one of the considered model states.

**Keyword Arguments**

`reference`

:`Vector`

of pairs. The first element of each pair is a model state and the second element is a`NamedTuple`

containing two fields:`dist`

: a reference distribution that the algorithm will match for the given state. This could be any distribution, see`TransformedBeta`

or the Distributions.jl documentation) for more options.`t`

: the timepoint along the state trajectory where the state should match the reference`dist`

. This does not have to coincide with a`saveat`

timepoint of the considered trial.

`n`

: Number of plausible patients to subsample to the virtual population.`nbins`

: Number of bins to discretize the reference distributions into.

**Example**

State `y1`

at timepoint 5 [modelled units of time] should be distributed as a `Beta(2,6)`

distribution within the bounds `[1, 5]`

, which is a `TransformedBeta`

distribution, and state `y2`

should be distributed as an `InverseGamma(2,2)`

at the 10 timepoint. The number of patients is set to 100 and number of bins for each state is 20.

```
ref = [
y1 => (dist=TransformedBeta(Beta=Beta(2,6), lb=2, ub=6), t=5),
y2 => (dist=InverseGamma(2,2), t=10)
]
alg = DDS(reference=ref, n=100, nbins=20)
```

One can easily visualize what a reference distribution looks like for a given bin size by running

```
using StatsPlots
reference_distribution = Normal(0,1)
number_of_bins = 20
number_of_samples = 1000
histogram(rand(reference_distribution, number_of_samples), nbins=number_of_bins)
```

`JuliaSimModelOptimizer.RefWeights`

— Type`RefWeights(; binning, reference_weights, n::Int)`

A reference-weight based algorithm that calculates assigns weigths to each plausible patient and then uses them to subsample the plausible population into a virtual population. Uses a binning function with reference weights for each bin to choose a subsample of the plausible patient population which bins with the same frequency as the reference.

**Keyword Arguments**

`binning`

: A function`binning(sol)`

which returns an integer representing the bin the patient belongs to.`reference_weights`

: A vector containing the probability of a patient belonging to each bin in the actual population. These weights need to sum to 1.`n`

: Number of plausible patients to subsample to the virtual population.

**References**

Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.

## Accessing the Sampler

Each subsampling algorithm internally uses a `Sampler`

object to perform subsampling on a plausible population `vp`

. One can call the `subsample`

method multiple times to rerun the subsampling process and produce a different population each time. However, this might be computationally expensive, as methods like `ARM`

need to solve an optimization problem before producing a subsampled population.

Users can access the internal `Sampler`

object directly and call it multiple times to generate different populations. The `Sampler`

is initialized with all the necessary information that the respective subsampling algorithm needs. Thus, by using this object, one can avoid the computational cost of running the entire `subsample`

method multiple times, if multiple subsampled populations are required.

`JuliaSimModelOptimizer.get_sampler`

— Function```
function get_sampler(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)
```

Returns a sampler that, when called, produces a subset of indices from the plausible patient population `vp`

. The sampled indices satisfy the population-level constraints as specified by the sampling algorithm `alg`

defined on a given `trial`

.

**Keyword Arguments**

Any keyword arguments `kwargs`

provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.

**Example**

```
vp = vpop(prob, alg; population_size = 100)
subsample_alg = ARM(data, save_names)
sampler = get_sampler(subsample_alg, vp, trial)
vp_subsampled = vp[sampler()]
```