# Subsampling Parameter Ensembles

While the parametric_uq function returns a VirtualPopulation ensemble of parameter values which correspond to relatively good fits to the data, in many cases this return is referred to as a "plausible population", i.e. a set of potentially good parameters which may or may not reflect the statistical effects of the population.

For example, say that for the data series that is being fit, 50% of the population is known to be fast matabolizers and the other 50% is comprised of slow matabolizers (characterized by some parameter or measurement in the model). A virtual population method, vpop, will return a plausible population of N plausible parameters but there is no guarentee that it captured the distribution of fast and slow metabolizers from the actual population. The purpose of a subsample algorithm is to downsample from N to M to find a subpopulation which captures important statistical quantities better than the plausible population.

## The subsample Function

JuliaSimModelOptimizer.subsampleFunction
function subsample(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)

Subsamples a plausible patient population vp in order to create a virtual population which satisfies the population-level constraints as specified by the sampling algorithm alg defined on a given trial.

Keyword Arguments

Any keyword arguments kwargs provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.

source

## Subsampling Algorithms

JuliaSimModelOptimizer.DDSType
DDS(; reference, n::Int, nbins::Int)

Discretized Density Sampling (DDS) performs subsampling by matching the histogram of each model state under consideration to a reference histogram. The method matches the frequency of plausible patients that fall within each bin to the frequency of the same bin in the reference distribution. This way, the final Virtual Population has bin frequencies (or probabilities) equal to those of the reference distribution for each one of the considered model states.

Keyword Arguments

• reference: Vector of pairs. The first element of each pair is a model state and the second element is a NamedTuple containing two fields:
• dist: a reference distribution that the algorithm will match for the given state. This could be any distribution, see TransformedBeta or the Distributions.jl documentation) for more options.
• t: the timepoint along the state trajectory where the state should match the reference dist. This does not have to coincide with a saveat timepoint of the considered trial.
• n: Number of plausible patients to subsample to the virtual population.
• nbins: Number of bins to discretize the reference distributions into.

Example

State y1 at timepoint 5 [modelled units of time] should be distributed as a Beta(2,6) distribution within the bounds [1, 5], which is a TransformedBeta distribution, and state y2 should be distributed as an InverseGamma(2,2) at the 10 timepoint. The number of patients is set to 100 and number of bins for each state is 20.

ref = [
y1 => (dist=TransformedBeta(Beta=Beta(2,6), lb=2, ub=6), t=5),
y2 => (dist=InverseGamma(2,2), t=10)
]
alg = DDS(reference=ref, n=100, nbins=20)
Tip : Visualizing distributions

One can easily visualize what a reference distribution looks like for a given bin size by running

using StatsPlots

reference_distribution = Normal(0,1)
number_of_bins = 20
number_of_samples = 1000
histogram(rand(reference_distribution, number_of_samples), nbins=number_of_bins)
source
JuliaSimModelOptimizer.ARMType
ARM(; data::DataFrame, bw=fill(0.5, length(save_names)), n_neighbors::Int=5,
optimizer=StochGlobalOpt(maxiters=100))

The Allen-Rieger-Musante (ARM) subsampling algorithm.

Keyword Arguments

• data: DataFrame containing data from the reference popultion, which is used to construct the probability density function and the cumulative density functions of the population.
• bw: Real number or vector of length = length(data). Bandwidth of the kernel density function that will be fitted on data. This is a non-negative quantity that determines how much smoothing will be applied to each data dimension when producing the kernel density function. Higher values of bw correspond to greater smoothing. Defaults to 0.5 for all dimensions. See MultiKDE.jl for an example.
• optimizer: an Optimization.jl optimization method. This is used to find the optimal prior probability of including a plausible patient into the virtual population. See the Optimization.jl documentation for more information on possible optimizers.
• n_neighbors: Number of neighboring points around each point in model output space that are considered when estimating the probability density function. The density at each point is estimated as n_neighbors/V, where V is the volume of a hypersphere containing the point and n_neighbors around it.

Reference

Allen RJ, Rieger TR, Musante CJ. Efficient Generation and Selection of Virtual Populations in Quantitative Systems Pharmacology Models. CPT Pharmacometrics Syst Pharmacol. 2016 Mar;5(3):140-6. doi: 10.1002/psp4.12063. Epub 2016 Mar 17. PMID: 27069777; PMCID: PMC4809626.

source
JuliaSimModelOptimizer.MAPELType
MAPEL(; data::DataFrame, mechanistic_axes, nbins::Int, init_noise,
maxiters=1000, optimizer = NelderMead())

The Mechanistic Axes Ensemble Population Linkage (MAPEL) algorithm. MAPEL assigns prevalence weights to each patient in a virtual population in order to match discretized (binned) measure derived from actual patient populations. The algorithm first maps plausible patients to the chosen mechanistic axes and discretizes these axes with the chosen number of bins. MAPEL then assigns a probability to each mechanistic axis bin, and by extent to all plausible patients that belong to that bin. MAPEL optimizes these bin probabilities so that the bin frequencies on the output measure axes match those in the population data.

Finally, when calling subsample with MAPEL, the prevalence weights of plausible patients will be used to subsample the initial virtual population into a smaller population that better matches the data. Patients with higher prevalence weight are more likely to be sampled in the final population.

Keyword Arguments

• data: A 'DataFrame' containing all information about the discretized output measure. Each row of data is a different output measure including the name of the measure, the time at which the measure was taken, the number of patients that data is derived from and all bin probabilities and bin edges. Column names are not case sensitive. The required column names are :
• measure : Output measures that were taken from the actual population.
• t: Time at which a measure was taken.
• n: Number of patients that were part of each measure.
• p1, p2, p3, ... : Bin probabilities. There must be as many such columns as bins. The numbering must follow the natural numbers.
• e1, e2, e3, ... : Bin edges. Each edge is a Tuple of the form (left_edge, right_edge).
There must be as many such columns as bins. The numbering must follow the natural numbers.
• mechanistic_axes : A Vector of the mechanistic axes names. These should match the names provided in the search_space of the InverseProblem.
• n: Number of plausible patients to subsample to the virtual population.
• nbins: Number of bins to discretize each mechanistic axis into.
• init_noise: Standard deviation of the noise that will be added to the initial bin probabilities before optimization. The noise is sampled from a Normal distribution centered at 0. If init_noise = 0 (default value) then no noise is added and the initial bin probabilities are uniform over each mechanistic axis.
• optimizer: an Optimization.jl optimization method. This is used to find the optimal mechanistic axis bin probabilities. See the Optimization.jl documentation for more information on possible optimizers.

Example

In the following example there are two output measures, s1(t) and s2(t) and two bins per measure. Hence, data must contain columns p1 and p2 for bin probabilities and e1 and e2 for bin edges.

df = DataFrame(
"measure" => ["s1(t)", "s2(t)"],
"N" => [4000, 5000],
"p1" => [0.7, 0.66],
"p2" => [0.3, 0.34],
"e1" => [(0.0, 0.35), (0.0, 0.3)],
"e2" => [(0.35, 0.7), (0.3, 0.8)],
"t" => [5, 4.8]
)

alg = MAPEL(
data = df,
mechanistic_axes = ["k1", "c1"],
nbins = 2,
maxiters = 10_000,
n = 1000,
init_noise = 0.001
)


References

Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.

source
JuliaSimModelOptimizer.RefWeightsType
RefWeights(; binning, reference_weights, n::Int)

A reference-weight based algorithm that calculates assigns weigths to each plausible patient and then uses them to subsample the plausible population into a virtual population. Uses a binning function with reference weights for each bin to choose a subsample of the plausible patient population which bins with the same frequency as the reference.

Keyword Arguments

• binning: A function binning(sol) which returns an integer representing the bin the patient belongs to.
• reference_weights: A vector containing the probability of a patient belonging to each bin in the actual population. These weights need to sum to 1.
• n: Number of plausible patients to subsample to the virtual population.

References

Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.

source

## Accessing the Sampler

Each subsampling algorithm internally uses a Sampler object to perform subsampling on a plausible population vp. One can call the subsample method multiple times to rerun the subsampling process and produce a different population each time. However, this might be computationally expensive, as methods like ARM need to solve an optimization problem before producing a subsampled population.

Users can access the internal Sampler object directly and call it multiple times to generate different populations. The Sampler is initialized with all the necessary information that the respective subsampling algorithm needs. Thus, by using this object, one can avoid the computational cost of running the entire subsample method multiple times, if multiple subsampled populations are required.

JuliaSimModelOptimizer.get_samplerFunction
function get_sampler(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)

Returns a sampler that, when called, produces a subset of indices from the plausible patient population vp. The sampled indices satisfy the population-level constraints as specified by the sampling algorithm alg defined on a given trial.

Keyword Arguments

Any keyword arguments kwargs provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.

Example

vp = vpop(prob, alg; population_size = 100)
subsample_alg = ARM(data, save_names)
sampler = get_sampler(subsample_alg, vp, trial)
vp_subsampled = vp[sampler()]
source