Subsampling
While the vpop
function returns a VirtualPopulation
ensemble of parameter values which correspond to relatively good fits to the data, in many cases this return is referred to as a "plausible patient population", i.e. a set of potentially good parameters which may or may not reflect the statistical effects of the population.
For example, say that for the data series that is being fit, 50% of the population is known to be fast matabolizers and the other 50% is comprised of slow matabolizers (characterized by some parameter or measurement in the model). A virtual population method, vpop
, will return a plausible population of N
plausible patients but there is no guarentee that it captured the distribution of fast and slow metabolizers from the actual population. The purpose of a subsample
algorithm is to downsample from N
to M
to find a subpopulation which captures important statistical quantities better than the plausible population.
The subsample
Function
JuliaSimModelOptimizer.subsample
— Functionfunction subsample(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)
Subsamples a plausible patient population vp
in order to create a virtual population which satisfies the population-level constraints as specified by the sampling algorithm alg
defined on a given trial
.
Keyword Arguments
Any keyword arguments kwargs
provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.
Subsampling Algorithms
JuliaSimModelOptimizer.MAPEL
— TypeMAPEL(; data::DataFrame, mechanistic_axes, nbins::Int, init_noise,
maxiters=1000, optimizer = NelderMead())
The Mechanistic Axes Ensemble Population Linkage (MAPEL) algorithm. MAPEL assigns prevalence weights to each patient in a virtual population in order to match discretized (binned) measure derived from actual patient populations. The algorithm first maps plausible patients to the chosen mechanistic axes and discretizes these axes with the chosen number of bins. MAPEL then assigns a probability to each mechanistic axis bin, and by extent to all plausible patients that belong to that bin. MAPEL optimizes these bin probabilities so that the bin frequencies on the output measure axes match those in the population data.
Finally, when calling subsample
with MAPEL
, the prevalence weights of plausible patients will be used to subsample the initial virtual population into a smaller population that better matches the data. Patients with higher prevalence weight are more likely to be sampled in the final population.
Keyword Arguments
data
: A 'DataFrame' containing all information about the discretized output measure. Each row ofdata
is a different output measure including the name of the measure, the time at which the measure was taken, the number of patients that data is derived from and all bin probabilities and bin edges. Column names are not case sensitive. The required column names are :measure
: Output measures that were taken from the actual population.t
: Time at which a measure was taken.n
: Number of patients that were part of each measure.p1
,p2
,p3
, ... : Bin probabilities. There must be as many such columns as bins. The numbering must follow the natural numbers.e1
,e2
,e3
, ... : Bin edges. Each edge is aTuple
of the form(left_edge, right_edge)
.
mechanistic_axes
: AVector
of the mechanistic axes names. These should match the names provided in thesearch_space
of theInverseProblem
.n
: Number of plausible patients to subsample to the virtual population.nbins
: Number of bins to discretize each mechanistic axis into.init_noise
: Standard deviation of the noise that will be added to the initial bin probabilities before optimization. The noise is sampled from aNormal
distribution centered at 0. Ifinit_noise = 0
(default value) then no noise is added and the initial bin probabilities are uniform over each mechanistic axis.optimizer
: an Optimization.jl optimization method. This is used to find the optimal mechanistic axis bin probabilities. See the Optimization.jl documentation for more information on possible optimizers.
Example
In the following example there are two output measures, s1(t)
and s2(t)
and two bins per measure. Hence, data
must contain columns p1
and p2
for bin probabilities and e1
and e2
for bin edges.
df = DataFrame(
"measure" => ["s1(t)", "s2(t)"],
"N" => [4000, 5000],
"p1" => [0.7, 0.66],
"p2" => [0.3, 0.34],
"e1" => [(0.0, 0.35), (0.0, 0.3)],
"e2" => [(0.35, 0.7), (0.3, 0.8)],
"t" => [5, 4.8]
)
alg = MAPEL(
data = df,
mechanistic_axes = ["k1", "c1"],
nbins = 2,
optimizer = NelderMead(),
maxiters = 10_000,
n = 1000,
init_noise = 0.001
)
References
Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.
JuliaSimModelOptimizer.ARM
— TypeARM(; data::DataFrame, bw=fill(0.5, length(save_names)), n_neighbors::Int=5,
optimizer=StochGlobalOpt(maxiters=100))
The Allen-Rieger-Musante (ARM) subsampling algorithm.
Keyword Arguments
data
:DataFrame
containing data from the reference popultion, which is used to construct the probability density function and the cumulative density functions of the population.bw
: Real number or vector of length= length(data)
. Bandwidth of the kernel density function that will be fitted ondata
. This is a non-negative quantity that determines how much smoothing will be applied to each data dimension when producing the kernel density function. Higher values ofbw
correspond to greater smoothing. Defaults to 0.5 for all dimensions. See MultiKDE.jl for an example.optimizer
: an Optimization.jl optimization method. This is used to find the optimal prior probability of including a plausible patient into the virtual population. See the Optimization.jl documentation for more information on possible optimizers.n_neighbors
: Number of neighboring points around each point in model output space that are considered when estimating the probability density function. The density at each point is estimated asn_neighbors/V
, whereV
is the volume of a hypersphere containing the point andn_neighbors
around it.
Reference
Allen RJ, Rieger TR, Musante CJ. Efficient Generation and Selection of Virtual Populations in Quantitative Systems Pharmacology Models. CPT Pharmacometrics Syst Pharmacol. 2016 Mar;5(3):140-6. doi: 10.1002/psp4.12063. Epub 2016 Mar 17. PMID: 27069777; PMCID: PMC4809626.
JuliaSimModelOptimizer.DDS
— TypeDDS(; reference, n::Int, nbins::Int)
Discretized Density Sampling (DDS) performs subsampling by matching the histogram of each model state under consideration to a reference histogram. The method matches the frequency of plausible patients that fall within each bin to the frequency of the same bin in the reference distribution. This way, the final Virtual Population has bin frequencies (or probabilities) equal to those of the reference distribution for each one of the considered model states.
Keyword Arguments
reference
:Vector
of pairs. The first element of each pair is a model state and the second element is aNamedTuple
containing two fields:dist
: a reference distribution that the algorithm will match for the given state. This could be any distribution, seeTransformedBeta
or the Distributions.jl documentation) for more options.t
: the timepoint along the state trajectory where the state should match the referencedist
. This does not have to coincide with asaveat
timepoint of the considered trial.
n
: Number of plausible patients to subsample to the virtual population.nbins
: Number of bins to discretize the reference distributions into.
Example
State y1
at timepoint 5 [modelled units of time] should be distributed as a Beta(2,6)
distribution within the bounds [1, 5]
, which is a TransformedBeta
distribution, and state y2
should be distributed as an InverseGamma(2,2)
at the 10 timepoint. The number of patients is set to 100 and number of bins for each state is 20.
ref = [
y1 => (dist=TransformedBeta(Beta=Beta(2,6), lb=2, ub=6), t=5),
y2 => (dist=InverseGamma(2,2), t=10)
]
alg = DDS(reference=ref, n=100, nbins=20)
One can easily visualize what a reference distribution looks like for a given bin size by running
using StatsPlots
reference_distribution = Normal(0,1)
number_of_bins = 20
number_of_samples = 1000
histogram(rand(reference_distribution, number_of_samples), nbins=number_of_bins)
JuliaSimModelOptimizer.RefWeights
— TypeRefWeights(; binning, reference_weights, n::Int)
A reference-weight based algorithm that calculates assigns weigths to each plausible patient and then uses them to subsample the plausible population into a virtual population. Uses a binning function with reference weights for each bin to choose a subsample of the plausible patient population which bins with the same frequency as the reference.
Keyword Arguments
binning
: A functionbinning(sol)
which returns an integer representing the bin the patient belongs to.reference_weights
: A vector containing the probability of a patient belonging to each bin in the actual population. These weights need to sum to 1.n
: Number of plausible patients to subsample to the virtual population.
References
Schmidt BJ, Casey FP, Paterson T, Chan JR. Alternate virtual populations elucidate the type I interferon signature predictive of the response to rituximab in rheumatoid arthritis. BMC Bioinformatics. 2013 Jul 10;14:221. doi: 10.1186/1471-2105-14-221. PMID: 23841912; PMCID: PMC3717130.
Accessing the Sampler
Each subsampling algorithm internally uses a Sampler
object to perform subsampling on a plausible population vp
. One can call the subsample
method multiple times to rerun the subsampling process and produce a different population each time. However, this might be computationally expensive, as methods like ARM
need to solve an optimization problem before producing a subsampled population.
Users can access the internal Sampler
object directly and call it multiple times to generate different populations. The Sampler
is initialized with all the necessary information that the respective subsampling algorithm needs. Thus, by using this object, one can avoid the computational cost of running the entire subsample
method multiple times, if multiple subsampled populations are required.
JuliaSimModelOptimizer.get_sampler
— Functionfunction get_sampler(alg::AbstractSubsamplingAlgorithm, vp::AbstractVirtualPopulation,
trial::AbstractExperiment; kwargs)
Returns a sampler that, when called, produces a subset of indices from the plausible patient population vp
. The sampled indices satisfy the population-level constraints as specified by the sampling algorithm alg
defined on a given trial
.
Keyword Arguments
Any keyword arguments kwargs
provided will be passed on as options to solving the optimization problem that is part of some subsampling methods, e.g. the Allen-Rieger-Musante (ARM) method. For more information about such arguments see the Optimization.jl documentation.
Example
vp = vpop(prob, alg; population_size = 100)
subsample_alg = ARM(data, save_names)
sampler = get_sampler(subsample_alg, vp, trial)
vp_subsampled = vp[sampler()]