# Subsampling

Here we demonstrate the subsampling procedure Discretized Density Sampling (DDS). DDS forms part of the subsampling suite in PumasQSP. For more details on other subsamping strategies see API section).

## Julia environment

For this tutorial we will need the following packages:

ModuleDescription
PumasQSPPumasQSP is used to formulate the QSP workflow.
ModelingToolkitThe symbolic modeling environment
DifferentialEquationsThe numerical differential equation solvers
DataSetsWe will load our experimental data from datasets on JuliaHub.
CSVHandling delimited text data
DataFramesHandling tabular data manipulation
Plots and StatsPlotsThe plotting and visualization libraries used for this tutorial.

First we prepare the environment by listing the packages we are using within the example.

# Packages used for this example
using PumasQSP
using ModelingToolkit
using DifferentialEquations
using DataSets
using DataFrames
using CSV
using Distributions
using Plots
using StatsPlots

## Model and Trials setup

We define our model using ModelingToolkit.jl.

# Model states and their default values
@variables begin
t
y1(t) = 1.0
y2(t) = 0.0
y3(t) = 0.0
y4(t) = 0.0
y5(t) = 0.0
y6(t) = 0.0
y7(t) = 0.0
y8(t) = 0.0057
end

# Model parameters and their default values
@parameters begin
k1 = 1.71
k2 = 280.0
k3 = 8.32
k4 = 0.69
k5 = 0.43
k6 = 1.81
end

# Differential operator
D = Differential(t)

# Differential equations
eqs = [
D(y1) ~ (-k1*y1 + k5*y2 + k3*y3 + 0.0007),
D(y2) ~ (k1*y1 - 8.75*y2),
D(y3) ~ (-10.03*y3 + k5*y4 + 0.035*y5),
D(y4) ~ (k3*y2 + k1*y3 - 1.12*y4),
D(y5) ~ (-1.745*y5 + k5*y6 + k5*y7),
D(y6) ~ (-k2*y6*y8 + k4*y4 + k1*y5 - k5*y6 + k4*y7),
D(y7) ~ (k2*y6*y8 - k6*y7),
D(y8) ~ (-k2*y6*y8 + k6*y7)
]

# Model definition
@named model = ODESystem(eqs)
\begin{align} \frac{\mathrm{d} \mathrm{y1}\left( t \right)}{\mathrm{d}t} =& 0.0007 + k5 \mathrm{y2}\left( t \right) + k3 \mathrm{y3}\left( t \right) - k1 \mathrm{y1}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y2}\left( t \right)}{\mathrm{d}t} =& k1 \mathrm{y1}\left( t \right) - 8.75 \mathrm{y2}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y3}\left( t \right)}{\mathrm{d}t} =& - 10.03 \mathrm{y3}\left( t \right) + k5 \mathrm{y4}\left( t \right) + 0.035 \mathrm{y5}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y4}\left( t \right)}{\mathrm{d}t} =& - 1.12 \mathrm{y4}\left( t \right) + k3 \mathrm{y2}\left( t \right) + k1 \mathrm{y3}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y5}\left( t \right)}{\mathrm{d}t} =& - 1.745 \mathrm{y5}\left( t \right) + k5 \mathrm{y6}\left( t \right) + k5 \mathrm{y7}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y6}\left( t \right)}{\mathrm{d}t} =& k1 \mathrm{y5}\left( t \right) + k4 \mathrm{y4}\left( t \right) + k4 \mathrm{y7}\left( t \right) - k5 \mathrm{y6}\left( t \right) - k2 \mathrm{y6}\left( t \right) \mathrm{y8}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y7}\left( t \right)}{\mathrm{d}t} =& - k6 \mathrm{y7}\left( t \right) + k2 \mathrm{y6}\left( t \right) \mathrm{y8}\left( t \right) \\ \frac{\mathrm{d} \mathrm{y8}\left( t \right)}{\mathrm{d}t} =& k6 \mathrm{y7}\left( t \right) - k2 \mathrm{y6}\left( t \right) \mathrm{y8}\left( t \right) \end{align}

Subseqeuntly, we create two clinical trial conditions.

# Clinical condition 1
# Load data using DataSets packages
data1 = dataset("pumasqsptutorials/hires_trial_1")

params=[k6 => 20.0],
alg=DynamicSS(QNDF()),
abstol=1e-10,
reltol=1e-10
)

# Clinical condition 2
# Load data using DataSets packages
data2 = dataset("pumasqsptutorials/hires_trial_2")

# Create the Trial object for trial 2
trial2 = Trial(data2, model;
params = [k3 => 4.0, k4 => 0.4, k6 => 8.0],
tspan = (0.0, 20.0),
abstol = 1e-10,
reltol = 1e-10
)
Trial with parameters:
k3 => 4.0
k4 => 0.4
k6 => 8.0.
timespan: (0.0, 20.0)


Using trial1 and trial2, we then define the InverseProblem with a search space.

# Define a search space
example_ss = [k1 => (1.0, 3.0),
k2 => (200.0, 400.0),
y1 => (0.0, 3.0)
]

# Create trial collection

# Create a InverseProblem object
prob = InverseProblem(trials, model, example_ss)
InverseProblem for model with 2 trials optimizing 3 parameters and states.


Lastly, we load a pre-trained Virtual Population object from a file.

# Read Virtual Population from dataset and import as VPResult
data = dataset("pumasqsptutorials/vp")
df = open(IO, data) do io
end
vp = import_vpop(df, prob)
Virtual population of size 1000.
┌────────────┬───────────┬─────────┐
│      y1(t) │        k1 │      k2 │
├────────────┼───────────┼─────────┤
│  0.0835802 │  0.811822 │   227.6 │
│ 0.00608752 │   2.27361 │ 238.794 │
│  0.0265608 │   1.60797 │  284.08 │
│   0.044584 │  0.357792 │ 262.775 │
│  0.0310739 │   0.83363 │ 399.164 │
│  0.0912289 │  0.875408 │ 385.448 │
│   0.147306 │ 0.0344307 │ 317.839 │
│  0.0881048 │   0.33897 │ 206.566 │
│ 0.00415461 │  0.509154 │ 241.224 │
│  0.0520245 │  0.342263 │  242.29 │
│  0.0394063 │   1.05342 │ 297.387 │
│  0.0814158 │  0.690934 │ 276.569 │
│  0.0102735 │   2.35286 │ 271.876 │
│ 0.00790389 │   2.23132 │ 258.614 │
│ 0.00465267 │   1.58992 │  262.58 │
│  0.0454282 │  0.149752 │   325.1 │
│     ⋮      │     ⋮     │    ⋮    │
└────────────┴───────────┴─────────┘
984 rows omitted


## Subsampling

The next step is to choose a subsampling algorithm, alg and run the subsampling on vp and our chosen trial. In this case we are considering the second condition, trial2 and the Discretized Density Sampling (DDS) algorithm.

# First set the reference distributions for the considered states
dist1 = TransformedBeta(Beta = Beta(2,6), lb = 0.0, ub = 0.12)
dist4 = TransformedBeta(Beta = Beta(2,5), lb = 0.0, ub = 0.002)
dist5 = TransformedBeta(Beta = Beta(3,3), lb = 0.0, ub = 0.0001)

# Each reference distribution is assigned to a state and describes the desired distribution at timepoint t
ref = [
y1 => (dist = dist1, t = 20.0),
y4 => (dist = dist4, t = 20.0),
y5 => (dist = dist5, t = 20.0)
]

# Set the number of patients to subsample and define the DDS algorithm
subsample_population_size = 400
alg = DDS(reference = ref, n = subsample_population_size, nbins = 20)

# Run subsampling for the second condition
vp_sub = subsample(alg, vp, trial2)
Virtual population of size 400.
┌─────────────┬───────────┬─────────┐
│       y1(t) │        k1 │      k2 │
├─────────────┼───────────┼─────────┤
│     0.10148 │  0.284545 │  385.18 │
│   0.0454976 │ 0.0448214 │  309.31 │
│   0.0870328 │  0.132543 │ 318.926 │
│ 0.000195905 │   2.77156 │ 225.468 │
│   0.0974946 │  0.177184 │ 318.623 │
│   0.0844511 │  0.183531 │ 353.477 │
│   0.0236881 │ 0.0557557 │ 323.566 │
│    0.150068 │  0.203784 │ 364.075 │
│    0.100219 │ 0.0545585 │  389.94 │
│   0.0615442 │  0.176576 │ 282.144 │
│   0.0387392 │ 0.0475441 │ 203.244 │
│   0.0619486 │ 0.0599833 │ 350.541 │
│    0.107384 │  0.111168 │ 215.623 │
│   0.0596616 │  0.109347 │ 217.981 │
│    0.141064 │  0.147734 │ 394.873 │
│   0.0472182 │  0.120555 │ 235.432 │
│      ⋮      │     ⋮     │    ⋮    │
└─────────────┴───────────┴─────────┘
384 rows omitted


## Visualization

Finally, we can visualize the result of subsampling, by first simulating trial2 using the original vp and the subsampled vp_sub, and then plotting a histogram of the distribution of the considered states y1, y4 and y5.

sim = solve_ensemble(vp, trial2)
df = [DataFrame(sol) for sol in sim]
data = reduce(append!, df)

sim_sub = solve_ensemble(vp_sub, trial2)
df_sub = [DataFrame(sol) for sol in sim_sub]
data_sub = reduce(append!, df_sub)

plt = map(ref) do r
state = first(r)
ref_info = last(r)
timestamp = ref_info.t
ref_dist = ref_info.dist
state_data = data[(data.timestamp .== timestamp), string(state)]
state_data_sub = data_sub[(data_sub.timestamp .== timestamp), string(state)]
histogram(state_data, normalize = true, nbins = 50, alpha = 0.3, label = "Plausible Population (PP)", title = string(state))
histogram!(state_data_sub, normalize = true, nbins = 50, alpha = 0.3, label = "Subsampled PP - Virtual Population")
histogram!(rand(ref_dist, 1_000), normalize = true, nbins = 50, alpha = 0.3, label = "Reference")
end
plot(plt..., layout = (3, 1), legendfontsize = 7)

As we can see from the plot, the subsampled Virtual Population is closer to the reference distribution for all three states than the original Plausible Population.