Deep Learning Regression — Predicting Biomarker Response from Four Patient Covariates¶

In notebook 01 we predicted clearance from dose with a straight line. A biomarker response is harder: the same dose affects an obese patient differently from a lean one, baseline biomarker level matters, and dose has a saturating sweet-spot. A neural network can learn that bend.

The story, extended¶

Each row is one trial subject, summarised by four covariates f1, f2, f3, f4 (age, BMI, baseline biomarker, dose) and one target y — their on-treatment biomarker response on a 0–10 scale. The covariates interact: the f2 · f3 (BMI × baseline) product reflects metabolic state, and f4 (dose) has a saturating sweet-spot.

Glossary — your field ↔ ML¶

Mental bridge for the deep-learning notebook. Here f1, f2, f3, f4 are four patient-level covariates — say age (f1), BMI (f2), baseline biomarker level (f3), and dose (f4) — and y is the predicted on-treatment biomarker response on a 0–10 scale.

Your field	ML term	Short bridge
A vector of patient covariates per subject	feature vector	Multiple inputs combined into one prediction.
The biomarker-response endpoint	target	What we want to predict.
A flexible response model	neural network	Stack of simple computations learning a smooth function.
Each "perceptron" combining covariates	hidden layer	Intermediate stage in the network.
A nonlinear squashing applied per node	activation function	The curvy step that lets the network bend.
A piecewise-linear "off below 0, on above"	ReLU	Cheapest, most popular nonlinearity.
Computing a predicted response from covariates	forward pass	Pushing inputs through the network.
Working out which covariate weights to nudge	backward pass / backpropagation	Computing the gradient of the loss.
The optimisation routine that updates weights (Adam here)	optimizer	Decides how to use the gradient.
A handful of patients processed together each step	batch	One mini-update of the weights.
Why dose-response is not linear in dose alone	non-linearity	Real responses curve and interact.

In [1]:

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(0)
rng = np.random.default_rng(0)

1. The data¶

Each row is one past patient. The four columns are the covariate features f1, f2, f3, f4 (rescaled to 0–1) and the column we want to predict is the target biomarker response y.

From one feature to many¶

We now have a feature vector per observation, not a single number. The model still produces one prediction per observation; it just has more inputs to combine.

In your field¶

Each row is one subject. The four features f1, f2, f3, f4 are age, BMI, baseline biomarker level, and dose (rescaled to 0–1) and the target y is the on-treatment biomarker response on a 0–10 scale. A covariate-adjusted PK/PD model in NONMEM handles the same inputs.

In [2]:

N = 800
X = rng.uniform(0, 1, size=(N, 4)).astype(np.float32)
f1, f2, f3, f4 = X[:, 0], X[:, 1], X[:, 2], X[:, 3]

# Non-linear ground truth: a smooth function with interactions.
y_true = (
    5.0
    + 3.0 * np.sin(2 * np.pi * f1)
    + 4.0 * (f2 * f3)
    - 6.0 * (f4 - 0.4) ** 2
    + rng.normal(0, 0.3, size=N).astype(np.float32)
)
y = y_true.astype(np.float32)
X.shape, y.shape

Out[2]:

((800, 4), (800,))

A linear dose-response cannot capture this — y is not a straight-line function of any single covariate. We need a model that bends, and that learns interactions between covariates, the way a flexible PK/PD or covariate-effect model must.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — why nonlinearity matters

Explain in plain words why a linear regression cannot capture
a dose-response that has a saturating sweet-spot in dose and
interacts with BMI and baseline biomarker, but a small neural
network with ReLU activations can. Give a concrete two-
covariate toy example. Keep the answer to ~5 minutes of
reading so I can return to my notebook.

Prompt 2 — hidden layers as covariate interactions

A "hidden layer" in a neural network mixes input features
into intermediate variables. If my inputs are four patient
covariates, what kinds of mixtures might a hidden layer
discover that look like the subgroup effects a clinician
would test for by hand (age × dose, BMI × baseline)? Keep
the answer to ~5 minutes of reading so I can return to my
notebook.

2. The model — a small neural network¶

A neural network stacks layers of simple computations. Each hidden layer computes h = activation(W · x + b), where W and b are learnable weights and bias and the activation function is a non-linearity such as ReLU ($\text{ReLU}(z) = \max(0, z)$). Without an activation function, stacking layers would still only give you a linear model.

Our network has 4 inputs → 16 hidden units → 16 hidden units → 1 output (the predicted target).

In your field¶

Each hidden layer is a stack of subgroup detectors the network discovers by itself — combinations like "older patient and high baseline biomarker" that a clinician might test for by hand. The non-linearity between layers is what lets the network capture saturation in dose response and covariate effects that interact.

Worked example¶

Each hidden unit is like one virtual reviewer paying attention to a particular combination of covariates — "older patient and low baseline biomarker", say. The next layer pools the reviewers' opinions into a final response score, much like a covariate-adjusted analysis combines effect estimates by hand.

In [3]:

class RegressionNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
        )

    def forward(self, x):
        return self.layers(x).squeeze(-1)

net = RegressionNet()
sum(p.numel() for p in net.parameters())   # total parameters

Out[3]:

3. Training — forward pass, loss, backward pass¶

Training one batch of data has three steps:

Forward pass — feed the features through the network to get predictions.
Compute the loss (MSE again) by comparing predictions to targets.
Backward pass (also called backpropagation) — compute the gradient of the loss with respect to every weight, then ask the optimizer to nudge the weights one step.

We repeat for many epochs until the loss stops improving.

In your field¶

The forward pass is "given a subject's covariates, predict their biomarker response"; the backward pass / backpropagation is "work out how to nudge every weight so the prediction gets closer to the observed response"; the optimizer (Adam) applies the nudges. One batch is a small group of subjects processed together.

In [4]:

# Train/test split.
idx = rng.permutation(N)
n_train = int(0.8 * N)
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train = torch.tensor(X[train_idx])
y_train = torch.tensor(y[train_idx])
X_test  = torch.tensor(X[test_idx])
y_test  = torch.tensor(y[test_idx])

In [5]:

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)

epochs = 300
batch_size = 64
history = []

for epoch in range(epochs):
    perm = torch.randperm(X_train.shape[0])
    epoch_loss = 0.0
    for start in range(0, X_train.shape[0], batch_size):
        b = perm[start:start + batch_size]
        pred = net(X_train[b])                  # forward pass
        loss = loss_fn(pred, y_train[b])
        optimizer.zero_grad()
        loss.backward()                          # backward pass
        optimizer.step()
        epoch_loss += float(loss) * b.shape[0]
    history.append(epoch_loss / X_train.shape[0])

print(f"final train MSE = {history[-1]:.3f}")

/tmp/ipykernel_2863/3561324963.py:18: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  epoch_loss += float(loss) * b.shape[0]

final train MSE = 0.184

In [6]:

# Loss curve — average MSE per epoch over our batches of patients.

plt.figure(figsize=(6, 4))
plt.plot(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss)")
plt.title("Training the regression network")
plt.grid(True, alpha=0.3)
plt.show()

No description has been provided for this image

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — forward and backward pass

Walk me through one training step of a small neural network:
forward pass, MSE loss, backward pass, optimizer step. Use a
tiny example with four patient covariates as inputs and a
biomarker response as output. Keep the answer to ~5 minutes
of reading so I can return to my notebook.

Prompt 2 — Adam vs Newton

What does the Adam optimizer do that plain stochastic
gradient descent does not? Frame the answer for someone used
to Newton-type steps in nonlinear mixed-effects software like
NONMEM. Keep the answer to ~5 minutes of reading so I can
return to my notebook.

4. Did it actually learn?¶

We now look at the test data the network has never seen. If the test MSE is close to the training MSE, the model has learned a generalisable rule, not just memorised. If the test MSE is much higher, we have overfitting — the same enemy as in notebook 01, only sneakier in deep models with many parameters.

In your field¶

Overfitting in a flexible biomarker-response model trained on a small Phase II cohort means the network memorises the cohort's quirks rather than learning a generalisable response rule. The warning sign is the usual one: training MSE drifts down while test MSE plateaus or rises.

In [7]:

with torch.no_grad():
    test_pred = net(X_test)
    test_mse = float(loss_fn(test_pred, y_test))
print(f"test MSE = {test_mse:.3f}")

plt.figure(figsize=(5, 5))
plt.scatter(y_test.numpy(), test_pred.numpy(), alpha=0.6)
lims = [float(y_test.min()) - 0.5, float(y_test.max()) + 0.5]
plt.plot(lims, lims, "k--", alpha=0.5)
plt.xlabel("true target y")
plt.ylabel("predicted target y")
plt.title("Predictions vs truth (test set)")
plt.grid(True, alpha=0.3)
plt.show()

test MSE = 0.161

Points clustered along the diagonal mean the network has captured the covariate interactions and the saturating sweet-spot in dose. A linear dose-response model on the same patients would leave a much messier scatter — and would systematically miss the dose-saturation bend.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — overfitting in clinical ML models

What does "overfitting" look like specifically in a flexible
biomarker-response neural network trained on a small Phase
II cohort? Give a concrete sign that a model is memorising
its training set rather than learning a generalisable
response rule. Keep the answer to ~5 minutes of reading so
I can return to my notebook.

Try this yourself¶

Reduce the network to a single nn.Linear(4, 1) (no hidden layer, no activation function). How much worse is the test MSE? Why is this exactly equivalent to a linear regression on the four covariates?
Increase the learning rate to 0.5. What happens to the loss curve? Relate it to taking unrealistically large parameter steps in NONMEM.
Replace nn.ReLU() with nn.Tanh(). Does it train as well?

Recap — vocabulary you now own¶

On top of notebook 01: neural network, hidden layer, activation function, ReLU, forward pass, backward pass, backpropagation, optimizer, batch, non-linearity.

The training loop — forward, loss, backward, step — is the same loop that powers every modern model, from a four- covariate biomarker predictor like this one to the multi-billion-parameter networks behind some of the newest clinical-decision-support tools.