Simple Linear Regression — Estimating Drug Clearance from Dose¶

A first machine learning notebook, framed in the language of clinical pharmacology. We will treat administered dose as the input and plasma drug clearance as the quantity to predict — the linear regime of a dose-clearance relationship. The definitions, formulas, and code stay the same for every learner; only this story changes.

The story¶

From a small Phase I PK study you have, for each patient, the administered dose and the measured plasma clearance. From those past patients you have a list of (dose, clearance) pairs.

Your job in this notebook: learn a simple linear rule that turns a fresh dose into a predicted clearance, the same way you would summarise a Phase I cohort with a linear PK fit before deciding whether to model saturation explicitly.

Glossary — your field ↔ ML¶

Mental bridge between clinical pharmacology and the ML vocabulary used below. In this notebook, x is administered drug dose (in mg) and y is plasma drug clearance (in L/h) for one patient — the linear regime of a dose-clearance relationship.

Your field	ML term	Short bridge
Dose, the controlled exposure	feature	The input the model reads.
Clearance, the measured PK endpoint	target	The quantity to predict.
Patients in your derivation cohort	training data	The cohort the model is fit on.
A held-out validation cohort	test data	Patients the model never sees during fitting.
Splitting a trial into derivation + validation halves	train/test split	Protocol for honest evaluation.
The dose-response model you specify (e.g. linear, Emax)	model	The functional form whose constants you tune.
Slope and intercept of a linear PK fit	parameters / weights	The numbers learning adjusts.
The intercept (e.g. baseline clearance)	bias	The constant term of the prediction.
A predicted clearance for a fresh dose	prediction	What the model outputs for new inputs.
(observed − fitted) clearance per patient	residual	Per-patient error after fitting.
The cost function `nls()` or `nlme` minimises	loss	Single number measuring total wrongness.
Mean squared residual across patients	mean squared error / MSE	Most common regression loss.
∂(SSR)/∂(parameter)	gradient	Direction of steepest cost increase.
Iterative nonlinear least-squares step	gradient descent	Tiny parameter updates that follow the gradient downhill.
The step size in that iterative fit	learning rate	How bold each update is.
One full pass over your patients	epoch	One sweep through training data.
An Emax fit that follows individual outliers	overfitting	Model fits noise, generalises poorly.
A linear fit applied to a saturating dose-response	underfitting	Model too simple to catch the pattern.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

1. Setting the scene¶

Each observation is one patient: an (x, y) pair where x is dose (mg) and y is clearance (L/h). The glossary above gives the unit mapping in full. The next cell defines the ML vocabulary.

Core vocabulary¶

feature — the input variable, here x.
target — the quantity we want to predict, here y.
A dataset is a collection of (feature, target) pairs.
training data — the subset used to fit the model.
test data — the subset held back to evaluate the model honestly.
train/test split — the act of partitioning the dataset into training data and test data.

In your field¶

Here the feature x is dose (mg) and the target y is plasma clearance (L/h). Your training data is the past Phase I cohort whose PK profiles you already analysed; the test data is the dose levels you held back to validate the dose-response rule. The split between "model-building" and "validation" cohorts is a train/test split.

In [2]:

# Synthetic ground truth: y = 0.25 * x + 1.0 + small noise
true_slope = 0.25
true_intercept = 1.0

x_data = rng.uniform(10, 200, size=80)
noise = rng.normal(0, 3.0, size=x_data.shape)
y_data = true_slope * x_data + true_intercept + noise

In [3]:

# Each dot is one past patient — administered dose vs measured clearance.

plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, alpha=0.7)
plt.xlabel("feature x")
plt.ylabel("target y")
plt.title("Past observations")
plt.grid(True, alpha=0.3)
plt.show()

No description has been provided for this image

What pattern do you see?¶

The cloud of points slopes upward — higher dose, higher measured clearance — like the linear region of a dose-clearance plot before saturation kicks in. Our goal is to draw the best straight line through that cloud so that for a new patient at a new dose we can predict clearance.

Go deeper with an LLM (optional — skip if you already know this)¶

Paste any prompt below into ChatGPT / Claude / Mistral, explore, then come back.

Prompt 1 — feature vs target in a clinical study

In machine learning, what is the difference between a "feature"
and a "target"? Apply the distinction to a Phase I PK study
where I record dose and measured clearance per patient. Which
is the feature, which is the target, and how does this map
onto the predictor / outcome split I am used to in regression
models in R? Keep the answer to ~5 minutes of reading so I
can return to my notebook.

Prompt 2 — why hold out test data

Why do machine learning workflows split data into "training
data" and "test data" before fitting, and why is this similar
in spirit to a derivation / validation cohort design in a
clinical study? Give one realistic case where skipping the
train/test split would let me fool myself. Keep the answer to
~5 minutes of reading so I can return to my notebook.

2. The model — linear regression¶

A linear regression model is the rule

$$ \hat{y} = w \cdot x + b $$

where $x$ is the feature, $\hat{y}$ is the prediction, and $w$ and $b$ are numbers the model learns. They are called the model's parameters (or weights, with $b$ specifically called the bias). A small set of parameters that fits a lot of data — that is the whole idea of regression.

In your field¶

The slope w is the per-mg increment in clearance; the bias b is the baseline clearance even at a notional zero dose (extrapolated from the Emax model). The prediction $\hat{y} = w \cdot x + b$ is the linear-in-dose part of an Emax or PK exposure-response model — useful when the dose range is well below the saturation regime.

Worked example¶

With w = 0.24 and b = 1.1, for x = 100 mg the predicted clearance is 0.24 · 100 + 1.1 = 25.1 L/h. The slope is the per-mg increment in clearance; the intercept is the baseline clearance at zero dose — the same two numbers you would report from a linear PK summary.

In [4]:

from sklearn.linear_model import LinearRegression

X = x_data.reshape(-1, 1)   # sklearn expects a 2-D feature matrix
y = y_data

model = LinearRegression().fit(X, y)
learned_slope = float(model.coef_[0])
learned_intercept = float(model.intercept_)
learned_slope, learned_intercept

Out[4]:

(0.25734906301152605, -0.15046939018604277)

The two numbers above are the learned parameters. Compare them with the ground-truth slope (0.25) and intercept (1.0) — the model has recovered the underlying rule from noisy data, the same way a clean linear PK fit recovers the true dose-response slope from noisy patient measurements.

3. How good is the fit? — the loss¶

For each training point we compute the residual, the gap between the true target and the prediction:

$$ r_i = y_i - \hat{y}_i $$

Squaring and averaging gives the mean squared error (MSE), the most common loss for regression:

$$ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$

A smaller loss means a better fit. "Training a model" is the search for parameters that minimise the loss.

In your field¶

A residual is (observed clearance − predicted clearance) for one subject. The MSE is the average squared residual — the same objective an nls() or nlme fit drives towards when you fit a population-PK model. A smaller loss means tighter exposure- response predictions.

In [5]:

y_pred = model.predict(X)
residuals = y - y_pred
mse = float(np.mean(residuals ** 2))
print(f"MSE = {mse:.3f}")

MSE = 7.687

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — what loss really is

Explain "loss" in machine learning, focusing on MSE for
regression. Tie it to the SSR that nls() or nlme minimises
when I fit a PK model. Are the two ideas the same? Where do
they differ? Keep the answer to ~5 minutes of reading so I
can return to my notebook.

Prompt 2 — residuals as a diagnostic

What can residuals tell me beyond the MSE itself? Use a
dose-clearance example where residuals fan out at high doses
(suggesting saturable elimination). Keep the answer to ~5
minutes of reading so I can return to my notebook.

4. Gradient descent — learning by tiny steps¶

sklearn solved for the best w, b in one shot using linear algebra. For larger models we cannot do that, so we use gradient descent.

The gradient of the loss with respect to a parameter tells us which direction makes the loss grow. We step in the opposite direction. We repeat for many epochs (one epoch = one full pass over the training data), each time moving the parameters a small amount controlled by the learning rate.

In your field¶

Gradient descent is what NONMEM, Monolix, and nlme are doing under the hood: take a step in the parameter space, see whether OFV (the loss) drops, take another. The learning rate is the step size — too small and your run takes overnight; too large and your fit oscillates and never settles on a final estimate.

In [6]:

# From-scratch gradient descent for the same problem.
w, b = 0.0, 0.0          # start anywhere
learning_rate = 1e-6     # small steps so the descent is visible
epochs = 8000
history = []

for epoch in range(epochs):
    y_hat = w * x_data + b
    error = y_hat - y_data                  # shape (N,)
    grad_w = 2 * np.mean(error * x_data)
    grad_b = 2 * np.mean(error)
    w -= learning_rate * grad_w
    b -= learning_rate * grad_b
    history.append(np.mean(error ** 2))

print(f"learned w = {w:.3f}, b = {b:.3f}")
print(f"final MSE = {history[-1]:.3f}")

learned w = 0.256, b = 0.001
final MSE = 7.692

In [7]:

# Loss falling — like SSR shrinking across iterations of nls() / nlme.

# Log-scale on the y-axis spreads the descent across the whole plot,
# so we can see the slow grind that follows the initial big drop.
plt.figure(figsize=(6, 4))
plt.semilogy(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss, log scale)")
plt.title("Loss curve during gradient descent")
plt.grid(True, which="both", alpha=0.3)
plt.show()

Why this matters¶

Every modern deep-learning model is trained with some flavour of gradient descent. The learning rate controls how bold each step is: too small and training crawls; too large and it overshoots. The iterative step that nls() or nlme takes when it fits a PK or PD model is a close cousin — a guided walk down the loss surface, one parameter update at a time.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — gradient descent intuition

Explain gradient descent to me as a clinical pharmacologist.
Walk through one iteration on a simple loss surface — what
is the gradient, why we step in its negative direction, what
the learning rate controls — using an analogy to nudging an
Emax / EC50 estimate to lower the SSR. Keep the answer to
~5 minutes of reading so I can return to my notebook.

Prompt 2 — when learning rate goes wrong

What happens to gradient descent if the learning rate is too
small, or too large? Show me a small numerical example with a
quadratic loss and relate it to step-size choices in
iterative nonlinear fitters. Keep the answer to ~5 minutes
of reading so I can return to my notebook.

5. Honest evaluation — train/test split¶

A model that memorises the training data is not useful. To check whether it has truly learned the pattern, we hide some data during training and only look at it at the end. That hidden portion is the test data.

Underfitting: the model is too simple to capture the pattern; loss is high on both training data and test data.
Overfitting: the model has memorised noise; training loss is low but test loss is high.

In your field¶

Overfitting here would be a dose–clearance curve that fits the Phase I cohort beautifully but fails on the Phase II patients. The classic kind: a flexible model that latches onto the noise of a small healthy-volunteer study and then mispredicts the patient population. Test data is your validation cohort.

In [8]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)
m = LinearRegression().fit(X_train, y_train)

train_mse = float(np.mean((y_train - m.predict(X_train)) ** 2))
test_mse = float(np.mean((y_test - m.predict(X_test)) ** 2))
print(f"train MSE = {train_mse:.3f}")
print(f" test MSE = {test_mse:.3f}")

train MSE = 7.941
 test MSE = 7.144

Try this yourself¶

Re-run the data-generation cell with a larger noise standard deviation — the analogue of a more between-patient-variable assay. What happens to the MSE? To the recovered slope?
Set the learning rate to 1e-3. Does training still converge? Why or why not? (Compare with what would happen if nls() took giant steps.)
Reduce the dataset to the first 10 points, as if you only had a tiny pilot cohort. Compare train MSE and test MSE — do you see overfitting?

Recap — vocabulary you now own¶

feature, target, training data, test data, train/test split, model, parameters, weights, bias, prediction, residual, loss, mean squared error, MSE, gradient, gradient descent, learning rate, epoch, overfitting, underfitting.

Up next: a small neural network that predicts something a straight line cannot — a biomarker response from four patient covariates.