Deep Learning Regression — Predicting Binding Affinity from Molecular Descriptors¶

In notebook 01 we predicted reaction velocity from substrate concentration with a straight line. Binding affinity is harder: moderate lipophilicity is good for membrane permeability, too much is bad for solubility. A neural network can learn that kind of bend.

The story, extended¶

Each row is one candidate compound, summarised by four descriptors f1, f2, f3, f4 (log P, polar surface area, MW, rotatable-bond count) and one target y — its measured binding affinity score. The descriptors interact: f2 · f3 (PSA × MW) matters, and f4 has a sweet spot.

Glossary — your field ↔ ML¶

Mental bridge for the deep-learning notebook. Here f1, f2, f3, f4 are four molecular descriptors of a candidate compound — say log P (f1), polar surface area (f2), molecular weight (f3), and number of rotatable bonds (f4) — and y is its predicted binding affinity (a pKi-like score, here on a 0–10 scale).

Your field ML term Short bridge
A vector of molecular descriptors per compound feature vector Multiple inputs combined into one prediction.
The pKi-like binding score target What we want to predict.
A QSAR model fit on past assay data neural network Stack of simple computations learning a smooth function.
Each "perceptron" combining descriptors hidden layer Intermediate stage in the network.
A nonlinear squashing applied per node activation function The curvy step that lets the network bend.
A piecewise-linear "off below 0, on above" ReLU Cheapest, most popular nonlinearity.
Computing a predicted pKi from descriptors forward pass Pushing inputs through the network.
Working out which descriptor weights to nudge backward pass / backpropagation Computing the gradient of the loss.
The optimisation routine that updates weights (Adam here) optimizer Decides how to use the gradient.
A handful of compounds processed together each step batch One mini-update of the weights.
Why you cannot QSAR-model with a single line non-linearity Real binding surfaces curve and interact.
In [1]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(0)
rng = np.random.default_rng(0)

1. The data¶

Each row is one past compound. The four columns are the descriptor features f1, f2, f3, f4 (rescaled to 0–1) and the column we want to predict is the target binding score y.

From one feature to many¶

We now have a feature vector per observation, not a single number. The model still produces one prediction per observation; it just has more inputs to combine.

In your field¶

Each row is one candidate compound. The four features f1, f2, f3, f4 are its descriptors — log P, polar surface area, molecular weight, rotatable-bond count — rescaled to 0–1, and the target y is its binding-affinity score on a 0–10 pKi-like scale.

In [2]:
N = 800
X = rng.uniform(0, 1, size=(N, 4)).astype(np.float32)
f1, f2, f3, f4 = X[:, 0], X[:, 1], X[:, 2], X[:, 3]

# Non-linear ground truth: a smooth function with interactions.
y_true = (
    5.0
    + 3.0 * np.sin(2 * np.pi * f1)
    + 4.0 * (f2 * f3)
    - 6.0 * (f4 - 0.4) ** 2
    + rng.normal(0, 0.3, size=N).astype(np.float32)
)
y = y_true.astype(np.float32)
X.shape, y.shape
Out[2]:
((800, 4), (800,))

A linear QSAR cannot capture this — affinity is not a straight-line function of any single descriptor. A model that bends, and that learns interactions between descriptors, is what we need.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — why nonlinearity matters

Explain in plain words why a linear regression cannot capture a
structure–activity relationship that has a sweet-spot in log P or
polar surface area, but a small neural network with ReLU
activations can. Give a concrete two-feature toy example. Keep the
answer to ~5 minutes of reading so I can return to my notebook.

Prompt 2 — hidden layers as descriptor combinations

A "hidden layer" in a neural network mixes input features into
intermediate variables. If my inputs are four molecular descriptors,
what kinds of mixtures might a hidden layer discover that look like
standard medicinal-chemistry rules of thumb? Keep the answer to ~5
minutes of reading so I can return to my notebook.

2. The model — a small neural network¶

A neural network stacks layers of simple computations. Each hidden layer computes h = activation(W · x + b), where W and b are learnable weights and bias and the activation function is a non-linearity such as ReLU ($\text{ReLU}(z) = \max(0, z)$). Without an activation function, stacking layers would still only give you a linear model.

Our network has 4 inputs → 16 hidden units → 16 hidden units → 1 output (the predicted target).

In your field¶

Each hidden layer is like a layer of consensus QSAR rules the network discovers automatically — combinations of descriptors that a medicinal chemist might have hand-engineered (lipophilic-efficiency, PSA-MW interactions, etc.). The non-linearity between layers is what lets it capture sweet-spots in log P that a multiple linear regression cannot.

Worked example¶

Each hidden unit is like one virtual reviewer paying attention to a particular combination of descriptors — "high log P and low PSA", say. The next layer pools the reviewers' opinions into a final affinity score, much like a consensus QSAR ensemble does manually.

In [3]:
class RegressionNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
        )

    def forward(self, x):
        return self.layers(x).squeeze(-1)

net = RegressionNet()
sum(p.numel() for p in net.parameters())   # total parameters
Out[3]:
369

3. Training — forward pass, loss, backward pass¶

Training one batch of data has three steps:

  1. Forward pass — feed the features through the network to get predictions.
  2. Compute the loss (MSE again) by comparing predictions to targets.
  3. Backward pass (also called backpropagation) — compute the gradient of the loss with respect to every weight, then ask the optimizer to nudge the weights one step.

We repeat for many epochs until the loss stops improving.

In your field¶

The forward pass is "given a compound's descriptors, predict its affinity"; the backward pass / backpropagation is "work out how to nudge each weight so the prediction gets closer to the assay value"; the optimizer (Adam) applies the nudges. One batch is a small set of compounds processed together — the same way a QSAR pipeline scores molecules in chunks.

In [4]:
# Train/test split.
idx = rng.permutation(N)
n_train = int(0.8 * N)
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train = torch.tensor(X[train_idx])
y_train = torch.tensor(y[train_idx])
X_test  = torch.tensor(X[test_idx])
y_test  = torch.tensor(y[test_idx])
In [5]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)

epochs = 300
batch_size = 64
history = []

for epoch in range(epochs):
    perm = torch.randperm(X_train.shape[0])
    epoch_loss = 0.0
    for start in range(0, X_train.shape[0], batch_size):
        b = perm[start:start + batch_size]
        pred = net(X_train[b])                  # forward pass
        loss = loss_fn(pred, y_train[b])
        optimizer.zero_grad()
        loss.backward()                          # backward pass
        optimizer.step()
        epoch_loss += float(loss) * b.shape[0]
    history.append(epoch_loss / X_train.shape[0])

print(f"final train MSE = {history[-1]:.3f}")
/tmp/ipykernel_2812/3561324963.py:18: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  epoch_loss += float(loss) * b.shape[0]
final train MSE = 0.184
In [6]:
# Loss curve — average MSE per epoch over our compound batches.

plt.figure(figsize=(6, 4))
plt.plot(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss)")
plt.title("Training the regression network")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — forward and backward pass

Walk me through one training step of a small neural network: forward
pass, MSE loss, backward pass, optimizer step. Use a tiny example
with four molecular descriptors as inputs and one binding-affinity
output. Keep the answer to ~5 minutes of reading so I can return to
my notebook.

Prompt 2 — Adam vs SGD

What does the Adam optimizer do that plain SGD does not? Frame the
answer for someone used to scipy.optimize and Levenberg–Marquardt.
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

4. Did it actually learn?¶

We now look at the test data the network has never seen. If the test MSE is close to the training MSE, the model has learned a generalisable rule, not just memorised. If the test MSE is much higher, we have overfitting — the same enemy as in notebook 01, only sneakier in deep models with many parameters.

In your field¶

Overfitting here means a network that memorises your training library — perhaps including the quirks of a particular assay batch — and then fails on compounds from a fresh series. The diagnostic is the same: training MSE drifts down while test MSE stalls or rises.

In [7]:
with torch.no_grad():
    test_pred = net(X_test)
    test_mse = float(loss_fn(test_pred, y_test))
print(f"test MSE = {test_mse:.3f}")

plt.figure(figsize=(5, 5))
plt.scatter(y_test.numpy(), test_pred.numpy(), alpha=0.6)
lims = [float(y_test.min()) - 0.5, float(y_test.max()) + 0.5]
plt.plot(lims, lims, "k--", alpha=0.5)
plt.xlabel("true target y")
plt.ylabel("predicted target y")
plt.title("Predictions vs truth (test set)")
plt.grid(True, alpha=0.3)
plt.show()
test MSE = 0.161
No description has been provided for this image

Points clustered along the diagonal mean the network has captured the descriptor interactions and the sweet-spot in f4. A linear QSAR on the same compounds would leave a much messier scatter — and would systematically miss the bend.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — overfitting in QSAR

What does "overfitting" look like specifically in a QSAR / deep
learning model trained on a small compound library? Give a concrete
sign that a model is memorising its training set rather than
learning useful structure–activity rules. Keep the answer to ~5
minutes of reading so I can return to my notebook.

Try this yourself¶

  1. Reduce the network to a single nn.Linear(4, 1) (no hidden layer, no activation function). How much worse is the test MSE? Why is this exactly equivalent to multiple linear regression on the four descriptors?
  2. Increase the learning rate to 0.5. What happens to the loss curve? Relate it to taking unrealistically large parameter steps in curve_fit.
  3. Replace nn.ReLU() with nn.Tanh(). Does it train as well?

Recap — vocabulary you now own¶

On top of notebook 01: neural network, hidden layer, activation function, ReLU, forward pass, backward pass, backpropagation, optimizer, batch, non-linearity.

The training loop — forward, loss, backward, step — is the same loop that powers every modern model, from a four-descriptor binding predictor like this one to AlphaFold-style structure models with hundreds of millions of parameters.