Deep Learning Regression — Predicting Panel Scores from Four Dish Characteristics¶

In notebook 01 we predicted prep time from cover count with a straight line. Panel scores are harder: a dish with high acidity and high richness can be brilliant (think ceviche or a lemon- butter sauce), but high acidity paired with low richness may taste harsh. A neural network can learn that kind of interaction.

The story, extended¶

Each row is one dish from your development kitchen, described by four tasting-panel dimensions f1, f2, f3, f4 (acidity, richness, spice-heat, texture complexity) and one target y — the panel's overall score. The dimensions interact: f2 · f3 (richness × heat) creates warmth, and f1 (acidity) has a sweet-spot that depends on the other three.

Glossary — your field ↔ ML¶

Mental bridge for the deep-learning notebook. Here f1, f2, f3, f4 are four dish characteristics scored by your tasting panel — acidity (f1), richness / fat-content (f2), spice-heat (f3), and texture complexity (f4) — and y is the overall panel score (0–10).

Your field	ML term	Short bridge
Four tasting-panel dimension scores per dish	feature vector	Multiple inputs combined into one prediction.
The overall panel score (0–10)	target	What we want to predict.
A flexible dish-quality model	neural network	Stack of simple computations learning a smooth function.
Each "sub-judge" combining dimension scores	hidden layer	Intermediate stage in the network.
A nonlinear squashing applied per node	activation function	The curvy step that lets the network bend.
A piecewise-linear "silent below zero, active above"	ReLU	Cheapest, most popular nonlinearity.
Computing an overall score from dimension inputs	forward pass	Pushing inputs through the network.
Working out which dimension weights to nudge	backward pass / backpropagation	Computing the gradient of the loss.
The routine that updates weights (Adam here)	optimizer	Decides how to use the gradient.
A handful of dishes processed together each step	batch	One mini-update of the weights.
Why panel scores are not just a weighted average of dimensions	non-linearity	Real flavour interactions curve and depend on each other.

In [1]:

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(0)
rng = np.random.default_rng(0)

1. The data¶

Each row is one dish your panel has already scored. The four columns are the dimension features f1, f2, f3, f4 (normalised to 0–1) and the column we want to predict is the target overall panel score y.

From one feature to many¶

We now have a feature vector per observation, not a single number. The model still produces one prediction per observation; it just has more inputs to combine.

In your field¶

Each row is one dish your panel has scored. The four features f1, f2, f3, f4 are tasting-panel dimensions — acidity, richness, spice-heat, texture complexity — rescaled to 0–1, and the target y is the panel's overall score on 0–10.

In [2]:

N = 800
X = rng.uniform(0, 1, size=(N, 4)).astype(np.float32)
f1, f2, f3, f4 = X[:, 0], X[:, 1], X[:, 2], X[:, 3]

# Non-linear ground truth: a smooth function with interactions.
y_true = (
    5.0
    + 3.0 * np.sin(2 * np.pi * f1)
    + 4.0 * (f2 * f3)
    - 6.0 * (f4 - 0.4) ** 2
    + rng.normal(0, 0.3, size=N).astype(np.float32)
)
y = y_true.astype(np.float32)
X.shape, y.shape

Out[2]:

((800, 4), (800,))

A straight-line model cannot capture this — the panel score is not a simple weighted sum of the four dimensions. What we need is a model that can learn interactions between dimensions, the way a head chef mentally balances fat, acid, heat, and texture rather than scoring each in isolation.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — why nonlinearity matters

Explain in plain words why a linear regression cannot capture a
flavour-profile scoring model that has a sweet-spot in spice-heat or
interactions between acidity and richness, but a small neural network
with ReLU activations can. Give a concrete two-dimension toy example.
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

Prompt 2 — hidden layers as flavour interactions

A "hidden layer" in a neural network mixes input features into
intermediate variables. If my inputs are four dish-dimension scores,
what kinds of mixtures might a hidden layer discover that look like
the classic flavour-pairing rules a chef uses? Keep the answer to ~5
minutes of reading so I can return to my notebook.

2. The model — a small neural network¶

A neural network stacks layers of simple computations. Each hidden layer computes h = activation(W · x + b), where W and b are learnable weights and bias and the activation function is a non-linearity such as ReLU ($\text{ReLU}(z) = \max(0, z)$). Without an activation function, stacking layers would still only give you a linear model.

Our network has 4 inputs → 16 hidden units → 16 hidden units → 1 output (the predicted target).

In your field¶

Each hidden layer is like a row of sub-judges the network learns to assemble: one might pay attention to "rich and spicy", another to "acidic but low-texture". The next layer pools their opinions into a final score, much like the panel chair weights individual judges' palates. The non-linearity between layers is what lets the network capture sweet-spots in heat or interactions between fat and acid that a simple weighted average cannot.

Worked example¶

Each hidden unit is like one virtual tasting judge paying attention to a particular combination of dimensions — "rich and spicy", say. The next layer pools the judges' opinions into a final score, much like a panel chair averages sub-scores weighted by how much each judge's palate is trusted.

In [3]:

class RegressionNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
        )

    def forward(self, x):
        return self.layers(x).squeeze(-1)

net = RegressionNet()
sum(p.numel() for p in net.parameters())   # total parameters

Out[3]:

3. Training — forward pass, loss, backward pass¶

Training one batch of data has three steps:

Forward pass — feed the features through the network to get predictions.
Compute the loss (MSE again) by comparing predictions to targets.
Backward pass (also called backpropagation) — compute the gradient of the loss with respect to every weight, then ask the optimizer to nudge the weights one step.

We repeat for many epochs until the loss stops improving.

In your field¶

The forward pass is "given a dish's four dimension scores, predict the panel score"; the backward pass / backpropagation is "work out how to nudge every weight so the prediction gets closer to the panel's actual score"; the optimizer (Adam) applies the nudges. One batch is a small flight of dishes processed together — the same way you might run a tasting in waves.

In [4]:

# Train/test split.
idx = rng.permutation(N)
n_train = int(0.8 * N)
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train = torch.tensor(X[train_idx])
y_train = torch.tensor(y[train_idx])
X_test  = torch.tensor(X[test_idx])
y_test  = torch.tensor(y[test_idx])

In [5]:

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)

epochs = 300
batch_size = 64
history = []

for epoch in range(epochs):
    perm = torch.randperm(X_train.shape[0])
    epoch_loss = 0.0
    for start in range(0, X_train.shape[0], batch_size):
        b = perm[start:start + batch_size]
        pred = net(X_train[b])                  # forward pass
        loss = loss_fn(pred, y_train[b])
        optimizer.zero_grad()
        loss.backward()                          # backward pass
        optimizer.step()
        epoch_loss += float(loss) * b.shape[0]
    history.append(epoch_loss / X_train.shape[0])

print(f"final train MSE = {history[-1]:.3f}")

/tmp/ipykernel_2913/3561324963.py:18: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  epoch_loss += float(loss) * b.shape[0]

final train MSE = 0.184

In [6]:

# Loss curve — average MSE per epoch over our batches of dishes.

plt.figure(figsize=(6, 4))
plt.plot(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss)")
plt.title("Training the regression network")
plt.grid(True, alpha=0.3)
plt.show()

No description has been provided for this image

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — forward and backward pass

Walk me through one training step of a small neural network: forward
pass, MSE loss, backward pass, optimizer step. Use a tiny example
with four dish-dimension scores as inputs and one panel score as
output. Keep the answer to ~5 minutes of reading so I can return to
my notebook.

Prompt 2 — Adam vs plain SGD

What does the Adam optimizer do that plain stochastic gradient
descent does not? Frame the answer for someone who is used to
adjusting a recipe iteratively by gut feel rather than by formula.
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

4. Did it actually learn?¶

We now look at the test data the network has never seen. If the test MSE is close to the training MSE, the model has learned a generalisable rule, not just memorised. If the test MSE is much higher, we have overfitting — the same enemy as in notebook 01, only sneakier in deep models with many parameters.

In your field¶

Overfitting here means a network that memorises your development kitchen's quirks — your specific panel's biases, your pantry's idiosyncrasies — and then disappoints when scoring a fresh menu in another setting. The warning sign is training MSE that keeps falling while test MSE stalls or rises.

In [7]:

with torch.no_grad():
    test_pred = net(X_test)
    test_mse = float(loss_fn(test_pred, y_test))
print(f"test MSE = {test_mse:.3f}")

plt.figure(figsize=(5, 5))
plt.scatter(y_test.numpy(), test_pred.numpy(), alpha=0.6)
lims = [float(y_test.min()) - 0.5, float(y_test.max()) + 0.5]
plt.plot(lims, lims, "k--", alpha=0.5)
plt.xlabel("true target y")
plt.ylabel("predicted target y")
plt.title("Predictions vs truth (test set)")
plt.grid(True, alpha=0.3)
plt.show()

test MSE = 0.161

Points clustered along the diagonal mean the network has captured the dimension interactions and the sweet-spot in spice-heat. A simple linear scoring model on the same dishes would leave a much messier scatter — and would miss the interaction between richness and acid that makes a great dish.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — overfitting in a recipe model

What does "overfitting" look like specifically in a neural network
trained on panel scores from a small number of dishes? Give a
concrete sign that the model is memorising your development kitchen's
quirks rather than learning a generalisable scoring rule. Keep the
answer to ~5 minutes of reading so I can return to my notebook.

Try this yourself¶

Reduce the network to a single nn.Linear(4, 1) (no hidden layer, no activation function). How much worse is the test MSE? This is equivalent to a simple weighted-average scoring sheet — no interactions between dimensions.
Increase the learning rate to 0.5. What happens to the loss curve? Relate it to over-adjusting a sauce: too big a change swings past the mark.
Replace nn.ReLU() with nn.Tanh(). Does it train as well?

Recap — vocabulary you now own¶

On top of notebook 01: neural network, hidden layer, activation function, ReLU, forward pass, backward pass, backpropagation, optimizer, batch, non-linearity.

The training loop — forward, loss, backward, step — is the same loop that powers every modern model, from a four-dimension dish scorer like this one to the large AI systems behind restaurant demand-forecasting and dynamic menu pricing.