Deep Learning Regression — Predicting Bolometric Magnitude from Four Photometric Bands¶

In notebook 01 we predicted photon counts from exposure time with a straight line. Bolometric magnitude is harder: colour indices matter, and a too-blue or too-red excess shifts the answer in a non-linear way. A neural network can learn that bend.

The story, extended¶

Each row is one star, summarised by four band measurements f1, f2, f3, f4 (think u, g, r, i) and one target y — its bolometric magnitude on a 0–10 scale. The bands interact: the f2 · f3 (g·r) product carries colour information, and f4 has a sweet spot near solar-type SEDs.

Glossary — your field ↔ ML¶

Mental bridge for the deep-learning notebook. Here f1, f2, f3, f4 are the four photometric-band measurements for one star — say u (f1), g (f2), r (f3), i (f4) — and y is its predicted bolometric magnitude on a 0–10 scale.

Your field ML term Short bridge
A vector of band magnitudes per star feature vector Multiple inputs combined into one prediction.
The bolometric magnitude estimate target What we want to predict.
An empirical SED-fitter trained on past stars neural network Stack of simple computations learning a smooth function.
Each "perceptron" combining bands hidden layer Intermediate stage in the network.
A nonlinear squashing applied per node activation function The curvy step that lets the network bend.
A piecewise-linear "off below 0, on above" ReLU Cheapest, most popular nonlinearity.
Computing a predicted magnitude from band fluxes forward pass Pushing inputs through the network.
Working out which band-weights to nudge backward pass / backpropagation Computing the gradient of the loss.
The optimisation routine that updates weights (Adam here) optimizer Decides how to use the gradient.
A handful of stars processed together each step batch One mini-update of the weights.
Why bolometric magnitude is not a linear sum of band magnitudes non-linearity Real SEDs curve and bands interact.
In [1]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(0)
rng = np.random.default_rng(0)

1. The data¶

Each row is one past star. The four columns are the band features f1, f2, f3, f4 (rescaled to 0–1) and the column we want to predict is the target bolometric magnitude y.

From one feature to many¶

We now have a feature vector per observation, not a single number. The model still produces one prediction per observation; it just has more inputs to combine.

In your field¶

Each row is one source. The four features f1, f2, f3, f4 are its u, g, r, i magnitudes (rescaled to 0–1) and the target y is its bolometric magnitude on a 0–10 scale. The model takes the full SED snapshot (not just one band) and returns one prediction.

In [2]:
N = 800
X = rng.uniform(0, 1, size=(N, 4)).astype(np.float32)
f1, f2, f3, f4 = X[:, 0], X[:, 1], X[:, 2], X[:, 3]

# Non-linear ground truth: a smooth function with interactions.
y_true = (
    5.0
    + 3.0 * np.sin(2 * np.pi * f1)
    + 4.0 * (f2 * f3)
    - 6.0 * (f4 - 0.4) ** 2
    + rng.normal(0, 0.3, size=N).astype(np.float32)
)
y = y_true.astype(np.float32)
X.shape, y.shape
Out[2]:
((800, 4), (800,))

A linear colour-magnitude transform cannot capture this — y is not a straight-line function of any single band. We need a model that bends, and that learns interactions between bands, the way a photometric-redshift estimator must.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — why nonlinearity matters

Explain in plain words why a linear combination of u, g, r, i
magnitudes cannot recover bolometric magnitude across the whole HR
diagram, but a small neural network with ReLU activations can.
Give a concrete two-band toy example where the relation kinks at
a colour boundary. Keep the answer to ~5 minutes of reading so I
can return to my notebook.

Prompt 2 — hidden layers as colour-index detectors

A "hidden layer" in a neural network mixes input features into
intermediate variables. If my inputs are four photometric bands,
what kinds of mixtures might a hidden layer discover that look
like colour indices, Balmer breaks, or other diagnostics
astronomers already use? Keep the answer to ~5 minutes of reading
so I can return to my notebook.

2. The model — a small neural network¶

A neural network stacks layers of simple computations. Each hidden layer computes h = activation(W · x + b), where W and b are learnable weights and bias and the activation function is a non-linearity such as ReLU ($\text{ReLU}(z) = \max(0, z)$). Without an activation function, stacking layers would still only give you a linear model.

Our network has 4 inputs → 16 hidden units → 16 hidden units → 1 output (the predicted target).

In your field¶

Each hidden layer is a stack of basis functions the network learns by itself — a bit like the SED templates you would feed to a photo-z code, except the network discovers its own templates from the data. The non-linearity between layers is what lets it capture sharp features like the 4000 Å break that no linear combination of u, g, r, i could reproduce.

Worked example¶

Each hidden unit is like one specialist colour-index detector paying attention to a particular combination of bands — "blue excess plus moderate r", say. The next layer pools the specialists' opinions into a final magnitude, much like a SED template-fitting code combines several diagnostics by hand.

In [3]:
class RegressionNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
        )

    def forward(self, x):
        return self.layers(x).squeeze(-1)

net = RegressionNet()
sum(p.numel() for p in net.parameters())   # total parameters
Out[3]:
369

3. Training — forward pass, loss, backward pass¶

Training one batch of data has three steps:

  1. Forward pass — feed the features through the network to get predictions.
  2. Compute the loss (MSE again) by comparing predictions to targets.
  3. Backward pass (also called backpropagation) — compute the gradient of the loss with respect to every weight, then ask the optimizer to nudge the weights one step.

We repeat for many epochs until the loss stops improving.

In your field¶

The forward pass is "given a source's u, g, r, i, compute a bolometric prediction"; the backward pass / backpropagation is "work out how to nudge every weight to bring that prediction closer to the truth"; the optimizer (Adam) actually applies the nudges. One batch is a chunk of sources processed together — the LSST photo-z pipelines do this thousands of times per night.

In [4]:
# Train/test split.
idx = rng.permutation(N)
n_train = int(0.8 * N)
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train = torch.tensor(X[train_idx])
y_train = torch.tensor(y[train_idx])
X_test  = torch.tensor(X[test_idx])
y_test  = torch.tensor(y[test_idx])
In [5]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)

epochs = 300
batch_size = 64
history = []

for epoch in range(epochs):
    perm = torch.randperm(X_train.shape[0])
    epoch_loss = 0.0
    for start in range(0, X_train.shape[0], batch_size):
        b = perm[start:start + batch_size]
        pred = net(X_train[b])                  # forward pass
        loss = loss_fn(pred, y_train[b])
        optimizer.zero_grad()
        loss.backward()                          # backward pass
        optimizer.step()
        epoch_loss += float(loss) * b.shape[0]
    history.append(epoch_loss / X_train.shape[0])

print(f"final train MSE = {history[-1]:.3f}")
/tmp/ipykernel_2762/3561324963.py:18: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  epoch_loss += float(loss) * b.shape[0]
final train MSE = 0.184
In [6]:
# Loss curve — average MSE per epoch over our batches of stars.

plt.figure(figsize=(6, 4))
plt.plot(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss)")
plt.title("Training the regression network")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — forward and backward pass

Walk me through one training step of a small neural network:
forward pass, MSE loss, backward pass, optimizer step. Use a tiny
example with four photometric bands as inputs and a magnitude as
output. Keep the answer to ~5 minutes of reading so I can return
to my notebook.

Prompt 2 — Adam vs LMfit

What does the Adam optimizer do that plain stochastic gradient
descent does not? Frame the answer for someone used to LMfit and
Levenberg–Marquardt. Keep the answer to ~5 minutes of reading so
I can return to my notebook.

4. Did it actually learn?¶

We now look at the test data the network has never seen. If the test MSE is close to the training MSE, the model has learned a generalisable rule, not just memorised. If the test MSE is much higher, we have overfitting — the same enemy as in notebook 01, only sneakier in deep models with many parameters.

In your field¶

Overfitting in a deep model means the network has learned to "recognise" individual training sources rather than the underlying SED–magnitude relation, so it does well on Rubin commissioning fields but fails on a different patch of sky. The classic warning sign is training MSE that keeps falling while test MSE plateaus or rises.

In [7]:
with torch.no_grad():
    test_pred = net(X_test)
    test_mse = float(loss_fn(test_pred, y_test))
print(f"test MSE = {test_mse:.3f}")

plt.figure(figsize=(5, 5))
plt.scatter(y_test.numpy(), test_pred.numpy(), alpha=0.6)
lims = [float(y_test.min()) - 0.5, float(y_test.max()) + 0.5]
plt.plot(lims, lims, "k--", alpha=0.5)
plt.xlabel("true target y")
plt.ylabel("predicted target y")
plt.title("Predictions vs truth (test set)")
plt.grid(True, alpha=0.3)
plt.show()
test MSE = 0.161
No description has been provided for this image

Points clustered along the diagonal mean the network has captured the colour interactions and the sweet-spot in f4. A linear colour-magnitude transform on the same stars would leave a much messier scatter — and would systematically miss the bend at the M-dwarf end.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — overfitting in photo-z / SED nets

What does "overfitting" look like specifically in a photometric
neural-network estimator trained on a small sample of stars? Give
a concrete sign that a model is memorising its training set
rather than learning useful colour relations. Keep the answer to
~5 minutes of reading so I can return to my notebook.

Try this yourself¶

  1. Reduce the network to a single nn.Linear(4, 1) (no hidden layer, no activation function). How much worse is the test MSE? Why is this exactly equivalent to a linear colour- magnitude transform on the four bands?
  2. Increase the learning rate to 0.5. What happens to the loss curve? Relate it to taking unrealistically large parameter steps in an LMfit run.
  3. Replace nn.ReLU() with nn.Tanh(). Does it train as well?

Recap — vocabulary you now own¶

On top of notebook 01: neural network, hidden layer, activation function, ReLU, forward pass, backward pass, backpropagation, optimizer, batch, non-linearity.

The training loop — forward, loss, backward, step — is the same loop that powers every modern model, from a four-band magnitude predictor like this one to the multi-billion-parameter networks that classify Rubin/LSST alerts in real time.