Simple Linear Regression — Estimating Photon Counts from Exposure Time¶

A first machine learning notebook, framed in the language of photometric calibration. We will treat exposure time t_exp as the input and the recorded photon count N as the quantity to predict. The definitions, formulas, and code stay the same for every learner; only this story changes.

The story¶

On a calm night you took a series of exposures of a non-saturated reference star at a range of integration times. From those past frames you have a list of (t_exp, N) pairs.

Your job in this notebook: learn a simple rule that turns a fresh t_exp into a predicted count N, just like the linearity portion of a detector calibration curve.

Glossary — your field ↔ ML¶

Mental bridge between photometric calibration and the ML vocabulary used below. In this notebook, x is exposure time t_exp (in seconds) and y is the recorded photon count N (in 10³ ADU) for a faint reference star on a clear night.

Your field ML term Short bridge
t_exp, the controlled variable per frame feature The input the model reads.
N, the measured photon count target The quantity to predict.
Calibration frames used to fit a linearity curve training data Frames the model adjusts on.
Held-out frames reserved for QC test data Frames the model never sees during fitting.
Splitting frames into "fit" and "QC" sets train/test split Protocol for honest evaluation.
The instrument response model, e.g. N = k·t_exp + bias model The functional form whose constants you tune.
Gain k and dark offset from your linearity fit parameters / weights The numbers learning adjusts.
The bias / dark offset of the detector bias The constant term added to the prediction.
The predicted count for a fresh exposure prediction What the model outputs for new inputs.
(observed − model) at each frame residual Per-frame error after fitting.
χ² (or its unweighted cousin SSR) loss Single number measuring total wrongness.
Mean of squared residuals mean squared error / MSE Most common regression loss.
∂χ²/∂(parameter) gradient Direction of steepest cost increase.
LMfit / Levenberg–Marquardt iteration gradient descent Tiny parameter updates that follow the gradient downhill.
The step size in the iterative minimiser learning rate How bold each update is.
One full pass over all calibration frames epoch One sweep through training data.
A fit that follows individual photon shot-noise spikes overfitting Model fits noise, generalises poorly.
A pure-linear fit to a clearly nonlinear detector response underfitting Model too simple for the pattern.
In [1]:
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

1. Setting the scene¶

Each observation is one frame: an (x, y) pair where x is t_exp and y is N. The glossary above gives the unit mapping in full. The next cell defines the ML vocabulary.

Core vocabulary¶

  • feature — the input variable, here x.
  • target — the quantity we want to predict, here y.
  • A dataset is a collection of (feature, target) pairs.
  • training data — the subset used to fit the model.
  • test data — the subset held back to evaluate the model honestly.
  • train/test split — the act of partitioning the dataset into training data and test data.

In your field¶

In this notebook, the feature x is exposure time t_exp (in seconds) and the target y is the photon count recorded in your aperture (in 10³ ADU). Your training data is the bank of calibration frames you have already reduced; the test data is a handful of frames you set aside, like the few stars you withhold from a photometric solution to check residuals. Splitting your frames into "fit" and "QC" sets is a train/test split.

In [2]:
# Synthetic ground truth: y = 0.25 * x + 1.0 + small noise
true_slope = 0.25
true_intercept = 1.0

x_data = rng.uniform(10, 200, size=80)
noise = rng.normal(0, 3.0, size=x_data.shape)
y_data = true_slope * x_data + true_intercept + noise
In [3]:
# Each dot is one calibration frame — exposure time vs. photon count.

plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, alpha=0.7)
plt.xlabel("feature x")
plt.ylabel("target y")
plt.title("Past observations")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

What pattern do you see?¶

The cloud of points slopes upward — longer exposures, more photons — exactly the shape of the linear regime of a CCD response curve. Our goal is to draw the best straight line through that cloud so that for a new exposure time we can predict the count we will record.

Go deeper with an LLM (optional — skip if you already know this)¶

Paste any prompt below into ChatGPT / Claude / Mistral, explore, then come back.

Prompt 1 — feature vs target in calibration

In machine learning, what is the difference between a "feature" and
a "target"? Apply the distinction to a CCD linearity calibration
where I record photon counts at a range of exposure times. Which is
the feature, which is the target, and what would gain and bias be?
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

Prompt 2 — why hold out test data

Why do machine learning workflows split data into "training data"
and "test data" before fitting? Compare with reserving a few
calibration frames for end-of-night QC. Give one realistic case
where skipping the train/test split would let me fool myself.
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

2. The model — linear regression¶

A linear regression model is the rule

$$ \hat{y} = w \cdot x + b $$

where $x$ is the feature, $\hat{y}$ is the prediction, and $w$ and $b$ are numbers the model learns. They are called the model's parameters (or weights, with $b$ specifically called the bias). A small set of parameters that fits a lot of data — that is the whole idea of regression.

In your field¶

The slope w is the photon rate (counts per second) and the bias b is the read-noise / dark pedestal that survives even at t_exp = 0. The prediction $\hat{y} = w \cdot x + b$ is just the linear part of a CCD response curve — exactly the kind of fit you make when you build a flat-field or a linearity correction.

Worked example¶

With w = 0.24 and b = 1.1, for t_exp = 100 s the predicted count is 0.24 · 100 + 1.1 = 25.1 (in 10³ ADU). The shape is identical to the gain × exposure + bias relation you fit when you calibrate a new detector.

In [4]:
from sklearn.linear_model import LinearRegression

X = x_data.reshape(-1, 1)   # sklearn expects a 2-D feature matrix
y = y_data

model = LinearRegression().fit(X, y)
learned_slope = float(model.coef_[0])
learned_intercept = float(model.intercept_)
learned_slope, learned_intercept
Out[4]:
(0.25734906301152605, -0.15046939018604277)

The two numbers above are the learned parameters. Compare them with the ground-truth slope (0.25) and intercept (1.0) — the model has recovered the underlying rule from noisy data, the same way a clean linearity fit recovers gain and bias from noisy calibration frames.

3. How good is the fit? — the loss¶

For each training point we compute the residual, the gap between the true target and the prediction:

$$ r_i = y_i - \hat{y}_i $$

Squaring and averaging gives the mean squared error (MSE), the most common loss for regression:

$$ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$

A smaller loss means a better fit. "Training a model" is the search for parameters that minimise the loss.

In your field¶

A residual is (measured counts − fitted counts) for one frame. The MSE is the average squared residual across your sample of frames — closely related to the χ² statistic from lmfit, just without the per-point weighting. A smaller loss means the linearity model hugs your calibration sequence more tightly.

In [5]:
y_pred = model.predict(X)
residuals = y - y_pred
mse = float(np.mean(residuals ** 2))
print(f"MSE = {mse:.3f}")
MSE = 7.687

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — what loss really is

Explain "loss" in machine learning, focusing on MSE for regression.
Tie it to the χ² / SSR I minimise when I fit a calibration curve in
LMfit. Are the two ideas the same? Where do they differ? Keep the
answer to ~5 minutes of reading so I can return to my notebook.

Prompt 2 — residuals as a diagnostic

What can residuals tell me beyond the MSE itself? Use a calibration
example where residuals trend upward at long exposures (hint:
detector starting to saturate). Keep the answer to ~5 minutes of
reading so I can return to my notebook.

4. Gradient descent — learning by tiny steps¶

sklearn solved for the best w, b in one shot using linear algebra. For larger models we cannot do that, so we use gradient descent.

The gradient of the loss with respect to a parameter tells us which direction makes the loss grow. We step in the opposite direction. We repeat for many epochs (one epoch = one full pass over the training data), each time moving the parameters a small amount controlled by the learning rate.

In your field¶

Gradient descent is doing by hand what lmfit does inside its Levenberg–Marquardt solver: nudge each parameter a little, see whether χ² goes down, and try again. The learning rate is the step size — too small and you spend a whole observing night converging; too large and you overshoot the χ² minimum and start bouncing on the loss surface.

In [6]:
# From-scratch gradient descent for the same problem.
w, b = 0.0, 0.0          # start anywhere
learning_rate = 1e-6     # small steps so the descent is visible
epochs = 8000
history = []

for epoch in range(epochs):
    y_hat = w * x_data + b
    error = y_hat - y_data                  # shape (N,)
    grad_w = 2 * np.mean(error * x_data)
    grad_b = 2 * np.mean(error)
    w -= learning_rate * grad_w
    b -= learning_rate * grad_b
    history.append(np.mean(error ** 2))

print(f"learned w = {w:.3f}, b = {b:.3f}")
print(f"final MSE = {history[-1]:.3f}")
learned w = 0.256, b = 0.001
final MSE = 7.692
In [7]:
# Loss falling — like χ² shrinking across LMfit iterations.

# Log-scale on the y-axis spreads the descent across the whole plot,
# so we can see the slow grind that follows the initial big drop.
plt.figure(figsize=(6, 4))
plt.semilogy(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss, log scale)")
plt.title("Loss curve during gradient descent")
plt.grid(True, which="both", alpha=0.3)
plt.show()
No description has been provided for this image

Why this matters¶

Every modern deep-learning model is trained with some flavour of gradient descent. The learning rate controls how bold each step is: too small and training crawls; too large and it overshoots. The Levenberg–Marquardt step inside LMfit is a close cousin — a guided walk down the χ² surface, one parameter update at a time.

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — gradient descent intuition

Explain gradient descent to me as an astronomer. Walk through one
iteration on a simple loss surface — what is the gradient, why we
step in its negative direction, what the learning rate controls —
using an analogy to nudging gain and dark-current values to lower
χ² on a calibration fit. Keep the answer to ~5 minutes of reading
so I can return to my notebook.

Prompt 2 — when learning rate goes wrong

What happens to gradient descent if the learning rate is too small,
or too large? Show me a small numerical example with a quadratic
loss and relate it to step-size choices in iterative least-squares
fitters. Keep the answer to ~5 minutes of reading so I can return
to my notebook.

5. Honest evaluation — train/test split¶

A model that memorises the training data is not useful. To check whether it has truly learned the pattern, we hide some data during training and only look at it at the end. That hidden portion is the test data.

  • Underfitting: the model is too simple to capture the pattern; loss is high on both training data and test data.
  • Overfitting: the model has memorised noise; training loss is low but test loss is high.

In your field¶

Overfitting would be a linearity curve that fits every wiggle of your calibration frames perfectly but then under-predicts counts on a brand-new exposure. Underfitting is the opposite: a single- slope model when your CCD actually has a distinct non-linear shoulder near saturation. Holding back test data is the same instinct as keeping a few stars out of your photometric solution to sanity-check.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)
m = LinearRegression().fit(X_train, y_train)

train_mse = float(np.mean((y_train - m.predict(X_train)) ** 2))
test_mse = float(np.mean((y_test - m.predict(X_test)) ** 2))
print(f"train MSE = {train_mse:.3f}")
print(f" test MSE = {test_mse:.3f}")
train MSE = 7.941
 test MSE = 7.144

Try this yourself¶

  1. Re-run the data-generation cell with a larger noise standard deviation — the analogue of a higher read-noise detector. What happens to the MSE? To the recovered slope (gain)?
  2. Set the learning rate to 1e-3 — equivalent to taking giant steps in your iterative fit. Does training still converge? Why or why not?
  3. Reduce the dataset to the first 10 points, as if your calibration set had only ten frames. Compare train MSE and test MSE — do you see overfitting?

Recap — vocabulary you now own¶

feature, target, training data, test data, train/test split, model, parameters, weights, bias, prediction, residual, loss, mean squared error, MSE, gradient, gradient descent, learning rate, epoch, overfitting, underfitting.

Up next: a small neural network that predicts something a straight line cannot — stellar magnitude from four photometric bands.