Simple Linear Regression — Estimating Consumption from Disposable Income¶

A first machine learning notebook, framed in the language of consumption-function estimation. We will treat monthly household disposable income as the input and monthly household consumption as the quantity to predict — a Keynesian consumption function in miniature. The definitions, formulas, and code stay the same for every learner; only this story changes.

The story¶

From a household survey you have a cross-section of monthly disposable income and consumption pairs. From those past observations you have a list of (income, consumption) pairs.

Your job in this notebook: learn a simple consumption function that turns a fresh income value into a predicted consumption level — exactly the OLS exercise you would run in Stata, only here we do it by hand.

Glossary — your field ↔ ML¶

Mental bridge between econometrics and the ML vocabulary used below. In this notebook, x is monthly household disposable income (in thousands of CAD) and y is monthly household consumption (also in thousands of CAD).

Your field	ML term	Short bridge
Regressor / explanatory variable (income)	feature	The input the model reads.
Dependent variable / outcome (consumption)	target	The quantity to predict.
The estimation sample	training data	Observations the model fits on.
A held-out hold-out sample for out-of-sample evaluation	test data	Observations the model never sees during fitting.
Splitting the sample into estimation and validation halves	train/test split	Protocol for honest evaluation.
Your linear specification `c = β₀ + β₁·y + ε`	model	The functional form whose coefficients you estimate.
OLS coefficients `β₀, β₁`	parameters / weights	The numbers learning adjusts.
The intercept term	bias	The constant of the prediction.
Predicted consumption for a fresh income value	prediction	What the model outputs for new inputs.
OLS residuals `c_i − ĉ_i`	residual	Per-point error after fitting.
Sum of squared residuals (SSR)	loss	Single number measuring total wrongness.
Mean of squared residuals	mean squared error / MSE	Most common regression loss.
∂(SSR)/∂β	gradient	Direction of steepest cost increase.
Iterative numerical solver (e.g. for nonlinear least squares)	gradient descent	Tiny coefficient updates that follow the gradient downhill.
The step size in that iterative solver	learning rate	How bold each update is.
One full pass over the estimation sample	epoch	One sweep through training data.
A flexible model that hugs sample-specific noise	overfitting	Loses out-of-sample predictive power.
A linear fit to a clearly nonlinear Engel curve	underfitting	Model too simple to catch the pattern.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

1. Setting the scene¶

Each observation is one household: an (x, y) pair where x is income and y is consumption. The glossary above gives the unit mapping in full. The next cell defines the ML vocabulary.

Core vocabulary¶

feature — the input variable, here x.
target — the quantity we want to predict, here y.
A dataset is a collection of (feature, target) pairs.
training data — the subset used to fit the model.
test data — the subset held back to evaluate the model honestly.
train/test split — the act of partitioning the dataset into training data and test data.

In your field¶

In this notebook, the feature x is monthly disposable income (rescaled) and the target y is monthly consumption (also rescaled). Your training data is the household-survey panel you already have; the test data is the cross-section you held back so the fit is honest. Splitting the panel into "estimation" and "holdout" subsamples is a train/test split.

In [2]:

# Synthetic ground truth: y = 0.25 * x + 1.0 + small noise
true_slope = 0.25
true_intercept = 1.0

x_data = rng.uniform(10, 200, size=80)
noise = rng.normal(0, 3.0, size=x_data.shape)
y_data = true_slope * x_data + true_intercept + noise

In [3]:

# Each dot is one household — disposable income on x, consumption on y.

plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, alpha=0.7)
plt.xlabel("feature x")
plt.ylabel("target y")
plt.title("Past observations")
plt.grid(True, alpha=0.3)
plt.show()

No description has been provided for this image

What pattern do you see?¶

The cloud of points slopes upward — higher income, higher consumption — exactly the linear consumption function from a first-year macro textbook. Our goal is to draw the best straight line through that cloud so that for a new income value we can predict consumption.

Go deeper with an LLM (optional — skip if you already know this)¶

Paste any prompt below into ChatGPT / Claude / Mistral, explore, then come back.

Prompt 1 — feature vs target in a regression

In machine learning, what is the difference between a "feature"
and a "target"? Apply the distinction to estimating a consumption
function from a household survey. Which is the feature, which is
the target, and how does this map onto the X / y notation in
statsmodels.OLS? Keep the answer to ~5 minutes of reading so I
can return to my notebook.

Prompt 2 — why hold out test data

Why do machine learning workflows split data into "training data"
and "test data" before fitting, while a classical econometrics
paper often estimates on the whole sample? Compare with the idea
of out-of-sample forecast evaluation. Give one realistic case
where skipping the train/test split would let me fool myself.
Keep the answer to ~5 minutes of reading so I can return to my
notebook.

2. The model — linear regression¶

A linear regression model is the rule

$$ \hat{y} = w \cdot x + b $$

where $x$ is the feature, $\hat{y}$ is the prediction, and $w$ and $b$ are numbers the model learns. They are called the model's parameters (or weights, with $b$ specifically called the bias). A small set of parameters that fits a lot of data — that is the whole idea of regression.

In your field¶

The slope w is the marginal propensity to consume (MPC) — extra spending per extra unit of disposable income — and the bias b is autonomous consumption (what households spend even at zero disposable income). The prediction $\hat{y} = w \cdot x + b$ is exactly the Keynesian consumption function you would estimate with statsmodels.OLS.

Worked example¶

With w = 0.24 and b = 1.1, for x = 100 (10⁵ CAD income) the predicted consumption is 0.24 · 100 + 1.1 = 25.1. The slope w is the marginal propensity to consume; the intercept b is autonomous consumption — the same two coefficients you interpret in any introductory macro estimation.

In [4]:

from sklearn.linear_model import LinearRegression

X = x_data.reshape(-1, 1)   # sklearn expects a 2-D feature matrix
y = y_data

model = LinearRegression().fit(X, y)
learned_slope = float(model.coef_[0])
learned_intercept = float(model.intercept_)
learned_slope, learned_intercept

Out[4]:

(0.25734906301152605, -0.15046939018604277)

The two numbers above are the learned parameters. Compare them with the ground-truth slope (0.25) and intercept (1.0) — the model has recovered the underlying rule from noisy data, the same way OLS recovers the true population coefficients in a well-specified Monte Carlo simulation.

3. How good is the fit? — the loss¶

For each training point we compute the residual, the gap between the true target and the prediction:

$$ r_i = y_i - \hat{y}_i $$

Squaring and averaging gives the mean squared error (MSE), the most common loss for regression:

$$ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$

A smaller loss means a better fit. "Training a model" is the search for parameters that minimise the loss.

In your field¶

A residual is (observed consumption − fitted consumption) for one household. The MSE is the average squared residual — closely related to the SSR that OLS minimises. A smaller loss means a tighter consumption function.

In [5]:

y_pred = model.predict(X)
residuals = y - y_pred
mse = float(np.mean(residuals ** 2))
print(f"MSE = {mse:.3f}")

MSE = 7.687

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — what loss really is

Explain "loss" in machine learning, focusing on MSE for
regression. Tie it to the SSR that OLS minimises in classical
econometrics. Are the two ideas the same? Where do they differ
(think weighted least squares, robust SEs)? Keep the answer to
~5 minutes of reading so I can return to my notebook.

Prompt 2 — residuals as a diagnostic

What can residuals tell me beyond the MSE itself? Use a
consumption-function example where residuals fan out at high
income — the textbook heteroskedasticity pattern. Keep the
answer to ~5 minutes of reading so I can return to my notebook.

4. Gradient descent — learning by tiny steps¶

sklearn solved for the best w, b in one shot using linear algebra. For larger models we cannot do that, so we use gradient descent.

The gradient of the loss with respect to a parameter tells us which direction makes the loss grow. We step in the opposite direction. We repeat for many epochs (one epoch = one full pass over the training data), each time moving the parameters a small amount controlled by the learning rate.

In your field¶

Gradient descent is the iterative version of OLS: nudge each coefficient, see whether SSR drops, repeat. The learning rate is the step size — small for steady convergence (think conjugate gradient), large to chase a rough shape quickly at the risk of bouncing past the optimum.

In [6]:

# From-scratch gradient descent for the same problem.
w, b = 0.0, 0.0          # start anywhere
learning_rate = 1e-6     # small steps so the descent is visible
epochs = 8000
history = []

for epoch in range(epochs):
    y_hat = w * x_data + b
    error = y_hat - y_data                  # shape (N,)
    grad_w = 2 * np.mean(error * x_data)
    grad_b = 2 * np.mean(error)
    w -= learning_rate * grad_w
    b -= learning_rate * grad_b
    history.append(np.mean(error ** 2))

print(f"learned w = {w:.3f}, b = {b:.3f}")
print(f"final MSE = {history[-1]:.3f}")

learned w = 0.256, b = 0.001
final MSE = 7.692

In [7]:

# Loss falling — like SSR shrinking across iterations of a numerical solver.

# Log-scale on the y-axis spreads the descent across the whole plot,
# so we can see the slow grind that follows the initial big drop.
plt.figure(figsize=(6, 4))
plt.semilogy(history)
plt.xlabel("epoch")
plt.ylabel("MSE (loss, log scale)")
plt.title("Loss curve during gradient descent")
plt.grid(True, which="both", alpha=0.3)
plt.show()

Why this matters¶

Every modern deep-learning model is trained with some flavour of gradient descent. The learning rate controls how bold each step is: too small and training crawls; too large and it overshoots. OLS gives you the closed-form answer for free; iterative descent is what you fall back to when the objective is nonlinear in the parameters (logit, NLLS, GMM).

Go deeper with an LLM (optional — skip if you already know this)¶

Prompt 1 — gradient descent intuition

Explain gradient descent to me as an econometrician. Walk
through one iteration on a simple loss surface — what is the
gradient, why we step in its negative direction, what the
learning rate controls — using an analogy to nudging a coefficient
vector to lower SSR. Keep the answer to ~5 minutes of reading so
I can return to my notebook.

Prompt 2 — when learning rate goes wrong

What happens to gradient descent if the learning rate is too
small, or too large? Show a small numerical example with a
quadratic loss and relate it to step-size choices in iterative
estimators (BHHH, scoring). Keep the answer to ~5 minutes of
reading so I can return to my notebook.

5. Honest evaluation — train/test split¶

A model that memorises the training data is not useful. To check whether it has truly learned the pattern, we hide some data during training and only look at it at the end. That hidden portion is the test data.

Underfitting: the model is too simple to capture the pattern; loss is high on both training data and test data.
Overfitting: the model has memorised noise; training loss is low but test loss is high.

In your field¶

Overfitting would be a consumption function that fits last quarter's panel beautifully but mispredicts on the next wave of the survey. Underfitting is the opposite: assuming a single MPC for the whole population when permanent-vs-transitory income clearly matters. Test data is the holdout sample you kept for honest evaluation.

In [8]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)
m = LinearRegression().fit(X_train, y_train)

train_mse = float(np.mean((y_train - m.predict(X_train)) ** 2))
test_mse = float(np.mean((y_test - m.predict(X_test)) ** 2))
print(f"train MSE = {train_mse:.3f}")
print(f" test MSE = {test_mse:.3f}")

train MSE = 7.941
 test MSE = 7.144

Try this yourself¶

Re-run the data-generation cell with a larger noise standard deviation — a higher idiosyncratic shock variance. What happens to the MSE? To the recovered slope (the MPC)?
Set the learning rate to 1e-3. Does training still converge? Why or why not? (Compare with what would happen in a Newton step that overshoots.)
Reduce the dataset to the first 10 points, as if your sample were a tiny pilot survey. Compare train MSE and test MSE — do you see overfitting?

Recap — vocabulary you now own¶

feature, target, training data, test data, train/test split, model, parameters, weights, bias, prediction, residual, loss, mean squared error, MSE, gradient, gradient descent, learning rate, epoch, overfitting, underfitting.

Up next: a small neural network that predicts something a straight line cannot — log house price from four covariates.