Bayes Linear Regression

Bayes Linear Regression is a probabilistic approach that combines Bayes' Theorem with linear regression. Instead of providing fixed point estimates for the model parameters (such as the coefficients in linear regression), this method incorporates uncertainty by modeling the parameters as probability distributions.

Mathematical Formulation

Consider the linear regression model where the target variable $y$ is predicted by a vector of features $\mathbf{x} \in \mathbb{R}^p$ (where $p$ is the number of features):

y_i = \beta^T \mathbf{x}_i + \epsilon_i

where:

$y_i$ is the target value for the $i$ -th observation,
$\mathbf{x}_i$ is the feature vector for the $i$ -th observation,
$\beta$ is the vector of unknown regression coefficients (parameters),
$\epsilon_i$ is the error term (or residual), which is assumed to be normally distributed: $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ , i.e., errors are independent and identically distributed with mean 0 and variance $\sigma^2$ .

Thus, for each observation $i$ , the conditional probability of $y_i$ given the feature vector $\mathbf{x}_i$ and the parameters $\beta$ is:

P(y_i | \mathbf{x}_i, \beta, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - \beta^T \mathbf{x}_i)^2}{2 \sigma^2}\right)

Prior Distribution

In Bayes Linear Regression, we assume a prior distribution for the parameters $\beta$ . A common choice is to assume a Gaussian prior on $\beta$ :

\beta \sim \mathcal{N}(\mathbf{0}, \tau^2 I)

where $\tau^2$ is the prior variance, and $I$ is the identity matrix. This prior distribution expresses the belief that the coefficients $\beta$ are likely to be close to zero, but with some uncertainty.

Likelihood Function

Given the assumption of normally distributed errors, the likelihood function for the observed data $\mathbf{y} = (y_1, y_2, \dots, y_n)^T$ given the feature matrix $\mathbf{X} = (\mathbf{x}_1^T, \mathbf{x}_2^T, \dots, \mathbf{x}_n^T)^T$ and parameters $\beta$ is:

P(\mathbf{y} | \mathbf{X}, \beta, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - \beta^T \mathbf{x}_i)^2}{2 \sigma^2}\right)

This represents the likelihood of observing the target values $\mathbf{y}$ given the feature vectors $\mathbf{X}$ and parameters $\beta$ , with noise variance $\sigma^2$ .

Posterior Distribution

By Bayes' Theorem, the posterior distribution of $\beta$ given the data $(\mathbf{X}, \mathbf{y})$ is proportional to the product of the likelihood and the prior:

P(\beta | \mathbf{X}, \mathbf{y}, \sigma^2) \propto P(\mathbf{y} | \mathbf{X}, \beta, \sigma^2) P(\beta)

Substituting the expressions for the likelihood and prior:

P(\beta | \mathbf{X}, \mathbf{y}, \sigma^2) \propto \exp\left(-\frac{1}{2 \sigma^2} \sum_{i=1}^{n} (y_i - \beta^T \mathbf{x}_i)^2\right) \exp\left(-\frac{1}{2 \tau^2} \beta^T \beta \right)

Posterior Mean and Covariance

The posterior distribution of $\beta$ is Gaussian, and the mean and covariance can be computed as follows. Completing the square in the exponent:

P(\beta | \mathbf{X}, \mathbf{y}, \sigma^2) \sim \mathcal{N}(\beta | \hat{\beta}_{\text{post}}, \Sigma_{\text{post}})

where the posterior mean $\hat{\beta}_{\text{post}}$ is given by:

\hat{\beta}_{\text{post}} = (\sigma^2 \mathbf{X}^T \mathbf{X} + \tau^2 I)^{-1} \mathbf{X}^T \mathbf{y}

and the posterior covariance $\Sigma_{\text{post}}$ is:

\Sigma_{\text{post}} = (\sigma^2 \mathbf{X}^T \mathbf{X} + \tau^2 I)^{-1}

Prediction

For a new observation $\mathbf{x}_{\text{new}}$ , the predictive distribution for the target variable $y_{\text{new}}$ is given by:

P(y_{\text{new}} | \mathbf{x}_{\text{new}}, \mathbf{X}, \mathbf{y}) = \int P(y_{\text{new}} | \mathbf{x}_{\text{new}}, \beta) P(\beta | \mathbf{X}, \mathbf{y}, \sigma^2) d\beta

This integral can be evaluated, leading to the following Gaussian predictive distribution:

P(y_{\text{new}} | \mathbf{x}_{\text{new}}, \mathbf{X}, \mathbf{y}) = \mathcal{N}(y_{\text{new}} | \mathbf{x}_{\text{new}}^T \hat{\beta}_{\text{post}}, \sigma^2 + \mathbf{x}_{\text{new}}^T \Sigma_{\text{post}} \mathbf{x}_{\text{new}})

This provides a probabilistic prediction, giving both the predicted value $\mathbf{x}_{\text{new}}^T \hat{\beta}_{\text{post}}$ and the uncertainty in the prediction, represented by the variance $\sigma^2 + \mathbf{x}_{\text{new}}^T \Sigma_{\text{post}} \mathbf{x}_{\text{new}}$ .

Implementation

import numpy as np

class BayesLinearRegression:
    def __init__(self, tau=1.0, sigma=1.0):
        self.tau = tau
        self.sigma = sigma
        self.beta_post = None
        self.sigma_post = None

    def fit(self, X, y):
        # Prior covariance (tau^2 I)
        tau2_I = self.tau ** 2 * np.eye(X.shape[1])

        # Likelihood covariance (sigma^2 I)
        sigma2_I = self.sigma ** 2 * np.eye(X.shape[0])

        # Compute the posterior covariance
        XTX = X.T @ X
        self.sigma_post = np.linalg.inv(np.linalg.inv(tau2_I) + (1 / self.sigma**2) * XTX)

        # Compute the posterior mean
        self.beta_post = (1 / self.sigma**2) * self.sigma_post @ X.T @ y

    def predict(self, X_new):
        # Predictive mean
        y_pred_mean = X_new @ self.beta_post
        
        # Predictive covariance
        y_pred_cov = self.sigma ** 2 + np.sum(X_new @ self.sigma_post * X_new, axis=1)
        
        return y_pred_mean, y_pred_cov

x = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [7, 8]])
y = np.array([3, 5, 7, 9, 11, 15])
b_lr = BayesLinearRegression(tau=1.0, sigma=1.0)
b_lr.fit(x, y)
y_pred_mean, y_pred_cov = b_lr.predict(np.array([[0, 1]]))
print(y_pred_mean, y_pred_cov)

Mathematical Formulation​

Prior Distribution​

Likelihood Function​

Posterior Distribution​

Posterior Mean and Covariance​

Prediction​

Implementation​