Regression Analysis for Continuous Target Variables

Unlike classification models, regression models are used to predict target variables on a continuous scale and for understanding relationships between variables, analysing trends and even making forecasts which makes them valuable within industry.

Univariate Linear Regression

Linear regression makes use of single or multiple exogenous or explanatory variables and models their relationship with a continuous target variable. Univariate (or simple) linear regression is the most basic form, and uses only a single explanatory variable as a predictor of the target variable. A univariate model can be described as follows...

$$y = w_1x +b$$

Where $w_1$, the weight coefficient, represents the slope of the line, the bias unit $b$ represents the y intercept. The goal is to optimise this equation and learn the parameters ($w_1$ and $b$) to produce a line the best describes the relationship between predictor $X$ and target $y$. This optimised equation can then be used to predict new variables that are outside the observed training samples.

The best fitting line is called the regression line.

Multiple Linear Regression

The linear regression model can be generalised to multiple explanatory variables, where the equation is extended to...

$$y = w_1x_1 + ... + w_mx_m + b = \sum_{i=1}^{m} w_ix_i + b = W^{T}x + b$$

To explore linear models further, let's look at some data.

1. Data Analysis

The Ames housing dataset has been produced by Dean De Cock in 2011. It contains information about individual residential properties in Ames, Iowa.

Documentation for this data can be found here with information on all 80 features. For simplicity, a subset of the variables will be used...

Columns

Overall Qual (Ordinal): Rates the overall material and finish of the house
Overall Cond (Ordinal): Rates the overall condition of the house
Gr Liv Area (Continuous): Above grade (ground) living area square feet
Central Air (Nominal): Central air conditioning
Total Bsmt SF (Continuous): Total square feet of basement area
SalePrice (Continuous): Sale price $$

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'

url = 'http://jse.amstat.org/v19n3/decock/AmesHousing.txt'

cols = ['Overall Qual', 'Overall Cond', 'Gr Liv Area', 'Central Air', 'Total Bsmt SF', 'SalePrice']

df = pd.read_csv(url, sep='\t', usecols=cols)

# clean up column labels
df.columns = [c.lower().replace(' ', '_') for c in cols]

# encode "central air"
# Y=1, N=0
df['central_air'] = df['central_air'].map({'Y':1, 'N':0})

# drop NANs
df = df.dropna(axis=0)

df.to_csv('data/ames_housing_data.csv', index=False)
df.head()

	overall_qual	overall_cond	gr_liv_area	central_air	total_bsmt_sf	saleprice
0	6	5	1080.0	1	1656	215000
1	5	6	882.0	1	896	105000
2	6	6	1329.0	1	1329	172000
3	7	5	2110.0	1	2110	244000
4	5	5	928.0	1	1629	189900

1.1 Visual inspection

import seaborn as sns

g = sns.PairGrid(df)
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend();

looking at the pairplot, it is easy to see that somewhat linear relationships exist between total_bsmt_sf and salesprice, or between gr_liv_area and salesprice.

Looking at the histograms for all three variables, they appear to be skewed by outliers. Linear regression models do not require that variables are normally distributed, unless analysis for certain statistics or hypothesis tests are being conducted.

1.2 Correlation

The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients abbreviated to Pearson's r, which is a measure of linear dependence between pairs of features.

Coefficients range between -1 (perfect negative correlation), to 0 (no correlation) to 1 (perfect positive). The Pearsons correlation coefficient can be calculated as the covariance between two variables (x and y) divided by their standard deviation.

$$\rho = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}$$

cm = np.corrcoef(df.T)

f, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(cm, annot=True, linewidths=.5, ax=ax, xticklabels=df.T.index, yticklabels=df.T.index);

Out of the continuous variables, total_bsmt_sf and gr_liv_area have the stronges correlation with the target variable, saleprice.

1.3 Fitting an OLS linear model from scratch

Ordinary least squares (OLS) is a method for estimmating the parameters of the linear regression line that minimises the sum of the squared error between training examples.

There is a relationship between linear regression and the Adaptive Linear Neuron (Adaline) mode. The Adeline model uses a linear activation function followed by a threshold function for the task of classification. By dropping the threshold function, we can use Adaline to solve OLS regression. This means gradient descent can be used to minimise the mean squared error loss function defined as...

$$L(w,b) = \frac{1}{2n}\sum_{i=1}^{n}(y^{(i)} - \hat{y^{(i)}})^2$$

Where $\hat{y}$ is the predicted value $\hat{y} = W^Tx + b$.

# based on implementation from the excellent book
# https://sebastianraschka.com/books/

class LinearRegressionGD():
    def __init__(self, lr=0.01, n_iter=50, random_state=1):
        self.lr = lr
        self.n_iter = n_iter
        self.random_state = random_state

    def fit(self, X, y):
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0., scale=1., size=X.shape[1])
        self.b_ = np.array([0.])
        self.losses_ = []

        for i in range(self.n_iter):
            out = self.net_input(X)
            errors = (y - out)
            self.w_ += self.lr * 2.0 * X.T.dot(errors) / X.shape[0]
            self.b_ += self.lr * 2.0 * errors.mean()
            loss = (errors**2).mean()
            self.losses_.append(loss)
        return self

    def net_input(self, X):
        return np.dot(X, self.w_) + self.b_

    def predict(self, X):
        return self.net_input(X)

X = df[['total_bsmt_sf']].values
y = df[['saleprice']].values

# standardise input and target
from sklearn.preprocessing import StandardScaler

sc_x, sc_y = StandardScaler(), StandardScaler()
X_std, y_std = sc_x.fit_transform(X), sc_y.fit_transform(y).flatten()

lr = LinearRegressionGD(lr=0.1)

lr.fit(X_std, y_std)

<__main__.LinearRegressionGD at 0x11d3e4a30>

plt.plot(range(1, lr.n_iter+1), lr.losses_)
plt.ylabel('MSE')
plt.xlabel('Epoch');

def regression_plot(X, y, model):
    """
    plot the line of best fit 
    from a linear regression model
    """
    fig, ax = plt.subplots(figsize=(8,6))

    ax.scatter(X,y, alpha=.6)
    ax.plot(X, model.predict(X), c='red', lw=1)

regression_plot(X_std, y_std, lr)
plt.xlabel('Total square feet of basement area')
plt.ylabel('Sale Price');

# model coefficients
print(f'Slope: {lr.w_[0]:.3f}')
print(f'Intercept: {lr.b_[0]:.3f}')

Slope: 0.707
Intercept: -0.000

This vanilla implementation demonstrated how an OLS model can be fit using SGD, which iteratively updated the weight and bias paramenters by minimising the squared errod between the predicted and actual values. Sklearn provides additional features to assess model fit and accuracy

from sklearn.linear_model import LinearRegression

slr = LinearRegression()
slr.fit(X_std,y_std)

y_pred = slr.predict(X_std)

# model coefficients
print(f'Slope: {slr.coef_[0].item():.3f}')
print(f'Intercept: {slr.intercept_:.3f}')

Slope: 0.707
Intercept: -0.000

regression_plot(X_std, y_std, slr)
plt.xlabel('Total square feet of basement area')
plt.ylabel('Sale Price');

The results match the from-scratch implementation. In both models, the presence of outliers has pulled the line away from what looks like could be a better fit (with a slighlty steeper slope).

2. RANSAC a robust regression model

In linear regression, outliers can have a big effect on the model fit by skewing the estimated model coefficients away from what could otherwise be the "best fit". Removing outliers involves both judgement and domain knowledge.

There is an alternative to outlier removal: The RANSA (RANdom SAmple Consensus) model deals with outliers by fitting the model to a subset of the data (inliers).

The model is first fitted on subset of data (inliers), then, all other samples are are tested against these inliers based on a user-defined tolerance. The model is again fitted on the new subset of data. The error is then estimated and the algorithm is stopped if the performance reaches a user defined threshold or if the fixed number of iterations has been reached.

from sklearn.linear_model import RANSACRegressor

ransac = RANSACRegressor(
    LinearRegression(),
    max_trials=100,
    min_samples=.95,
    residual_threshold=None,
    random_state=123
)

ransac.fit(X,y)

RANSACRegressor(estimator=LinearRegression(), min_samples=0.95,
                random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here, a linear model was fit with the RANSAC algorithm so that the minimum number of training examples used was 95% of the data (min_samples=.95). The algorithm uses the median absolute deviation by default to select the inliers (residual_threshold=None).

As expected, the model coefficients differ from the previous linear model.

print(f'Slope: {ransac.estimator_.coef_[0].item():.3f}')
print(f'Intercept: {ransac.estimator_.intercept_[0]:.3f}')

Slope: 106.348
Intercept: 20190.093

Visualising the results of the model can be useful for understanding how it all works.

inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

line_X = np.arange(3, 10, 1)
line_y = ransac.predict(line_X[:, np.newaxis])

# plot inliers
plt.scatter(X[inlier_mask], y[inlier_mask],
           c='steelblue', edgecolor='white',
           marker='o', label='Inliers')

# plot outliers
plt.scatter(X[outlier_mask], y[outlier_mask],
           c='green', edgecolor='white',
           marker='s', label='Outliers')

# line of fit
plt.plot(line_X, line_y, color='red', lw=2)

plt.xlabel('Total Basement Square Footage')
plt.ylabel('Sales Price in U.S. dollars')
plt.legend(loc='upper left')
plt.tight_layout();

To identify fewer outliers, the residual_threshold can be set > than the median absolute deviation.

$${\displaystyle \operatorname {MAD} =\operatorname {median} (|X_{i}-{\tilde {X}}|)}$$

where

$$\tilde {X} = \operatorname {median}(X)$$

def median_absolute_deviation(data):
    return np.median(np.abs(data - np.median(data)))

median_absolute_deviation(y)

37000.0

3. Evaluating the performance of linear regression models

The goal of this section is to go beyond univariate models by introducing additional exogenous variables. However, in doing so it is not then possible to visualise the regression line (or hyperplane) using a two dimensional plot.

Alternatively, residual plots are commonly used to diagnose regression models. They can be used to detect non-linearity and outliers, and to assess whether the errors are randomly distributed.

# train test split
from sklearn.model_selection import train_test_split

target = 'saleprice'
features = df.columns[df.columns != target]

X = df[features].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=123)

# model
mlr = LinearRegression() 

# fit
mlr.fit(X_train, y_train)

# predict
y_train_pred = mlr.predict(X_train)
y_test_pred = mlr.predict(X_test)

# straight from the book!
# max values for hlines
x_max = np.max([np.max(y_train_pred), np.max(y_test_pred)])
x_min = np.min([np.min(y_train_pred), np.min(y_test_pred)])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3), sharey=True)

ax1.scatter(y_test_pred, y_test_pred - y_test,
            c='green', marker='s', edgecolor='white',
            label='Test data')
ax2.scatter(y_train_pred, y_train_pred - y_train,
            c='steelblue', marker='o', edgecolor='white',
            label='Training data')
ax1.set_ylabel('Residuals')

for ax in (ax1, ax2):
    ax.set_xlabel('Predicted values')
    ax.legend(loc='upper left')
    ax.hlines(y=0, xmin=x_min, xmax=x_max, color='red', lw=1.5)

plt.tight_layout()

plt.show()

A perfect prediction would result in the residuals sitting at exactly zero. A good model would also result in the residuals being randomly scattered around 0. When this is not the case (as can be seen above) it means that the model has not captured some explanatory information. This information has leaked into the residuals.

ax = df[['total_bsmt_sf', 'gr_liv_area']].boxplot(figsize=(6,8))
ax.set_title('Example of outliers');

4. Modeling non-linear relationships - polynomial regression

Given the information learned above, try constructing polynomial features and removing outliers to see if a better fit can be achieved. To illustrate, overall_qual will be used as the relationship to saleprice is clearly non-linear.

fig, ax = plt.subplots(figsize=(8,6))
plt.scatter('overall_qual', 'saleprice', data=df, alpha=.5)
plt.xlabel('overall quality')
plt.ylabel('sales price');

# set X, y
X = df['overall_qual'].values.reshape(-1, 1)
y = df['saleprice'].values

from sklearn.preprocessing import PolynomialFeatures

# fit regression model
lm = LinearRegression()

# feature engineering (quadratic, cubic)
quadr = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)

X_quad = quadr.fit_transform(X)
X_cubic = cubic.fit_transform(X)

# fit linear
X_fit = np.arange(X.min()-1, X.max()+2, 1)[:,np.newaxis]

regr = lm.fit(X, y)
y_lin_fit = regr.predict(X_fit)
linear_r2 = r2_score(y, regr.predict(X))

# fit quadratic
regr = lm.fit(X_quad, y)
y_quad_fit = regr.predict(quadr.fit_transform(X_fit))
quad_r2 = r2_score(y, regr.predict(X_quad))

# fit cubic
regr = lm.fit(X_cubic,y)
y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
cubic_r2 = r2_score(y, regr.predict(X_cubic))

fig, ax = plt.subplots(figsize=(10,8))

plt.scatter(X, y, label='Training points', color='lightgrey', alpha=.6)

# linear regression line
plt.plot(X_fit, y_lin_fit, 
         label=f'Linear (d=1), $R^2$={linear_r2:.2f}',
         color='steelblue', lw=2, linestyle=':'
        );

# cubic regression line
plt.plot(X_fit, y_cubic_fit, 
         label=f'Cubic (d=2), $R^2$={cubic_r2:.2f}',
         color='red', lw=2, linestyle='-'
        );

# quadratic regression line
plt.plot(X_fit, y_quad_fit, 
         label=f'Quadratic (d=3), $R^2$={quad_r2:.2f}',
         color='green', lw=2, linestyle='--'
        );

plt.legend(loc='upper left')
plt.xlabel('overall quality')
plt.ylabel('sales price');