Skip to content

Lesson 7: Tabular Data

Pt 1

Decision Tree Ensembles

Some examples include... - Random Forest (Regressor / Classifier) - Gradient Boost (Regressor / Classifier) - XGBoost (Regressor / Classifier)

We will use the Scikit-Learn library for this task rather than PyTorch.

The Data

  • Blue Book for Bulldozers Kaggle Competition
  • The goal being to predict sale price of heavy equipment at auction based on ussage, type and configuration.

Kaggle setup help

There are a number of different ways to do this, I had some trouble doing this so after some reading, tried this instead.

fastai helper functions

  • Had some trouble importing fastais helper functions, found that someone has added them all here
from pathlib import Path

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype

from fastai.tabular.all import *
# helper functions
from fastbook import *

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor,export_graphviz

from dtreeviz.trees import *

import IPython
from IPython.display import Image, display_svg, SVG

import os

Load credentials from json file. This is a simple file with the following information...

{"username":"xxx","key":"xxx"}
import json

with open('creds.json') as f:
    creds = json.load(f)
os.environ['KAGGLE_USERNAME']=creds["username"]
os.environ['KAGGLE_KEY']=creds["key"]
from kaggle import api

api.competition_download_cli('bluebook-for-bulldozers')
 10%|█         | 5.00M/48.4M [00:00<00:01, 30.5MB/s]
Downloading bluebook-for-bulldozers.zip to /notebooks

100%|██████████| 48.4M/48.4M [00:01<00:00, 28.5MB/s]




p = Path.cwd()

for i in p.iterdir():
    print(i)
/notebooks/.ipynb_checkpoints
/notebooks/.kaggle
/notebooks/course-v4
/notebooks/fastbook
/notebooks/lesson1_assets
/notebooks/models
/notebooks/20200920_fastai_lesson_prod_app.ipynb
/notebooks/20201006_fastai_lesson_6.ipynb
/notebooks/20201026_fastai_lesson_6_collab.ipynb
/notebooks/bluebook-for-bulldozers.zip
/notebooks/lesson_7_tabular.ipynb
/notebooks/storage
/notebooks/datasets

fname = p/'bluebook-for-bulldozers.zip'

dest = p/'storage/data/bluebook'
# dest.mkdir()

# only run once!
#file_extract(fname, dest)

The Data

each row of the dataset represents the sale of a single machine at an auction

df = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)

df.head()
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls
0 1139246 66000.0 999089 3157 121 3.0 2004 68.0 Low 11/16/2006 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
1 1139248 57000.0 117657 77 121 3.0 1996 4640.0 Low 3/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
2 1139249 10000.0 434808 7009 121 3.0 2001 2838.0 High 2/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1139251 38500.0 1026470 332 121 3.0 2001 3486.0 High 5/19/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1139253 11000.0 1057373 17311 121 3.0 2007 722.0 Medium 7/23/2009 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 53 columns

df.columns
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
      dtype='object')

Transforming data

  • convert categorical ie ProductSize
  • set order
df['ProductSize'].unique()
array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large',
       'Compact'], dtype=object)
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

dependent variable here is SalePrice, this is the variable we want to predict. Kaggle specifically tells us the metric to use: root mean squared log error (RMSLE). To use this metric, we need to take the log of the dependent variable which will allow use to use RMSE

dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])

Decision Trees

  • ask binary questions about data
    • ie is x > y
  • the trouble is we don't know what binary questions to ask, and through machine learning, we need to decide on what these will be.

Handling Dates

df = add_datepart(df, 'saledate')
/opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/tabular/core.py:33: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  for n in attr: df[prefix + n] = getattr(field.dt, n.lower())

df_test = pd.read_csv(dest/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')

Using TabularPandas and TabularProc

we will use two tabular transforms to modify our data. - Categorify - replaces a column with numeric category - FillMissign - fills any missing data with the median - also adds a boolean column where True will be set for any data point that was missing

procs = [Categorify, FillMissing]

Validation Set

hold aside some data (approx 2 weeks) as per the competition rules - do this with np.where

cond = (df.saleYear<2011) | (df.saleMonth<10)

train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0] # inverse condition

splits = (list(train_idx), list(valid_idx))

Here cont_cat_split is a helper function that returns column names of cont and cat variables from given df.

# define continuous and categorical columns for TabularPandas

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

create a Tabular Object (to)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
len(to.train), len(to.valid)
(404710, 7988)

Look at the data with to.show. This will show the data in human readable form. to.items.head will show the processed data in numerical form.

to.show(3)
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc fiModelSeries fiModelDescriptor ProductSize fiProductClassDesc state ProductGroup ProductGroupDesc Drive_System Enclosure Forks Pad_Type Ride_Control Stick Transmission Turbocharged Blade_Extension Blade_Width Enclosure_Type Engine_Horsepower Hydraulics Pushblock Ripper Scarifier Tip_Control Tire_Size Coupler Coupler_System Grouser_Tracks Hydraulics_Flow Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed auctioneerID_na MachineHoursCurrentMeter_na SalesID MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear SalePrice
0 Low 521D 521 D #na# #na# #na# Wheel Loader - 110.0 to 120.0 Horsepower Alabama WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# None or Unspecified None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False 1163635200 False False 1139246 999089 3157 121 3.0 2004 68.0 2006 11 46 16 3 320 11.097410
1 Low 950FII 950 F II #na# Medium Wheel Loader - 150.0 to 175.0 Horsepower North Carolina WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# 23.5 None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False 1080259200 False False 1139248 117657 77 121 3.0 1996 4640.0 2004 3 13 26 4 86 10.950807
2 High 226 226 #na# #na# #na# #na# Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity New York SSL Skid Steer Loaders #na# OROPS None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# Auxiliary #na# #na# #na# #na# #na# None or Unspecified None or Unspecified None or Unspecified Standard #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# False False False False False False 1077753600 False False 1139249 434808 7009 121 3.0 2001 2838.0 2004 2 9 26 3 57 9.210340
to.items.head(3)
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed auctioneerID_na MachineHoursCurrentMeter_na
0 1139246 11.097410 999089 3157 121 3.0 2004 68.0 2 963 ... 320 1 1 1 1 1 1 2647 1 1
1 1139248 10.950807 117657 77 121 3.0 1996 4640.0 2 1745 ... 86 1 1 1 1 1 1 2148 1 1
2 1139249 9.210340 434808 7009 121 3.0 2001 2838.0 1 336 ... 57 1 1 1 1 1 1 2131 1 1

3 rows × 67 columns

check the vocab with to.classes

to.classes['ProductSize']
(#7) ['#na#','Large','Large / Medium','Medium','Small','Mini','Compact']
# save the tabular object for later

# (dest/'to.pkl').save(to)

Decision Tree Regressor: for continuous variables

# had to copy from 
# https://github.com/anandsaha/fastai.part1.v2/blob/master/fastai/structured.py

def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    """ Draws a representation of a random forest in IPython.
    Parameters:
    -----------
    t: The tree you wish to draw
    df: The data used to train the tree. This is used to get the names of the features.
    """
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True,
                      special_characters=True, rotate=True, precision=precision)
    IPython.display.display(graphviz.Source(re.sub('Tree {',
       f'Tree {{ size={size}; ratio={ratio}', s)))
xs,y = to.train.xs, to.train.y

valid_xs, valid_y = to.valid.xs, to.valid.y
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y);
draw_tree(m, xs, precision=2);
Tree 0 Coupler_System ≤ 0.5 mse = 0.48 samples = 404710 value = 10.1 1 YearMade ≤ 1991.5 mse = 0.42 samples = 360847 value = 10.21 0->1 True 2 mse = 0.12 samples = 43863 value = 9.21 0->2 False 3 mse = 0.37 samples = 155724 value = 9.97 1->3 4 ProductSize ≤ 4.5 mse = 0.37 samples = 205123 value = 10.4 1->4 5 mse = 0.31 samples = 182403 value = 10.5 4->5 6 mse = 0.17 samples = 22720 value = 9.62 4->6
samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')
G node4 2020-11-30T06:28:05.016208 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ leaf5 2020-11-30T06:28:05.560116 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node4->leaf5 leaf6 2020-11-30T06:28:05.654756 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node4->leaf6 node1 2020-11-30T06:28:05.131089 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node1->node4 leaf3 2020-11-30T06:28:05.398894 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node1->leaf3 leaf2 2020-11-30T06:28:05.742191 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node0 2020-11-30T06:28:05.254063 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node0->node1 < node0->leaf2

Using Dtreeviz is a little more intuitive and provides a bit more information. For example, YearMade is showing that there are years equal to 1000, which is obviously an error in the data. Let's fix that

xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950
m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)

dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')
G node4 2020-11-30T06:28:21.881440 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ leaf5 2020-11-30T06:28:22.494637 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node4->leaf5 leaf6 2020-11-30T06:28:22.604794 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node4->leaf6 node1 2020-11-30T06:28:22.010995 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node1->node4 leaf3 2020-11-30T06:28:22.367366 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node1->leaf3 leaf2 2020-11-30T06:28:22.694135 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node0 2020-11-30T06:28:22.156796 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/ node0->node1 < node0->leaf2
# remove max_leaf_nodes
# build a bigger tree
m = DecisionTreeRegressor()
m.fit(xs, y)
DecisionTreeRegressor()
def r_mse(preds,y): return round(math.sqrt(((preds-y)**2).mean()),6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m_rmse(m, xs, y)
0.0
m_rmse(m, valid_xs, valid_y)
0.333069

Our training set is 0 and validation set is worse than the observed value from the original tree viz. The reason is that there are almost as many leaves in our model than observations in our data set. To avoid this, we need to pick some stopping criteria, like some threshold that will tell the model, don't split this if there are less than x number of items in the leaf node. Do this with min_samples_leaf.

m.get_n_leaves(), len(xs)
(324549, 404710)

we have nearly as many leaf nodes as observations in our dataset

we need to create some rules here

m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.248593, 0.323391)
m.get_n_leaves()
12397

Catagorical Variables

Unlike with collab filtering, we do not need to create dummy variables with categorical values because through pre-processing, we have already transformed these into numerical values.

However, you can one-hot encode if you like.

Bagging

A technique developed by professor Leo Breiman. The idea is that you can bootstrap subsets of your data, train your model, store the predictions, then average the predictions.

Steps 1. randomly choose a subset 2. train a model on the subset 3. save the model, return to step one 4. make a prediction using all of the models, then take the average of each model's prediction

Leo furthered his thinking by not only selecting a random subset of rows, but also a random subset of columns. This is known as a Random Forrest.

The function rf below uses the following arguments - n_estimators defines the number of trees - max_samples defines how many rows to sample for training each tree - max_features defines how many columns to sample at each split point (0.5 means "take half the total number of columns"). - min_samples_leaf specify when to stop splitting the tree nodes - effectively limiting the depth of the tree
- n_jobs=-1 use CPUs to build the trees in parallel.

def rf(xs, y, n_estimators=40, max_samples=200_000,
      max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
      max_samples=max_samples, max_features=max_features,
      min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
m = rf(xs, y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.171231, 0.234308)

That is much better!

Making Predictions

To understand the impact of n_estimators you can get predictions from each individual tree in the forest

preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds
array([[10.11121098,  9.94458659,  9.42150625, ...,  9.17998473,
         9.29954442,  9.29954442],
       [10.14101274,  9.82777866,  9.55376172, ...,  9.48364408,
         9.48364408,  9.48364408],
       [10.01018006, 10.1378665 ,  9.27639723, ...,  9.49919689,
         9.18244871,  9.18244871],
       ...,
       [ 9.88747565,  9.52539463,  9.46619672, ...,  9.24248886,
         9.22252042,  9.22252042],
       [10.5004158 ,  9.98228111,  9.20137348, ...,  9.43222591,
         9.30666413,  9.30666413],
       [10.08093796, 10.61704159,  9.33421822, ...,  9.26213868,
         9.29758778,  9.29758778]])

This represents every prediction for each and every tree for every row of data.

r_mse(preds.mean(axis=0), valid_y)
0.234308

Here is how to make a single rediction. I think!!

CHECK THIS

# slice a row of data
valid_xs.iloc[0]
UsageBand             2.0
fiModelDesc        2301.0
fiBaseModel         706.0
fiSecondaryDesc      43.0
fiModelSeries         0.0
                    ...  
saleMonth            10.0
saleWeek             40.0
saleDay               3.0
saleDayofweek         0.0
saleDayofyear       276.0
Name: 22915, Length: 66, dtype: float64
# predict requires a 2D array, reshape your data

data = valid_xs.iloc[0].values.reshape(1,-1)

m.predict(data)
array([10.0005727])

You can visualise how RMSE improvs as more and more trees are added

plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);

validation set is worse than the training set. Why?

  • we might be overfitting
  • the last two weeks of the auction data may have been different somehow

Out of Bag Error (OOB error)

how can we check? we can use OOB predictions from the model and run rmse on the oob error.

r_mse(m.oob_prediction_, y)
0.211059

What is happening here?

OOB error gives you a sense of how much you are overfitting.

Model interpretation

source

For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:

  • How confident are we in our predictions using a particular row of data?
  • For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
  • Which columns are the strongest predictors, which can we ignore?
  • Which columns are effectively redundant with each other, for purposes of prediction?
  • How do predictions vary, as we vary these columns?
preds_std = preds.std(0)
preds_std
array([0.24516726, 0.17149769, 0.12129906, ..., 0.1911843 , 0.16064838,
       0.16064838])

Feature Important

def rf_feat_importance(m, df):
    return pd.DataFrame({'cols': df.columns, 'imp': m.feature_importances_}
                        ).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, xs)

fi[:10]
cols imp
58 YearMade 0.175870
6 ProductSize 0.118317
30 Coupler_System 0.099115
7 fiProductClassDesc 0.071253
31 Grouser_Tracks 0.064081
55 ModelID 0.061831
50 saleElapsed 0.051997
32 Hydraulics_Flow 0.042906
3 fiSecondaryDesc 0.039384
1 fiModelDesc 0.031282
def plot_fi(fi):
    return fi.plot.barh('cols', 'imp', figsize=(12,8), legend=False)

plot_fi(fi[:30]);

Removing low-importance variables

to_keep = fi[fi.imp>0.005].cols
len(to_keep)
20

retrain model using only this subset of columns

xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
m = rf(xs_imp, y)
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs, valid_y)
(0.180874, 0.231109)
cluster_columns??
cluster_columns(xs_imp)

seems like we could remove some of the clustered columns.

Let's calculate a baseline using a sample of data

def get_oob(df):
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
        max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(df, y)
    return m.oob_score_
get_oob(xs_imp)
0.8769414512037411

now remove redundant columns one at a time

{c:get_oob(xs_imp.drop(c, axis=1)) for c in (
    'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',
    'fiModelDesc', 'fiBaseModel',
    'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}
{'saleYear': 0.8754647378064608,
 'saleElapsed': 0.8723764211506366,
 'ProductGroupDesc': 0.8765810711892142,
 'ProductGroup': 0.8773280451665235,
 'fiModelDesc': 0.8758816619476205,
 'fiBaseModel': 0.8756967579434771,
 'Hydraulics_Flow': 0.8769591574648032,
 'Grouser_Tracks': 0.876871969234521,
 'Coupler_System': 0.8762546455208067}

not much change here so try dropping multiple variables

to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
get_oob(xs_imp.drop(to_drop, axis=1))
0.875680776530026
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)

check rmse again to confirm accuracy hasn't really changed

m = rf(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.182544, 0.23238)

similar accuracy but less features!

Partial Dependence

what is the relationship between variables

p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)),c);

largest group is actual #na#

do the same for yearmade

ax = valid_xs_final['YearMade'].hist()
from sklearn.inspection import plot_partial_dependence

fig,ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ['YearMade','ProductSize'],
                        grid_resolution=20, ax=ax);

what is partial dependence telling us?

We want to look at how year affects the sale price, that is, all else being equal, what affect does year have on sale price

Tree Interpreter

#!pip install treeinterpreter
#!pip install waterfallcharts
import warnings
warnings.simplefilter('ignore', FutureWarning)

from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
row = valid_xs_final.iloc[:5]
prediction,bias,contributions = treeinterpreter.predict(m, row.values)
prediction[0], bias[0], contributions[0].sum()
(array([10.03875756]), 10.104200155980113, -0.06544259554720655)
waterfall(valid_xs_final.columns, contributions[0], threshold=0.08, 
          rotation_value=45,formatting='{:,.3f}');

The extrapolation problem

np.random.seed(42)
x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin);
xs_lin = x_lin.unsqueeze(1)
x_lin.shape, xs_lin.shape
(torch.Size([40]), torch.Size([40, 1]))

you can do the same using None

x_lin[:,None].shape
torch.Size([40, 1])
m_lin = RandomForestRegressor().fit(x_lin[:30].reshape(-1, 1), y_lin[:30])
plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);

random forrext cannot extrapolate outside of the bounds of the training data

we need to make sure validation set does not contain out of domain data

test and training set may vary, how do we tell??

Finding out of domain data

df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))

m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]
cols imp
5 saleElapsed 0.915808
10 SalesID 0.069088
13 MachineID 0.011871
0 YearMade 0.000678
4 ModelID 0.000536
12 Hydraulics 0.000529
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))

for c in ('SalesID','saleElapsed','MachineID'):
    m = rf(xs_final.drop(c,axis=1), y)
    print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))
orig 0.233484
SalesID 0.231357
saleElapsed 0.236643
MachineID 0.231104

time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)

m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)
0.229127
xs['saleYear'].hist();
filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]
m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)
(0.176904, 0.22864)

Using a Neural Net

# load data
df_nn = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)

# set ProductSize as categorical
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

# take log of dependent variable
df_nn[dep_var] = np.log(df_nn[dep_var])

# do some date prep
df_nn = add_datepart(df_nn, 'saledate')
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
cont_nn.append('saleElapsed')
cat_nn.remove('saleElapsed')
df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)
df_nn_final[cat_nn].nunique()
YearMade                73
ProductSize              6
Coupler_System           2
fiProductClassDesc      74
ModelID               5281
Hydraulics_Flow          3
fiSecondaryDesc        177
fiModelDesc           5059
Enclosure                6
ProductGroup             6
Hydraulics              12
fiModelDescriptor      140
Drive_System             4
dtype: int64
xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
(0.178922, 0.230357)
cat_nn.remove('fiModelDescriptor')
df_nn_final['saleElapsed'].astype('int64', copy=False)

df_nn_final.dtypes
YearMade                 int64
ProductSize           category
Coupler_System          object
fiProductClassDesc      object
ModelID                  int64
saleElapsed              int64
Hydraulics_Flow         object
fiSecondaryDesc         object
fiModelDesc             object
Enclosure               object
ProductGroup            object
Hydraulics              object
fiModelDescriptor       object
Drive_System            object
SalePrice              float64
dtype: object

Normalize

Normalize subtracts the mean, then divides by the standard deviation. We didn't need this for a decision tree because we were only performing binary splits. However, we do need to normalize for neaural nets because we don't want things with crazy distributions.

procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df=df_nn_final, 
                      procs=procs_nn, 
                      cat_names=cat_nn, 
                      cont_names=cont_nn,
                      splits=splits, 
                      y_names=dep_var)
dls = to_nn.dataloaders(1024)

This is a regression model so we want to set our y_range based on the min and max of the dependent variable.

y = to_nn.train.y
y.min(),y.max()
(8.465899, 11.863583)
learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
                        n_out=1, loss_func=F.mse_loss)
learn.lr_find()
SuggestedLRs(lr_min=0.003981071710586548, lr_steep=0.00019054606673307717)
learn.fit_one_cycle(5, 1e-2)
epoch train_loss valid_loss time
0 0.069223 0.062953 00:11
1 0.056285 0.055872 00:13
2 0.048484 0.055052 00:12
3 0.043525 0.051425 00:12
4 0.040454 0.051055 00:11
preds,targs = learn.get_preds()
r_mse(preds,targs)
0.225954

Summary

Random Forests are easy to train, resillient, don't require much pre-processing, train quickly and don't overfit. They can be less accurate than a neural net and can take longer at inference time to evaluate the trees.

Neural Nets are probably the fiddliest models to implement and set up but can give slightly better results.