Lesson 7: Tabular Data

Pt 1

Decision Tree Ensembles

Some examples include... - Random Forest (Regressor / Classifier) - Gradient Boost (Regressor / Classifier) - XGBoost (Regressor / Classifier)

We will use the Scikit-Learn library for this task rather than PyTorch.

The Data

Blue Book for Bulldozers Kaggle Competition
The goal being to predict sale price of heavy equipment at auction based on ussage, type and configuration.

Kaggle setup help

There are a number of different ways to do this, I had some trouble doing this so after some reading, tried this instead.

fastai helper functions

Had some trouble importing fastais helper functions, found that someone has added them all here

from pathlib import Path

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype

from fastai.tabular.all import *
# helper functions
from fastbook import *

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor,export_graphviz

from dtreeviz.trees import *

import IPython
from IPython.display import Image, display_svg, SVG

import os

Load credentials from json file. This is a simple file with the following information...

{"username":"xxx","key":"xxx"}

import json

with open('creds.json') as f:
    creds = json.load(f)

os.environ['KAGGLE_USERNAME']=creds["username"]
os.environ['KAGGLE_KEY']=creds["key"]

from kaggle import api

api.competition_download_cli('bluebook-for-bulldozers')

 10%|█         | 5.00M/48.4M [00:00<00:01, 30.5MB/s]

Downloading bluebook-for-bulldozers.zip to /notebooks

100%|██████████| 48.4M/48.4M [00:01<00:00, 28.5MB/s]

p = Path.cwd()

for i in p.iterdir():
    print(i)

/notebooks/.ipynb_checkpoints
/notebooks/.kaggle
/notebooks/course-v4
/notebooks/fastbook
/notebooks/lesson1_assets
/notebooks/models
/notebooks/20200920_fastai_lesson_prod_app.ipynb
/notebooks/20201006_fastai_lesson_6.ipynb
/notebooks/20201026_fastai_lesson_6_collab.ipynb
/notebooks/bluebook-for-bulldozers.zip
/notebooks/lesson_7_tabular.ipynb
/notebooks/storage
/notebooks/datasets

fname = p/'bluebook-for-bulldozers.zip'

dest = p/'storage/data/bluebook'
# dest.mkdir()

# only run once!
#file_extract(fname, dest)

The Data

each row of the dataset represents the sale of a single machine at an auction

df = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)

df.head()

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1139246	66000.0	999089	3157	121	3.0	2004	68.0	Low	11/16/2006 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
1	1139248	57000.0	117657	77	121	3.0	1996	4640.0	Low	3/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
2	1139249	10000.0	434808	7009	121	3.0	2001	2838.0	High	2/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1139251	38500.0	1026470	332	121	3.0	2001	3486.0	High	5/19/2011 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1139253	11000.0	1057373	17311	121	3.0	2007	722.0	Medium	7/23/2009 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 53 columns

df.columns

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
      dtype='object')

Transforming data

convert categorical ie ProductSize
set order

df['ProductSize'].unique()

array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large',
       'Compact'], dtype=object)

sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'

df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

dependent variable here is SalePrice, this is the variable we want to predict. Kaggle specifically tells us the metric to use: root mean squared log error (RMSLE). To use this metric, we need to take the log of the dependent variable which will allow use to use RMSE

dep_var = 'SalePrice'

df[dep_var] = np.log(df[dep_var])

Decision Trees

ask binary questions about data
- ie is x > y
the trouble is we don't know what binary questions to ask, and through machine learning, we need to decide on what these will be.

Handling Dates

df = add_datepart(df, 'saledate')

/opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/tabular/core.py:33: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  for n in attr: df[prefix + n] = getattr(field.dt, n.lower())

df_test = pd.read_csv(dest/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')

Using TabularPandas and TabularProc

we will use two tabular transforms to modify our data. - Categorify - replaces a column with numeric category - FillMissign - fills any missing data with the median - also adds a boolean column where True will be set for any data point that was missing

procs = [Categorify, FillMissing]

Validation Set

hold aside some data (approx 2 weeks) as per the competition rules - do this with np.where

cond = (df.saleYear<2011) | (df.saleMonth<10)

train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0] # inverse condition

splits = (list(train_idx), list(valid_idx))

Here cont_cat_split is a helper function that returns column names of cont and cat variables from given df.

# define continuous and categorical columns for TabularPandas

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

create a Tabular Object (to)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

len(to.train), len(to.valid)

(404710, 7988)

Look at the data with to.show. This will show the data in human readable form. to.items.head will show the processed data in numerical form.

to.show(3)

	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	ProductSize	fiProductClassDesc	state	ProductGroup	ProductGroupDesc	Drive_System	Enclosure	Forks	Pad_Type	Ride_Control	Stick	Transmission	Turbocharged	Blade_Extension	Blade_Width	Enclosure_Type	Engine_Horsepower	Hydraulics	Pushblock	Ripper	Scarifier	Tip_Control	Tire_Size	Coupler	Coupler_System	Grouser_Tracks	Hydraulics_Flow	Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls	saleIs_month_end	saleIs_month_start	saleIs_quarter_end	saleIs_quarter_start	saleIs_year_end	saleIs_year_start	saleElapsed	auctioneerID_na	MachineHoursCurrentMeter_na	SalesID	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	saleYear	saleMonth	saleWeek	saleDay	saleDayofweek	saleDayofyear	SalePrice
0	Low	521D	521	D	#na#	#na#	#na#	Wheel Loader - 110.0 to 120.0 Horsepower	Alabama	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	1163635200	False	False	1139246	999089	3157	121	3.0	2004	68.0	2006	11	46	16	3	320	11.097410
1	Low	950FII	950	F	II	#na#	Medium	Wheel Loader - 150.0 to 175.0 Horsepower	North Carolina	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	23.5	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	1080259200	False	False	1139248	117657	77	121	3.0	1996	4640.0	2004	3	13	26	4	86	10.950807
2	High	226	226	#na#	#na#	#na#	#na#	Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity	New York	SSL	Skid Steer Loaders	#na#	OROPS	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Auxiliary	#na#	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	None or Unspecified	Standard	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	False	False	False	False	False	False	1077753600	False	False	1139249	434808	7009	121	3.0	2001	2838.0	2004	2	9	26	3	57	9.210340

to.items.head(3)

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	fiModelDesc	...	saleDayofyear	saleIs_month_end	saleIs_month_start	saleIs_quarter_end	saleIs_quarter_start	saleIs_year_end	saleIs_year_start	saleElapsed	auctioneerID_na	MachineHoursCurrentMeter_na
0	1139246	11.097410	999089	3157	121	3.0	2004	68.0	2	963	...	320	1	1	1	1	1	1	2647	1	1
1	1139248	10.950807	117657	77	121	3.0	1996	4640.0	2	1745	...	86	1	1	1	1	1	1	2148	1	1
2	1139249	9.210340	434808	7009	121	3.0	2001	2838.0	1	336	...	57	1	1	1	1	1	1	2131	1	1

3 rows × 67 columns

check the vocab with to.classes

to.classes['ProductSize']

(#7) ['#na#','Large','Large / Medium','Medium','Small','Mini','Compact']

# save the tabular object for later

# (dest/'to.pkl').save(to)

Decision Tree Regressor: for continuous variables

# had to copy from 
# https://github.com/anandsaha/fastai.part1.v2/blob/master/fastai/structured.py

def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    """ Draws a representation of a random forest in IPython.
    Parameters:
    -----------
    t: The tree you wish to draw
    df: The data used to train the tree. This is used to get the names of the features.
    """
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True,
                      special_characters=True, rotate=True, precision=precision)
    IPython.display.display(graphviz.Source(re.sub('Tree {',
       f'Tree {{ size={size}; ratio={ratio}', s)))

xs,y = to.train.xs, to.train.y

valid_xs, valid_y = to.valid.xs, to.valid.y

m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y);

draw_tree(m, xs, precision=2);

samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')

Using Dtreeviz is a little more intuitive and provides a bit more information. For example, YearMade is showing that there are years equal to 1000, which is obviously an error in the data. Let's fix that

xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950

m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)

dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')

# remove max_leaf_nodes
# build a bigger tree
m = DecisionTreeRegressor()
m.fit(xs, y)

DecisionTreeRegressor()

def r_mse(preds,y): return round(math.sqrt(((preds-y)**2).mean()),6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

m_rmse(m, xs, y)

0.0

m_rmse(m, valid_xs, valid_y)

0.333069

Our training set is 0 and validation set is worse than the observed value from the original tree viz. The reason is that there are almost as many leaves in our model than observations in our data set. To avoid this, we need to pick some stopping criteria, like some threshold that will tell the model, don't split this if there are less than x number of items in the leaf node. Do this with min_samples_leaf.

m.get_n_leaves(), len(xs)

(324549, 404710)

we have nearly as many leaf nodes as observations in our dataset

we need to create some rules here

m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

(0.248593, 0.323391)

m.get_n_leaves()

Catagorical Variables

Unlike with collab filtering, we do not need to create dummy variables with categorical values because through pre-processing, we have already transformed these into numerical values.

However, you can one-hot encode if you like.

Bagging

A technique developed by professor Leo Breiman. The idea is that you can bootstrap subsets of your data, train your model, store the predictions, then average the predictions.

Steps 1. randomly choose a subset 2. train a model on the subset 3. save the model, return to step one 4. make a prediction using all of the models, then take the average of each model's prediction

Leo furthered his thinking by not only selecting a random subset of rows, but also a random subset of columns. This is known as a Random Forrest.

The function rf below uses the following arguments - n_estimators defines the number of trees - max_samples defines how many rows to sample for training each tree - max_features defines how many columns to sample at each split point (0.5 means "take half the total number of columns"). - min_samples_leaf specify when to stop splitting the tree nodes - effectively limiting the depth of the tree
- n_jobs=-1 use CPUs to build the trees in parallel.

def rf(xs, y, n_estimators=40, max_samples=200_000,
      max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
      max_samples=max_samples, max_features=max_features,
      min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

m = rf(xs, y)

m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

(0.171231, 0.234308)

That is much better!

Making Predictions

To understand the impact of n_estimators you can get predictions from each individual tree in the forest

preds = np.stack([t.predict(valid_xs) for t in m.estimators_])

preds

array([[10.11121098,  9.94458659,  9.42150625, ...,  9.17998473,
         9.29954442,  9.29954442],
       [10.14101274,  9.82777866,  9.55376172, ...,  9.48364408,
         9.48364408,  9.48364408],
       [10.01018006, 10.1378665 ,  9.27639723, ...,  9.49919689,
         9.18244871,  9.18244871],
       ...,
       [ 9.88747565,  9.52539463,  9.46619672, ...,  9.24248886,
         9.22252042,  9.22252042],
       [10.5004158 ,  9.98228111,  9.20137348, ...,  9.43222591,
         9.30666413,  9.30666413],
       [10.08093796, 10.61704159,  9.33421822, ...,  9.26213868,
         9.29758778,  9.29758778]])

This represents every prediction for each and every tree for every row of data.

r_mse(preds.mean(axis=0), valid_y)

0.234308

Here is how to make a single rediction. I think!!

CHECK THIS

# slice a row of data
valid_xs.iloc[0]

UsageBand             2.0
fiModelDesc        2301.0
fiBaseModel         706.0
fiSecondaryDesc      43.0
fiModelSeries         0.0
                    ...  
saleMonth            10.0
saleWeek             40.0
saleDay               3.0
saleDayofweek         0.0
saleDayofyear       276.0
Name: 22915, Length: 66, dtype: float64

# predict requires a 2D array, reshape your data

data = valid_xs.iloc[0].values.reshape(1,-1)

m.predict(data)

array([10.0005727])

You can visualise how RMSE improvs as more and more trees are added

plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);

validation set is worse than the training set. Why?

we might be overfitting
the last two weeks of the auction data may have been different somehow

Out of Bag Error (OOB error)

how can we check? we can use OOB predictions from the model and run rmse on the oob error.

r_mse(m.oob_prediction_, y)

0.211059

What is happening here?

OOB error gives you a sense of how much you are overfitting.

Model interpretation

source

For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:

How confident are we in our predictions using a particular row of data?
For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
Which columns are the strongest predictors, which can we ignore?
Which columns are effectively redundant with each other, for purposes of prediction?
How do predictions vary, as we vary these columns?

preds_std = preds.std(0)
preds_std

array([0.24516726, 0.17149769, 0.12129906, ..., 0.1911843 , 0.16064838,
       0.16064838])

Feature Important

def rf_feat_importance(m, df):
    return pd.DataFrame({'cols': df.columns, 'imp': m.feature_importances_}
                        ).sort_values('imp', ascending=False)

fi = rf_feat_importance(m, xs)

fi[:10]

	cols	imp
58	YearMade	0.175870
6	ProductSize	0.118317
30	Coupler_System	0.099115
7	fiProductClassDesc	0.071253
31	Grouser_Tracks	0.064081
55	ModelID	0.061831
50	saleElapsed	0.051997
32	Hydraulics_Flow	0.042906
3	fiSecondaryDesc	0.039384
1	fiModelDesc	0.031282

def plot_fi(fi):
    return fi.plot.barh('cols', 'imp', figsize=(12,8), legend=False)

plot_fi(fi[:30]);

Removing low-importance variables

to_keep = fi[fi.imp>0.005].cols
len(to_keep)

retrain model using only this subset of columns

xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]

m = rf(xs_imp, y)

m_rmse(m, xs_imp, y), m_rmse(m, valid_xs, valid_y)

(0.180874, 0.231109)

cluster_columns??

cluster_columns(xs_imp)

seems like we could remove some of the clustered columns.

Let's calculate a baseline using a sample of data

def get_oob(df):
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
        max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(df, y)
    return m.oob_score_

get_oob(xs_imp)

0.8769414512037411

now remove redundant columns one at a time

{c:get_oob(xs_imp.drop(c, axis=1)) for c in (
    'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',
    'fiModelDesc', 'fiBaseModel',
    'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}

{'saleYear': 0.8754647378064608,
 'saleElapsed': 0.8723764211506366,
 'ProductGroupDesc': 0.8765810711892142,
 'ProductGroup': 0.8773280451665235,
 'fiModelDesc': 0.8758816619476205,
 'fiBaseModel': 0.8756967579434771,
 'Hydraulics_Flow': 0.8769591574648032,
 'Grouser_Tracks': 0.876871969234521,
 'Coupler_System': 0.8762546455208067}

not much change here so try dropping multiple variables

to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
get_oob(xs_imp.drop(to_drop, axis=1))

0.875680776530026

xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)

check rmse again to confirm accuracy hasn't really changed

m = rf(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)

(0.182544, 0.23238)

similar accuracy but less features!

Partial Dependence

what is the relationship between variables

p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)),c);

largest group is actual #na#

do the same for yearmade

ax = valid_xs_final['YearMade'].hist()

from sklearn.inspection import plot_partial_dependence

fig,ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ['YearMade','ProductSize'],
                        grid_resolution=20, ax=ax);

what is partial dependence telling us?

We want to look at how year affects the sale price, that is, all else being equal, what affect does year have on sale price

Tree Interpreter

#!pip install treeinterpreter
#!pip install waterfallcharts

import warnings
warnings.simplefilter('ignore', FutureWarning)

from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall

row = valid_xs_final.iloc[:5]

prediction,bias,contributions = treeinterpreter.predict(m, row.values)

prediction[0], bias[0], contributions[0].sum()

(array([10.03875756]), 10.104200155980113, -0.06544259554720655)

waterfall(valid_xs_final.columns, contributions[0], threshold=0.08, 
          rotation_value=45,formatting='{:,.3f}');

The extrapolation problem

np.random.seed(42)

x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin);

xs_lin = x_lin.unsqueeze(1)
x_lin.shape, xs_lin.shape

(torch.Size([40]), torch.Size([40, 1]))

you can do the same using None

x_lin[:,None].shape

torch.Size([40, 1])

m_lin = RandomForestRegressor().fit(x_lin[:30].reshape(-1, 1), y_lin[:30])

plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);

random forrext cannot extrapolate outside of the bounds of the training data

we need to make sure validation set does not contain out of domain data

test and training set may vary, how do we tell??

Finding out of domain data

df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))

m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]

	cols	imp
5	saleElapsed	0.915808
10	SalesID	0.069088
13	MachineID	0.011871
0	YearMade	0.000678
4	ModelID	0.000536
12	Hydraulics	0.000529

m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))

for c in ('SalesID','saleElapsed','MachineID'):
    m = rf(xs_final.drop(c,axis=1), y)
    print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))

orig 0.233484
SalesID 0.231357
saleElapsed 0.236643
MachineID 0.231104

time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)

m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)

0.229127

xs['saleYear'].hist();

filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]

m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)

(0.176904, 0.22864)

Using a Neural Net

# load data
df_nn = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)

# set ProductSize as categorical
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

# take log of dependent variable
df_nn[dep_var] = np.log(df_nn[dep_var])

# do some date prep
df_nn = add_datepart(df_nn, 'saledate')

df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]

cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

cont_nn.append('saleElapsed')
cat_nn.remove('saleElapsed')

df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)

df_nn_final[cat_nn].nunique()

YearMade                73
ProductSize              6
Coupler_System           2
fiProductClassDesc      74
ModelID               5281
Hydraulics_Flow          3
fiSecondaryDesc        177
fiModelDesc           5059
Enclosure                6
ProductGroup             6
Hydraulics              12
fiModelDescriptor      140
Drive_System             4
dtype: int64

xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

(0.178922, 0.230357)

cat_nn.remove('fiModelDescriptor')

df_nn_final['saleElapsed'].astype('int64', copy=False)

df_nn_final.dtypes

YearMade                 int64
ProductSize           category
Coupler_System          object
fiProductClassDesc      object
ModelID                  int64
saleElapsed              int64
Hydraulics_Flow         object
fiSecondaryDesc         object
fiModelDesc             object
Enclosure               object
ProductGroup            object
Hydraulics              object
fiModelDescriptor       object
Drive_System            object
SalePrice              float64
dtype: object

Normalize

Normalize subtracts the mean, then divides by the standard deviation. We didn't need this for a decision tree because we were only performing binary splits. However, we do need to normalize for neaural nets because we don't want things with crazy distributions.

procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df=df_nn_final, 
                      procs=procs_nn, 
                      cat_names=cat_nn, 
                      cont_names=cont_nn,
                      splits=splits, 
                      y_names=dep_var)

dls = to_nn.dataloaders(1024)

This is a regression model so we want to set our y_range based on the min and max of the dependent variable.

y = to_nn.train.y
y.min(),y.max()

(8.465899, 11.863583)

learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
                        n_out=1, loss_func=F.mse_loss)

learn.lr_find()

SuggestedLRs(lr_min=0.003981071710586548, lr_steep=0.00019054606673307717)

learn.fit_one_cycle(5, 1e-2)

epoch	train_loss	valid_loss	time
0	0.069223	0.062953	00:11
1	0.056285	0.055872	00:13
2	0.048484	0.055052	00:12
3	0.043525	0.051425	00:12
4	0.040454	0.051055	00:11

preds,targs = learn.get_preds()
r_mse(preds,targs)

0.225954

Summary

Random Forests are easy to train, resillient, don't require much pre-processing, train quickly and don't overfit. They can be less accurate than a neural net and can take longer at inference time to evaluate the trees.

Neural Nets are probably the fiddliest models to implement and set up but can give slightly better results.