Lesson 7: Tabular Data
Pt 1
Decision Tree Ensembles
Some examples include... - Random Forest (Regressor / Classifier) - Gradient Boost (Regressor / Classifier) - XGBoost (Regressor / Classifier)
We will use the Scikit-Learn library for this task rather than PyTorch.
The Data
- Blue Book for Bulldozers Kaggle Competition
- The goal being to predict sale price of heavy equipment at auction based on ussage, type and configuration.
Kaggle setup help
There are a number of different ways to do this, I had some trouble doing this so after some reading, tried this instead.
fastai helper functions
- Had some trouble importing fastais helper functions, found that someone has added them all here
from pathlib import Path
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
# helper functions
from fastbook import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor,export_graphviz
from dtreeviz.trees import *
import IPython
from IPython.display import Image, display_svg, SVG
import os
Load credentials from json file. This is a simple file with the following information...
{"username":"xxx","key":"xxx"}
import json
with open('creds.json') as f:
creds = json.load(f)
os.environ['KAGGLE_USERNAME']=creds["username"]
os.environ['KAGGLE_KEY']=creds["key"]
from kaggle import api
api.competition_download_cli('bluebook-for-bulldozers')
p = Path.cwd()
for i in p.iterdir():
print(i)
fname = p/'bluebook-for-bulldozers.zip'
dest = p/'storage/data/bluebook'
# dest.mkdir()
# only run once!
#file_extract(fname, dest)
The Data
each row of the dataset represents the sale of a single machine at an auction
df = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)
df.head()
df.columns
Transforming data
- convert categorical ie
ProductSize
- set order
df['ProductSize'].unique()
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
dependent variable here is SalePrice
, this is the variable we want to predict. Kaggle specifically tells us the metric to use: root mean squared log error (RMSLE). To use this metric, we need to take the log of the dependent variable which will allow use to use RMSE
dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])
Decision Trees
- ask binary questions about data
- ie is x > y
- the trouble is we don't know what binary questions to ask, and through machine learning, we need to decide on what these will be.
Handling Dates
df = add_datepart(df, 'saledate')
df_test = pd.read_csv(dest/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')
Using TabularPandas and TabularProc
we will use two tabular transforms to modify our data.
- Categorify
- replaces a column with numeric category
- FillMissign
- fills any missing data with the median
- also adds a boolean column where True
will be set for any data point that was missing
procs = [Categorify, FillMissing]
Validation Set
hold aside some data (approx 2 weeks) as per the competition rules
- do this with np.where
cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0] # inverse condition
splits = (list(train_idx), list(valid_idx))
Here cont_cat_split
is a helper function that returns column names of cont and cat variables from given df
.
# define continuous and categorical columns for TabularPandas
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
create a Tabular Object (to)
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
len(to.train), len(to.valid)
Look at the data with to.show
. This will show the data in human readable form. to.items.head
will show the processed data in numerical form.
to.show(3)
to.items.head(3)
check the vocab with to.classes
to.classes['ProductSize']
# save the tabular object for later
# (dest/'to.pkl').save(to)
Decision Tree Regressor: for continuous variables
# had to copy from
# https://github.com/anandsaha/fastai.part1.v2/blob/master/fastai/structured.py
def draw_tree(t, df, size=10, ratio=0.6, precision=0):
""" Draws a representation of a random forest in IPython.
Parameters:
-----------
t: The tree you wish to draw
df: The data used to train the tree. This is used to get the names of the features.
"""
s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True,
special_characters=True, rotate=True, precision=precision)
IPython.display.display(graphviz.Source(re.sub('Tree {',
f'Tree {{ size={size}; ratio={ratio}', s)))
xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y);
draw_tree(m, xs, precision=2);
samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
orientation='LR')
Using Dtreeviz is a little more intuitive and provides a bit more information. For example, YearMade
is showing that there are years equal to 1000, which is obviously an error in the data. Let's fix that
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950
m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs, y)
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
orientation='LR')
# remove max_leaf_nodes
# build a bigger tree
m = DecisionTreeRegressor()
m.fit(xs, y)
def r_mse(preds,y): return round(math.sqrt(((preds-y)**2).mean()),6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m_rmse(m, xs, y)
m_rmse(m, valid_xs, valid_y)
Our training set is 0 and validation set is worse than the observed value from the original tree viz. The reason is that there are almost as many leaves in our model than observations in our data set. To avoid this, we need to pick some stopping criteria, like some threshold that will tell the model, don't split this if there are less than x number of items in the leaf node. Do this with min_samples_leaf
.
m.get_n_leaves(), len(xs)
we have nearly as many leaf nodes as observations in our dataset
we need to create some rules here
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
m.get_n_leaves()
Catagorical Variables
Unlike with collab filtering, we do not need to create dummy variables with categorical values because through pre-processing, we have already transformed these into numerical values.
However, you can one-hot encode if you like.
Bagging
A technique developed by professor Leo Breiman. The idea is that you can bootstrap subsets of your data, train your model, store the predictions, then average the predictions.
Steps 1. randomly choose a subset 2. train a model on the subset 3. save the model, return to step one 4. make a prediction using all of the models, then take the average of each model's prediction
Leo furthered his thinking by not only selecting a random subset of rows, but also a random subset of columns. This is known as a Random Forrest.
The function rf
below uses the following arguments
- n_estimators
defines the number of trees
- max_samples
defines how many rows to sample for training each tree
- max_features
defines how many columns to sample at each split point (0.5 means "take half the total number of columns").
- min_samples_leaf
specify when to stop splitting the tree nodes
- effectively limiting the depth of the tree
- n_jobs=-1
use CPUs to build the trees in parallel.
def rf(xs, y, n_estimators=40, max_samples=200_000,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
m = rf(xs, y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
That is much better!
Making Predictions
To understand the impact of n_estimators
you can get predictions from each individual tree in the forest
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds
This represents every prediction for each and every tree for every row of data.
r_mse(preds.mean(axis=0), valid_y)
Here is how to make a single rediction. I think!!
CHECK THIS
# slice a row of data
valid_xs.iloc[0]
# predict requires a 2D array, reshape your data
data = valid_xs.iloc[0].values.reshape(1,-1)
m.predict(data)
You can visualise how RMSE improvs as more and more trees are added
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);
validation set is worse than the training set. Why?
- we might be overfitting
- the last two weeks of the auction data may have been different somehow
Out of Bag Error (OOB error)
how can we check? we can use OOB predictions from the model and run rmse on the oob error.
r_mse(m.oob_prediction_, y)
What is happening here?
OOB error gives you a sense of how much you are overfitting.
Model interpretation
For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:
- How confident are we in our predictions using a particular row of data?
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
- Which columns are the strongest predictors, which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary, as we vary these columns?
preds_std = preds.std(0)
preds_std
Feature Important
def rf_feat_importance(m, df):
return pd.DataFrame({'cols': df.columns, 'imp': m.feature_importances_}
).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, xs)
fi[:10]
def plot_fi(fi):
return fi.plot.barh('cols', 'imp', figsize=(12,8), legend=False)
plot_fi(fi[:30]);
Removing low-importance variables
to_keep = fi[fi.imp>0.005].cols
len(to_keep)
retrain model using only this subset of columns
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
m = rf(xs_imp, y)
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs, valid_y)
cluster_columns??
cluster_columns(xs_imp)
seems like we could remove some of the clustered columns.
Let's calculate a baseline using a sample of data
def get_oob(df):
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(df, y)
return m.oob_score_
get_oob(xs_imp)
now remove redundant columns one at a time
{c:get_oob(xs_imp.drop(c, axis=1)) for c in (
'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',
'fiModelDesc', 'fiBaseModel',
'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}
not much change here so try dropping multiple variables
to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
get_oob(xs_imp.drop(to_drop, axis=1))
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)
check rmse again to confirm accuracy hasn't really changed
m = rf(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
similar accuracy but less features!
Partial Dependence
what is the relationship between variables
p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)),c);
largest group is actual #na#
do the same for yearmade
ax = valid_xs_final['YearMade'].hist()
from sklearn.inspection import plot_partial_dependence
fig,ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ['YearMade','ProductSize'],
grid_resolution=20, ax=ax);
what is partial dependence telling us?
We want to look at how year affects the sale price, that is, all else being equal, what affect does year have on sale price
Tree Interpreter
#!pip install treeinterpreter
#!pip install waterfallcharts
import warnings
warnings.simplefilter('ignore', FutureWarning)
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
row = valid_xs_final.iloc[:5]
prediction,bias,contributions = treeinterpreter.predict(m, row.values)
prediction[0], bias[0], contributions[0].sum()
waterfall(valid_xs_final.columns, contributions[0], threshold=0.08,
rotation_value=45,formatting='{:,.3f}');
The extrapolation problem
np.random.seed(42)
x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin);
xs_lin = x_lin.unsqueeze(1)
x_lin.shape, xs_lin.shape
you can do the same using None
x_lin[:,None].shape
m_lin = RandomForestRegressor().fit(x_lin[:30].reshape(-1, 1), y_lin[:30])
plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);
random forrext cannot extrapolate outside of the bounds of the training data
we need to make sure validation set does not contain out of domain data
test and training set may vary, how do we tell??
Finding out of domain data
df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID','saleElapsed','MachineID'):
m = rf(xs_final.drop(c,axis=1), y)
print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))
time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)
m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)
xs['saleYear'].hist();
filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]
m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)
Using a Neural Net
# load data
df_nn = pd.read_csv(dest/'TrainAndValid.csv', low_memory=False)
# set ProductSize as categorical
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
# take log of dependent variable
df_nn[dep_var] = np.log(df_nn[dep_var])
# do some date prep
df_nn = add_datepart(df_nn, 'saledate')
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
cont_nn.append('saleElapsed')
cat_nn.remove('saleElapsed')
df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)
df_nn_final[cat_nn].nunique()
xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
cat_nn.remove('fiModelDescriptor')
df_nn_final['saleElapsed'].astype('int64', copy=False)
df_nn_final.dtypes
Normalize
Normalize
subtracts the mean, then divides by the standard deviation. We didn't need this for a decision tree because we were only performing binary splits. However, we do need to normalize for neaural nets because we don't want things with crazy distributions.
procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df=df_nn_final,
procs=procs_nn,
cat_names=cat_nn,
cont_names=cont_nn,
splits=splits,
y_names=dep_var)
dls = to_nn.dataloaders(1024)
This is a regression model so we want to set our y_range
based on the min and max of the dependent variable.
y = to_nn.train.y
y.min(),y.max()
learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
n_out=1, loss_func=F.mse_loss)
learn.lr_find()
learn.fit_one_cycle(5, 1e-2)
preds,targs = learn.get_preds()
r_mse(preds,targs)
Summary
Random Forests are easy to train, resillient, don't require much pre-processing, train quickly and don't overfit. They can be less accurate than a neural net and can take longer at inference time to evaluate the trees.
Neural Nets are probably the fiddliest models to implement and set up but can give slightly better results.