Explain XGBoost

Let’s create a XGBoost model (the easy way)

Setup

If {explore} is not installed, install it from CRAN (you need explore 1.2.0 or higher)

install.packages("explore") 

Then we simply load the package and create some data

library(explore)
data <- create_data_buy(obs = 1000)

Data preparation

Let’s take a look into the data

data |> describe()
# A tibble: 13 x 8
   variable        type     na na_pct unique    min      mean    max
   <chr>           <chr> <int>  <dbl>  <int>  <dbl>     <dbl>  <dbl>
 1 period          int       0      0      1 202012 202012    202012
 2 buy             int       0      0      2      0      0.51      1
 3 age             int       0      0     66     17     52.3      88
 4 city_ind        int       0      0      2      0      0.5       1
 5 female_ind      int       0      0      2      0      0.5       1
 6 fixedvoice_ind  int       0      0      2      0      0.11      1
 7 fixeddata_ind   int       0      0      1      1      1         1
 8 fixedtv_ind     int       0      0      2      0      0.4       1
 9 mobilevoice_ind int       0      0      2      0      0.63      1
10 mobiledata_prd  chr       0      0      3     NA     NA        NA
11 bbi_speed_ind   int       0      0      2      0      0.61      1
12 bbi_usg_gb      int       0      0     83      9    164.   100000
13 hh_single       int       0      0      2      0      0.37      1

So we got a dataset with 13 variables. buy is a binary variable, containing only 2 unique values: 0 and 1. We will use buy as target.

For the XGBoost we just need numeric variables. Here we have a character variable. We could try to convert it into a meaningful numeric variable, but in this example we will simply drop all variables that are not numeric or have no variance.

data <- data |> 
  drop_var_no_variance() |> 
  drop_var_not_numeric()

data |> describe()
# A tibble: 10 x 8
   variable        type     na na_pct unique   min   mean    max
   <chr>           <chr> <int>  <dbl>  <int> <dbl>  <dbl>  <dbl>
 1 buy             int       0      0      2     0   0.51      1
 2 age             int       0      0     66    17  52.3      88
 3 city_ind        int       0      0      2     0   0.5       1
 4 female_ind      int       0      0      2     0   0.5       1
 5 fixedvoice_ind  int       0      0      2     0   0.11      1
 6 fixedtv_ind     int       0      0      2     0   0.4       1
 7 mobilevoice_ind int       0      0      2     0   0.63      1
 8 bbi_speed_ind   int       0      0      2     0   0.61      1
 9 bbi_usg_gb      int       0      0     83     9 164.   100000
10 hh_single       int       0      0      2     0   0.37      1

Now we have only 13 variables. period and fixeddata_ind are dropped because they contain only a constant value (no variance). mobiledata_prd is dropped because it is not numeric.

Create XGBoost

We are ready to create a XGBoost model. It is as simple as this:

data |> explain_xgboost(target = buy)

We get the feature importance of a XGBoost model explaining the target buy:

feature-importance

The most important varaibles are ‘age’, bbi_usg_gb, femaale_ind and fixed_tv_ind

Let’s take a closer look to their relationship with the target variable buy:

library(dplyr)  # for select
data |> 
  select(buy, age, bbi_usg_gb, female_ind, fixedtv_ind) |> 
  explore_all(target = buy)

explore_all

So with the help of explain_xgboost() we can understand which variables explain the target buy.

More XGBoost

explain_xgboost() can return much more:

data |> explain_xgboost(target = buy, out = "all") -> model

Now you get more outputs. model$plot is the feature-important plot. You get the data behind:

model$importance
          variable        gain       cover  frequency  importance
1:             age 0.459501045 0.338333975 0.29666667 0.459501045
2:      bbi_usg_gb 0.259520698 0.304656256 0.31333333 0.259520698
3:      female_ind 0.160985735 0.150706363 0.13333333 0.160985735
4:     fixedtv_ind 0.076391986 0.106877646 0.11333333 0.076391986
5:   bbi_speed_ind 0.017973412 0.028905647 0.03666667 0.017973412
6:        city_ind 0.017914818 0.050673036 0.05000000 0.017914818
7:       hh_single 0.004823200 0.013086040 0.03000000 0.004823200
8: mobilevoice_ind 0.001655567 0.003425110 0.01333333 0.001655567
9:  fixedvoice_ind 0.001233540 0.003335926 0.01333333 0.001233540

And the result of hyper parameter tuning

model$tune_plot

tune-plot

model$tune_data
   model_nr  eta max_depth runtime iter train_auc_mean test_auc_mean
1:        1 0.30         3  0 mins   19      0.9579619     0.9252692
2:        2 0.10         3  0 mins   45      0.9572738     0.9274616
3:        3 0.01         3  0 mins  513      0.9582980     0.9266134
4:        4 0.30         5  0 mins   15      0.9796771     0.9201005
5:        5 0.10         5  0 mins   28      0.9731001     0.9232089
6:        6 0.01         5  0 mins   36      0.9572939     0.9244891

Prediction

The XGBoost model itself is available too for prediction.

test <- create_data_buy(obs = 100, seed = 999) |>   # new data
  predict_target(model = model$model) 
test |> describe()
# A tibble: 14 x 8
   variable        type     na na_pct unique       min      mean       max
   <chr>           <chr> <int>  <dbl>  <int>     <dbl>     <dbl>     <dbl>
 1 period          int       0      0      1 202012    202012    202012   
 2 buy             int       0      0      2      0         0.49      1   
 3 age             int       0      0     43     24        52.6      82   
 4 city_ind        int       0      0      2      0         0.49      1   
 5 female_ind      int       0      0      2      0         0.51      1   
 6 fixedvoice_ind  int       0      0      2      0         0.09      1   
 7 fixeddata_ind   int       0      0      1      1         1         1   
 8 fixedtv_ind     int       0      0      2      0         0.4       1   
 9 mobilevoice_ind int       0      0      2      0         0.6       1   
10 mobiledata_prd  chr       0      0      3     NA        NA        NA   
11 bbi_speed_ind   int       0      0      2      0         0.65      1   
12 bbi_usg_gb      int       0      0     54     11      1063.   100000   
13 hh_single       int       0      0      2      0         0.25      1   
14 prediction      dbl       0      0     96      0.02      0.52      0.98

Now we can check the prediction

test |> explore(prediction, target = buy)

prediction

Obviousley, it worked very well!

Behind the scenes

If you run explain_xgboost() with parameter log = TRUE you see whats going on “behind the scenes”:

data |> explain_xgboost(target = buy, log = TRUE)
train xgboost nr 1
eta=0.3, max_depth=3, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
19 iterations, training time=0 min, auc=0.9253

train xgboost nr 2
eta=0.1, max_depth=3, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
45 iterations, training time=0 min, auc=0.9275

train xgboost nr 3
eta=0.01, max_depth=3, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
513 iterations, training time=0 min, auc=0.9266

train xgboost nr 4
eta=0.3, max_depth=5, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
15 iterations, training time=0 min, auc=0.9201

train xgboost nr 5
eta=0.1, max_depth=5, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
28 iterations, training time=0 min, auc=0.9232

train xgboost nr 6
eta=0.01, max_depth=5, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
36 iterations, training time=0 min, auc=0.9245

best model found in cross-validation: xgboost nr 2: 
eta=0.01, max_depth=5, gamma=0, colsample_bytree=0.8, subsample=0.8, min_child_weight=1, scale_pos_weight=1
nrounds=45, test_auc=0.9275

train final model...
done, training time=0 min

Need for speed? Use the nthread parameter in xgboost()

Written on March 7, 2024