Tidymodels: Decision Tree Learning in R

Cianna Bedford-Petersen, Christopher Loan & Brendan Cullen

Intro

Tidyverse’s newest release has recently come together to form a cohesive suite of packages for modeling and machine learning, called {tidymodels}. The successor to Max Kuhn’s {caret} package, {tidymodels} allows for a tidy approach to your data from start to finish. We’re going to walk through the basics for getting off the ground with {tidymodels} and demonstrate its application to three different tree-based methods for predicting student test scores. For further information about the package, you can visit https://www.tidymodels.org/.

Setup

Load both the {tidyverse} and {tidymodels} packages into your environment. We’ll also load in the {skimr} package to help us with some descriptives for our data and a host of other packages that will be required to run our machine learning models.

library(tidymodels) 
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees

Import the data

We’re using simulated data which approximates reading and math scores for ~189,000 3rd-8th grade students in Oregon public schools see this Kaggle page for details. For the purpose of demonstration, we’ll be sampling 1% of the data with sample_frac() to keep computer processing time manageable. All school IDs in the data are real, so we can use that information to link the data with other sources. Specifically, we’re also going to pull in some data on student enrollment in free and reduced lunch from the National Center for Education Statistics and some ethnicity data from the Oregon Department of Education.

set.seed(100)

# import data and perform initial cleaning
# initial cleaning steps include: 
# *recode NA's for lang_cd and ayp_lep to more meaningful values
# *remove vars with entirely missing data
# Note: the data is called 'train.csv', but we will actually further split this into its own training and testing data

dat <- read_csv(here::here("static", "data", "train.csv")) %>% 
  select(-classification) %>% # remove this variable because it's redundant with `score`
  mutate(lang_cd = ifelse(is.na(lang_cd), "E", lang_cd), 
         ayp_lep = ifelse(is.na(ayp_lep), "G", ayp_lep)) %>% 
  sample_frac(.01) %>% # sample 1% of the data to reduce run time
  janitor::remove_empty(c("rows", "cols")) %>% 
  drop_na() %>% 
  select_if(~length(unique(.x)) > 1)

# import fall membership report ethcnicity data and do some basic cleaning and renaming
sheets <- readxl::excel_sheets(here::here("static", "data", "fallmembershipreport_20192020.xlsx"))

ode_schools <- readxl::read_xlsx(here::here("static", "data", "fallmembershipreport_20192020.xlsx"),
                                 sheet = sheets[4])

ethnicities <- ode_schools %>%
  select(attnd_schl_inst_id = `Attending School ID`,
         attnd_dist_inst_id = `Attending District Institution ID`,
         sch_name = `School Name`,
         contains("%")) %>%
  janitor::clean_names()

names(ethnicities) <- gsub("x2019_20_percent", "p", names(ethnicities))

# join ethnicity data with original dataset
dat <- left_join(dat, ethnicities)

# import and tidy free and reduced lunch data 
frl <- rio::import("https://nces.ed.gov/ccd/Data/zip/ccd_sch_033_1718_l_1a_083118.zip",
              setclass = "tbl_df") %>% 
  janitor::clean_names()  %>% 
  filter(st == "OR")  %>%
  select(ncessch, lunch_program, student_count)  %>% 
  mutate(student_count = replace_na(student_count, 0))  %>% 
  pivot_wider(names_from = lunch_program,
            values_from = student_count)  %>% 
  janitor::clean_names()  %>% 
  mutate(ncessch = as.double(ncessch))

# import student counts for each school across grades
stu_counts <- rio::import("https://github.com/datalorax/ach-gap-variability/raw/master/data/achievement-gaps-geocoded.csv", setclass = "tbl_df")  %>% 
  filter(state == "OR" & year == 1718)  %>% 
  count(ncessch, wt = n)  %>% 
  mutate(ncessch = as.double(ncessch))

# join frl and stu_counts data
frl <- left_join(frl, stu_counts)

# add frl data to train data
dat <- left_join(dat, frl)

After loading in our three datasets, we’ll join them together to make one cohesive data set to use for modeling. After joining, the data contains both student-level variables (e.g. gender, ethnicity, enrollment in special education/talented and gifted programs, etc.) and district-level variables (e.g. school longitude and latitude, proportion of students who qualify for free and reduced-price lunch, etc.), all of which will be included for each 3 of our {tidymodels} tree-based examples.

For a more complete description of the variables, you can download the data dictionary here.

Explore the data

We’ll use the skim() function from {skimr} to take a closer look at our variables. Many numeric predictors are clearly non-normal (see histograms below), but this is no problem as tree-based methods are robust to non-normality.

dat %>% 
  select(-contains("id"), -ncessch, -missing, -not_applicable) %>%  # remove ID and irrelevant variables
  mutate(tst_dt = lubridate::as_date(lubridate::mdy_hms(tst_dt))) %>% # covert test date to date
  modify_if(is.character, as.factor) %>%  # convert character vars to factors
  skim() %>% 
  select(-starts_with("numeric.p")) # remove quartiles
Table 1: Data summary
Name Piped data
Number of rows 1857
Number of columns 41
_______________________
Column type frequency:
Date 1
factor 25
numeric 15
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
tst_dt 0 1 2018-03-16 2018-06-07 2018-05-18 47

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gndr 0 1 FALSE 2 M: 939, F: 918
ethnic_cd 0 1 FALSE 7 W: 1151, H: 458, M: 100, A: 79
tst_bnch 0 1 FALSE 6 G6: 343, 1B: 330, G4: 304, G7: 304
migrant_ed_fg 0 1 FALSE 2 N: 1793, Y: 64
ind_ed_fg 0 1 FALSE 2 N: 1842, Y: 15
sp_ed_fg 0 1 FALSE 2 N: 1614, Y: 243
tag_ed_fg 0 1 FALSE 2 N: 1759, Y: 98
econ_dsvntg 0 1 FALSE 2 Y: 1100, N: 757
ayp_lep 0 1 FALSE 10 G: 1471, F: 164, Y: 72, E: 58
stay_in_dist 0 1 FALSE 2 Y: 1811, N: 46
stay_in_schl 0 1 FALSE 2 Y: 1803, N: 54
dist_sped 0 1 FALSE 2 N: 1846, Y: 11
trgt_assist_fg 0 1 FALSE 3 N: 1773, Y: 83, y: 1
ayp_schl_partic 0 1 FALSE 2 Y: 1846, N: 11
ayp_dist_prfrm 0 1 FALSE 2 Y: 1803, N: 54
ayp_schl_prfrm 0 1 FALSE 2 Y: 1785, N: 72
rc_schl_partic 0 1 FALSE 2 Y: 1846, N: 11
rc_dist_prfrm 0 1 FALSE 2 Y: 1803, N: 54
rc_schl_prfrm 0 1 FALSE 2 Y: 1785, N: 72
lang_cd 0 1 FALSE 2 E: 1815, S: 42
tst_atmpt_fg 0 1 FALSE 2 Y: 1853, P: 4
grp_rpt_schl_partic 0 1 FALSE 2 Y: 1846, N: 11
grp_rpt_dist_prfrm 0 1 FALSE 2 Y: 1845, N: 12
grp_rpt_schl_prfrm 0 1 FALSE 2 Y: 1834, N: 23
sch_name 1 1 FALSE 699 Hig: 14, Jud: 14, Hou: 13, Fiv: 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd hist
enrl_grd 0 1 5.44 1.69 ▇▃▅▃▃
score 0 1 2495.34 115.19 ▁▁▂▇▁
lat 0 1 44.79 0.99 ▂▁▂▅▇
lon 0 1 -122.51 1.16 ▅▇▁▁▁
p_american_indian_alaska_native 1 1 0.01 0.06 ▇▁▁▁▁
p_asian 1 1 0.04 0.07 ▇▁▁▁▁
p_native_hawaiian_pacific_islander 1 1 0.01 0.01 ▇▁▁▁▁
p_black_african_american 1 1 0.02 0.04 ▇▁▁▁▁
p_hispanic_latino 1 1 0.25 0.18 ▇▅▂▁▁
p_white 1 1 0.60 0.20 ▁▃▅▇▅
p_multiracial 1 1 0.06 0.03 ▇▆▁▁▁
free_lunch_qualified 0 1 231.23 147.55 ▇▇▃▁▁
reduced_price_lunch_qualified 0 1 39.86 24.77 ▆▇▃▁▁
no_category_codes 0 1 271.09 165.44 ▆▇▃▁▁
n 0 1 816.07 536.55 ▇▃▂▁▁

While most of our predictors are categorical, we can use {corrplot} to better visualize the relationships among the numeric variables.

dat %>% 
  select(-contains("id"), -ncessch, -missing, -not_applicable) %>% 
  select_if(is.numeric) %>% 
  select(score, everything()) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  corrplot::corrplot()

Split data and resample

The first step of our analysis is to split our data into two separate sets: a “training” set and a “testing” set. The training set is used to train a model and, if desired, to adjust (i.e., “tune”) the model’s hyperparameters before evaluating its final performance on our test data. By allowing us to test a model on a new sample, we assess “out of sample” accuracy (i.e., unseen data-—what all predictive models are interested in) and limit overfitting to the training set. We can do this efficiently with the initial_split() function. This comes from the {rsample} package, which is part of the {tidymodels} package that we already loaded. Defaults put 75% of the data in the training set and 25% in the test set, but this can be adjusted with the prop argument. Then, we’ll extract the training data from our split object and assign it a name.

To further prevent over-fitting, we’ll resample our data using vfold_cv(). This function outputs k-fold cross-validated versions of our training data, where k = the number of times we resample (unsure why v- is used instead of k- here). By using k = 10 data sets, we get a better estimate of the model’s out-of-sample accuracy. On top of decreasing bias from over-fitting, this is essential when tuning hyperparameters (though we plan to apply defaults and not tune here, for brevity). Though our use of 10-fold cross validation is both frequently used and effective, it should be noted that other methods (e.g., bootstrap resampling) or other k-values are sometimes used to accomplish the same goal.

set.seed(100)

# split the data
split <- initial_split(dat)

# extract the training data
train <- training(split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train)

Pre-processing

Before we add in our data to the model, we’re going to set up an object that pre-processes our data. This is called a recipe. To create a recipe, you’ll first specify a formula for your model, indicating which variable is your outcome and which are your predictors. Using ~. here will indicate that we want to use all variables other than score as predictors. Then, we can specify a series of pre-processing steps for our data that directs our recipe to assign our variables a role or performs feature engineering steps. Pre-processing may be sound uncommon, but if you’ve ever used lm() (or several other R functions) you’ve done some of this by simply calling the function (e.g., automatic dummy-coding to handle categorical data). This is beneficial because it gives the analyst more control, despite adding complexity to the process.

A complete list of possible pre-processing steps can be found here: https://recipes.tidymodels.org/articles/Custom_Steps.html

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = as.numeric(lubridate::mdy_hms(tst_dt))) %>% # convert `test date` variable to a date 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% # declare ID variables
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_unknown(all_nominal()) %>% # categorizes missing categorical data (NA's) as `unknown`
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

Create a model

The last step before bringing in our data is to specify our model. This will call upon functions from the {parsnip} package, which standardizes language for specifying a multitude of statistical models. There are a few core elements that you will need to specify for each model

The type of model

This indicates what type of model you choose to fit, each of which will be a different function. We’ll be focusing on decision tree methods using bag_tree(), random_forest(), and boost_tree(). A full list of models can be found here https://www.tidymodels.org/find/parsnip/

The engine

set_engine() calls the package to support the model you specified above.

The mode

set_mode() indicates the type of prediction you’d like to use in your model, you’ll choose between regression and classification. Since we are looking to predict student scores, which is a continuous predictor, we’ll be choosing regression.

The arguments

set_args() allows you to set values for various parameters for your model, each model type will have a specific set of parameters that can be altered. For these parameters, you can either set a particular value or you can use the tune function to search for the optimal value of each parameter. Tuning requires a few extra steps, so we will leave the default arguments for clarity. For more information on tuning check out https://tune.tidymodels.org/.

Create a workflow

Up to this point we’ve been setting up a lot of individual elements and now it is time to combine them to create a cohesive framework, called a workflow, so we can run our desired models. First, we’ll use the workflow() command and then we’ll pull in the recipe and model we already created. The next section shows three examples of specifying models and creating a workflow for different decision tree methods.

Model Examples

Bagged trees

A bagged tree approach creates multiple subsets of data from the training set which are randomly chosen with replacement. Each subset of data is used to train a given decision tree. In the end, we have an ensemble of different models. The predictions from all the different trees are averaged together, giving us a stronger prediction than one tree could independently.

Specify model

set.seed(100)
mod_bag <- bag_tree() %>%
  set_mode("regression") %>%
  set_engine("rpart", times = 10) # 10 bootstrap resamples

Create workflow

wflow_bag <- workflow() %>% 
  add_recipe(rec) %>%
  add_model(mod_bag)

Fit the model

set.seed(100)
plan(multisession)

fit_bag <- fit_resamples(
  wflow_bag,
  cv,
  metrics = metric_set(rmse, rsq),
  control = control_resamples(verbose = TRUE,
                              save_pred = TRUE,
                              extract = function(x) extract_model(x)))

Visualize

The plot below shows the root nodes from a bagged tree made of 100 trees (10 folds x 10 bootstrapped resamples). Root nodes are the 1st node in a decision tree, and they are determined by which variable best optimizes a loss function (e.g., minimizes mean square error [MSE] for continuous outcomes or Gini Index for categorical outcomes). Put roughly, the most common root nodes can be thought of as the most “important” predictors.

Random forest

Random forest is similar to bagged tree methodology but goes one step further. In addition to taking random subsets of data, the model also draws a random selection of features. Instead of utilizing all features, the random subset of features allows more predictors to be eligible root nodes. This is particularly useful for handling high dimensionality data (e.g., have more variables than participants/cases).

Specify the model

set.seed(100)
mod_rf <-rand_forest() %>%
  set_engine("ranger",
             num.threads = parallel::detectCores(), 
             importance = "permutation", 
             verbose = TRUE) %>% 
  set_mode("regression") %>% 
  set_args(trees = 1000)

Create workflow

wflow_rf <- workflow() %>% 
  add_model(mod_rf) %>% 
  add_recipe(rec)

Fit the model

set.seed(100)
plan(multisession)

fit_rf <- fit_resamples(
  wflow_rf,
  cv,
  metrics = metric_set(rmse, rsq),
  control = control_resamples(verbose = TRUE,
                              save_pred = TRUE,
                              extract = function(x) x)
)

Visualize

The plot below shows the root nodes from a random forest with 1000 trees (specified using set_args(trees = 1000) in the parsnip model object).

Boosted trees

Boosted trees, like bagged trees, are an ensemble model. Instead of applying successive models to resampled data and pooling estimates, boosted trees fit the next tree to the residuals (i.e., error term) of the prior tree. The goal is to minimize residual error through multiple trees, and is typically done with fairly “shallow” decision tree (i.e., 1-6 splits in each tree). Though each model is only slightly improving the error rate, the sequential use of many shallow trees makes computationally efficient (i.e. reduced run time) and highly accurate predictions.

Specify the model

mod_boost <- boost_tree() %>% 
  set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
  set_mode("regression")

Create workflow

wflow_boost <- workflow() %>% 
  add_recipe(rec) %>% 
  add_model(mod_boost)

Fit the model

set.seed(100)
plan(multisession)

fit_boost <- fit_resamples(
  wflow_boost, 
  cv,
  metrics = metric_set(rmse, rsq),
  control = control_resamples(verbose = TRUE,
                              save_pred = TRUE)
)

Visualize

One of the few downfalls of {tidymodels} is its (current) inability to plot these tree-based models. For the past two models, it was simpler to extract root nodes and plot them, but their interpretation (as we’re fitting to residuals instead of data sets) are not straightforward. For that reason, we don’t have any pretty plots here. Instead, we’ll skip to evaluating the metrics of all models.

Evaluate metrics

After running these three models, it’s time to evaluate their performance. We can do this with tune::collect_metrics(). The table below shows the estimate of the out-of-sample performance for each of our 3 models.

collect_metrics(fit_bag) %>% 
  bind_rows(collect_metrics(fit_rf)) %>%
  bind_rows(collect_metrics(fit_boost)) %>% 
  filter(.metric == "rmse") %>% 
  mutate(model = c("bag", "rf", "boost")) %>% 
  select(model, everything()) %>% 
  knitr::kable()
model .metric .estimator mean n std_err
bag rmse standard 96.62424 10 2.609574
rf rmse standard 95.46510 10 3.184353
boost rmse standard 94.86529 10 2.978113

Here, we are faced with a common problem in the machine learning world: choosing between models that perform similarly. Whether we would prefer random forest, bagged trees, or boosted trees may also depend on computational efficiency (i.e., time) or other factors. In practice, tuning several hyperparameters may have made one model clearly preferable over the others, but in our case - relying on all defaults - we would probably have similar performance with both models on a new data set and would prefer random forest or boosted trees models for their efficiency.

Out-of-sample performance

The final step is to apply each trained model to our test data using last_fit().

# bagged trees
final_fit_bag <- last_fit(
  wflow_bag,
  split = split
)

# random forest
final_fit_rf <- last_fit(
  wflow_rf,
  split = split
)

# boosted trees
final_fit_boost <- last_fit(
  wflow_boost,
  split = split
)

The table below shows the actual out-of-sample performance for each of our 3 models.

# show performance on test data
collect_metrics(final_fit_bag) %>% 
  bind_rows(collect_metrics(final_fit_rf)) %>%
  bind_rows(collect_metrics(final_fit_boost)) %>% 
  filter(.metric == "rmse") %>% 
  mutate(model = c("bag", "rf", "boost")) %>% 
  select(model, everything()) %>% 
  knitr::kable()
model .metric .estimator .estimate
bag rmse standard 96.19641
rf rmse standard 91.35844
boost rmse standard 90.72458

After applying our 3 trained models to the unseen test data, it looks like our boosted trees model is the winner since it has the lowest RMSE. In this example, we only used 1% of the data to train these models, which could make it difficult to meaningfully compare their performance. In practice, our random forest model actually results in the best out-of-sample prediction (RMSE = 84.08) when using all of the available data, which we did for the Kaggle competition.

Cianna Bedford-Petersen is a personality psychologist interested in online social networks, data visualization, and open science

comments powered by Disqus