7  Predictive Modeling

7.1 Feature Selection | Random Forests | Machine Learning

While correlation is a valuable tool, it does have limitations. Machine learning with random forests can offer complementary insights and uncover relationships beyond what correlation reveals, in a few key ways:

Nonlinear relationships: Correlation only detects linear relationships, meaning variables move proportionally together (positive) or inversely (negative). Random forests, as decision tree ensembles, can capture nonlinear relationships. Imagine a U-shaped curve - correlation wouldn’t pick this up, but a random forest could learn that a certain range of values in one variable predicts changes in another despite no linear trend.

Complex interactions: Correlation looks at each variable pair individually. Random forests consider interactions between multiple variables simultaneously. This is crucial for real-world data, where factors often influence each other in intricate ways. For example, temperature and humidity might have no individual correlation with crop yield, but their combined effect could be significant, which a random forest might capture.

Feature importance: Random forests provide feature importance scores, highlighting which variables contribute most to the model’s predictions. This goes beyond correlation’s simple strength measure and helps identify key drivers in complex systems. Understanding these drivers can lead to better decision-making even if the exact relationship isn’t fully linear.

Handling diverse data: Random forests can handle diverse data types (numerical, categorical) without specific transformations, unlike some correlation methods. This flexibility allows them to analyze real-world datasets more readily.

However, it’s important to remember that:

  • Random forests themselves don’t directly tell you “what causes what”. They identify predictive relationships, but interpreting those requires domain knowledge and additional analysis.
  • Random forests can be “black boxes” - explaining their inner workings can be challenging. While feature importance gives clues, the complex decision tree structure might not be easily interpretable.

Overall, random forests and correlation are complementary tools. Correlation offers a quick, interpretable measure of linear relationships, while random forests can delve deeper into nonlinear interactions and complex data, but with less interpretability. Combining both approaches can lead to a richer understanding of your data.

We will use the NWSL match data from the 2021-2023 seasons to select the most important features to predict the outcome of a match (i.e Win or Loss), as well as see if we can predict the results with the use of the Boruta package.

This method is based after Feature Selection Using R | Machine Learning Models using Boruta Package by Dr. Bharatendra Rai

In previous explorations, it was determined that the model worked best if the model was trained, and the predictions were made, using matches that did not result in a draw. Therefore, we will filter the data to only include matches that did not result in a draw.

Note we will eventually need to convert our target vector (result) to a factor.

team_stats <- read_csv("/Users/seansteele/dev/soccer/data/processed/wyscout_team_stats/nwsl_wyscout_match_stats.csv",
                       show_col_types = FALSE) |>
  filter(result != "D")

glimpse(team_stats)
Rows: 735
Columns: 111
$ date                                      <date> 2022-10-03, 2022-10-03, 202…
$ match                                     <chr> "Chicago Red Stars - Angel C…
$ competition                               <chr> "United States. NWSL", "Unit…
$ duration                                  <dbl> 100, 100, 95, 95, 94, 94, 94…
$ team                                      <chr> "Angel City", "Chicago Red S…
$ scheme                                    <chr> "4-3-1-2 (83.78%)", "3-4-2-1…
$ goals                                     <dbl> 0, 2, 1, 3, 2, 1, 0, 1, 0, 1…
$ xg                                        <dbl> 1.15, 1.67, 1.59, 1.47, 1.73…
$ shots_total                               <dbl> 6, 13, 18, 10, 14, 11, 12, 9…
$ shots_on_target                           <dbl> 1, 7, 7, 6, 4, 6, 4, 5, 3, 6…
$ shots_on_target_perc                      <dbl> 16.67, 53.85, 38.89, 60.00, …
$ passes_total                              <dbl> 413, 441, 377, 313, 374, 415…
$ passes_accurate                           <dbl> 316, 343, 278, 213, 284, 327…
$ passes_accurate_perc                      <dbl> 76.51, 77.78, 73.74, 68.05, …
$ possession_percent                        <dbl> 49.80, 50.20, 51.28, 48.72, …
$ losses_total                              <dbl> 136, 127, 127, 143, 129, 121…
$ losses_low                                <dbl> 22, 20, 19, 28, 30, 30, 25, …
$ losses_medium                             <dbl> 60, 60, 51, 63, 44, 34, 41, …
$ losses_high                               <dbl> 54, 47, 57, 52, 55, 57, 59, …
$ recoveries_total                          <dbl> 84, 112, 100, 99, 101, 86, 7…
$ recoveries_low                            <dbl> 31, 41, 43, 41, 41, 30, 34, …
$ recoveries_medium                         <dbl> 42, 53, 41, 44, 41, 38, 25, …
$ recoveries_high                           <dbl> 11, 18, 16, 14, 19, 18, 18, …
$ duels_total                               <dbl> 208, 208, 228, 228, 206, 206…
$ duels_won                                 <dbl> 91, 112, 102, 118, 95, 107, …
$ duels_won_perc                            <dbl> 43.75, 53.85, 44.74, 51.75, …
$ shots_outside_penalty_area                <dbl> 3, 5, 9, 2, 5, 5, 5, 5, 3, 4…
$ shots_outside_penalty_area_on_target      <dbl> 0, 2, 5, 2, 0, 4, 2, 2, 1, 1…
$ shots_outside_penalty_area_on_target_perc <dbl> 0.00, 40.00, 55.56, 100.00, …
$ positional_attacks_total                  <dbl> 24, 39, 36, 29, 37, 26, 36, …
$ positional_attacks_with_shots             <dbl> 4, 9, 11, 4, 9, 6, 9, 4, 4, …
$ positional_attacks_with_shots_perc        <dbl> 16.67, 23.08, 30.56, 13.79, …
$ counterattacks_total                      <dbl> 1, 0, 1, 1, 1, 0, 0, 4, 2, 2…
$ counterattacks_with_shots                 <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 1…
$ counterattacks_with_shots_perc            <dbl> 100.00, 0.00, 100.00, 0.00, …
$ set_pieces_total                          <dbl> 20, 31, 29, 23, 19, 26, 28, …
$ set_pieces_with_shots                     <dbl> 0, 3, 4, 5, 5, 3, 2, 4, 3, 4…
$ set_pieces_with_shots_perc                <dbl> 0.00, 9.68, 13.79, 21.74, 26…
$ corners_total                             <dbl> 3, 4, 6, 3, 7, 8, 4, 2, 4, 5…
$ corners_with_shots                        <dbl> 0, 1, 2, 1, 5, 3, 1, 2, 2, 3…
$ corners_with_shots_perc                   <dbl> 0.00, 25.00, 33.33, 33.33, 7…
$ free_kicks_total                          <dbl> 0, 1, 4, 0, 1, 1, 3, 1, 1, 2…
$ free_kicks_with_shots                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ free_kicks_with_shots_perc                <dbl> 0.00, 0.00, 0.00, 0.00, 0.00…
$ penalties_total                           <dbl> 1, 0, 0, 1, 0, 1, 1, 0, 0, 0…
$ penalties_converted                       <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…
$ penalties_converted_perc                  <dbl> 0, 0, 0, 100, 0, 100, 0, 0, …
$ crosses_total                             <dbl> 10, 8, 15, 22, 13, 24, 17, 1…
$ crosses_accurate                          <dbl> 7, 1, 4, 8, 5, 3, 3, 4, 1, 9…
$ crosses_accurate_perc                     <dbl> 70.00, 12.50, 26.67, 36.36, …
$ deep_completed_crosses                    <dbl> 3, 1, 5, 7, 3, 4, 3, 4, 1, 8…
$ deep_completed_passes                     <dbl> 2, 8, 5, 9, 6, 3, 6, 3, 4, 9…
$ penalty_area_entries_total                <dbl> 13, 24, 26, 25, 23, 33, 28, …
$ penalty_area_entries_runs                 <dbl> 4, 5, 3, 2, 3, 4, 2, 8, 2, 5…
$ penalty_area_entries_crosses              <dbl> 2, 7, 9, 12, 5, 13, 15, 7, 3…
$ touches_in_penalty_area                   <dbl> 9, 18, 19, 17, 17, 20, 16, 1…
$ offensive_duels_total                     <dbl> 64, 80, 92, 56, 81, 77, 79, …
$ offensive_duels_won                       <dbl> 20, 31, 29, 18, 21, 28, 28, …
$ offensive_duels_won_perc                  <dbl> 31.25, 38.75, 31.52, 32.14, …
$ offsides                                  <dbl> 2, 1, 4, 2, 5, 2, 0, 6, 1, 3…
$ conceded_goals                            <dbl> 2, 0, 3, 1, 1, 2, 1, 0, 1, 0…
$ shots_against_total                       <dbl> 13, 6, 10, 18, 11, 14, 9, 12…
$ shots_against_on_target                   <dbl> 7, 1, 7, 7, 6, 4, 5, 5, 6, 3…
$ shots_against_on_target_perc              <dbl> 53.85, 16.67, 70.00, 38.89, …
$ defensive_duels_total                     <dbl> 80, 64, 56, 92, 77, 81, 65, …
$ defensive_duels_won                       <dbl> 49, 44, 38, 63, 49, 60, 34, …
$ defensive_duels_won_perc                  <dbl> 61.25, 68.75, 67.86, 68.48, …
$ aerial_duels_total                        <dbl> 23, 23, 60, 60, 28, 28, 38, …
$ aerial_duels_won                          <dbl> 10, 12, 26, 28, 18, 10, 14, …
$ aerial_duels_won_perc                     <dbl> 43.48, 52.17, 43.33, 46.67, …
$ sliding_tackles_total                     <dbl> 2, 5, 3, 4, 4, 9, 2, 1, 8, 0…
$ sliding_tackles_successful                <dbl> 1, 1, 2, 3, 2, 4, 1, 0, 4, 0…
$ sliding_tackles_successful_perc           <dbl> 50.00, 20.00, 66.67, 75.00, …
$ interceptions                             <dbl> 52, 46, 44, 58, 51, 38, 18, …
$ clearances                                <dbl> 12, 15, 14, 20, 7, 17, 12, 2…
$ fouls                                     <dbl> 20, 7, 8, 15, 7, 4, 6, 12, 8…
$ yellow_cards                              <dbl> 2, 1, 1, 3, 0, 1, 0, 2, 1, 1…
$ red_cards                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ forward_passes_total                      <dbl> 160, 183, 170, 133, 162, 143…
$ forward_passes_accurate                   <dbl> 104, 122, 114, 82, 102, 98, …
$ forward_passes_accurate_perc              <dbl> 65.00, 66.67, 67.06, 61.65, …
$ back_passes_total                         <dbl> 67, 87, 42, 50, 64, 75, 52, …
$ back_passes_accurate                      <dbl> 59, 78, 36, 43, 60, 71, 45, …
$ back_passes_accurate_perc                 <dbl> 88.06, 89.66, 85.71, 86.00, …
$ lateral_passes_total                      <dbl> 118, 93, 106, 82, 86, 142, 1…
$ lateral_passes_accurate                   <dbl> 105, 79, 83, 59, 70, 121, 14…
$ lateral_passes_accurate_perc              <dbl> 88.98, 84.95, 78.30, 71.95, …
$ long_passes_total                         <dbl> 50, 44, 66, 44, 42, 26, 42, …
$ long_passes_accurate                      <dbl> 27, 25, 32, 17, 15, 16, 21, …
$ long_passes_accurate_perc                 <dbl> 54.00, 56.82, 48.48, 38.64, …
$ passes_to_final_third_total               <dbl> 52, 57, 51, 44, 40, 58, 52, …
$ passes_to_final_third_accurate            <dbl> 34, 41, 26, 30, 17, 45, 39, …
$ passes_to_final_third_accurate_perc       <dbl> 65.38, 71.93, 50.98, 68.18, …
$ progressive_passes_total                  <dbl> 76, 86, 80, 80, 61, 69, 81, …
$ progressive_passes_accurate               <dbl> 44, 50, 51, 47, 33, 53, 57, …
$ progressive_passes_accurate_perc          <dbl> 57.89, 58.14, 63.75, 58.75, …
$ smart_passes_total                        <dbl> 0, 5, 5, 1, 6, 9, 0, 6, 6, 2…
$ smart_passes_accurate                     <dbl> 0, 2, 2, 0, 3, 0, 0, 1, 3, 1…
$ smart_passes_accurate_perc                <dbl> 0.00, 40.00, 40.00, 0.00, 50…
$ throw_ins_total                           <dbl> 38, 26, 30, 24, 27, 20, 33, …
$ throw_ins_accurate                        <dbl> 33, 23, 29, 22, 25, 18, 26, …
$ throw_ins_accurate_perc                   <dbl> 86.84, 88.46, 96.67, 91.67, …
$ goal_kicks                                <dbl> 10, 5, 8, 11, 6, 8, 7, 8, 14…
$ match_tempo                               <dbl> 17.33, 18.35, 16.21, 14.16, …
$ average_passes_per_possession             <dbl> 3.56, 3.29, 2.86, 2.59, 3.31…
$ long_pass_percent                         <dbl> 12.11, 9.98, 17.51, 14.06, 1…
$ average_shot_distance                     <dbl> 18.71, 17.59, 20.01, 17.80, …
$ average_pass_length                       <dbl> 17.69, 16.80, 18.81, 19.37, …
$ ppda                                      <dbl> 6.60, 8.76, 6.69, 7.49, 8.39…
$ result                                    <chr> "L", "W", "L", "W", "W", "L"…
$ pts                                       <dbl> 0, 3, 0, 3, 3, 0, 0, 3, 0, 3…

7.2 Data Cleaning

It is recommended to remove any columns that are not useful for the model, or those that highly correlate with our result columns. First to remove items that are not useful for the model.

team_stats_clean <- team_stats |>
  select(-c(date, match, competition, duration, team, scheme))

Then to find the correlation between the columns and the result column. Since result is not a numeric value, I will use pts as a proxy for the result column.

team_stats_cor <- team_stats_clean |>
  select(-c(result)) |>
  corrr::correlate(quiet = TRUE) |>
  select(term, pts) |>
  # Hack to put both positive and negative correlations together
  mutate(pts = round(pts, 2),
         abs_pts = abs(pts)) |>
  arrange(desc(abs_pts)) |>
  select(-abs_pts) |>
  slice_head(n = 10)

team_stats_cor |> gt()
term pts
goals 0.68
conceded_goals -0.68
shots_on_target 0.38
shots_against_on_target -0.38
xg 0.34
shots_on_target_perc 0.29
shots_against_on_target_perc -0.26
counterattacks_with_shots 0.23
shots_against_total -0.23
clearances 0.23

It should come as no surprise that the goals and goals_conceded columns are highly correlated with the result/pts column. We will remove these columns from the dataset, as well as the pts column. Note: I have also converted the result column to a factor.

team_stats_clean <- team_stats |>
  select(-c(date, match, competition, duration, team, scheme, goals, conceded_goals, pts)) |>
  mutate(result = as.factor(result))

7.2.1 Feature Selection

We now will use the Boruta package to select the most important features to predict the outcome of a match (i.e Win or Loss).

set.seed(111)
boruta <- Boruta(result ~ ., 
                 data = team_stats_clean, 
                 doTrace = 0, # set to 2 to see the progress
                 maxRuns = 300) # increase if you would like to try to further reduce

print(boruta)
Boruta performed 299 iterations in 38.98194 secs.
 33 attributes confirmed important: average_passes_per_possession,
back_passes_accurate, back_passes_total, clearances,
counterattacks_total and 28 more;
 58 attributes confirmed unimportant: aerial_duels_total,
aerial_duels_won, aerial_duels_won_perc, average_pass_length,
average_shot_distance and 53 more;
 10 tentative attributes left: corners_total,
counterattacks_with_shots_perc, forward_passes_accurate,
free_kicks_total, long_pass_percent and 5 more;

Boruta allows us to visualized the results of the feature selection process.

plot(boruta, las = 2, cex.axis = 0.4)

With such dense information, the graph can be a little tricky to read. We can also view the results in a table format.

non_rejected_vars <- attStats(boruta) |>
  rownames_to_column("variable") |>
  as_tibble() |>
  arrange(desc(meanImp)) |>
  select(variable, decision) |>
  filter(decision != "Rejected")

non_rejected_vars |> 
  gt()
variable decision
shots_against_on_target Confirmed
xg Confirmed
shots_on_target Confirmed
shots_against_on_target_perc Confirmed
shots_on_target_perc Confirmed
clearances Confirmed
shots_against_total Confirmed
shots_total Confirmed
lateral_passes_total Confirmed
losses_high Confirmed
throw_ins_total Confirmed
interceptions Confirmed
penalties_converted Confirmed
lateral_passes_accurate Confirmed
penalties_converted_perc Confirmed
penalty_area_entries_crosses Confirmed
positional_attacks_with_shots_perc Confirmed
crosses_total Confirmed
possession_percent Confirmed
average_passes_per_possession Confirmed
counterattacks_with_shots Confirmed
smart_passes_accurate_perc Confirmed
passes_total Confirmed
throw_ins_accurate Confirmed
passes_accurate Confirmed
counterattacks_total Confirmed
match_tempo Confirmed
deep_completed_passes Confirmed
back_passes_accurate Confirmed
passes_to_final_third_total Confirmed
positional_attacks_with_shots Confirmed
back_passes_total Confirmed
passes_accurate_perc Confirmed
touches_in_penalty_area Tentative
penalty_area_entries_runs Tentative
losses_low Tentative
penalties_total Tentative
corners_total Tentative
forward_passes_accurate Tentative
smart_passes_accurate Tentative
counterattacks_with_shots_perc Tentative
long_pass_percent Tentative
free_kicks_total Tentative

We are now going to build a model to predict the outcome of the matches based on the variables. We will split our data into a training and test set. We will use the training set to build our model, and the test set to evaluate the model.

# Data Partition
set.seed(222)
ind <- sample(2, nrow(team_stats_clean), replace = TRUE, prob = c(0.6, 0.4))
train <- team_stats_clean[ind == 1, ]
test <- team_stats_clean[ind == 2, ]

First, we will see how well our model can predict a W/L outcome using all of the variables. This will be our benchmark for testing our reduced variable models.

# Random Forest Model
set.seed(333)
rf_all_vars <- randomForest(result ~ ., data = train)
rf_all_vars

Call:
 randomForest(formula = result ~ ., data = train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 10

        OOB estimate of  error rate: 27.59%
Confusion matrix:
    L   W class.error
L 152  68   0.3090909
W  57 176   0.2446352

Using all the data, we can see that the model can predict the outcome of the matches ~70% of the time.

Now to look at how well the model predicts the test data. First, we will use all the variables to predict the outcome of the test data. I’ve split the process into two steps to show how it’s done. Here , p is returned, which is the prediction of either a “W” or “L” for each match in the test data.

# Prediction & Confusion Matrix - Test
p <- predict(rf_all_vars, test)
p[1:10]
 1  2  3  4  5  6  7  8  9 10 
 L  W  L  W  W  L  L  W  W  L 
Levels: L W

predict() returns a vector of W/L predictions. We can use the confusionMatrix() function to see how well our model predicted the outcome of the test data.

confusionMatrix(p, test$result)
Confusion Matrix and Statistics

          Reference
Prediction   L   W
         L 113  26
         W  34 109
                                          
               Accuracy : 0.7872          
                 95% CI : (0.7348, 0.8335)
    No Information Rate : 0.5213          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5747          
                                          
 Mcnemar's Test P-Value : 0.3662          
                                          
            Sensitivity : 0.7687          
            Specificity : 0.8074          
         Pos Pred Value : 0.8129          
         Neg Pred Value : 0.7622          
             Prevalence : 0.5213          
         Detection Rate : 0.4007          
   Detection Prevalence : 0.4929          
      Balanced Accuracy : 0.7881          
                                          
       'Positive' Class : L               
                                          

We can see that accuracy is 79%. This will be our benchmark to measure our reduced features.

How well does our reduced model work? We will use all non-rejected variables (i.e. “Confirmed” and “Tentative”) from the Boruta model.

# Random Forest Models - Non-Rejected
set.seed(333)
rf_non_rej_model <- getNonRejectedFormula(boruta) |>
  as.formula() |>  # This replaces `p` from above to work in a pipe
  randomForest(data = train) |>
  predict(test) |>
  confusionMatrix(test$result)

rf_non_rej_model
Confusion Matrix and Statistics

          Reference
Prediction   L   W
         L 116  25
         W  31 110
                                        
               Accuracy : 0.8014        
                 95% CI : (0.75, 0.8464)
    No Information Rate : 0.5213        
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.6028        
                                        
 Mcnemar's Test P-Value : 0.504         
                                        
            Sensitivity : 0.7891        
            Specificity : 0.8148        
         Pos Pred Value : 0.8227        
         Neg Pred Value : 0.7801        
             Prevalence : 0.5213        
         Detection Rate : 0.4113        
   Detection Prevalence : 0.5000        
      Balanced Accuracy : 0.8020        
                                        
       'Positive' Class : L             
                                        

We’ve actually improved our accuracy by reducing the number of variables in our model.

Finally, we will build a model using only the “Confirmed” variables from the Boruta model.

# Random Forest Models - Confirmed
set.seed(333)
confirmed_model <- getConfirmedFormula(boruta) |>
  as.formula() |> # This replaces `p` from above to work in a pipe
  randomForest(data = train) |>
  predict(test) |>
  confusionMatrix(test$result)

confirmed_model
Confusion Matrix and Statistics

          Reference
Prediction   L   W
         L 119  25
         W  28 110
                                          
               Accuracy : 0.8121          
                 95% CI : (0.7615, 0.8559)
    No Information Rate : 0.5213          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6238          
                                          
 Mcnemar's Test P-Value : 0.7835          
                                          
            Sensitivity : 0.8095          
            Specificity : 0.8148          
         Pos Pred Value : 0.8264          
         Neg Pred Value : 0.7971          
             Prevalence : 0.5213          
         Detection Rate : 0.4220          
   Detection Prevalence : 0.5106          
      Balanced Accuracy : 0.8122          
                                          
       'Positive' Class : L               
                                          

Here are the reduced features that we can use to predict the outcome of a match.

non_rejected_vars |> 
  filter(decision == "Confirmed") |>
  pull(variable)
 [1] "shots_against_on_target"            "xg"                                
 [3] "shots_on_target"                    "shots_against_on_target_perc"      
 [5] "shots_on_target_perc"               "clearances"                        
 [7] "shots_against_total"                "shots_total"                       
 [9] "lateral_passes_total"               "losses_high"                       
[11] "throw_ins_total"                    "interceptions"                     
[13] "penalties_converted"                "lateral_passes_accurate"           
[15] "penalties_converted_perc"           "penalty_area_entries_crosses"      
[17] "positional_attacks_with_shots_perc" "crosses_total"                     
[19] "possession_percent"                 "average_passes_per_possession"     
[21] "counterattacks_with_shots"          "smart_passes_accurate_perc"        
[23] "passes_total"                       "throw_ins_accurate"                
[25] "passes_accurate"                    "counterattacks_total"              
[27] "match_tempo"                        "deep_completed_passes"             
[29] "back_passes_accurate"               "passes_to_final_third_total"       
[31] "positional_attacks_with_shots"      "back_passes_total"                 
[33] "passes_accurate_perc"              

It is recommended that we turn to other models, as well as expert domain knowledge, to further refine our model. Some of these features, such as throw-ins, are not likely to be relevant, but may point to other features that are. The data provided in the model does not include opponents data, and so items like opponents turnovers, may be a proxy for the team’s throw-ins.