Prediction of Match Outcomes in Men's Professional Tennis

Introduction

The outcome of a tennis match is influenced by a number of factors, some intrinsic and others extrinsic to the players involved. Relevant intrinsic factors include relative underlying skill levels, physical and mental stamina and fatigue, and matchup-specific considerations (e.g., a right-handed player who struggles against lefties). Factors extrinsic to the players themselves include surface (e.g., clay vs hard courts), altitude, weather variables (e.g., temperature, humidity, wind), and location (e.g., one player is playing in their home country). Singles tennis, as an individual sport with matches composed of discrete units of action (serves, points, games, sets) played across a range of largely knowable situational conditions, may be particularly amenable to multifactorial linear and nonlinear modeling for prediction of match outcomes.

Methods

Primary Objective and Data Sources

The goal of this project was to predict the outcomes of individual top-level men's professional singles tennis matches. The target feature for prediction (TF) was the percentage share of total points played in a given match won by each of the two players involved (0-100% per player). The primary data source was a recent (2009-2019) ~30,000 match subset of Jeff Sackmann's Association of Men's Tennis Professionals (ATP) data archive. This primary source was supplemented with additions and corrections obtained directly from the ATP data site. Additionally, some predictive features were also generated from implied win probabilities (IWPs) for players from matches played previously to a given match being predicted. These IWPs were, in turn, derived from historical wagering lines from a number of sportsbooks obtained from either Dan Weston or Oddsportal. All data manipulation, processing, and analyses were conducted in the Python/Pandas environment and documented with Jupyter Notebooks (anaconda3).

Predictive Feature Generation

Figure 1 summarizes predictive features generated (N=374), by broad class. Player-level features (4 of 5 classes) were computed, per match to be predicted, as Player Value - Opponent Value (i.e., 'Differential'). Where indicated, features in different classes were subject to one or several adjustments to account for recency of included data ('Decay-Time Weighted'[DTW]), past performance of opposition in the feature accrual window ('Strength of Schedule [SOS] Adjusted'), and statistical assessment of the tournament conditions under which a player's matches in the feature accrual window were played ('Tourney Conditions Adjusted'[TCA]).

'Player Demographic Features' (n=8) captured such parameters as age, handedness, height, tournament entry type (e.g., qualifier vs. direct entry) and the presence or absence of a "home country advantage" for a given player.
'Player Conditioning Features' (n=10) included features estimating player fatigue (e.g., decay-weighted total time on court in recent days prior to a predicted match), recent travel burden, and stamina (e.g., number of previous matches in overall sample).
'Tournament Conditions Features' (n=21) included parameters characterizing the success of a tournament field in specific areas (e.g., serving and returning) relative to past success of the same group of players on the same surface. In practice, as rounds of a tournament "went by", these factors were updated to provide an estimate of tournament conditions (e.g., court speed) at "present". These conditions factors for 1st Round matches in a tournament were set to 1 (ie, 'neutral' assumptions).
'Player Past Market Sentiment Features' (n=6) capture DTW/SOS-adjusted, wagering line-derived, averaged IWPs for players from matches played PRIOR TO a given match being predicted. These features were derived in both Short-Term (previous 10 matches) and Long-Term (previous 60 matches) forms. Importantly with respect to data leakage and the ultimate goal of using this model in an actionable way, these features DID NOT include data derived from wagering lines for a given match being predicted.
'Player Past Performance Features'(n=329) was the largest class of predictive features. These features captured past performance on serve (total, 1st, and 2nd serve performance, break points saved, aces, double faults etc.) and return (same, but from the returner's perspective) and previous head-to-head player matchup statistics. As with the previous class, these features were derived in both Short-Term (previous 10 matches) and Long-Term (previous 60 matches) forms. Some features in this class isolated performance on the same surface between indoor and outdoor performance, though a given surface model (either clay court or hard court) included both indoor and outdoor matches on that surface. Also, a number of the features in this class were (the player 'Differential' of) ratios aimed at capturing player efficiencies (e.g., ace%/double fault%). Many other features in this class captured 'Offense vs Defense' (e.g., Player A Past Serve Points Won% - Player B Return Points Won%) or 'Defense vs Offense'.

Figure 1. Overview of Predictive Features By Class

Figure 2 shows the feature generation workflow for the large majority of 'Player Past-Performance Features'. Most features in this class were time-decay weighted with continuous functions (e.g., for 'Long-Term' the immediately previous match to the predicted match was weighted 60, the one prior to that was weighted 59, down to a weight of 1 for the 60th match prior to the predicted match). These DTW features were then adjusted to account for both SOS faced by a player during the feature accrual time window and for tournament conditions (e.g., court speed) experienced during the same interval. Finally, the 'Differential' forms of each feature were generated from these DTW/SOS/TCA adjusted player-level features. One notable exception to the general time-weighting strategy was player fatigue, for which either accrual of time or total points played was considered over a 10 day period prior to the predicted match, with decay-weighting based on how many days ago a match was (as opposed to how many matches "ago").

Figure 2. Overview of the Predictive Feature Generation Process

Machine Learning Regression Modeling

After predictive features were generated for each match, best regression models for prediction of the TF were found on a surface-specific (clay or hard) basis ('Best Model'). To find 'Best Model' per surface, 4 different regression models were explored using scikitlearn (scikit-learn 1.1.1): Linear, Random Forest, Gradient Boosting, and HistGradient Boosting. For all model types, a 75/25 training/test split and 5-fold training set cross-validation were used. Additionally, hyperparameter grid search optimization was used per model as warranted (for ex., for Gradient Boosting the grid search was conducted for imputation type, scaler type, learning rate, maximum depth, and maximum features). 'Best Model' on both surfaces was a hyperparameter-optimized Gradient Boosting Regressor model (see 'Model Metrics' csv files in 'Reporting' folder for details and how this model performed compared to the other models).

Incorporating Implied Win Probabilities (IWPs) into Modeling/Model Evaluation

IWPs per-player in a match were derived from historical wagering lines (see Methods/'Primary Objective and Data Sources' for sources) with well-established formulas. Part of this process was removal of the charge that betting sites apply to betting markets they offer (ie, the "vig"), which is incorporated directly into wagering lines. IWPs used in predictive feature generation were derived separately for both opening (Pinnacle Sports only) and closing (both Pinnacle Sports alone and an aggregate of available books) wagering lines.

IWPs for the match being predicted were NOT used in predictive feature generation, as this would have constituted "data leakage" in the context of making actionable (ie, wagerable) match predictions. Consequently, IWP-derived features used in modeling were adjusted averages from matches played PRIOR TO a given predicted match (Short-Term and Long-Term variants were created, just as for 'Player Past-Performance Features'). However, opening and closing line-derived IWPs for a given predicted match WERE used to derive benchmark predictions of the TF, with which to compare the prediction quality of 'Best Model' per surface. Pre-match lines (particularly closing lines) represent the "wisdom of the markets", incorporating information on weather/playing conditions and "insider" information on player inury/illness status, in the immediate leadup to a match to be predicted. Thus, benchmark models derived from these lines were an extremely stringent test of the quality of models built only from information and data publically available in advance of a given match.

Unless otherwise specified, modeling results reported below are root-mean-square error (RMSE) +- standard deviation (SD) for training set (5-fold) cross-validation for the TF (% of total points [0-100%] played in a given predicted match won by a given player).

Also unless otherwise specified, models included completed matches played 2015-2019 (with 2009-2014 additionally included for various statistical accruals during the feature development stage) where BOTH players in a given predicted match had previously played >=20 matches on the same surface as the predicted match. Additionally, matches with <12 games played were removed prior to feature development. This filtering resulted in ~5,000 total matches included in hard court modeling and ~2,250 matches in clay court modeling. For a number of sub-analyses, however, the time range inclusion for modeling was expanded to boost the number of matches included. See Supp. Figs. 1 and 2 for details on the effects of varying the time range and previous matches played thresholds on model prediction quality.

Key Results

1) 'Best Model' on both surfaces, 'Full Set (No IWPs)' predicted the TF more accurately than a linear model based on features derived solely from player rankings-related information ('Rankings') at the time of a given match being predicted. 'Best Model' also outperformed a model that simply guessed the mean training set TF for each individual match TF ('Dummy'). However, 'Best Model' predictions were NOT as accurate as those generated by linear models based solely on pre-match, market-derived Implied Win Probabilities (IWPs).(Fig. 3)

Figure 3. Best Model Prediction Quality vs Benchmark Models

2) Hard court match outcomes were modeled more accurately than those on clay courts, even when adjusting for sample size differences across the two surfaces by reducing the number of hard court matches to equal the number of clay court matches. (Fig. 4)

Figure 4. Best Model Prediction Quality By Surface

For the 'Hard' and 'Hard + Clay' models (both surfaces included in the same model) in the "Equal Samples Per Surface" analysis, 25 modeling iterations with random subsets of hard court records (each for one player in a given predicted match) were averaged.

3) On both surfaces, best of 5 sets matches (ie, Grand Slam matches) were modeled with higher accuracy than were the best of 3 sets matches from the regular ATP tour, both for 'Best Model' and for the stringent market benchmark 'Closing IWP Model'. (Fig. 5)

Figure 5. Best Model Prediction Quality by Non-Grand Slam vs. Grand Slam Matches

For this analysis, the modeling time range for both surfaces was expanded to include 2011-2019 in order to boost the number of best of 5 set matches (Grand Slam; GS). As a result, 'Best Model' for both surfaces in this analysis had a slightly lower mean training error than for the time ranges (2015-2019 and 2012-2019 for hard and clay court modeling, respectively) used to determine best model prediction quality per surface in the core analysis (see Fig. 4 and Supp. Fig. 2).

4) 'Best Model' prediction quality continually improved for both surfaces as matches with increasingly large amounts of IWP movement from the opening to closing of pre-match wagering were removed, up until an inclusion threshold of ~3% (e.g., one player in a given match moved from 50% IWP at opening to 53% IWP at closing, with the other player moving in the other direction to 47% IWP). A similar trend was observed for the 'Closing IWP' model. (Fig. 6)

Figure 6. Best Model Prediction Quality By Opening to Closing IWP Movement Threshold

For this analysis, matches from 2012-2019 were included in the modeling stage for clay courts to increase the number of matches to a level more comparable to that of hard court modeling (2015-2019). See Supp. Fig. 1 for analysis of model prediction accuracy by surface as a function of time range inclusion. Opening-to-closing IWP movement was derived from Pinnacle Sports wagering lines. The sole input to the 'Closing IWP Model' for each surface was averaged (across a variable set of available sports books per match) closing IWP for matches still included at a given movement threshold (results were extremely similar to those obtained with Pinnacle Sports closing line-derived IWPs as the sole model input).

5) Removal of 1st round matches, ~1/3 of all matches in the modeling samples, resulted in improvement of 'Best Model' prediction quality for both surfaces. The same effect was observed for 'Closing IWP Model' per surface (Fig. 7A). Mean IWP for 1st round matches on both surfaces moved more from opening to closing than that for subsequent rounds; in fact, mean IWP movement decreased for each successive round (Fig. 7B). This finding suggests an inverse relationship between model prediction quality and market volatility.

Figure 7. Best Model Prediction Quality By Tournament Match Round Inclusion

Match inclusion criteria for 'Best Model', per surface, and calculation of 'Closing IWP Model' in Fig. 7A were the same as for Fig. 6 (see footnote to Fig. 6). Implied Win Probability (IWP) Δ in Fig. 7B is the mean per-surface change in IWP from opening to closing of wagering for Pinnacle Sports across all matches in the full modeling sample for a given round of play.

6) When individual feature classes or adjustment-types were systematically "subtracted" in the modeling stage ('Subtraction Analysis'), subtraction of 'Short-Term Player Past Performance Features' resulted in the largest reduction in model prediction quality. Figure 8 shows, in increasing order of negative impact on model prediction quality, the effect of removing individual feature classes and adjustments from the full hard court 'Best Model'. Results were similar for clay courts (not shown).

Figure 8. Effect on Model Prediction Quality of Removing Individual Feature Classes or Adjustments from the Best Hard Court Model

Subtraction model variants with feature adjustment removals (hatched bars) were recomputed such that subsequently calculated adjustments (see Fig. 1) were calculated as in the best full model, though with previous adjustments in the workflow set to 1. For example, in the no Decay-Time Weighting subtraction model (hatched lavendar bar) strength of schedule (SOS) adjustment was conducted on features where every match (either in a long-term or short-term feature variant) was equally weighted.

Discussion

In this project, outcomes of individual top-level men's professional singles tennis matches were predicted with machine learning techniques, using as model inputs features derived from publicly-available historical match data. Each of the hyperparameter-optimized machine learning regression models tested (Linear, Random Forest, Gradient Boosting, HistGradient Boosting) performed similarly well in prediction of the target feature (TF; % points won by each player), with Gradient Boosting Regressor yielding a slightly better prediction quality (as measured by mean 5-fold cross validation training error) than the others for both surface models (clay and hard courts). 'Best Model' prediction quality for both surfaces, however, was not quite as high as that yielded by linear models with pre-match wagering line-derived implied win probability (IWP) as their sole inputs ('Opening IWPs' and 'Closing IWPs'). Pre-match IWPs reflect the collective "wisdom of the crowd", incorporating such factors as emergent information on injury and illness, up-to-the-minute weather information, qualitative impressions of player form from match and practice observation, and untoward events like collusive match fixing efforts. Not surprisingly, 'Closing IWP' was superior to 'Opening IWP', though even 'Opening IWP' outperformed 'Best Model' for both the hard and clay surfaces.

While 'Best Model' prediction quality was not as high as that acheived with IWP-derived models, it is encouraging for the aproach taken in this project that it still outperformed models derived from pre-match player rankings and ranking derivatives (e.g., log of player ranking points). Rankings are relied upon by the ATP Tour to determine event seeding and player entry status. Unlike rankings, the modeling approach used in this project took into account variables such as surface-specificity, court conditions, player fatigue, short-term player performance, and matchup-specific considerations (e.g., past head-to-head outcomes).

The 'Best Model' for hard courts yielded a higher prediction quality than that for clay courts, even when randomly reducing the sample size on hard courts to that used for the clay court model. Likewise, 'Opening IWPs' and 'Closing IWPs' performed better on hard court data than on clay court data. One likely reason that hard court tennis was more predictable than clay court tennis is the relative dominance of the serve on hard courts (see Supp. Fig. 3A-D below). Related to the relative dominance of the serve on hard courts is the fact that matches are shorter on hard courts (Fig. 1E-F), with the implication that imminent fitness and fatigue are likely less important on hard courts. Other plausible contributing factors to the greater predictability of hard court vs clay court outcomes are greater variability of surface quality on clay courts and the much greater percentage of hard court matches played indoors under "ideal" conditions (~25% vs ~4%; see Supp. Fig. 4). Additionally, there were 2x as many Best of 5 sets matches on hard courts than on clay courts in the sample. As was seen in Fig. 5, on both surfaces 'Best Model' and the market models alike yielded considerably more accurate predictions for Best of 5 sets matches than for Best of 3 sets matches. Much of this greater accuracy can be attributed simply to the larger sample size of points played in these matches, though some is likely related to other aspects of the Grand Slam environment (e.g., better plyer conditions, more transparency/media spotlight on injuries and illness, and peak motivatyion for all players involved).

Prediction quality for 'Best Model' improved as matches with increasingly large amounts of pre-match IWP movement were filtered out. Furthermore, IWP movement was greater for 1st Round matches than for any other tournament round, and 'Best Model' prediction quality improved with the removal of 1st Round matches. Uncertaintly in player injury and illness status, as well as uncertainty around playing conditions, is highest for 1st Round matches relativet th remainder of a tournament. Collusion between players and bettors is also far more likely to occur when a specific matchup is guaranteed and there is more than a day or two to make arrangements before the match. Even when filtering out matches with large degrees of pre-match IWP movement and 1st Round matches, however, 'Best Model' per surface still did not perform as well as pre-match IWP-derived models. The market certainly doesn't "know" everything going on with players and playing conditions beyond the data, but it "knows" enough to stay ahead of stats-based models as conditions shift.

In a Subtraction Analysis, it was seen that removal of surface-specific, 'Short Term Player Past-Performance Features' (ie, from the 10 matches immediately previous to the match being predicted) had a more negative impact on model prediction quality than did subtraction of any other feature type or feature adjustment type. Surprisingly, this had a much larger negative impact on model quality than subtraction of features based on aggregate, adjusted IWP from previous matches over both the short and long term. The implication from this finding is that "wisdom of crowds" is superior to the match stats-based approach when it's from immediately before a match to be predicted, but a stats-based approach is superior to market sentiment from PREVIOUS matches. This analysis also revealed that 'Decay-Time Weighting' of features derived from previous matches, 'Strength of Schedule Adjustment', and 'Player Conditioning Features' (e.g., fatigue, stamina, travel burden) were all relatively important to 'Best Model' prediction quality. Surprisingly, subtraction of 'Long-Term Player Past-Performance Features' (ie, from the 60 surface-specific matches immediately previous to the match being predicted) had only a minimal negative impact on 'Best Model' prediction quality. It's possible that this relative lack of negative impact was a result of not successfully identifying the best range of inclusion or time-decay weighting function for capturing long-term performance, as these parameters were explored systematically but by no means exhaustively.

As far as future improvements to this modeling approach, I am extremely interested in acquiring and adding features derived from previous match tracking data. This would include serve and ground stroke velocities, spin rates, player movement (speed, distance and efficiency) data, and depth of shot data. These types of data would not only be useful as point outcome-independent, "under the hood" assessments of past player performance, but they could also potentially be bootstrapped to better assessment of court/playing conditions in an ongoing tournament. In the current modeling approach, 'Tournament Conditions Features' were derived solely from tournament field-adjusted outcomes data (e.g., ace%, service points won%) in previous rounds of an ongoing tournament. The 'Subtraction Analysis' revealed that this approach to deriving 'Tournament Conditions Features' had only a minimal effect on prediction quality relative to no adjustment. More generally, the broader goal is to convert this modeling approach into an actionable one in terms of wagering. To that end, instead of employing regression-type machine learning models to predict points won %, the next iteration will use classifier model variants to directly derive IWPs from confidence of a win or a loss (e.g., from log-loss ratios). These predicted IWPs can then be compared to those derived from wagering lines, and rules can be identified via simulated wagering (on matches subsequent to the modeling time range) for optimal betting thresholds and sizes based on divergence of predicted IWPs from wagering line-derived IWPs.

Supplementary Figures

Supplementary Figure 1. Effect of Time Range Inclusion on Best Model Accuracy

'Best Model' was found for each surface separately for the time ranges 201X-2019. Each model always included at least 2 additional years of data for predictive feature accrual (e.g., the 2011-2019 model additionally had data from 2009-2010 included during feature generation). Model accuracy peaked with the inclusion of 2015-2019 (hard court) or 216-2019 (clay court). For clay court modeling, however, the standard deviation of the training error was substantially lower with the inclusion of a few more years of data. Thus the decision in the main results to mostly include 2012-2019 in the modeling stage, despite the fact that mean training error was higher with this expanded time range inclusion. On both surfaces, accuracy for a linear model with aggregate closing IWP as the sole input ('Closing IWP Model') outperformed, but largely tracked qualitatively with, 'Best Model' across the range of time inclusions evaluated.

Supplementary Figure 2. Effect of Previous Matches Played Threshold on Best Model Accuracy

Best models were found, per surface, separately for a number of minimum previous matches played thresholds. For a match to be included at a given threshold, BOTH players must have played the minimum number of prior matches on that surface. The hard court model variants included matches for the years 2015-2019 (2009-2014 included during predictive feature generation for accrual). The clay court model variants included matches for the years 2012-2019 (2009-2011 included during predictive feature generation for accrual). Model accuracy on hard courts peaked at a very high minimum threshold (140 matches), though a threshold of 20 matches was selected for modeling because model accuracy was sufficiently high with very low standard deviation and inclusion of the large majority of the overall sample. For clay courts, model accuracy peaked with a threshold of 60 prior patches, though a 20 match threshold resulted in minimal standard deviation and acceptable accuracy while including over half of the overall sample. On both surfaces, accuracy for a linear model with aggregate closing IWP as the sole input ('Closing IWP Model') outperformed, but largely tracked qualitatively with, 'Best Model' across the range of thresholds evaluated.

Supplementary Figure 3. Aggregate Match Statistics for Hard Court vs Clay Court Tennis

For both surfaces, all matches played between players who BOTH had 20 previous same-surface matches in 2015-2019 were included. Taken together, these aggregate statistics demonstrate that the serve was more important over this time range to success on hard court than on clay courts (A-D). Additonally, and related to the relative importance of the serve to player success on hard courts, both Grand Slam (GS) and non-Grand Slam (non-GS) matches were shorter on clay courts than on hard courts (E-F).

Supplementary Figure 4. Hard Court Best Model Indoor vs Outdoor Matches

For both the 2015-2019 (left panel) and 2011-2019 (right panel) model inclusion time ranges, all hard court matches played between players who BOTH had 20 previous same-surface matches in 2015-2019 were included. While, as was seen in the analysis summarized in Supp. Fig. 1, overall prediction quality is slightly lower with the expanded time range, inclusion of this time range allows a larger sample of relatively scarce indoor matches to be included. In this expanded time range analysis there is a suggestion (albeit still with a large amount of error) that indoor matches, which are played under "ideal" conditions and favor the server even more strongly than hard court matches overall, are more well predicted by both 'Best Model' and the "wisdom of the market" ('Closing IWP Model').

Supplementary Figure 5. Aggregate Match Statistics for Outdoor Hard Court vs Indoor Hard Court Tennis

For both outdoor and indoor hard courts, all matches played between players who BOTH had 20 previous same-surface matches in 2015-2019 were included. Taken together, these aggregate statistics demonstrate that the serve was slightly more important over this time range to success on indoor hard courts than on outdoor hard courts (A-D). Additonally, and related to the relative premonidance of the serve on hard courts, non-Grand Slam (non-GS) matches were shorter on indoor hard courts than on outdoor hard courts (E).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
models		models
notebooks		notebooks
reporting		reporting
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction of Match Outcomes in Men's Professional Tennis

Introduction

Methods

Primary Objective and Data Sources

Predictive Feature Generation

Machine Learning Regression Modeling

Incorporating Implied Win Probabilities (IWPs) into Modeling/Model Evaluation

Key Results

Discussion

Supplementary Figures

About

Releases

Packages

Languages

ursus-maritimus-714/Mens-Tennis-Prediction

Folders and files

Latest commit

History

Repository files navigation

Prediction of Match Outcomes in Men's Professional Tennis

Introduction

Methods

Primary Objective and Data Sources

Predictive Feature Generation

Machine Learning Regression Modeling

Incorporating Implied Win Probabilities (IWPs) into Modeling/Model Evaluation

Key Results

Discussion

Supplementary Figures

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages