Mathematical Models and Formulas for Soccer Betting
Instead of a fixed formula like Poisson or Elo, one can use machine learning (ML) to predict match outcomes based on data. The simplest and most interpretable approach is logistic regression, which is a statistical model often used for binary or multi-category prediction. In a betting context, logistic regression can estimate the probability of a home win, draw, or away win given input features describing the match.
Table of Contents
Logistic Regression (Binary outcome): Logistic regression models the log-odds of an outcome as a linear combination of features. For example, to predict the probability p of the home team winning, a logistic model might be:
logit(p)=lnp1−p=β0+β1X1+β2X2+⋯+βnXn .\text{logit}(p) = \ln\frac{p}{1-p} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n \,.logit(p)=ln1−pp=β0+β1X1+β2X2+⋯+βnXn.
Here $X_1, X_2, \dots, X_n$ are features (predictor variables) and $\beta_i$ are the coefficients learned from data. The model outputs $p = 1/(1 + e^{-\text{logit}(p)})$, a number between 0 and 1 representing the win probability. For multi-class problems (win/draw/lose), this can be extended (e.g. using multinomial logistic regression or training separate one-vs-rest models).
Features: A key part of ML is deciding what features to include. Features could be team-specific stats and recent performance indicators. For instance:
Team ratings (like Elo or FIFA ranking or market odds-based ratings).
Recent form (wins/losses in last 5 games, goal difference in last few matches).
Head-to-head history, if relevant.
Offensive/defensive metrics (average goals scored/conceded, shots, expected goals metrics).
Situational factors: home/away, rest days, injuries (if quantifiable), etc.
One example approach: treat each team-match as a data point, with features like the team’s rolling average goals scored and conceded in the last 5 games, and the opponent’s corresponding stats. In a Premier League dataset, this yielded about ~90 initial features, which can then be pruned to avoid multicollinearity or overfitting. In one case, these were reduced to 6 key features for a Bayesian logistic model due to high correlation among many raw stats.
Training the model: You need historical match data with outcomes to fit the logistic regression (i.e. determine the coefficients β). This typically involves maximizing the likelihood of the observed results. If using a Bayesian logistic regression, you would also specify prior distributions for the coefficients and use methods like MCMC to get posterior distributions, as was done in a recent study predicting Premier League matches. Bayesian training gives a distribution (with mean and credible interval) for each coefficient, reflecting uncertainty.
Output: The trained logistic model can output a probability for the home team win (and similarly for away win or draw). For example, it might output p(Home Win) = 0.65 for a certain matchup, given the features. That implies roughly 65% chance home, so maybe 20% draw, 15% away if those were the only options (the model could be structured to output all three probabilities).
Evaluation: One must evaluate how well the model predicts outcomes (accuracy, Brier score for probabilities, calibration — do events given 60% probability happen 60% of the time? etc.). Checking for overfitting is important: if you use too many features or too complex a model relative to the data, it might fit past games very well but fail to predict future matches (overfitting). Techniques like cross-validation, regularization (penalizing large coefficients), or feature selection (as mentioned, reducing 90 features to a smaller set) help ensure the model generalizes.
Other ML models: Beyond logistic regression, more complex models have been applied to soccer:
Random forests / Gradient boosting: These tree-based ensemble methods can capture nonlinear interactions between features. For example, a boosted decision tree model might automatically discover that when Team A’s attack rating is very high and Team B’s defense is very low, the win probability is extremely high, etc., without you having to specify a formula. These can improve accuracy if enough data, but they risk overfitting if not tuned and usually sacrifice interpretability.
Poisson regression or ordinal regression: Instead of modeling win/draw/lose directly, one can model the goals scored by each team using regression (possibly with a Poisson likelihood or a modern variant like xG models). Then derive win/draw probabilities from the predicted score distributions.
Neural networks: Some have attempted using neural nets to combine various inputs (like form, player data, etc.) to predict match outcomes. While powerful, they require large data and careful feature engineering or they might just overfit or be inconsistent week to week.
Hybrid models: For instance, you might use an Elo rating as one input feature among others in a logistic regression, capturing team strength, and let the model also account for situational factors or recent deviations in form.
The advantage of statistical/ML models is flexibility – you can incorporate many pieces of information. A logistic model might include a term for home advantage that it learns from data rather than fixing it; it could include interactions (maybe certain teams are especially strong at home beyond the norm, etc.). However, complexity must be managed: too many parameters can lead to overfitting, and one must always validate the model on unseen matches to gauge true predictive power.
In practical betting, some of the best strategies use a combination: e.g. start with a sound base model (like Poisson with team strengths or Elo), then use machine learning to adjust or refine predictions using additional data (like injuries or advanced stats). ML models can also estimate the uncertainty of predictions, which helps in staking decisions (e.g. if a model is only slightly confident, you might bet smaller).
Bayesian Methods and Updates
Bayesian methods in sports betting involve updating probabilities or model parameters as new information (data) arrives, using Bayes’ Theorem. A Bayesian approach is especially useful when dealing with limited data or when you have prior knowledge you want to incorporate. Two main contexts for Bayesian thinking in soccer betting are:
Bayesian updating of team ratings/strengths: Instead of assuming a team’s attack strength or Elo rating is fixed, we treat it as a random variable with a probability distribution. Before the season (or with little data), our belief about the team’s strength might be centered around some average (with uncertainty). As matches are played, we update this belief. For example, suppose prior to the season, Team X’s average goal rate is assumed around 1.2 goals/game with quite a broad uncertainty. After a few high scoring games, a Bayesian update will adjust the distribution of Team X’s attack strength upward (and with reduced uncertainty). This is essentially how a Bayesian rating system could work, continually updating a distribution for each team’s ability based on match results.
One could implement a hierarchical model: each team i has an attack parameter $\alpha_i$ and defense parameter $\delta_i$ (like in Dixon-Coles goal model). Put priors on these (e.g. around league average). Each match outcome gives a likelihood for the parameters of the two teams involved. Using Bayes’ theorem, you multiply the prior by the likelihood of the observed result to get a posterior. This posterior then serves as the new prior for the next update. MCMC or variational Bayes can be used to infer the posterior distributions if the model is complex.
For instance, if Team X and Team Y play a 3-3 draw unexpectedly (high score), the update might raise both teams’ attack posteriors or lower defense ratings. Bayesian models can account for uncertainty – if a weird result happens, they might widen the uncertainty if it doesn’t fit prior expectations, rather than completely shifting to a new value (depending on the weighting of prior vs data).
Bayesian inference in logistic models: As demonstrated by the example in section 2.3, one can put priors on the coefficients of a logistic regression. For example, you might have a prior that the “home advantage coefficient” is around some value (based on historical leagues) but with some variance. The data then updates this. The result is a posterior predictive distribution for match outcomes, which inherently gives you not just a single probability estimate but a distribution of possible probabilities given the uncertainty in parameters. This is valuable for understanding confidence in your predictions.
Bayesian reasoning for odds: You can use Bayes’ theorem directly to update probabilities as conditions change. For example, in-play betting often uses Bayesian updating: you have a prior probability for a team to win before the match, then if that team scores first, you update the win probability given that new evidence (essentially using a likelihood of scoring first given eventually winning, etc.). Some sophisticated models update win probabilities in real time using Bayesian networks (taking into account time remaining, score, etc. – these are used in win probability charts).
Bayesian approach advantages:
It provides a natural way to include prior knowledge (e.g., Team A is usually strong, even if first 2 games they underperformed, you don’t throw away that prior completely).
It quantifies uncertainty. Instead of saying “Team A win probability is 60%,” a Bayesian model might say “with 90% credibility, Team A’s win probability is between 50% and 70%” given the data. If uncertainty is high, a bettor might demand a bigger edge before betting.
It can be updated sequentially. Each match you can update your model without retraining from scratch, making it adaptive.
Example – Bayesian updating of a probability: Suppose before a match you think Team A has 40% win chance (implied odds 2.50). Now you learn that the opponent’s star striker is out injured. You might want to update that 40% higher. Bayesian updating could be formalized if you had a prior distribution on something like “Team A’s advantage” and the injury info as new data modifying the likelihood of outcomes. In practice, many bettors do this subjectively (adjust odds by some factor), but a Bayesian model could integrate such information if you have data (e.g., historical impact of star player absences).
In summary, Bayesian methods are a more advanced topic, but they underpin many modern statistical models in sports. A well-known academic example is the Bradley-Terry model (for pairwise outcomes) and its extensions; one can put a Bayesian framework on it to estimate team strengths. Another example: the TrueSkill rating system (used for online gaming rankings) is essentially a Bayesian Elo that updates a distribution for skill after each game. The key takeaway for a bettor is that Bayesian thinking encourages you to update your beliefs about teams as new information comes in, and to account for uncertainty rather than just point estimates. This can prevent overconfidence in bets – if your model has huge error bars, you might bet more conservatively (or not at all) until more data firm up your predictions.
Value Betting Techniques
A value bet occurs when you believe the true probability of an outcome is higher than what the bookmaker’s odds imply. In other words, the bookmaker is underestimating that outcome (offering odds that are “too high” given the actual likelihood). Placing bets only when you have identified value is essential for long-term profitability – it’s how you ensure you have positive expected value (+EV) on your side.
Identifying value:
Convert odds to implied probability: As discussed in Section 1.4, take the bookmaker’s odds for an outcome and convert to implied probability (e.g. decimal odds 2.80 → 35.7% implied chance).
Estimate your own probability: Using your predictive model (Poisson, Elo, machine learning, etc.), come up with your probability for the same outcome. This is your true probability estimate.
Compare the probabilities (or implied odds): If your estimated probability is higher than the bookmaker’s implied probability, the bet is favorable. Equivalently, if your “fair odds” (1/your probability) are lower than the bookie’s odds, it indicates value on that side.
Ensure positive EV: Compute the expected value to be sure. A quick test: value exists if
EV=pyour×(Payout)−(1−pyour)×(Stake)>0.\text{EV} = p_{\text{your}} \times (\text{Payout}) – (1 – p_{\text{your}}) \times (\text{Stake}) > 0.EV=pyour×(Payout)−(1−pyour)×(Stake)>0.
This simplifies to pyour×odds>1p_{\text{your}} \times \text{odds} > 1pyour×odds>1 for decimal odds (since payout = odds × stake). Thus, a value bet means pyour×O>1p_{\text{your}} \times O > 1pyour×O>1. If instead pyour×O=1p_{\text{your}} \times O = 1pyour×O=1, it’s a fair bet (zero EV), and if <1<1<1 it’s -EV.
Example: You estimate Team X has a 50% chance to win (0.50). The odds on Team X are 2.20 (implied 45.5%). Since 0.50 × 2.20 = 1.10 (>1), this is a positive EV bet. The expected value per $1 is $1.10 – $1 = $0.10, or +10% of stake. If you stake $100, EV = $10 profit on average. Conversely, if odds were 1.80 (implied 55.6%) for Team X, then 0.50 × 1.80 = 0.90, a negative EV (you’d expect to lose money over time).
Using EV to compare bets: Expected value can also help decide between multiple potential bets. For example, if two different bets have +EV, the one with higher EV% (profit as % of stake) is theoretically the better opportunity (though consider variance and limits too).
Real-world example (from earlier): The Smarkets odds for Manchester United to win at Arsenal were 2.78 (implied ~35.97%). If your Poisson or Elo model gave Man U a 40% chance, then:
EV = 0.40 × (Odds×Stake – Stake) – 0.60 × (Stake) for a $50 stake, which was calculated as (89×0.40)−(50×0.60)=5.6(89 \times 0.40) – (50 \times 0.60) = 5.6(89×0.40)−(50×0.60)=5.6 (meaning +£5.60 per £50 staked, or +11.2% of the stake). Initially, using the market implied 35.97%, the EV was negative (about -£0.96 per £50), but with your higher probability the EV turns positive. This swing from -EV to +EV underscores how having a better prediction than the bookmaker can create value. In this scenario, you’d label the Man U win a value bet and expect profits if your 40% assessment is accurate over the long run.
Value betting tips:
Specialize and find discrepancies: Bookmakers are quite accurate for major markets (odds move to incorporate a lot of information). Value often exists in niche markets or less popular leagues where the bookmaker’s models are not as sharp. It can also exist in the period when odds open or if you act faster on news. By specializing, you might know something the general market doesn’t, or you can develop a model that outperforms the market in that domain.
Account for bookmaker margin: Remember that the book’s odds include a margin; even if you find an outcome where your estimated probability equals the implied probability, it’s not truly value because the bookmaker’s line is skewed in their favor. You typically need to beat the implied probability by a margin larger than the vig to have real +EV. For example, if an outcome is listed at 50% implied (2.0 odds) but true chance is also 50%, it’s a no-bet (bookmaker would have no edge either in a fair world).
Use Kelly (or fraction) to size bets: Once you identify a value bet, decide the stake wisely. The Kelly criterion (Section 1.3) can be used to maximize growth by betting proportionally to your edge. For instance, if you think a bet has a 10% edge, Kelly might say bet ~5% of bankroll (depending on odds). Many bettors use half-Kelly for safety. This prevents over-betting on small edges or under-betting on big edges.
Keep records and calculate your realized ROI: Over a large sample of your value bets, track how you do. A positive long-term ROI confirms that you indeed have an edge (your probabilities were, on average, better than the market’s). Short-term variance is high – you can lose many bets in a row even if each had positive EV, especially at higher odds – but if after 500 bets you have a solid positive ROI, that’s a good sign your value identification is working.
Beware of biases: It’s easy to overestimate probabilities for your favorite team or fall into traps like the Gambler’s Fallacy. Always base your probability estimates on data-driven analysis or proven models, not gut feelings. If using a model, constantly validate it. If adjusting a model’s output (e.g., “I think the model doesn’t know Team A has a new striker, so I’ll bump them from 30% to 35%”), do so with caution and document these adjustments to see if they help over time.
Example of value betting workflow: Imagine you run a Poisson model for every game week. You get probabilities for each match outcome. You convert all those to “fair odds”. You then scan the bookmakers’ odds for the same matches. Any outcome where Bookie Odds > Fair Odds (your model) by a decent margin is flagged. Say your model says a draw should be 3.0 (33.3%) but a book offers 3.5 (28.6% implied); that’s a value gap (your estimated probability is 5% higher than market). You double-check any team news or factors your model might have missed. If confident, you place the bet. Over the season, you monitor which types of bets yield the best returns (maybe you find home underdogs or certain mid-table matchups are where the model does best). You refine the model or your criteria as needed. This systematic approach is what many quantitative bettors use.
In summary, value betting is the practice of consistently exploiting discrepancies between your probability estimates and the market’s. It requires solid models, discipline to only bet when there is value, and good bankroll management. By always asking “Is this bet +EV?” and backing it up with calculations, you tilt the odds in your favor.