Zephyrnet Logo

Bet Wisely: Predicting the Scoreline of a Football Match using Poisson Distribution

Date:

The biggest religion in the world is not even a religion.”Fernando Torres

Spanish footballing giant Sevilla FC together with FC Bengaluru United, one of India’s most exciting football teams have launched a Football Hackathon – Data-Driven Player Performance Assessment. This Hackathon will be a unique opportunity to effectively use data science in the space of professional football scouting and player performance analysis and enhancement and are excited to have you on the journey.

Football Hackathon is LIVE NOW!

Introduction

Football is loved by all and its beauty lies in its Unpredictable nature. One thing which is strongly associated with this game is its fans, brooding and debating before a game over who will win the game. And some fans even go to the limit of speculating the scoreline before the match. So let`s try to answer some of these questions logically.

Getting to know Poisson

Suppose your friend says that on average 2 goals happen per game, well, is he right? If right then what are the actual chances of seeing two goals in a match? Here comes to our rescue Poisson distribution helps us to find the probability of observing ‘n’ events (read ‘n’ goals) in a fixed time period given that we provide it with the expectation of events occurring (average events per time period). Let`s see it mathematically once

(where λ = average events per time period)

Chances of Scoring

Now let`s answer some questions with this equation, but first we need data, so for this I downloaded the International football results from 1872 to 2020 data from Kaggle. A sample of our dataset is shown below.

code:

data.head(3)
Poisson Distribution - Head

Let`s start with finding the average goals we can expect within 90 minutes.

For this, I have created a separate dataset filtering out data for matches played in the 21st century(2000-2020) and added the home_score and away_score to find out the total no. of goals occurring in each match and then taken the mean of the total goals column to get the average goals we can expect in a match.

code: 

data['total_goals']=data['home_score']+data['away_score']
data['date']=data['date'].apply(lambda x : int(str.split(x,'-')[0]))
rec_data=data.loc[(data['date']>=2000)]
rec_data.iloc[[rec_data.total_goals.argmax()]]
print(rec_data.total_goals.mean())

2.744112130054189

Now putting this expectation in Poissson Distribution formula let`s see what are the actual chances of seeing 3 goals in a match.

Wow, only a mere 22% chance. Let`s plot the probabilities of the no. of goals in a match to get a better picture.

Poisson Distribution

Now from this, we can calculate the probability of seeing ‘x ‘or fewer number of goals simply by adding the probabilities of ‘x’ and the numbers which are less than ‘x’.And by just subtracting this from 1 we can get the probability of seeing more than ‘x’ goals in a match. Let`s plot this too.

The wait is over…

Now suppose you have an impatient friend who does not want to sit for the whole game. And he comes to you during a match and asks how much time does he has to wait to see a goal. Woah, that`s a tough question right, but worry not, ask him to sit through 10000 games and note the time between each goal. Just kidding, obviously, he would freak out. Actually I simulated 10000 matches and found out the average time.

The most likely waiting time is 2 minutes. But wait this not actually what I was looking for, I want the average time that I have to wait to see a goal if I start watching the game at a random time. For that, I will take 10000 instances, where each instance is watching 10000 games and calculating the average waiting time between goals in that 10000 games and reporting us. Finally, I will be plotting those 10000 reports from each of my instances and find out the expected average waiting time.

It looks like we have to wait for 33 minutes approx. However we may have to wait for more, this is a classic Waiting Time Paradox.

Predicting the scoreline

Finally, let`s do the question with which we started and the most exciting question that who will win and what will be the scoreline to be precise.

For this, I will use the history between two teams (let them consider as the home team and away team) and take the average_home_score as the expected goals for the home team and average_away_score as the expected goals for the away team and predict the scoreline using Poisson distribution. In case the teams have fewer encounters between them, we will consider a few factors

HS = Mean of home goals scored by the home team throughout history.

AS = Mean of away goals scored by away team throughout history.

HC = Mean of goals conceded in home matches by the home team.

AC = Mean of goals conceded in away matches by the away team.

So, the Home team’s expected score will be calculated as (HS + AC) / 2

So, the Away team’s expected score will be calculated as  (AS + HC) / 2

Wait, the expected score is not the predicted score. The expected score is the average number of goals we expect them to score in a game between them.

code:

import pandas as pd
import numpy as np
from scipy import stats
def PredictScore():

    home_team = input("Enter Home Team: ")
    ht = (''.join(home_team.split())).lower()
    away_team = input("Enter Away Team: ")
    at = (''.join(away_team.split())).lower()

    if len(data[(data.home_team ==ht) & (data.away_team ==at)]) > 20:

        avg_home_score = data[(data.home_team ==ht) & (data.away_team ==at)].home_score.mean()
        avg_away_score = data[(data.home_team ==ht) & (data.away_team ==at)].away_score.mean()

        home_goal = int(stats.mode(np.random.poisson(avg_home_score,100000))[0])
        away_goal = int(stats.mode(np.random.poisson(avg_away_score,100000))[0])

    else:
        avg_home_goal_conceded = data[(data.home_team ==ht)].away_score.mean()
        avg_away_goal_scored   = data[(data.away_team ==at)].away_score.mean()
        away_goal = int(stats.mode(np.random.poisson(1/2*(avg_home_goal_conceded+avg_away_goal_scored),100000))[0])

        avg_away_goal_conceded = data[(data.home_team ==at)].home_score.mean()
        avg_home_goal_scored   = data[(data.away_team ==ht)].home_score.mean()
        home_goal = int(stats.mode(np.random.poisson(1/2*(avg_away_goal_conceded+avg_home_goal_scored),100000))[0])

    avg_total_score = int(stats.mode(
        np.random.poisson((data[(data.home_team==ht) &    (data.away_team==at)].total_goals.mean()),100000))[0])

    print(f'Expected total goals are {avg_total_score}')
    print(f'They have played {len(data[(data.home_team ==ht) & (data.away_team ==at)])} matches')
    print(f'The scoreline is {home_team} {home_goal}:{away_goal} {away_team}')

Let`s try with Brazil as the home team and Mexico as the away team.

code:

PredictScore()

Poisson Distribution gives us a prediction of Brazil winning with a 2-0 scoreline. I searched the net and found that the last match between them was played on 2 Jul 2018 and the scoreline says Brazil won by 2-0. Well, I got lucky, you may not.

Conclusion

If you want to explore further no worry, here is my code. Furthermore, this is just a basic way of predicting the game, nowadays classification algorithms are used to predict the outcome and regression algorithms to predict the scoreline. But That`s the topic for another day, till then have fun playing with this. Adios!

Football Hackathon is LIVE NOW!

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?