Connect with us

Big Data

Aravo and EcoVadis Partner to Support A Smarter Approach to…

Published

on

Screen showing the EcoVadis integration in Aravo

Aravo and EcoVadis Partner to Support A Smarter Approach to Responsible Procurement

This partnership unlocks a smarter approach to responsible sourcing, by bringing trusted sustainability ratings into the procurement, compliance and risk management workflows within Aravo, that are used to manage millions of suppliers across the globe.

Aravo, the leading provider of intelligent automation for third-party risk and performance management, and EcoVadis, the world’s most trusted provider of business sustainability ratings, today announced an integration that allows clients to achieve more impactful responsible procurement and Environmental, Social and Corporate Governance (ESG) goals.

The integration means that the objective measures of the quality of suppliers’ social responsibility programs from EcoVadis can feed into the assessment workflows and reporting in Aravo, and automatically trigger the appropriate management actions. It allows clients to assess the performance of suppliers across Environment, Ethics, Labor & Human Rights and Sustainable Procurement practices, throughout the lifecycle of the relationship – from onboarding, to establish whether to do business with a supplier, through to ongoing monitoring of supplier performance over time. A core benefit is the ability to use the ratings to guide third parties towards better sustainability practices, which in turn lifts the ethical posture of clients, and helps improve sustainability across an entire ecosystem.

Aravo Chief Customer Officer, David Rusher, said: “Driving sustainability and good business practices through the supply chain is vital to better business performance, but more broadly to society at large. This partnership unlocks a smarter approach to responsible sourcing, by bringing trusted sustainability ratings into the procurement, compliance and risk management workflows within Aravo, that are used to manage millions of suppliers across the globe. It means customers are positioned to be market leaders in their commitments to sustainable procurement, ESG, and responsible sourcing programs.”

“Through the Aravo integration, EcoVadis can now provide mutual customers a new level of visibility into supplier sustainability performance, such as environmental concerns, ethical business practices, human rights and more,” said Dave McClintock, marketing director at EcoVadis. “By incorporating the EcoVadis sustainability rating process into Aravo’s automated onboarding and due diligence workflows, procurement leaders are able to drive tangible impact, while significantly accelerating their sustainable procurement programs.”

A number of joint clients, including leading pharmaceutical and consumer packaged goods companies are already benefiting from the integration.

About Aravo

Aravo delivers the market’s smartest third-party risk and performance management solutions, powered by intelligent automation.

For more than 20 years now, Aravo’s combination of award-winning technology and unrivaled domain expertise has helped the world’s most respected brands accelerate and optimize their third-party management programs, delivering better business outcomes faster and ensuring the agility to adapt as programs evolve.

With solutions built on technology designed for usability, agility, and scale, even the most complex organizations can keep pace with the high velocity of regulatory change. As a centralized system of record for all data related to third-party risk, Aravo helps organizations achieve a complete view of their third-party ecosystem throughout the lifecycle of the relationship, from intake through off-boarding and all stages in between and across all risk domains.

Aravo is trusted by the world’s leading brands, helping them manage the risk and improve the performance of more than 5 million third parties, suppliers and vendors across the globe.

Learn more at aravo.com, Twitter or LinkedIn.

About EcoVadis

EcoVadis is the world’s most trusted provider of business sustainability ratings, intelligence and collaborative performance improvement tools for global supply chains. Backed by a powerful technology platform and a global team of domain experts, EcoVadis’ easy-to-use and actionable sustainability scorecards provide detailed insight into environmental, social and ethical risks across 200 purchasing categories and 160 countries. Industry leaders such as Johnson & Johnson, Verizon, L’Oréal, Subway, Nestlé, Salesforce, Michelin and BASF are among the more than 75,000 businesses on the EcoVadis network, all working with a single methodology to evaluate, collaborate and improve sustainability performance in order to protect their brands, foster transparency and innovation, and accelerate growth. Learn more on ecovadis.com, Twitter or LinkedIn.

Share article on social media or email:

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.prweb.com/releases/aravo_and_ecovadis_partner_to_support_a_smarter_approach_to_responsible_procurement/prweb18093881.htm

Big Data

Maximum Likelihood Estimation -A Comprehensive Guide

Published

on

This article was published as a part of the Data Science Blogathon

Introduction

The purpose of this guide is to explore the idea of Maximum Likelihood Estimation, which is perhaps the most important concept in Statistics. If you’re interested in familiarizing yourself with the mathematics behind Data Science, then maximum likelihood estimation is something you can’t miss. For most statisticians, it’s like the sine qua non of their discipline, something without which statistics would lose a lot of its power.

What is Maximum Likelihood Estimation

So, what’s Maximum Likelihood Estimation? We’ve to understand many concepts before we can thoroughly answer this question. For now, we can think of it intuitively as follows:

It is a process of using data to find estimators for different parameters characterizing a distribution.

To understand it better, let’s step into the shoes of a statistician. Being a statistician, our primary job is to analyse the data that we have been presented with. Naturally, the first thing would be to identify the distribution from which we have obtained our data. Next, we need to use our data to find the parameters of our distribution. A parameter is a numerical characteristic of a distribution. Normal distributions, as we know, have mean (µ) & variance (σ2) as parameters. Binomial distributions have the number of trials (n) & probability of success (p) as parameters. Gamma distributions have shape (k) and scale (θ) as parameters. Exponential distributions have the inverse mean (λ) as the parameter. The list goes on. These parameters or numerical characteristics are vital for understanding the size, shape, spread, and other properties of a distribution. Since the data that we have is mostly randomly generated, we often don’t know the true values of the parameters characterizing our distribution.

That’s when estimators step in. An estimator is like a function of your data that gives you approximate values of the parameters that you’re interested in. Most of us might be familiar with a few common estimators. For instance, the sample-mean estimator, which is perhaps the most frequently used estimator. It’s calculated by taking the mean of our observations and comes in very handy when trying to estimate parameters that represent the mean of their distribution (for example the parameter µ for a normal distribution). Another common estimator is the sample-variance estimator, which is calculated as the variance of our observations and comes in very handy when trying to estimate parameters that represent the variance of their distribution (for example the parameter σ2 for a normal distribution). You might be tempted to think that we can easily construct estimators for a parameter based on the numerical characteristic that the parameter represents. For instance, use the sample mean estimator whenever the parameter is the mean of your distribution. Or use the sample-mode estimator if you’re trying to estimate the mode of your distribution. These are often called natural estimators. However, there are two problems with this approach:

1) Things aren’t always that simple. Sometimes, you may encounter problems involving estimating parameters that do not have a simple one-to-one correspondence with common numerical characteristics. For instance, if I give you the following distribution:

example distribution | Maximum Likelihood Estimation

The above equation shows the probability density function of a Pareto distribution with scale=1. It’s not easy to estimate parameter θ of the distribution using simple estimators based because the numerical characteristics of the distribution vary as a function of the range of the parameter. For instance, the mean of the above distribution is expressed as follows:

Mean | Maximum Likelihood Estimation

This is just one example picked from the infinitely possible sophisticated statistical distributions. (We’ll later see how we can use the use Maximum Likelihood Estimation to find an apt estimator for the parameter θ of the above distribution)

2) Even if things were simple, there’s no guarantee that the natural estimator would be the best one. Sometimes, other estimators give you better estimates based on your data. In the 8th section of this article, we would compute the MLE for a set of real numbers and see its accuracy.

In this article, we’ll focus on maximum likelihood estimation, which is a process of estimation that gives us an entire class of estimators called maximum likelihood estimators or MLEs. MLEs are often regarded as the most powerful class of estimators that can ever be constructed. You might be having several questions in your mind: How do MLEs look like? How can we find them? Are they really good?

Let’s start our journey into the magical and mystical realm of MLEs.

Pre-requisites:

1) Probability: Basic ideas about random variables, mean, variance and probability distributions. If you’re unfamiliar with these ideas, then you can read one of my articles on ‘Understanding Random Variables’ here.

2) Mathematics: Preliminary knowledge in Calculus and Linear Algebra; ability to solve simple convex optimization problems by taking partial derivatives; calculating gradients.

3) Passion: Finally, reading about something without having a passion for it is like knowing without learning. Real learning comes when you have a passion for the subject and the concept that is being taught.

Table of Content

1) Basics of Statistical Modelling

2) Total Variation Distance

3) Kullback-Leibler Divergence

4) Deriving the Maximum Likelihood Estimator

5) Understanding and Computing the Likelihood Function

6) Computing the Maximum Likelihood Estimator for Single-Dimensional Parameters

7) Computing the Maximum Likelihood Estimator for Multi-Dimensional Parameters

8) Demystifying the Pareto Problem

Basics of Statistical Modelling for Maximum Likelihood Estimation

Statistical modelling is the process of creating a simplified model for the problem that we’re faced with. For us, it’s using the observable data we have to capture the truth or the reality (i.e., understanding those numerical characteristics). Of course, it’s not possible to capture or understand the complete truth. So, we will aim to grasp as much reality as possible.

In general, a statistical model for a random experiment is the pair:

Random experiment | Maximum Likelihood Estimation

There are a lot of new variables! Let’s understand them one by one.

1) E represents the sample space of an experiment. By experiment we mean the data that we’ve collected- the observable data. So, E’s the range of values that our data can take (based on the distribution that we’ve assigned to it).

2) ℙθ represents the family of probability-measures on E. In other words, it indicates the probability distribution that we’ve assigned to our data (based on our observations).

3) θ represents the set of unknown parameters that characterize the distribution ℙθ. All those numerical features we wish to estimate are represented by θ. For now, it’s enough to think of θ as a single parameter that we’re trying to estimate. We’ll later see how to deal with multi-dimensional parameters.

4) Θ represents the parameter space i.e., the range or the set of all possible values that the parameter θ could take.

Let’s take 2 examples:

A) For Bernoulli Distribution: We know that if X is a Bernoulli random variable, then X can take only 2 possible values- 0 and 1. Thus, the sample space E is the set {0, 1}. The Bernoulli probability distribution is shown as Ber(p), where p is the Bernoulli parameter, which represents the mean or the probability of success. Since it’s a measure of probability, p always ranges between 0 and 1. Therefore, Θ = [0, 1]. Putting all of this together, we obtain the following statistical model for Bernoulli distribution:

bernoulli distribution | Maximum Likelihood Estimation

B) For Exponential Distribution: We know that if X is an exponential random variable, then X can take any positive real value. Thus, the sample space E is [0, ∞). The exponential probability distribution is shown as Exp(λ), where λ is the exponential parameter, that represents the rate (here, the inverse mean). Since X is always positive, its expectation is always positive, and therefore the inverse-mean or λ is positive. Therefore, Θ = (0, ∞). Putting all of this together, we obtain the following statistical model for exponential distribution:

exponential distribution

Hope you all have got a decent understanding of creating formal statistical models for our data. Most of this idea would be used only when we introduce formal definitions and go through certain examples. Once you get well versed in the process of constructing MLEs, you won’t have to go through all of this.

A Note on Notations: In general, the notation for estimators is a hat over the parameter we are trying to estimate i.e. if θ is the parameter we’re trying to estimate, then the estimator for θ is represented as θ-hat. We shall use the terms estimator and estimate (the value that the estimator gives) interchangeably throughout the guide.

Before proceeding to the next section, I find it important to discuss an important assumption that we shall make throughout this article- identifiability.

Identifiability means that different values of a parameter (from the parameter space Θ) must produce different probability distributions. In other words, for two different values of a parameter (θ & θ’), there must exist two different distributions (ℙ θ & ℙ θ’). That is,

probability

Equivalently,

notations

Total Variation Distance for Maximum Likelihood Estimation

Here, we’ll explore the idea of computing distance between two probability distributions. There could be two distributions from different families such as the exponential distribution and the uniform distribution or two distributions from the same family, but with different parameters such as Ber(0.2) and Ber(0.8). The notion of distance is commonly used in statistics and machine learning- finding distance between data points, the distance of a point from a hyperplane, the distance between two planes, etc.

How can we compute the distance between two probability distributions? One of the most commonly used metrics by statisticians is Total Variation (TV) Distance, which measures the worst deviation between two probability distributions for any subset of the sample space E.

Mathematically we
define the total variation distance between two distributions ℙ and ℚ as
follows:

Total variation distance | Maximum Likelihood Estimation

Intuitively, total variation distance between two distributions ℙ and ℚ refers to the maximum difference in their probabilities computed for any subset over the sample space for which they’re defined. To understand it better, let us assign random variables X and Y to ℙ and ℚ respectively. For all A that are subsets of E, we find ℙ(A) and ℚ(A), which represent the probability of X and Y taking a value in A. We find the absolute value of the difference between those probabilities for all A and compare them. The maximum absolute difference is the total variation distance. Let’s take an example.

Compute the total variation distance between ℙ and ℚ where the probability mass functions are as follows:

probability mass distribution

Since the observed values of the random variables corresponding to ℙ and ℚ are defined only over 1 and 2, the sample space is E = {1, 2}. What are the possible subsets? There are 3 possible subsets: {1}, {2} and {1, 2}. (We may always ignore the null set). Let’s compute the absolute difference in ℙ(A) and ℚ(A) for all possible subsets A. 

absolute difference in ℙ(A) and ℚ(A)

Therefore, we can compute the TV distance as follows:

compute TV distance

That’s it. Suppose, we are now asked to compute TV distance between Exp(1) and Exp(2) distribution. Could you find the TV distance between them using the above method? Certainly not! Exponential distributions have E = [0, ∞). There will be infinite subsets of E. You can’t find ℙ(A) and ℚ(A) for each of those subsets. To deal with such situations, there’s a simpler analytical formula for the computation of TV Distance, which is defined differently depending on whether ℙ and ℚ are discrete or continuous distributions.

A) For the discrete case,

If ℙ and ℚ are discrete distributions with probability mass functions p(x) and q(x) and sample space E, then we can compute the TV distance between them using the following equation:

for the discreate case | maximum likelihood estimation

Let’s use the above formula to compute the TV distance between ℙ=Ber(α) and ℚ=Ber(β). The calculation is as follows:

E = {0,1} since we’re dealing with Bernoulli random variables.

TV distance in Bernoulli distribution

Using the short-cut formula, we obtain,

short cut formula | maximum likelihood estimation

That’s neater! Now, let’s talk about the continuous case.

B) For the continuous case,

If ℙ and ℚ are continuous distributions with probability density functions p(x) and q(x) and sample space E, then we can compute the TV distance between them using the following equation:

for continuous case | maximum likelihood estimation

Let’s use the above formula to compute the TV distance between ℙ=Exp(1) and ℚ=Unif[0,1] (the uniform distribution between 0 and 1). The calculation is as follows:

TV distance for uniform distribution

We’ve used the indicator function 𝕀 above, which takes the value 1 if the condition in the curly brackets is satisfied and 0 otherwise. We could have also described the probability density functions without using the indicator function as follows:

PDF | maximum likelihood estimation

The indicator functions make the calculations look neater and allow us to treat the entire real line as the sample space for the probability distributions.

Using the short-cut formula, we obtain,

using short cut formula | maximum likelihood estimation

Thus, we’ve obtained the required value. (Even imagining doing this calculation without the analytical equation seems impossible).

We shall now see some mathematical properties of Total Variation Distance:

1) Symmetry:

symmetry

2) Definiteness:

definiteness | maximum likelihood estimation

3) Range:

Range | maximum likelihood estimation

4) Triangle Inequality:

triangle inequality | maximum likelihood estimation

That almost concludes our discussion on TV distance. You might be wondering about the reason for this detour. We started our discussion with MLE and went on to talk about TV distance. What’s the connection between them? Are they related to each other? Well, technically no. MLE is not based on TV distance, rather it’s based on something called as Kullback-Leibler divergence, which we’ll see in the next section. But an understanding of TV distance is still important to understand the idea of MLEs.

Now for the most important and tricky part of this guide. Let’s try to construct an estimator based on the TV distance. How shall we do it?

We’ll use one of the properties of TV distance that we discussed earlier- a property that tells you the value that TV distance approaches as two distributions become equal. You’ve guessed it right- it’s definiteness. We consider the following two distributions (from the same family, but different parameters):

θ and ℙθ*, where θ is the parameter that we are trying to estimate, θ* is the true value of the parameter θ and ℙ is the probability distribution of the observable data we have. From definiteness, we have,

from p to ~p

(Notice how the above equation has used identifiability). Since we had also learnt that the minimum value of TV distance is 0, we can also say:

minimum distance | maximum likelihood estimation

Graphically, we may represent the same as follows:

graphical representation | maximum likelihood estimation

Image by Author

(The blue curve could be any function that ranges between 0 and 1 and attains minimum value = 0 at θ*). It will not be possible for us to compute the function TV(ℙθ, ℙθ*) in the absence of the true parameter value θ*. What if we could estimate the TV distance and let our estimator be the minimizer of the estimated TV distance between ℙθ and ℙθ*?!

In estimation, our goal is to find an estimator θ-hat for the parameter θ such that θ-hat is close to the true parameter θ*. We can see that in terms of minimizing the distance between the distributions ℙθ and ℙθ*. And that’s when TV distance comes into the picture. We want an estimator θ-hat such that when θ = θ-hat, the estimated TV distance between the probability measures under θ and θ* is minimized. That is, θ =θ-hat should be the minimizer of the estimated TV distance between ℙθ and ℙθ*. Mathematically, we can describe θ-hat as:

argmin

Graphically,

graphically
Image by Author

We want to be able to estimate the blue curve (TV(ℙθ, ℙθ*)) to find the red curve (TV(ℙθ, ℙθ*)-hat). The value of θ that minimizes the red curve would be θ-hat which should be close to the value of θ that minimizes the blue curve i.e., θ*.

That’s the fundamental idea of MLE in a nutshell. We’ll later use this idea somewhere else and derive the maximum likelihood estimator.

So, we have TV(ℙθ, ℙθ*)-hat, which we could minimize using our tools of calculus and obtain an estimator. Problem sorted. Right? No! We have another problem- How to find TV(ℙθ, ℙθ*)-hat? And that’s a tough one. There’s no easy way that allows us to estimate the TV distance between ℙθ and ℙθ*. And that’s why this whole idea of estimating TV distance to find θ-hat fails. What can we do now?

Maybe, we could find another function that is similar to TV distance and obeys definiteness, one that should be most importantly estimable. And that brings us to the next section- Kullback-Leibler Divergence.

Kullback-Leibler Divergence

KL divergence, also known as relative entropy, like TV distance is defined differently depending on whether ℙ and ℚ are discrete or continuous distributions.

A) For the discrete case,

If ℙ and ℚ are discrete distributions with probability mass functions p(x) and q(x) and sample space E, then we can compute the KL divergence between them using the following equation:

KL divergence for discrete case | maximum likelihood estimation

The equation certainly looks more complex than the one for TV distance, but it’s more amenable to estimation. We’ll see this later in this section when we talk about the properties of KL divergence.

Let’s use the above formula to compute the KL divergence between ℙ=Ber(α) and ℚ=Ber(β). The calculation is as follows:

KL for bernoulli | maximum likelihood estimation

Using the formula, we obtain,

equation | maximum likelihood estimation

That’s it. A more difficult computation, but we’ll see its utility later.

B) For the continuous case,

If ℙ and ℚ are continuous distributions with probability density functions p(x) and q(x) and sample space E, then we can compute the KL divergence between them using the following equation:

KL for continuous | maximum likelihood estimation

Let’s use the above formula to compute the KL divergence between ℙ=Exp(α) and ℚ=Exp(β). The calculation is as follows:

KL for exponential case | maximum likelihood estimation

Since we’re dealing with exponential distributions, the sample space E is [0, ∞). Using the formula, we obtain,

Final formula | maximum likelihood estimation

Don’t worry, I won’t make you go through the long integration by parts to solve the above integral. Just use wolfram or any integral calculator to solve it, which gives us the following result:

integral result | maximum likelihood estimation

And we’re done. That’s how we can compute the KL divergence between two distributions. If you’d like to practice more, try computing the KL divergence between ℙ=N(α, 1) and ℚ=N(β, 1) (normal distributions with different mean and same variance). Let me know your answers in the comment section.

We’ll now discuss the properties of KL divergence. These properties are going to be different from TV distance because KL divergence is a divergence, not a distance. Be careful with the wording. We may not expect properties such as symmetry or triangular inequality to hold, but we do expect definiteness to hold to allow us to construct estimators. Also, note that we’ll be using only the definition of KL divergence for continuous distributions in the following sections. For discrete distributions, just replace sum with integral and the procedure remains the same. Following are the properties of KL divergence:

1) Asymmetry (in general):

asymmetry | maximum likelihood estimation

2) Definiteness:

Kl definiteness | maximum likelihood estimation

3) Range:

KL Range

(Yes, KL divergence can be greater than one because it does not represent a probability or a difference in probabilities. The KL divergence also goes to infinity for some very common distributions such as the KL divergence between two uniform distributions under certain conditions)

4) No Triangle Inequality (in general):

Kl no traingle inequality | maximum likelihood estimation

5) Amenable to Estimation:

KL amenable to estimation | maximum likelihood estimation

Recall, the properties of expectation: If X is a random variable with probability density function f(x) and sample space E, then

PDF | maximum likelihood estimation

If we replace x with a function of x, say g(x), we get

replace X

We’ve used just this in the expression for KL divergence. The probability density function is p(x) and g(x) is log(p(x)/q(x)). We’ve also put a subscript x~ℙ to show that we’re calculating the expectation under p(x). So we have,

next equation

We’ll see how this makes KL divergence estimable in section 4. Now, let’s use the ideas discussed at end of section 2 to address our problem of finding an estimator θ-hat to parameter θ of a probability distribution ℙθ:

We consider the following two distributions (from the same family, but different parameters):

θ and ℙθ*, where θ is the parameter that we are trying to estimate, θ* is the true value of the parameter θ and ℙ is the probability distribution of the observable data we have.

From definiteness, we have,

from definitness | maximum likelihood estimation

(Notice how the above equation has used identifiability). Since we had also learnt that the minimum value of KL divergence is 0, we can say:

minimum KL divergence | maximum likelihood estimation

Graphically, we may represent the same as follows:

graph KL divergence | maximum likelihood estimation
Image by Author

(The blue curve could be any function that ranges between 0 and infinity and attains minimum value = 0 at θ*). It will not be possible for us to compute the function KL(ℙθ* || ℙθ) in the absence of the true parameter value θ*. So, we estimate it and let our estimator θ-hat be the minimizer of the estimated KL divergence between ℙθ* and ℙθ.

Mathematically, 

Mathmatically

And that estimator is precisely the maximum likelihood estimator. We’ll simplify the above expression in the next section and understand the reasoning behind its terminology.

Graphically,

graphically

Image by Author

We want to be able to estimate the blue curve (KL(ℙθ* || ℙθ)) to find the red curve (KL(ℙθ* || ℙθ)-hat). The value of θ that minimizes the red curve would be θ-hat which should be close to the value of θ that minimizes the blue curve i.e., θ*. And the best part is, unlike TV distance, we can estimate KL divergence and use its minimizer as our estimator for θ.

That’s how we get the MLE.

Deriving the Estimator fo Maximum Likelihood Estimation

In the previous section, we obtained that the MLE θ-hat is calculated as:

MLE
Equation 1

We have considered the distributions ℙθ and ℙθ*, where θ is the parameter that we are trying to estimate, θ* is the true value of the parameter θ and ℙ is the probability distribution of the observable data we have. Let the probability distribution functions (could be density or mass depending upon the nature of the distribution) be pθ(x) and pθ*(x).

(Notice that we’ve used the same letter p to denote the distribution functions as both the distributions belong to the same family ℙ. Also, the parameter has been subscripted to distinguish the parameters under which we’re calculating the distribution functions.)

We have also shown the process of expressing the KL divergence as an expectation:

KL divergence | maximum likelihood estimation

Where c =Ex~θ*[log(pθ*(x))] is treated as a constant as it is independent of θ. (θ* is a constant value). We won’t be needing this quantity at all as we want to minimize the KL divergence over θ.

So, we can say that,

KL hat
Equation 2

How is this useful to us? Recall what the law of large numbers gives us. As our sample size (no. of observations) grows bigger, then the sample mean of the observations converges to the true mean or expectation of the underlying distribution. That is, if Y1, Y2, …, Yn are independent and identically distributed random variables, then

n random variable

We can replace Yi with any function of a random variable, say log(pθ(x)). So, we get, 

1/n

Thus, using our data, we can find the 1/n*sum(log(pθ(x)) and use that as an estimator for Ex~θ*[log(pθ(x))]

Thus, we have, 

estimator | maximum likelihood estimation

Substituting this in equation 2, we obtain:

equation 2

Finally, we’ve obtained an estimator for the KL divergence. We can substitute this in equation 1, to obtain the maximum likelihood estimator:

MLE

(Addition of a constant can only shift the function up and down, not affect the minimizer of the function)

(Finding the minimizer of negative of f(x) is equivalent to finding the maximizer of f(x))

maximum value | maximum likelihood estimation

(Multiplication of a function by a constant does not affect its maximize)

argmax

(log(x) is an increasing function, the maximizer of g(f(x)) is the maximizer of f(x) if g is an increasing function)

Thus, the maximum likelihood estimator θMLE-hat (change in notation) is defined mathematically as:

max

П(pθ(xi)) is called the likelihood function. Thus, the MLE is an estimator that is the maximizer of the likelihood function. Therefore, it is called the Maximum Likelihood Estimator. We’ll understand the likelihood function in greater detail in the next section.

Understanding and Computing the Maximum Likelihood Estimation Function

The likelihood function is defined as follows:

A) For discrete case: If X1, X2, …, Xn are identically distributed random variables with the statistical model (E, {ℙθ}θΘ), where E is a discrete sample space, then the likelihood function is defined as:

likelihood function | maximum likelihood estimation

Furthermore, if X1, X2, …, Xn are independent,

for more

By definition of probability mass function, if X1, X2, …, Xn have probability mass function pθ(x), then, ℙθ[Xi=xi] = pθ(xi). So, we have:

PMF | maximum likelihood estimation

B) For continuous case: It’s the same as before. We just need to replace the probability mass function with the probability density function. If X1, X2, …, Xn are independent and identically distributed random variables with the statistical model (E, {ℙθ}θΘ), where E is a continuous sample space, then the likelihood function is defined as:

for contiuous

Where, pθ(xi) is the probability density function of the distribution that X1, X2, …, Xn follow.

To better understand the likelihood function, we’ll take some examples.

I) Bernoulli Distribution:

Model:

BD | maximum likelihood estimation

Parameter: θ=p

Probability Mass Function:

PMF | maximum likelihood estimation

Likelihood Function:

likelihood function

II) Poisson Distribution:

Model:

poisson distribution

(Sample space is the set of all whole numbers)

Parameter: θ=λ

Probability Mass Function:

PMF

Likelihood Function:

likelihood

III) Exponential Distribution:

Model:

ED | maximum likelihood estimation

Parameter: θ=λ

Probability Density Function:

PDF of ED

Likelihood Function:

Likelihood function

IV) Uniform Distribution:

This one’s also going to be very interesting because the probability density function is defined only over a particular range, which itself depends upon the value of the parameter to be estimated.

Model:

Uniform

Parameter: θ=α

Probability Density Function:

PDF

(We can ignore the part where x should be more than 0 as it is independent of the parameter α)

Likelihood Function:

LIkelihood

That seems tricky. How should we take the product of indicator functions? Remember, the indicator function can take only 2 values- 1 (if the condition in the curly brackets is satisfied) and 0 (if the condition in the curly brackets is not satisfied). If all the xi’s satisfy the condition under the curly brackets, then the product of the indicator functions will also be one. But if even one of the xi’s fails to satisfy the condition, the product will become zero. Therefore, the product of these indicator functions itself can be considered as an indicator function that can take only 2 values- 1 (if the condition in the curly brackets is satisfied by all xi’s) and 0 (if the condition in the curly brackets is not satisfied by at least 1 xi). Therefore,

Final

(All xi’s are less than α if and only if max{xi} is less than α)

And this concludes our discussion on likelihood functions. Hope you had fun practicing these problems!

Computing the Maximum Likelihood Estimator for Single-Dimensional Parameters

In this section, we’ll use the likelihood functions computed earlier to obtain the maximum likelihood estimators for some common distributions. This section will be heavily reliant on using tools of optimization, primarily first derivative test, second derivative tests, and so on. We won’t go into very complex calculus in this section and will restrict ourselves to single variable calculus. Multivariable calculus would be used in the next section.

Earlier on, we had obtained the maximum likelihood estimator which is defined as follows:

MLE for single dimension

We also saw that П(pθ(xi)) was the likelihood function. The MLE is just the θ that maximizes the likelihood function. So our job is quite simple- just maximize the likelihood functions we computed earlier using differentiation.

Note: Sometimes differentiating the likelihood function isn’t easy. So, we often use log-likelihood instead of likelihood. Using logarithmic functions saves us from using the notorious product and division rules of differentiation. Since log(x) is an increasing function, the maximizer of log-likelihood and likelihood is the same.

MLE

Examples:

To better understand the likelihood function, we’ll take some examples.

I) Bernoulli Distribution:

Likelihood Function:

MLE

Log-likelihood Function:

MLE

Maximum Likelihood Estimator:

MLE

Calculation of the First derivative:

Calculation of Critical Points in (0, 1)

critical points
Equation 6.1

Calculation of the Second derivative:

Substituting equation 6.1 in the above expression, we obtain,

Therefore, p = 1/n*(sum(xi)) is the maximizer of the log-likelihood. Therefore,

The MLE is the sample-mean estimator for the Bernoulli distribution! Yes, the one we talked about at the beginning of the article. Isn’t it amazing how something so natural as the mean could be produced using rigorous mathematical formulation and computation!

II) Poisson Distribution:

Likelihood Function:

Log-likelihood Function:

Maximum Likelihood Estimator:

Calculation of the First derivative:

Calculation of Critical Points in (0, ∞)

Equation 6.2

Calculation of the Second derivative:

Substituting equation 6.2 in the above expression, we obtain,

Therefore, λ = 1/n*(sum(xi)) is the maximizer of the log-likelihood. Thus,

It’s again the sample-mean estimator!

III) Exponential Distribution:

Likelihood Function:

Log-likelihood Function:

Maximum Likelihood Estimator:

Calculation of the First derivative:

Calculation of Critical Points in (0, ∞)

Equation 6.3

Calculation of the Second derivative:

Substituting equation 6.3 in the above expression, we obtain,

Therefore, λ = (sum(xi))/n is the maximizer of the log-likelihood. Therefore,

IV) Uniform Distribution:

Likelihood Function:

Here, we don’t need to use the log-likelihood function. Nor do we have to use the tools of calculus. We’ll try to find the maximizer of the above likelihood function using pure logic. We have, 

Since n represents the sample size, n is positive. Therefore, for constant n, the likelihood increases as α decreases. The likelihood function would be maximized for the minimum value of α. What’s the minimum value? It’s not zero. See the expression inside the curly brackets. 

Therefore, the minimum value of α is max{xi}. Thus,

This concludes our discussion on computing the maximum likelihood
estimator for statistical models with single parameters.

Computing the Maximum Likelihood Estimator for Multi-Dimensional Parameters

In this section, we’ll use the likelihood functions computed earlier to obtain the maximum likelihood estimators for the normal distributions, which is a two-parameter model. This section would require familiarity with basic instruments of multivariable calculus such as calculating gradients. If you find yourself unfamiliar with these tools, don’t worry! You may choose to ignore the mathematical intricacies and understand only the broad idea behind the computations. We’ll use all those tools only for optimizing the multidimensional functions, which you can easily do using modern calculators.

The problem we wish to address in this section is finding the MLE for a distribution that is characterized by two parameters. Since normal distributions are the most famous in this regard, we’ll go through the process of finding MLEs for the two parameters- mean (µ) and variance (σ2). The process goes as follows:

Statistical Model:

E = (-∞, ∞) as a gaussian random variable can take any value on the real line.

θ = (µ, σ2) is interpreted as a 2-dimensional parameter (Intuitively think of it as a set of 2 parameters).

Θ = (-∞, ∞) × (0, ∞) as mean (µ) can take any value in the real line and variance (σ2) is always positive.

Parameter: θ = (µ, σ2)

Probability Density Function:

Likelihood Function:

Log-likelihood Function:

We now maximize the above multi-dimensional function as follows:

Computing the Gradient of the Log-likelihood:

gradient of the log likelihood | maximum likelihood estimation

Setting the gradient equal to the zero vector, we obtain,

gradient =0 | maximum likelihood estimation

On comparing the first element, we obtain:

comparison

On comparing the second element, we obtain:

n/2n

Thus, we have obtained the maximum likelihood estimators for the parameters of the gaussian distribution:

guassian dsitribution

The estimator for variance is popularly called the biased sample variance estimator.

Demystifying the Pareto Problem w.r.t. Maximum Likelihood Estimation

One of the probability distributions that we encountered at the beginning of this guide was the Pareto distribution. Since there was no one-to-one correspondence of the parameter θ of the Pareto distribution with a numerical characteristic such as mean or variance, we could not find a natural estimator. Now that we’re equipped with the tools of maximum likelihood estimation, let’s use them to find the MLE for the parameter θ of the Pareto distribution. Recall that the Pareto distribution has the following probability density function:

Graphically, it may be represented as follows (for θ=1):

pareto distribution
Image by Author

1. Model:

(Shape parameter (θ) is always positive. The sample space must be greater than the scale, which is 1 in our case)

2. Parameter: θ

3. Probability Density Function:

PDF | maximum likelihood estimation

4. Likelihood Function:

maximum likelihood estimation

5. Log-likelihood Function:

log liklihood | maximum likelihood estimation

6. Maximum Likelihood Estimator:

MLE | maximum likelihood estimation

7. Calculation of the First derivative:

first derivative

8. Calculation of Critical Points in (0, ∞)

calculation
Equation 8.1

9. Calculation of the Second derivative:

second derivative | maximum likelihood estimation

Substituting equation 8.1 in the above expression, we obtain,

substitution | maximum likelihood estimation

10. Result:

Therefore, θ = n/(sum(log(xi))) is the maximizer of the log likelihood. Therefore,

result

To make things more meaningful, let’s plug in some real numbers. We’ll be using R to do the calculations.

I’ve randomly generated the following set of 50 numbers from a Pareto distribution with shape (θ)=scale=1 using the following R code:

install.packages(‘extremefit’)
library(extremefit)
xi<-rpareto(50, 1, 0, 1)

The first argument (50) shows the sample size. The second argument (1) shows the shape parameter (θ). You may ignore the third argument (it shows the location parameter, which is set to zero by default). The fourth argument (1) shows the scale parameter, which is set to 1. The following set of numbers was generated:

generated numbers | maximum likelihood estimation

Image by Author

Let’s evaluate the performance of our MLE. We should expect the MLE to be close to 1 to show that it’s a good estimator. Calculations:

n=50
S<-sum(log(xi))
MLE<-n/S

Output: 1.007471

That’s incredibly close to 1! Indeed, the MLE is doing a great job. Go ahead, try changing the sample sizes, and calculating the MLE for different samples. You can also try changing the shape parameter or even experiment with other distributions.

Conclusion

The purpose of this article was to see MLEs not as abstract functions, but as mesmerizing mathematical constructs that have their roots deeply seated under solid logical and conceptual foundations. I hope you enjoyed going through this guide!

In case you have any doubts or suggestions, do reply in the comment box. Please feel free to contact me via mail.

If you liked my article and want to read more of them, visit this link.

Note: All images have been made by the author.

About the Author

I am currently a first-year undergraduate student at the National University of Singapore (NUS) and am deeply interested in Statistics, Data Science, Economics, and Machine Learning. I love working on different Data Science projects. If you’d like to see some of my projects, visit this link.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.analyticsvidhya.com/blog/2021/09/maximum-likelihood-estimation-a-comprehensive-guide/

Continue Reading

Big Data

Quad nations to focus on clean-energy supply chain, says Australia PM

Published

on

(Reuters) – The United States, Japan, India and Australia will work to improve the security of supply chains for critical technologies such as clean energy and to ease a global semiconductor shortage, said Australia’s Prime Minister Scott Morrison.

The Quad nations, in their first in-person summit https://www.reuters.com/world/china/quad-leaders-meet-white-house-amid-shared-china-concerns-2021-09-24 on Friday in Washington, agreed on a partnership to secure critical infrastructure, the White House said.

Morrison told reporters after the meeting this will include connecting Australia’s raw minerals with manufacturing and processing capabilities, and with end users in the United States, India and Japan, according to a transcript released on Saturday by his government.

Australia is the world’s biggest supplier of rare earths outside of China, and is a major supplier of minerals used in electric vehicle batteries, such as nickel, copper and cobalt.

While the leaders did not publicly refer to China, they repeatedly insisted on rules-based behaviour in a region where China has been trying to flex its muscles. Beijing criticised the group as “doomed to fail.”

The other Quad leaders expressed appreciation for Australia’s role in supplying critical materials “because that is a necessary supply for the many industries and processing works that they operate themselves”, Morrison said.

“On critical minerals, Australia is one of the biggest producers, but we believe we can play a bigger role in a critical supply chain that is supporting the technologies of the future.”

Australia will host a clean-energy supply chain summit next year, aiming to develop a roadmap for building such supply chains in the Indo-Pacific region, Morrison said.

The Quad also discussed ways to better secure a semiconductor supply, Morrison said, as global carmakers and other manufacturers have cut production due to the shortage made worse by a COVID-19 resurgence in key Asian semiconductor production hubs.

“This is an ecosystem we want to create and we want to do that… in the region,” he said.

(Reporting by Melanie Burton in Melbourne; Editing by William Mallard)

Image Credit: Reuters

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://datafloq.com/read/quad-nations-focus-clean-energy-supply-chain-says-australia-pm/18150

Continue Reading

Big Data

China welcomes Huawei executive home, but silent on freed Canadians

Published

on

By David Stanway

SHANGHAI (Reuters) -Chinese state media welcomed telecoms giant Huawei’s chief financial officer, Meng Wanzhou, back to the “motherland” on Saturday, after more than 1,000 days under house arrest in Canada, on what they called unfounded charges of bank fraud.

But they have kept silent about Michael Kovrig and Michael Spavor, the two Canadians released from Chinese custody in an apparent act of reciprocation by Beijing.

Chinese state broadcaster CCTV carried a statement by the Huawei executive, written as her plane flew over the North Pole, avoiding U.S. airspace.

Her eyes were “blurring with tears” as she approached “the embrace of the great motherland”, Meng said. “Without a strong motherland, I wouldn’t have the freedom I have today.”

Meng was arrested in December 2018 in Vancouver after a New York court issued an arrest warrant, saying she tried to cover up attempts by Huawei-linked companies to sell equipment to Iran in breach of U.S. sanctions.

After more than two years of legal wrangling, she was finally allowed to leave Canada and fly back to China on Friday, after securing a deal with U.S. prosecutors.

Huawei, founded by Meng’s father Ren Zhengfei, said in a statement that it “looked forward to seeing Ms. Meng returning home safely to be reunited with her family.” It said it would continue to defend itself against U.S. charges.

Canadians Michael Kovrig and Michael Spavor, detained by Chinese authorities just days after Meng’s arrest, were released a few hours later, Prime Minister Justin Trudeau has said.

State news agency Xinhua formally acknowledged the end of Meng’s house arrest on Saturday, attributing her release to the “unremitting efforts of the Chinese government”.

Hu Xijin, editor in chief of the Global Times tabloid backed by the ruling Communist Party, wrote on Twitter that “international relations have fallen into chaos” as a result of Meng’s “painful three years”.

He added, “No arbitrary detention of Chinese people is allowed.”

However, neither Hu nor other media have mentioned the release of Spavor and Kovrig, and reactions on China’s Twitter-like Weibo social media platform have been few and far between.

The foreign ministry has not commented publicly.

China has previously denied engaging in “hostage diplomacy”, insisting that the arrest and detention of the two Canadians was not tied in any way to the extradition proceedings against Meng.

Spavor was accused of supplying photographs of military equipment to Kovrig and sentenced to 11 years in jail in August. Kovrig had still been awaiting sentencing.

(Reporting by David Stanway in Shanghai; Additional reporting by David Kirton in Shenzhen; Editing by Clarence Fernandez and William Mallard)

Image Credit: Reuters

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://datafloq.com/read/china-welcomes-huawei-executive-home-silent-freed-canadians/18149

Continue Reading

Big Data

Brazil telecoms regulator says 5G auction rules to be published by Monday

Published

on

SAO PAULO (Reuters) – Brazil’s government expects to attract some 50 billion reais ($9.35 billion) in bids from a planned auction of fifth generation (5G) mobile spectrum, with auction rules to be issued by Monday, telecoms regulator Anatel Superintendent Abraão Balbino said on Friday.

Balbino said that the value of the projected capital expenditures made by the companies will be discounted from the bids, with 40 billion reais in capital expenditures expected.

($1 = 5.3471 reais)

(Reporting by Alberto Alerigi)

Image Credit: Reuters

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://datafloq.com/read/brazil-telecoms-regulator-says-5g-auction-rules-published-monday/18144

Continue Reading
Esports4 days ago

Here are all of CS:GO’s Operation Riptide skins

Esports3 days ago

How to start a Private Queue in CS:GO

Esports2 days ago

Can You Play Diablo II: Resurrected Offline?

Esports3 days ago

How to complete all week one missions in Operation Riptide

Esports4 days ago

Valve reveals CS:GO Operation Riptide, featuring private queue, short competitive games, new deathmatch modes, and more

Esports2 days ago

Failed to Enter Game, Character Could Not be Found: How to Fix Error in Diablo II: Resurrected

Esports5 days ago

All Fashion Week Timed Research, Finding Your Voice Special Research, and event-exclusive Field Research tasks and rewards in Pokémon Go

Esports4 days ago

Pokémon UNITE APK and OBB download links for Android

Esports3 days ago

CS:GO Riptide Case: Full List of New Skins

Esports4 days ago

Some players unable to claim Pokémon UNITE mobile pre-registration rewards due to new error

Esports2 days ago

Valkyrae says YouTube is working on gifted members and a feature similar to Twitch Prime

Esports4 days ago

5 Best Counters to Vex in League of Legends

Esports3 days ago

Initial reactions to the Worlds 2021 group draw: How does each team stack up against the field?

Esports2 days ago

Valkyrae says YouTube is working on gifted members and a feature similar to Twitch Prime

Esports23 hours ago

Fall Guys achieves Guinness World Record for most downloaded PlayStation Plus game ever

Esports2 days ago

Best Stats for the Druid in Diablo II: Resurrected

Covid195 days ago

Fintech Apps Sees a Surge in Downloads Amidst the Pandemic

Blockchain3 days ago

United States Infrastructure Bill Brings Cardano Billionaire to Washington.

Esports2 days ago

Microsoft’s The Initiative brings on Crystal Dynamics to help develop its Perfect Dark reboot

Blockchain4 days ago

Bitcoin & Ethereum Options Expiry on September 24th, What Does This Mean for the BTC and ETH Price?

Trending