Electron microscopy
 
PythonML
Expectation-Maximization (EM) Algorithm
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

In Expectation-Maximization (EM) algorithm for estimating parameters in a mixture model using maximum likelihood, we mainly include:

          i) Soft assignment:

Instead of assigning a data point exclusively to the cluster with the maximum probability, EM assigns it to all clusters with some probability. The "softness" of the assignment is captured by these probabilities. Mathematically, the assignment is based on the responsibility (i):

          Expectation-Maximization (EM) algorithm --------------------------------- [3696a]

where,

          is a function representing the log-likelihood of the observed data given the parameters , , and .

          Expectation-Maximization (EM) algorithm indicates that the operations in the EM algorithm are performed for each observation in the dataset (indexed by ).

          ii) Parameter updates:

The cluster parameters, such as the mixing coefficients (), means (), and covariances (j), are updated using weighted averages. The weights are the responsibilities(i), indicating the likelihood of a data point belonging to a specific cluster:

          Expectation-Maximization (EM) algorithm --------------------------------- [3696b]

          Expectation-Maximization (EM) algorithm --------------------------------- [3696c]

where,

  • is the joint probability of observing (i) and its corresponding latent variable (i) given the parameters , , and .
  • represents the probability that a randomly selected data point belongs to the -th component of the mixture. It is estimated by counting the proportion of data points assigned to the -th component.
  • represents the mean of the distribution associated with the -th component of the mixture. It is estimated by computing the weighted average of the observations assigned to the -th component, where the weights are the probabilities of belonging to that component.

In a mixture model, we have a dataset with observations that are assumed to be generated from a mixture of different distributions, each associated with a latent variable . The latent variable indicates from which component of the mixture the observation is generated. The goal is to estimate the parameters , , and that maximize the likelihood of the observed data. The likelihood function is the probability of observing the entire dataset given the parameters.

The expressions in Equations 3696b and 3696c are the updates for the parameters and in the M-step of the EM algorithm for a mixture model. The is an indicator function that equals 1 if (i) is equal to and 0 otherwise. The sums in the denominators are normalization terms to ensure that represents a valid probability distribution and is properly weighted.

Equations 3696b and 3696c are exact Gaussian Discriminant Analysis (GDA), except difference of replacing y with z. That is, the main difference lies in the interpretation of the latent variable or label. In GDA, there's typically a categorical label variable indicating the class to which an observation belongs. The likelihood and parameter estimation involve the class labels, and the focus is on modeling the class-conditional distributions. On the other hand, in Gaussian Mixture Model (GMM) with EM, there is a latent variable that represents the component from which a data point is generated in the mixture. This latent variable is associated with the different components of the mixture model, each modeled as a Gaussian distribution.

Based on evidence lower bound (ELBO) in variational inference, the EM algorithm gives,

                  Expectation-Maximization (EM) algorithm ----------------- [3696ic]

This is the expression for the evidence lower bound, where is a distribution used in variational inference to approximate the posterior distribution . The goal is to maximize this lower bound during the optimization process.

For any θ and Q, its relationship to likelihood is given by,

          factor analysis ---------------------------------- [3696icd]

This inequality holds due to the nature of the variational inference framework. The evidence lower bound is a lower bound on the log-likelihood . The goal of variational inference is to maximize the lower bound with respect to both the model parameters and the variational parameters . The inequality ensures that, at each step of the optimization, the lower bound is improved, and as a result, the log-likelihood is also improved. Maximizing the lower bound is a surrogate for maximizing the actual log-likelihood, which may be computationally challenging.

The EM algorithm alternates between the E-step (Expectation) and M-step (Maximization):

i) The E-step involves estimating the expected value of the latent variables (in this case, (i)) given the observed data and the current parameter estimates. The expected value of the latent variables is computed given the current parameter estimates.

                    Expectation-Maximization (EM) algorithm --------------------------------- [3696d]

                               Expectation-Maximization (EM) algorithm ---------------------- [3696e]

where,

  • is the posterior probability that the data point (i) belongs to the -th component of the mixture, given the observed data and the current parameter estimates.

  • is the conditional probability of observing (i) given that it comes from the -th component of the mixture. This is modeled as a univariate Gaussian distribution with parameters and :

              Expectation-Maximization (EM) algorithm ----------------- [3696f]

    For the multivariate case, where (i) is a vector and is the mean vector of the -th component, and is the covariance matrix of the -th component, the equation becomes:

            Expectation-Maximization (EM) algorithm - [3696g]

  • Here, x(i) is the observed data point, μj is the mean vector of the j-th component, Σj is the covariance matrix of the j-th component, and σ2j is the variance of the j-th component (in the univariate case).

  • is the prior probability of a data point belonging to the -th component of the mixture, represented by the mixing coefficient . ϕ is a vector of numbers that sum to 1.

We can have,

        Expectation-Maximization (EM) algorithm -------------------------- [3696gb]

        Expectation-Maximization (EM) algorithm -------------------------- [3696gba]

Equation 3696gba represents the conditional probability of the latent variable (i) taking on a specific value given the observed data (i) and the current parameter estimates , , and . (i) can be a multinomial variable denoted by or any other symbol. In probabilistic models, the term "multinomial" refers to a categorical distribution, which is a discrete probability distribution over a finite set of categories. The parameter would represent the probabilities associated with each category in the multinomial distribution.

       In summary, in the E-step, we set the distribution over latent variables (i) to be,

                    Expectation-Maximization (EM) algorithm --------------------------------- [3696gc]

ii) During the M-step, you are optimizing the log-likelihood (ℓ(θ)) of the data with respect to the model parameters. In the M-step, the parameters are updated by maximizing the expected log-likelihood.

     The parameters and are part of the parameters that are estimated in the EM algorithm for GMMs during the M-step.

                  Expectation-Maximization (EM) algorithm ----------------- [3696h]

                  Expectation-Maximization (EM) algorithm ----------------- [3696i]

The two equations above are used in the Maximization (M-step) to update the parameters and . In Equation 3696h, represents the mixing coefficient for the -th component of the mixture. It is updated by taking the average of the responsibilities over all data points. In Equation 3696i, is the mean vector of the -th component, and it is updated by taking the weighted average of the data points (i) using the responsibilities as weights. Equations 3696h and 3696i ensure that remains a valid probability distribution, and is updated based on the weighted average of the data points, where the weights are given by the responsibilities. These updates are performed iteratively in the EM algorithm until convergence, aiming to find the parameters that maximize the likelihood of the observed data in the presence of latent variables.

Inequality 3696icd tells that in E-step we maximize J with respect to Q, while in M-step, we maximize J with respect to θ::

  • In the E-step, the goal is to compute or estimate the expected value of the log-likelihood, given the observed data and the current estimates of the parameters ().
  • This involves maximizing the evidence lower bound () with respect to the distribution of the latent variables (), treating the parameters () as fixed.

                Expectation-Maximization (EM) algorithm ----------------- [3696ica]

  • In the M-step, the goal is to maximize the expected log-likelihood obtained from the E-step with respect to the parameters ().
  • This involves maximizing the evidence lower bound () with respect to the model parameters (), treating the distribution of the latent variables () as fixed.

                Expectation-Maximization (EM) algorithm ----------------- [3696icb]

    The EM algorithm iteratively alternates between these two steps until convergence. The E-step computes the expected value of the latent variables, and the M-step updates the parameters of the model based on this expectation. The algorithm aims to find a set of parameters that maximizes the observed data log-likelihood, taking into account the presence of latent variables. Therefore, the EM algorithm is often referred as a coordinate ascent algorithm.

In summary, in the M-step, we update θ by,

                  Expectation-Maximization (EM) algorithm ----------------- [3696id]

The right-hand side of the optimization problem is constructed as a lower bound for the log-likelihood. This is a key concept in variational inference and the derivation of the evidence lower bound (ELBO). The given optimization problem is often associated with the maximization of the ELBO. The goal is to find the parameters that maximize the ELBO. The ELBO is a lower bound for the log-likelihood because of Jensen's inequality. Specifically, for any distribution Qi(z(i)), Jensen's inequality implies that:

                  Expectation-Maximization (EM) algorithm --------------- [3696ie]

Maximizing the sum over in the optimization problem is equivalent to maximizing a lower bound on the log-likelihood. This approach is commonly used in variational inference to approximate intractable posterior distributions by optimizing a tractable lower bound. The optimization process seeks to find both the model parameters and the variational parameters that define the distribution Qi(z(i)).

A probabilistic model which is often used in the EM algorithm, particularly for mixture models such as Gaussian Mixture Models (GMMs) is given by the expressions, which represent the probability distribution assumptions for the model, below.

                  Expectation-Maximization (EM) algorithm --------------------------------- [3696ida]

                  Expectation-Maximization (EM) algorithm --------- [3696idb]

                  Expectation-Maximization (EM) algorithm ----------------------------------------- [3696idc]

                  Expectation-Maximization (EM) algorithm ----------------------------------------- [3696idd]

Equation 3696ida represents the assignment of observation to the -th component of the mixture model. In GMMs, each component is associated with a Gaussian distribution. Equation 3696idb is the probability density function (PDF) of the observed data (i) given the assignment and the model parameters . In the case of a Gaussian mixture model, this PDF is represented by a multivariate Gaussian distribution with mean and covariance matrix for the -th component. Equation 3696idc represents the prior probability of assignment to the -th component. In GMMs, is often interpreted as the weight or mixing coefficient associated with the -th component. Equation 3696idd is the estimated probability (or responsibility) that observation belongs to the -th component during the E-step of the EM algorithm. It is often computed as part of the EM algorithm and is proportional to the likelihood of the data given the current parameters.

The process, of finding the maximum likelihood estimates for the mean parameter by taking the derivative of the log-likelihood function with respect to and setting it equal to 0, is given by,

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696idea]

The process involves taking the derivative of the log-likelihood function with respect to the mean parameter μj and setting it equal to 0 to find the values of that maximize the likelihood of the observed data.

The parameter update step in the M-step of the EM algorithm, particularly for Gaussian Mixture Models (GMMs), can be given by,

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696ide]

where,

            μj is the means

            εj is the covariances

            ϕj is the mixing coefficients.

Equation 3696ide is the update rule for the mean of the -th component during the M-step. This update is based on the responsibilities assigned to the -th component during the E-step when we are optimizing the log-likelihood of the data.

Similarly, update equations for the other parameters can be obtained. For instance, the covariance matrix update is typically obtained by setting the derivative of the log-likelihood with respect to to 0. The update equation often involves the weighted sum of the outer products of the data points with the responsibilities . Therefore, can be given by,

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696idf]

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696idg]

This update ensures that maximizes the likelihood of the data under the assumption of a Gaussian distribution.

Update for (mixing coefficient) can be given by,

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696idh]

                  Expectation-Maximization (EM) algorithm ----------------------------------- [3696idi]

where,

             is the total number of data points. This update ensures that reflects the proportion of data points assigned to the -th component.

The mixing coefficient represents the prior probability of selecting the -th component. The update for is obtained by setting the derivative of the log-likelihood with respect to to 0.

As an example of EM algorithm, assuming we want to maximize the L(θ) of the blue curve, in Figure 3696a, with the steps below:

i) In the first E-step, we can randomly initiate θ as at θ0 = 2.25, we can construct a lower bound shown in green for a log-likelihood with the properties below:

          i.a) Everywhere on the lower bound (green) curve is lower than the blue curve.

          i.b) The blue curve and the green curve have a common point, which is the initiated point on the blue curve.

ii) In the first M-step, we find the maximum of the green curve, which is the point at Max1.

iii) In the second E-step, we use the θ value of the point at Max1 as θ1to repeat the steps i) and ii).

iv) Those steps will repeated until the local maximum is found.

Upload Files to Webpages

Figure 3696a. Expectation-Maximization (EM) algorithm (code). In this script, an optimization technique with the minimize function from scipy.optimize module is used to find the maximum of the green curve.

Equation 3696a can be modified to,

          Expectation-Maximization (EM) algorithm --------------------------------- [3696j]

where,

          Qi(z(i)) represents a probability distribution over the latent variable z(i) for the -th observation.

         Expectation-Maximization (EM) algorithm is the sum over all possible values of the latent variable z(i) according to the distribution Qi. The sum is equal to 1.

         Expectation-Maximization (EM) algorithm is the ratio of the joint probability to the distribution . It's essentially the responsibility of the latent variable according to the current model.

Because we have expectation of a function given by ,

          Expectation-Maximization (EM) algorithm --------------------------------- [3696jb]

Here, represents the expected value of the function with respect to the probability distribution . The sum is taken over all possible values of the random variable .

And, expectation of a random variable is given by,

          Expectation-Maximization (EM) algorithm --------------------------------- [3696jc]

Here, similarly, represents the expected value of the random variable itself. Again, the sum is taken over all possible values of according to the probability distribution .

In the EM algorithm, these formulas in Equations 3696jb and 3696jc are often used in the E-step to compute the expected values of certain functions or random variables with respect to a given probability distribution.

Then, another way to represent the log-likelihood function in the EM algorithm is,

          Expectation-Maximization (EM) algorithm --------------------------------- [3696k]

The right-hand side of Equation 3696k represents the log of the expected value of the ratio with respect to the distribution over the latent variable z(i).

Based on Jensen's inequality, for log function, which is a concave function, then we have,

          Expectation-Maximization (EM) algorithm -------------- [3696l]

Then, we have,

          Expectation-Maximization (EM) algorithm ------------ [3696m]

The right-hand side in Inequality 3696m represents the lower bounds as shown in Figure 3696a.

The EM algorithm involves iteratively updating parameters , , and to maximize this log-likelihood function. The E-step computes the expected value of the log-likelihood with respect to the current estimates of the parameters, and the M-step maximizes this expected log-likelihood to update the parameters. The process repeats until convergence. The distribution Qi(z(i)) plays a crucial role in this iterative process.

On a given iteration at EM, we want ,

          Expectation-Maximization (EM) algorithm ---- [3696n]

To enable Equation 3696n, which is related to the evidence lower bound (ELBO), to be true, we need to have a specific relationship between the distributions involved. This equation involves an expectation with respect to the distribution . The equality holds under the condition that the distributions are chosen such that the logarithm can be taken inside the expectation. This typically happens when the logarithm is a convex function, and the Jensen's inequality becomes an equality. Therefore, the condition for equality is that the logarithm function applied to the ratio is a concave function, or equivalently, the reciprocal of the logarithm (exponential function) is convex. Then, the Expectation-Maximization (EM) algorithm needs to be a constant with respect to the distribution Qi(z(i)). In other words, the ratio should not depend on the specific values of z(i) drawn from the distribution Qi.

In practical terms, this often occurs when dealing with exponential families of distributions, where the ratio of two probability density functions is a well-behaved function, and the logarithm of this ratio satisfies the conditions for Jensen's inequality to become an equality.

Let's set,

          Expectation-Maximization (EM) algorithm ----------------- [3696o]

We know,

          Expectation-Maximization (EM) algorithm ----------------- [3696p]

 

       

       

Table 3696. Applications of Expectation-Maximization (EM) algorithm.
Applications Details
Expectation-Maximization (EM) algorithm working in Gaussian Mixture Models (GMMs)

 

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================