|
||||||||
Expectation-Maximization (EM) Algorithm - Python Automation and Machine Learning for ICs - - An Online Book - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= In Expectation-Maximization (EM) algorithm for estimating parameters in a mixture model using maximum likelihood, we mainly include: i) Soft assignment: Instead of assigning a data point exclusively to the cluster with the maximum probability, EM assigns it to all clusters with some probability. The "softness" of the assignment is captured by these probabilities. Mathematically, the assignment is based on the responsibility (i): where, is a function representing the log-likelihood of the observed data given the parameters , , and . ii) Parameter updates: The cluster parameters, such as the mixing coefficients ( ), means ( ), and covariances ( j), are updated using weighted averages. The weights are the responsibilities (i), indicating the likelihood of a data point belonging to a specific cluster: where,
In a mixture model, we have a dataset with observations that are assumed to be generated from a mixture of different distributions, each associated with a latent variable . The latent variable indicates from which component of the mixture the observation is generated. The goal is to estimate the parameters , , and that maximize the likelihood of the observed data. The likelihood function is the probability of observing the entire dataset given the parameters.The expressions in Equations 3696b and 3696c are the updates for the parameters and in the M-step of the EM algorithm for a mixture model. The is an indicator function that equals 1 if (i) is equal to and 0 otherwise. The sums in the denominators are normalization terms to ensure that represents a valid probability distribution and is properly weighted.Equations 3696b and 3696c are exact Gaussian Discriminant Analysis (GDA), except difference of replacing y with z. That is, the main difference lies in the interpretation of the latent variable or label. In GDA, there's typically a categorical label variable indicating the class to which an observation belongs. The likelihood and parameter estimation involve the class labels, and the focus is on modeling the class-conditional distributions. On the other hand, in Gaussian Mixture Model (GMM) with EM, there is a latent variable that represents the component from which a data point is generated in the mixture. This latent variable is associated with the different components of the mixture model, each modeled as a Gaussian distribution. Based on evidence lower bound (ELBO) in variational inference, the EM algorithm gives, This is the expression for the evidence lower bound, where is a distribution used in variational inference to approximate the posterior distribution . The goal is to maximize this lower bound during the optimization process.For any θ and Q, its relationship to likelihood is given by, This inequality holds due to the nature of the variational inference framework. The evidence lower bound is a lower bound on the log-likelihood . The goal of variational inference is to maximize the lower bound with respect to both the model parameters and the variational parameters . The inequality ensures that, at each step of the optimization, the lower bound is improved, and as a result, the log-likelihood is also improved. Maximizing the lower bound is a surrogate for maximizing the actual log-likelihood, which may be computationally challenging.The EM algorithm alternates between the E-step (Expectation) and M-step (Maximization): i) The E-step involves estimating the expected value of the latent variables (in this case, (i)) given the observed data and the current parameter estimates. The expected value of the latent variables is computed given the current parameter estimates. where,
Here, x(i) is the observed data point, μj is the mean vector of the j-th component, Σj is the covariance matrix of the j-th component, and σ2j is the variance of the j-th component (in the univariate case). We can have, Equation 3696gba represents the conditional probability of the latent variable (i) taking on a specific value given the observed data (i) and the current parameter estimates , , and . (i) can be a multinomial variable denoted by or any other symbol. In probabilistic models, the term "multinomial" refers to a categorical distribution, which is a discrete probability distribution over a finite set of categories. The parameter would represent the probabilities associated with each category in the multinomial distribution.In summary, in the E-step, we set the distribution over latent variables (i) to be, ii) During the M-step, you are optimizing the log-likelihood (ℓ(θ)) of the data with respect to the model parameters. In the M-step, the parameters are updated by maximizing the expected log-likelihood. The parameters and are part of the parameters that are estimated in the EM algorithm for GMMs during the M-step. The two equations above are used in the Maximization (M-step) to update the parameters and . In Equation 3696h, represents the mixing coefficient for the -th component of the mixture. It is updated by taking the average of the responsibilities over all data points. In Equation 3696i, is the mean vector of the -th component, and it is updated by taking the weighted average of the data points (i) using the responsibilities as weights. Equations 3696h and 3696i ensure that remains a valid probability distribution, and is updated based on the weighted average of the data points, where the weights are given by the responsibilities. These updates are performed iteratively in the EM algorithm until convergence, aiming to find the parameters that maximize the likelihood of the observed data in the presence of latent variables.Inequality 3696icd tells that in E-step we maximize J with respect to Q, while in M-step, we maximize J with respect to θ::
This involves maximizing the evidence lower bound ( ) with respect to the distribution of the latent variables ( ), treating the parameters ( ) as fixed. This involves maximizing the evidence lower bound ( ) with respect to the model parameters ( ), treating the distribution of the latent variables ( ) as fixed. The EM algorithm iteratively alternates between these two steps until convergence. The E-step computes the expected value of the latent variables, and the M-step updates the parameters of the model based on this expectation. The algorithm aims to find a set of parameters that maximizes the observed data log-likelihood, taking into account the presence of latent variables. Therefore, the EM algorithm is often referred as a coordinate ascent algorithm. In summary, in the M-step, we update θ by, The right-hand side of the optimization problem is constructed as a lower bound for the log-likelihood. This is a key concept in variational inference and the derivation of the evidence lower bound (ELBO). The given optimization problem is often associated with the maximization of the ELBO. The goal is to find the parameters that maximize the ELBO. The ELBO is a lower bound for the log-likelihood because of Jensen's inequality. Specifically, for any distribution Qi(z(i)), Jensen's inequality implies that: Maximizing the sum over in the optimization problem is equivalent to maximizing a lower bound on the log-likelihood. This approach is commonly used in variational inference to approximate intractable posterior distributions by optimizing a tractable lower bound. The optimization process seeks to find both the model parameters and the variational parameters that define the distribution Qi(z(i)).A probabilistic model which is often used in the EM algorithm, particularly for mixture models such as Gaussian Mixture Models (GMMs) is given by the expressions, which represent the probability distribution assumptions for the model, below. Equation 3696ida represents the assignment of observation to the -th component of the mixture model. In GMMs, each component is associated with a Gaussian distribution. Equation 3696idb is the probability density function (PDF) of the observed data (i) given the assignment and the model parameters . In the case of a Gaussian mixture model, this PDF is represented by a multivariate Gaussian distribution with mean and covariance matrix for the -th component. Equation 3696idc represents the prior probability of assignment to the -th component. In GMMs, is often interpreted as the weight or mixing coefficient associated with the -th component. Equation 3696idd is the estimated probability (or responsibility) that observation belongs to the -th component during the E-step of the EM algorithm. It is often computed as part of the EM algorithm and is proportional to the likelihood of the data given the current parameters.The process, of finding the maximum likelihood estimates for the mean parameter by taking the derivative of the log-likelihood function with respect to and setting it equal to 0, is given by, The process involves taking the derivative of the log-likelihood function with respect to the mean parameter μj and setting it equal to 0 to find the values of that maximize the likelihood of the observed data.The parameter update step in the M-step of the EM algorithm, particularly for Gaussian Mixture Models (GMMs), can be given by, where, μj is the means εj is the covariances ϕj is the mixing coefficients. Equation 3696ide is the update rule for the mean of the -th component during the M-step. This update is based on the responsibilities assigned to the -th component during the E-step when we are optimizing the log-likelihood of the data.Similarly, update equations for the other parameters can be obtained. For instance, the covariance matrix update is typically obtained by setting the derivative of the log-likelihood with respect to to 0. The update equation often involves the weighted sum of the outer products of the data points with the responsibilities . Therefore, can be given by, This update ensures that maximizes the likelihood of the data under the assumption of a Gaussian distribution.Update for (mixing coefficient) can be given by, where, is the total number of data points. This update ensures that reflects the proportion of data points assigned to the -th component. The mixing coefficient represents the prior probability of selecting the -th component. The update for is obtained by setting the derivative of the log-likelihood with respect to to 0.As an example of EM algorithm, assuming we want to maximize the L(θ) of the blue curve, in Figure 3696a, with the steps below: i) In the first E-step, we can randomly initiate θ as at θ0 = 2.25, we can construct a lower bound shown in green for a log-likelihood with the properties below: i.a) Everywhere on the lower bound (green) curve is lower than the blue curve. i.b) The blue curve and the green curve have a common point, which is the initiated point on the blue curve. ii) In the first M-step, we find the maximum of the green curve, which is the point at Max1. iii) In the second E-step, we use the θ value of the point at Max1 as θ1to repeat the steps i) and ii). iv) Those steps will repeated until the local maximum is found.
Equation 3696a can be modified to, where, Qi(z(i)) represents a probability distribution over the latent variable z(i) for the -th observation. Because we have expectation of a function given by , Here, represents the expected value of the function with respect to the probability distribution . The sum is taken over all possible values of the random variable .And, expectation of a random variable is given by, Here, similarly, represents the expected value of the random variable itself. Again, the sum is taken over all possible values of according to the probability distribution .In the EM algorithm, these formulas in Equations 3696jb and 3696jc are often used in the E-step to compute the expected values of certain functions or random variables with respect to a given probability distribution. Then, another way to represent the log-likelihood function in the EM algorithm is, The right-hand side of Equation 3696k represents the log of the expected value of the ratio with respect to the distribution over the latent variable z(i).Based on Jensen's inequality, for log function, which is a concave function, then we have, Then, we have, The right-hand side in Inequality 3696m represents the lower bounds as shown in Figure 3696a. The EM algorithm involves iteratively updating parameters , , and to maximize this log-likelihood function. The E-step computes the expected value of the log-likelihood with respect to the current estimates of the parameters, and the M-step maximizes this expected log-likelihood to update the parameters. The process repeats until convergence. The distribution Qi(z(i)) plays a crucial role in this iterative process.On a given iteration at EM, we want , To enable Equation 3696n, which is related to the evidence lower bound (ELBO), to be true, we need to have a specific relationship between the distributions involved. This equation involves an expectation with respect to the distribution In practical terms, this often occurs when dealing with exponential families of distributions, where the ratio of two probability density functions is a well-behaved function, and the logarithm of this ratio satisfies the conditions for Jensen's inequality to become an equality. Let's set, We know,
============================================
|
||||||||
================================================================================= | ||||||||
|
||||||||