Electron microscopy
 
Logistic Regression
- Python for Integrated Circuits -
- An Online Book -
Python for Integrated Circuits                                                                                   http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Logistic regression is a statistical method used for binary classification, which means it is employed to predict the probability of an observation belonging to one of two possible classes or categories. It is a type of regression analysis used for predicting a binary outcome (1 / 0, Yes / No, True / False) based on one or more predictor variables. It is a widely used machine learning algorithm in various fields, including medicine, finance, and marketing, due to its simplicity and interpretability.

The key idea behind logistic regression is to model the relationship between a set of independent variables (features) and a binary dependent variable (target) by using the logistic function (also known as the sigmoid function). The logistic function maps any input value to a value between 0 and 1, namely hθ(x) ∈ [0,1], which can be interpreted as the probability of the observation belonging to one of the two classes. Then, we have hypothesis fuction given by:

          hypothesis fuction ---------------------- [3876a]

where:

  • is the predicted probability that the input example belongs to the positive class (class 1).
  • is the logistic (sigmoid) function, which maps any real number to the range [0, 1].
  • is a vector of model parameters, and is its transpose.
  • is the feature vector of the input example.
  • x ∈ ℝn+1, x0 = 1

g(z) is given by,

          hypothesis fuction ---------------------- [3876b]

Overall, we have,

          hypothesis fuction ---------------------- [3876c]

E[y|x;θ] can be the mean of a Bernoulli distribution in some case of applications.

In general, the formula for the logistic function can be given by,

          hypothesis fuction ---------------------- [3876d]

where,         

  • h is the probability of the observation belonging to class 1.
  • is the intercept.
  • are the coefficients associated with the independent variables .

The plot in Figure 3876a shows a logistic regression decision boundary with two features.

Logistic regression decision boundary with two features

(a)

Logistic regression decision boundary with two features

(b)

Figure 3876a. (a) Logistic regression decision boundary with two features (Python code), and (b) Logistic (Sigmoid) Function (Python code).

In logistic regression, we have:

          p(y=1|x; θ) = hθ(x) ---------------------- [3876e]

          p(y=0|x; θ) = 1 - hθ(x) ---------------------- [3876f]

By combining Equations 3876d and 3876e, we can get,

          hypothesis fuction ---------------------- [3876g]

The expression in Equation 3876d is a representation of the probability that a binary outcome variable y is equal to 1 (often considered the "positive" or "success" class) given a set of input features x and the model parameters θ. The target variable y equals 1 (i.e., Pr(y=1)). This is the hypothesis function in logistic regression. It is also referred to as the logistic function or sigmoid function. The sigmoid function takes the linear combination of the input features x and model parameters θ and "squashes" it into a range between 0 and 1. The formula for the sigmoid function in Equation 3876a.

In Equation 3876e, the probability that y is equal to 0 (the negative class) is given by 1 minus the probability that y is equal to 1. This is because there are only two possible outcomes in binary logistic regression, and the probabilities of these two outcomes must sum to 1.

We know likelihood function can be given by (see page3997),

          hypothesis fuction ---------------------- [3876h]

          hypothesis fuction ---------------------- [3876i]

where,

  • L(θ) represents the likelihood of the parameters θ.
  • Π denotes the product of all terms in the sequence.

In logistic regression, we aim to maximize the likelihood (Equation 3876i) of the observed data, which is represented as the product of the conditional probabilities of the actual outcomes given the input data and model parameters.

Combining Equations 3876f and 3876h, then we can get,

          hypothesis fuction ---------------------- [3876j]

Then, we have,

          hypothesis fuction ------------ [3876k]

The goal of the calculation above is that we need to maximize ℒ(θ) by choosing a proper θ.

The cross-entropy loss function can be given by,

          The likelihood function -------------------------- [3876km]

where,:

  • typically represents the predicted probability that a given input belongs to the positive class (class 1).
  • represents the actual binary label (0 or 1) of the instance.

To find the optimal parameters θ, we typically maximize the log-likelihood, which is the natural logarithm of the likelihood:

          hypothesis fuction ---------------------- [3876kb]

With the consideration of regularization consideration, we can have,

          hypothesis fuction ---------------------- [3876kc]

This is the regularized log-likelihood function used in regularized logistic regression (also known as L2 regularization or Ridge regularization). In this case, we have an additional term, - λ||θ||2, which is a penalty term on the magnitude of the model parameters θ. This term is used to prevent overfitting by discouraging overly complex models with large parameter values. The λ (lambda) parameter is a hyperparameter that controls the strength of the regularization; larger values of λ lead to stronger regularization. The difference between Equations 3876kb and 3876kc lies in the regularization term "- λ||θ||^2" in the second expression. The regularization term encourages the model to have smaller parameter values, which can improve the generalization of the model by reducing overfitting. It balances the trade-off between fitting the data well and keeping the model parameters small.

Logistic regression with regularization, such as L1 or L2 regularization (e.g., Lasso or Ridge), can work well for text classification, especially when dealing with a large number of features (words or tokens). Regularization helps prevent overfitting and can improve generalization to some extent.

Equation 3876j indicates that the general update rule for θj using the gradient of the loss function is as follows (see page3875):

          general update rule for θj ------------ [3876l]

SGD is a gradient-based optimization algorithm that updates the model parameters in small steps to minimize a cost function (in the case of logistic regression, this is typically the log-likelihood or cross-entropy loss) by iteratively adjusting the model's weights.

This equation indicates that we need to maximize the log-likelihood. Then, the update rule of θj is given by,

          general update rule for θj ------------------------ [3876m]

To fit a logistic regression model, you typically use a method like maximum likelihood estimation to estimate the coefficients () that maximize the likelihood of the observed data given the model. Once the model is trained, you can use it to make predictions by calculating the probability of an observation belonging to one of the classes and then applying a threshold (usually 0.5) to classify the observation into one of the two classes.

Logistic regression is particularly useful when you have a binary outcome variable and want to understand how a set of predictor variables influence the likelihood of an event occurring. It is a linear model, which means it assumes a linear relationship between the independent variables and the log-odds of the binary outcome. If there are more than two classes in the dependent variable, extensions of logistic regression such as multinomial logistic regression or ordinal logistic regression can be used.

The probability distribution of the dependent variable in logistic regression follows:

          1) A Bernoulli distribution for binary logistic regression.

          2) A categorical distribution for multinomial logistic regression.

If we have fixed parameters Φ, μ0, μ1, and ε, then we have p(y=1|x; Φ, μ0, μ1, and ε) as a function of x, given by the conditional probability expression,

          general update rule for θj ------------------------ [3876n]

Equation 3876n relates the probability of given ) to the likelihood of observing given ), the prior probability of (), and the marginal probability of ().

In this expression, the parameters Φ, are fixed, meaning they are treated as constants and not estimated from data. Under these conditions, the equation can be seen as a representation of Bayesian inference where you are expressing the probability of given as a function of other known probabilities:

  1. is the posterior probability of given the observed data and the fixed parameters. It is what you want to estimate or predict.

  2. represents the likelihood, which is the probability of observing the data given that , under the fixed parameters μ0, μ1, ϵ.
  3. is the prior probability of given the fixed parameters . It represents the prior belief about the probability of based on these parameters.
  4. is the marginal likelihood, which is the probability of observing the data under the fixed parameters Φ, .

Figure 3876b shows probability density function and probability of logistic regression with fixed Φ, . In Figure 3876b (b) the logistic regression model is used to calculate and over the range of feature values (x).

general update rule for θj

(a)

general update rule for θj

(b)

Figure 3876b. Logistic regression with fixed Φ, : (a) Probability Density (Python code), and (b) Probability (Python code).

If we have two features, then we will get the probability distributions shown in Figure 3876c.

general update rule for θj

Figure 3876c. Probability (Python code).
         

As discussed in support-vector machines (SVM), we have,

          hypothesis fuction --------------------------------- [3876o]

where,

          g is the activation function.

          n is the number of input features.

With Equations 3876a and 3876o, we can know that the corresponding terms are shown below,

    y^ = σ ( θT x ) = ∑( θi x + θ0i )          
                                     
    y^ = σ (           w x + b )          

Equation 3876o is a basic representation of a single-layer neural network, also known as a perceptron or logistic regression model, depending on the choice of the activation function g. From Equation 3876o, we can derive different forms or variations by changing the activation function, the number of layers, or the architecture of the neural network as shown in Table 3876a.

Table 3876a. Different forms or variations of Equation 3876o.

Algorithms Details
Linear Regression Set g(z) = z (identity function).
This simplifies the equation to hypothesis fuction, which is the formula for linear regression.
Logistic Regression Set g(z) = 1 / (1 + e(-z)) (the sigmoid function).
This is a binary classification model, and the equation becomes the logistic regression model.
Multi-layer Neural Network You can add more layers to the network by introducing new sets of weights and biases, and applying activation functions at each layer. This leads to a more complex model.
Different Activation Functions You can choose different activation functions for different characteristics of your model. For example, you can use ReLU, tanh, or other non-linear activation functions instead of the sigmoid function.
Deep Learning Architectures You can create more complex neural network architectures, such as convolutional neural networks (CNNs) for image data or recurrent neural networks (RNNs) for sequential data.
Regularization You can add regularization terms, such as L1 or L2 regularization, to the loss function to prevent overfitting.

As an example, assuming we have a dataset s = {x(i), y(i)}, here, i = 1, ..., m. We want to maximize the posterior probability p(θ|S), we can use Bayes' theorem:

          hypothesis fuction --------------------------------- [3876p]

where,

          p(s) is a constant.

Then, to maximize p(θ|S), we need to maximize p(S|θ) * p(θ), which is,

          hypothesis fuction --------------------------------- [3876q]

In machine learning, with Gaussian distribution, the prior distribution of θ can be given by,

          Gaussian distribution ------------------- [3876r]

This is a Gaussian (Normal) distribution with mean zero and a covariance matrix of τ2I, where I is the identity matrix. This is a common choice for the prior distribution in Bayesian regularization in machine learning.

Table 3876b. Applications and related concepts of logistic regression.

Applications Page
Batch gradient descent Introduction
Stochastic gradient descent (SGD): logistic regression can be addressed by applying SGD Introduction
Soft Margin versus Hard Margin Introduction
Discriminative algorithms Introduction

Note that bias, see page3352, in logistic regression model can be introduced. As we know that the logistic regression model predicts the probability that a given input belongs to a particular class. The basic form of logistic regression is:

----------------- [[3876s]

Bias Introduction: If the dataset used to estimate the coefficients β has historical biases or lacks representation from all groups, the model will learn these biases. For example, if a hiring dataset has fewer examples of successful female candidates due to historical biases, the model might undervalue qualifications typically held by women.

Mitigation of bias: To avoid this, we can:

  • Resample the dataset to ensure all classes are equally represented.
  • Reweight the loss function to give higher importance to underrepresented classes.
  • Include fairness constraints during optimization that adjust for disparities in model predictions across groups.

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================