Cross entropy (log loss/logistic loss)

Cross Entropy (Log Loss/Logistic Loss)
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Cross entropy is a concept often used in information theory and statistics, particularly in machine learning and data science. Cross entropy, also known as log loss or logistic loss, is a common mathematical concept used in machine learning and information theory to measure the dissimilarity between two probability distributions. In machine learning, it is often used as a loss function to evaluate the performance of classification models, particularly in binary and multi-class classification problems.

Cross entropy is a way to measure the dissimilarity or the difference between two probability distributions. In machine learning, it's commonly used when comparing the predicted probability distribution generated by a model to the true probability distribution of the data:

You have two probability distributions: one is the "true" distribution (usually denoted as P), and the other is the "predicted" or "estimated" distribution (usually denoted as Q).
For each event or outcome in the distribution, you calculate the probability of that event occurring according to both P and Q.
Then, you sum up the products of the probabilities of each event under P and the logarithm (usually with a base of 2) of the probability of the same event under Q. The formula for cross entropy between P and Q is often written as:

H(P, Q) = -Σ [p(x) * log(q(x))] ----------------------------------------------------- [3979a]

Cross Entropy ----------------------------------------------------- [3979b]

Where:

H(P, Q) is the cross entropy between distributions P and Q.
p(x) is the probability of event x according to the true distribution P.
q(x) is the probability of event x according to the predicted distribution Q.
The sum is taken over all possible events or outcomes.

The result is a non-negative value, and it quantifies how well the estimated distribution Q represents the true distribution P. Lower cross-entropy indicates a better match between the two distributions.

The loss function cross entropy can be used for classification problems. Here's how cross entropy works:

Binary Classification: In a binary classification problem, you have two classes, typically denoted as 0 and 1. Let's say you have a true probability distribution P(y) over these classes and a predicted probability distribution Q(y) produced by your machine learning model. The cross-entropy loss for binary classification is calculated as:

Cross Entropy --------------------------- [3979c]

Here, "y" represents the class label (0 or 1), and the sum is taken over both classes. This formula penalizes the model more when it makes incorrect predictions with high confidence.

Multi-class Classification: In multi-class classification problems with more than two classes, the cross-entropy loss is calculated similarly, but it considers all classes. Let's say you have C classes, and P(y) is a one-hot encoded vector representing the true class, and Q(y) is the predicted probability distribution over the classes. The cross-entropy loss for multi-class classification is:

--------------------------------- [3979d]

Here, the sum is taken over all C classes.

The goal in machine learning is to minimize the cross-entropy loss. When your model's predicted probabilities match the true probabilities (i.e., when Q(y) equals P(y)), the cross-entropy loss is minimized to zero. However, as the predicted probabilities diverge from the true probabilities, the loss increases, indicating that the model's predictions are becoming less accurate.

In practice, optimizing cross entropy is a common approach for training classification models using techniques like logistic regression, neural networks, and softmax regression. It encourages the model to produce accurate and well-calibrated probability estimates for each class, making it a popular choice as a loss function for classification tasks.

For decision trees, with the theory of cross entropy loss, we have,

Cross Entropy --------------------------------- [3979e]

:where,

represents the cross-entropy loss.
The summation is over all the classes in the classification problem.
represents the predicted probability of class ..
is the natural logarithm of .
represents the true probability of class .

The formula is essentially calculating the average negative log-likelihood of the true class labels given the predicted probabilities. It is commonly used as a loss function in classification problems, including those involving decision trees.

Assuming there are two children regions R₁ and R₂, Figure 3979a shows the plot of cross entropy loss in Equation 3979e

Cross Entropy

Figure 3979a. Plot of cross entropy loss in Equation 3979e. The grey dot represents L(R_p). They difference beween the green dot and the grey dot is the change of loss. (Code)

In the scenario of splitting a dataset into two subsets (R₁ and R₂) based on a certain feature, assuming a dataset (R_p = 700 positive, R_n = 100 negative), which means there are 700 positive cases and 100 negative cases in the dataset, then we consider two possible splits:

R₁ = 600 positive, and R₂ = 100 negative.

R₁ = 400 positive and 100 negative, and R₂ = 200 positive.

The logarithm in the cross-entropy loss function does not need to be base 2. The base of the logarithm depends on the context and the choice made during the formulation of the loss function. In most machine learning frameworks and applications, the natural logarithm (base e) is commonly used.

The cross-entropy loss for a binary classification problem is often defined as:

Cross Entropy ------------- [3979f]

where,

is the true label (either 0 or 1).

is the predicted probability of the positive class.

The logarithm in this formula is commonly the natural logarithm.

For multi-class classification problems, the cross-entropy loss is extended to handle multiple classes. The formula becomes:

Cross Entropy ------------------------------- [3979g]

where,

is the true probability distribution (i.e., a one-hot encoded vector representing the true class).

is the predicted probability distribution.

The use of cross-entropy as an error metric, especially in machine learning and classification tasks, is a common and widely accepted practice. Cross-entropy is related to Shannon's information theory and is particularly suited for evaluating the performance of models that output probabilities.

The binary cross-entropy formula, commonly used for binary classification problems is given by,

------------------------------- [3979h]

where,

D represents your dataset, consisting of pairs (x,y), where x is the input data and y is the corresponding true label.

y^ is the predicted probability that the given input x belongs to the positive class (class 1).

y is the true label (either 0 or 1 in binary classification).

The formula sums over all data points in the dataset D, and for each data point, it calculates the cross-entropy loss between the true label y and the predicted probability y^. The loss penalizes the model more when its prediction is far from the true label.

The term on the right-hand side in Equation 3979h is essentially a combination of two log loss terms, one for each class (0 and 1). The log loss penalizes confidently wrong predictions more than uncertain or correct predictions.

This script at code plots predicted values versus LogLoss as shown in Figure 3979b. This script generates random true labels and predicted values for a binary classification problem and calculates the log loss for each data point.

Figure 3979b. Predicted values versus LogLos.

The log loss is calculated for each data point using the formula in Equation 3979h. The formula penalizes predictions that are far from the true label, and it is commonly used as an evaluation metric for classification models. That is, the formula for log loss penalizes predictions based on the discrepancy between the predicted probability and the true label, for instance:

When the true label is 1 (positive class), the penalty increases as the predicted probability (y^) moves away from 1. In terms of the x-axis on the plot in Figure 3979b, this movement corresponds to the left side, as lower predicted probabilities are plotted towards the left side of the graph.

When the true label is 0 (negative class), the penalty increases as the predicted probability (y^) moves away from 0.

This behavior reflects the desire for the model to make confident and accurate predictions. Predictions that are closer to the true label result in lower log loss values, while predictions that are further away result in higher log loss values.

For simplification, we can ignore the sum sign in Equation 3979h. Consider a scenario with one data point from a binary classification problem:

True label (y): 1 (positive class)
Predicted probability (y^): 0.8

Substituting the given values into the equation, then we have,

LogLoss = −(1)log(0.8) − (1−1)log(1−0.8) = LogLoss = −log(0.8) = 0.2231

Therefore, for this specific data point, with a true label of 1 and a predicted probability of 0.8, the log loss is approximately 0.2231. In the plotted figure, the x-coordinate will be 0.8 (predicted probability) and the y-coordinate will be 0.2231 (log loss). This point on the graph represents the relationship between the predicted value and the log loss for this particular data point.

============================================

=================================================================================