Electron microscopy
 
Bayes' Theorem (Bayes rule or Bayes Law) in Machine learning
- Python for Integrated Circuits -
- An Online Book -
Python for Integrated Circuits                                                                                   http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Bayes' theorem is also known with some other name such as Bayes rule or Bayes Law. Bayes' theorem, named after the Reverend Thomas Bayes, is a fundamental principle in probability theory and statistics that describes how to update or revise our beliefs about an event or hypothesis based on new evidence or information. It provides a way to calculate the conditional probability of an event A given that event B has occurred, in terms of the conditional probability of event B given event A and the probabilities of events A and B on their own.

Mathematically, Bayes' theorem can be expressed as:

          Bayes' theorem ------------------------------------ [4016a]

Where:

  • is the probability of event A occurring given that event B has occurred (the posterior probability).
  • is the probability of event B occurring given that event A has occurred (the likelihood).
  • is the prior probability of event A occurring.
  • is the probability of event B occurring.

Equation 4016a can be obtained from conditional probability. 

We can replace A with H, and replace B with D,

           ---------------------------------------- [4016al]

where,

         H is hypothesis.

         D is data. 

         P(H|D) represents for posterior probability.

         P(H) represents for prior probability. 

         P(D|H) represents for likelihood. 

         P(D) is a normalizing constant. 

Figure 4016a shows an example of the application of Equation 4016al. The squares (42 in total) represent the emails which have been classified as spam (gray squares) and not-spam (white squares). There are 13 spam emails and 29 not-spam emails.  

By using Bayes’ theorem, we want to find the probability of the word “excellent” in the not-spam emails given the word “excellent”. We first calculate the prior probability, 

            P(not-spam) =  29/42

We can calculate the likelihood, in the subset in the red area in Figure 4016a (b), which covers spam and not-spam emails and is given by, 

            P(“excellent” | not-spam) = 6/29

where,

            29 is the total not-spam emails.  

The normalizing constant is related to the red area in Figure 4016a (b), given by,

Finally, we have  

Finally, we obtain the  probability of the word “excellent” in the not-spam emails given by,

            P(not-spam | "excellent") = (29/42)* (6/29)/(16/42) = 37.5%

Therefore, if we can have the likelihood, then we can calculate the possibility of the data.  

 (a)  (b)

Figure 4016a. Finding the probability of the word “excellent” in the not-spam emails given the word “excellent”.

In simple terms, Bayes' theorem provides a way to update our initial beliefs (prior probabilities) with new evidence (likelihood) to obtain a revised belief (posterior probability). It's commonly used in fields such as statistics, machine learning, and various scientific disciplines to make predictions and decisions based on uncertain information and data.

In a probabilistic model, when you have two random variables X and Y, you can use Bayes' rule for conditional probabilities to calculate the conditional probability P(Y=1|X) as follows:

          P(Y=1|X) = [P(X|Y=1) * P(Y=1)] / [P(X|Y=1) * P(Y=1) + P(X|Y=0) * P(Y=0)] ------------------------------ [4016b]

where,

  1. P(Y=1|X) is the probability of Y being 1 given X.

  2. P(X|Y=1) is the probability of observing X given that Y is 1. It represents the likelihood of X under the condition that Y is 1.

  3. P(Y=1) is the prior probability of Y being 1. It represents the initial probability of Y being 1 without considering X.

  4. P(X|Y=0) is the probability of observing X given that Y is 0. It represents the likelihood of X under the condition that Y is 0.

  5. P(Y=0) is the prior probability of Y being 0. It represents the initial probability of Y being 0 without considering X.

Bayes' rule allows you to update your belief about the probability of Y being 1 (or 0) given new information about X. It takes into account the likelihood of observing X under the different conditions of Y (P(X|Y=1) and P(X|Y=0)) and the prior probabilities of Y being 1 or 0 (P(Y=1) and P(Y=0)) to compute the posterior probability of Y being 1 after observing X (P(Y=1|X)).

For instance, when working with probabilistic models or Bayesian classifiers, Bayes' theorem below is used for making predictions in binary classification,

          -------------------------------- [4016c]

where,:

  • p(y=1|x) is the conditional probability that the example belongs to class 1 given the observed features x. This is the probability you want to estimate.

  • p(x|y=1) is the probability distribution of the features x given that the example belongs to class 1. It represents the likelihood of observing the features x when the class is 1.

  • p(y=1) is the prior probability of the example belonging to class 1. It represents the prior belief or probability that class 1 is the correct class.

  • p(x|y=0) is the probability distribution of the features x given that the example belongs to class 0 (the other class).

  • p(y=0) is the prior probability of the example belonging to class 0.

It is used to estimate the probability of an example belonging to a specific class, typically class 1 (y=1), based on the observed features (x).

An example is that, suppose we want to predict whether it will rain in the afternoon based on the weather conditions in the morning. We have two competing hypotheses: 

      i) Event A: It will be a cloudy morning. 

      ii) Event B: It will not be a cloudy morning. 

Now, let's assign some probabilities: 

      P(A): Probability of a cloudy morning. 

      P(B): Probability of a not cloudy morning. 

      P(Rain|A): Probability of rain in the afternoon given a cloudy morning. 

      P(Rain|B): Probability of rain in the afternoon given a not cloudy morning. 

We can use Bayes' Rule to update our belief in the probability of rain in the afternoon given the morning weather conditions:

          ----------------------------------- [4016cl]

          ----------------------------------- [4016cm]

where, 

          P(A∣Rain) is the probability of a cloudy morning given that it's raining in the afternoon.

          P(B∣Rain) is the probability of a not cloudy morning given that it's raining in the afternoon. 

Therefore, in the Bayesian framework, we update our beliefs about the morning weather conditions based on the evidence of afternoon rain. If it's more likely to rain in the afternoon when the morning is cloudy, our belief in a cloudy morning will increase, and vice versa. This is analogous to updating our predictions based on new information, making Bayes' Rule a powerful tool in probability and statistics.

Another example is assuming a sentence, which is "This is Yougui Liao" with label "Good", then we can apply Bayes'theorem to this sentence by updating our belief in the correctness or "goodness" of the sentence based on the observed label into Equation 4016a:

  • A: The hypothesis that the sentence "This is Yougui Liao" is correct or "good."
  • B: The observed label "Good."
  • represents the prior probability of the sentence being correct or "good." This is your initial belief or probability before considering any new evidence. It could be based on general knowledge or historical data.
  • represents the probability of observing the label "Good." This is the evidence you have, i.e., the probability of the "Good" label given all possible sentences.
  • is the updated probability of the sentence being correct or "good," given the label "Good."
  • is the probability of observing the label "Good" if the sentence is indeed correct or "good." This is the likelihood of the evidence given the hypothesis.

Then, we can have the equation below:

         -- [4016d]

where,

  • represents the prior probability of hypothesis H (in your case, the sentence being correct).
  • represents the likelihood of observing the sentence "This is Yougui Liao." if the hypothesis H is true.
  • is the marginal probability of observing the sentence "This is Yougui Liao." without conditioning on H. It serves as a normalization factor.

This formula allows you to calculate the updated probability of the hypothesis H being true (in this case, the sentence being correct) given the observed sentence "This is Yougui Liao." based on the prior probability and the likelihood.

Then, if you have a dataset with a large number of sentences labeled as "Good" and "Not Good," you can estimate based on the frequency of "Good" labels when sentences are correct or "good."

Third example is that assuming we have the csv file presented in the code. A screenshot of part of the csv file is below:

          A screenshot of part of the csv file is below

This code implementes a simple Naive Bayes classifier for text classification (refer to page4026):

  1. P(H|D):

    • In the context of the Python code, P(H|D) represents the conditional probability of a question being "Insincere" (target=1) given the question text. In the code, this probability is computed for each test question, and the response variable is set to "Insincere" if P(H|D) is greater than 0.5.
    •           P(H|D)

  2. P(H):
    • P(H) represents the prior probability of a question being "Insincere." In the code, this is calculated as the proportion of "Insincere" questions in the training data (p_insincere).
    •           P(H|D)

  3. P(D|H):
    • P(D|H) represents the conditional probability of the question text (D) given that the question is "Insincere" (H=1). In the code, this probability is calculated based on the frequency of words in "Insincere" questions and their conditional probabilities (cp_insincere).
    • The conditional probability P(D|H) for "Insincere" (H=1) questions is estimated (calculated) based on the frequency of words in the "Insincere" questions by the steps below:

                Tokenization: The text of each "Insincere" question is preprocessed to split it into individual words or tokens. This involves removing punctuation and splitting the text into words. This results in a list of words in each "Insincere" question.

                Stemming and Lemmatization: The code applies stemming and lemmatization to the words. This process reduces words to their root forms. Stemming and lemmatization are text normalization techniques that help group related words together. For example, "running" and "ran" might be reduced to "run."

                Frequency Count: For each word in the "Insincere" questions, the code counts how many times it appears in "Insincere" questions. This results in a dictionary (word_count_insincere) that stores the word as the key and its frequency as the value. This counts how often each word occurs in insincere questions.

                Probability Calculation: To calculate P(D|H), the code calculates the conditional probability of each word in an "Insincere" question given that the question is "Insincere" (H=1). This is done by dividing the frequency of the word in "Insincere" questions by the total number of words in "Insincere" questions. This is stored in a dictionary (cp_insincere), where the word is the key, and the conditional probability is the value.

Next example is to find which class, for a document, maximize the posterior possibility:

         ------------------------ [4016e]

         -------------- [4016f]

         --- [4016g]

         --------------------------- [4016h]

where,

        word1, word2, word3, ..., wordn are the words in the particular document.

If there are too many words in the documents, then we can make two assumptions:

        i) Word order does not matter, so we use BOW representations.

        ii) Word appearances are independent of each other given a particular class. This is why "Naive" comes from. However, in real life, some words, e.g. "Thank" and "you" are correlated.

The Naive Bayes Classifier is given by the log formula below,

         --------------------------- [4016i]

For instance, a csv file has contents below:

        

Then, the Priors, P(c) are:

          P(Good) = 2/5

          P(Not good) = 3/5

To calculate the likelihood, P(wi|c), of each class for the given data table above, we can use a simple rule-based approach or a machine learning algorithm. In this case, we can determine the class based on the presence of certain keywords or patterns in the documents. A basic rule-based approach would involve counting the presence of certain words or phrases associated with each class.

For example, we can count the presence of positive words or phrases for the "Good" class and negative words or phrases for the "Not good" class. Let's assume that "good" and "dog" are positive indicators, and "not" and "raining" are negative indicators. We can calculate the likelihood of each class as follows:

  1. "This is good start":

    • Positive indicators (good, dog): 2
    • Negative indicators (not, raining): 0
    • Likelihood for "Good" class: (2 / (2 + 0)) = 1
  2. "It is raining":
    • Positive indicators (good, dog): 0
    • Negative indicators (not, raining): 1
    • Likelihood for "Not good" class: (1 / (0 + 1)) = 1
  3. "It is snowing":
    • Positive indicators (good, dog): 0
    • Negative indicators (not, raining): 0
    • Likelihood for "Not good" class: (0 / (0 + 0)) = Not defined
  4. "This is a dog":
    • Positive indicators (good, dog): 2
    • Negative indicators (not, raining): 0
    • Likelihood for "Good" class: (2 / (2 + 0)) = 1
  5. "He is not a friend":
    • Positive indicators (good, dog): 0
    • Negative indicators (not, raining): 2
    • Likelihood for "Not good" class: (2 / (0 + 2)) = 1

Note that in text classification of a new document, we will always get a return with the highest possibility, which the algorithm will be able to find, even though the words in the new document are not in the documents used in the training process. However, in the cases when a word or feature in a new document has never been seen in the training data, the Naive Bayes algorithm may assign a very low probability to it, and the probability of the document belonging to any class could be significantly affected. Other words in the document may still influence the classification, but the model's performance may be suboptimal.

However, this one above is a simplified example, and in real-world scenarios, we would likely use more sophisticated methods, such as machine learning algorithms, to classify documents based on their content.

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================