Electron microscopy
 
PythonML
Bellman Expectation and Bellman Optimality Equations
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Bellman optimality equation expresses the relationship between the optimal value of a state V(S)) and the values of the possible next states weighted by their transition probabilities and discounted by the factor γ.

Bellman equation for the state-value function is given by,

            Upload Files to Webpages ----------------------------------- [3667a]            

Bellman equation for the action-value function is given by,

            Upload Files to Webpages     ----------------------------------- [3667b]       

Main symbols and notations used in Bellman equations are:

        a: Action, representing the decision or move taken by the agent in a given state.

        Fitted VI: Fitted Value Iteration (VI). 

        P: Transition model, representing the probability distribution of next states given the current state and action.

        P(s'∣s, a): Transition probability, representing the probability of transitioning to state s' from state s by taking action a. 

        Q(s,a): Action-value function, representing the expected cumulative reward from being in state s, taking action a, and following a particular policy.

        Q(s,a): Optimal action-value function, representing the maximum expected cumulative reward achievable from being in state s, taking action a, and following the optimal policy.

        r: Immediate reward, the numerical value received by the agent as feedback for taking an action in a specific state. 

        R: Return, the total cumulative reward obtained by the agent over a sequence of time steps. 

        s: State, representing the current situation or configuration of the environment.

        St, At: State and action at time step t in a trajectory. 

        T: Time step or the length of a trajectory. 

        V(s): State-value function, representing the expected cumulative reward from being in state s and following a particular policy. 

        V*(s): Optimal state-value function, representing the maximum expected cumulative reward achievable from state s onward following the optimal policy.

        V^:  The approximation to the optimal state-value function.

         ϵ: Exploration rate, a parameter used in epsilon-greedy strategies to balance exploration and exploitation.

        γ: Discount factor, a parameter that determines the importance of future rewards. It is a value between 0 and 1. 

        π: Policy, a mapping from states to probabilities of selecting each possible action.

       π(a∣s): Probability of taking action a in state s according to policy π.

        θ are the parameters of the function approximator.

        τ: Trajectory, a sequence of states and actions experienced by the agent during interaction with the environment. 

 

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================