Bellman Expectation and Bellman Optimality Equations - Python Automation and Machine Learning for ICs - - An Online Book - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Bellman optimality equation expresses the relationship between the optimal value of a state V∗(S)) and the values of the possible next states weighted by their transition probabilities and discounted by the factor γ. Bellman equation for the state-value function is given by, ----------------------------------- [3667a]
Bellman equation for the action-value function is given by, ----------------------------------- [3667b] Main symbols and notations used in Bellman equations are: a: Action, representing the decision or move taken by the agent in a given state. Fitted VI: Fitted Value Iteration (VI). P: Transition model, representing the probability distribution of next states given the current state and action. P(s'∣s, a): Transition probability, representing the probability of transitioning to state s' from state s by taking action a. Q(s,a): Action-value function, representing the expected cumulative reward from being in state s, taking action a, and following a particular policy. Q∗(s,a): Optimal action-value function, representing the maximum expected cumulative reward achievable from being in state s, taking action a, and following the optimal policy. r: Immediate reward, the numerical value received by the agent as feedback for taking an action in a specific state. R: Return, the total cumulative reward obtained by the agent over a sequence of time steps. s: State, representing the current situation or configuration of the environment. St, At: State and action at time step t in a trajectory. T: Time step or the length of a trajectory. V(s): State-value function, representing the expected cumulative reward from being in state s and following a particular policy. V*(s): Optimal state-value function, representing the maximum expected cumulative reward achievable from state s onward following the optimal policy. V^: The approximation to the optimal state-value function. ϵ: Exploration rate, a parameter used in epsilon-greedy strategies to balance exploration and exploitation. γ: Discount factor, a parameter that determines the importance of future rewards. It is a value between 0 and 1. π: Policy, a mapping from states to probabilities of selecting each possible action. π(a∣s): Probability of taking action a in state s according to policy π. θ are the parameters of the function approximator. τ: Trajectory, a sequence of states and actions experienced by the agent during interaction with the environment.
============================================
|
||||||||
================================================================================= | ||||||||
|
||||||||