Bellman expectation and Bellman optimality equations

Bellman Expectation and Bellman Optimality Equations
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Bellman optimality equation expresses the relationship between the optimal value of a state V^∗(S)) and the values of the possible next states weighted by their transition probabilities and discounted by the factor γ.

Bellman equation for the state-value function is given by,

Upload Files to Webpages ----------------------------------- [3667a]

Bellman equation for the action-value function is given by,

Upload Files to Webpages ----------------------------------- [3667b]

Main symbols and notations used in Bellman equations are:

a: Action, representing the decision or move taken by the agent in a given state.

Fitted VI: Fitted Value Iteration (VI).

P: Transition model, representing the probability distribution of next states given the current state and action.

P(s'∣s, a): Transition probability, representing the probability of transitioning to state s' from state s by taking action a.

Q(s,a): Action-value function, representing the expected cumulative reward from being in state s, taking action a, and following a particular policy.

Q^∗(s,a): Optimal action-value function, representing the maximum expected cumulative reward achievable from being in state s, taking action a, and following the optimal policy.

r: Immediate reward, the numerical value received by the agent as feedback for taking an action in a specific state.

R: Return, the total cumulative reward obtained by the agent over a sequence of time steps.

s: State, representing the current situation or configuration of the environment.

S_t, A_t: State and action at time step t in a trajectory.

T: Time step or the length of a trajectory.

V(s): State-value function, representing the expected cumulative reward from being in state s and following a particular policy.

V*(s): Optimal state-value function, representing the maximum expected cumulative reward achievable from state s onward following the optimal policy.

V^{^}: The approximation to the optimal state-value function.

ϵ: Exploration rate, a parameter used in epsilon-greedy strategies to balance exploration and exploitation.

γ: Discount factor, a parameter that determines the importance of future rewards. It is a value between 0 and 1.

π: Policy, a mapping from states to probabilities of selecting each possible action.

π(a∣s): Probability of taking action a in state s according to policy π.

θ are the parameters of the function approximator.

τ: Trajectory, a sequence of states and actions experienced by the agent during interaction with the environment.

============================================

=================================================================================