Electron microscopy
 
PythonML
State-Action Rewards in Markov Decision Process (MDP)
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

State-action rewards refers to the immediate rewards associated with taking a specific action in a specific state. In mathematical terms, the reward R(s, a) represents the immediate reward the agent receives when it takes action a in state s. 

The goal of the agent in reinforcement learning is typically to learn a policy, a mapping from states to actions, that maximizes the cumulative expected sum of rewards over time. This involves the agent taking actions in the environment, observing the rewards and resulting states, and updating its policy to improve its future decision-making. The total expected return (or cumulative reward) for following a policy π, denoted by J(π), is the sum of the expected rewards over time,

         Upload Files to Webpages --------------------- [3666a]
where,

           γ is the discount factor, which determines the importance of future rewards compared to immediate rewards. The agent's objective is to find the policy that maximizes this expected return.

The total expected return for a policy π in a reinforcement learning setting is given by,  

         Upload Files to Webpages --------------------- [3666b]

This equation considers the immediate rewards R(St, at) at each time step t and discounts them by γt , where γ is the discount factor.  This formulation takes into account the potentially infinite sequence of rewards the agent might receive as it interacts with the environment over time. The discount factor γ ensures that future rewards are weighted less than immediate rewards, reflecting the agent's preference for obtaining rewards sooner rather than later.

Then, the Bellman optimality equation for the state-value function (V(S)) becomes,

         Upload Files to Webpages --------------------- [3666b]

where,

          V(S) is the optimal value of state S.

          maxa denotes the maximum over all possible actions a that the agent can take in state  S. 

          R(S,a) is the immediate reward the agent receives when taking action a in state S.

         γ is the discount factor, which discounts the importance of future rewards. 

        ∑S'PSa(S′) represents the sum over all possible next states S′ of the probability PSa(S′) of transitioning from state S to state S′ given action a.

       V(S′) is the value of the next state S′. 

In Equation 3666b, the first term on the right hand side represents the immediate reward, and the second term is the expected future rewards.  

Note that value iteration can be applied to both the state-value and state-action value formulations. The difference lies in whether we are estimating the value of states (state-value) or state-action pairs (state-action value or Q-values).  

Then, the optimal policy π*(s)) from the optimal state-value function (V*(s)) can be given by,

         Upload Files to Webpages --------------------- [3666c]

 

         

 

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================