Electron microscopy
 
PythonML
Finite-Horizon MDP (Markov Decision Process)
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Finite-horizon MDP uses a fixed time horizon for decision-making. In a finite-horizon MDP, the agent makes decisions over a finite number of time steps or stages. The decision process is divided into discrete time periods, and the agent's goal is often to maximize the cumulative sum of rewards over this finite time horizon. 

The formulation of a finite-horizon MDP includes an additional parameter, T, representing the time horizon. The agent takes actions in each time step, and the environment responds with transitions and rewards. The objective is to find a policy that maximizes the sum of rewards over the specified time horizon. 

The formulation of a finite-horizon MDP can be expressed as (S, A, P, R, T), where T is the time horizon. Solving finite-horizon MDPs often involves dynamic programming methods, such as backward induction or forward dynamic programming, to compute the optimal policy and corresponding value function over the finite time steps. The optimal policy dictates the best action to take at each time step to maximize the expected cumulative reward by the end of the time horizon.

The expected sum of rewards over time under a certain policy can be given by,

              ------------------------------ [3665a]

where,

         E represents the expectation operator, indicating that the expression is referring to the expected value. 

        R(St, at) represents the immediate reward associated with taking action at in state St

        St and atrepresent the state and action at time step t, respectively. 

The sum in Equation 3665a denotes the total sum of rewards over the finite time horizon T. In the MDP, the objective is often to find a policy that maximizes the expected cumulative reward over a specified time horizon. Then, the corresponding  policy is πt* (non-stationary policy), which depends on time instead of a stationary policy. 

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================