For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as an undiscounted episodic task, with constant rewards of until the goal state is reached. Figure 6.11 shows the result of applying -greedy Sarsa to this task, with , , and the initial values for all . The increasing slope of the graph shows that the goal is reached more and more quickly over time.

Feb 02, 2018 · As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa.

2.2 State-Action-Reward-State-Action (SARSA) SARSA very much resembles Q-learning. The key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm. It implies that SARSA learns the Q-value based on the action performed by the current policy instead of the greedy policy.

从SARSA算法到Q-learning with ϵ-greedy Exploration算法,程序员大本营,技术文章内容聚合第一站。

