2.2 State-Action-Reward-State-Action (SARSA) SARSA very much resembles Q-learning. The key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm. It implies that SARSA learns the Q-value based on the action performed by the current policy instead of the greedy policy.
For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as an undiscounted episodic task, with constant rewards of until the goal state is reached. Figure 6.11 shows the result of applying -greedy Sarsa to this task, with , , and the initial values for all . The increasing slope of the graph shows that the goal is reached more and more quickly over time.