CS 188 Project3(RL) Q4: Asynchronous Value Iteration

     在AsynchronousValueIterationAgent中编写一个值迭代代理,该代理已在ValueIterationAgents.py中为您部分指定。您的价值迭代代理是一个离线计划,而不是强化学习代理,因此相关的训练选项是它在初始计划阶段运行的值迭代的迭代次数(选项-i)。AsynchronicValueIterationAgent构造MDP,并在构造函数返回之前为指定的迭代次数运行循环值迭代。请注意,所有这些值迭代代码都应该放在构造函数(__init__)中。

       这个类之所以被称为AsynchronousValueIterationAgent,是因为我们在每个迭代中只更新一个状态,而不是执行批处理样式更新。下面是循环值迭代的工作原理。在第一次迭代中,只更新状态列表中第一个状态的值。在第二次迭代中,只更新第二次迭代的值。继续进行,直到您更新一次每个状态的值,然后在后续迭代的第一个状态重新开始。如果为更新选择的状态是终端状态,则在该迭代中不会发生任何事情。您可以将它作为索引实现到代码框架中定义的状态变量中。

      回忆一下值迭代状态更新公式:

      值迭代迭代一个固定点的值,如类中所讨论的。还可以以不同的方式更新状态值,例如以随机顺序(随机选择状态、更新值并重复)或以批处理样式(如Q1)更新状态值。在Q4,我们将探索另一种技术。 

      AsynchronicValueIterationAgent继承自Q1的ValueIterationAgent,因此您需要实现的唯一方法是runValueIteration,由于超类构造函数调用runValueIteration,因此重写它就足以根据需要更改代理的行为。

      注意:当一个状态在MDP中没有可用的操作时,一定要处理这种情况(想想这对于将来的奖励意味着什么)。要测试实现,请运行autograder。运行时间应该不到一秒钟,如果需要更长的时间,您可能会在项目的后面遇到问题,所以现在就让您的实现更加高效。

python autograder.py -q q4

CS 188 Project3(RL) Q4: Asynchronous Value Iteration_第1张图片

# valueIterationAgents.py
# -----------------------
# Licensing Information:  You are free to use or extend these projects for
# educational purposes provided that (1) you do not distribute or publish
# solutions, (2) you retain this notice, and (3) you provide clear
# attribution to UC Berkeley, including a link to http://ai.berkeley.edu.
# 
# Attribution Information: The Pacman AI projects were developed at UC Berkeley.
# The core projects and autograders were primarily created by John DeNero
# ([email protected]) and Dan Klein ([email protected]).
# Student side autograding was added by Brad Miller, Nick Hay, and
# Pieter Abbeel ([email protected]).



class AsynchronousValueIterationAgent(ValueIterationAgent):
    """
        * Please read learningAgents.py before reading this.*

        An AsynchronousValueIterationAgent takes a Markov decision process
        (see mdp.py) on initialization and runs cyclic value iteration
        for a given number of iterations using the supplied
        discount factor.
    """
    def __init__(self, mdp, discount = 0.9, iterations = 1000):
        """
          Your cyclic value iteration agent should take an mdp on
          construction, run the indicated number of iterations,
          and then act according to the resulting policy. Each iteration
          updates the value of only one state, which cycles through
          the states list. If the chosen state is terminal, nothing
          happens in that iteration.

          Some useful mdp methods you will use:
              mdp.getStates()
              mdp.getPossibleActions(state)
              mdp.getTransitionStatesAndProbs(state, action)
              mdp.getReward(state)
              mdp.isTerminal(state)
        """
        ValueIterationAgent.__init__(self, mdp, discount, iterations)

    def runValueIteration(self):
        "*** YOUR CODE HERE ***"
        #pass
        value = 0
        maxValue = ninfinity
        states = self.mdp.getStates()
        iterations = self.iterations
        for i in range(iterations):
            current = util.Counter()
            length = len(states)
            state = states[i%length]
            if not self.mdp.isTerminal(state):
                maxValues = []
            else:
                current[state] = 0
            actions = self.mdp.getPossibleActions(state)
            for action in actions:
                value = 0
                nextAction = next
                statesAndProbabilities = self.mdp.getTransitionStatesAndProbs(state,action)
                for nextAction, prob in statesAndProbabilities:
                    reward = self.mdp.getReward(state,action,nextAction)
                    dicount = self.discount
                    value = value + prob * (reward + dicount*self.values[nextAction])
                maxValues = maxValues + [value]
                if value > maxValue:
                    maxValue = value
                length = len(maxValues)
                if length!=0:
                    current[state] = max(maxValues)
            self.values[state] = current[state]

 

 

欢迎关注微信公众号:“从零起步学习人工智能”!

 

 

你可能感兴趣的:(AI,&,Big,Data案例实战课程)