一般MDP最优策略的唯一性

The Uniquenness of Optimal Poticies for General MDP

摘要: 对于一般的MDP模型，本文证明了对任意一族依赖于历史的随机策略所导致的策略测度类的任意凸组合，存在一个随机马氏策略所导致的策略测度，使得相应于它们的平均期望目标，折扣目标以及期望总报酬目标的值均分别相等，推广了E.B.Dynkin和Yushkevich1，M.Puterman2，E.Feinberg和A.Shwartz3，R。Strauch4，以及董泽清和宋京生5等相应的所有结果。然后还进一步证明了关于平均期望目标、折扣目标以及期望总报酬目标的最优策略，它们要么唯一，要么有无穷多个。

Abstract: For the general MDP model, we prove that:for any convex combination of strategic mea sures class produced by a given randomized history-dependent policy class,there exists a strategic measure produced by a randomized Markov policy, such that the values of average expected cri terion,of discounted criterion and of expected total reward criterion, which correspond to them,are equal, respectively. So we generilizes the corresponding results obtained by E. B. Dynkin and Yushevich 1, M. Puterman 2, E. Fenberg and A. Shwartze 3, It. Strauch4 and Dongzeqing etc 5, respectively. Finaly, we also prove that the optimal policies for average expected criterion,discounted criterion and expected totall reward criterion, are either unique or infinite.