通过等式(7),我们可以迭代计算对无偏估计值进行求解。实际情况中用户必然会因为推荐商品的组合问题产生更复杂的行为,这样一来必然导致累积奖励独立计算的假设不成立。但以此为本,我们可以推导出基于更复杂假设下的计算累积奖励估计量的递归公式。
参考文献
[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.A., Playing atari with deep reinforcement learning. CoRR abs/1312.5602, 2013.
[2] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999
[3] E. Wiewiora. Potential-based shaping and Q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19(1):205–208, 2003