- build baseline ranking bandits: score vector and cascading feedbackTODO
- build ranking imageTODO
- submit ranking train job to Vertex AITODO
- serving ranking bandit with Vertex AI
- The contextual bandits approach is classified as an extension of multi-armed bandits
- a contextual multi-armed bandit problem is a simplified reinforcement learning algorithm where the agent takes an action from a set of possible actions
Main differences:
- The item features are stored in the
part of the observation, in the order of how they are recommended - Since this ordered list of items expresses what action was taken by the policy,
value of the trajectory is not used by the agent.
Note: difference between the "per-arm" observation recieved by the policy vs the agent:
- While the agent receives the items in the recommendation slots, the policy receives the items that are available for recommendation.
- The user is responsible for converting the observation to the syntax required by the agent.
The training observation contains the global features and the features of the items in the recommendation slots
- The item features are stored in the
part of the observation, in the order of how they are recommended - Note: since this ordered list of items expresses what action was taken by the policy, the action value of the trajectory is not used by the agent
- Cascading Linear Submodular Bandits: Accounting for Position Bias and Diversity in Online Learning to Rank, G. Hiranandani, H. Singh, P. Gupta, I. A. Burhanuddin, Z. Wen and B. Kveton, 35th Conference on Uncertainty in Artificial Intelligence (2019)
- account for both position bias and diversity in forming the list of items to recommend
- Contextual Combinatorial Cascading Bandits, , S. Li, B. Wang, S. Zhang, W. Chen, Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1245-1253, 2016