Simulate a real-world interaction environment of users and their respective preferences. To illustrate this, we take a partially labeled dataset (i.e., a dataset with feedback for a subset of
<user, item>
pairs), and create an environment that approximates feedback/rewards for all<user, item>
pairs
01a-bandit-mf-env-simulation.ipynb
- Train bandit with environment generating training data from approximate reward matrix01b-build-training-image.ipynb
- Build docker image for scaling training with Vertex AI01c-scale-bandit-simulation-vertex.ipynb
- Submit hptuning job, using best hpt params, launch full-scale training job (both hpt and full-training submitted to Vertex AI training)01d-stationary-linear-envs.ipynb
- tests different bandit agents against stationary linear environments; logs training params and metrics to Vertex Experiments01e-stationary-perarm-envs.ipynb
- introduce concept of "per-arm" features; run compatible environment simulation
To evaluate the performance of your RL model, you may need to run offline simulation first to determine if your RL model meets production criteria. In this case, you may have a static dataset, similar to the MovieLens dataset but potentially larger, and you can construct a custom simulation environment to use in place of the MovieLens one. In the custom environment, you may decide how to formulate observations and rewards, such as in terms of how to represent users with user vectors and what those vectors look like, perhaps via an embedding layer in a neural network. You may apply the rest of the steps and code in this demo just as you did for MovieLens, and then evaluate your model. After offline simulation, you may proceed to the next-steps of launching your model, such as A/B testing.
TLDR
- Offline evaluation of RL algorithms
- Faster/cheaper/safer than live experiments
- Understanding of user/system behavior
- Questions that we typically aim to answer with simulation rarely require realistically reproducing exact sequence of events
TODO
the MF-based environment simulates real-world environment containing users and their respective preferences. Internally, the MovieLens simulation environment takes the user-by-movie-item rating matrix and performs a RANK_K
matrix factorization on the rating matrix, in order to address the sparsity of the matrix.
- After this construction step, the environment can generate user vectors of dimension
RANK_K
to represent users in the simulation environment, and is able to determine the approximate reward for any user and movie item pair. - In RL's language, user vectors are observations, recommended movie items are actions, and approximate ratings are rewards.
This environment therefore defines the RL problem at hand:
how to recommend movies that maximize user ratings, in a simulated world of users with their respective preferences defined by the MovieLens dataset, while having zero knowledge of the internal mechanism of the environment
TODO