Name		Name	Last commit message	Last commit date
parent directory ..
imgs		imgs
01a-bandit-mf-env-simulation.ipynb		01a-bandit-mf-env-simulation.ipynb
01b-build-training-image.ipynb		01b-build-training-image.ipynb
01c-scale-bandit-simulation-vertex.ipynb		01c-scale-bandit-simulation-vertex.ipynb
01d-stationary-linear-envs.ipynb		01d-stationary-linear-envs.ipynb
01e-stationary-perarm-envs.ipynb		01e-stationary-perarm-envs.ipynb
README.md		README.md
result.json		result.json
vocab_dict.pkl		vocab_dict.pkl

README.md

Offline simulation for training Contextual Bandits

Simulate a real-world interaction environment of users and their respective preferences. To illustrate this, we take a partially labeled dataset (i.e., a dataset with feedback for a subset of <user, item> pairs), and create an environment that approximates feedback/rewards for all <user, item> pairs

Objectives

01a-bandit-mf-env-simulation.ipynb - Train bandit with environment generating training data from approximate reward matrix
01b-build-training-image.ipynb - Build docker image for scaling training with Vertex AI
01c-scale-bandit-simulation-vertex.ipynb - Submit hptuning job, using best hpt params, launch full-scale training job (both hpt and full-training submitted to Vertex AI training)
01d-stationary-linear-envs.ipynb - tests different bandit agents against stationary linear environments; logs training params and metrics to Vertex Experiments
01e-stationary-perarm-envs.ipynb - introduce concept of "per-arm" features; run compatible environment simulation

Why environment simulation?

To evaluate the performance of your RL model, you may need to run offline simulation first to determine if your RL model meets production criteria. In this case, you may have a static dataset, similar to the MovieLens dataset but potentially larger, and you can construct a custom simulation environment to use in place of the MovieLens one. In the custom environment, you may decide how to formulate observations and rewards, such as in terms of how to represent users with user vectors and what those vectors look like, perhaps via an embedding layer in a neural network. You may apply the rest of the steps and code in this demo just as you did for MovieLens, and then evaluate your model. After offline simulation, you may proceed to the next-steps of launching your model, such as A/B testing.

TLDR

Offline evaluation of RL algorithms
Faster/cheaper/safer than live experiments
Understanding of user/system behavior
Questions that we typically aim to answer with simulation rarely require realistically reproducing exact sequence of events

Our custom environment

TODO

the MF-based environment simulates real-world environment containing users and their respective preferences. Internally, the MovieLens simulation environment takes the user-by-movie-item rating matrix and performs a RANK_K matrix factorization on the rating matrix, in order to address the sparsity of the matrix.

After this construction step, the environment can generate user vectors of dimension RANK_K to represent users in the simulation environment, and is able to determine the approximate reward for any user and movie item pair.
In RL's language, user vectors are observations, recommended movie items are actions, and approximate ratings are rewards.

This environment therefore defines the RL problem at hand:

how to recommend movies that maximize user ratings, in a simulated world of users with their respective preferences defined by the MovieLens dataset, while having zero knowledge of the internal mechanism of the environment

Managed Tensorboard

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01-offline-bandit-simulation

01-offline-bandit-simulation

README.md

Offline simulation for training Contextual Bandits

Objectives

Why environment simulation?

Our custom environment

Managed Tensorboard

Hyperparameter tuning jobs

Files

01-offline-bandit-simulation

Directory actions

More options

Directory actions

More options

Latest commit

History

01-offline-bandit-simulation

Folders and files

parent directory

README.md

Offline simulation for training Contextual Bandits

Objectives

Why environment simulation?

Our custom environment

Managed Tensorboard

Hyperparameter tuning jobs