[metalearn] neurips bbo challenge idea dump #26

cosmicBboy · 2020-08-01T21:49:27Z

Noting these down for the neurips bbo challenge

idea 1: generate more suggestions and only send the top
n_suggestions ranked by value.
idea 2: generate n_suggestions and set the reward for the lowest
m suggestions to -1 (this should happen in the observe step)
idea 3: combine idea 1 and 2 - generate x factor more suggestions
than n_suggestions and send top n_suggestions ranked by value.
Use all of the suggestions to update the controller, setting the
rewards for the suggestions that didn't make it to -1 (this should
happen in the suggest step)
idea 4: in meta-ml package, use a nn.ModuleDict to name micro
actions by algorithm/hyperparameter name. This enables the
addition of arbitrary new hyperparameters while preserving the
weights of the old hyperparameters [metalearn] support arbitrary expansion of action space in metalearn_controller #24.
idea 5: reward function engineering: keep track of the running min
and max reward over the entire run, normalizing the reward for each
batch to be betwee -1 (min) and 1 (max)
idea 6: use continuous action space for selecting real [metalearn] support continuous policy action space #23
hyperparameters within the bounds specified by api_config:
- https://medium.com/@asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c # noqa
- use the normal distribution:
  https://pytorch.org/docs/stable/distributions.html#normal
idea 7: implement trust region policy optimization (TRPO):
- https://arxiv.org/pdf/1502.05477.pdf
- code: https://github.com/ikostrikov/pytorch-trpo
idea 8: implement proximal policy optimization (PPO) [metalearn] support PPO training algorithm #25:
- https://arxiv.org/abs/1707.06347
- code: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2020-08-14T02:33:03Z

idea 9: use Random Network Distillation, applied to the value of the next time step

cosmicBboy · 2020-08-18T13:21:00Z

idea 10: try Q-actor critic method instead of advantage function

cosmicBboy · 2020-08-18T13:21:30Z

idea 11: use simpler policy architecture, with multivariate normal to jointly produce all hyperparameters instead of sequentially with an RNN

cosmicBboy · 2020-08-18T13:21:56Z

idea 12: use model-based RL to estimate the reward function (function approximator can even be gaussian process!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metalearn] neurips bbo challenge idea dump #26

[metalearn] neurips bbo challenge idea dump #26

cosmicBboy commented Aug 1, 2020 •

edited

Loading

cosmicBboy commented Aug 14, 2020

cosmicBboy commented Aug 18, 2020

cosmicBboy commented Aug 18, 2020

cosmicBboy commented Aug 18, 2020

[metalearn] neurips bbo challenge idea dump #26

[metalearn] neurips bbo challenge idea dump #26

Comments

cosmicBboy commented Aug 1, 2020 • edited Loading

cosmicBboy commented Aug 14, 2020

cosmicBboy commented Aug 18, 2020

cosmicBboy commented Aug 18, 2020

cosmicBboy commented Aug 18, 2020

cosmicBboy commented Aug 1, 2020 •

edited

Loading