Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support partial scale-down in RayJob #4169

Open
1 of 3 tasks
eric-higgins-ai opened this issue Feb 7, 2025 · 2 comments
Open
1 of 3 tasks

Support partial scale-down in RayJob #4169

eric-higgins-ai opened this issue Feb 7, 2025 · 2 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@eric-higgins-ai
Copy link

What would you like to be added:
Currently the RayJob admission webhook returns an error when the job sets enableInTreeAutoscaling: true. It makes sense that autoscaling up isn't supported, but I would expect autoscaling down could be supported via the "dynamic reclaim" feature.

I'd potentially be down to implement this - I just wanted to open an issue to see how people feel about this feature before spending time on it.

Why is this needed:
We want to run hyperparameter sweeps with Ray Tune, and they have a feature that allows early exiting from trials based on metric values. With this, it's possible that at some point the job doesn't need all the resources it was initially given (because most trials have finished), and we'd like to be able to reclaim those resources.

Completion requirements:
RayJobs support dynamic reclaiming if the Ray autoscaler indicates the job should scale down

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@eric-higgins-ai eric-higgins-ai added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 7, 2025
@mimowo
Copy link
Contributor

mimowo commented Feb 7, 2025

It makes sense that autoscaling up isn't supported, but I would expect autoscaling down could be supported via the "dynamic reclaim" feature.

Potentially, we also have a design (not implemented) for generic (any CRD) dynamic Jobs in #77. Maybe it is time to prioritize that work.

cc @mwielgus @mwysokin @tenzen-y

@tenzen-y
Copy link
Member

It makes sense that autoscaling up isn't supported, but I would expect autoscaling down could be supported via the "dynamic reclaim" feature.

Potentially, we also have a design (not implemented) for generic (any CRD) dynamic Jobs in #77. Maybe it is time to prioritize that work.

cc @mwielgus @mwysokin @tenzen-y

Yeah, I think so too. Could we prioritize this after the next minor release (0.12)? Because we have a lot of alpha features and trying to fix bugs in this release cycle. This seems to be a slightly big feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants