Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training added on top of flux_impl #147

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

ksikiric
Copy link
Contributor

@ksikiric ksikiric commented Feb 12, 2025

Linked to #146

I've added the training code on top of https://github.com/AI-Hypercomputer/maxdiffusion/tree/flux_impl. This PR is meant to be merged after #146.

With the training code, I have also added a pipeline for flux, which can be used for inference as well.

Copy link

google-cla bot commented Feb 12, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@entrpn entrpn mentioned this pull request Feb 13, 2025
@ksikiric ksikiric force-pushed the kris/flux-impl-training branch from ba2d028 to f56234e Compare February 13, 2025 12:11
@ksikiric
Copy link
Contributor Author

ksikiric commented Feb 13, 2025

Background in #148

@entrpn, I've rebased on flux_lora now and aligned the pipeline with the changes you made to generate_flux.py. Inference is working as expected, but I am a bit suspicious about the training. Please try it out and lets discuss on how to move forward with this.

In the meantime, I will prepare another PR where I will add FA for GPUs, similar to how it is done in maxtext.

@ksikiric
Copy link
Contributor Author

Hi @entrpn, have you had a chance to test this PR? I think we can try to merge this soon if you think it looks alright

@entrpn
Copy link
Collaborator

entrpn commented Feb 19, 2025

Hi @entrpn, have you had a chance to test this PR? I think we can try to merge this soon if you think it looks alright

I started to take a look at it. The pipeline fails for me during the data pipeline due to memory restraints in my environment. During the text encoding, I get OOM. The code will need to be refactored in a way that this can run on on CPU or at least 32 GB of accelerator memory (preferably 16) since the t5 encoder cannot be sharded atm. I remember doing something similar before by batching the captions in the data pipeline. I can take a look at it next week and try to get that part working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants