Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor monolithic ArgInfoStep into separate classes encoding different operations #2744

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

che-sh
Copy link
Contributor

@che-sh che-sh commented Feb 13, 2025

Summary:
Torchrec rewriting logic got a bit hairy over the years, this sequence of changes aims to refactor the rewrite logic to be less convoluted and more maintainable in the future.

This change: Splits monolithic ArgInfoStep into multiple classes, each handling single potential operation (+minimum data necessary to perform it).

Internal

Diff stack navigation:

  1. D69292525 and below - before refactoring
  2. D69438143 - Refactor get_node_args and friends into a class
  3. D69461227 - refactor "joint lists" in ArgInfo into a list of ArgInfoStep
  4. D69461226 - refactor _build_args_kwargs into instance methods on ArgInfo and ArgInfoStep
  5. D69461228 - split monolithic ArgInfoStep into a class hierarchy (you are here)

Differential Revision: D69461228

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69461228

Summary:

`_shard_modules` function is used in fx_traceability tests for SDD and SemiSync pipeline. It uses a default ShardingPlanner and topology that use hardcoded batch size (512) and HBM memory limit (32Gb), respectively. This change allows specifying the ShardingPlanner and Topology to more accurately reflect the machine capabilities. The change is intentionally limited to `_shard_modules` only and not public `shard_modules` to avoid changing the contract for the latter.

Reviewed By: sarckk

Differential Revision: D69163227
…stproc modules (pytorch#2733)

Summary:

Postproc modules with collection inputs (list or dict) with non-static (derived from input or other postproc) elements were not properly rewritten - input elements remained fx.Nodes even during the actual model forward (i.e. outside rewrite, during pipeline execution)

To illustrate:

```
def forward(model_input: ...) -> ...:
    modified_input = model_input.float_features + 1
    sharded_module_input = self.postproc(model_input, modified_input)  # works
    sharded_module_input = self.postproc(model_input, [123])  # works
    sharded_module_input = self.postproc(model_input, [torch.ones_like(modified_input)])  # fails
    sharded_module_input = self.postproc(model_input, [modified_input])  # fails
    sharded_module_input = self.postproc(model_input, { 'a': 123 })  # works
    sharded_module_input = self.postproc(model_input, { 'a': torch.ones_like(modified_input) })  # fails
    sharded_module_input = self.postproc(model_input, { 'a': modified_input })  # fails

    return self.ebc(sharded_module_input)
```

Differential Revision: D69292525
Summary:

Torchrec rewriting logic got a bit hairy over the years, this sequence of changes aims to refactor the rewrite logic to be less convoluted and more maintainable in the future.

This change: _get_node_args and related functions pass around lot of "context" (train_pipeline_context, streams, etc.) that rarely or never changes + some "state" (model, pipelined_preprocs) that is accumulated during the run. Refactoring `_get_node_args` (and friends) into a class allows initializing/passing those into class constructor, and simplifies the call signatures a lot

Internal

Diff stack navigation:
1. D69292525 and below - before refactoring
2. D69438143 - Refactor get_node_args and friends into a class (**you are here**)
3. D69461227 - refactor "joint lists" in ArgInfo into a list of ArgInfoStep
4. D69461226 - refactor `_build_args_kwargs` into instance methods on ArgInfo and ArgInfoStep
5. D69461228 - split monolithic `ArgInfoStep` into a class hierarchy

Differential Revision: D69438143
…ch#2742)

Summary:

Torchrec rewriting logic got a bit hairy over the years, this sequence of changes aims to refactor the rewrite logic to be less convoluted and more maintainable in the future.

This change: ArgInfo uses a "synchronized lists" pattern, having 4 attributes, each being a list, semantically representing different fields of a data structure (i.e. input_attrs[0], is_getitems[0], ... all relate to a single transformation on the input; all lists must have same number of elements). This diff refactors them into an actual list of a (new) `ArgInfoStep` class instances that encapsulate the related fields.

Internal

Diff stack navigation:
1. D69292525 and below - before refactoring
2. D69438143 - Refactor get_node_args and friends into a class 
3. D69461227 - refactor "joint lists" in ArgInfo into a list of ArgInfoStep (**you are here**)
4. D69461226 - refactor `_build_args_kwargs` into instance methods on ArgInfo and ArgInfoStep
5. D69461228 - split monolithic `ArgInfoStep` into a class hierarchy

Differential Revision: D69461227
…Info (pytorch#2743)

Summary:

Torchrec rewriting logic got a bit hairy over the years, this sequence of changes aims to refactor the rewrite logic to be less convoluted and more maintainable in the future.

This change: 
* almost all code in `_build_args_kwargs` deals with the fields of ArgInfoStep, and remaining part handles looping over `ArgInfo.steps` - so this change just colocates "behavior" (`_build_args_kwargs` logic) with data it belongs to. 
* introduces helper functions/factory methods for various types of ArgInfoStep
* encapsulates the logic of handling a `List[ArgInfo]` into a `CallArgs` class (+changes a bit - explicitly separating args nad kwargs, vs. having them differ by empty/present `ArgInfo.name` field)

Internal

Diff stack navigation:
1. D69292525 and below - before refactoring
2. D69438143 - Refactor get_node_args and friends into a class 
3. D69461227 - refactor "joint lists" in ArgInfo into a list of ArgInfoStep
4. D69461226 - refactor `_build_args_kwargs` into instance methods on ArgInfo and ArgInfoStep (**you are here**)
5. D69461228 - split monolithic `ArgInfoStep` into a class hierarchy

Differential Revision: D69461226
…ent operations (pytorch#2744)

Summary:

Torchrec rewriting logic got a bit hairy over the years, this sequence of changes aims to refactor the rewrite logic to be less convoluted and more maintainable in the future.

This change: Splits monolithic ArgInfoStep into multiple classes, each handling single potential operation (+minimum data necessary to perform it).

Internal

Diff stack navigation:
1. D69292525 and below - before refactoring
2. D69438143 - Refactor get_node_args and friends into a class 
3. D69461227 - refactor "joint lists" in ArgInfo into a list of ArgInfoStep
4. D69461226 - refactor `_build_args_kwargs` into instance methods on ArgInfo and ArgInfoStep 
5. D69461228 - split monolithic `ArgInfoStep` into a class hierarchy (**you are here**)

Differential Revision: D69461228
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69461228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants