Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize UMAP Creation #213

Open
aritraghsh09 opened this issue Feb 13, 2025 · 1 comment · May be fixed by #221
Open

Optimize UMAP Creation #213

aritraghsh09 opened this issue Feb 13, 2025 · 1 comment · May be fixed by #221
Assignees
Milestone

Comments

@aritraghsh09
Copy link
Collaborator

aritraghsh09 commented Feb 13, 2025

Generation of UMAP takes considerable amount of time given that it is not multithreaded.

Since this is trivially parallelizable, deploy multiprocessing on the following loop in https://github.com/lincc-frameworks/fibad/blob/main/src/fibad/verbs/umap.py

for batch_indexes in tqdm(
            np.array_split(all_indexes, num_batches),
            desc="Creating Lower Dimensional Representation using UMAP",
            total=num_batches,
        ):
            # We flatten all dimensions of the input array except the dimension
            # corresponding to batch elements. This ensures that all inputs to
            # the UMAP algorithm are flattend per input item in the batch
            batch = inference_results[batch_indexes].reshape(len(batch_indexes), -1)
            batch_ids = all_ids[batch_indexes]
            transformed_batch = reducer.transform(batch)
            umap_results.write_batch(batch_ids, transformed_batch)

The tqdm implementation will also need to change to make sure it works properly in multiprocessing

@aritraghsh09 aritraghsh09 changed the title Optimize UMAP creating Optimize UMAP Creation Feb 13, 2025
@mtauraso mtauraso added this to the Post MTR milestone Feb 14, 2025
@mtauraso
Copy link
Collaborator

This is not as trivial as it appears, because InferenceDataSetWriter.write_batch is not thread or multiprocessing safe.

I think the solution here looks like:

  1. Take all the CPU-bound inference operations and parallelize them (the trivial part)
  2. Make write_batch not block on np.save, instead spinning off a subprocess or thread to wait on that call.

This way the calls to write_batch are all sequential, so the bookkeeping and overall interface of InferenceDataSetWriter don't need big changes.

@mtauraso mtauraso self-assigned this Feb 18, 2025
@mtauraso mtauraso linked a pull request Feb 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants