Optimize UMAP Creation #213

aritraghsh09 · 2025-02-13T20:24:05Z

Generation of UMAP takes considerable amount of time given that it is not multithreaded.

Since this is trivially parallelizable, deploy multiprocessing on the following loop in https://github.com/lincc-frameworks/fibad/blob/main/src/fibad/verbs/umap.py

for batch_indexes in tqdm(
            np.array_split(all_indexes, num_batches),
            desc="Creating Lower Dimensional Representation using UMAP",
            total=num_batches,
        ):
            # We flatten all dimensions of the input array except the dimension
            # corresponding to batch elements. This ensures that all inputs to
            # the UMAP algorithm are flattend per input item in the batch
            batch = inference_results[batch_indexes].reshape(len(batch_indexes), -1)
            batch_ids = all_ids[batch_indexes]
            transformed_batch = reducer.transform(batch)
            umap_results.write_batch(batch_ids, transformed_batch)

The tqdm implementation will also need to change to make sure it works properly in multiprocessing

The text was updated successfully, but these errors were encountered:

mtauraso · 2025-02-14T22:50:43Z

This is not as trivial as it appears, because InferenceDataSetWriter.write_batch is not thread or multiprocessing safe.

I think the solution here looks like:

Take all the CPU-bound inference operations and parallelize them (the trivial part)
Make write_batch not block on np.save, instead spinning off a subprocess or thread to wait on that call.

This way the calls to write_batch are all sequential, so the bookkeeping and overall interface of InferenceDataSetWriter don't need big changes.

aritraghsh09 added the Performance Optimization label Feb 13, 2025

aritraghsh09 changed the title ~~Optimize UMAP creating~~ Optimize UMAP Creation Feb 13, 2025

mtauraso added this to the Post MTR milestone Feb 14, 2025

mtauraso self-assigned this Feb 18, 2025

mtauraso linked a pull request Feb 18, 2025 that will close this issue

Parallelize umap with process pools #221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize UMAP Creation #213

Optimize UMAP Creation #213

aritraghsh09 commented Feb 13, 2025 •

edited

Loading

mtauraso commented Feb 14, 2025

Optimize UMAP Creation #213

Optimize UMAP Creation #213

Comments

aritraghsh09 commented Feb 13, 2025 • edited Loading

mtauraso commented Feb 14, 2025

aritraghsh09 commented Feb 13, 2025 •

edited

Loading