Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

customsolve.py: incorrect number of arguments to _find_pivot. #542

Open
rileyjmurray opened this issue Feb 18, 2025 · 2 comments
Open

customsolve.py: incorrect number of arguments to _find_pivot. #542

rileyjmurray opened this issue Feb 18, 2025 · 2 comments
Labels
bug A bug or regression
Milestone

Comments

@rileyjmurray
Copy link
Contributor

Here's a code snippet in pygsti/optimize/customsolve.py on the head of master.

smbuf1b = _np.zeros(1, _np.int64)
smbuf2 = _np.zeros(2, _np.int64)
smbuf3 = _np.zeros(3, _np.int64)
for icol in range(a.shape[1]):
# Step 1: find the index of the row that is the best pivot.
# each proc looks for its best pivot (Note: it should not consider rows already pivoted on)
potential_pivot_indices = all_row_indices[potential_pivot_mask]
ibest_global, ibest_local, h, k = _find_pivot(a, b, icol, potential_pivot_indices, my_row_slice,
shared_floats, shared_ints, resource_alloc, comm, host_comm,
smbuf1, smbuf2, smbuf3, host_index_buf, host_val_buf)

The smbuf1b variable is never used. It looks like it should be passed as an argument to _find_pivot, after smbuf1, per the signature below (variable names in the signature are just buf1 and buf1b):

def _find_pivot(a, b, icol, potential_pivot_inds, my_row_slice, shared_floats, shared_ints,
resource_alloc, comm, host_comm, buf1, buf1b, buf2, buf3, best_host_indices, best_host_vals):

As a consequence of smbuf1b not being passed to _find_pivot, we're calling _find_pivot with only 15 positional arguments when it requires 16.

The appearance of this bug is uncommon because there's a dedicated code path for handling linear solves with matrices of order < 10,000 (below) that gets triggered in our testing and even in most large GST runs:

if comm.size < proc_threshold and a.shape[1] < 10000:
# We're not exactly sure where scipy is better, but until we speed up / change gaussian-elim
# alg the scipy alg is much faster for small numbers of procs and so should be used unless
# A is too large to be gathered to the root proc.
global_a, a_shm = ari.gather_jtj(a, return_shared=True)
global_b, b_shm = ari.gather_jtf(b, return_shared=True)
#global_a = ari.gather_jtj(a)
#global_b = ari.gather_jtf(b)
if comm.rank == 0:
try:
global_x = _scipy.linalg.solve(global_a, global_b, assume_a='pos')
ok_buf[0] = 1 # ok
except _scipy.linalg.LinAlgError as e:
ok_buf[0] = 0 # failure!
err = e
else:
global_x = None
err = _scipy.linalg.LinAlgError("Linear solver fail on root proc!") # just in case...
comm.Bcast(ok_buf, root=0)
if ok_buf[0] == 0:
_smt.cleanup_shared_ndarray(a_shm)
_smt.cleanup_shared_ndarray(b_shm)
raise err # all procs must raise in sync
ari.scatter_x(global_x, x)
_smt.cleanup_shared_ndarray(a_shm)
_smt.cleanup_shared_ndarray(b_shm)
return

Possible ways to move forward:

  1. Temporary. Just change the dimension threshold to 100,000 instead of 10,000. Storing one double-precision matrix of this size requires ~80GB. The overhead of the Python call means this will require 160 GB of "free" RAM to run. That's a reasonable amount for modern compute servers.
  2. Try and fix the bug in the custom Gaussian elimination code. Have the branch on matrix size check a user-modifiable constant instead of a hard coded value. Some of our tests can run this code where that matrix size threshold is quite small.
  3. Ambitious. Create a new ArrayInterface class that relies on Dask arrays. Replace the custom Gaussian elimination code with calls to Dask's distributed linear algebra functions: https://docs.dask.org/en/stable/_modules/dask/array/linalg.html.
@rileyjmurray rileyjmurray added the bug A bug or regression label Feb 18, 2025
@rileyjmurray
Copy link
Contributor Author

It looks like the bug was introduced here:
0573832

That commit leaves the call to _find_pivot unchanged:
0573832#diff-b5f5b8b216bb22ad8b9236aee7c88e7f141562fb4a7584e15e5a885f26361cbdR166-R168
but it modifies _find_pivot to take another argument:
0573832#diff-b5f5b8b216bb22ad8b9236aee7c88e7f141562fb4a7584e15e5a885f26361cbdR213-R214

@rileyjmurray
Copy link
Contributor Author

I've gone ahead with the second of my possible ways to move forward:

Try and fix the bug in the custom Gaussian elimination code. Have the branch on matrix size check a user-modifiable constant instead of a hard coded value. Some of our tests can run this code where that matrix size threshold is quite small.

The changes, which were quite simple, are currently on the xfgst-riley branch. We can port them to develop when needed.

@sserita sserita added this to the 0.9.14 milestone Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants