customsolve.py: incorrect number of arguments to _find_pivot. #542

rileyjmurray · 2025-02-18T15:11:23Z

Here's a code snippet in pygsti/optimize/customsolve.py on the head of master.

Lines 157 to 168 in c74fcb2

    
           smbuf1b = _np.zeros(1, _np.int64) 
        
           smbuf2 = _np.zeros(2, _np.int64) 
        
           smbuf3 = _np.zeros(3, _np.int64) 
        
           for icol in range(a.shape[1]): 
        
               # Step 1: find the index of the row that is the best pivot. 
        
               # each proc looks for its best pivot (Note: it should not consider rows already pivoted on) 
        
               potential_pivot_indices = all_row_indices[potential_pivot_mask] 
        
               ibest_global, ibest_local, h, k = _find_pivot(a, b, icol, potential_pivot_indices, my_row_slice, 
        
                                                             shared_floats, shared_ints, resource_alloc, comm, host_comm, 
        
                                                             smbuf1, smbuf2, smbuf3, host_index_buf, host_val_buf)

The smbuf1b variable is never used. It looks like it should be passed as an argument to _find_pivot, after smbuf1, per the signature below (variable names in the signature are just buf1 and buf1b):

pyGSTi/pygsti/optimize/customsolve.py

Lines 213 to 214 in c74fcb2

    
           def _find_pivot(a, b, icol, potential_pivot_inds, my_row_slice, shared_floats, shared_ints, 
        
                           resource_alloc, comm, host_comm, buf1, buf1b, buf2, buf3, best_host_indices, best_host_vals):

As a consequence of smbuf1b not being passed to _find_pivot, we're calling _find_pivot with only 15 positional arguments when it requires 16.

The appearance of this bug is uncommon because there's a dedicated code path for handling linear solves with matrices of order < 10,000 (below) that gets triggered in our testing and even in most large GST runs:

pyGSTi/pygsti/optimize/customsolve.py

Lines 98 to 126 in c74fcb2

    
           if comm.size < proc_threshold and a.shape[1] < 10000: 
        
               # We're not exactly sure where scipy is better, but until we speed up / change gaussian-elim 
        
               # alg the scipy alg is much faster for small numbers of procs and so should be used unless 
        
               # A is too large to be gathered to the root proc. 
        
               global_a, a_shm = ari.gather_jtj(a, return_shared=True) 
        
               global_b, b_shm = ari.gather_jtf(b, return_shared=True) 
        
               #global_a = ari.gather_jtj(a) 
        
               #global_b = ari.gather_jtf(b) 
        
               if comm.rank == 0: 
        
                   try: 
        
                       global_x = _scipy.linalg.solve(global_a, global_b, assume_a='pos') 
        
                       ok_buf[0] = 1  # ok 
        
                   except _scipy.linalg.LinAlgError as e: 
        
                       ok_buf[0] = 0  # failure! 
        
                       err = e 
        
               else: 
        
                   global_x = None 
        
                   err = _scipy.linalg.LinAlgError("Linear solver fail on root proc!")  # just in case... 
        
               comm.Bcast(ok_buf, root=0) 
        
               if ok_buf[0] == 0: 
        
                   _smt.cleanup_shared_ndarray(a_shm) 
        
                   _smt.cleanup_shared_ndarray(b_shm) 
        
                   raise err  # all procs must raise in sync 
        
               ari.scatter_x(global_x, x) 
        
               _smt.cleanup_shared_ndarray(a_shm) 
        
               _smt.cleanup_shared_ndarray(b_shm) 
        
               return

Possible ways to move forward:

Temporary. Just change the dimension threshold to 100,000 instead of 10,000. Storing one double-precision matrix of this size requires ~80GB. The overhead of the Python call means this will require 160 GB of "free" RAM to run. That's a reasonable amount for modern compute servers.
Try and fix the bug in the custom Gaussian elimination code. Have the branch on matrix size check a user-modifiable constant instead of a hard coded value. Some of our tests can run this code where that matrix size threshold is quite small.
Ambitious. Create a new ArrayInterface class that relies on Dask arrays. Replace the custom Gaussian elimination code with calls to Dask's distributed linear algebra functions: https://docs.dask.org/en/stable/_modules/dask/array/linalg.html.

The text was updated successfully, but these errors were encountered:

rileyjmurray · 2025-02-18T15:21:01Z

It looks like the bug was introduced here:
0573832

That commit leaves the call to _find_pivot unchanged:
0573832#diff-b5f5b8b216bb22ad8b9236aee7c88e7f141562fb4a7584e15e5a885f26361cbdR166-R168
but it modifies _find_pivot to take another argument:
0573832#diff-b5f5b8b216bb22ad8b9236aee7c88e7f141562fb4a7584e15e5a885f26361cbdR213-R214

rileyjmurray · 2025-02-18T22:06:31Z

I've gone ahead with the second of my possible ways to move forward:

Try and fix the bug in the custom Gaussian elimination code. Have the branch on matrix size check a user-modifiable constant instead of a hard coded value. Some of our tests can run this code where that matrix size threshold is quite small.

The changes, which were quite simple, are currently on the xfgst-riley branch. We can port them to develop when needed.

rileyjmurray added the bug A bug or regression label Feb 18, 2025

sserita added this to the 0.9.14 milestone Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

customsolve.py: incorrect number of arguments to _find_pivot. #542

customsolve.py: incorrect number of arguments to _find_pivot. #542

rileyjmurray commented Feb 18, 2025

rileyjmurray commented Feb 18, 2025

rileyjmurray commented Feb 18, 2025

customsolve.py: incorrect number of arguments to _find_pivot. #542

customsolve.py: incorrect number of arguments to _find_pivot. #542

Comments

rileyjmurray commented Feb 18, 2025

rileyjmurray commented Feb 18, 2025

rileyjmurray commented Feb 18, 2025