Model does not initialize reliably #7

scott-uses-git · 2024-11-11T12:09:15Z

Hello again,

I've been having trouble getting my model to reliably initialize. The problem arises when initial f-est is "nan" - the model will get completely stuck and wont make any progress

Initial f-est:           nan, fit: -3.669e-01, tensor norm:  8.726e+02

The only workaround I have found is to restart the model over and over until it finds a good random starting point. Even then, the initial f-est is near machine limit but the algorithm is able to quickly converge to a more reasonable parameter space.

Initial f-est: 1.045070e+199, fit: -3.669e-01, tensor norm:  8.726e+02
Epoch   1: f-est =  5.316580e+06, fit =  1.938e-02, step =  1.0e-03, time = 1.24e+01 sec
Epoch   2: f-est =  4.930706e+06, fit =  2.244e-02, step =  1.0e-03, time = 2.50e+01 sec
Epoch   3: f-est =  4.551186e+06, fit =  2.439e-02, step =  1.0e-03, time = 3.80e+01 sec
Epoch   4: f-est =  4.210776e+06, fit =  2.598e-02, step =  1.0e-03, time = 5.22e+01 sec
Epoch   5: f-est =  3.899706e+06, fit =  2.746e-02, step =  1.0e-03, time = 6.45e+01 sec
Epoch   6: f-est =  3.723946e+06, fit =  2.853e-02, step =  1.0e-03, time = 7.74e+01 sec
Epoch   7: f-est =  3.554303e+06, fit =  2.937e-02, step =  1.0e-03, time = 9.13e+01 sec

Can you recommend any setting adjustments I can make to get a better model initialization?

Here is some more info on my data and model settings - unfortunately, my data is proprietary and I am not able to share it.

Sparse tensor: 
  8473 x 92 x 230 (1.79289e+08 total entries)
  761452 (0.4%) Nonzeros and 178527228 (99.6%) Zeros
  8.7e+02 Frobenius norm

Execution environment:
  MPI grid: 1 x 1 x 1 processes (1 total)
  Execution space: serial

GCP-SGD (Generalized CP Tensor Decomposition):
Generalized function type: Poisson (count)
Optimization method: adam
Max iterations (epochs): 100
Iterations per epoch: 500
Traditional annealer, learning rate: 1.0e-03, decay: 1.0e-01
  Function sampler:  stratified with 100000 nonzero and 100000 zero samples
  Gradient sampler:  stratified with 22844 nonzero and 22844 zero samples
  Gradient nonzero samples per epoch: 11422000 (1500.0%)
Gradient method: single MTTKRP

I would greatly appreciate any help or advice!

Thanks,
Scott

The text was updated successfully, but these errors were encountered:

etphipp · 2024-11-11T17:59:33Z

Sorry you are having trouble. I think it doesn't make progress once there is a NaN because of the way arithmetic with NaNs works. I should probably have a check in there that if the initial f-est is NaN to error out and if f-est becomes NaN later, to fail the step.

But think the bigger question is why you are getting NaNs in the first place, especially for the initial f-est. With Poisson loss, it seems the most likely way that would happen is if a model value is negative. But with the random initial guess algorithms GenTen has, that shouldn't happen. It looks like currently GenTen does not have a way to save the initial guess, but I can add that. Are you using the command-line interface or the python interface?

etphipp · 2024-12-15T15:12:04Z

@scott-uses-git are you still having trouble with this? I recently added an option to save the initial guess, which would allow you to inspect it when you see NaN for the initial f-est, or reuse an initial guess that appears to be good.

Also, are you sure your tensor does not have negative values in it? That would be a sure-fire way to get NaNs with Poisson loss. To ensure the code is working correctly, you could try running an open example from FROSTT. The LBNL-network tensor is fairly small and runs pretty fast with GCP. Also, I recently pushed a python example for the streaming GCP algorithm that uses the Chicago Crime tensor here that includes calling static GCP for comparison purposes. That example uses Bernoulli loss, but could easily skip the step of setting the nonzero values to 1 and use Poisson loss instead.

scott-uses-git · 2024-12-17T09:46:28Z

Hello! I'm very sorry for not responding to this issue that I opened. I figured out that there were slices of my input tensor that were extremely sparse. ex: 10 nonzero entries in a 100*8000 slice. I removed these extremely sparse slices and the model was able to initialize reliably.

I apologize if you spent any of your time looking into this :(

Thank you for following up.

Scott

etphipp · 2024-12-17T15:23:19Z

Hmm, that's interesting, and thanks for the reply. I'll have to think about why an extremely sparse slice would cause a problem as it isn't obvious to me what that would be problematic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model does not initialize reliably #7

Model does not initialize reliably #7

scott-uses-git commented Nov 11, 2024

etphipp commented Nov 11, 2024

etphipp commented Dec 15, 2024

scott-uses-git commented Dec 17, 2024

etphipp commented Dec 17, 2024

Model does not initialize reliably #7

Model does not initialize reliably #7

Comments

scott-uses-git commented Nov 11, 2024

etphipp commented Nov 11, 2024

etphipp commented Dec 15, 2024

scott-uses-git commented Dec 17, 2024

etphipp commented Dec 17, 2024