Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model does not initialize reliably #7

Open
scott-uses-git opened this issue Nov 11, 2024 · 4 comments
Open

Model does not initialize reliably #7

scott-uses-git opened this issue Nov 11, 2024 · 4 comments

Comments

@scott-uses-git
Copy link

Hello again,

I've been having trouble getting my model to reliably initialize. The problem arises when initial f-est is "nan" - the model will get completely stuck and wont make any progress

Initial f-est:           nan, fit: -3.669e-01, tensor norm:  8.726e+02

The only workaround I have found is to restart the model over and over until it finds a good random starting point. Even then, the initial f-est is near machine limit but the algorithm is able to quickly converge to a more reasonable parameter space.

Initial f-est: 1.045070e+199, fit: -3.669e-01, tensor norm:  8.726e+02
Epoch   1: f-est =  5.316580e+06, fit =  1.938e-02, step =  1.0e-03, time = 1.24e+01 sec
Epoch   2: f-est =  4.930706e+06, fit =  2.244e-02, step =  1.0e-03, time = 2.50e+01 sec
Epoch   3: f-est =  4.551186e+06, fit =  2.439e-02, step =  1.0e-03, time = 3.80e+01 sec
Epoch   4: f-est =  4.210776e+06, fit =  2.598e-02, step =  1.0e-03, time = 5.22e+01 sec
Epoch   5: f-est =  3.899706e+06, fit =  2.746e-02, step =  1.0e-03, time = 6.45e+01 sec
Epoch   6: f-est =  3.723946e+06, fit =  2.853e-02, step =  1.0e-03, time = 7.74e+01 sec
Epoch   7: f-est =  3.554303e+06, fit =  2.937e-02, step =  1.0e-03, time = 9.13e+01 sec

Can you recommend any setting adjustments I can make to get a better model initialization?

Here is some more info on my data and model settings - unfortunately, my data is proprietary and I am not able to share it.

Sparse tensor: 
  8473 x 92 x 230 (1.79289e+08 total entries)
  761452 (0.4%) Nonzeros and 178527228 (99.6%) Zeros
  8.7e+02 Frobenius norm

Execution environment:
  MPI grid: 1 x 1 x 1 processes (1 total)
  Execution space: serial

GCP-SGD (Generalized CP Tensor Decomposition):
Generalized function type: Poisson (count)
Optimization method: adam
Max iterations (epochs): 100
Iterations per epoch: 500
Traditional annealer, learning rate: 1.0e-03, decay: 1.0e-01
  Function sampler:  stratified with 100000 nonzero and 100000 zero samples
  Gradient sampler:  stratified with 22844 nonzero and 22844 zero samples
  Gradient nonzero samples per epoch: 11422000 (1500.0%)
Gradient method: single MTTKRP

I would greatly appreciate any help or advice!

Thanks,
Scott

@etphipp
Copy link
Collaborator

etphipp commented Nov 11, 2024

Sorry you are having trouble. I think it doesn't make progress once there is a NaN because of the way arithmetic with NaNs works. I should probably have a check in there that if the initial f-est is NaN to error out and if f-est becomes NaN later, to fail the step.

But think the bigger question is why you are getting NaNs in the first place, especially for the initial f-est. With Poisson loss, it seems the most likely way that would happen is if a model value is negative. But with the random initial guess algorithms GenTen has, that shouldn't happen. It looks like currently GenTen does not have a way to save the initial guess, but I can add that. Are you using the command-line interface or the python interface?

@etphipp
Copy link
Collaborator

etphipp commented Dec 15, 2024

@scott-uses-git are you still having trouble with this? I recently added an option to save the initial guess, which would allow you to inspect it when you see NaN for the initial f-est, or reuse an initial guess that appears to be good.

Also, are you sure your tensor does not have negative values in it? That would be a sure-fire way to get NaNs with Poisson loss. To ensure the code is working correctly, you could try running an open example from FROSTT. The LBNL-network tensor is fairly small and runs pretty fast with GCP. Also, I recently pushed a python example for the streaming GCP algorithm that uses the Chicago Crime tensor here that includes calling static GCP for comparison purposes. That example uses Bernoulli loss, but could easily skip the step of setting the nonzero values to 1 and use Poisson loss instead.

@scott-uses-git
Copy link
Author

Hello! I'm very sorry for not responding to this issue that I opened. I figured out that there were slices of my input tensor that were extremely sparse. ex: 10 nonzero entries in a 100*8000 slice. I removed these extremely sparse slices and the model was able to initialize reliably.

I apologize if you spent any of your time looking into this :(

Thank you for following up.

Scott

@etphipp
Copy link
Collaborator

etphipp commented Dec 17, 2024

Hmm, that's interesting, and thanks for the reply. I'll have to think about why an extremely sparse slice would cause a problem as it isn't obvious to me what that would be problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants