Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume partial downloads. #317

Open
Wedge009 opened this issue Dec 17, 2024 · 16 comments
Open

Resume partial downloads. #317

Wedge009 opened this issue Dec 17, 2024 · 16 comments
Labels
enhancement New feature or request

Comments

@Wedge009
Copy link

I see particularly large results uploads (eg ~150 MiB) running into Failed response: EOF fairly frequently and it sometimes takes several hours - even more than day, running into the risk of missing the deadline - to complete a result upload because with every 'failed response' it has to restart from 0%.

I don't seen any option to enable debug-level output, so I can't tell if the issue is at my end or the server's.

I'm not sure if this is feasible or even possible with HTTP POST but could we have the ability to ask server what it did manage to receive and continue from there? (Or would that run into even more difficulty with respect to transmission integrity, etc?)


Perhaps a separate issue, but I have problems uploading on a lone Windows 7 host with three results stuck at 0%. Previously I had trouble with the v8 client even downloading tasks in the first place, so I reverted to the v7 client. But since v8.4 beta release I managed to get some work downloaded and now have run into stuck uploads. Not sure if it could be something certificate-related in Windows 7.

@Wedge009
Copy link
Author

A sample of what I'm currently (2024-12-17 02:37 UTC) experiencing (same work-unit):

20:27:41:I1:WU797:Completed 625000 out of 625000 steps (100%)
20:27:41:I1:WU797:Saving result file ../logfile_01.txt
20:27:41:I1:WU797:Saving result file frame770.gro
20:27:41:I1:WU797:Saving result file frame770.xtc
20:27:42:I1:WU797:Saving result file md.log
20:27:42:I1:WU797:Saving result file science.log
20:27:42:I1:WU797:Saving result file state.cpt
20:27:42:I1:WU797:Folding@home Core Shutdown: FINISHED_UNIT
20:27:42:I1:WU797:Core returned FINISHED_UNIT (100)
20:27:44:I1:WU797:Uploading WU results
20:28:40:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
20:30:13:I1:WU797:UPLOAD 2% 3.21MiB of 158.92MiB
21:26:44:I1:WU797:UPLOAD 40% 63.53MiB of 158.92MiB
21:27:46:E :OUT157:Failed response: EOF
21:27:46:I1:WU797:Uploading WU results
21:27:47:I1:OUT160:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
21:29:05:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
21:30:31:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
22:27:17:I1:WU797:UPLOAD 41% 65.12MiB of 158.92MiB
22:27:47:E :OUT160:Failed response: EOF
22:27:47:I1:WU797:Uploading WU results
22:27:49:I1:OUT161:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
22:27:49:E :OUT161:Failed response: EOF
22:27:49:I1:WU797:Uploading WU results
22:27:51:I1:OUT162:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
22:29:03:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
22:30:27:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
23:27:41:I1:WU797:UPLOAD 46% 73.04MiB of 158.92MiB
23:27:51:E :OUT162:Failed response: EOF
23:27:51:I1:WU797:Uploading WU results
23:27:53:I1:OUT166:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
23:27:54:E :OUT166:Failed response: EOF
23:27:54:I1:WU797:Uploading WU results
23:27:55:I1:OUT167:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
23:29:03:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
23:30:36:I1:WU797:UPLOAD 2% 3.10MiB of 158.92MiB
23:34:19:I1:WU797:UPLOAD 5% 7.92MiB of 158.92MiB
23:34:40:E :OUT167:Failed response: EOF
23:34:40:I1:WU797:Uploading WU results
23:34:41:I1:OUT168:> POST https://fahserver1.flatironinstitute.org/api/results HTTP/1.1
23:35:53:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
23:37:08:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
00:34:36:I1:WU797:UPLOAD 45% 71.48MiB of 158.92MiB
00:34:41:E :OUT168:Failed response: EOF
00:34:41:I1:WU797:Uploading WU results
00:34:42:I1:OUT169:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
00:35:36:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
00:36:41:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
01:33:49:I1:WU797:UPLOAD 49% 77.87MiB of 158.92MiB
01:34:46:E :OUT169:Failed response: EOF
01:34:46:I1:WU797:Uploading WU results
01:34:47:I1:OUT170:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
01:34:48:E :OUT170:Failed response: EOF
01:34:48:I1:WU797:Uploading WU results
01:34:49:I1:OUT171:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
01:35:45:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
01:36:37:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
02:34:42:I1:WU797:UPLOAD 74% 117.53MiB of 158.92MiB
02:34:50:E :OUT171:Failed response: EOF
02:34:50:I1:WU797:Uploading WU results
02:34:51:I1:OUT172:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
02:34:51:E :OUT172:Failed response: EOF
02:34:51:I1:WU797:Uploading WU results
02:34:53:I1:OUT173:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
02:36:01:I1:WU797:UPLOAD 1% 1.55MiB of 158.92MiB
02:37:26:I1:WU797:UPLOAD 2% 3.11MiB of 158.92MiB

@Wedge009
Copy link
Author

Wedge009 commented Dec 18, 2024

I have problems uploading on a lone Windows 7 host with three results stuck at 0%.

I think this might be a non-issue related to #130. I was sure that I saw upload progress on Windows previously, but I must have been mistaken. Being stuck on 0% must have been a factor of this issue here, where uploads were constantly restarting from 0%. And when three uploads were in progress simultaneously, no more work was being downloaded.

@kbernhagen
Copy link
Contributor

Could this be the MTU size problem?
https://foldingforum.org/viewtopic.php?t=42248

@Wedge009
Copy link
Author

Wedge009 commented Dec 18, 2024

All my hosts running fah are connected via ethernet and I verified MTU is the standard 1500.

I suspect the problem might be at my end - if you see the above output failures were on multiple servers and I doubt they are all broken. Also, today is much milder than recent days (was over 40C yesterday) and I noticed network transmission seems more stable. (That repeatedly failing upload eventually completed overnight, at 17:34:48 UTC, over 21 hours after task completion.)

So that brings back my original question on whether it's possible to implement some sort of resume functionality for potentially unreliable connections. Spending several hours on a task and then not being able to upload seems rather wasteful. (I've had a few tasks expire or be marked as 'failed'.)

@jcoffland jcoffland changed the title Is it possible to implement result upload resuming? Resume partial downloads. Dec 18, 2024
@jcoffland jcoffland added the enhancement New feature or request label Dec 18, 2024
@jcoffland
Copy link
Member

This is possible and something I've had in mind for some time now. It will require changes to all of the Work Servers (WS) so they can save and continue partial uploads. Then the client itself must also support this. It would likely result in significant network bandwidth savings but it's going to take some time and there are currently other higher priority items.

@Wedge009
Copy link
Author

No worries, just good to know it's possible and part of the plan. Thanks.

@Wedge009
Copy link
Author

Wedge009 commented Feb 5, 2025

I can report in a separate issue if it's suitable, but I wonder if this is related to result uploads. If a host temporarily loses connection for whatever reason, I find that even a successful upload later on seems to result in a failure return from the server.

Image
('~50 days ago' relates to the server upload problems I mentioned above in December 2024 - I really dislike relative time-stamps.)

And somewhat related to that (again, can report separately if required), I find sometimes work-units are too easily dumped. Sometimes even something as innocuous as a clean reboot, upon restarting the client can complain about not being able to find any results so it dumps the task. Quite frustrating when progress was close to complete.

Image

These aren't huge problems, but I find the wasted energy on dumped/failed/rejected/etc work quite disappointing - and I didn't find this to be an issue with the v7 client.

@muziqaz
Copy link
Contributor

muziqaz commented Feb 5, 2025

Windows client has an issue with Windows stabbing it in the back every time system is rebooted. So I til the fix is sorted we advise Windows users to pause their work before rebooting.
Linux has no issues there

@Wedge009
Copy link
Author

Wedge009 commented Feb 5, 2025

I did notice it was usually on Windows... a shame it's something that has to be done manually - sometimes a reboot might not come at a good time, particularly if the user is on a recent Windows that reboots automatically all the time.

@muziqaz
Copy link
Contributor

muziqaz commented Feb 5, 2025

Unfortunately the issue is quite difficult to fix since it is more Windows rather than FAHClient

@Wedge009
Copy link
Author

Wedge009 commented Feb 5, 2025

I feel this is getting off-topic for the actual issue about uploads, but what do you mean about 'more (to do with) Windows'? Uncontrolled rebooting? Yeah, that's a Windows problem. But the Windows fah-client not resuming properly - as I mentioned above that didn't seem to be a problem with the v7 client. I don't know what the architectural changes are between v7 and v8 but it must be pretty significant for this to be accepted as standard for Windows hosts.

@muziqaz
Copy link
Contributor

muziqaz commented Feb 5, 2025

It is not accepted as standard. Just the fix is a bit elusive

@Wedge009
Copy link
Author

Wedge009 commented Feb 8, 2025

I made the mistake of rebooting during an upload and forgetting to pause the current task... 🤦

Image

@muziqaz
Copy link
Contributor

muziqaz commented Feb 8, 2025

Pausing and then rebooting while uploading the WU would not have saved it.
I know the bug is FAH's fault, but at some point there needs to be a bit of care and responsibility to be had when participating in projects like FAH. :)

@Wedge009
Copy link
Author

Wedge009 commented Feb 8, 2025

The task with the interrupted upload would still be marked as failed, but the pause would have at least saved the other task. Fortunately progress was only 3.6%.

Most of my hosts are Linux-based, but I was in such a hurry to reboot the Windows host in this particular instance that by the time I remembered to check what FaH was doing it was too late. As I said, this was never an issue with the v7 client, so at the very least this sort of behaviour should be considered a regression.

@muziqaz
Copy link
Contributor

muziqaz commented Feb 8, 2025

Make no mistake, this is considered a bug and very serious one. This is in no way considered a normal and acceptable behaviour from any software

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants