Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit amount of slot queue workers to 1 #988

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

markspanbroek
Copy link
Member

reason: proof generation does not happen in
separate threads yet, so to avoid taking too
long to provide initial storage proofs we only
pick up one slot at a time

@markspanbroek markspanbroek marked this pull request as ready for review November 7, 2024 13:16
Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mark!

collateral=200.u256,
nodes = 5,
tolerance = 2).get
nodes = 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is the motivation of limiting the number of nodes to hasten the completion of this test? In the past, having 5 nodes has surfaced a few issues. In saying that, I will eventually finish this PR which will have 5 or 6 nodes, so at least we'll have one test (in the future with 5/6 nodes): #976

Copy link
Member Author

@markspanbroek markspanbroek Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it now takes about 4 minutes to fill 3 slots. Something still is off that this takes so long (most likely the downloading)

Copy link
Contributor

@emizzle emizzle Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even on the testnet I think downloading is quick... locally as an example, download takes 410ms in one of the integration tests (8 blocks)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proof generation might take some time. Also, I noticed on the testnet that we're doing local verification of the proofs, which is not needed and takes time on each node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I'm running this test with some more logging on my machine and it spends most of its time in the SaleInitialProving state.

@emizzle
Copy link
Contributor

emizzle commented Nov 7, 2024

Another option to speed up tests: pass number of workers as a cli param (3 for tests, 1 for testnet). More work of course...

@markspanbroek
Copy link
Member Author

Another option to speed up tests: pass number of workers as a cli param. More work of course...

This is just a temporary fix. We should increase the number of workers as soon as we have threading for creating proofs.

@@ -16,7 +16,7 @@ template ethersuite*(name, body) =
var snapshot: JsonNode

setup:
ethProvider = JsonRpcProvider.new("ws://localhost:8545")
ethProvider = JsonRpcProvider.new("http://localhost:8545")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right change, but we also might run into the poll future failing... (something to be fixed in ethers)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the contract tests started failing with this change. Strange that works in the integration tests, but not in the contract tests. I'll revert the change on this line for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to know if the reason they are failing is due to the poll failing issue though. In saying that, I'd like to get a fix in for the poll() fail. Trying to work on it now, but unsure if I'll get something in before my flight leaves in a few hours.

@markspanbroek
Copy link
Member Author

I can't seem to get the integration tests to run stably. I fixed all the obvious cases where things could go wrong, and still they fail in unexpected places (while submitting a request, while running unit tests). I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️

@markspanbroek markspanbroek marked this pull request as draft November 8, 2024 12:53
@AuHau
Copy link
Member

AuHau commented Nov 9, 2024

I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

@markspanbroek
Copy link
Member Author

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

Yes, I think so. It seems that subscription handlers not being called is causing the unpredictable test failures. When using only websockets the tests actually work quite reliably up to the point that a test takes longer than 5 minutes. When using http and polling the tests fail at random points. My suspicion is that this is caused by what Eric noticed; that an exception is being raised in the poll function, that silently stops the polling.

@AuHau
Copy link
Member

AuHau commented Nov 9, 2024

I see, there is the #985 which might show the error. It could be worth rebase this PR on top of that to test your hypothesis.
It is unfortunately not finished, but it should be working, there are just some polishing needed in order to merge it. I will try to finish those tomorrow, I won't make it today unfortunately.

@emizzle
Copy link
Contributor

emizzle commented Nov 9, 2024

So I've got somewhere on a solution for this, but it's not complete. The first part of the solution is to use the then util to attach callbacks to poll, allowing us to catch exceptions and then propagate them to the synchronous closure.

The second part, which I'm not sure about yet, is understanding whether or not we need to reconnect the RpcHttpClient when this exception is raised (this particular exception only or all exceptions)?

The third part is trying to understand if we need this reconnection logic from part 2 in any other areas of ethers (including websockets).

@emizzle
Copy link
Contributor

emizzle commented Nov 9, 2024

@markspanbroek here's my work on keeping the polling loop alive: https://github.com/codex-storage/nim-ethers/tree/fix/keep-polling-alive.

NOTES

  1. We cannot consume futures like we were without asyncSpawning or awaiting them. This is where the .then/.catch comes in: it's effectively an asyncSpawn but does not raise a Defect and instead exposes the exceptions.
  2. We only restart the polling loop when an RpcPostError is encountered
  3. We raise a Defect on other errors, but i'm not so sure this is the right approach. We might need to handle all possible errors differently.
  4. There may be other situations in ethers where this happens?
  5. The then.nim addition was taken from codex and was simplified a bit. It needs to be pulled into its own lib and used in both codex and ethers. The version in this branch is the most recent and will be used for the lib.

@markspanbroek
Copy link
Member Author

here's my work on keeping the polling loop alive

Thanks @emizzle, I think I've now got a version of http polling that works based on these ideas. I also found 4 other bugs that made the tests unstable.

I'll try to get PRs for all of this in tomorrow.

@markspanbroek
Copy link
Member Author

On Linux and MacOS the tests run stable now. The Windows tests are running so slow that they are even failing on the fairly simple contract tests that do not involve proofs.

I'm installing a Windows VM to see if I can reproduce this on my machine.

To work around this issue when subscriptions are
inactive for more than 5 minutes:
NomicFoundation/hardhat#2053

Use 100 millisecond polling; default polling interval
of 4 seconds is too close to the 5 second timeout for
`check eventually`.
confirm(0) doesn't wait at all, confirm(1) waits
for the transaction to be mined
includes fixes for http polling and .confirm()
allow for a bit more time to withdraw funds
to ensure that the transaction has been processed
before continuing with the test
there were two logic errors in this test:
- a slot is freed anyway at the end of the contract
- when starting the request takes a long time, the
  first slot can already be freed because there were
  too many missing proofs
currentTime() doesn't always correctly reflect
the time of the next transaction
otherwise the windows runner in the CI won't
be able to start the request before it expires
allow for a bit more time for a request to
be submitted
windows ci is so slow, it can take up to 40 seconds
just to submit a storage request to hardhat
@markspanbroek markspanbroek force-pushed the limit-slotqueue-workers branch from bf183f9 to eb1e5e2 Compare November 14, 2024 14:10
@markspanbroek markspanbroek changed the base branch from master to fix-concurrency-issues November 14, 2024 14:11
@markspanbroek
Copy link
Member Author

markspanbroek commented Nov 14, 2024

Moved most of the fixes into #993, and rebased this PR on top of it.

I'm not sure whether it still makes sense to limit the amount of slot queue workers now that I've done some measurements on proof generation. It doesn't take that much time (~3 seconds).

reason: with the increased period length of 90 seconds, it
can take longer to wait for a stable challenge at the
beginning of a period.
@markspanbroek markspanbroek force-pushed the limit-slotqueue-workers branch from eb1e5e2 to 15991d8 Compare November 14, 2024 17:31
apparently it takes windows 2-3 seconds to
resolve "localhost" to 127.0.0.1 for every
json-rpc connection that we make 🤦
reason: proof generation does not happen in
separate threads yet, so to avoid taking too
long to provide initial storage proofs we only
pick up one slot at a time
@markspanbroek markspanbroek force-pushed the limit-slotqueue-workers branch from 15991d8 to 2007867 Compare November 14, 2024 18:49
Base automatically changed from fix-concurrency-issues to master November 25, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants