-
Notifications
You must be signed in to change notification settings - Fork 141
Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132
base: master
Are you sure you want to change the base?
Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132
Conversation
@@ -155,7 +149,7 @@ def __init__(self, client, options, fetch_function): | |||
|
|||
def _fetch_next_block(self): | |||
while super(self.__class__, self)._has_more_pages() and len(self._buffer) == 0: | |||
return self._fetch_items_helper_with_retries(self._fetch_function) | |||
return self._fetch_items_helper_no_retries(self._fetch_function) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strategy: this is the outermost level of retries, which is being removed.
pydocumentdb/retry_options.py
Outdated
@@ -32,7 +32,7 @@ class RetryOptions(object): | |||
:ivar int MaxWaitTimeInSeconds: | |||
Max wait time in seconds to wait for a request while the retries are happening. Default value 30 seconds. | |||
""" | |||
def __init__(self, max_retry_attempt_count = 9, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 30): | |||
def __init__(self, max_retry_attempt_count = 17, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 60): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the default number of retries, and time allowed, to reflect the fact that users may have inadvertently been relying on large numbers of retries.
@@ -221,7 +221,7 @@ def test_default_retry_policy_for_query(self): | |||
result_docs = list(docs) | |||
self.assertEqual(result_docs[0]['id'], 'doc1') | |||
self.assertEqual(result_docs[1]['id'], 'doc2') | |||
self.assertEqual(self.counter, 12) | |||
self.assertEqual(self.counter, 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test counts how many retries were done. Before the fix, 12 http requests were generated. Now 6 are. This is by-design. Updating test to match.
except errors.HTTPFailure as ex: | ||
self.assertEqual(ex.status_code, StatusCodes.TOO_MANY_REQUESTS) | ||
|
||
client._DocumentClient__Post = original_post_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocking out the server returning 429 error by overriding the clients __Post
method. It needs to be done this low down, because mocking at the _Execute
level (as in the rest of this file) misses out the lower level of retries.
Manually started CI on this PR: https://cosmos-db-sdk-public.visualstudio.com/cosmos-db-sdk-public/_build/results?buildId=1282 |
Failed with this error:
|
Any updates? |
For background, read bug #131
This happens because
base_execution_context._fetch_items_helper_with_retries()
runs the requests wrapped in aretry_utility._Execute()
call, which does retries. However, much lower down the call stack more retries will be done becausesynchronized_request.SynchronizedRequest()
also usesretry_utility.Execute()
to do retries.These two nested retry policies cause the following problem: When the innermost policy decides there have been too many failures it re-raises the
HTTPFailure
exception (retry_utility.py:84
) However, the outermost retry policy catches that (base_execution_context.py:132
), and attempts another round of innermost retries. The first of these sees misleading request processing state: it has received 0 rows of results, and does not have a continuation token:In this case,
self._continuation
isNone
because this is the first page of the query and the server has not returned a continuation token. Andself._has_started
isTrue
because of the first round of retries. So the while loop never executes, and it returns the empty array offetched_items
.So the client (incorrectly) decides the query is complete and returns 0 rows. Bug.
This pull request:
It is safe to remove the outermost layer of retries, because the library code seems to have evolved as follows. Initially, only the outermost layer of retries existed. Then support for queries failing over to another geo-location was added, by putting retries down at the bottom level (in
synchronized_request
, see Commit 9d1bac3.) But at this time, the uppermost layer of retries was not removed. So now the code does nested retries. Only the innermost layer is actually needed.Also, notice that nested retries is another bug. It means that when user code specifies retry options, they apply to both inner and outer retry policies, and get compounded together. We have tests which cover this (
test/retry_policy_tests.py
), but they mock out the retry utility's_Execute
function, so by-pass the nesting, which makes the tests pass despite the fact that when it runs for real it won't do what the user expect. Eg, a retry limit of 8 calls will turn into 8*8 = 64!Users may have inadvertently been relying on this very large level of retries, which is why the default retry level has been increased. Not to 64, but double (8 to 16).
Testing: all existing tests pass, and the newly added test passes with the fix, and fails without the fix. Additionally, we've been running this against a live CosmosDB server and it's working nicely for us.