Skip to content
This repository has been archived by the owner on Jan 19, 2024. It is now read-only.

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

austindonnelly
Copy link
Contributor

@austindonnelly austindonnelly commented Sep 5, 2018

For background, read bug #131

This happens because base_execution_context._fetch_items_helper_with_retries() runs the requests wrapped in a retry_utility._Execute() call, which does retries. However, much lower down the call stack more retries will be done because synchronized_request.SynchronizedRequest() also uses retry_utility.Execute() to do retries.

These two nested retry policies cause the following problem: When the innermost policy decides there have been too many failures it re-raises the HTTPFailure exception (retry_utility.py:84) However, the outermost retry policy catches that (base_execution_context.py:132), and attempts another round of innermost retries. The first of these sees misleading request processing state: it has received 0 rows of results, and does not have a continuation token:

base_execution_context.py:116
fetched_items = []
while self._continuation or not self._has_started
    do the work
return fetched_items

In this case, self._continuation is None because this is the first page of the query and the server has not returned a continuation token. And self._has_started is True because of the first round of retries. So the while loop never executes, and it returns the empty array of fetched_items.

So the client (incorrectly) decides the query is complete and returns 0 rows. Bug.

This pull request:

  • adds a new test which repros this bug.
  • fixes the bug, by removing the outermost layer of retries.
  • increases the default level of retries, now that two retry policies are no longer nested.
  • fixes a test which was counting the number of requests issued, not that the two policies are no longer nested.

It is safe to remove the outermost layer of retries, because the library code seems to have evolved as follows. Initially, only the outermost layer of retries existed. Then support for queries failing over to another geo-location was added, by putting retries down at the bottom level (in synchronized_request, see Commit 9d1bac3.) But at this time, the uppermost layer of retries was not removed. So now the code does nested retries. Only the innermost layer is actually needed.

Also, notice that nested retries is another bug. It means that when user code specifies retry options, they apply to both inner and outer retry policies, and get compounded together. We have tests which cover this (test/retry_policy_tests.py), but they mock out the retry utility's _Execute function, so by-pass the nesting, which makes the tests pass despite the fact that when it runs for real it won't do what the user expect. Eg, a retry limit of 8 calls will turn into 8*8 = 64!

Users may have inadvertently been relying on this very large level of retries, which is why the default retry level has been increased. Not to 64, but double (8 to 16).

Testing: all existing tests pass, and the newly added test passes with the fix, and fails without the fix. Additionally, we've been running this against a live CosmosDB server and it's working nicely for us.

@@ -155,7 +149,7 @@ def __init__(self, client, options, fetch_function):

def _fetch_next_block(self):
while super(self.__class__, self)._has_more_pages() and len(self._buffer) == 0:
return self._fetch_items_helper_with_retries(self._fetch_function)
return self._fetch_items_helper_no_retries(self._fetch_function)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy: this is the outermost level of retries, which is being removed.

@@ -32,7 +32,7 @@ class RetryOptions(object):
:ivar int MaxWaitTimeInSeconds:
Max wait time in seconds to wait for a request while the retries are happening. Default value 30 seconds.
"""
def __init__(self, max_retry_attempt_count = 9, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 30):
def __init__(self, max_retry_attempt_count = 17, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 60):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the default number of retries, and time allowed, to reflect the fact that users may have inadvertently been relying on large numbers of retries.

@@ -221,7 +221,7 @@ def test_default_retry_policy_for_query(self):
result_docs = list(docs)
self.assertEqual(result_docs[0]['id'], 'doc1')
self.assertEqual(result_docs[1]['id'], 'doc2')
self.assertEqual(self.counter, 12)
self.assertEqual(self.counter, 6)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test counts how many retries were done. Before the fix, 12 http requests were generated. Now 6 are. This is by-design. Updating test to match.

except errors.HTTPFailure as ex:
self.assertEqual(ex.status_code, StatusCodes.TOO_MANY_REQUESTS)

client._DocumentClient__Post = original_post_function
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking out the server returning 429 error by overriding the clients __Post method. It needs to be done this low down, because mocking at the _Execute level (as in the rest of this file) misses out the lower level of retries.

@christopheranderson
Copy link

@christopheranderson
Copy link

christopheranderson commented Mar 28, 2019

Failed with this error:

================================== FAILURES ===================================
_______________ Test_retry_policy_tests.test_429_on_first_page ________________

self = <test.retry_policy_tests.Test_retry_policy_tests testMethod=test_429_on_first_page>

    def test_429_on_first_page(self):
>       client = document_client.DocumentClient(Test_retry_policy_tests.host, {'masterKey': Test_retry_policy_tests.masterKey})
E       NameError: global name 'document_client' is not defined

D:\a\1\s\test\retry_policy_tests.py:254: NameError

@HyeonGook
Copy link

Any updates?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants