Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

austindonnelly · 2018-09-05T16:15:04Z

For background, read bug #131

This happens because base_execution_context._fetch_items_helper_with_retries() runs the requests wrapped in a retry_utility._Execute() call, which does retries. However, much lower down the call stack more retries will be done because synchronized_request.SynchronizedRequest() also uses retry_utility.Execute() to do retries.

These two nested retry policies cause the following problem: When the innermost policy decides there have been too many failures it re-raises the HTTPFailure exception (retry_utility.py:84) However, the outermost retry policy catches that (base_execution_context.py:132), and attempts another round of innermost retries. The first of these sees misleading request processing state: it has received 0 rows of results, and does not have a continuation token:

base_execution_context.py:116
fetched_items = []
while self._continuation or not self._has_started
    do the work
return fetched_items

In this case, self._continuation is None because this is the first page of the query and the server has not returned a continuation token. And self._has_started is True because of the first round of retries. So the while loop never executes, and it returns the empty array of fetched_items.

So the client (incorrectly) decides the query is complete and returns 0 rows. Bug.

This pull request:

adds a new test which repros this bug.
fixes the bug, by removing the outermost layer of retries.
increases the default level of retries, now that two retry policies are no longer nested.
fixes a test which was counting the number of requests issued, not that the two policies are no longer nested.

It is safe to remove the outermost layer of retries, because the library code seems to have evolved as follows. Initially, only the outermost layer of retries existed. Then support for queries failing over to another geo-location was added, by putting retries down at the bottom level (in synchronized_request, see Commit 9d1bac3.) But at this time, the uppermost layer of retries was not removed. So now the code does nested retries. Only the innermost layer is actually needed.

Also, notice that nested retries is another bug. It means that when user code specifies retry options, they apply to both inner and outer retry policies, and get compounded together. We have tests which cover this (test/retry_policy_tests.py), but they mock out the retry utility's _Execute function, so by-pass the nesting, which makes the tests pass despite the fact that when it runs for real it won't do what the user expect. Eg, a retry limit of 8 calls will turn into 8*8 = 64!

Users may have inadvertently been relying on this very large level of retries, which is why the default retry level has been increased. Not to 64, but double (8 to 16).

Testing: all existing tests pass, and the newly added test passes with the fix, and fails without the fix. Additionally, we've been running this against a live CosmosDB server and it's working nicely for us.

austindonnelly · 2018-09-05T16:16:43Z

pydocumentdb/execution_context/base_execution_context.py

@@ -155,7 +149,7 @@ def __init__(self, client, options, fetch_function):

    def _fetch_next_block(self):
        while super(self.__class__, self)._has_more_pages() and len(self._buffer) == 0:
-            return self._fetch_items_helper_with_retries(self._fetch_function)
+            return self._fetch_items_helper_no_retries(self._fetch_function)


Strategy: this is the outermost level of retries, which is being removed.

austindonnelly · 2018-09-05T16:17:31Z

pydocumentdb/retry_options.py

@@ -32,7 +32,7 @@ class RetryOptions(object):
    :ivar int MaxWaitTimeInSeconds:
        Max wait time in seconds to wait for a request while the retries are happening. Default value 30 seconds.
    """
-    def __init__(self, max_retry_attempt_count = 9, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 30):
+    def __init__(self, max_retry_attempt_count = 17, fixed_retry_interval_in_milliseconds = None, max_wait_time_in_seconds = 60):


Increasing the default number of retries, and time allowed, to reflect the fact that users may have inadvertently been relying on large numbers of retries.

austindonnelly · 2018-09-05T16:18:24Z

test/retry_policy_tests.py

@@ -221,7 +221,7 @@ def test_default_retry_policy_for_query(self):
        result_docs = list(docs)
        self.assertEqual(result_docs[0]['id'], 'doc1')
        self.assertEqual(result_docs[1]['id'], 'doc2')
-        self.assertEqual(self.counter, 12)
+        self.assertEqual(self.counter, 6)


This test counts how many retries were done. Before the fix, 12 http requests were generated. Now 6 are. This is by-design. Updating test to match.

austindonnelly · 2018-09-05T16:19:39Z

test/retry_policy_tests.py

+        except errors.HTTPFailure as ex:
+            self.assertEqual(ex.status_code, StatusCodes.TOO_MANY_REQUESTS)
+
+        client._DocumentClient__Post = original_post_function


Mocking out the server returning 429 error by overriding the clients __Post method. It needs to be done this low down, because mocking at the _Execute level (as in the rest of this file) misses out the lower level of retries.

christopheranderson · 2019-03-28T01:44:01Z

Manually started CI on this PR: https://cosmos-db-sdk-public.visualstudio.com/cosmos-db-sdk-public/_build/results?buildId=1282

christopheranderson · 2019-03-28T02:02:14Z

Failed with this error:

================================== FAILURES ===================================
_______________ Test_retry_policy_tests.test_429_on_first_page ________________

self = <test.retry_policy_tests.Test_retry_policy_tests testMethod=test_429_on_first_page>

    def test_429_on_first_page(self):
>       client = document_client.DocumentClient(Test_retry_policy_tests.host, {'masterKey': Test_retry_policy_tests.masterKey})
E       NameError: global name 'document_client' is not defined

D:\a\1\s\test\retry_policy_tests.py:254: NameError

HyeonGook · 2020-02-18T07:50:01Z

Any updates?

austindonnelly added 3 commits September 3, 2018 16:18

New test: 429 on first page of query results.

38c84a7

Fix bug, and fix tests to match.

ff3c66e

Increase default retries.

55fabdd

austindonnelly commented Sep 5, 2018

View reviewed changes

ausfeldt requested a review from srinathnarayanan September 6, 2018 23:00

Merge branch 'master' into austind/429-on-first-page

7b76ce9

christopheranderson added the bug label Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

austindonnelly commented Sep 5, 2018 •

edited

Loading

austindonnelly Sep 5, 2018

austindonnelly Sep 5, 2018

austindonnelly Sep 5, 2018

austindonnelly Sep 5, 2018

christopheranderson commented Mar 28, 2019

christopheranderson commented Mar 28, 2019 •

edited

Loading

HyeonGook commented Feb 18, 2020

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

Are you sure you want to change the base?

Fix Bug#131: 0 rows returns when server return 429 on first page of results. #132

Conversation

austindonnelly commented Sep 5, 2018 • edited Loading

austindonnelly Sep 5, 2018

Choose a reason for hiding this comment

austindonnelly Sep 5, 2018

Choose a reason for hiding this comment

austindonnelly Sep 5, 2018

Choose a reason for hiding this comment

austindonnelly Sep 5, 2018

Choose a reason for hiding this comment

christopheranderson commented Mar 28, 2019

christopheranderson commented Mar 28, 2019 • edited Loading

HyeonGook commented Feb 18, 2020

austindonnelly commented Sep 5, 2018 •

edited

Loading

christopheranderson commented Mar 28, 2019 •

edited

Loading