Refactor integration tests to remove random collection sampling #749

mfisher87 · 2024-07-06T18:47:00Z

Resolves #215

cc @betolink just getting started on this. I have some sample code that can generate a list of 100 most popular collections, in order, given a provider.

Before I continue, I would like input!

Next steps may be:

Refactor things to be less quick-and-dirty. First consideration for me is the mismatch between "DAACs" in the tests and "providers" in the script that generates the popular lists.
Modify the tests to read the new source instead of randomly selecting. Can use first N rows from the file, depending on the test; currently not every test uses the same number of collections. Can create a fixture for providing the data in these files as a mapping of DAACs -> list of collections.
A "blocklist" of known-bad collections
?

mfisher87 · 2024-07-07T02:11:44Z

tests/integration/popular_collections/generate.py

+            "page_num": 1,
+            "page_size": 100,
+            "sort_key[]": "-usage_score",
+        },


I just copied this out of my browser dev tools' network tab after doing a similar query in earthdata search client. I'm sure we can run an equivalent query with earthaccess.

mfisher87 · 2024-07-09T17:56:48Z

Worked on this with @itcarroll during hack day. Notes: #755

mfisher87 · 2024-07-09T17:58:42Z

We considered the usefulness of random sampling tests. We don't think we should be doing this for integration tests, especially when they execute on every PR. We could, for example, run them on a cron job and create reports, but that seems like overkill when we have a community to help us identify datasets and connect with the right support channel if there's an issue with the provider.

We may still consider a cron job for, for examle, recalculating the most popular datasets on a monthly basis.

mfisher87 · 2024-07-09T18:47:30Z

We decided we can hardcode a small number and expand the list as we go. Other things like random tests on a cron or updating the list of popular datasets on a cron can be addressed separately.

mfisher87 · 2024-08-06T18:42:21Z

@betolink will take on work to update generate.py to generate top N collections for all providers.

@mfisher87 will continue working on test_onprem_download.py for just NSIDC_ECS for now to make it use the new source of collections.

mfisher87 · 2024-08-06T18:44:36Z

We will update the .txt files to .csv files and add boolean field for "does the collection have a EULA?" and then we'll use that field to mark those tests as xfail.

…into integration-tests-refactor

mfisher87 · 2024-08-21T00:03:30Z

Two major milestones:

@danielfromearth updated the script which generates the top collection lists to use all providers supported by earthaccess 🎉 Still TODO: Make them CSVs with a boolean representing whether the collection has a EULA
We just got the test_onprem_download.py module working without randomization! 🎉 Still TODO: Refactor the other 3 integration test modules to share this behavior. Let's try and remove duplicate code while we're at it!

Thanks to @DeanHenze and @Sherwin-14 for collaborating on this on today's hackathon!

mfisher87 · 2024-08-21T00:03:51Z

earthaccess/results.py

@@ -244,6 +244,9 @@ def _repr_html_(self) -> str:
        granule_html_repr = _repr_granule_html(self)
        return granule_html_repr

+    def __hash__(self) -> int:
+        return hash(self["meta"]["concept-id"])


@betolink @chuckwondo This seems reasonable to me, but please validate me :)

Thinking about it for like 5 minutes, this is obviously a bad idea. This class is subclassing dict. We'd need to implement like a frozendict.

mfisher87 · 2024-08-21T00:06:23Z

Also still TODO: Run generate.py in GHA on a monthly/quarterly cron and auto-open a PR with the changes to top collections?

mfisher87 · 2024-08-21T16:26:44Z

If we want to determine whether a collection has a EULA, this example was provided:

curl -i -XGET "https://cmr.earthdata.nasa.gov/search/collections.json?concept_id=C1808440897-ASF&pretty=true"

The metadata "eula_identifiers" : [ "1b454cfb-c298-4072-ae3c-3c133ce810c8" ] is present in the response. We're not 100% sure whether this can be used authoritatively. Discussion in progress: https://nsidc.slack.com/archives/C2LRKMDEV/p1724179804149239

mfisher87 · 2024-10-01T17:19:52Z

tests/integration/test_onprem_open.py

TODO: Add tests for OBDAAC on-prem open. Related to #828 - we want to make sure the data streams successfully. Opening data from OBDAAC on-prem relies on both #828 and a (potentially) unreleased change to fsspec! (check the September release notes)

danielfromearth · 2024-10-29T17:59:49Z

Looks like part of this issue may be related to work on EULAs in this issue.

mfisher87 changed the title ~~Refactor integration tests~~ Refactor integration tests to remove random collection sampling Jul 6, 2024

mfisher87 commented Jul 7, 2024

View reviewed changes

mfisher87 added 3 commits July 23, 2024 11:39

Extract duplicated function

d1b878c

Add popular collection script proof of concept

6ccd9d6

Use union type instead of union operator

194fd29

mfisher87 force-pushed the integration-tests-refactor branch from d79f48f to 194fd29 Compare July 23, 2024 17:39

mfisher87 added 8 commits July 23, 2024 12:15

Remove logic which enables up to 10% integration test failure

fa00a17

Enable import of sampling test utility function

a867c07

Remove ORNLDAAC from onprem test; they're fully in the cloud

3b357a1

Adjust test logging/docstrings for consistent & correct language

456bf58

Update generate script to fail if paging

45af152

Add helper function to sample from collection list file

a001aab

Fix granule sampling bug that can result in dupes

bf1e882

Update test parameter schema (WIP)

ed47045

danielfromearth and others added 6 commits August 20, 2024 14:52

loop through all providers while generating collection lists

7d0daef

add collection text files for all currently listed providers

32b71aa

Make granules hashable, fix granule sampling logic

5d708c8

Remove random collection sampling from test module

c8ba3d4

Merge remote-tracking branch 'origin/integration-tests-refactor-dek' …

01331d8

…into integration-tests-refactor

Re-enable temporarily commented test parameters

bfcf830

mfisher87 commented Aug 21, 2024

View reviewed changes

mfisher87 added the help wanted Extra attention is needed label Sep 3, 2024

mfisher87 commented Oct 1, 2024

View reviewed changes

asteiker mentioned this pull request Oct 29, 2024

Integration tests are flaky -- replace dataset sampling with top 50 datasets #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor integration tests to remove random collection sampling #749

Refactor integration tests to remove random collection sampling #749

mfisher87 commented Jul 6, 2024 •

edited

Loading

mfisher87 Jul 7, 2024 •

edited

Loading

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024

mfisher87 commented Aug 21, 2024

mfisher87 commented Aug 21, 2024

mfisher87 Oct 1, 2024

danielfromearth commented Oct 29, 2024

Refactor integration tests to remove random collection sampling #749

Are you sure you want to change the base?

Refactor integration tests to remove random collection sampling #749

Conversation

mfisher87 commented Jul 6, 2024 • edited Loading

mfisher87 Jul 7, 2024 • edited Loading

Choose a reason for hiding this comment

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 21, 2024 • edited Loading

mfisher87 Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

mfisher87 Aug 21, 2024

Choose a reason for hiding this comment

mfisher87 commented Aug 21, 2024

mfisher87 commented Aug 21, 2024

mfisher87 Oct 1, 2024

Choose a reason for hiding this comment

danielfromearth commented Oct 29, 2024

mfisher87 commented Jul 6, 2024 •

edited

Loading

mfisher87 Jul 7, 2024 •

edited

Loading

mfisher87 commented Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024 •

edited

Loading