Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permissions error when downloading files from OPERA_L2_RTC-S1_V1 dataset on ASF DAAC (but not OPERA_L3_DIST-ALERT-HLS_PROVISIONAL_V0 on LP DAAC) #439

Closed
cmspeed opened this issue Jan 29, 2024 · 10 comments · Fixed by #443 or #472

Comments

@cmspeed
Copy link

cmspeed commented Jan 29, 2024

I am attempting to download a few granules from the OPERA_L2_RTC-S1_V1 dataset using the following:

import earthaccess
earthaccess.login()

results = earthaccess.search_data(
    short_name='OPERA_L2_RTC-S1_V1',
    cloud_hosted=True,
    bounding_box=(-117.33, 35.541, -117.880, 35.991),
    temporal=("2023-10-01", "2023-10-15"),
    count=50
)

files = earthaccess.download(results, "./data")

Even though the initial authentication step seems to be working as it should, I am encountering the following HTTP 401 error (for every granule):

Error while downloading the file OPERA_L2_RTC-S1_T136-290177 IW2_20231004T002041Z_20240123T055542Z_S1A_30_v1.0.h5

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: ....

Interestingly, the same syntax applied to a different OPERA dataset doesn't produce the same error, and the files are downloaded as expected.

import earthaccess
earthaccess.login()

results = earthaccess.search_data(
    short_name='OPERA_L3_DIST-ALERT-HLS_PROVISIONAL_V0',
    cloud_hosted=True,
    bounding_box=(-117.33, 35.541, -117.880, 35.991),
    temporal=("2023-10-01", "2023-10-15"),
    count=50
)

files = earthaccess.download(results, "./data")

A key difference between these datasets is that the OPERA RTC product is distributed by ASF DAAC whereas the OPERA DIST-ALERT product is distributed by LP DAAC. I am wondering if the issue may stem from a difference in the way that earthaccess requests these files from the two DAACs. I am relatively new to this means of data access, so any guidance would be much appreciated!

@mfisher87
Copy link
Collaborator

mfisher87 commented Jan 29, 2024

🎉 Thanks for the report!

I believe the issue is that this dataset has an End User License Agreement (EULA) that must be accepted. We need to work on our error message for this so users can more easily resolve this issue on their own (#36).

Can you please try visiting a granule URL in your browser (https://datapool.asf.alaska.edu/RTC/OPERA-S1/OPERA_L2_RTC-S1_T098-209157-IW3_20240129T101603Z_20240129T135204Z_S1A_30_v1.0_HH.tif) and navigate through the EULA, then try using earthaccess again and let us know if that helps?

@cmspeed
Copy link
Author

cmspeed commented Jan 29, 2024

Thank you for the information! I have downloaded OPERA RTC data using the ASF Data Search Vertex in the past, and I agreed to the ASF EULA at that time. The granule you linked downloads automatically when clicked, but the issue persists when trying to access RTC data with earthaccess. Regarding permissions, would there need to be something added to the .netrc file that would indicate the user has already agreed to the EULA?

@mfisher87
Copy link
Collaborator

Shoot! That was my best guess :) @betolink thoughts on how to debug deeper?

There should be no .netrc change required to reflect EULA acceptance. Is there any possibility you have multiple users, and one is set up in .netrc, but the other is logged in the browser?

@betolink
Copy link
Member

I'm trying to get to one of the files directly with curl and I'm getting a 401 Access Denied. If this requires a different EULA that info is not present in the redirects. I think @jhkennedy probably knows if this collection a) accepts EDL tokens as a valid authentication method or b) there is a particular EULA that needs to be accepted.

 curl -Lv -H "Authorization: Bearer $EDL_TOKEN" https://datapool.asf.alaska.edu/RTC/OPERA-S1/OPERA_L2_RTC-S1_T136-290176-IW3_20231004T002039Z_20240123T060856Z_S1A_30_v1.0_mask.tif

@jhkennedy
Copy link
Collaborator

jhkennedy commented Jan 29, 2024

ASF has a datapool app in front of all of our distribution endpoints so that we can provide persistent URL no matter what happens on the backend. This was particularly beneficial when we were migrating data from on-prem to the cloud.

However, datapool does get in the way of Bearer tokens because tools like curl and requests are trying to do right by the user -- If you send an auth header to datapool.asf.alaska.edu, they assume it was for datapool.asf.alaska.edu, and don't send it on to cumulus.asf.alaska.edu to prevent credential leaking.

If you hit cumulus directly, it works:

curl -L -H "Authorization: Bearer $EDL_TOKEN"  https://cumulus.asf.alaska.edu/RTC/OPERA-S1/OPERA_L2_RTC-S1/OPERA_L2_RTC-S1_T136-290176-IW3_20231004T002039Z_20240123T060856Z_S1A_30_v1.0/OPERA_L2_RTC-S1_T136-290176-IW3_20231004T002039Z_20240123T060856Z_S1A_30_v1.0_mask.tif

The curl command above also works if you add --location-trusted to it, which will forward along the auth through all redirections. Similarly, with requests, we'd have to handle that on the client side.

On the ASF side, to eliminate the need for users/applications/libraries to handle this bit of fun themselves, we'd have to either:

  • eliminate datapool, which would require updating 10s to 100s of millions of CMR records, and I wouldn't expect that to happen.
  • add authn back to datapool so it can accept the header and vend a asf-wide cookie before the user has to be redirected anywhere, but it'd have to know a lot more about the datasets than it currently does

Notably, basic auth with a .netrc likely does work because the behavior of many tools (wget, requests) with basic auth is to wait for an HTTP 401 response before sending an Authentication header.

@jhkennedy
Copy link
Collaborator

jhkennedy commented Jan 29, 2024

Notably, for the asf_search Python pacakge, this is handled here:
https://github.com/asfadmin/Discovery-asf_search/blob/master/asf_search/ASFSession.py#L31

And the suggestion from @asjohnston-asf who talked me through what's happening here is:

My strategy would be: Ask the user for their token at init time, attempt all download requests without an auth header, if at any point you get a 401 response from a list of trusted hosts (cumulus, sentinel1, nisar, ...) re-submit that specific request with the auth header

But I don't know that we want Earthaccess to have to know that much about each DAAC

@mfisher87
Copy link
Collaborator

But I don't know that we want Earthaccess to have to know that much about each DAAC

💯

@betolink
Copy link
Member

I'll take a look at possible solutions so we can have all the ASF catalog available in earthaccess.

@betolink
Copy link
Member

Hi @cmspeed

We just merged a PR that fixes access to all OPERA products (or so we think). If you don't want to wait for the next release you can install the library from the repo and test if the fix works for you.

pip install git+https://github.com/nsidc/earthaccess.git@main

Maybe you'll need to uninstall first and then install it. Looking forward to the results you get!

@betolink betolink mentioned this issue Feb 28, 2024
@cmspeed
Copy link
Author

cmspeed commented Apr 5, 2024

Hi @betolink - Apologies for delay in circling back to this. Things are now working on my end with the ASF-hosted datasets after the update. Thanks for your work on this (and earthaccess in general). Really good stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
4 participants