Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support endpoints that don't support range requests in asyncBufferFromUrl #57

Merged
merged 3 commits into from
Jan 16, 2025

Conversation

swlynch99
Copy link
Contributor

@swlynch99 swlynch99 commented Jan 16, 2025

Hey! We've been using hyparquet to parse some parquet files stored behind an endpoint that doesn't support range requests. This is possible to do with a custom file object but only works with asyncBufferFromUrl if the relevant file is in the browser cache which leads to some confusing "works on my machine" issues. It would be nicer if asyncBufferFromUrl just worked correctly in this case and that's what I've done in this PR.

Before this commit asyncBufferFromUrl assumes that the body of whatever successful response it gets is equivalent to the range it requested. If the origin server does not support HTTP range requests then this assumption is usually wrong and will lead to parsing failures.

This commit changes asyncBufferFromUrl to change its behaviour slightly based on the status code in the response:

  • if 200 then we got the whole parquet file as the response. Save it and use the resulting ArrayBuffer to serve all future slice calls.
  • if 206 then we got a range response and we can just return that.

I have also included some test cases to ensure that such responses are handled correctly and also tweaked other existing mocks to include the relevant status code.

The one case where this code isn't fully correct is the case of multiple concurrent calls to slice. It'll work fine if the origin supports range requests, but might end up making extra unnecessary requests if it doesn't. I scanned readGroup and I don't think it ever makes concurrent slice calls so I don't think this is an issue. I am, however, happy to fix it if you guys think it is worth doing so.

…mUrl

Before this commit asyncBufferFromUrl assumes that the body of whatever
successful response it gets is equivalent to the range it requested. If
the origin server does not support HTTP range requests then this
assumption is usually wrong and will lead to parsing failures.

This commit changes asyncBufferFromUrl to change its behaviour slightly
based on the status code in the response:
- if 200 then we got the whole parquet file as the response. Save it and
  use the resulting ArrayBuffer to serve all future slice calls.
- if 206 then we got a range response and we can just return that.

I have also included some test cases to ensure that such responses are
handled correctly and also tweaked other existing mocks to also include
the relevant status code.
Copy link
Collaborator

@platypii platypii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall.

I confirmed that common file servers returned 206 for range requests, so that's all good (s3, azure, huggingface, and github all worked).

Some style and lint comments, but once those are fixed I will approve and merge. Sorry that CI didn't run. Thanks for this PR @swlynch99! 👍

src/utils.js Outdated
switch (res.status) {
// Endpoint does not support range requests and returned the whole object
case 200:
buffer = res.arrayBuffer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint: should be no semi-colons at end of line. Please run npm run lint and fix the errors.

(I'm not sure why lint didn't run on this PR? It should be in the github actions?)

Copy link
Contributor

@severo severo Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the CI requires approval to run: https://github.com/hyparam/hyparquet/actions/runs/12801142297/attempts/1

This workflow is awaiting approval from a maintainer in #57

Maybe a repo setting to change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just launched the CI

@swlynch99
Copy link
Contributor Author

I've fixed all the lint errors (npm run eslint -- --fix) and swapped out the switch statement.

Since this is my first PR to the repository you're going to have to approve CI again. There is a repo setting to change this, though I don't remember what it is off the top of my head.

@platypii platypii merged commit 7255457 into hyparam:master Jan 16, 2025
3 checks passed
@platypii
Copy link
Collaborator

Published v1.8.1 to npm. Thanks again @swlynch99!

@swlynch99 swlynch99 deleted the buffer-from-url-no-range branch January 16, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants