-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support endpoints that don't support range requests in asyncBufferFromUrl
#57
Conversation
…mUrl Before this commit asyncBufferFromUrl assumes that the body of whatever successful response it gets is equivalent to the range it requested. If the origin server does not support HTTP range requests then this assumption is usually wrong and will lead to parsing failures. This commit changes asyncBufferFromUrl to change its behaviour slightly based on the status code in the response: - if 200 then we got the whole parquet file as the response. Save it and use the resulting ArrayBuffer to serve all future slice calls. - if 206 then we got a range response and we can just return that. I have also included some test cases to ensure that such responses are handled correctly and also tweaked other existing mocks to also include the relevant status code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall.
I confirmed that common file servers returned 206 for range requests, so that's all good (s3, azure, huggingface, and github all worked).
Some style and lint comments, but once those are fixed I will approve and merge. Sorry that CI didn't run. Thanks for this PR @swlynch99! 👍
src/utils.js
Outdated
switch (res.status) { | ||
// Endpoint does not support range requests and returned the whole object | ||
case 200: | ||
buffer = res.arrayBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint: should be no semi-colons at end of line. Please run npm run lint
and fix the errors.
(I'm not sure why lint didn't run on this PR? It should be in the github actions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the CI requires approval to run: https://github.com/hyparam/hyparquet/actions/runs/12801142297/attempts/1
This workflow is awaiting approval from a maintainer in #57
Maybe a repo setting to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just launched the CI
I've fixed all the lint errors ( Since this is my first PR to the repository you're going to have to approve CI again. There is a repo setting to change this, though I don't remember what it is off the top of my head. |
Published v1.8.1 to npm. Thanks again @swlynch99! |
Hey! We've been using hyparquet to parse some parquet files stored behind an endpoint that doesn't support range requests. This is possible to do with a custom
file
object but only works withasyncBufferFromUrl
if the relevant file is in the browser cache which leads to some confusing "works on my machine" issues. It would be nicer ifasyncBufferFromUrl
just worked correctly in this case and that's what I've done in this PR.Before this commit
asyncBufferFromUrl
assumes that the body of whatever successful response it gets is equivalent to the range it requested. If the origin server does not support HTTP range requests then this assumption is usually wrong and will lead to parsing failures.This commit changes
asyncBufferFromUrl
to change its behaviour slightly based on the status code in the response:I have also included some test cases to ensure that such responses are handled correctly and also tweaked other existing mocks to include the relevant status code.
The one case where this code isn't fully correct is the case of multiple concurrent calls to
slice
. It'll work fine if the origin supports range requests, but might end up making extra unnecessary requests if it doesn't. I scannedreadGroup
and I don't think it ever makes concurrent slice calls so I don't think this is an issue. I am, however, happy to fix it if you guys think it is worth doing so.