Support endpoints that don't support range requests in `asyncBufferFromUrl` #57

swlynch99 · 2025-01-16T02:52:36Z

Hey! We've been using hyparquet to parse some parquet files stored behind an endpoint that doesn't support range requests. This is possible to do with a custom file object but only works with asyncBufferFromUrl if the relevant file is in the browser cache which leads to some confusing "works on my machine" issues. It would be nicer if asyncBufferFromUrl just worked correctly in this case and that's what I've done in this PR.

Before this commit asyncBufferFromUrl assumes that the body of whatever successful response it gets is equivalent to the range it requested. If the origin server does not support HTTP range requests then this assumption is usually wrong and will lead to parsing failures.

This commit changes asyncBufferFromUrl to change its behaviour slightly based on the status code in the response:

if 200 then we got the whole parquet file as the response. Save it and use the resulting ArrayBuffer to serve all future slice calls.
if 206 then we got a range response and we can just return that.

I have also included some test cases to ensure that such responses are handled correctly and also tweaked other existing mocks to include the relevant status code.

The one case where this code isn't fully correct is the case of multiple concurrent calls to slice. It'll work fine if the origin supports range requests, but might end up making extra unnecessary requests if it doesn't. I scanned readGroup and I don't think it ever makes concurrent slice calls so I don't think this is an issue. I am, however, happy to fix it if you guys think it is worth doing so.

…mUrl Before this commit asyncBufferFromUrl assumes that the body of whatever successful response it gets is equivalent to the range it requested. If the origin server does not support HTTP range requests then this assumption is usually wrong and will lead to parsing failures. This commit changes asyncBufferFromUrl to change its behaviour slightly based on the status code in the response: - if 200 then we got the whole parquet file as the response. Save it and use the resulting ArrayBuffer to serve all future slice calls. - if 206 then we got a range response and we can just return that. I have also included some test cases to ensure that such responses are handled correctly and also tweaked other existing mocks to also include the relevant status code.

platypii

This looks good overall.

I confirmed that common file servers returned 206 for range requests, so that's all good (s3, azure, huggingface, and github all worked).

Some style and lint comments, but once those are fixed I will approve and merge. Sorry that CI didn't run. Thanks for this PR @swlynch99! 👍

src/utils.js

platypii · 2025-01-16T07:46:56Z

src/utils.js

+      switch (res.status) {
+        // Endpoint does not support range requests and returned the whole object
+        case 200:
+          buffer = res.arrayBuffer();


lint: should be no semi-colons at end of line. Please run npm run lint and fix the errors.

(I'm not sure why lint didn't run on this PR? It should be in the github actions?)

It looks like the CI requires approval to run: https://github.com/hyparam/hyparquet/actions/runs/12801142297/attempts/1

This workflow is awaiting approval from a maintainer in #57

Maybe a repo setting to change

I just launched the CI

test/utils.test.js

swlynch99 · 2025-01-16T19:44:17Z

I've fixed all the lint errors (npm run eslint -- --fix) and swapped out the switch statement.

Since this is my first PR to the repository you're going to have to approve CI again. There is a repo setting to change this, though I don't remember what it is off the top of my head.

platypii · 2025-01-16T20:08:29Z

Published v1.8.1 to npm. Thanks again @swlynch99!

platypii requested changes Jan 16, 2025

View reviewed changes

swlynch99 added 2 commits January 16, 2025 11:38

Fix all lint warnings

300c7f2

replace switch with if-else

9a4b91d

platypii approved these changes Jan 16, 2025

View reviewed changes

platypii merged commit 7255457 into hyparam:master Jan 16, 2025
3 checks passed

swlynch99 deleted the buffer-from-url-no-range branch January 16, 2025 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support endpoints that don't support range requests in `asyncBufferFromUrl` #57

Support endpoints that don't support range requests in `asyncBufferFromUrl` #57

swlynch99 commented Jan 16, 2025 •

edited

Loading

platypii left a comment

platypii Jan 16, 2025

severo Jan 16, 2025 •

edited

Loading

severo Jan 16, 2025

swlynch99 commented Jan 16, 2025

platypii commented Jan 16, 2025

Support endpoints that don't support range requests in asyncBufferFromUrl #57

Support endpoints that don't support range requests in asyncBufferFromUrl #57

Conversation

swlynch99 commented Jan 16, 2025 • edited Loading

platypii left a comment

Choose a reason for hiding this comment

platypii Jan 16, 2025

Choose a reason for hiding this comment

severo Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

severo Jan 16, 2025

Choose a reason for hiding this comment

swlynch99 commented Jan 16, 2025

platypii commented Jan 16, 2025

Support endpoints that don't support range requests in `asyncBufferFromUrl` #57

Support endpoints that don't support range requests in `asyncBufferFromUrl` #57

swlynch99 commented Jan 16, 2025 •

edited

Loading

severo Jan 16, 2025 •

edited

Loading