Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calendar_interval in datehistogram #2459

Open
PSeitz opened this issue Jul 26, 2024 · 0 comments
Open

calendar_interval in datehistogram #2459

PSeitz opened this issue Jul 26, 2024 · 0 comments
Assignees

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Jul 26, 2024

The calendar_interval parameter is not supported currently in the date histogram aggregation, this is an outline on its challenges and drafting solutions.
Unlike fixed_interval, calendar_interval may have intervals of different sizes, depending on which timestamp-ranges the months/years/etc. map.

Fixed Interval DateHistogram

Currently the date histogram collection reuses the histogram implementation, which collects sparse by default. That means we don't allocate e.g. a Vec upfront, instead we have a Hashmap:

buckets: FxHashMap<i64, SegmentHistogramBucketEntry>,

For every timestamp, we truncate to the nearest bucket timestamp and collect into it.
This behavior allows for "drill-down", where we apply a filter and get a high resolution histogram. Preallocating over min-max of the column may OOM in these cases.

Calendar Aware DateHistogram

With the calendar aware date histogram we have two value spaces, the data stored as UTC and the data converted into a timezone. We want to avoid converting every fetched timestamp into its timezone specific counterpart, ideally the buckets should reflect that.

The simplest solution for calendar_interval would be to reuse the range aggregation by preallocating the ranges. This has two problems:

  • Filter + high resolution may OOM due to too many buckets
  • A binary_search to find the bucket may be slow

Potential Solutions

  • A multi-level data structure that preallocates the top-level and is lazy on lower levels
  • group buckets into fixed interval ranges and have a similar algorithm as now inside a group, where we truncate to the closest bucket with some metadata
@PSeitz PSeitz changed the title calendar_aware interval in datehistogram calendar_interval in datehistogram Jul 26, 2024
@PSeitz PSeitz self-assigned this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant