Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to support efficient normalized ID retrieval #6061

Open
sffc opened this issue Feb 3, 2025 · 3 comments
Open

How to support efficient normalized ID retrieval #6061

sffc opened this issue Feb 3, 2025 · 3 comments
Labels
C-time-zone Component: Time Zones

Comments

@sffc
Copy link
Member

sffc commented Feb 3, 2025

See discussion in #5610. This thread is intended to discuss how we go about implementing this.

Currently (in 2.0-beta1), we have two data structs that look like the following:

pub struct IanaToBcp47MapV3<'data> {
    /// A map from normal-case IANA time zone identifiers to indexes of BCP-47 time zone
    /// identifiers along with a canonical flag. The IANA identifiers are normal-case.
    ///
    /// The `usize` values stored in the trie have the following form:
    ///
    /// - Lowest bit: 1 if canonical, 0 if not canonical
    /// - All remaining bits: index into `bcp47_ids`
    ///
    /// For example, in CLDR 44, `"Africa/Abidjan"` has value 221, which means it is canonical
    /// (low bit is 1 == odd number) and the index into `bcp47_ids` is 110 (221 >> 1).
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub map: ZeroAsciiIgnoreCaseTrie<ZeroVec<'data, u8>>,
    /// A sorted list of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    // Note: this is 9739B as `ZeroVec<TimeZoneBcp47Id>` (`ZeroVec<TinyStr8>`)
    // and 9335B as `VarZeroVec<str>`
    pub bcp47_ids: ZeroVec<'data, TimeZoneBcp47Id>,
    /// An XxHash64 checksum of [`Self::bcp47_ids`].
    pub bcp47_ids_checksum: u64,
}

pub struct Bcp47ToIanaMapV1<'data> {
    /// An XxHash64 checksum of [`IanaToBcp47MapV3::bcp47_ids`].
    ///
    /// The checksum here should match the checksum in [`IanaToBcp47MapV3`]
    /// if these were generated from the same data set.
    pub bcp47_ids_checksum: u64,
    /// The IANA time zone identifier corresponding to the BCP-47 ID in
    /// [`IanaToBcp47MapV3::bcp47_ids`].
    ///
    /// Since there can be more than one IANA identifier for a particular
    /// BCP-47 identifier, this list contains only the current canonical
    /// IANA identifier.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub canonical_iana_ids: VarZeroVec<'data, str>,
}

With the requirement to support efficient (no-alloc) retrieval of normalized IDs, we need a new data model.

I still want the main entrypoint to be as small as possible, since this extra requirement of round-tripping normalized IANA IDs is primarily for ECMA-262 clients. So, I think IanaToBcp47MapV3 should remain the same overall.

For Bcp47ToIanaMapV1, I see two ways we could go:

  1. Store a Map<Iana, (Bcp47Index, Status)>, sorted by Iana, with enum Status { Ecma262Canonical, NonCanonical } (could also be a bool). Invariant: the map contains exactly one Status::Ecma262Canonical per Bcp47Index. Advantage: Simple, and could potentially be written to be independent of IanaToBcp47MapV3. Disadvantage: Does not support efficient lookup from Bcp47
  2. Store a Map<Bcp47Index, (Iana, Vec<Iana>)> where every Bcp47 has the canonical Iana and zero or more non-canonical Iana. Advantage: Invariants are more strictly encoded in the data. Disadvantage: More complex type.

If we can get (2) to work, it is probably the technically superior solution. Here is a zerovec for it:

pub struct Bcp47ToIanaMapV2 {
    pub bcp47_ids_checksum: u64,
    pub map: VarZeroVec<Tuple2VarULE<str, VarZeroSlice<str>>>,
}

where the indices in map are the same as the index originating from IanaToBcp47MapV3.

That struct looks a bit expensive to deserialize from postcard since it needs to do a lot of traversal to check all of the str slices. To make it slightly more efficient, we could store PotentialUtf8, and return None or Etc/Unknown if the validation fails.

Another model could be:

pub struct Bcp47ToIanaMapV2 {
    pub bcp47_ids_checksum: u64,
    pub map: ZeroVec<VarTupleULE(u16, VarZeroSlice<u16>)>,
    pub strings: VarZeroVec<str>,
}

where map returns indices into strings. This could be more efficient for validation since the entire strings vec could be validated in one go (all values concatenated together), but it is bigger since there is by definition no string deduplication. So I probably lean toward the previous model.

CC @robertbastian

@sffc sffc added the C-time-zone Component: Time Zones label Feb 3, 2025
@sffc sffc added this to the ICU4X 2.0 Stretch ⟨P2⟩ milestone Feb 3, 2025
@sffc
Copy link
Member Author

sffc commented Feb 3, 2025

Notes archive:

  • @robertbastian The "canonical IANA" is valid for only a range of TZDB versions. It requires frequent updates.

Shane's proposal:

struct TimeZoneId(pub TinyStr); // or something, this is bcp47

impl TimeZoneIdMapperBorrowed<'data> {
    /// Parses an IANA time zone ID to an ICU4X time zone.
    /// Returns `und` if the IANA time zone ID is not known.
    pub fn parse_iana(self, iana_id: &str) -> TimeZoneId
    /// Same as `parse_iana`, except:
    /// 1. Returns `None` if the IANA ID is not known
    /// 2. Returns the case-normalized IANA ID
    pub fn parse_and_normalize_iana<'s>(self, iana_id: &'s str) -> Option<(TimeZoneId, Cow<'s str>)>
    /// Same as `parse_iana`, except:
    /// 1. Returns `None` if the IANA ID is not known
    /// 2. Returns a valid ECMA-262 Primary IANA ID
    pub fn parse_as_ecma262_iana<'s>(self, iana_id: &'s str) -> Option<(TimeZoneId, Cow<'s str>)>
    /// Returns a valid ECMA-262 Primary IANA ID.
    pub fn get_ecma262_iana(self, time_zone_id: TimeZoneId) -> Option<String>
    /// Returns the time zone ID along with the IANA version in which it was added (for example, "2024b")
    pub fn get_iana_with_version(self, time_zone_id: TimeZoneId) -> Option<(String, TzdbVersion)>
}

impl TimeZoneIdMapperWithExtraDataBorrowed<'data> {
    pub fn parse_and_normalize_iana(self, iana_id: &str) -> Option<(TimeZoneId, &'data str)>
    pub fn parse_and_canonicalize_iana(self, iana_id: &str) -> Option<(TimeZoneId, &'data str)>
    pub fn get_iana(self, time_zone_id: TimeZoneId) -> Option<&'data str>
    pub fn get_iana_with_version(self, time_zone_id: TimeZoneId) -> Option<(&'data str, TzdbVersion)>
    pub fn get_iana_with_max_version(self, time_zone_id: TimeZoneId, max_version: TzdbVersion) -> Option<(&'data str, TzdbVersion)>
}

Data model:

  1. ZeroTrie<str, TimeZoneId> (multiple strings can return the same time zone ID)
  2. ZeroMap<str, { bcp: TimeZoneId, status: Status, )> (contains 600 entries)
    • enum Status { GenerallyAvailablePreviouslyCanonical, GenerallyAvailableCurrentlyCanonical, GenerallyAvailableNonCanonical, RecentNonCanonical, RecentCanonical }

Examples in 2022:

  • Europe/Belfast: GenerallyAvailableNonCanonical,
  • Europe/Kiev: GenerallyAvailablePreviouslyCanonical
  • Europe/Kyiv: RecentCanonical
  • Europe/London: GenerallyAvailableCurrentlyCanonical
  • PST8PDT: RecentNonCanonical

Invariants:

  1. A time zone has EITHER GenerallyAvailableCurrentlyCanonical OR GenerallyAvailablePreviouslyCanonical + RecentCanonical
  2. A time zone can have any number of GenerallyAvailableNonCanonical, RecentNonCanonical

In addition, other helper functions may be made available.

Rob's proposal:

Keep the current API:

  • we keep the normalization method, but it gains a big warning that normalization is not what you want: All of Europe/London, Europe/Belfast, europe/london, and europe/belfast are identical in both IANA and CLDR. ECMA-402 users can use this to do weird things, but generally this should not be used.
  • we keep the canonicalization API, but we drop the "canonicalization" phrase, because it's not version-agnostic.

@robertbastian
Copy link
Member

We can collapse

    pub map: VarZeroVec<Tuple2VarULE<str, VarZeroSlice<str>>>,

into

    pub map: VarZeroVec<Tuple2VarULE<bool, str>>,

The map would store the data like

(true, Europe/Kyiv),
(false, Europe/Kyiv),
(true, Europe/London),
(false, Europe/Belfast),
(false, Europe/Glasgow),

and I think that's smaller

@sffc
Copy link
Member Author

sffc commented Feb 3, 2025

@robertbastian That's more like my option 1. It has two problems:

  1. The index calculated from bcp47_ids is only one index, but this data model makes there be one or more entries per BCP-47 ID. Therefore, this type of model would need to include the BCP47 ID or index in the list.
  2. It doesn't enforce the invariant of one canonical ID per BCP47 ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-time-zone Component: Time Zones
Projects
None yet
Development

No branches or pull requests

2 participants