How to support efficient normalized ID retrieval #6061

sffc · 2025-02-03T14:26:53Z

See discussion in #5610. This thread is intended to discuss how we go about implementing this.

Currently (in 2.0-beta1), we have two data structs that look like the following:

pub struct IanaToBcp47MapV3<'data> {
    /// A map from normal-case IANA time zone identifiers to indexes of BCP-47 time zone
    /// identifiers along with a canonical flag. The IANA identifiers are normal-case.
    ///
    /// The `usize` values stored in the trie have the following form:
    ///
    /// - Lowest bit: 1 if canonical, 0 if not canonical
    /// - All remaining bits: index into `bcp47_ids`
    ///
    /// For example, in CLDR 44, `"Africa/Abidjan"` has value 221, which means it is canonical
    /// (low bit is 1 == odd number) and the index into `bcp47_ids` is 110 (221 >> 1).
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub map: ZeroAsciiIgnoreCaseTrie<ZeroVec<'data, u8>>,
    /// A sorted list of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    // Note: this is 9739B as `ZeroVec<TimeZoneBcp47Id>` (`ZeroVec<TinyStr8>`)
    // and 9335B as `VarZeroVec<str>`
    pub bcp47_ids: ZeroVec<'data, TimeZoneBcp47Id>,
    /// An XxHash64 checksum of [`Self::bcp47_ids`].
    pub bcp47_ids_checksum: u64,
}

pub struct Bcp47ToIanaMapV1<'data> {
    /// An XxHash64 checksum of [`IanaToBcp47MapV3::bcp47_ids`].
    ///
    /// The checksum here should match the checksum in [`IanaToBcp47MapV3`]
    /// if these were generated from the same data set.
    pub bcp47_ids_checksum: u64,
    /// The IANA time zone identifier corresponding to the BCP-47 ID in
    /// [`IanaToBcp47MapV3::bcp47_ids`].
    ///
    /// Since there can be more than one IANA identifier for a particular
    /// BCP-47 identifier, this list contains only the current canonical
    /// IANA identifier.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub canonical_iana_ids: VarZeroVec<'data, str>,
}

With the requirement to support efficient (no-alloc) retrieval of normalized IDs, we need a new data model.

I still want the main entrypoint to be as small as possible, since this extra requirement of round-tripping normalized IANA IDs is primarily for ECMA-262 clients. So, I think IanaToBcp47MapV3 should remain the same overall.

For Bcp47ToIanaMapV1, I see two ways we could go:

Store a Map<Iana, (Bcp47Index, Status)>, sorted by Iana, with enum Status { Ecma262Canonical, NonCanonical } (could also be a bool). Invariant: the map contains exactly one Status::Ecma262Canonical per Bcp47Index. Advantage: Simple, and could potentially be written to be independent of IanaToBcp47MapV3. Disadvantage: Does not support efficient lookup from Bcp47
Store a Map<Bcp47Index, (Iana, Vec<Iana>)> where every Bcp47 has the canonical Iana and zero or more non-canonical Iana. Advantage: Invariants are more strictly encoded in the data. Disadvantage: More complex type.

If we can get (2) to work, it is probably the technically superior solution. Here is a zerovec for it:

pub struct Bcp47ToIanaMapV2 {
    pub bcp47_ids_checksum: u64,
    pub map: VarZeroVec<Tuple2VarULE<str, VarZeroSlice<str>>>,
}

where the indices in map are the same as the index originating from IanaToBcp47MapV3.

That struct looks a bit expensive to deserialize from postcard since it needs to do a lot of traversal to check all of the str slices. To make it slightly more efficient, we could store PotentialUtf8, and return None or Etc/Unknown if the validation fails.

Another model could be:

pub struct Bcp47ToIanaMapV2 {
    pub bcp47_ids_checksum: u64,
    pub map: ZeroVec<VarTupleULE(u16, VarZeroSlice<u16>)>,
    pub strings: VarZeroVec<str>,
}

where map returns indices into strings. This could be more efficient for validation since the entire strings vec could be validated in one go (all values concatenated together), but it is bigger since there is by definition no string deduplication. So I probably lean toward the previous model.

CC @robertbastian

The text was updated successfully, but these errors were encountered:

sffc · 2025-02-03T15:17:29Z

Notes archive:

@robertbastian The "canonical IANA" is valid for only a range of TZDB versions. It requires frequent updates.

Shane's proposal:

struct TimeZoneId(pub TinyStr); // or something, this is bcp47

impl TimeZoneIdMapperBorrowed<'data> {
    /// Parses an IANA time zone ID to an ICU4X time zone.
    /// Returns `und` if the IANA time zone ID is not known.
    pub fn parse_iana(self, iana_id: &str) -> TimeZoneId
    /// Same as `parse_iana`, except:
    /// 1. Returns `None` if the IANA ID is not known
    /// 2. Returns the case-normalized IANA ID
    pub fn parse_and_normalize_iana<'s>(self, iana_id: &'s str) -> Option<(TimeZoneId, Cow<'s str>)>
    /// Same as `parse_iana`, except:
    /// 1. Returns `None` if the IANA ID is not known
    /// 2. Returns a valid ECMA-262 Primary IANA ID
    pub fn parse_as_ecma262_iana<'s>(self, iana_id: &'s str) -> Option<(TimeZoneId, Cow<'s str>)>
    /// Returns a valid ECMA-262 Primary IANA ID.
    pub fn get_ecma262_iana(self, time_zone_id: TimeZoneId) -> Option<String>
    /// Returns the time zone ID along with the IANA version in which it was added (for example, "2024b")
    pub fn get_iana_with_version(self, time_zone_id: TimeZoneId) -> Option<(String, TzdbVersion)>
}

impl TimeZoneIdMapperWithExtraDataBorrowed<'data> {
    pub fn parse_and_normalize_iana(self, iana_id: &str) -> Option<(TimeZoneId, &'data str)>
    pub fn parse_and_canonicalize_iana(self, iana_id: &str) -> Option<(TimeZoneId, &'data str)>
    pub fn get_iana(self, time_zone_id: TimeZoneId) -> Option<&'data str>
    pub fn get_iana_with_version(self, time_zone_id: TimeZoneId) -> Option<(&'data str, TzdbVersion)>
    pub fn get_iana_with_max_version(self, time_zone_id: TimeZoneId, max_version: TzdbVersion) -> Option<(&'data str, TzdbVersion)>
}

Data model:

ZeroTrie<str, TimeZoneId> (multiple strings can return the same time zone ID)
ZeroMap<str, { bcp: TimeZoneId, status: Status, )> (contains 600 entries)
- enum Status { GenerallyAvailablePreviouslyCanonical, GenerallyAvailableCurrentlyCanonical, GenerallyAvailableNonCanonical, RecentNonCanonical, RecentCanonical }

Examples in 2022:

Europe/Belfast: GenerallyAvailableNonCanonical,
Europe/Kiev: GenerallyAvailablePreviouslyCanonical
Europe/Kyiv: RecentCanonical
Europe/London: GenerallyAvailableCurrentlyCanonical
PST8PDT: RecentNonCanonical

Invariants:

A time zone has EITHER GenerallyAvailableCurrentlyCanonical OR GenerallyAvailablePreviouslyCanonical + RecentCanonical
A time zone can have any number of GenerallyAvailableNonCanonical, RecentNonCanonical

In addition, other helper functions may be made available.

Rob's proposal:

Keep the current API:

we keep the normalization method, but it gains a big warning that normalization is not what you want: All of Europe/London, Europe/Belfast, europe/london, and europe/belfast are identical in both IANA and CLDR. ECMA-402 users can use this to do weird things, but generally this should not be used.
we keep the canonicalization API, but we drop the "canonicalization" phrase, because it's not version-agnostic.

robertbastian · 2025-02-03T15:31:46Z

We can collapse

    pub map: VarZeroVec<Tuple2VarULE<str, VarZeroSlice<str>>>,

into

    pub map: VarZeroVec<Tuple2VarULE<bool, str>>,

The map would store the data like

(true, Europe/Kyiv),
(false, Europe/Kyiv),
(true, Europe/London),
(false, Europe/Belfast),
(false, Europe/Glasgow),

and I think that's smaller

sffc · 2025-02-03T19:12:29Z

@robertbastian That's more like my option 1. It has two problems:

The index calculated from bcp47_ids is only one index, but this data model makes there be one or more entries per BCP-47 ID. Therefore, this type of model would need to include the BCP47 ID or index in the list.
It doesn't enforce the invariant of one canonical ID per BCP47 ID.

sffc added the C-time-zone Component: Time Zones label Feb 3, 2025

sffc added this to the ICU4X 2.0 Stretch ⟨P2⟩ milestone Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support efficient normalized ID retrieval #6061

How to support efficient normalized ID retrieval #6061

sffc commented Feb 3, 2025 •

edited

Loading

sffc commented Feb 3, 2025

robertbastian commented Feb 3, 2025

sffc commented Feb 3, 2025

How to support efficient normalized ID retrieval #6061

How to support efficient normalized ID retrieval #6061

Comments

sffc commented Feb 3, 2025 • edited Loading

sffc commented Feb 3, 2025

robertbastian commented Feb 3, 2025

sffc commented Feb 3, 2025

sffc commented Feb 3, 2025 •

edited

Loading