Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify language tag fallback support #17

Merged
merged 1 commit into from
Nov 27, 2024
Merged

Specify language tag fallback support #17

merged 1 commit into from
Nov 27, 2024

Conversation

domenic
Copy link
Collaborator

@domenic domenic commented Nov 25, 2024

This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of many subtags.

@aphillips, this is what I came up with after consulting with @sffc.

It has a couple of implementation-defined parts, namely the use of LookupMatchingLocaleByBestFit, and a similar operation when deciding how to allocate "base" languages between more-specific variants. (See the example given for Chinese.) Maybe the latter could be rephrased to use LookupMatchingLocaleByBestFit, to reduce this? Thoughts welcome.

My understanding is that this implementation-definedness is largely a function of everyone relying on ICU which is not specified, but we've all kind of agreed to be fine with.

This doesn't fully solve the "language arcs" problem discussed in webmachinelearning/translation-api#11 in the context of translation. (And, I wouldn't want to close that issue until we have a full spec for translation anyway.) It's only for the summarizer API so far, which has the simpler question "is this single language supported?" The path to language arcs shouldn't be so hard from here, though.

The end result seems to be pretty reasonable. In particular, it should match ECMA-402 APIs. Since ECMA-402 allows me to do things like new Intl.Collator(["en-US-Braille-x-pirate"]) and get a resolved locale of en-US, or "ja-Bopo-BR" and get a resolved locale of ja, the proposal is that our AI APIs will do the same.


Preview | Diff

This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of language subtags.
@domenic
Copy link
Collaborator Author

domenic commented Nov 27, 2024

I'm going to merge this for now as I am doing some other spec restructuring and I want to put it on top of this. Regardless, any review or help is appreciated, even after merging.

@domenic domenic merged commit da6e057 into main Nov 27, 2024
2 checks passed
@domenic domenic deleted the language-tags branch November 27, 2024 06:49
@aphillips
Copy link

This seems fine.

My understanding is that this implementation-definedness is largely a function of everyone relying on ICU which is not specified, but we've all kind of agreed to be fine with.

Yes, although ICU/CLDR is not necessarily everywhere.

This doesn't fully solve the "language arcs" problem discussed in webmachinelearning/translation-api#11 in the context of translation.

It is different, although it has some similarities. This is a 1:1 matching problem (that is, a resource lookup problem), while language arcs have two sides to match (source and target). In this case, one has some text in a language and one wishes to use the best summarizer for it. Most language tag matching schemes match long tags to shorter ones (e.g. zh-Hant-MO-u-ca-islamic-hc-12 to zh-Hant), with some squiggle room for script subtags and the like.

However, you can also have shorter-to-longer matching, e.g. if your document is labeled fr and you have fr-FR and fr-CA summarizers and need to pick one. CLDR (and thus ICU) defines an addLikelySubtags mechanism (this also helps with zh-TW => zh-Hant-TW) which you might want to reference.

@domenic
Copy link
Collaborator Author

domenic commented Nov 29, 2024

Thanks for the review!

However, you can also have shorter-to-longer matching, e.g. if your document is labeled fr and you have fr-FR and fr-CA summarizers and need to pick one.

The way this is handled in the current PR is via the "language tag set completeness rules", which state that if you have fr-FR and fr-CA, you must also have a fr summarizer. We can assume implementations will meet this requirement by choosing one of the two existing ones to represent fr.

It gets trickier when you ask, how can we ensure they pick the "correct" one of the two existing ones. (Which is probably fr-FR, right?) For that I have the following text, which I'm not 100% happy with; suggestions welcome:

Append languageTag to either readilyAvailableLanguages or afterDownloadAvailableLanguages. Which of the two sets to append to is implementation-defined, and should be guided by considerations similar to that of LookupMatchingLocaleByBestFit in terms of keeping "best fallback languages" together.


CLDR (and thus ICU) defines an addLikelySubtags mechanism (this also helps with zh-TW => zh-Hant-TW) which you might want to reference.

When I asked @sffc about this, he said

The BestFit matcher will inherit zh-TW from zh-Hant because that is a parent locale.

and so explicitly calling addLikelySubtags was not necessary. Do you think that's right?


1. Let |languageTag| be that language, represented as a BCP 47 language tag string. <span class="issue">Describe how to handle subtags.</span>
<div class="example" id="example-subtags-chinese">
A common setup seen in today's software is to support two types of written Chinese: "traditional Chinese" and "simplified Chinese". Let's suppose that the user agent supports summarizing text written in traditional Chinese readily, and simplified Chinese after a download.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: The idea of "downloadable locales" is something I've proposed multiple times in different forms in ECMA-402, but it so far hasn't landed because of the impact it has on fingerprinting/privacy.


1. Set |availableLanguages|[|languageTag|] to "{{AICapabilityAvailability/readily}}".
One way this could be implemented would be for [=current summarizer language availabilities=] to return that « "`zh-Hant`" » is readily available, and « "`zh`", "`zh-Hans`" » is available after download. This return value conforms to the requirements of the [=language tag set completeness rules=], in ensuring that "`zh`" is present. Per <a class="allow-2119" href="#readily-or-after-download-implementation-defined">the "should"-level guidance</a>, the implementation has determined that "`zh`" belongs in the list of after-download available languages, with "`zh-Hans`", instead of in the list of readily available languages, with "`zh-Hant`".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you did this, but it seems like it shouldn't be required for zh-Hans to be supported just because zh-Hant is supported. I filed tc39/ecma402#947

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on tc39/ecma402#947, would it be acceptable for the behavior here to be "zh implies zh-Hant if zh-Hant is available but not zh-Hans, but switches to zh-Hans if it is loaded later"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is yes, that is one possibility the spec allows. But the spec also allows always identifying zh with zh-Hans. And it vaguely encourages the latter, via

Which of the two sets to append to is implementation-defined, and should be guided by considerations similar to that of LookupMatchingLocaleByBestFit in terms of keeping "best fallback languages" together.

In more detail, the spec requirements are:

  • Every language tag must be assigned one of these three values (default "no")
  • If x-Y is "readily", then x must be either "after-download" or "readily".
  • If x-Y is "after-download", then x must be either "after-download" or "readily".

So all of the following are possible:

  • zh-Hant is "readily" available, zh-Hans is "no". Then, the user agent fulfills the spec requirements by determining that zh is "readily" available, and uses the zh-Hant language pack.
  • zh-Hant is "readily" available, zh-Hans is "after-download". The user agent would prefer to treat zh as zh-Hans, so it reports "after-download" for zh, and will not use the zh-Hant language pack for zh.
  • zh-Hant is "readily" available, zh-Hans is "after-download". The user agent believes it is OK to identify zh with zh-Hant, despite the above-quoted spec sentence. So, the user agent reports zh as "readily" available, and uses the zh-Hant language pack for zh.
    • Extension of this example: maybe in another tab, someone causes zh-Hans to be downloaded. By this time, in the original tab, we have already resolved zh to zh-Hant, so nothing will change there. But if the user reloads the page, then the user agent might switch to choosing zh-Hans when given zh. Or it might not. It all depends on how they interpret that above-quoted sentence.

If you think disallowing this last version is a good idea, and can think of spec language that would disallow it that goes beyond what I quoted above, please let me know!

@@ -413,27 +441,66 @@ Every {{AISummarizerCapabilities}} has an <dfn for="AISummarizerCapabilities">av
</div>

<div algorithm>
The <dfn>current summarizer language availability map</dfn> is given by the following steps. They return a [=map=] from strings representing BCP 47 language tags to {{AICapabilityAvailability}} values, or null. [[!RFC5646]]
The <dfn>current summarizer language availabilities</dfn> are given by the following steps. They return a [=list=] containing two [=list/items=]; the items each are [=sets=] of strings representing [=Unicode canonicalized locale identifier=], or null. [[!ECMA-402]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me whether this function is directly callable from client code, but in ECMA-402 we don't ever return a full list of available locales; instead, you give us a list and we filter the list. This solves a variety of issues including automatically handling fallback. See FilterLocales

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. Yeah, this is not directly callable. There is a languageAvailable(languageTag) method which we filter against this list using LookupMatchingLocaleByBestFit.

(The design is changing slightly; see #22. But the principle of only exposing testing APIs remains.)

There are some speculative use cases for exposing a list of locales, which basically become "build me Google Translate using the browser's functionality". There you want a list of all supported translation source/target pairs. But we're resistant to expose that for fingerprinting reasons so it's currently not in any explainers. I'll be sure to circle back if we do end up with a strong need to expose that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants