What is the current status of locale support (particularly `en_US.UTF-8` and `C`)? #23010

akinomyoga · 2025-01-23T10:36:27Z

akinomyoga
Jan 23, 2025

Although the default locale in Termux seems to be en_US.UTF-8, the locale support of Termux appears to be incomplete. Here, I'd like to know the latest information about how it is incomplete (what's available and what's not) and how we could work around issues related to incomplete locale support better. There doesn't seem to be any official documentation about the locale support.

Original issue

Suppose one wants to manipulate bytes in binary data (which does not contain NUL) stored in a shell variable within a Bash script. Usually, one can achieve this by setting LC_CTYPE=C and count the number of bytes by ${#data} or access a byte with ${data:index:1}. However, this doesn't seem to work in Termux. For example, you can see the issue with the following example:

$ (LC_CTYPE=C; a=$'\xE3\x81\x82'; echo "${#a}")
1

Although we expect 3 as the result, Bash returns 1 in Termux. In all the other environments, 3 is obtained as expected.

Past discussions

There is an old discussion from 2020:

Does Termux has the 'C' locale? #5845

The issue asked whether the locale C is available. The answer was that Termux doesn't support locales. However, it would be unclear what happens when no locales are supported. If it were not supported literally at all, many of the basic C APIs would be unavailable (e.g., printf, isalpha, tolower, strftime, etc. all depend on the current locale). Thus, it would be reasonable to think something is assumed for the results of the actions that rely on locale. What is that?

A StackOverflow question from 2021

How to detect that POSIX locale is not provided on POSIX shellscript and POSIX utilities, portablily? - Unix & Linux Stack Exchange

states that

Termux does never serve locale command nor POSIX locale; only en_US.UTF-8 is available

which implies that en_US.UTF-8 would have been introduced between 2020 and 2021.

There is also a comment in a discussion from 2022:

Postgres - add collation #5996 (comment)

The comment says

Bionic libc does not support locales other than UTF-8 or C (POSIX).

which contradicts the first information from 2020. Does this mean that Termux/Bionic introduced a certain support for the locale en_US.UTF-8 and C between 2020 and 2022?

However, as of 2025, the locale C is incomplete as illustrated in the first example. Even for en_US.UTF-8, another issue from 2023 reports that en_US.UTF-8 is unsupported (or not complete enough to pass the tests):

Configure the system locale to please `nvim +checkhealth` within `tmux`: Locale does not support UTF-8 termux-app#3187

Those four statements in past discussions don't seem to be really consistent with each other, so I think some of them (or all) are untrustworthy. If all of them are somewhat correct, I guess it would mean Termux supports neither of en_US.UTF-8 nor C, but an unspecified amalgam of en_US.UTF-8 and C. Or it might be switching back and forth between en_US.UTF-8 and C every single year.

Bionic libc

The third mentioned Bionic libc, so can I assume that Termux packages adopt Bionic as the C standard library? I also tried to look up information in Bionic. However, Bionic doesn't seem to have a place to report an issue or ask questions. Instead, I find the following comment in /libc/bionic/locale.cpp of the Bionic codebase:

// We only support two locales, the "C" locale (also known as "POSIX"),
// and the "C.UTF-8" locale (also known as "en_US.UTF-8").

This seems to imply that Bionic supports both C and en_US.UTF-8 (a synonym of C.UTF-8) separately. This comment has existed at least since 2016, which is inconsistent with the observation above.

I also found a mention on locale in the documentation (boldfaced by me):

Locales. Although bionic contains the various _l() functions, the only locale supported is a UTF-8 C/POSIX locale. Most of the POSIX APIs are insufficient to support the wide range of languages used by Android users, and apps should use icu4c (or do their i18n work in Java) instead.

This part of the documentation seems to have been introduced by commit aosp-mirror/platform_bionic@046fe15, whose commit message says

Explicitly mention bionic's single C.UTF-8 locale.

So it seems to imply that Bionic actually only supports en_US.UTF-8 (a synonym of C.UTF-8). If this is true, it seems to me that the first information in the code comment Bionic's locale.cpp would be wrong. Or the support for the C locale might have been dropped at some point between 2015 and 2022.

I'm not sure which information I should believe. In either case, the behavior is not consistent with the past reports for Termux. Another possibility would be that the upstream Bionic and the Bionic used by Termux are actually different versions. Another possibility would be that Termux only uses Bionic partially, and the locale part might have extensions/modifications.

Timeline

To summarize the timeline, we could make the following table for the locale support:

	Termux	Bionic
2015		`C` and `en_US.UTF-8`
2020	None
2021	`en_US.UTF-8`
2022	`C` or `en_US.UTF-8`	`en_US.UTF-8`
2023	~~`en_US.UTF-8`~~ (broken)
2025	~~`C`~~ (broken)

Every piece of the information is inconsistent, so I'm confused about which information would be really trustworthy, and what would be the relationship between the C library used in Termux packages and the upstream Bionic.

Questions

What is the true situation of the locale support? In particular, I'd like trustworthy and certain answers rather than guesses like in the above contradicting information.
What is the relation between the C library used in Termux and Bionic?
Can I use the C locale (which is separate from C.UTF-8 / en_US.UTF-8)? If not, would it be supported in the future?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the current status of locale support (particularly `en_US.UTF-8` and `C`)? #23010

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

What is the current status of locale support (particularly en_US.UTF-8 and C)? #23010

akinomyoga Jan 23, 2025

Original issue

Past discussions

Bionic libc

Timeline

Questions

Replies: 0 comments

What is the current status of locale support (particularly `en_US.UTF-8` and `C`)? #23010

akinomyoga
Jan 23, 2025