gh-126505: Do not use Unicode case folding in ASCII regexes #126544

jirkamarsik · 2024-11-07T14:25:43Z

When a pattern is being compiled in _compiler.py's optimize_charset, the RANGE opcode is translated into the RANGE_UNI_IGNORE opcode. This should be done only in regexes which set the Unicode flag, otherwise we get Unicode case folding behavior in regexes which set the ASCII or Locale mode flags.

The correct way to check for Unicode mode in optimize_charset would be to check if fixes:, because the fixes argument is None in ASCII and Locale modes and a dict in Unicode mode. The code currently uses the condition if fixup:, but fixup is None only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too and the RANGE opcode is translated to a RANGE_UNI_IGNORE opcode for character sets which include characters outside of the basic multilingual plane (the second time an IndexError is thrown in optimize_charset).

Issue: SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

When an ASCII regex would use a character range that exceeds the bounds of the basic multilingual plane, it would be compiled into an opcode that performs Unicode case folding. Now, only Unicode regexes can use the Unicode-specific case folding opcode.

cpython-cla-bot · 2024-11-07T14:25:46Z

The following commit authors need to sign the Contributor License Agreement:

[email protected]

Click the button to sign:

bedevere-app · 2024-11-07T14:25:47Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

ZeroIntensity

Thanks for contributing! This needs a NEWS entry, as it's a user-facing bug, and you'll also need to sign the CLA.

ZeroIntensity · 2024-11-07T15:19:53Z

Lib/test/test_re.py

+        # gh-126505
+        # should match in Unicode mode


Suggested change

# gh-126505

# should match in Unicode mode

# GH-126505: should match in Unicode mode

vstinner · 2024-11-07T15:26:31Z

cc @serhiy-storchaka

jirkamarsik · 2024-11-07T15:41:01Z

Closing this Pull Request in favor of @serhiy-storchaka's upcoming fix.
#126505 (comment)

bedevere-app bot added the awaiting review label Nov 7, 2024

bedevere-app bot mentioned this pull request Nov 7, 2024

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

Closed

ZeroIntensity reviewed Nov 7, 2024

View reviewed changes

ZeroIntensity added topic-regex needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Nov 7, 2024

jirkamarsik closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-126505: Do not use Unicode case folding in ASCII regexes #126544

gh-126505: Do not use Unicode case folding in ASCII regexes #126544

jirkamarsik commented Nov 7, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Nov 7, 2024

bedevere-app bot commented Nov 7, 2024

ZeroIntensity left a comment

ZeroIntensity Nov 7, 2024

vstinner commented Nov 7, 2024

jirkamarsik commented Nov 7, 2024

	# gh-126505
	# should match in Unicode mode
	# GH-126505: should match in Unicode mode

gh-126505: Do not use Unicode case folding in ASCII regexes #126544

gh-126505: Do not use Unicode case folding in ASCII regexes #126544

Conversation

jirkamarsik commented Nov 7, 2024 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Nov 7, 2024

bedevere-app bot commented Nov 7, 2024

ZeroIntensity left a comment

Choose a reason for hiding this comment

ZeroIntensity Nov 7, 2024

Choose a reason for hiding this comment

vstinner commented Nov 7, 2024

jirkamarsik commented Nov 7, 2024

jirkamarsik commented Nov 7, 2024 •

edited by bedevere-app bot

Loading