-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-126505: Do not use Unicode case folding in ASCII regexes #126544
Conversation
When an ASCII regex would use a character range that exceeds the bounds of the basic multilingual plane, it would be compiled into an opcode that performs Unicode case folding. Now, only Unicode regexes can use the Unicode-specific case folding opcode.
The following commit authors need to sign the Contributor License Agreement: |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! This needs a NEWS entry, as it's a user-facing bug, and you'll also need to sign the CLA.
# gh-126505 | ||
# should match in Unicode mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# gh-126505 | |
# should match in Unicode mode | |
# GH-126505: should match in Unicode mode |
Closing this Pull Request in favor of @serhiy-storchaka's upcoming fix. |
When a pattern is being compiled in
_compiler.py
'soptimize_charset
, theRANGE
opcode is translated into theRANGE_UNI_IGNORE
opcode. This should be done only in regexes which set the Unicode flag, otherwise we get Unicode case folding behavior in regexes which set the ASCII or Locale mode flags.The correct way to check for Unicode mode in
optimize_charset
would be to checkif fixes:
, because thefixes
argument isNone
in ASCII and Locale modes and adict
in Unicode mode. The code currently uses the conditionif fixup:
, butfixup
isNone
only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too and theRANGE
opcode is translated to aRANGE_UNI_IGNORE
opcode for character sets which include characters outside of the basic multilingual plane (the second time anIndexError
is thrown inoptimize_charset
).