Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

Closed
djoooooe opened this issue Nov 6, 2024 · 4 comments
Closed
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir topic-regex type-bug An unexpected behavior, bug, or error

Comments

@djoooooe
Copy link

djoooooe commented Nov 6, 2024

Bug report

Bug description:

It seems like SRE ignores the ASCII flag when parsing a character range whose upper bound is beyond the BMP region:

>>> import re

# should match
>>> regex = re.compile("[\ua7aa-\uffff]", re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'> 

# should not match
>>> regex = re.compile("[\ua7aa-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'>

# must be related to case folding, since \ua7aa folds to \u0266
>>> regex = re.compile("[\ua7ab-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

# correct behavior when upper bound is in BMP
>>> regex = re.compile("[\ua7aa-\uffff]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

@djoooooe djoooooe added the type-bug An unexpected behavior, bug, or error label Nov 6, 2024
@picnixz picnixz added extension-modules C modules in the Modules dir topic-regex labels Nov 6, 2024
@jirkamarsik
Copy link

I think this could be caused by this line:

if fixup:

When the pattern is being compiled in _compiler.py's optimize_charset, the RANGE opcode is translated into the RANGE_UNI_IGNORE opcode. This should probably be done only in Unicode mode. The correct way to check for Unicode mode in that function would be to check if fixes:, because the fixes argument is None in ASCII and Locale mode and a dict in Unicode mode. The code currently uses the condition if fixup:, but fixup is None only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too.

@picnixz
Copy link
Contributor

picnixz commented Nov 7, 2024

cc @serhiy-storchaka as the RE expert

@serhiy-storchaka
Copy link
Member

Thank you for your report @jirkamarsik. Actually, this is a more complex issue.

Currently, the result in the following examples does not depend on the upper bound if it is larger enough.

>>> import re
>>> re.match(r'[N-\uffff]', 'A', re.I|re.A)
<re.Match object; span=(0, 1), match='A'>
>>> re.match(r'[n-\uffff]', 'Z', re.I|re.A)
<re.Match object; span=(0, 1), match='Z'>
>>> re.match(r'[N-\U00010000]', 'A', re.I|re.A)
<re.Match object; span=(0, 1), match='A'>
>>> re.match(r'[n-\U00010000]', 'Z', re.I|re.A)
<re.Match object; span=(0, 1), match='Z'>

But with the proposed fix the last two matches will return None.

I am working on this.

@serhiy-storchaka serhiy-storchaka self-assigned this Nov 7, 2024
@serhiy-storchaka serhiy-storchaka added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Nov 7, 2024
@serhiy-storchaka
Copy link
Member

I have found also other bug: re.match(r'[19\U00010400]', '\U00010400', re.I) returns None.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 7, 2024
…sses

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
serhiy-storchaka added a commit that referenced this issue Nov 11, 2024
…H-126557)

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 11, 2024
…sses (pythonGH-126557)

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
(cherry picked from commit 819830f)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 11, 2024
…sses (pythonGH-126557)

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
(cherry picked from commit 819830f)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this issue Nov 11, 2024
…asses (GH-126557) (GH-126690)

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
(cherry picked from commit 819830f)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this issue Nov 11, 2024
…asses (GH-126557) (GH-126689)

* upper-case non-BMP character was ignored
* the ASCII flag was ignored when matching a character range whose
  upper bound is beyond the BMP region
(cherry picked from commit 819830f)

Co-authored-by: Serhiy Storchaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants