SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

djoooooe · 2024-11-06T14:28:25Z

Bug report

Bug description:

It seems like SRE ignores the ASCII flag when parsing a character range whose upper bound is beyond the BMP region:

>>> import re

# should match
>>> regex = re.compile("[\ua7aa-\uffff]", re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'> 

# should not match
>>> regex = re.compile("[\ua7aa-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'>

# must be related to case folding, since \ua7aa folds to \u0266
>>> regex = re.compile("[\ua7ab-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

# correct behavior when upper bound is in BMP
>>> regex = re.compile("[\ua7aa-\uffff]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

jirkamarsik · 2024-11-06T22:00:03Z

I think this could be caused by this line:

cpython/Lib/re/_compiler.py

Line 301 in a1c57bc

if fixup:

When the pattern is being compiled in _compiler.py's optimize_charset, the RANGE opcode is translated into the RANGE_UNI_IGNORE opcode. This should probably be done only in Unicode mode. The correct way to check for Unicode mode in that function would be to check if fixes:, because the fixes argument is None in ASCII and Locale mode and a dict in Unicode mode. The code currently uses the condition if fixup:, but fixup is None only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too.

picnixz · 2024-11-07T11:45:43Z

cc @serhiy-storchaka as the RE expert

serhiy-storchaka · 2024-11-07T15:27:43Z

Thank you for your report @jirkamarsik. Actually, this is a more complex issue.

Currently, the result in the following examples does not depend on the upper bound if it is larger enough.

>>> import re
>>> re.match(r'[N-\uffff]', 'A', re.I|re.A)
<re.Match object; span=(0, 1), match='A'>
>>> re.match(r'[n-\uffff]', 'Z', re.I|re.A)
<re.Match object; span=(0, 1), match='Z'>
>>> re.match(r'[N-\U00010000]', 'A', re.I|re.A)
<re.Match object; span=(0, 1), match='A'>
>>> re.match(r'[n-\U00010000]', 'Z', re.I|re.A)
<re.Match object; span=(0, 1), match='Z'>

But with the proposed fix the last two matches will return None.

I am working on this.

serhiy-storchaka · 2024-11-07T20:19:58Z

I have found also other bug: re.match(r'[19\U00010400]', '\U00010400', re.I) returns None.

…sses * upper-case non-BMP character was ignored * the ASCII flag was ignored when matching a character range whose upper bound is beyond the BMP region

…H-126557) * upper-case non-BMP character was ignored * the ASCII flag was ignored when matching a character range whose upper bound is beyond the BMP region

…sses (pythonGH-126557) * upper-case non-BMP character was ignored * the ASCII flag was ignored when matching a character range whose upper bound is beyond the BMP region (cherry picked from commit 819830f) Co-authored-by: Serhiy Storchaka <[email protected]>

…asses (GH-126557) (GH-126690) * upper-case non-BMP character was ignored * the ASCII flag was ignored when matching a character range whose upper bound is beyond the BMP region (cherry picked from commit 819830f) Co-authored-by: Serhiy Storchaka <[email protected]>

…asses (GH-126557) (GH-126689) * upper-case non-BMP character was ignored * the ASCII flag was ignored when matching a character range whose upper bound is beyond the BMP region (cherry picked from commit 819830f) Co-authored-by: Serhiy Storchaka <[email protected]>

djoooooe added the type-bug An unexpected behavior, bug, or error label Nov 6, 2024

picnixz added extension-modules C modules in the Modules dir topic-regex labels Nov 6, 2024

bedevere-app bot mentioned this issue Nov 7, 2024

gh-126505: Do not use Unicode case folding in ASCII regexes #126544

Closed

serhiy-storchaka self-assigned this Nov 7, 2024

serhiy-storchaka added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Nov 7, 2024

bedevere-app bot mentioned this issue Nov 7, 2024

gh-126505: Fix bugs in compiling case-insensitive character classes #126557

Merged

This was referenced Nov 11, 2024

[3.13] gh-126505: Fix bugs in compiling case-insensitive character classes (GH-126557) #126689

Merged

[3.12] gh-126505: Fix bugs in compiling case-insensitive character classes (GH-126557) #126690

Merged

serhiy-storchaka closed this as completed Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

djoooooe commented Nov 6, 2024 •

edited by bedevere-app bot

Loading

jirkamarsik commented Nov 6, 2024

picnixz commented Nov 7, 2024

serhiy-storchaka commented Nov 7, 2024

serhiy-storchaka commented Nov 7, 2024

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

Comments

djoooooe commented Nov 6, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

jirkamarsik commented Nov 6, 2024

picnixz commented Nov 7, 2024

serhiy-storchaka commented Nov 7, 2024

serhiy-storchaka commented Nov 7, 2024

djoooooe commented Nov 6, 2024 •

edited by bedevere-app bot

Loading