Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performances on CPP language #893

Open
3 of 5 tasks
gregberge opened this issue Jan 16, 2025 · 24 comments
Open
3 of 5 tasks

Poor performances on CPP language #893

gregberge opened this issue Jan 16, 2025 · 24 comments

Comments

@gregberge
Copy link

Validations

Describe the bug

It takes ~10s on my MacBook M1 to highlight that code with JS RegExp engine, shiki v1.27.2.

TMap<FString, FString> WsUpgradeHeaders;
WsUpgradeHeaders.Add(TEXT("Host"), TEXT("127.0.0.1:7788"));
WebSocket = FWebSocketsModule::Get().CreateWebSocket(WebSocketAddress, TEXT("ws"), WsUpgradeHeaders);

Reproduction

https://textmate-grammars-themes.netlify.app/?theme=vitesse-dark&grammar=cpp&code=TMap%3CFString%2C+FString%3E+WsUpgradeHeaders%3B%0AWsUpgradeHeaders.Add%28TEXT%28%22Host%22%29%2C+TEXT%28%22127.0.0.1%3A7788%22%29%29%3B%0AWebSocket+%3D+FWebSocketsModule%3A%3AGet%28%29.CreateWebSocket%28WebSocketAddress%2C+TEXT%28%22ws%22%29%2C+WsUpgradeHeaders%29%3B

Contributes

  • I am willing to submit a PR to fix this issue
  • I am willing to submit a PR with failing tests
@slevithan
Copy link
Collaborator

slevithan commented Jan 16, 2025

Your repro link to Shiki's grammar/theme playground (which looks like it's currently running Shiki v1.26.1) gives me times of 1s to 2s on the first run, and much faster thereafter. I don't have access to a Mac to test on Safari. Maybe it's performing much worse in Safari's RegExp engine if you're seeing times of ~10s.

Regardless, more than 1s is indeed slow. I assume this is a case of one or more regexes in the C++ TextMate grammar that is/are triggering superlinear backtracking. Based on this and one previous report (#871), it seems Oniguruma might have better protections against runaway backtracking than V8/JavaScriptCore. That's not to say Oniguruma isn't also prone to such problems. Many grammars have had to fix backtracking-related perf problems in the past, unrelated to Shiki's JS engine. But different regex engines have different optimizations and mitigations in place.

So yeah, this probably needs to be fixed in the upstream grammar. I would be happy to help fix the regex(es) if someone can identify which regexes are causing the problem.


@antfu it would be great for debugging (and probably for TM grammar authors) if there was a script (or maybe it could be built into the Shiki playground?) that somehow could report on the longest running regexes for a given grammar and sample text.

@antfu
Copy link
Member

antfu commented Jan 18, 2025

If we are on JavaScript RegExp, running node with --cpu-prof you could inspect the regex cost in https://discoveryjs.github.io/cpupro/

@slevithan
Copy link
Collaborator

slevithan commented Jan 23, 2025

CC @jeff-hykin as the creator of the upstream better-cpp-syntax grammar, in case he has a hunch for the problem source, or is interested in being aware of this.

It's certainly also possible that the perf problem is not in better-cpp-syntax but is in Oniguruma-To-ES or Shiki's JS engine. But such large effects smell like runaway backtracking, which is why that's where my assumptions went.

There's something weird going on though because cpp is the only case where Shiki's precomiled grammar is significantly slower than transpiling the regexes at runtime, which I can't explain yet. The cpp grammar seems to very heavily use possessive quantifiers and atomic groups in cases where they don't change matching behavior. These are features that are fully supported by Oniguruma-To-ES and indeed avoid runaway backtracking that might otherwise occur, but they're not cost free in the JS engine since they're emulated using backreferences to captures within lookahead and therefore lead to using EmulatedRegExp as the constructor in order to hide the injected "emulation groups" from match result arrays. The perf overhead of this is typically tiny (for both regex execution and construction), but it's possible that doing this for hundreds of regexes like cpp does (and especially for long regexes with lots of emulation groups) is a factor. If this turns out to be the problem, it's possible to make big architectural changes in Oniguruma-To-ES that should speed this up significantly. That's something I'd much prefer to avoid though since there would be lots of cascading code changes needed.

@jeff-hykin
Copy link

jeff-hykin commented Jan 24, 2025

I assume this is a case of one or more regexes in the C++ TextMate grammar that is/are triggering superlinear backtracking

I mean, if the grammar is slow, thats almost always why. But something this short definitely shouldn't take 10s on an M1.

To be fair, I've only ever tested with vscode-textmate though

seems Oniguruma might have better protections

Eh, that or VS code has some protections against runaway regex. Like just capping the line length.

There was an update at some point, like 1.5 years ago that was a major help. I think Alex Duma (VS Code guy, probably misspelling his name) knows about.

But even before that update, an example this small would've/should've been fine.

That's not to say Oniguruma isn't also prone to such problems

Many grammars have had to fix backtracking-related perf problems in the past

✅ (its really a TextMate/regex inherent design flaw IMO)

So yeah, this probably needs to be fixed in the upstream grammar

While not a terrible conclusion, unfortunately I mostly disagree.

Not cause the grammar is good, its an abomination, but rather because the grammar, at its best, is only capable of a tradeoff: do you want it to be decently slow and horrifically wrong, or do you want it to be decently slow, wrong occasionally-ish, and have edge cases where its so slow it freezes the whole process.

Not to say there's never a case of purely a performance loss that could be fixed with better grammar code (take a look at the function defintion, and type pattern if someone is interested in optimization), but rather that there are at least some cases where the thrashing is necessary for correctness.

Lots of atomic groups, fail-quickly look ahead patterns, and limitations on lookaheads/behinds have been added to the C++ grammar try and help with performance. Sometimes there was a small culprit that made a major impact.

The only bulletproof fix tho, in the C++ grammar would be, basically, a built in line-length limiter. Which I've considered and have, as always, mixed feelings about.

TLDR

I think user preference matters with how much slowness they can tolerate (I have a very long line length limit) which is outside the grammar, I also think the TextMate engine has responsibility in making sure a single regex doesnt crash everything.

@jeff-hykin
Copy link

jeff-hykin commented Jan 24, 2025

in case he has a hunch for the problem source, or is interested in being aware of this.

Definitely interested!

I really dug deep into what textmate tools were available, like completely undocumented stuff, so in that sense I'm not surprised theres problems.

The good/bad news is I'm pretty sure the C++ grammar is the largest most feature-abusing grammar out there. At least I sure hope theres nothing worse. If this can be fixed, I would feel pretty good about the others.

The cpp grammar seems to very heavily use possessive quantifiers and atomic groups in cases where they don't change matching behavior

Exactly. Ironically, this was done to improve performance. So if, somehow, this is hurting performance after precompilation, its realistic that removing all atomic groups/possessives woudln't change correctness. I don't think there was a single time we used possessive quantifiers for something other than performance.

Removing them and seeing what happens might be the easiest debugging step.

@slevithan
Copy link
Collaborator

slevithan commented Jan 24, 2025

seems Oniguruma might have better protections

Eh, that or VS code has some protections against runaway regex. Like just capping the line length.

This is super relevant, just not in this context since Shiki's JS and WASM engines are both being applied line by line. But I'm happy to see you reinforce the point (from your own much more extensive experience with TM grammars than me) that Oniguruma is also very vulnerable. As I mentioned, although different regex engines can have different built-in optimizations that can avoid or mitigate certain cases, all backtracking regex engines are vulnerable.

So yeah, this probably needs to be fixed in the upstream grammar

While not a terrible conclusion, unfortunately I mostly disagree. [...]

My comment was based on the assumption there there are one or more regexes in the cpp grammar that are triggering runaway backtracking (in a specific way that Oniguruma is less vulnerable to than V8/JavaScriptCore). This hasn't been proven, though. But if it was true, it would likely be easy to fix the regexes upstream (possibly by simply adding e.g. an atomic group).

The more I think about this, though, the more I think there's a good chance it's a fixable (but unfortunately pretty deep) problem in the design of my EmulatedRegExp subclass of JS's RegExp, that the cpp grammar is uniquely triggering through its massive reliance on atomic groups and possessive quantifiers, including within very long patterns.

Not cause the grammar is good, its an abomination, but rather because the grammar, at its best, is only capable of a tradeoff: do you want it to be decently slow and horrifically wrong, or do you want it to be decently slow, wrong occasionally-ish, and have edge cases where its so slow it freezes the whole process.

My claim (not deeply investigated, just a hunch based on how much atomic groups and possessive quantifiers are used) is that these backtracking-control features are being used even in many cases when they have no potential benefit and can't change what's matched or have a meaningful effect on backtracking behavior. In the context of Oniguruma (and other regex engines that support backtracking control natively), there is no cost to overusing these features like this. But alas, in the case of Oniguruma-To-ES, there is some (non-exponential) cost that can add up in extreme cases.

The only bulletproof fix tho, in the C++ grammar would be, basically, a built in line-length limiter. Which I've considered and have, as always, mixed feelings about.

If I'm understanding what you mean correctly, I'd be hesitant on that as well, in large part because it wouldn't be bulletproof. It's very possible to trigger catastrophic backtracking with relatively short target strings/lines.

I also think the TextMate engine has responsibility in making sure a single regex doesnt crash everything.

I agree with you, but TextMate engines are unlikely to make fundamental changes that eliminate the risk of runaway backtracking anytime soon.

The cpp grammar seems to very heavily use possessive quantifiers and atomic groups in cases where they don't change matching behavior

Exactly. Ironically, this was done to improve performance. So if, somehow, this is hurting performance after precompilation, its realistic that removing all atomic groups/possessives woudln't change correctness. I don't think there was a single time we used possessive quantifiers for something other than performance.

Removing them and seeing what happens might be the easiest debugging step.

That's a great debugging step! If the perf problem for Shiki's JS engine just went away, that would indeed identify the source of the problem, and reinforce that I should do the major work that would be needed in Oniguruma-To-ES. But, it might introduce genuine runaway backtracking where there was none before (including with the WASM engine), making it no longer useful as a debugging step.

@slevithan
Copy link
Collaborator

slevithan commented Jan 24, 2025

There's something weird going on though because cpp is the only case where Shiki's precomiled grammar is significantly slower than transpiling the regexes at runtime, which I can't explain yet.

Sadly (for me), the only explanation I can think of is that Shiki's standard JS engine (which transpiles regexes at runtime) never actually has to deal with transpiling the majority of regexes in the cpp grammar because they aren't triggered by the simple code snippets we're dealing with in Shiki's cpp sample or the snippet @gregberge gave. But with a precompiled grammar, simply loading the grammar will (I believe) run all the EmulatedRegExp constructor calls (which in cpp's case is hundreds of calls). So if the construction is slow (or can be slow in some extreme cases) -- as opposed to transpilation or regex execution being slow -- it would explain this.

Context: Oniguruma-To-ES doesn't use EmulatedRegExp for most regexes with most grammars. cpp is probably unique in forcing the majority of its regexes through the subclass.

But it feels like I have some work to do in rewriting the EmulatedRegExp constructor (and all of the Regex+ internals/plugins that feed into it) if I want to dramatically speed up EmulatedRegExp construction for cases like this. 😓

@slevithan
Copy link
Collaborator

slevithan commented Jan 29, 2025

Profiling in the Chrome dev tools for @gregberge's grammar playground link shows that 10-15 specific regexp exec calls are taking the majority of the run time. In other words, regex search time (not transpilation or construction time) is the problem, or at least the biggest problem. That certainly sounds like runaway backtracking to me, or runaway backtracking plus other factors. Like I mentioned before, this can disproportionally hit JS performance because the V8, JavaScriptCore, and Oniguruma regex engines have different performance optimizations and mitigations, so specific problem cases don't always hit them the same way.

So yeah, that will need to be fixed upstream, but like I offered before, I can probably help fix the regexes if someone identifies the culprits. 🙂

Earlier, I talked a lot about regex construction time and the EmulatedRegExp subclass from Oniguruma-To-ES. I dug pretty deep into investigating and addressing that, but alas, that road only led to marginal improvements. So I'm back to my original hunch.

However, since I've gone through the work, I'll share the details below…

Refactoring EmulatedRegExp

Over the last few days I've made big, complicated changes for Oniguruma-To-ES v3.0.0 (coming soon) spanning multiple libraries I maintain. The purpose was to try to speed up regex construction (as opposed to regex transpilation or search performance) when using the EmulatedRegExp subclass, which Shiki's JS engine heavily relies on for the cpp grammar. EmulatedRegExp is used more often for cpp than probably any other grammar, due to indiscriminate use (IMO, overuse) of possessive quantifiers and atomic groups.

Old vs new EmulatedRegExp

The core issue I was addressing was the system of how various "emulation group" markers (which might look like e.g. ($E$…) or ($3$E$…)) were added to some transpiled patterns to indicate special behavior that would be handled in the subclass's overriden exec method (e.g., hiding certain capturing groups from match results, or transferring matched subpatterns/indices between capture slots). This extra handling is necessary to precisely emulate Oniguruma's subroutines, recursion, atomic groups, and possessive quantifiers. (The subclass is also used to handle some \G anchors, but that didn't use in-pattern markers.)

This metadata was provided in context throughout the pattern (rather than provided via additional args to EmulatedRegExp) in order to simplify the already very complex transpilation of some features. It also allowed reusing and building on an existing Regex+ system. However, this system had some downsides, including that the markers added during transpilation then immediately needed to be removed at construction time and turned into metadata before passing the cleaned up pattern to the native RegExp constructor.

Based on some of the symptoms I was seeing for cpp in this thread, I thought this construction-time overhead might be the source of the problem, given how many "emulation groups" are added to cpp patterns and how large some of the cpp patterns are that get processed in this way.

But there was a problem with that theory…

Most construction time is actually spent in RegExp

I've now done the work to remove the processing described above from regex construction, and although it dramatically improved the performance of EmulatedRegExp's construction-time overhead, it didn't meaningfully improve overall construction time for some of the monsters from the cpp grammar that I've tested. In other words, although the result of my work is that regex construction time spent within EmulatedRegExp is dramatically reduced, it doesn't matter because most of the construction time was actually spent in the native JS RegExp constructor (given the size and complexity of the regexes).

Based on these results, a meaningful issue (though not the main issue, since the main issue is how long it takes some regexes to search when called with exec) is that some of the monstrous regexes in the cpp grammar are inherently slow to construct in JS.

To illustrate the construction time issue, take just this one monster pattern from the cpp grammar which is nearly 10,000 chars long. Transpiled to an identically equivalent JS pattern string (using the Oniguruma-To-ES options that Shiki uses), it's over 17,000 chars.

Following are the rough construction times if I feed it into different constructors on my own relatively-new/fast laptop, running in Chrome. Note the perf differences between the first and subsequent calls. Subsequent calls show the time spent in the constructor itself, since the compiled regex that gets returned has already been cached by V8.

  • RegExp (not able to handle emulation group markers, so they were left out, giving RegExp an unfair advantage)
    • First call: ~40ms.
    • Subsequent call: ~0.12ms.
  • Old EmulatedRegExp from Oniguruma-To-ES v2.3.0 (with markers embedded in the pattern that need to be stripped)
    • First call: ~42ms.
    • Subsequent call: ~4ms.
  • New EmulatedRegExp from Oniguruma-To-ES v3.0.0 (no embedded markers; metadata provided in args)
    • First call: ~40ms.
    • Subsequent call: ~0.18ms.

The difference between ~4ms in v2.3.0 and ~0.18ms in v3.0.0 shows that the refactor was successful at dramatically improving performance for the processing that it was targeting. Alas, it doesn't matter when considering the unavoidable 40ms on top that comes from the native RegExp. 40ms might not sound so bad until you consider that this is just one regex out of 513 in the cpp grammar, and there are some additional cpp monsters like it.

Aside: Here is the v3.0.0 version of the metadata that gets fed into EmulatedRegExp for this particular regex. It has 111 "hidden captures" related to its transpiled possessive quantifiers and recursion that need to be hidden from match results in order for the beginCaptures indices in the TM grammar to line up correctly:

{
  hiddenCaptures: [1,4,7,8,10,11,16,19,20,22,23,25,28,31,32,34,35,37,40,41,43,44,46,47,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,74,75,77,78,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,101,104,105,107,108,110,113,114,116,117,119,122,123,125,126,129,132,133,135,136,139,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,161,164,165,167,168],
  strategy: 'clip_search',
}

We can also roughly compare performance of Oniguruma-To-ES v2.3.0 and v3.0.0 using Shiki's bench script, not by comparing them against each other, but comparing to the WASM engine. Alas, both give comparable (poor) JS results when tested on Shiki's short C++ sample, with wasm running ~2.5x faster than js and ~4.5x faster than js-precompiled. (As I mentioned before, cpp is the only grammar where the precompiled JS performs significantly worse than transpiling Oniguruma patterns to JS at runtime, which seems related to construction time unless anyone can think of other theories.)

Conclusion

Based on all of the above, it doesn't sound like there's much I can do on the Oniguruma-To-ES side in a generalized way to improve JS performance for the cpp grammar. v3.0.0 will be published soon, but it won't be a major win for cpp perf.

There is likely a lot of room to improve performance upstream in the cpp grammar, if specifically trying to optimize for Shiki's JS engine. This could include identifying and fixing any runaway backtracking (that is hitting JS especially hard in this case), removing gratuitous atomic groups and possessive quantifiers that cannot possibly affect matches or improve performance (happy to advise on this in a separate issue if helpful), and simplifying and/or splitting some monster patterns.

@jeff-hykin
Copy link

10-15 specific regexp exec calls are taking the majority of the run time

Are you able to print out what those regex patterns are? That'd be a huge help.

~4.5x faster than js-precompiled

Could you do an optional lazy compile based on the size of the regex string? Then get the benefits of both JIT and precompile.

Based on all of the above, it doesn't sound like there's much I can do on the Oniguruma-To-ES side in a generalized way to improve JS performance for the cpp grammar

Sounds likely. I am curious how much is just JS overhead though, is there a general comparison of VS oniguruma to shiki showing how much of the cost is inherently JS?

First call: ~40ms.

I mean, I'm happy to try and find backtracking, there's probably at least some local-optima ones that still exist, and maybe one big global-optima that requires incremental breaks that all fix themselves once all the little ones are fixed (not super confident on that though).

But it sounds to me like, the patterns are just too big.

One of the other things that might play a role is in the textmate (not oniguruma) optimizations. The C++ grammar has a lot of patterns that start the same. If this "common start" pattern was matched once, and then ruled-in all TM patterns with that regex start, it would be a lot faster than failing for every pattern individually.

I have no idea if VS Code is doing an optimization like that, but its something to consider for end-user performance.

The subclass is also used to handle some \G anchors, but that didn't use in-pattern markers.

Yeah C++ has to make heavy use of it as a hack for matching TM pattern-ranges that look the same at the start but diverge later: like a var declaration and function defintion.

What we want:

pat1Start 
    (pat2Start (pat2.1|pat2.2) endpat2) 
    (pat3Start (pat3.1|pat3.2) pat3End)
pat1End

What we have to do instead:

pat1Start (pat2With\G 
(pat2.1|pat2.2|pat3ThatLooksForPat2End
    (pat3.1|pat3.2) 
pat3End)
pat2lookAheadForPat1End)
pat1End

And thats the simple case, usually there's alternative pat2's because a failure of pat2 to match at all means its actually a different thing altogether.

Its really inefficient compared to what it could be.

All this to say: If you made a forked TM engine then C++ could get rid of all the \G usages, while also being more correct

@slevithan
Copy link
Collaborator

slevithan commented Feb 2, 2025

10-15 specific regexp exec calls are taking the majority of the run time

Are you able to print out what those regex patterns are? That'd be a huge help.

If it's possible to identify the regexes in the performance console, I don't know how to. But, by locally hacking some timing code directly into Shiki, I identified the two worst performing regexes, at least when tested against the Shiki sample and gregberge's sample. They both get multiple calls, as well. The following links point to them in the JSON copy of the grammar that Shiki uses (not the JS/precompiled version):

I'll analyze the first one in my next comment.

Could you do an optional lazy compile based on the size of the regex string? Then get the benefits of both JIT and precompile.

That's an excellent idea! I'll definitely play around with some variations of this. 😊

I am curious how much is just JS overhead though, is there a general comparison of VS oniguruma to shiki showing how much of the cost is inherently JS?

V8's Irregexp engine (shared by all modern browsers excluding Safari -- even Firefox uses it) is much faster than Oniguruma via WASM, not because Oniguruma is slow, but because Irregexp is fast. But you can't directly compare Oniguruma and JS RegExp performance except for a small subset of features, because most regex features aren't perfectly equivalent across JS and Oniguruma. Oniguruma-To-ES prides itself on perfect translation of syntax and behavior (and it's because of this that Shiki's JS engine is able to cover the long tail of TM grammars), but the accuracy is not cost free. As a simple example, ^ in Oniguruma is equivalent to (?<=^|\n(?!$)) in JS. If you guessed that running (?<=^|\n(?!$)) is slower than ^ (and precludes many of the low-level engine optimizations that can be done for simple ^), you'd be right. At a high level, though:

  • Oniguruma via WASM is likely slower than Oniguruma native.
  • But then, whereas JS is single-threaded, I believe Oniguruma can run regexes in a scanner in parallel. That should be a huge win, except...
  • Native JS RegExp is blazing fast, so it still comes out ahead in many cases.
  • But alas, accurate translation of Oniguruma's advanced features and edge cases can significantly slow down generated patterns.
  • However, Oniguruma-To-ES is smart and can mitigate some of these effects.
    • To use the earlier example of ^, the rules.singleline option treats ^ as \A and $ as \Z. Oniguruma's \A translates to simple ^ in JS. Shiki's JS engine uses this without changing the meaning since TM grammars always search line by line.
  • Also, V8 seems to run into construction-time perf problems with extremely long patterns.
  • Etc., etc.

In the end, some grammars are faster in JS (sometimes dramatically so, and often even with the overhead of transpilation at runtime), but somewhat more grammars (as of today) are faster in Oniguruma. For most grammars, the difference is not dramatic, and so the fact that the JS engine with precompiled grammars only needs ~3 kb of tree-shaken code from Oniguruma-To-ES to run (and can drop the large WASM dependency that's hundreds of kb) is a major win.

First call: ~40ms.

[...] it sounds to me like, the patterns are just too big.

I think V8 (not sure about Safari's JavaScriptCore) does struggle with construction of at least some extremely long patterns, for whatever reason. This might only be a meaningful issue for a small number of the very longest patterns in the cpp grammar though.

The subclass is also used to handle some \G anchors, but that didn't use in-pattern markers.

Yeah C++ has to make heavy use of it as a hack for matching TM pattern-ranges that look the same at the start but diverge later

No worries since Oniguruma-To-ES knows how to translate \G and does so in a reasonably fast way. A fair number of ways \G is used doesn't even require subclass-based emulation and is extremely fast. Until now, I don't think people thought it was possible to fully emulate \G, but then Oniguruma-To-ES has a lot of firsts for JS. 😊

@slevithan
Copy link
Collaborator

slevithan commented Feb 2, 2025

I identified the two worst performing regexes, at least when tested against the Shiki sample and gregberge's sample. They both get multiple calls, as well. The following links point to them in the JSON copy of the grammar that Shiki uses (not the JS/precompiled version):

I'll analyze the first one in my next comment.

Okay, so, with the function_call regex, failing to match something very simple like HelloWorld takes ~1,000 backtracking steps just for the attempt at position 0, and similar for each subsequent position. And successfully matching HelloWorld( also takes ~1,000 steps.

At least this backtracking isn't exponential. But it is excessive, and increases somewhat faster than linearly.

There are two main contributors to this excessive backtracking that I see...

Excessive backtracking contributor 1

First is the massive number of alternation options within two different negative lookarounds. The negative lookahead that contains a list starting with __has_cpp_attribute doesn't need to change, given that it's probably best to avoid micro-optimizations that hurt readability. But the negative lookbehind that starts with \Wreinterpret_cast can be improved a lot.

Here it is with JSON string escaping removed:

(?<!\Wreinterpret_cast|^reinterpret_cast|\Watomic_noexcept|^atomic_noexcept|\Wuint_least16_t|^uint_least16_t|\Wuint_least32_t|^uint_least32_t|\Wuint_least64_t|^uint_least64_t|\Watomic_cancel|^atomic_cancel|\Watomic_commit|^atomic_commit|\Wuint_least8_t|^uint_least8_t|\Wuint_fast16_t|^uint_fast16_t|\Wuint_fast32_t|^uint_fast32_t|\Wint_least16_t|^int_least16_t|\Wint_least32_t|^int_least32_t|\Wint_least64_t|^int_least64_t|\Wuint_fast64_t|^uint_fast64_t|\Wthread_local|^thread_local|\Wint_fast16_t|^int_fast16_t|\Wint_fast32_t|^int_fast32_t|\Wint_fast64_t|^int_fast64_t|\Wsynchronized|^synchronized|\Wuint_fast8_t|^uint_fast8_t|\Wdynamic_cast|^dynamic_cast|\Wint_least8_t|^int_least8_t|\Wint_fast8_t|^int_fast8_t|\Wstatic_cast|^static_cast|\Wsuseconds_t|^suseconds_t|\Wconst_cast|^const_cast|\Wuseconds_t|^useconds_t|\Wconstinit|^constinit|\Wco_return|^co_return|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wconstexpr|^constexpr|\Wconsteval|^consteval|\Wconstexpr|^constexpr|\Wconstexpr|^constexpr|\Wconsteval|^consteval|\Wprotected|^protected|\Wnamespace|^namespace|\Wblksize_t|^blksize_t|\Wco_return|^co_return|\Win_addr_t|^in_addr_t|\Win_port_t|^in_port_t|\Wuintptr_t|^uintptr_t|\Wtemplate|^template|\Wnoexcept|^noexcept|\Wnoexcept|^noexcept|\Wcontinue|^continue|\Wco_await|^co_await|\Wco_yield|^co_yield|\Wunsigned|^unsigned|\Wu_quad_t|^u_quad_t|\Wblkcnt_t|^blkcnt_t|\Wuint16_t|^uint16_t|\Wuint32_t|^uint32_t|\Wuint64_t|^uint64_t|\Wintptr_t|^intptr_t|\Wintmax_t|^intmax_t|\Wintmax_t|^intmax_t|\Wvolatile|^volatile|\Wregister|^register|\Wrestrict|^restrict|\Wexplicit|^explicit|\Wvolatile|^volatile|\Wnoexcept|^noexcept|\Woperator|^operator|\Wdecltype|^decltype|\Wtypename|^typename|\Wrequires|^requires|\Wco_await|^co_await|\Wco_yield|^co_yield|\Wreflexpr|^reflexpr|\Wswblk_t|^swblk_t|\Wvirtual|^virtual|\Wssize_t|^ssize_t|\Wconcept|^concept|\Wmutable|^mutable|\Wfixpt_t|^fixpt_t|\Wint16_t|^int16_t|\Wint32_t|^int32_t|\Wint64_t|^int64_t|\Wuint8_t|^uint8_t|\Wtypedef|^typedef|\Wdaddr_t|^daddr_t|\Wcaddr_t|^caddr_t|\Wqaddr_t|^qaddr_t|\Wdefault|^default|\Wnlink_t|^nlink_t|\Wsegsz_t|^segsz_t|\Wu_short|^u_short|\Wwchar_t|^wchar_t|\Wprivate|^private|\W__asm__|^__asm__|\Walignas|^alignas|\Walignof|^alignof|\Wmutable|^mutable|\Wnullptr|^nullptr|\Wclock_t|^clock_t|\Wmode_t|^mode_t|\Wpublic|^public|\Wsize_t|^size_t|\Wdouble|^double|\Wquad_t|^quad_t|\Wstatic|^static|\Wtime_t|^time_t|\Wmodule|^module|\Wimport|^import|\Wexport|^export|\Wextern|^extern|\Winline|^inline|\Wxor_eq|^xor_eq|\Wand_eq|^and_eq|\Wreturn|^return|\Wfriend|^friend|\Wnot_eq|^not_eq|\Wsigned|^signed|\Wstruct|^struct|\Wint8_t|^int8_t|\Wushort|^ushort|\Wswitch|^switch|\Wu_long|^u_long|\Wtypeid|^typeid|\Wu_char|^u_char|\Wsizeof|^sizeof|\Wbitand|^bitand|\Wdelete|^delete|\Wino_t|^ino_t|\Wkey_t|^key_t|\Wpid_t|^pid_t|\Woff_t|^off_t|\Wuid_t|^uid_t|\Wshort|^short|\Wbreak|^break|\Wcatch|^catch|\Wcompl|^compl|\Wwhile|^while|\Wfalse|^false|\Wclass|^class|\Wunion|^union|\Wconst|^const|\Wor_eq|^or_eq|\Wconst|^const|\Wthrow|^throw|\Wbitor|^bitor|\Wu_int|^u_int|\Wusing|^using|\Wdiv_t|^div_t|\Wdev_t|^dev_t|\Wgid_t|^gid_t|\Wfloat|^float|\Wlong|^long|\Wgoto|^goto|\Wuint|^uint|\Wid_t|^id_t|\Wcase|^case|\Wauto|^auto|\Wvoid|^void|\Wenum|^enum|\Wtrue|^true|\Wchar|^char|\Wid_t|^id_t|\WNULL|^NULL|\Wthis|^this|\Wbool|^bool|\Welse|^else|\Wfor|^for|\Wnew|^new|\Wnot|^not|\Wxor|^xor|\Wand|^and|\Wasm|^asm|\Wint|^int|\Wtry|^try|\Wdo|^do|\Wif|^if|\Wor|^or)

That can be replaced with:

(?<!(?:\W|^)(?:reinterpret_cast|atomic_noexcept|uint_least16_t|uint_least32_t|uint_least64_t|atomic_cancel|atomic_commit|uint_least8_t|uint_fast16_t|uint_fast32_t|int_least16_t|int_least32_t|int_least64_t|uint_fast64_t|thread_local|int_fast16_t|int_fast32_t|int_fast64_t|synchronized|uint_fast8_t|dynamic_cast|int_least8_t|int_fast8_t|static_cast|suseconds_t|const_cast|useconds_t|constinit|co_return|uintmax_t|uintmax_t|uintmax_t|constexpr|consteval|constexpr|constexpr|consteval|protected|namespace|blksize_t|co_return|in_addr_t|in_port_t|uintptr_t|template|noexcept|noexcept|continue|co_await|co_yield|unsigned|u_quad_t|blkcnt_t|uint16_t|uint32_t|uint64_t|intptr_t|intmax_t|intmax_t|volatile|register|restrict|explicit|volatile|noexcept|operator|decltype|typename|requires|co_await|co_yield|reflexpr|swblk_t|virtual|ssize_t|concept|mutable|fixpt_t|int16_t|int32_t|int64_t|uint8_t|typedef|daddr_t|caddr_t|qaddr_t|default|nlink_t|segsz_t|u_short|wchar_t|private|__asm__|alignas|alignof|mutable|nullptr|clock_t|mode_t|public|size_t|double|quad_t|static|time_t|module|import|export|extern|inline|xor_eq|and_eq|return|friend|not_eq|signed|struct|int8_t|ushort|switch|u_long|typeid|u_char|sizeof|bitand|delete|ino_t|key_t|pid_t|off_t|uid_t|short|break|catch|compl|while|false|class|union|const|or_eq|const|throw|bitor|u_int|using|div_t|dev_t|gid_t|float|long|goto|uint|id_t|case|auto|void|enum|true|char|id_t|NULL|this|bool|else|for|new|not|xor|and|asm|int|try|do|if|or))

This has multiple benefits:

  • The number of alternatives is cut in half, leading to less backtracking.
  • Each of those Oniguruma \W tokens get expanded to [^\p{L}\p{M}\p{N}\p{Pc}] for JS to make it exactly equivalent. So having only one instance of that instead of 176 is significantly beneficial for JS.
  • Significantly cutting down on length like this should also improve compile time in V8.
  • It increases readability (though perhaps that's not relevant if this list is being generated by something).
  • The leading (?:\W|^) in the updated version might allow some engines to optimize the search by immediately failing if this condition is impossible to satisfy. Not likely for this pattern for a few reasons, but it's generally a good idea to think about clues for the regex engine like this, since engine implementors put a lot of work into these kinds of optimizations.

Excessive backtracking contributor 2

The segment >)\s*+)?::)*\s*+ can be changed to >)\s*)?+::)*+\s* to drop a lot of needless backtracking. AFAICT, this matches the same strings, but since the regex is complex and I don't know much about C++ syntax you might want to double check that this change doesn't prevent any complex cases that it should match.

Notice that of the four quantifiers in this segment, I changed all of them. Both of the possessive quantifiers changed to non-possessive (since I don't think making these whitespace tokens possessive was helping much if at all) and I made the two non-possessive quantifiers possessive. That change cuts backtracking roughly in half for a case like ::test<"hi">::test<"hi"> -- from more than 2,000 backtracks in order to fail at position 0, down to ~1,300.

The idea here is you want to lock in your <…> matches when you find them and then again lock in the trailing :: if you find it, before moving on to the next portion of the pattern.


I think both of these improvements also apply to the second slow regex (function_definition), but I haven't closely analyzed it like I did with this one.

I suspect that simply applying these two improvements to these two slowest regexes will lead to a significant improvement for the C++ grammar in JS. All the better if there are additional regexes with these same opportunities for improvement in the grammar.

Aside: Although not necessarily perf related, you can also replace your use of (?:\/\*(?:[^\*]++|\*+(?!\/))*+\*\/) with /\*(?~\*/)\*/. This uses an Oniguruma-specific "absent repeater" (?~…).

All of the improvements suggested here will benefit not only Shiki's JS engine, but also Oniguruma. 😊

@slevithan
Copy link
Collaborator

slevithan commented Feb 3, 2025

Could you do an optional lazy compile based on the size of the regex string? Then get the benefits of both JIT and precompile.

That's an excellent idea! I'll definitely play around with some variations of this. 😊

I've just published Oniguruma-To-ES v3.1.0 (which should be in the next Shiki release). It includes lazy compilation as a feature of its EmulatedRegExp class, and allows configuring the pattern length (based on the transpiled output) that triggers lazy compilation. Based on some very rough testing, I'm planning to set lazyCompileLength to 3000 in Shiki's JS engine. As of now, that flags 28 patterns in the C++ grammar.

Based on this discussion, I'm optimistic that C++ will perform at least reasonably well in Shiki's JS engine after the following are completed, without the need for any complex refactoring in better-cpp-syntax:

For me in oniguruma-to-es/Shiki:

  • Support and use lazy compilation, to avoid a perf penalty for precompiled grammars that contain extremely long patterns that aren't always used.

For @jeff-hykin in better-cpp-syntax:

  • Update the two regexes that are by far the worst performing (function_call and function_definition) using my recommendations above.
  • Update all other regexes using any sort of grouping with (\Wfoo|^foo|\Wbar|^bar|\W…|^…) to instead use ((?:\W|^)(?:foo|bar|…)).

@jeff-hykin
Copy link

jeff-hykin commented Feb 3, 2025

That can be replaced with:

(?<!(?:\W|^)

Are you sure? Thats how I originally tried it years ago and ran into issues (groups in lookaround not allowed). After hitting that I went an made a helper function that generates all that junk for any list of keywords I give it.

I can test it tomorrow and see what happens. Maybe I was originally using it for a lookahead.

I use that helper function EVERYWHERE its what prevents normal identifiers from matching keywords. So this change would be a performance enhancement to the whole codebase. Actually it would be a performance enhance to basically all the grammars I've made.

BTW here's what the source code looks likefor function_defintion, if you're curious.

@jeff-hykin
Copy link

The segment >)\s*+)?::)*\s*+ can be changed to >)\s*)?+::)*+\s*

This one could be big, and help with some known perf Issues on VS Code. It won't be a problem to test so I'll try it whenever I get a chance.

I don't understand why not make all of them possessive though?

@jeff-hykin
Copy link

(?:/*(?:[^\*]++|*+(?!/))*+*/)

This was actually our biggest performance improvement. That block comment pattern was used all over and without the possessives in there, it caused exponential backtracking.

/*(?*/)*/. This uses an Oniguruma-specific "absent repeater" (?…).

Cool!

@jeff-hykin
Copy link

The \W|^ thing makes me realize I should ask you something: do you know of any regex optimizers? Basically precompiling of regex, and/or minifing it. For example, getting rid of unnecessary non-capture groups.

I have looked for one in the past, and haven't found any. If there was one, it could automatically optimize that repetion of \W|^ along with probably plently of other things that are the result of regex being generated.

@slevithan
Copy link
Collaborator

slevithan commented Feb 3, 2025

That can be replaced with:

(?<!(?:\W|^)

Are you sure? Thats how I originally tried it years ago and the issues I ran into (groups in lookaround were not allowed).

Trust me bro. 😆

Oniguruma used to have finicky restrictions on when it allowed variable-length lookbehind (lookahead never had a problem). E.g. when I published this page https://stevenlevithan.com/regex/tests/lookbehind.html back in 2012, Oniguruma allowed (?<=a|bc) and (?<=a(?:b|c)|a), but not (?<=ab?) or (?<=a(?:b|cd)). So this must be what you were running into.

But sometime over the years, Oniguruma upgraded its support for variable-length lookbehind! The current restrictions for lookaround contents are not based on length variability:

  • Lookahead invalid within lookbehind.
  • Absent stoppers and clearers invalid within lookbehind.
  • Capturing groups invalid within negative lookbehind.
  • Negative lookbehind invalid within positive lookbehind.

Oniguruma-To-ES validates all of this perfectly, so you can use its demo page as a tester. If a lookaround can be transpiled, Oniguruma supports it.

I've tested this in Oniguruma 6.9.8 (released Apr 2022), which is the version used in vscode-oniguruma 1.7.0 through the current 2.0.1. And, via vscode-oniguruma 1.7.0, it's the version used in VS Code and Shiki. The lookbehind change most likely came earlier than 6.9.8, but the change doesn't seem to be mentioned in Oniguruma's release notes going back years more. I can tell you that there are definitely other grammars relying on modern Oniguruma's improved lookbehind rules.

Modern Oniguruma's support for variable-length lookbehind is among the best of the regex flavors, although .NET and JavaScript (ES2018) still beat it by offering lookbehind without any restrictions.

I use that helper function EVERYWHERE its what prevents normal identifiers from matching keywords. So this change would be a performance enhancement to the whole codebase. Actually it would be a performance enhance to basically all the grammars I've made.

Nice! 🥳

@slevithan
Copy link
Collaborator

slevithan commented Feb 4, 2025

This one could be big, and help with some known perf Issues on VS Code. It won't be a problem to test so I'll try it whenever I get a chance.

I don't understand why not make all of them possessive though?

It's largely a personal preference. I like to be very intentional about every feature in a regex. If I make something possessive or atomic, it's because doing so alters either (1) backtracking behavior or (2) what the regex matches. That way it serves as documentation about intent.

Backtracking behavior can definitely be hard to reason about in complex regexes like this one, so I understand your preference might lean the other way, to try to protect against undesired backtracking even in situations where it's not a risk -- so that you don't have to prove to yourself it's not a risk.

But then, I'm also partly motivated by making these regexes faster in JS, and since each possessive quantifier has a small search-time cost to emulate in JS, I'd prefer to remove them in already-slow regexes where they're identified to not be meaningfully beneficial.

Up to you, though. Making the other changes should still be beneficial even if you prefer to not de-possessivize anything.

@slevithan
Copy link
Collaborator

slevithan commented Feb 4, 2025

do you know of any regex optimizers? Basically precompiling of regex, and/or minifing it. For example, getting rid of unnecessary non-capture groups.

I have looked for one in the past, and haven't found any. If there was one, it could automatically optimize that repetion of \W|^ along with probably plently of other things that are the result of regex being generated.

To be safe, an optimizer has to be specific to the regex flavor it's targeting. I'm not aware of any Oniguruma regex optimizers.

But here are a couple for other flavors:

  • regexp-tree's Optimizer module is a nice optimizer for JS regexes. Note that it doesn't support JS's v flag, and I've found some bugs in it. And alas, it doesn't deal with cases like \Wfoo|^foo|\Wbar|^bar. But it might still be useful for some things, or for inspiration. I included it as an experimental option in Regex+'s Babel plugin.
  • Regexp-Optimizer is an old one for Perl that I haven't used.
  • The narrower concept of a "regex trie" that builds an efficient regex from large word lists has been done a number of times. E.g. in JS, in Python.

I have some very basic optimizations already built into Oniguruma-To-ES. Contributions that add more would be welcome, but then, the library also has to weigh other factors like library weight and transpilation performance.

But if you wanted to build your own optimizer, Oniguruma-To-ES's toOnigurumaAst would be exactly what you need to get started! And as a separate library, you could go nuts. Basically, you'd start by generating an OnigurumaAst (that Oniguruma-To-ES gives you for free), and then create an Oniguruma-to-Oniguruma transformer that applies your optimizations (you'd already get some freebies out of the box like group unwrapping), and then an Oniguruma-to-Oniguruma generator to get your new pattern back from the AST. These would be fairly straightforward since the OnigurumaAst format is simple, easy to work with, and custom-designed around Oniguruma features/syntax.

If you wanted to build this as open source:

  • I'd probably be happy to help, time permitting.
  • I'd bet @antfu would be happy to use it in shikijs/textmate-grammars-themes, which would not only increase the impact but also give you a good testing ground to prove the correctness of your optimizer since it would have to avoid breaking any of the grammars there.
  • It's possible that VS Code and GitHub's Linguist would also adopt it, to optimize/minify all of their TM grammars.

So it sounds like a cool project that could have a lot of impact, if you or anyone else wanted to take this on.

@slevithan
Copy link
Collaborator

slevithan commented Feb 4, 2025

This was actually our biggest performance improvement. That block comment pattern was used all over and without the possessives in there, it caused exponential backtracking.

It didn't have to. The problem, no doubt, was not that the quantifiers weren't possessive, but that there were overlapping quantifiers that offered multiple ways to split up the work of matching the same strings (leading to trying every possible permutation before giving up on a failing match).

So, for example, another way to solve the backtracking problem would have been to use /\*(?:[^*]|\*(?!/))*\*/. Without the inner quantifiers, all paths through this regex are now mutually exclusive (since the two alts in the group are mutually exclusive), so there's no potential for runaway backtracking. That said, making the (?:…) group's * quantifier possessive (*+) would further cut backtracking nearly in half on failing matches that start with '/*' because, after reaching the end of the string and not finding a '*/' to match, it wouldn't have to first try the alternation option it didn't try at each step in the group loop the first time through.

@slevithan
Copy link
Collaborator

slevithan commented Feb 4, 2025

  • Update all other regexes using any sort of grouping with (\Wfoo|^foo|\Wbar|^bar|\W…|^…) to instead use ((?:\W|^)(?:foo|bar|…)).

Even better, for cases that appear in lookbehind and where the alternatives all start with a word char, you can just use \b instead of (?:\W|^), without changing what it matches.

@RedCMD
Copy link

RedCMD commented Feb 5, 2025

function definition is definitely a known culprit
previous performance issues:

when running the entire grammar on that code snippet
I'm seeing

  • First call: ~250ms. (loading and building grammar from disk)
    Image
  • Subsequent calls: ~4-5ms.
    Image

with function definition and function call taking the majority

as @slevithan stated, oniguruma now allows variable length inside lookbehinds

I agree with the optimization of (?<!(?:\W|^) and /\*(?~\*/)\*/ etc

one of the optimizations that vscode-oniguruma/textmate deploys
is all regexs that need to be tested on a string; are batched and run together (one after another tho)
then the next time that same batch of regexs are run, only some of them are rerun, as it uses the results and caches from the previous run
quite often leading to 0.001-0.010ms times, even when a line is 500,000 characters long

@RedCMD
Copy link

RedCMD commented Feb 6, 2025

I have a very basic optimizer in my extension
it just removes (?:non-capture-groups), (?x)extended-comments, (?#group-comments) and unneeded backslashes \\\/

Image

Image

knocks a full 20,000 characters off the cpp grammar alone
from 421487 to 401753

@slevithan
Copy link
Collaborator

I have a very basic optimizer in my extension it just removes (?:non-capture-groups), (?x)extended-comments, (?#group-comments) and unneeded backslashes \\\/

Nice, and good to know.

Although Oniguruma-To-ES can't currently be used as an Oniguruma optimizer (because it only outputs JS regex syntax), the JS it outputs also includes those optimizations and several more. So people are currently getting those effects in Shiki's precompiled JS grammars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants