-
-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performances on CPP language #893
Comments
Your repro link to Shiki's grammar/theme playground (which looks like it's currently running Shiki v1.26.1) gives me times of 1s to 2s on the first run, and much faster thereafter. I don't have access to a Mac to test on Safari. Maybe it's performing much worse in Safari's RegExp engine if you're seeing times of ~10s. Regardless, more than 1s is indeed slow. I assume this is a case of one or more regexes in the C++ TextMate grammar that is/are triggering superlinear backtracking. Based on this and one previous report (#871), it seems Oniguruma might have better protections against runaway backtracking than V8/JavaScriptCore. That's not to say Oniguruma isn't also prone to such problems. Many grammars have had to fix backtracking-related perf problems in the past, unrelated to Shiki's JS engine. But different regex engines have different optimizations and mitigations in place. So yeah, this probably needs to be fixed in the upstream grammar. I would be happy to help fix the regex(es) if someone can identify which regexes are causing the problem. @antfu it would be great for debugging (and probably for TM grammar authors) if there was a script (or maybe it could be built into the Shiki playground?) that somehow could report on the longest running regexes for a given grammar and sample text. |
If we are on JavaScript RegExp, running node with |
CC @jeff-hykin as the creator of the upstream better-cpp-syntax grammar, in case he has a hunch for the problem source, or is interested in being aware of this. It's certainly also possible that the perf problem is not in There's something weird going on though because |
I mean, if the grammar is slow, thats almost always why. But something this short definitely shouldn't take 10s on an M1. To be fair, I've only ever tested with vscode-textmate though
Eh, that or VS code has some protections against runaway regex. Like just capping the line length. There was an update at some point, like 1.5 years ago that was a major help. I think Alex Duma (VS Code guy, probably misspelling his name) knows about. But even before that update, an example this small would've/should've been fine.
✅
✅ (its really a TextMate/regex inherent design flaw IMO)
While not a terrible conclusion, unfortunately I mostly disagree. Not cause the grammar is good, its an abomination, but rather because the grammar, at its best, is only capable of a tradeoff: do you want it to be decently slow and horrifically wrong, or do you want it to be decently slow, wrong occasionally-ish, and have edge cases where its so slow it freezes the whole process. Not to say there's never a case of purely a performance loss that could be fixed with better grammar code (take a look at the function defintion, and type pattern if someone is interested in optimization), but rather that there are at least some cases where the thrashing is necessary for correctness. Lots of atomic groups, fail-quickly look ahead patterns, and limitations on lookaheads/behinds have been added to the C++ grammar try and help with performance. Sometimes there was a small culprit that made a major impact. The only bulletproof fix tho, in the C++ grammar would be, basically, a built in line-length limiter. Which I've considered and have, as always, mixed feelings about. TLDRI think user preference matters with how much slowness they can tolerate (I have a very long line length limit) which is outside the grammar, I also think the TextMate engine has responsibility in making sure a single regex doesnt crash everything. |
Definitely interested! I really dug deep into what textmate tools were available, like completely undocumented stuff, so in that sense I'm not surprised theres problems. The good/bad news is I'm pretty sure the C++ grammar is the largest most feature-abusing grammar out there. At least I sure hope theres nothing worse. If this can be fixed, I would feel pretty good about the others.
Exactly. Ironically, this was done to improve performance. So if, somehow, this is hurting performance after precompilation, its realistic that removing all atomic groups/possessives woudln't change correctness. I don't think there was a single time we used possessive quantifiers for something other than performance. Removing them and seeing what happens might be the easiest debugging step. |
This is super relevant, just not in this context since Shiki's JS and WASM engines are both being applied line by line. But I'm happy to see you reinforce the point (from your own much more extensive experience with TM grammars than me) that Oniguruma is also very vulnerable. As I mentioned, although different regex engines can have different built-in optimizations that can avoid or mitigate certain cases, all backtracking regex engines are vulnerable.
My comment was based on the assumption there there are one or more regexes in the The more I think about this, though, the more I think there's a good chance it's a fixable (but unfortunately pretty deep) problem in the design of my
My claim (not deeply investigated, just a hunch based on how much atomic groups and possessive quantifiers are used) is that these backtracking-control features are being used even in many cases when they have no potential benefit and can't change what's matched or have a meaningful effect on backtracking behavior. In the context of Oniguruma (and other regex engines that support backtracking control natively), there is no cost to overusing these features like this. But alas, in the case of Oniguruma-To-ES, there is some (non-exponential) cost that can add up in extreme cases.
If I'm understanding what you mean correctly, I'd be hesitant on that as well, in large part because it wouldn't be bulletproof. It's very possible to trigger catastrophic backtracking with relatively short target strings/lines.
I agree with you, but TextMate engines are unlikely to make fundamental changes that eliminate the risk of runaway backtracking anytime soon.
That's a great debugging step! If the perf problem for Shiki's JS engine just went away, that would indeed identify the source of the problem, and reinforce that I should do the major work that would be needed in Oniguruma-To-ES. But, it might introduce genuine runaway backtracking where there was none before (including with the WASM engine), making it no longer useful as a debugging step. |
Sadly (for me), the only explanation I can think of is that Shiki's standard JS engine (which transpiles regexes at runtime) never actually has to deal with transpiling the majority of regexes in the Context: Oniguruma-To-ES doesn't use But it feels like I have some work to do in rewriting the |
Profiling in the Chrome dev tools for @gregberge's grammar playground link shows that 10-15 specific regexp So yeah, that will need to be fixed upstream, but like I offered before, I can probably help fix the regexes if someone identifies the culprits. 🙂 Earlier, I talked a lot about regex construction time and the However, since I've gone through the work, I'll share the details below… Refactoring
|
Are you able to print out what those regex patterns are? That'd be a huge help.
Could you do an optional lazy compile based on the size of the regex string? Then get the benefits of both JIT and precompile.
Sounds likely. I am curious how much is just JS overhead though, is there a general comparison of VS oniguruma to shiki showing how much of the cost is inherently JS?
I mean, I'm happy to try and find backtracking, there's probably at least some local-optima ones that still exist, and maybe one big global-optima that requires incremental breaks that all fix themselves once all the little ones are fixed (not super confident on that though). But it sounds to me like, the patterns are just too big. One of the other things that might play a role is in the textmate (not oniguruma) optimizations. The C++ grammar has a lot of patterns that start the same. If this "common start" pattern was matched once, and then ruled-in all TM patterns with that regex start, it would be a lot faster than failing for every pattern individually. I have no idea if VS Code is doing an optimization like that, but its something to consider for end-user performance.
Yeah C++ has to make heavy use of it as a hack for matching TM pattern-ranges that look the same at the start but diverge later: like a var declaration and function defintion. What we want:
What we have to do instead:
And thats the simple case, usually there's alternative pat2's because a failure of pat2 to match at all means its actually a different thing altogether. Its really inefficient compared to what it could be. All this to say: If you made a forked TM engine then C++ could get rid of all the \G usages, while also being more correct |
If it's possible to identify the regexes in the performance console, I don't know how to. But, by locally hacking some timing code directly into Shiki, I identified the two worst performing regexes, at least when tested against the Shiki sample and gregberge's sample. They both get multiple calls, as well. The following links point to them in the JSON copy of the grammar that Shiki uses (not the JS/precompiled version):
I'll analyze the first one in my next comment.
That's an excellent idea! I'll definitely play around with some variations of this. 😊
V8's Irregexp engine (shared by all modern browsers excluding Safari -- even Firefox uses it) is much faster than Oniguruma via WASM, not because Oniguruma is slow, but because Irregexp is fast. But you can't directly compare Oniguruma and JS RegExp performance except for a small subset of features, because most regex features aren't perfectly equivalent across JS and Oniguruma. Oniguruma-To-ES prides itself on perfect translation of syntax and behavior (and it's because of this that Shiki's JS engine is able to cover the long tail of TM grammars), but the accuracy is not cost free. As a simple example,
In the end, some grammars are faster in JS (sometimes dramatically so, and often even with the overhead of transpilation at runtime), but somewhat more grammars (as of today) are faster in Oniguruma. For most grammars, the difference is not dramatic, and so the fact that the JS engine with precompiled grammars only needs ~3 kb of tree-shaken code from Oniguruma-To-ES to run (and can drop the large WASM dependency that's hundreds of kb) is a major win.
I think V8 (not sure about Safari's JavaScriptCore) does struggle with construction of at least some extremely long patterns, for whatever reason. This might only be a meaningful issue for a small number of the very longest patterns in the
No worries since Oniguruma-To-ES knows how to translate |
Okay, so, with the At least this backtracking isn't exponential. But it is excessive, and increases somewhat faster than linearly. There are two main contributors to this excessive backtracking that I see... Excessive backtracking contributor 1First is the massive number of alternation options within two different negative lookarounds. The negative lookahead that contains a list starting with Here it is with JSON string escaping removed:
That can be replaced with:
This has multiple benefits:
Excessive backtracking contributor 2The segment Notice that of the four quantifiers in this segment, I changed all of them. Both of the possessive quantifiers changed to non-possessive (since I don't think making these whitespace tokens possessive was helping much if at all) and I made the two non-possessive quantifiers possessive. That change cuts backtracking roughly in half for a case like The idea here is you want to lock in your I think both of these improvements also apply to the second slow regex ( I suspect that simply applying these two improvements to these two slowest regexes will lead to a significant improvement for the C++ grammar in JS. All the better if there are additional regexes with these same opportunities for improvement in the grammar. Aside: Although not necessarily perf related, you can also replace your use of All of the improvements suggested here will benefit not only Shiki's JS engine, but also Oniguruma. 😊 |
I've just published Oniguruma-To-ES v3.1.0 (which should be in the next Shiki release). It includes lazy compilation as a feature of its Based on this discussion, I'm optimistic that C++ will perform at least reasonably well in Shiki's JS engine after the following are completed, without the need for any complex refactoring in For me in
For @jeff-hykin in
|
Are you sure? Thats how I originally tried it years ago and ran into issues (groups in lookaround not allowed). After hitting that I went an made a helper function that generates all that junk for any list of keywords I give it. I can test it tomorrow and see what happens. Maybe I was originally using it for a lookahead. I use that helper function EVERYWHERE its what prevents normal identifiers from matching keywords. So this change would be a performance enhancement to the whole codebase. Actually it would be a performance enhance to basically all the grammars I've made. BTW here's what the source code looks likefor function_defintion, if you're curious. |
This one could be big, and help with some known perf Issues on VS Code. It won't be a problem to test so I'll try it whenever I get a chance. I don't understand why not make all of them possessive though? |
This was actually our biggest performance improvement. That block comment pattern was used all over and without the possessives in there, it caused exponential backtracking.
Cool! |
The \W|^ thing makes me realize I should ask you something: do you know of any regex optimizers? Basically precompiling of regex, and/or minifing it. For example, getting rid of unnecessary non-capture groups. I have looked for one in the past, and haven't found any. If there was one, it could automatically optimize that repetion of \W|^ along with probably plently of other things that are the result of regex being generated. |
Trust me bro. 😆 Oniguruma used to have finicky restrictions on when it allowed variable-length lookbehind (lookahead never had a problem). E.g. when I published this page https://stevenlevithan.com/regex/tests/lookbehind.html back in 2012, Oniguruma allowed But sometime over the years, Oniguruma upgraded its support for variable-length lookbehind! The current restrictions for lookaround contents are not based on length variability:
Oniguruma-To-ES validates all of this perfectly, so you can use its demo page as a tester. If a lookaround can be transpiled, Oniguruma supports it. I've tested this in Oniguruma 6.9.8 (released Apr 2022), which is the version used in Modern Oniguruma's support for variable-length lookbehind is among the best of the regex flavors, although .NET and JavaScript (ES2018) still beat it by offering lookbehind without any restrictions.
Nice! 🥳 |
It's largely a personal preference. I like to be very intentional about every feature in a regex. If I make something possessive or atomic, it's because doing so alters either (1) backtracking behavior or (2) what the regex matches. That way it serves as documentation about intent. Backtracking behavior can definitely be hard to reason about in complex regexes like this one, so I understand your preference might lean the other way, to try to protect against undesired backtracking even in situations where it's not a risk -- so that you don't have to prove to yourself it's not a risk. But then, I'm also partly motivated by making these regexes faster in JS, and since each possessive quantifier has a small search-time cost to emulate in JS, I'd prefer to remove them in already-slow regexes where they're identified to not be meaningfully beneficial. Up to you, though. Making the other changes should still be beneficial even if you prefer to not de-possessivize anything. |
To be safe, an optimizer has to be specific to the regex flavor it's targeting. I'm not aware of any Oniguruma regex optimizers. But here are a couple for other flavors:
I have some very basic optimizations already built into Oniguruma-To-ES. Contributions that add more would be welcome, but then, the library also has to weigh other factors like library weight and transpilation performance. But if you wanted to build your own optimizer, Oniguruma-To-ES's If you wanted to build this as open source:
So it sounds like a cool project that could have a lot of impact, if you or anyone else wanted to take this on. |
It didn't have to. The problem, no doubt, was not that the quantifiers weren't possessive, but that there were overlapping quantifiers that offered multiple ways to split up the work of matching the same strings (leading to trying every possible permutation before giving up on a failing match). So, for example, another way to solve the backtracking problem would have been to use |
Even better, for cases that appear in lookbehind and where the alternatives all start with a word char, you can just use |
when running the entire grammar on that code snippet with as @slevithan stated, oniguruma now allows variable length inside lookbehinds I agree with the optimization of one of the optimizations that vscode-oniguruma/textmate deploys |
Nice, and good to know. Although Oniguruma-To-ES can't currently be used as an Oniguruma optimizer (because it only outputs JS regex syntax), the JS it outputs also includes those optimizations and several more. So people are currently getting those effects in Shiki's precompiled JS grammars. |
Validations
Describe the bug
It takes ~10s on my MacBook M1 to highlight that code with JS RegExp engine, shiki v1.27.2.
Reproduction
https://textmate-grammars-themes.netlify.app/?theme=vitesse-dark&grammar=cpp&code=TMap%3CFString%2C+FString%3E+WsUpgradeHeaders%3B%0AWsUpgradeHeaders.Add%28TEXT%28%22Host%22%29%2C+TEXT%28%22127.0.0.1%3A7788%22%29%29%3B%0AWebSocket+%3D+FWebSocketsModule%3A%3AGet%28%29.CreateWebSocket%28WebSocketAddress%2C+TEXT%28%22ws%22%29%2C+WsUpgradeHeaders%29%3B
Contributes
The text was updated successfully, but these errors were encountered: