Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scribble lexer returns an empty token #14

Open
rfindler opened this issue Nov 30, 2022 · 8 comments
Open

scribble lexer returns an empty token #14

rfindler opened this issue Nov 30, 2022 · 8 comments

Comments

@rfindler
Copy link
Member

This was found by random testing.

Here's the (non-random) test case that illustrates the bug. The function try should never raise an error (it is a simplification of the checks done by the option in the contract). The string big-s is the one that drdr found and it shrinks to "^|{|\\".

#lang racket
(require syntax-color/scribble-lexer)

(define (try s)
  (define 3ary-lexer (make-scribble-inside-lexer #:command-char #\^))
  (define in (open-input-string s))
  (port-count-lines! in)
  (let loop ([mode #f][offset 0])
    (define-values (txt type paren start end backup new-mode)
      (3ary-lexer in offset mode))
    (cond
      [(equal? type 'eof) #t]
      [else
       (unless (< start end)
         (error 'whoops! "start ~s end ~s" start end))
       (loop new-mode end)])))

;; returns #t if the test case failed, #f if it didn't fail
;; so `s` is helpful when this returns #t
(define (try/bool s)
  (with-handlers ([(λ (x) (and (exn:fail? x)
                               (regexp-match #rx"whoops" (exn-message x))))
                   (λ (x) #t)])
    (try s)
    #f))

(define big-s
  "\"\u0099\"\u009E\u0095Ýäλ|Á5|\u008D)è_)4\u0097(¬{\u009D })Ó|@^|{E cTuÐÆ}\u009E|\\|\u0017)\u0099j\u009A")

(define (get-candidates s)
  (append
   (for/list ([i (in-range (string-length s))])
     (string-append (substring s 0 i)
                    (substring s (+ i 1) (string-length s))))
   (for/list ([i (in-range (string-length s))]
              #:unless (equal? (string-ref s i) #\a))
     (string-append (substring s 0 i)
                    "a"
                    (substring s (+ i 1) (string-length s))))))

(define (shrink s)
  (let loop ([s s])
    (cond
      [(try/bool s)
       (define candidates (get-candidates s))
       (cond
         [(null? candidates) s]
         [else
          (or (for/or ([candidate (in-list candidates)])
                (loop candidate))
              s)])]
      [else #f])))

(shrink big-s)
@mflatt
Copy link
Member

mflatt commented Dec 1, 2022

I think the problem is line 195 of scribble-lexer.rkt as modified by commit fa9de50. There's a regexp-quote there whose result is used in [...], but quoting rules are different in [...] than expected by regexp-quote.

One possible solution is to make that line

(regexp-quote at-bytes) #"|{)|(?="

This change would also fix a problem with an at character that encodes to multiple bytes.

Meanwhile, I think the regexp-quote on line 40 has the same issue, but it's in [^...], so a different strategy seems necessary. A regexp-quote-for-range function that works on a single character might be a better approach there.

@rfindler
Copy link
Member Author

rfindler commented Dec 2, 2022

It looks like a small variation on that change fixes the bug identified here. Concretely, these changes the tests pass (which includes a new one covering the test found by random testing):

$  git diff
diff --git a/syntax-color-lib/syntax-color/scribble-lexer.rkt b/syntax-color-lib/syntax-color/scribble-lexer.rkt
index 1db28ff..97b4928 100644
--- a/syntax-color-lib/syntax-color/scribble-lexer.rkt
+++ b/syntax-color-lib/syntax-color/scribble-lexer.rkt
@@ -192,7 +192,7 @@
                                      (byte-rx #"^[|]" re-opener #"{")
                                      (byte-rx #".*?(?:(?=[|]"
                                               re-opener
-                                              #"[" (regexp-quote at-bytes) #"{])|(?="
+                                              #"(" (regexp-quote at-bytes) #"|{))|(?="
                                               closer
                                               #")|(?=[\r\n])|$)")
                                      '|{|  ;; Better complex paren?
diff --git a/syntax-color-test/tests/syntax-color/scribble-lexer.rkt b/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
index 6e7c315..48ce2bb 100644
--- a/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
+++ b/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
@@ -276,3 +276,5 @@
                                            (4 parenthesis)))
 
 (test/chars '(#\^) "\\" '((1 text)))
+
+(test/chars (remove #\\ chars) "@|{|\\" '((3 parenthesis) (2 text)))

But, unfortunately, adding #\λ to the definition of chars on line 59 of tests/syntax-color/scribble-lexer causes a new test case to fail:

FAILED, line 274, current-lexer-char #\λ
 result
'((1 3 parenthesis 0)
  (3 6 parenthesis 2)
  (6 9 text 0)
  (9 11 parenthesis 0)
  (11 13 parenthesis 0))
 is not expected
'((1 6 parenthesis 0) (6 9 text 0) (9 13 parenthesis 0))

This doesn't seem to be an issue with line 40, tho, because regexp-quote does nothing to "λ", which should work out in this case.

@rfindler
Copy link
Member Author

rfindler commented Dec 2, 2022

Oh, and actually I'm not sure that this added test case is correct -- the actual scribble reader doesn't seem to like @|{|\ so maybe the colorer should be turning something red.

@mflatt
Copy link
Member

mflatt commented Dec 3, 2022

I think the test on line 275 fails because the test assumes that the "at" character is not #\λ. Using #\λ as the canonical Unicode character is like using 42 as the canonical integer. :)

I guess line 40 works out in the case of #\λ, because characters in non-ASCII encodings are already ruled out. Also, regexp-quote only adds backslashes, and backslash is already ruled out, so another one won't hurt. So, it's subtle and depends on implementation details that regexp-quote doesn't guarantee, but I guess there's no test that will fail.

I don't think any part of @|{|\ should be red. It can be completed with }|, and the added test case looks right to me.

rfindler added a commit that referenced this issue Dec 4, 2022
@rfindler
Copy link
Member Author

rfindler commented Dec 4, 2022

I think the test on line 275 fails because the test assumes that the "at" character is not #\λ. Using #\λ as the canonical Unicode character is like using 42 as the canonical integer. :)

Oh! I totally didn't realize that \u3BB is λ! Duh. Thanks, I've pushed that. I've also added #\α to give the random tester a multi-byte something to work with.

I guess line 40 works out in the case of #\λ, because characters in non-ASCII encodings are already ruled out. Also, regexp-quote only adds backslashes, and backslash is already ruled out, so another one won't hurt. So, it's subtle and depends on implementation details that regexp-quote doesn't guarantee, but I guess there's no test that will fail.

I think I'm with you, but here's my reasoning and a suggested change. First off, it looks to me like this one doesn't have the multi-byte problem because it is a string regexp. Looking over the syntax for regexps, anything that matches the riliteral part of the grammar should be fine, which leaves us only to worry about ], -, and ^. Since this is going to go in the first position following the [^ the - and ] are both going to be fine, just as is.

Staring at the grammar, I think that what happens with ^ is that it actually gets turned into \^ by regexp-quote, which means it ends up ruling out both \ and ^, going via this expansion of the productions:

<atom> ->
[^<rng>] -> 
[^<mrng>] -> 
[^<lirng>] -> 
[^<lirng><lrng>] -> 
[^<riliteral><lrng>] ->
[^\<lrng>] ->
[^\^]

(where I'm leaving out all the other stuff in the set that follows.)

Given all that, what do you think about changing that line 40 from this:

(regexp (format "^[|]([^~aa-zA-Z0-9 \t\r\n\f\\\177-\377{]*){" (regexp-quote (string at))))

to this:

  (regexp (format "^[|]([^~aa-zA-Z0-9 \t\r\n\f\\\177-\377]*){"
                  (if (equal? at #\^)
                      "{^"
                      (format "~a{" at))))

that is, if we have a ^ as the at character, then we prefix it with { (just to get something in there so ^ can go second), otherwise we put the at character first (in case it is - or ] (.... Hm, maybe ] isn't allowed anyway?)) and follow it with {. It didn't have to be { but I just grabbed that one from the end to have something to put there. This change passes the tests, FWIW.

@rfindler
Copy link
Member Author

rfindler commented Dec 4, 2022

I don't think any part of @|{|\ should be red. It can be completed with }|, and the added test case looks right to me.

When I run this program:

#lang scribble/base
@|{|\

I get a read error: "unbalanced |" raised from scribble/reader, looks like on line 566. I am not sure I'm parsing this monstrosity correctly, tho, so I might be invoking something wrong when trying to run a test? My best guess is that we've got an opener spelled |{ which needs a |} closer and I agree that wouldn't be a lexer error (although the scribble lexer is already matching parens somehow I suppose). But the |\ that follows looks like it needs a | and that might be a legit lexer error?

(probably) Relatedly, in this program:

#lang scribble/base
@(define x 1)
@|{ x }|

I see an arrow from the definition of x to the use of x, but the use is colored green, not blue. Maybe that's the real problem here?

@mflatt
Copy link
Member

mflatt commented Dec 4, 2022

For line 40, I think ^ is ok just like -, since it will be after a ^ within [...]. As far as I can tell now, just changing (regexp-quote (string at)) to at would work, but I wouldn't be surprised if I'm missing something.

For the @|{\| example, it turns out that the readers for #lang at-exp racket and #lang scribble treat it differently, which surprises me and probably was not intended. The difference is whether the at-exp reader starts in S-expression or text mode. I was viewing @|{ as an opener, which is what happens in S-expression mode like starting in #lang at-exp racket. But another interpretation is @ followed by an S-expression within |...|, which is the parse that text mode and starting in #lang scribble implement. Probably it's a bad idea to change the reader, and then the Scribble lexer needs to change to implement the different choice in text mode.

@rfindler
Copy link
Member Author

rfindler commented Dec 4, 2022

For line 40, I think ^ is ok just like -, since it will be after a ^ within [...]. As far as I can tell now, just changing (regexp-quote (string at)) to at would work, but I wouldn't be surprised if I'm missing something.

The test cases all pass when I make that change, but I also see that they pass when I change (regexp-quote (string at)) to "" too, so perhaps we need more tests.

I see what you're saying about ^ working in regexps like [^^], tho. Indeed, it looks like [^^] accepts any one character except ^. Perhaps the documentation needs an adjustment? I see that it isn't a simple change, because this regexp: [^^]] does not mean "any character except two characters ^ and ]". So there is something special about ^ in rng somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants