scribble lexer returns an empty token #14

rfindler · 2022-11-30T16:56:39Z

Here's the (non-random) test case that illustrates the bug. The function try should never raise an error (it is a simplification of the checks done by the option in the contract). The string big-s is the one that drdr found and it shrinks to "^|{|\\".

#lang racket
(require syntax-color/scribble-lexer)

(define (try s)
  (define 3ary-lexer (make-scribble-inside-lexer #:command-char #\^))
  (define in (open-input-string s))
  (port-count-lines! in)
  (let loop ([mode #f][offset 0])
    (define-values (txt type paren start end backup new-mode)
      (3ary-lexer in offset mode))
    (cond
      [(equal? type 'eof) #t]
      [else
       (unless (< start end)
         (error 'whoops! "start ~s end ~s" start end))
       (loop new-mode end)])))

;; returns #t if the test case failed, #f if it didn't fail
;; so `s` is helpful when this returns #t
(define (try/bool s)
  (with-handlers ([(λ (x) (and (exn:fail? x)
                               (regexp-match #rx"whoops" (exn-message x))))
                   (λ (x) #t)])
    (try s)
    #f))

(define big-s
  "\"\u0099\"\u009E\u0095Ýäλ|Á5|\u008D)è_)4\u0097(¬{\u009D })Ó|@^|{E cTuÐÆ}\u009E|\\|\u0017)\u0099j\u009A")

(define (get-candidates s)
  (append
   (for/list ([i (in-range (string-length s))])
     (string-append (substring s 0 i)
                    (substring s (+ i 1) (string-length s))))
   (for/list ([i (in-range (string-length s))]
              #:unless (equal? (string-ref s i) #\a))
     (string-append (substring s 0 i)
                    "a"
                    (substring s (+ i 1) (string-length s))))))

(define (shrink s)
  (let loop ([s s])
    (cond
      [(try/bool s)
       (define candidates (get-candidates s))
       (cond
         [(null? candidates) s]
         [else
          (or (for/or ([candidate (in-list candidates)])
                (loop candidate))
              s)])]
      [else #f])))

(shrink big-s)

The text was updated successfully, but these errors were encountered:

mflatt · 2022-12-01T03:42:46Z

I think the problem is line 195 of scribble-lexer.rkt as modified by commit fa9de50. There's a regexp-quote there whose result is used in [...], but quoting rules are different in [...] than expected by regexp-quote.

One possible solution is to make that line

(regexp-quote at-bytes) #"|{)|(?="

This change would also fix a problem with an at character that encodes to multiple bytes.

Meanwhile, I think the regexp-quote on line 40 has the same issue, but it's in [^...], so a different strategy seems necessary. A regexp-quote-for-range function that works on a single character might be a better approach there.

related to #14

rfindler · 2022-12-02T15:18:38Z

It looks like a small variation on that change fixes the bug identified here. Concretely, these changes the tests pass (which includes a new one covering the test found by random testing):

$  git diff
diff --git a/syntax-color-lib/syntax-color/scribble-lexer.rkt b/syntax-color-lib/syntax-color/scribble-lexer.rkt
index 1db28ff..97b4928 100644
--- a/syntax-color-lib/syntax-color/scribble-lexer.rkt
+++ b/syntax-color-lib/syntax-color/scribble-lexer.rkt
@@ -192,7 +192,7 @@
                                      (byte-rx #"^[|]" re-opener #"{")
                                      (byte-rx #".*?(?:(?=[|]"
                                               re-opener
-                                              #"[" (regexp-quote at-bytes) #"{])|(?="
+                                              #"(" (regexp-quote at-bytes) #"|{))|(?="
                                               closer
                                               #")|(?=[\r\n])|$)")
                                      '|{|  ;; Better complex paren?
diff --git a/syntax-color-test/tests/syntax-color/scribble-lexer.rkt b/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
index 6e7c315..48ce2bb 100644
--- a/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
+++ b/syntax-color-test/tests/syntax-color/scribble-lexer.rkt
@@ -276,3 +276,5 @@
                                            (4 parenthesis)))
 
 (test/chars '(#\^) "\\" '((1 text)))
+
+(test/chars (remove #\\ chars) "@|{|\\" '((3 parenthesis) (2 text)))

But, unfortunately, adding #\λ to the definition of chars on line 59 of tests/syntax-color/scribble-lexer causes a new test case to fail:

FAILED, line 274, current-lexer-char #\λ
 result
'((1 3 parenthesis 0)
  (3 6 parenthesis 2)
  (6 9 text 0)
  (9 11 parenthesis 0)
  (11 13 parenthesis 0))
 is not expected
'((1 6 parenthesis 0) (6 9 text 0) (9 13 parenthesis 0))

This doesn't seem to be an issue with line 40, tho, because regexp-quote does nothing to "λ", which should work out in this case.

rfindler · 2022-12-02T15:21:07Z

Oh, and actually I'm not sure that this added test case is correct -- the actual scribble reader doesn't seem to like @|{|\ so maybe the colorer should be turning something red.

mflatt · 2022-12-03T23:44:43Z

I think the test on line 275 fails because the test assumes that the "at" character is not #\λ. Using #\λ as the canonical Unicode character is like using 42 as the canonical integer. :)

I guess line 40 works out in the case of #\λ, because characters in non-ASCII encodings are already ruled out. Also, regexp-quote only adds backslashes, and backslash is already ruled out, so another one won't hurt. So, it's subtle and depends on implementation details that regexp-quote doesn't guarantee, but I guess there's no test that will fail.

I don't think any part of @|{|\ should be red. It can be completed with }|, and the added test case looks right to me.

related to #14

rfindler · 2022-12-04T01:07:47Z

I think the test on line 275 fails because the test assumes that the "at" character is not #\λ. Using #\λ as the canonical Unicode character is like using 42 as the canonical integer. :)

Oh! I totally didn't realize that \u3BB is λ! Duh. Thanks, I've pushed that. I've also added #\α to give the random tester a multi-byte something to work with.

I guess line 40 works out in the case of #\λ, because characters in non-ASCII encodings are already ruled out. Also, regexp-quote only adds backslashes, and backslash is already ruled out, so another one won't hurt. So, it's subtle and depends on implementation details that regexp-quote doesn't guarantee, but I guess there's no test that will fail.

I think I'm with you, but here's my reasoning and a suggested change. First off, it looks to me like this one doesn't have the multi-byte problem because it is a string regexp. Looking over the syntax for regexps, anything that matches the riliteral part of the grammar should be fine, which leaves us only to worry about ], -, and ^. Since this is going to go in the first position following the [^ the - and ] are both going to be fine, just as is.

Staring at the grammar, I think that what happens with ^ is that it actually gets turned into \^ by regexp-quote, which means it ends up ruling out both \ and ^, going via this expansion of the productions:

<atom> ->
[^<rng>] -> 
[^<mrng>] -> 
[^<lirng>] -> 
[^<lirng><lrng>] -> 
[^<riliteral><lrng>] ->
[^\<lrng>] ->
[^\^]

(where I'm leaving out all the other stuff in the set that follows.)

Given all that, what do you think about changing that line 40 from this:

(regexp (format "^[|]([^~aa-zA-Z0-9 \t\r\n\f\\\177-\377{]*){" (regexp-quote (string at))))

to this:

  (regexp (format "^[|]([^~aa-zA-Z0-9 \t\r\n\f\\\177-\377]*){"
                  (if (equal? at #\^)
                      "{^"
                      (format "~a{" at))))

that is, if we have a ^ as the at character, then we prefix it with { (just to get something in there so ^ can go second), otherwise we put the at character first (in case it is - or ] (.... Hm, maybe ] isn't allowed anyway?)) and follow it with {. It didn't have to be { but I just grabbed that one from the end to have something to put there. This change passes the tests, FWIW.

rfindler · 2022-12-04T01:10:00Z

I don't think any part of @|{|\ should be red. It can be completed with }|, and the added test case looks right to me.

When I run this program:

#lang scribble/base
@|{|\

I get a read error: "unbalanced |" raised from scribble/reader, looks like on line 566. I am not sure I'm parsing this monstrosity correctly, tho, so I might be invoking something wrong when trying to run a test? My best guess is that we've got an opener spelled |{ which needs a |} closer and I agree that wouldn't be a lexer error (although the scribble lexer is already matching parens somehow I suppose). But the |\ that follows looks like it needs a | and that might be a legit lexer error?

(probably) Relatedly, in this program:

#lang scribble/base
@(define x 1)
@|{ x }|

I see an arrow from the definition of x to the use of x, but the use is colored green, not blue. Maybe that's the real problem here?

mflatt · 2022-12-04T01:51:14Z

For line 40, I think ^ is ok just like -, since it will be after a ^ within [...]. As far as I can tell now, just changing (regexp-quote (string at)) to at would work, but I wouldn't be surprised if I'm missing something.

For the @|{\| example, it turns out that the readers for #lang at-exp racket and #lang scribble treat it differently, which surprises me and probably was not intended. The difference is whether the at-exp reader starts in S-expression or text mode. I was viewing @|{ as an opener, which is what happens in S-expression mode like starting in #lang at-exp racket. But another interpretation is @ followed by an S-expression within |...|, which is the parse that text mode and starting in #lang scribble implement. Probably it's a bad idea to change the reader, and then the Scribble lexer needs to change to implement the different choice in text mode.

rfindler · 2022-12-04T02:00:18Z

For line 40, I think ^ is ok just like -, since it will be after a ^ within [...]. As far as I can tell now, just changing (regexp-quote (string at)) to at would work, but I wouldn't be surprised if I'm missing something.

The test cases all pass when I make that change, but I also see that they pass when I change (regexp-quote (string at)) to "" too, so perhaps we need more tests.

I see what you're saying about ^ working in regexps like [^^], tho. Indeed, it looks like [^^] accepts any one character except ^. Perhaps the documentation needs an adjustment? I see that it isn't a simple change, because this regexp: [^^]] does not mean "any character except two characters ^ and ]". So there is something special about ^ in rng somehow.

rfindler added a commit that referenced this issue Dec 2, 2022

add a little bit more information into the test failure messages

06f5504

related to #14

rfindler added a commit that referenced this issue Dec 4, 2022

Fix bug found by random testing

6c3cc96

related to #14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scribble lexer returns an empty token #14

scribble lexer returns an empty token #14

rfindler commented Nov 30, 2022

mflatt commented Dec 1, 2022

rfindler commented Dec 2, 2022

rfindler commented Dec 2, 2022

mflatt commented Dec 3, 2022 •

edited

Loading

rfindler commented Dec 4, 2022

rfindler commented Dec 4, 2022

mflatt commented Dec 4, 2022

rfindler commented Dec 4, 2022

scribble lexer returns an empty token #14

scribble lexer returns an empty token #14

Comments

rfindler commented Nov 30, 2022

mflatt commented Dec 1, 2022

rfindler commented Dec 2, 2022

rfindler commented Dec 2, 2022

mflatt commented Dec 3, 2022 • edited Loading

rfindler commented Dec 4, 2022

rfindler commented Dec 4, 2022

mflatt commented Dec 4, 2022

rfindler commented Dec 4, 2022

mflatt commented Dec 3, 2022 •

edited

Loading