Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural language queries exhibit unexpected behavior when processing Chinese text. #2472

Open
MochiXu opened this issue Aug 8, 2024 · 1 comment

Comments

@MochiXu
Copy link
Contributor

MochiXu commented Aug 8, 2024

Describe the bug

Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like "(Who is Obama) OR (good boy)", Tantivy parses it into a BooleanQuery, with each subquery composed using TermQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery {
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "good"))), 
                (Should, TermQuery(Term(field=1, type=Str, "boy")))
            ] })
    ] 
}

This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query "(Who is Obama) OR 伊文斯隐瞒秘密", Tantivy interprets the Chinese part as a PhraseQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, PhraseQuery { 
             field: Field(1), phrase_terms: [
                 (0, Term(field=1, type=Str, "伊文")), 
                 (1, Term(field=1, type=Str, "伊文斯")), 
                 (2, Term(field=1, type=Str, "隐瞒")), 
                 (3, Term(field=1, type=Str, "秘密"))], slop: 0 
         })
] }

This behavior differs from what we expect. When parsing Chinese, we expect it to also use Should to combine each individual tokens, as demonstrated below in our expected behavior.

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
             subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "伊文"))), 
                (Should, TermQuery(Term(field=1, type=Str, "伊文斯"))), 
                (Should, TermQuery(Term(field=1, type=Str, "隐瞒"))),
                (Should, TermQuery(Term(field=1, type=Str, "秘密")))
             ]
         })
] }

Which version of tantivy are you using?
Our tantivy-search is based with Tantivy 0.21.1 version.

To Reproduce

In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the default tokenizer, it treats "伊文斯隐瞒秘密" as a single token. We have integrated the Cang-jie and ICU tokenizers into tantivy-search, which can properly tokenize Chinese text.

To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:

  let sentence = "(Who is Obama) OR 伊文斯隐瞒秘密";
  let text_query: Box<dyn Query> = parser.parse_query(sentence).unwrap();
  println!("{:?}", text_query);
@MochiXu
Copy link
Contributor Author

MochiXu commented Aug 8, 2024

Is it possible to consider adding a new subquery type (such as a TermsQuery) to LogicalLiteral and introducing a special character in the natural language query to represent special languages (such as Chinese, Japanese, etc.)? Currently, these are the only potential solutions I can think of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant