You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like "(Who is Obama) OR (good boy)", Tantivy parses it into a BooleanQuery, with each subquery composed using TermQuery:
This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query "(Who is Obama) OR 伊文斯隐瞒秘密", Tantivy interprets the Chinese part as a PhraseQuery:
This behavior differs from what we expect. When parsing Chinese, we expect it to also use Should to combine each individual tokens, as demonstrated below in our expected behavior.
Which version of tantivy are you using?
Our tantivy-search is based with Tantivy 0.21.1 version.
To Reproduce
In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the default tokenizer, it treats "伊文斯隐瞒秘密" as a single token. We have integrated the Cang-jie and ICU tokenizers into tantivy-search, which can properly tokenize Chinese text.
To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:
let sentence = "(Who is Obama) OR 伊文斯隐瞒秘密";let text_query:Box<dynQuery> = parser.parse_query(sentence).unwrap();println!("{:?}", text_query);
The text was updated successfully, but these errors were encountered:
Is it possible to consider adding a new subquery type (such as a TermsQuery) to LogicalLiteral and introducing a special character in the natural language query to represent special languages (such as Chinese, Japanese, etc.)? Currently, these are the only potential solutions I can think of.
Describe the bug
Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like
"(Who is Obama) OR (good boy)"
, Tantivy parses it into aBooleanQuery
, with each subquery composed usingTermQuery
:This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query
"(Who is Obama) OR 伊文斯隐瞒秘密"
, Tantivy interprets the Chinese part as aPhraseQuery
:This behavior differs from what we expect. When parsing Chinese, we expect it to also use
Should
to combine each individual tokens, as demonstrated below in our expected behavior.Which version of tantivy are you using?
Our tantivy-search is based with Tantivy 0.21.1 version.
To Reproduce
In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the
default
tokenizer, it treats"伊文斯隐瞒秘密"
as a single token. We have integrated theCang-jie
andICU
tokenizers into tantivy-search, which can properly tokenize Chinese text.To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:
The text was updated successfully, but these errors were encountered: