[doc] support ngram_search function (#899)

apache/doris#38226
apache · Sep 27, 2024 · e676a91 · e676a91
1 parent 240c0e7
commit e676a91
Show file tree

Hide file tree

Showing 9 changed files with 403 additions and 0 deletions.
diff --git a/docs/sql-manual/sql-functions/string-functions/ngram-search.md b/docs/sql-manual/sql-functions/string-functions/ngram-search.md
@@ -0,0 +1,67 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. 
+
+Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
+
+N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
+
+The N-gram similarity is calculated as:
+
+2 * |Intersection| / (|text set| + |pattern set|)
+
+where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
+
+Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
+
+Only supports ASCII encoding.
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH
diff --git a/...-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md b/...-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md
@@ -0,0 +1,67 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH
diff --git a/...tent-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/...tent-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md
@@ -0,0 +1,65 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH
diff --git a/...tent-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md b/...tent-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md
@@ -0,0 +1,65 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH
diff --git a/sidebars.json b/sidebars.json
@@ -895,6 +895,7 @@
                                 "sql-manual/sql-functions/string-functions/split-by-regexp",
                                 "sql-manual/sql-functions/string-functions/substring-index",
                                 "sql-manual/sql-functions/string-functions/money-format",
+                                "sql-manual/sql-functions/string-functions/ngram-search",
                                 "sql-manual/sql-functions/string-functions/parse-url",
                                 "sql-manual/sql-functions/string-functions/quote",
                                 "sql-manual/sql-functions/string-functions/url-decode",

diff --git a/...oned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/...oned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md
@@ -0,0 +1,69 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. 
+
+Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
+
+N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
+
+The N-gram similarity is calculated as:
+
+2 * |Intersection| / (|text set| + |pattern set|)
+
+where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
+
+Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
+
+Only supports ASCII encoding.
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH