diff --git a/docs/sql-manual/sql-functions/string-functions/ngram-search.md b/docs/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..ae42731b9904d --- /dev/null +++ b/docs/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,67 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "en" +} +--- + + + +## Description + +Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. + +Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. + +N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. + +The N-gram similarity is calculated as: + +2 * |Intersection| / (|text set| + |pattern set|) + +where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. + +Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. + +Only supports ASCII encoding. + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..1a2eecc3cb20b --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,67 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "zh-CN" +} +--- + + + +## Description + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 +其中`pattern`,`gram_num`必须为常量。 +如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 + +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 + +N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) + +其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 + +注意,根据定义,相似度为 1 不代表两个字符串相同。 + +仅支持 ASCII 编码。 + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..e080165939e41 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,65 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "zh-CN" +} +--- + + + +## Description + +计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 +其中`pattern`,`gram_num`必须为常量。 +如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 + +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 + +N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) + +其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 + +注意,根据定义,相似度为 1 不代表两个字符串相同。 + +仅支持 ASCII 编码。 + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..e080165939e41 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,65 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "zh-CN" +} +--- + + + +## Description + +计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 +其中`pattern`,`gram_num`必须为常量。 +如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 + +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 + +N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) + +其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 + +注意,根据定义,相似度为 1 不代表两个字符串相同。 + +仅支持 ASCII 编码。 + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/sidebars.json b/sidebars.json index 4f7831a16a06c..fc9c9711ae583 100644 --- a/sidebars.json +++ b/sidebars.json @@ -895,6 +895,7 @@ "sql-manual/sql-functions/string-functions/split-by-regexp", "sql-manual/sql-functions/string-functions/substring-index", "sql-manual/sql-functions/string-functions/money-format", + "sql-manual/sql-functions/string-functions/ngram-search", "sql-manual/sql-functions/string-functions/parse-url", "sql-manual/sql-functions/string-functions/quote", "sql-manual/sql-functions/string-functions/url-decode", diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..a39c0f6a0976d --- /dev/null +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,69 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "en" +} +--- + + + +## Description + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. + +Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. + +N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. + +The N-gram similarity is calculated as: + +2 * |Intersection| / (|text set| + |pattern set|) + +where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. + +Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. + +Only supports ASCII encoding. + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md b/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..ae42731b9904d --- /dev/null +++ b/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,67 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "en" +} +--- + + + +## Description + +Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. + +Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. + +N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. + +The N-gram similarity is calculated as: + +2 * |Intersection| / (|text set| + |pattern set|) + +where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. + +Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. + +Only supports ASCII encoding. + +## Syntax + +`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` + +## Example + +```sql +mysql> select ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> select ngram_search("abababab","babababa",2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` +## keywords + NGRAM_SEARCH,NGRAM,SEARCH diff --git a/versioned_sidebars/version-2.1-sidebars.json b/versioned_sidebars/version-2.1-sidebars.json index 07464482c3c0f..7572a6b544ded 100644 --- a/versioned_sidebars/version-2.1-sidebars.json +++ b/versioned_sidebars/version-2.1-sidebars.json @@ -840,6 +840,7 @@ "sql-manual/sql-functions/string-functions/split-by-string", "sql-manual/sql-functions/string-functions/substring-index", "sql-manual/sql-functions/string-functions/money-format", + "sql-manual/sql-functions/string-functions/ngram-search", "sql-manual/sql-functions/string-functions/parse-url", "sql-manual/sql-functions/string-functions/quote", "sql-manual/sql-functions/string-functions/url-decode", diff --git a/versioned_sidebars/version-3.0-sidebars.json b/versioned_sidebars/version-3.0-sidebars.json index f82c6e85055e2..d7a7efcfcacb2 100644 --- a/versioned_sidebars/version-3.0-sidebars.json +++ b/versioned_sidebars/version-3.0-sidebars.json @@ -885,6 +885,7 @@ "sql-manual/sql-functions/string-functions/split-by-string", "sql-manual/sql-functions/string-functions/substring-index", "sql-manual/sql-functions/string-functions/money-format", + "sql-manual/sql-functions/string-functions/ngram-search", "sql-manual/sql-functions/string-functions/parse-url", "sql-manual/sql-functions/string-functions/quote", "sql-manual/sql-functions/string-functions/url-decode",