Skip to content

Commit

Permalink
Release 2.3.5
Browse files Browse the repository at this point in the history
  • Loading branch information
rodrigopivi committed Oct 3, 2019
1 parent 5877011 commit 7275948
Show file tree
Hide file tree
Showing 11 changed files with 2,806 additions and 1,921 deletions.
12 changes: 9 additions & 3 deletions gatsby-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,20 @@ module.exports = {
title: 'Chatito'
},
plugins: [
'gatsby-plugin-typescript',
{
resolve: `gatsby-plugin-typescript`,
options: {},
},
{
resolve: 'gatsby-plugin-page-creator',
options: {
path: `${__dirname}/web/pages`
}
},
'gatsby-plugin-react-helmet',
'gatsby-plugin-styled-components'
]
{
resolve: 'gatsby-plugin-styled-components',
options: {},
},
],
};
4,537 changes: 2,703 additions & 1,834 deletions package-lock.json

Large diffs are not rendered by default.

50 changes: 24 additions & 26 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "chatito",
"version": "2.3.4",
"version": "2.3.5",
"description": "Generate training datasets for NLU chatbots using a simple DSL",
"bin": {
"chatito": "./dist/bin.js"
Expand Down Expand Up @@ -48,9 +48,8 @@
"license": "MIT",
"homepage": "https://github.com/rodrigopivi/Chatito",
"dependencies": {
"chance": "1.0.18",
"minimist": "1.2.0",
"wink-tokenizer": "5.2.1"
"chance": "1.1.0",
"minimist": "1.2.0"
},
"jest": {
"transform": {
Expand All @@ -74,43 +73,42 @@
]
},
"devDependencies": {
"@babel/core": "7.4.5",
"@types/chance": "1.0.5",
"@babel/core": "7.6.2",
"@types/chance": "1.0.6",
"@types/file-saver": "2.0.1",
"@types/jest": "24.0.15",
"@types/node": "12.0.10",
"@types/react": "16.8.22",
"@types/react-dom": "16.8.4",
"@types/react-helmet": "5.0.8",
"@types/react-router-dom": "4.3.4",
"@types/wink-tokenizer": "4.0.0",
"@types/jest": "24.0.18",
"@types/node": "12.7.9",
"@types/react": "16.9.4",
"@types/react-dom": "16.9.1",
"@types/react-helmet": "5.0.11",
"@types/react-router-dom": "5.1.0",
"babel-loader": "8.0.6",
"babel-plugin-import": "1.12.0",
"babel-plugin-import": "1.12.2",
"babel-plugin-styled-components": "1.10.6",
"codeflask": "1.4.1",
"core-js": "3.1.4",
"core-js": "3.2.1",
"file-saver": "2.0.2",
"gatsby": "2.12.0",
"gatsby-link": "2.2.0",
"gatsby-plugin-react-helmet": "3.1.0",
"gatsby-plugin-styled-components": "3.1.0",
"gatsby-plugin-typescript": "2.1.0",
"gh-pages": "2.0.1",
"jest": "24.8.0",
"gh-pages": "2.1.1",
"jest": "24.9.0",
"pegjs": "0.10.0",
"prettier": "1.18.2",
"react": "16.8.6",
"react-dom": "16.8.6",
"react": "16.10.1",
"react-dom": "16.10.1",
"react-helmet": "5.2.1",
"react-json-view": "1.19.1",
"react-router-dom": "5.0.1",
"regenerator-runtime": "0.13.2",
"styled-components": "4.3.2",
"ts-jest": "24.0.2",
"ts-node": "8.3.0",
"tslint": "5.18.0",
"react-router-dom": "5.1.2",
"regenerator-runtime": "0.13.3",
"styled-components": "4.4.0",
"ts-jest": "24.1.0",
"ts-node": "8.4.1",
"tslint": "5.20.0",
"tslint-config-prettier": "1.18.0",
"tslint-plugin-prettier": "2.0.1",
"typescript": "3.5.2"
"typescript": "3.6.3"
}
}
79 changes: 42 additions & 37 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ https://img.shields.io/circleci/project/github/RedSparr0w/node-csgo-parser/maste

[Try the online IDE!](https://rodrigopivi.github.io/Chatito/)


## Overview
Chatito helps you generate datasets for training and validating chatbot models using a simple DSL.

Expand All @@ -27,27 +26,30 @@ This project contains the:
### Chatito language
For the full language specification and documentation, please refer to the [DSL spec document](https://github.com/rodrigopivi/Chatito/blob/master/spec.md).

### Adapters
The language is independent from the generated output format and because each model can receive different parameters and settings, this are the currently implemented data formats, if your provider is not listed, at the Tools and resources section there is more information on how to support more formats.
## Tips

NOTE: Samples are not shuffled between intents for easier review and because some adapters stream samples directly to the file.
### Prevent overfit

#### Default format
Use the default format if you plan to train a custom model or if you are writing a custom adapter. This is the most flexible format because you can annotate `Slots` and `Intents` with custom entity arguments, and they all will be present at the generated output, so for example, you could also include dialog/response generation logic with the DSL. E.g.:
[Overfitting](https://en.wikipedia.org/wiki/Overfitting) is a problem that can be prevented if we use Chatito correctly. The idea behind this tool, is to have an intersection between data augmentation and a description of possible sentences combinations. It is not intended to generate deterministic datasets that may overfit a single sentence model, in those cases, you can have some control over the generation paths only pull samples as required.

```
%[some intent]('context': 'some annotation')
@[some slot] ~[please?]
### Tools and resources

@[some slot]('required': 'true', 'type': 'some type')
~[some alias here]
- [Visual Studio Code syntax highlighting plugin](https://marketplace.visualstudio.com/items?itemName=nimfin.chatito) Thanks to [Yuri Golobokov](https://github.com/nimf) for his [work on this](https://github.com/nimf/chatito-vscode).

```
- [AI Blueprints: How to build and deploy AI business projects](https://books.google.com.pe/books?id=sR2CDwAAQBAJ) implements practical full chatbot examples using chatito at chapter 7.

Custom entities like 'context', 'required' and 'type' will be available at the output so you can handle this custom arguments as you want.
- [3 steps to convert chatbot training data between different NLP Providers](https://medium.com/@benoit.alvarez/3-steps-to-convert-chatbot-training-data-between-different-nlp-providers-fa235f67617c) details a simple way to convert the data format to non implemented adapters. You can use a generated dataset with providers like DialogFlow, Wit.ai and Watson.

- [Aida-nlp](https://github.com/rodrigopivi/aida) is a tiny experimental NLP deep learning library for text classification and NER. Built with Tensorflow.js, Keras and Chatito. Implemented in JS and Python.

## Adapters
The language is independent from the generated output format and because each model can receive different parameters and settings, this are the currently implemented data formats, if your provider is not listed, at the Tools and resources section there is more information on how to support more formats.

NOTE: Samples are not shuffled between intents for easier review and because some adapters stream samples directly to the file and it's recommended to split intents in different files for easier review and maintenance.

### [Rasa](https://rasa.com/docs/rasa/)
[Rasa](https://rasa.com/docs/rasa/) is an open source machine learning framework for automated text and voice-based conversations. Understand messages, hold conversations, and connect to messaging channels and APIs. Chatito can help you build a dataset for the [Rasa NLU](https://rasa.com/docs/rasa/nlu/about/) component.

#### [Rasa NLU](https://rasa.com/docs/nlu/)
[Rasa NLU](https://rasa.com/docs/nlu/) is a great open source framework for training NLU models.
One particular behavior of the Rasa adapter is that when a slot definition sentence only contains one alias, and that alias defines the 'synonym' argument with 'true', the generated Rasa dataset will map the alias as a synonym. e.g.:

```
Expand All @@ -64,21 +66,21 @@ One particular behavior of the Rasa adapter is that when a slot definition sente

In this example, the generated Rasa dataset will contain the `entity_synonyms` of `synonym 1` and `synonym 2` mapping to `some slot synonyms`.

#### [Flair](https://github.com/zalandoresearch/flair)
[Flair](https://github.com/zalandoresearch/flair) A very simple framework for state-of-the-art NLP. Developed by Zalando Research. It provides state of the art (GPT, BERT, ELMo, etc...) pre trained models and embeddings for many languages that work out of the box. This adapter supports the `text classification` dataset in FastText format and the `named entity recognition` dataset in two column [BIO](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) annotated words, as documented at [flair corpus documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md). This two data formats are very common and with many other providers or models.
### [Flair](https://github.com/zalandoresearch/flair)
[Flair](https://github.com/zalandoresearch/flair) A very simple framework for state-of-the-art NLP. Developed by Zalando Research. It provides state of the art [(GPT, BERT, RoBERTa, XLNet, ELMo, etc...)](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md) pre trained embeddings for many languages that work out of the box. This adapter supports the `text classification` dataset in FastText format and the `named entity recognition` dataset in two column [BIO](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) annotated words, as documented at [flair corpus documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md). This two data formats are very common and with many other providers or models.

The NER dataset requires a word tokenization processing that is currently done using [wink-tokenizer](https://github.com/winkjs/wink-tokenizer) npm package. Extending the adapter to add PoS tagging can be explored in the future, but it's not implemented.
The NER dataset requires a word tokenization processing that is currently done using a [simple tokenizer](https://github.com/rodrigopivi/Chatito/tree/master/src).

NOTE: Flair adapter is only available for the NodeJS NPM CLI package, not for the IDE.

#### [LUIS](https://www.luis.ai/)
### [LUIS](https://www.luis.ai/)
[LUIS](https://www.luis.ai/) is part of Microsoft's Cognitive services. Chatito supports training a LUIS NLU model through its [batch add labeled utterances endpoint](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c09), and its [batch testing api](https://docs.microsoft.com/en-us/azure/cognitive-services/LUIS/luis-how-to-batch-test).

To train a LUIS model, you will need to post the utterance in batches to the relevant API for training or testing.

Reference issue: [#61](https://github.com/rodrigopivi/Chatito/issues/61)

#### [Snips NLU](https://snips-nlu.readthedocs.io/en/latest/)
### [Snips NLU](https://snips-nlu.readthedocs.io/en/latest/)
[Snips NLU](https://snips-nlu.readthedocs.io/en/latest/) is another great open source framework for NLU. One particular behavior of the Snips adapter is that you can define entity types for the slots. e.g.:

```
Expand All @@ -92,9 +94,23 @@ Reference issue: [#61](https://github.com/rodrigopivi/Chatito/issues/61)

In the previous example, all `@[date]` values will be tagged with the `snips/datetime` entity tag.

### NPM package
### Default format
Use the default format if you plan to train a custom model or if you are writing a custom adapter. This is the most flexible format because you can annotate `Slots` and `Intents` with custom entity arguments, and they all will be present at the generated output, so for example, you could also include dialog/response generation logic with the DSL. E.g.:

Chatito supports Node.js `v8.11.2 LTS` or higher.
```
%[some intent]('context': 'some annotation')
@[some slot] ~[please?]
@[some slot]('required': 'true', 'type': 'some type')
~[some alias here]
```

Custom entities like 'context', 'required' and 'type' will be available at the output so you can handle this custom arguments as you want.

## NPM package

Chatito supports Node.js `>= v8.11`.

Install it with yarn or npm:
```
Expand All @@ -113,7 +129,7 @@ The generated dataset should be available next to your definition file.

Here is the full npm generator options:
```
npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOptions> --outputPath=<outputPath> --trainingFileName=<trainingFileName> --testingFileName=<testingFileName> --defaultDistribution=<defaultDistribution>
npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOptions> --outputPath=<outputPath> --trainingFileName=<trainingFileName> --testingFileName=<testingFileName> --defaultDistribution=<defaultDistribution> --autoAliases=<autoAliases>
```

- `<pathToFileOrDirectory>` path to a `.chatito` file or a directory that contains chatito files. If it is a directory, will search recursively for all `*.chatito` files inside and use them to generate the dataset. e.g.: `lightsChange.chatito` or `./chatitoFilesFolder`
Expand All @@ -126,19 +142,8 @@ npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOpt

- `<autoAliases>` Optional. The generaor behavior when finding an undefined alias. Valid opions are `allow`, `warn`, `restrict`. Defauls to 'allow'.

### Notes to prevent overfitting

[Overfitting](https://en.wikipedia.org/wiki/Overfitting) is a problem that can be prevented if we use Chatito correctly. The idea behind this tool, is to have an intersection between data augmentation and a probabilistic description of possible sentences combinations. It is not intended to generate deterministic datasets, you should avoid generating all possible combinations.

### Tools and resources

- [Visual Studio Code syntax highlighting plugin](https://marketplace.visualstudio.com/items?itemName=nimfin.chatito) Thanks to [Yuri Golobokov](https://github.com/nimf) for his [work on this](https://github.com/nimf/chatito-vscode).

- [AI Blueprints: How to build and deploy AI business projects](https://books.google.com.pe/books?id=sR2CDwAAQBAJ) implements practical full chatbot examples using chatito at chapter 7.
### Author and maintainer
[Rodrigo Pimentel](https://www.linkedin.com/in/rodrigo-pimentel-550430b7/)

- [3 steps to convert chatbot training data between different NLP Providers](https://medium.com/@benoit.alvarez/3-steps-to-convert-chatbot-training-data-between-different-nlp-providers-fa235f67617c) details a simple way to convert the data format to non implemented adapters. You can use a generated dataset with providers like DialogFlow, Wit.ai and Watson.
sr.rodrigopv[at]gmail

- [Aida-nlp](https://github.com/rodrigopivi/aida) is a tiny experimental NLP deep learning library for text classification and NER. Built with Tensorflow.js, Keras and Chatito. Implemented in JS and Python.

### Author and maintainer
Rodrigo Pimentel
19 changes: 9 additions & 10 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

## 1 - Overview

Chatito is a domain specific language designed to simplify the process of creating, extending and maintaining
datasets for training natural language processing (NLP) models for text classification, named entity recognition, slot filling or equivalent tasks.
Chatito is a powerful domain specific language designed to simplify the process of creating, extending and maintaining datasets for training and validating natural language processing (NLP) models for text classification, named entity recognition, slot filling or equivalent tasks.

Chatito design principles:

- Simplicity: should be understandable by someone looking at it for the first time
- Practicality: this tool is meant to help people who use it, the design should be guided by the community needs

- Speed: generate samples by pulling them from a cloud of probabilities on demand

- Practicality: this tool is meant to help people who use it, the design should be guided by the community needs
- Simplicity: should be understandable by someone looking at it for the first time


Following those principles this is an example of the language and its generated output:

Expand All @@ -33,7 +33,7 @@ Following those principles this is an example of the language and its generated
```

This code could produce a maximum of 18 examples, the output format is independent from the DSL language,
although it is recommended to use a newline delimited format to just stream results to a file, a format like ndjson is recommended over plain json and using the `training` entity argument to limit the dataset size is recommended for large dataset where there should be no need to generate all variations.
although it is recommended to use a newline delimited format to just stream results to a file, a format like ndjson is recommended over plain json and using the `training` entity argument to limit the dataset size is recommended for large dataset where there should be no need to generate all combinations.

That said, the earlier DSL code generates two training examples for the `greet` intent. Here is the `Newline Delimited JSON` (ndjson.org) examples generated from the previous code:

Expand All @@ -46,8 +46,7 @@ Given this principles in mind, this document is the specification of such langua

## 2 - Language

A chatito file, is a document containing the grammar definitions. Because of the different encoding formats and range of
non printable characters, this are the requirements of document source text and some terminology:
A chatito file, is a document containing the grammar definitions. Because of the different encoding formats and range of non printable characters, this are the requirements of document source text and some terminology:

- Format: UTF-8
- Valid characters: Allow international language characters.
Expand Down Expand Up @@ -204,11 +203,11 @@ The text next to the import statement should be a relative path from the main fi
Note: Chatito will throw an exception if two imports define the same entity.


### 2.2 - Controlling probabilities
### 2.3 - Controlling probabilities

The way Chatito works, is like pulling samples from a cloud of possible combinations and avoiding duplicates. Once the sentences definitions gain complexity, the max possible combinations increments exponentially, causing a problem where the generator will most likely pick sentences that have more possible combinations, and omit some sentences that may be more important at the dataset. To overcome this problem, semantics for controlling the data generation probabilities are provided.

#### 2.2.1 - Frequency distribution strategies
#### 2.3.1 - Frequency distribution strategies

When generating samples for an entity, the generator will randomly pick a sentence model using one of the two frequency distribution strategies available: `regular` or `even`.

Expand Down Expand Up @@ -256,7 +255,7 @@ For `even` distribution using the previous example:
| sentence 3 | 400 | 1 | 33.3333% |


#### 2.2.1 - Sentence probability operator
#### 2.3.2 - Sentence probability operator

The sentence probability operator is defined by the `*[` symbol at the start of a sentence following by the probability value and `]`. The probability value may be expressed in two ways, as a plain number (considered as weighted probabilty, e.g.: `1`) or as a percentage value (a number ending with `%`, e.g.: `33.3333%`), but once an entity defines a probabilty as either weight or percentage, then all the other sentences for that entity should use the same type. Inconsistencies declaring entity sentence probabilty values should be considered an input error and if the value is not a valid integer, float or percentual value, the input should be considered as simple text and not as a sentence probability definition.

Expand Down
Loading

0 comments on commit 7275948

Please sign in to comment.