Xparser is a versatile parsing library that empowers developers with robust parsing capabilities.
You can easily install Xparser by downloading the source codes and integrating them into your project.
In order to use Xparser, you need to define your grammar using a simple JSON file.
A grammar allows Xparser to transform a sequence of characters into a syntax tree.
All JSON Xparser grammar files must have the following structure:
{
"name": "nameOfYourGrammar",
"terminals": [
{
"name": "nameOfTerminalRule",
"regex": "ECMAScript regular expression"
}
],
"rules": [
{
"name": "ruleName",
"expressions": [
"[b]def<identifier>():"
]
}
]
}
You can also specify the JSON schema so that you don't run into errors:
{
"$schema": "https://raw.githubusercontent.com/SimoneAncona/xparser/main/schemas/schema.json",
}
For more information click here
As mentioned above, we use grammars to generate an abstract syntax tree or AST, you can do it in your C++ project:
#include "xparser.hh"
#include <fstream>
#include <sstream>
#include <string>
#include <stdexcept>
#include <iostream>
std::string read_json_file(std::string filename);
int main(int argc, char** argv)
{
Xpp::Parser parser(read_json_file("myGrammar.json")); // import the grammar file
Xpp::AST ast = parser.generate_ast("parse this string"); // parse a string and generate the AST
std::cout << ast.to_json().to_string() << std::endl; // see the JSON string representation of the AST
return 0;
}
std::string read_json_file(std::string filename)
{
ifstream file;
stringstream buff;
file.open(filename);
if (file.fail())
throw std::runtime_error("Cannot open the file: " + filename);
buff << file.rdbuf();
return buff.str();
}
A terminal is always a final node in the AST, a terminal value can be a literal number, a literal string or an identifier. There are 3 types of terminal values:
- Predefined: terminals that are built-in such as
integer
oridentifier
- User-defined: terminals that are defined in the
terminals
property of the grammar JSON file. - Constant: terminals that are defined in rule expressions, we will see later what this means.
A terminal is defined by a name and a regular expression, except for those constants.
There are 12 built-in terminals:
integer
: that is equivalent to[-|+]?\d+
regular expression.identifier
: that is equivalent to[_a-zA-Z][_a-zA-Z0-9]*
.real
: that is equivalent to[+|-]?\d+(\.\d+)?
.alpha
: that is equivalent to[a-zA-Z]
.alnum
: equivalent to[a-zA-Z0-9]
.digit
: equivalent to[0-9]
.hexDigit
: equivalent to[0-9a-fA-F]
.octalDigit
: equivalent to[0-7]
.space
: equivalent to[^\S\r\n]
.newLine
: equivalent to\r?\n
.any
: equivalent to.
.eof
: End Of File.
User-defined terminals are defined in the JSON grammar file under the terminals
property. A terminal is defined by specifying the name and the ECMAScript regular expression.
NOTE: regular expressions are strings, in order to represent the expression
/[^\S\r\n]/
you must write"[^\\S\\r\\n]"
.
A user-defined terminal could be like the following.
{
"terminals": [
{
"name": "binaryNumber",
"regex": "[0|1]+"
}
]
}
NOTE: The order in which they are placed in the array indicates the hierarchy, the topmost terminals will be parsed first.
A rule define the syntax of the language and specify how elements of the language are combined. Rules are defined under the rules
property in the JSON grammar.
Each rule has a name and a set of expressions which specify the syntax.
{
"rules": [
{
"name": "variableDeclaration",
"expressions": [
"[b]var<identifier><newLine|eof>"
]
}
]
}
NOTE: The order in which rules are placed in the array indicates a reverse hierarchy, those below are parsed first.
The rule expression language allows you to specify the syntax of a rule, there are 3 elements in the rule expression language:
- Constant terminals: are used to define strings or sequences of characters that must match exactly in order to form a valid expression or sentence.
- References: references to other rules or terminals, references are delimited by
<>
. - Flags: flags are always specified at the beginning and are delimited by
[]
.
As mentioned above, constant terminals tells the parser to match exactly the character sequence. For example:
"[b]if<space*>(<condition>)"
In this expression, if
is a constant terminal and tells the parser to match exactly the string "if".
To use <, [, | and other characters that have special meaning in Rule Expression Language in a constant terminal you need to use the \ character
NOTE: the escape character in the JSON file must be written \\. Example:
❌"[s]def \< <identifier> \>"
✔️"[s]def \\< <identifier> \\>"
A reference is a reference to another rule or terminal, that tells the parser to match the string that follow the referenced rule.
A rule can have a reference to itself provided that in the expression array there is at least one expression with only terminal references or constant terminals.
Using the previous example:
"[b]if<space*>(<condition>)"
<condition>
is a reference to a rule called condition.
A reference can be quantified. There are 5 quantifiers:
?
: zero or 1.*
: zero or more.+
: 1 or more.{x}
: exactly x of.{x:y}
: a range from x to y (included).
Quantifiers are placed at the end of the reference like this:
"4letters:<alpha{4}>"
The example above specify to match a string that starts with "4letters:" and then followed by exactly 4 alphabetic characters.
References can be alternated, alternate matches are represented using the |
character. Each alternative represents a different way to match a part of the expression. For example:
"4letters_or_5num:<alpha{4}|digit{5}>"
In this example we match all strings that starts with "4letters_or_5num:" followed by 4 alphabetic characters or 5 decimal digits.
Flags are specified at the beginning of the expression and can change how the expression is evaluated.
There are 4 flags:
s
for ignore spaces: if this flag is set, every space between different terminals and terminals, rule references and other rules or terminals and rule references, will be ignored and not evaluated as a constant terminal.b
for boundary: this flag guarantees that there is at least 1 space of gap between terminals or rules with same expressions or regular expressions.i
for case-insesitive: all constant terminals are case insensitive.I
for case-insesitive: all characters of a constant terminal are lower case or upper case, not a mix.
NOTE: you cannot specify both
i
andI
flags.
Example:
"[Isb]foreach<space*>(<identifier> in <identifier>)"
That expression can match:
FOREACH (el in els)
.foreach( el IN els)
.
That expression doesn't match with:
Foreach(el in els)
.foreach(elinels)
.
If not specified, spaces can be evaluated as constant terminal or ignored. Let's see the difference:
"hello world<letter{4}> <number{6}>"
┃ ┃
┃ ┃
┗━━━━━━━━━━━━━━━━┻━━ These spaces are constant terminals.
"[s]hello world<letter{4}> <number{6}>"
┃ ┃
┃ ┗ This space will be ignored.
┗ This is space is a part of the constant terminal.
"[s]hello <identifier> world"
┃ ┃
┃ ┃
┗━━━━━━━━━━━━┻━━ These spaces will be ignored
You must add <space> this is because 'hello'
and 'world' can be seen as identifiers and
's' does not guarantee that there are no
spaces.
"[sb]hello <identifier> world"
┃ ┃
┃ ┃
┗━━━━━━━━━━━━┻━━ These spaces will be ignored
However the 'b' flag ensures that there is at
least one space between constant terminals and
references.
"[b]hello <identifier> world"
┃ ┃
┃ ┃
┗━━━━━━━━━━━━┻━━ These spaces will not be ignored