tiny-parser enables you to write arbitrary use-case specific parsers within minutes.
It ships with a collection of predefined language defintions I started to write.
Defining the grammar of JSON using tiny-parser looks like this:
from tinyparser import Rule, Token, Language
json = Language({
"root.number.": [Token.NUMBER],
"root.string.": [Token.STRING],
"root.list.": [Token.LEFT_SQUARE_BRACKET, "list.", Token.RIGHT_SQUARE_BRACKET],
"root.object.": [Token.LEFT_CURLY_BRACKET, "object.", Token.RIGHT_CURLY_BRACKET],
"list.nonempty.multiple.": ["root.", Token.COMMA, "list.nonempty."],
"list.nonempty.single.": ["root."],
"list.empty.": [],
"object.nonempty.multiple.": ["attribute.", Token.COMMA, "object.nonempty."],
"object.nonempty.single.": ["attribute."],
"object.empty.": [],
"attribute.": [Token.STRING, Token.COLON, "root."],
})
That's it!
If you'd like to parse some json now, you can do this through:
# Parse input into ast
ast = tinyparser.parse(json, '{"Hello": "World"}')
# Inspection:
tinyparser.print_ast(ast)
""" Output:
<root> = [root. > root.object.]
.children = [
.1 = [Token.LEFT_CURLY_BRACKET] = '{'
.2 = [object. > object.nonempty.single.]
.children = [
.1 = [attribute.]
.children = [
.1 = [Token.STRING] = 'Hello'
.2 = [Token.COLON] = ':'
.3 = [root. > root.string.]
.children = [
.1 = [Token.STRING] = 'World'
]
]
]
.3 = [Token.RIGHT_CURLY_BRACKET] = '}'
]
"""
While this parsing result has all necessary information, it also contains unnecessary information. To improve this, tiny-parser allows you to post-process intermediate parsing results to enable the pretty datastructure of your choice. Whether its custom classes, dictionaries, lists... you name it.
Since JSON is primarily a data-description language, why shouldn't we simpy turn the string input into the python datastructure!? In order to do this, our language grammar needs some meta information on how to process each rule (don't worry, everything you'll see will be explained later):
json = Language({
"root.number.": (eval, (Token.NUMBER, (None, "value"))),
"root.string.": (None, (Token.STRING, (None, "value"))),
"root.list.": ("#", Token.LEFT_SQUARE_BRACKET, ("list.", "#"), Token.RIGHT_SQUARE_BRACKET),
"root.object.": ("#", Token.LEFT_CURLY_BRACKET, ("object.", "#"), Token.RIGHT_CURLY_BRACKET),
"list.nonempty.multiple.": ([], "root.", (Token.COMMA, []), "list.nonempty."),
"list.nonempty.single.": ([], "root."),
"list.empty.": ([]),
"object.nonempty.multiple.": ({}, ("attribute.", ""), Token.COMMA, ("object.nonempty.", "")),
"object.nonempty.single.": ({}, ("attribute.", "")),
"object.empty.": ({}),
"attribute.": ({}, (Token.STRING, (None, "value")), Token.COLON, ("root.", 0)),
})
Now we can do:
# Prints: {'Hello': 'World'}
print( tinyparser.parse(json, '{"Hello" : "World"}') )
The first constructor argument to the class tinyparser.Language
is the grammar - a python dictionary containing all grammar rules.
Each dictionary key maps a specific identification to a rule definition.
grammmar = {
"root.option-A": ["number."]
, "root.option-B": ["string."]
, "number.": [Token.NUMBER]
, "string.": [Token.STRING]
# And so on...
}
In principle, you can name your rules the way you like. For most cases however, you'll want a hierachical key structure.
By doing this, you can reference groups of rules and thus enable disjunctions.
This is, because tiny-parser rule references will match every rule that starts with a certain prefix.
For example, a rule reference of "expression."
will match all rules with a dictionary key starting with expression.
, such as expression.empty.
, expression.nonempty.single.
or expression.nonemtpy.multiple.
.
By convention, all rule identifications should end in the separation character you use (in our case .
). This is, because references should not have to care, if they reference a group of rules or a single rule (separation of Interface from Implementation).
Note: For educational purposes, all rule identifications are words. When you ship your code and/or parsing speed is needed, numbers would suite the purpose just as well, but are quicker in parsing time. That is, the shorter your identifications, the quicker tiny-parser can resolve each reference.
Rule definitions are either of the form [steps...]
or (target, steps...)
.
If a rule is defined to match nothing (the empty string) and therefore has no steps, you may just specify target
(neither wrapped inside a tuple nor list). I.e., you may as well pass None
.
This chapter will be all about the matching steps, i.e. what you can match.
Usually, language grammars come in different formats: BNF, EBNF, graphical control flow etc. Common to all of them is, what they are made of:
- Tokens (i.e. string content matching some part of the input), e.g.
}
or&&
orconst
, and - References to other rules.
Essentially, the "steps" you will pass as arguments to the definition of each rule will mostly consist of these two things, references to other rules and tokens that you want to read from the input.
tiny-parser will parse your input in two stages: 1. tokenization, 2. rule matching. Tokenization is a common preparation step in parsing. Most compilers and source code analysis tools do this. Breaking up the input into it's atomic components (tokens) happens, because it eases the process of rule matching immensely.
Tokenization happens linearly from the beginning of the input to the end. You can compare this process to the process of identifying the "building blocks" of written english within a given sentence:
- words (made of characters of the english alphabet, terminated by everything that's not of the english alphabet),
- dots,
- dashes,
- parentheses,
- numbers (starting with a digit, terminated by everything thats neither a number nor a decimal dot).
You can probably already see, how this eases the further comprehension of some input string.
tiny-parser by default employs a basic tokenization that will suffice for many occasions.
It's defined by the enum tinyparser.Token
, deriving from the class tinyparser.TokenType
.
This basic tokenization will allow you to match certain tokens, just by passing the enum member of the token type you'd like to match. For example the rule :
"root.parentheses.empty.": [Token.LEFT_PARENTHESIS, Token.RIGHT_PARENTHESIS]
has two steps that together match "()", even with whitespaces in between "( )".
In some cases, merely specifying the type of token that you want to match is not precise enough.
To match a token with specific content, for example the identifier func
, you can do this with the function exactly
:
"root.function.": [Token.exactly("func"), Token.IDENTIFIER]
You reference rules (or groups of rules) simply by writing their identification or common prefix as a string. For parsing a simple list of numbers, each separated by a comma, you'd write:
grammmar = {
"list.nonempty.multiple.": ["list-element.", Token.COMMA, "list.nonempty."]
, "list.nonempty.single.": ["list-element."]
, "list.nonempty.single.": []
, "list-element.": [Token.NUMBER]
}
The target
of a rule specifies its return value once its matched of the rule - so to speak.
Token Name | Regular Expression |
---|---|
NEWLINE | \r\n|\r|\n |
DOUBLE_EQUAL | == |
EXCLAMATION_EQUAL | != |
LESS_EQUAL | <= |
GREATER_EQUAL | >= |
AND_EQUAL | &= |
OR_EQUAL | |= |
XOR_EQUAL | ^= |
PLUS_EQUAL | += |
MINUS_EQUAL | -= |
TIMES_EQUAL | *= |
DIVIDES_EQUAL | /= |
DOUBLE_AND | && |
DOUBLE_OR | \|\| |
DOUBLE_PLUS | \+\+ |
DOUBLE_MINUS | -- |
PLUS | \+ |
MINUS | - |
TIMES | * |
DIVIDES | / |
POWER | \^ |
LESS | < |
GREATER | > |
LEFT_PARENTHESIS | \( |
RIGHT_PARENTHESIS | \) |
LEFT_SQUARE_BRACKET | \[ |
RIGHT_SQUARE_BRACKET | \] |
LEFT_CURLY_BRACKET | \{ |
RIGHT_CURLY_BRACKET | \} |
SEMICOLON | ; |
COLON | : |
COMMA | , |
HAT | \^ |
DOT | \. |
IDENTIFIER | [a-zA-Z_][a-zA-Z0-9_]*\b |
NUMBER | (\+|-)?([1-9][0-9](\.[0-9])?\b|\.[0-9]+\b|0\b) |
STRING | "(?P<value>([^"]|\\")+)" |
UNKNOWN | . |