libparsing

C & Python Parsing Elements Grammar Library

Version :  0.7.0
URL     :  http://github.com/sebastien/parsing
README  :  https://cdn.rawgit.com/sebastien/libparsing/master/README.html

libparsing is a parsing element grammar (PEG) library written in C with Python bindings. It offers decent performance while allowing for a lot of flexibility. It is mainly intended to be used to create programming languages and software engineering tools.

As opposed to more traditional parsing techniques, the grammar is not compiled but constructed using an API that allows dynamic update of the grammar.

The parser does not do any tokeninzation, the instead input stream is consumed and parsing elements are dynamically asked to match the next element of it. Once parsing elements match, the resulting matched input is processed and an action is triggered.

libparsing supports the following features:

backtracking, ie. going back in the input stream if a match is not found
cherry-picking, ie. skipping unrecognized input
contextual rules, ie. a rule that will match or not depending on external variables

Parsing elements are usually slower than compiled or FSM-based parsers as they trade performance for flexibility. It's probably not a great idea to use libparsing if the parsing has to happen as fast as possible (ie. a protocol implementation), but it is a great use for programming languages, as it opens up the door to dynamic syntax plug-ins and multiple language embedding.

If you're interested in PEG, you can start reading Brian Ford's original article. Projects such as PEG/LEG by Ian Piumarta http://piumarta.com/software/peg/ ,OMeta by Alessandro Warth http://www.tinlizzie.org/ometa/ or Haskell's Parsec library https://www.haskell.org/haskellwiki/Parsec are of particular interest in the field.

Here is a short example of what creating a simple grammar looks like in Python:

g = Grammar()
s = g.symbols
g.token("WS",       "\s+")
g.token("NUMBER",   "\d+(\.\d+)?")
g.token("VARIABLE", "\w+")
g.token("OPERATOR", "[\/\+\-\*]")
g.group("Value",     s.NUMBER, s.VARIABLE)
g.rule("Suffix",     s.OPERATOR._as("operator"), s.Value._as("value"))
g.rule("Expression", s.Value, s.Suffix.zeroOrMore())
g.axiom(s.Expression)
g.skip(s.WS)
match = g.parseString("10 + 20 / 5")

and the equivalent code in C

Grammar* g = Grammar_new()
SYMBOL(WS,         TOKEN("\\s+"))
SYMBOL(NUMBER,     TOKEN("\\d+(\\.\\d+)?"))
SYMBOL(VARIABLE,   TOKEN("\\w+"))
SYMBOL(OPERATOR,   GROUP("[\\/\\+\\-\\*]"))
SYMBOL(Value,      GOUP(_S(NUMBER), _S(VARIABLE)))
SYMBOL(Suffix,     RULE(_AS(_S(OPERATOR), "operator"), _AS(_S(Value), "value")))
SYMBOL(Expression, RULE(_S(Value), _MO(Suffix))
g->axiom = s_Expression;
g->skip(s_WS);
Grammar_prepare(g);
Match* match = Grammar_parseString(g, "10 + 20 / 5")

Installing

To install the Python parsing module:

easy_install libparsing    # From Setuptools
pip install  libparsing    # From PIP

Note that for the above to work, you'll need a C compiler libffi-dev and libpcre-dev. On Ubuntu, do sudo apt install build-essential libffi-dev libprcre-dev.

To compile the C parsing module:

git clone http://github.com/sebastien/libparsing
cd libparsing
make
make install               # You can set PREFIX

libparsing works with GCC4 and Clang and is written following the c11 standard.

C API

Input data

The parsing library is configured at compile-time to iterate on specific elements of input, typically char. You can redefine the macro ITERATION_UNIT to the type you'd like to iterate on.

By default, the ITERATION_UNIT is a char, which works both for ASCII and UTF8. On the topic of Unicode/UTF8, the parsing library only uses functions that are UTF8-savvy.

#ifndef ITERATION_UNIT
#define ITERATION_UNIT char
#endif

typedef ITERATION_UNIT iterated_t;

Input data is acquired through iterators. Iterators wrap an input source (the default input is a FileInput) and a move callback that updates the iterator's offset. The iterator will build a buffer of the acquired input and maintain a pointer for the current offset within the data acquired from the input stream.

You can get an iterator on a file by doing:

Iterator* iterator = Iterator_Open("example.txt");

type `Iterator`

typedef struct Iterator {
	char           status;    // The status of the iterator, one of STATUS_{INIT|PROCESSING|INPUT_ENDED|ENDED}
	char*          buffer;    // The buffer to the read data, note how it is a (void*) and not an `iterated_t`
	iterated_t*    current;   // The pointer current offset within the buffer
	iterated_t     separator; // The character for line separator, `\n` by default.
	size_t         offset;    // Offset in input (in bytes), might be different from `current - buffer` if some input was freed.
	size_t         lines;     // Counter for lines that have been encountered
	size_t         capacity;  // Content capacity (in bytes), might be bigger than the data acquired from the input
	size_t         available; // Available data in buffer (in bytes), always `<= capacity`
	bool           freeBuffer;
	void*          input;     // Pointer to the input source
	bool          (*move) (struct Iterator*, int n); // Plug-in function to move to the previous/next positions
} Iterator;

type `FileInput`

The file input wraps information about the input file, such as the FILE object and the path.

typedef struct FileInput {
	FILE*        file;
	const char*  path;
} FileInput;

shared `EOL`

The EOL character used to count lines in an iterator context.

extern iterated_t         EOL;

operation `Iterator_Open`

Returns a new iterator instance with the given open file as input

Iterator* Iterator_Open(const char* path);

operation `Iterator_FromString`

Returns a new iterator instance with the text

Iterator* Iterator_FromString(const char* text);

constructor `Iterator`

Iterator* Iterator_new(void);

destructor `Iterator_free`

void      Iterator_free(Iterator* this);

method `Iterator_open`

Makes the given iterator open the file at the given path. This will automatically assign a FileInput to the iterator as an input source.

bool Iterator_open( Iterator* this, const char *path );

method `Iterator_hasMore`

Tells if the iterator has more available data. This means that there is available data after the current offset.

bool Iterator_hasMore( Iterator* this );

method `Iterator_remaining`

Returns the number of bytes available from the current iterator's position up to the last available data. For dynamic streams, where the length is unknown, this should be lesser or equalt to ITERATOR_BUFFER_AHEAD.

size_t Iterator_remaining( Iterator* this );

method `Iterator_moveTo`

Moves the iterator to the given offset

bool Iterator_moveTo ( Iterator* this, size_t offset );

method `String_move`

bool String_move ( Iterator* this, int offset );

define `ITERATOR_BUFFER_AHEAD`

The number of iterated_t that should be loaded after the iterator's current position. This limits the numbers of iterated_t that a Token could match.

#define ITERATOR_BUFFER_AHEAD 64000

constructor `FileInput`

FileInput* FileInput_new(const char* path );

destructor `FileInput_free`

void       FileInput_free(FileInput* this);

method `FileInput_preload`

Preloads data from the input source so that the buffer has up to ITERATOR_BUFFER_AHEAD characters ahead.

size_t FileInput_preload( Iterator* this );

method `FileInput_move`

Advances/rewinds the given iterator, loading new data from the file input whenever there is not ITERATOR_BUFFER_AHEAD data elements ahead of the iterator's current position.

bool FileInput_move   ( Iterator* this, int n );

Grammar

The Grammar is the concrete definition of the language you're going to parse. It is defined by an axiom and input data that can be skipped, such as white space.

The axiom and skip properties are both references to parsing elements.

typedef struct ParsingContext ParsingContext;
typedef struct ParsingElement ParsingElement;
typedef struct ParsingResult  ParsingResult;
typedef struct Reference      Reference;
typedef struct Match          Match;
typedef void                  Element;

typedef struct Element { char type; // Type is used du differentiate ParsingElement from Reference int id; // The ID, assigned by the grammar, as the relative distance to the axiom } Element;

type `Grammar`

typedef struct Grammar {
	ParsingElement*  axiom;       // The axiom
	ParsingElement*  skip;        // The skipped element
	int              axiomCount;  // The count of parsing elemetns in axiom
	int              skipCount;   // The count of parsing elements in skip
	Element**        elements;    // The set of all elements in the grammar
	bool             isVerbose;
} Grammar;

constructor `Grammar`

Grammar* Grammar_new(void);

destructor `Grammar_free`

void Grammar_free(Grammar* this);

method `Grammar_prepare`

void Grammar_prepare ( Grammar* this );

method `Grammar_symbolsCount`

int Grammar_symbolsCount ( Grammar* this );

method `Grammar_parseIterator`

ParsingResult* Grammar_parseIterator( Grammar* this, Iterator* iterator );

method `Grammar_parsePath`

ParsingResult* Grammar_parsePath( Grammar* this, const char* path );

method `Grammar_parseString`

ParsingResult* Grammar_parseString( Grammar* this, const char* text );

method `Grammar_freeElements`

void Grammar_freeElements(Grammar* this);

Elements

callback `WalkingCallback`

typedef int (*WalkingCallback)(Element* this, int step, void* context);

method `Element_walk`

int Element_walk( Element* this, WalkingCallback callback, void* context);

method `Element__walk`

int Element__walk( Element* this, WalkingCallback callback, int step, void* context);

Parsing Elements

Parsing elements are the core elements that recognize and process input data. There are 4 basic types: Work, Token, Group and Rule.

Parsing elements offer two main operations: recognize and process. The recognize method generates a Match object (that might be the FAILURE singleton if the data was not recognized). The process method tranforms corresponds to a user-defined action that transforms the Match object and returns the generated value.

Parsing element are assigned an id that corresponds to their breadth-first distance to the axiom. Before parsing, the grammar will re-assign the parsing element's id accordingly.

type `Match`

typedef struct Match {
	char            status;     // The status of the match (see STATUS_XXX)
	size_t          offset;     // The offset of `iterated_t` matched
	size_t          length;     // The number of `iterated_t` matched
	Element*        element;
	ParsingContext* context;
	void*           data;      // The matched data (usually a subset of the input stream)
	struct Match*   next;      // A pointer to the next  match (see `References`)
	struct Match*   children;  // A pointer to the child match (see `References`)
	void*           result;    // A pointer to the result of the match
} Match;

define `STATUS_INIT`

The different values for a match (or iterator)'s status

#define STATUS_INIT        '-'

define `STATUS_PROCESSING`

#define STATUS_PROCESSING  '~'

define `STATUS_MATCHED`

#define STATUS_MATCHED     'M'

define `STATUS_SUCCESS`

#define STATUS_SUCCESS     'S'

define `STATUS_PARTIAL`

#define STATUS_PARTIAL     's'

define `STATUS_FAILED`

#define STATUS_FAILED      'X'

define `STATUS_INPUT_ENDED`

#define STATUS_INPUT_ENDED '.'

define `STATUS_ENDED`

#define STATUS_ENDED       'E'

define `TYPE_ELEMENT`

#define TYPE_ELEMENT    'E'

define `TYPE_WORD`

#define TYPE_WORD       'W'

define `TYPE_TOKEN`

#define TYPE_TOKEN      'T'

define `TYPE_GROUP`

#define TYPE_GROUP      'G'

define `TYPE_RULE`

#define TYPE_RULE       'R'

define `TYPE_CONDITION`

#define TYPE_CONDITION  'c'

define `TYPE_PROCEDURE`

#define TYPE_PROCEDURE  'p'

define `TYPE_REFERENCE`

#define TYPE_REFERENCE  '#'

define `ID_UNBOUND`

A parsing element that is not bound to a grammar will have ID_UNBOUND by default.

#define ID_UNBOUND      -10

define `ID_BINDING`

A parsing element that being bound to a grammar (see Grammar_prepare) will have an id of ID_BINDING temporarily.

#define ID_BINDING       -1

singleton `FAILURE_S`

A specific match that indicates a failure

extern Match FAILURE_S;

shared `FAILURE`

extern Match* FAILURE;

operation `Match_Empty`

Creates new empty (successful) match

Match* Match_Empty(Element* element, ParsingContext* context);

operation `Match_Success`

Creates a new successful match of the given length

Match* Match_Success(size_t length, Element* element, ParsingContext* context);

constructor `Match`

Match* Match_new(void);

destructor `Match_free`

Frees the given match. If the match is FAILURE, then it won't be feed. This means that most of the times you won't need to free a failed match, as it's likely to be the FAILURE singleton.

void Match_free(Match* this);

method `Match_isSuccess`

bool Match_isSuccess(Match* this);

method `Match_getOffset`

int Match_getOffset(Match* this);

method `Match_getLength`

int Match_getLength(Match* this);

method `Match__walk`

int Match__walk(Match* this, WalkingCallback callback, int step, void* context );

method `Match_countAll`

int Match_countAll(Match* this);

type `ParsingElement`

typedef struct ParsingElement {
	char           type;       // Type is used du differentiate ParsingElement from Reference
	int            id;         // The ID, assigned by the grammar, as the relative distance to the axiom
	const char*    name;       // The parsing element's name, for debugging
	void*          config;     // The configuration of the parsing element
	struct Reference*     children;   // The parsing element's children, if any
	struct Match*         (*recognize) (struct ParsingElement*, ParsingContext*);
	struct Match*         (*process)   (struct ParsingElement*, ParsingContext*, Match*);
	void                  (*freeMatch) (Match*);
} ParsingElement;

operation `ParsingElement_Is`

Tells if the given pointer is a pointer to a ParsingElement.

bool         ParsingElement_Is(void *);

constructor `ParsingElement`

Creates a new parsing element and adds the given referenced parsing elements as children. Note that this is an internal constructor, and you should use the specialized versions instead.

ParsingElement* ParsingElement_new(Reference* children[]);

destructor `ParsingElement_free`

void ParsingElement_free(ParsingElement* this);

method `ParsingElement_add`

Adds a new reference as child of this parsing element. This will only be effective for composite parsing elements such as Rule or Token.

ParsingElement* ParsingElement_add(ParsingElement *this, Reference *child);

method `ParsingElement_clear`

ParsingElement* ParsingElement_clear(ParsingElement *this);

method `ParsingElement_clear`

Returns the match for this parsing element for the given iterator's state. inline Match* ParsingElement_recognize( ParsingElement* this, ParsingContext* context );

method `ParsingElement_process`

Processes the given match once the parsing element has fully succeeded. This is where user-bound actions will be applied, and where you're most likely to do things such as construct an AST.

Match* ParsingElement_process( ParsingElement* this, Match* match );

method `ParsingElement_name`

Transparently sets the name of the element

ParsingElement* ParsingElement_name( ParsingElement* this, const char* name );

Word

Words recognize a static string at the current iterator location.

type `WordConfig`

The parsing element configuration information that is used by the Token methods.

typedef struct WordConfig {
	const char* word;
	size_t      length;
} WordConfig;

constructor `ParsingElement`

ParsingElement* Word_new(const char* word);

destructor `Word_free`

void Word_free(ParsingElement* this);

method `Word_recognize`

The specialized match function for token parsing elements.

Match*          Word_recognize(ParsingElement* this, ParsingContext* context);

method `Word_word`

const char* Word_word(ParsingElement* this);

method `WordMatch_group`

const char* WordMatch_group(Match* match);

Tokens

Tokens are regular expression based parsing elements. They do not have any children and test if the regular expression matches exactly at the iterator's current location.

type `TokenConfig`

The parsing element configuration information that is used by the Token methods.

typedef struct TokenConfig {
	const char* expr;
#ifdef WITH_PCRE
	pcre*       regexp;
	pcre_extra* extra;
#endif
} TokenConfig;

type `TokenMatch`

typedef struct TokenMatch {
	int             count;
	const char**    groups;
} TokenMatch;

method `Token_new`

Creates a new token with the given POSIX extended regular expression

ParsingElement* Token_new(const char* expr);

destructor `Token_free`

void Token_free(ParsingElement*);

method `Token_recognize`

The specialized match function for token parsing elements.

Match*          Token_recognize(ParsingElement* this, ParsingContext* context);

method `Token_expr`

const char* Token_expr(ParsingElement* this);

method `TokenMatch_free`

Frees the TokenMatch created in Token_recognize

void TokenMatch_free(Match* match);

method `TokenMatch_group`

const char* TokenMatch_group(Match* match, int index);

method `TokenMatch_count`

int TokenMatch_count(Match* match);

References

We've seen that parsing elements can have children. However, a parsing element's children are not directly parsing elements but rather parsing elements' References. This is why the ParsingElement_add takes a Reference object as parameter.

References allow to share a single parsing element between many different composite parsing elements, while decorating them with additional information such as their cardinality (ONE, OPTIONAL, MANY and MANY_OPTIONAL) and a name that will allow process actions to easily access specific parts of the parsing element.

type `Reference`

typedef struct Reference {
	char            type;            // Set to Reference_T, to disambiguate with ParsingElement
	int             id;              // The ID, assigned by the grammar, as the relative distance to the axiom
	char            cardinality;     // Either ONE (default), OPTIONAL, MANY or MANY_OPTIONAL
	const char*     name;            // The name of the reference (optional)
	struct ParsingElement* element;  // The reference to the parsing element
	struct Reference*      next;     // The next child reference in the parsing elements
} Reference;

define `CARDINALITY_OPTIONAL`

The different values for the Reference cardinality.

#define CARDINALITY_OPTIONAL      '?'

define `CARDINALITY_ONE`

#define CARDINALITY_ONE           '1'

define `CARDINALITY_MANY_OPTIONAL`

#define CARDINALITY_MANY_OPTIONAL '*'

define `CARDINALITY_MANY`

#define CARDINALITY_MANY          '+'

operation `Reference_Is`

Tells if the given pointer is a pointer to Reference

bool Reference_Is(void * this);

operation `Reference_Ensure`

Ensures that the given element (or reference) is a reference.

Reference* Reference_Ensure(void* elementOrReference);

operation `Reference_FromElement`

Returns a new reference wrapping the given parsing element

Reference* Reference_FromElement(ParsingElement* element);

constructor `Reference`

References are typically owned by their single parent composite element.

Reference* Reference_new(void);

destructor `Reference_free`

void Reference_free(Reference* this);

method `Reference_cardinality`

Sets the cardinality of this reference, returning it transprently.

Reference* Reference_cardinality(Reference* this, char cardinality);

method `Reference_name`

Reference* Reference_name(Reference* this, const char* name);

method `Reference_hasNext`

bool Reference_hasNext(Reference* this);

method `Reference_hasElement`

bool Reference_hasElement(Reference* this);

method `Reference__walk`

int Reference__walk( Reference* this, WalkingCallback callback, int step, void* nothing );

method `Reference_recognize`

Returns the matched value corresponding to the first match of this reference. OPTIONAL references might return EMPTY, ONE references will return a match with a next=NULL while MANY may return a match with a next pointing to the next match.

Match* Reference_recognize(Reference* this, ParsingContext* context);

Groups

Groups are composite parsing elements that will return the first matching reference's match. Think of it as a logical or.

constructor `ParsingElement`

ParsingElement* Group_new(Reference* children[]);

method `Group_recognize`

Match*          Group_recognize(ParsingElement* this, ParsingContext* context);

Rules

Groups are composite parsing elements that only succeed if all their matching reference's.

constructor `ParsingElement`

ParsingElement* Rule_new(Reference* children[]);

method `Rule_recognize`

Match*          Rule_recognize(ParsingElement* this, ParsingContext* context);

Procedures

Procedures are parsing elements that do not consume any input, always succeed and usually have a side effect, such as setting a variable in the parsing context.

callback `ProcedureCallback`

typedef void (*ProcedureCallback)(ParsingElement* this, ParsingContext* context);

callback `MatchCallback`

typedef void (*MatchCallback)(Match* m);

constructor `ParsingElement`

ParsingElement* Procedure_new(ProcedureCallback c);

method `Procedure_recognize`

Match*          Procedure_recognize(ParsingElement* this, ParsingContext* context);

Conditions

Conditions, like procedures, execute arbitrary code when executed, but they might return a FAILURE.

callback `ConditionCallback`

typedef Match* (*ConditionCallback)(ParsingElement*, ParsingContext*);

constructor `ParsingElement`

ParsingElement* Condition_new(ConditionCallback c);

method `Condition_recognize`

Match*          Condition_recognize(ParsingElement* this, ParsingContext* context);

The parsing process

The parsing itself is the process of taking a grammar and applying it to an input stream of data, represented by the iterator.

The grammar's axiom will be matched against the iterator's current position, and if necessary, the grammar's skip parsing element will be applied to advance the iterator.

typedef struct ParsingStep    ParsingStep;
typedef struct ParsingOffset  ParsingOffset;

type `ParsingStats`

typedef struct ParsingStats {
	size_t   bytesRead;
	double   parseTime;
	size_t   symbolsCount;
	size_t*  successBySymbol;
	size_t*  failureBySymbol;
	size_t   failureOffset;   // A reference to the deepest failure
	size_t   matchOffset;
	size_t   matchLength;
	Element* failureElement;  // A reference to the failure element
} ParsingStats;

constructor `ParsingStats`

ParsingStats* ParsingStats_new(void);

destructor `ParsingStats_free`

void ParsingStats_free(ParsingStats* this);

method `ParsingStats_setSymbolsCount`

void ParsingStats_setSymbolsCount(ParsingStats* this, size_t t);

method `ParsingStats_registerMatch`

Match* ParsingStats_registerMatch(ParsingStats* this, Element* e, Match* m);

type `ParsingContext`

typedef struct ParsingContext {
	struct Grammar*              grammar;      // The grammar used to parse
	struct Iterator*             iterator;     // Iterator on the input data
	struct ParsingOffset* offsets;      // The parsing offsets, starting at 0
	struct ParsingOffset* current;      // The current parsing offset
	struct ParsingStats*  stats;
} ParsingContext;

constructor `ParsingContext`

ParsingContext* ParsingContext_new( Grammar* g, Iterator* iterator );

method `ParsingContext_text`

iterated_t* ParsingContext_text( ParsingContext* this );

destructor `ParsingContext_free`

void ParsingContext_free( ParsingContext* this );

type `ParsingResult`

typedef struct ParsingResult {
	char            status;
	Match*          match;
	ParsingContext* context;
} ParsingResult;

constructor `ParsingResult`

ParsingResult* ParsingResult_new(Match* match, ParsingContext* context);

method `ParsingResult_free`

Frees this parsing result instance as well as all the matches it referes to.

void ParsingResult_free(ParsingResult* this);

method `ParsingResult_isFailure`

bool ParsingResult_isFailure(ParsingResult* this);

method `ParsingResult_isPartial`

bool ParsingResult_isPartial(ParsingResult* this);

method `ParsingResult_isComplete`

bool ParsingResult_isComplete(ParsingResult* this);

method `ParsingResult_text`

iterated_t* ParsingResult_text(ParsingResult* this);

method `ParsingResult_textOffset`

int ParsingResult_textOffset(ParsingResult* this);

method `ParsingResult_remaining`

size_t ParsingResult_remaining(ParsingResult* this);

The result of recognizing parsing elements at given offsets within the input stream is stored in ParsingOffset. Each parsing offset is a stack of ParsingStep, corresponding to successive attempts at matching parsing elements at the current position.

The parsing offset is a stack of parsing steps, where the tail is the most specific parsing step. By following the tail's previous parsing step, you can unwind the stack.

The parsing steps each have an offset within the iterated stream. Offsets where data has been fully extracted (ie, a leaf parsing element has matched and processing returned a NOTHING) can be freed as they are not necessary any more.

type `ParsingOffset`

typedef struct ParsingOffset {
	size_t                offset; // The offset matched in the input stream
	ParsingStep*          last;   // The last matched parsing step (ie. corresponding to the most specialized parsing element)
	struct ParsingOffset* next;   // The link to the next offset (if any)
} ParsingOffset;

constructor `ParsingOffset`

ParsingOffset* ParsingOffset_new( size_t offset );

destructor `ParsingOffset_free`

void ParsingOffset_free( ParsingOffset* this );

The parsing step allows to memoize the state of a parsing element at a given offset. This is the data structure that will be manipulated and created/destroyed the most during the parsing process.

typedef struct ParsingStep {
	ParsingElement*     element;       // The parsing elemnt we're matching
	char                step;          // The step corresponds to current child's index (0 for token/word)
	unsigned int        iteration;     // The current iteration (on the step)
	char                status;        // Match status `STATUS_{INIT|PROCESSING|FAILED}`
	Match*              match;         // The corresponding match, if any.
	struct ParsingStep* previous;      // The previous parsing step on the parsing offset's stack
} ParsingStep;

constructor `ParsingStep`

ParsingStep* ParsingStep_new( ParsingElement* element );

destructor `ParsingStep_free`

void ParsingStep_free( ParsingStep* this );

Processor

typedef struct Processor Processor;

callback `ProcessorCallback`

typedef void (*ProcessorCallback)(Processor* processor, Match* match);

typedef struct Processor {
	ProcessorCallback   fallback;
	ProcessorCallback*  callbacks;
	int                 callbacksCount;
} Processor;

constructor `Processor`

Processor* Processor_new( );

method `Processor_free`

void Processor_free(Processor* this);

method `Processor_register`

void Processor_register (Processor* this, int symbolID, ProcessorCallback callback) ;

method `Processor_process`

int Processor_process (Processor* this, Match* match, int step);

Utilities

method `Utilities_indent`

void Utilities_indent( ParsingElement* this, ParsingContext* context );

method `Utilities_dedent`

void Utilities_dedent( ParsingElement* this, ParsingContext* context );

method `Utilites_checkIndent`

Match* Utilites_checkIndent( ParsingElement *this, ParsingContext* context );

Syntax Sugar

The parsing library provides a set of macros that make defining grammars a much easier task. A grammar is usually defined in the following way:

leaf symbols (words & tokens) are defined ;
compound symbolds (rules & groups) are defined.

Let's take as simple grammar and define it with the straight API:

// Leaf symbols
ParsingElement* s_NUMBER   = Token_new("\\d+");
ParsingElement* s_VARIABLE = Token_new("\\w+");
ParsingElement* s_OPERATOR = Token_new("[\\+\\-\\*\\/]");

// We also attach names to the symbols so that debugging will be easier
ParsingElement_name(s_NUMBER,   "NUMBER");
ParsingElement_name(s_VARIABLE, "VARIABLE");
ParsingElement_name(s_OPERATOR, "OPERATOR");

// Now we defined the compound symbols
ParsingElement* s_Value    = Group_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_NUMBER),   CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_VARIABLE), CARDINALITY_ONE)
NULL
});
ParsingElement* s_Suffix    = Rule_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_OPERATOR),  CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_Value),     CARDINALITY_ONE)
NULL
});
* ParsingElement* s_Expr    = Rule_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_Value),  CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_Suffix), CARDINALITY_MANY_OPTIONAL)
NULL
});

// We define the names as well
ParsingElement_name(s_Value,  "Value");
ParsingElement_name(s_Suffix, "Suffix");
ParsingElement_name(s_Expr, "Expr");

As you can see, this is quite verbose and makes reading the grammar declaration a difficult task. Let's introduce a set of macros that will make expressing grammars much easier.

Symbol declaration & creation

macro `SYMBOL`

Declares a symbol of name n as being parsing element e.

#define SYMBOL(n,e)       ParsingElement* s_ ## n = ParsingElement_name(e, #n);

macro `WORD`

Creates a Word parsing element with the given regular expression

#define WORD(v)           Word_new(v)

macro `TOKEN`

Creates a Token parsing element with the given regular expression

#define TOKEN(v)          Token_new(v)

macro `RULE`

Creates a Rule parsing element with the references or parsing elements as children.

#define RULE(...)         Rule_new((Reference*[(VA_ARGS_COUNT(__VA_ARGS__)+1)]){__VA_ARGS__,NULL})

macro `GROUP`

Creates a Group parsing element with the references or parsing elements as children.

#define GROUP(...)        Group_new((Reference*[(VA_ARGS_COUNT(__VA_ARGS__)+1)]){__VA_ARGS__,NULL})

macro `PROCEDURE`

Creates a Procedure parsing element

#define PROCEDURE(f)      Procedure_new(f)

macro `CONDITION`

Creates a Condition parsing element

#define CONDITION(f)      Condition_new(f)

Symbol reference & cardinality

macro `_S`

Refers to symbol n, wrapping it in a CARDINALITY_ONE reference

#define _S(n)             ONE(s_ ## n)

macro `_O`

Refers to symbol n, wrapping it in a CARDINALITY_OPTIONAL reference

#define _O(n)             OPTIONAL(s_ ## n)

macro `_M`

Refers to symbol n, wrapping it in a CARDINALITY_MANY reference

#define _M(n)             MANY(s_ ## n)

macro `_MO`

Refers to symbol n, wrapping it in a CARDINALITY_MANY_OPTIONAL reference

#define _MO(n)            MANY_OPTIONAL(s_ ## n)

macro `_AS`

Sets the name of reference r to be v

#define _AS(r,v)          Reference_name(Reference_Ensure(r), v)

Supporting macros

The following set of macros is mostly used by the set of macros above. You probably won't need to use them directly.

macro `NAME`

Sets the name of the given parsing element e to be the name n.

#define NAME(n,e)         ParsingElement_name(e,n)

macro `ONE`

Sets the given reference or parsing element's reference to CARDINALITY_ONE If a parsing element is given, it will be automatically wrapped in a reference.

#define ONE(v)            Reference_cardinality(Reference_Ensure(v), CARDINALITY_ONE)

macro `OPTIONAL`

Sets the given reference or parsing element's reference to CARDINALITY_OPTIONAL If a parsing element is given, it will be automatically wrapped in a reference.

#define OPTIONAL(v)       Reference_cardinality(Reference_Ensure(v), CARDINALITY_OPTIONAL)

macro `MANY`

Sets the given reference or parsing element's reference to CARDINALITY_MANY If a parsing element is given, it will be automatically wrapped in a reference.

#define MANY(v)           Reference_cardinality(Reference_Ensure(v), CARDINALITY_MANY)

macro `MANY_OPTIONAL`

Sets the given reference or parsing element's reference to CARDINALITY_MANY_OPTIONAL If a parsing element is given, it will be automatically wrapped in a reference.

#define MANY_OPTIONAL(v)  Reference_cardinality(Reference_Ensure(v), CARDINALITY_MANY_OPTIONAL)

Grammar declaration with macros

The same grammar that we defined previously can now be expressed in the following way:

SYMBOL(NUMBER,   TOKEN("\\d+"))
SYMBOL(VAR,      TOKEN("\\w+"))
SYMBOL(OPERATOR, TOKEN("[\\+\\-\\*\\/]"))

SYMBOL(Value,  GROUP( _S(NUMBER),   _S(VAR)     ))
SYMBOL(Suffix, RULE(  _S(OPERATOR), _S(Value)   ))
SYMBOL(Expr,   RULE(  _S(Value),    _MO(Suffix) ))

All symbols will be define as s_XXX, so that you can do:

ParsingGrammar* g = Grammar_new();
g->axiom = s_Expr;

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the FFunction inc (CANADA) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Name		Name	Last commit message	Last commit date
Latest commit History 351 Commits
bin		bin
example		example
src		src
test		test
.gitignore		.gitignore
Makefile		Makefile
NOTES.md		NOTES.md
README.html		README.html
README.md		README.md
setup.py		setup.py

sebastien/libparsing

Folders and files

Latest commit

History

Repository files navigation

libparsing

C & Python Parsing Elements Grammar Library

Installing

C API

Input data

Grammar

Elements

Parsing Elements

Word

Tokens

References

Groups

Rules

Procedures

Conditions

The parsing process