Skip to content

Latest commit

 

History

History
30 lines (23 loc) · 3.89 KB

README.md

File metadata and controls

30 lines (23 loc) · 3.89 KB

NativeXML

Build Status Coverage Status

XML parsing library written in Julia

Description

This is an XML parsing library written in Julia; see Extensible Markup Language (XML) 1.0 (Fifth Edition). The library is written as a series of stages, each stage feeding in to the next.

Lexical

The first stage is the lexical layer; see src/Lexical.jl. It takes the provided raw character data, and converts it to a stream of tokens. Each token consists of a token type, a token value, and a token position. (As of this writing, the token position is not implemented, and a placeholder value is used instead.) Most token values are one or two characters long, with the exception of text and ws tokens (plain text and white space, respectively), which can be arbitrarily long. The set of possible token types is listed at the top of src/Lexical.jl. The names used for token types come from the SGML specification, with comments in the code indicating any differences.

Events

The second stage is the events layer; see src/Events.jl. This stage takes the token stream created by the lexical stage and converts it to a stream of events. Each event represents a single point in the XML input: examples are element start, comment end, element declaration, and so on. Each event contains the information gathered from the input: for example, element start (ElementStart) contains

  • is_recovery, indicating whether or not this event was created during error recovery,
  • is_empty, indicating whether or not the element start tag used XML's empty tag feature (see Extensible Markup Language (XML) 1.0 (Fifth Edition), production [44]),
  • name, giving the element name, and
  • attributes, giving the set of provided attributes.

The events stage can also emit markup errors, which indicate syntactic issues discovered while generating events; see src/Events.jl, any occurrence of MarkupError. With few exceptions, the errors generated by the events stage are never so-called well-formed constraints or validation constraints, as defined by Extensible Markup Language (XML) 1.0 (Fifth Edition), but consist only of cases where the input token stream does not conform to the syntax defined by the grammar productions of the specification. Typical exceptions are

  • SGML keywords are recognized, an error is emitted, and the keyword is converted to data content,
  • an XML keyword or reserved name that is not all uppercase triggers a MarkupError indicating the issue, but the keyword is accepted as if it had appeared in all uppercase.

This last point highlights a general strategy in the events stage with respect to errors and error recovery: the events stage is strict to report errors, but forgiving for parsing, in that it will almost always assume the encountered error was a typo and proceed accordingly. For example, an element start tag such as <e, where the closing tagc (i.e., >) is missing, will be parsed as <e>, thereby generating two events:

  • an ElementStart event, and
  • a MarkupError event indicating the error encountered.

Since the ElementStart event was emitted during error recovery, its is_recovery field would be set to true.

Parser

The third stage is the parser stage (in progress). This stage takes the events stream created by the events stage and imposes well-formedness constraints and validity constraints, as defined by Extensible Markup Language (XML) 1.0 (Fifth Edition), § 1.2.