State machine approach to parsing well-formed XML (and adding missing bits along the way). Working example of this utility:
http://learningbywrote.com/demo/read_bychar.php
The SGML standard has an appendix describing the distinction between structure-controlled applications and markup-sensitive applications. The principles apply for XML as well as SGML applications. The key chapter is available as a public recommendation for a revision; the particular material is under the heading, Attachment 1: The ISO 8879 Element Structure Information Set (ESIS). This attachment does not describe a syntax for this information set; there is a useable one created by James Clark for his original NSGMLS parser: NSGMLS: An SGML System Conforming to International Standard ISO 8879.
This web application is a PHP parser that uses a finite state machine approach to parsing a document by character and creating state transitions that represent ESIS events; these are output in a manner similar to that of James Clark's NSGMLS parser. Well formed content always produces a popped element stack of 0. With appropriate input rules that define empty elements, this parser can effectively turn non-well-formed content into a Well Formed result stream that can then be used as "Source Tree" input to an XSLT transformation to other result formats (such as DITA, which was the impetus for my writing this parser).
Notes:
- This is not necessarily a fully conforming parser for even well-formed XML. Notably, end-of-line handling still needs to be implemented, as do many other type-checking events during the parse (eg, detecting attribute values in the absense of quote marks).
- If the doctype represents XHTML, the document is expected to be well-formed XML, otherwise a language rule is invoked to generate closing markup events for normally empty start tags with no closing delimiter.
- If the attribute value has no LIT or LITA delimiters, a special mode uses ' ' or '>' to complete the value scope.
- If markup ends are impliable if known per language type, stop conditions can be used as "end tag" events.
- If the markup is HTML, the parser normalizes the element case to lowercase to ensure more consistent closures during the parse.
- If invoked with no parameter, a default topic ("Dictation Task.html") will be parsed.
- Use the ?infile= query parameter at the end of this URl to pass the URL of an HTML target for parsing. For example, try some of these:
- ?infile=http://core.jumpchart.com/help/article/21/
- ?infile=http://www.sgmlsource.com/8879/n1035.htm (has known unclosed definition terms and other atrocious HTML in it)
- ?infile=http://www.jclark.com/sp/sgmlsout.htm (also has unclosed definition terms)
- ?infile=