Before getting started, please note the following requirements about the parser:
- The parser has been written in the C++ programming language. The minimum version of C++ required to use the parser is C++17 due to the fact that C++17 features have been used, such as structured bindings and the
std::filesystem
library. - It is assumed you have a suitable C++ compiler installed. This project has compiled and worked successfully on g++ v12.1.0, but should work on any standards compiler supporting C++17.
- Of course, understanding of XML is assumed. Please refer to online resources if you are unsure about how XML works because that is beyond the scope of this guide.
Provided these requirements have been satisfied, here is how this parser can be used in a C++ program:
- Download the files in the
src
folder and keep these files in a suitable location. All header (.h
) files must be kept in the same folder. - The only public header that is designed for direct use is
xml.h
. - When compiling, include all the
.cpp
files of the parser in the compilation process.
That's all you really need to know about setting up the parser to be used in a project. Next up is how to actually use the parser in source code.
Before beginning, note that all functions, classes and constants of the library are wrappedin the xml
namespace, to avoid name collisions.
The relevant header to include is the src/xml.h
header, which contains a function called xml::parse
with two overloads.
There are two ways of inputting XML data into the parser:
- A
std::string
object that contains all the XML data to parse, where parsing will start from the first character and conclude after the last character of the string. - A
std::istream
object that contains XML data, where parsing will start from the initial stream position and conclude when the end of the stream is reached. Note that polymorphism is allowed, for example, astd::ifstream
(input file stream) can be provided.
The xml::parse
function accepts 3 parameters, and in position order, these parameters are:
- Either a
const std::string&
orstd::istream&
object (polymorphism acceptable). - A Boolean indicating whether to validate all elements to their ELEMENT declarations (true by default). For more information on what an ELEMENT declaration is, see: https://www.w3.org/TR/xml/#elemdecls
- A Boolean indicating whether to validate attributes of all elements based on ATTLIST declarations (true by default). For more information on what an ATTLIST declaration is, see: https://www.w3.org/TR/xml/#attdecls
Once the xml::parse
function is called, the parsing begins. All parsing will be in accordance with the standard as per https://www.w3.org/TR/xml, except the limitations as seen in the README document.
In terms of validation performed, note the following:
- If there is no DOCTYPE declaration, no validation takes place, since it is impossible to have a 'valid' document with no DTD. However, do not worry about this since validity is not a requirement, only well-formedness. In general, validation is good since it promotes data integrity, but it is simply irrelevant whenever no DTD is available. Indeed, many XML documents do not contain a DTD since it is seen to be unnecessary in many scenarios.
- If a DOCTYPE declaration exists, all validation as seen in the standard will be performed and parsing will fail if such a document is not valid. However, note the following:
- If element validation is disabled, then elements in the document will not be validated against ELEMENT declarations in the DTD.
- If attribute validation is disabled, then attributes of elements in the document will not be validated against ATTLIST declarations in the DTD.
Output takes the form of a single xml::Document
object. The object has several public attributes (freely modifiable if needed since the object does not need to used later by the parser).
Before explaining each attribute of xml::Document
, there are several typedefs and utility classes used:
xml::Char
is a typedef of typeint
, ensuring all UTF-8 characters can be properly represented.xml::String
is inherited fromstd::vector<xml::Char>
, ensuring a UTF-8 string is available. Additionally, it can be cast to astd::string
object if needed.xml::ExternalID
- an external ID in the document, such as in an external ID declaration:type
(typexml::ExternalIDType
) - the type of external ID, one of:xml::ExternalIDType::system
(SYSTEM),xml::ExternalIDType::public_
(PUBLIC) orxml::ExternalIDType::none
(no external ID provided).system_id
(typestd::filesystem::path
) - the system ID (not relevant iftype
isxml::ExternalIDType::none
).public_id
(typestd::filesystem::path
) - the public ID (only relevant iftype
isxml::ExternalIDType::public_
).
xml::ProcessingInstruction
- a processing instruction:target
(typexml::String
) - name of the target application the instruction is directed to.instruction
(typexml::String
) - contents of `the instruction.
xml::ElementDeclaration
- an ELEMENT declaration in the DTD:`name
(typexml::String
) - the name of the elmeent the element declaration relates to.type
(typexml::ElementType
) - the type of element. One of:xml::ElementType::
empty
, (no content)any
, (any content)mixed
, (character data and certain child elements)children
(child elements only).element_content
(typexml::ElementContentModel
) - the element content model (only relevant if element is of typexml::ElementType::children
).count
(typexml::ElementContentCount
) - allowed consecutive frequency of the element content. One of:xml::ElementContentCount::
one
,zero_or_more
(*),zero_or_one
(?),one_more_more
(+).is_name
(typebool
) - whether the element content is at a leaf - where a single element name is to be matched.name
(typexml::String
) - the name of the element to match (only relevant ifis_name
istrue
).is_sequence
(typebool
) - whether sub-contents must all occur in order (comma-separated) if true, or one sub-content must be matched (bar-separated) if false. Only relevant ifis_name
isfalse
.parts
(typestd::vector<xml::ElementContentModel>
) - the sub-element content models only for use ifis_name
isfalse
.
mixed_content
(typexml::MixedContentModel
) - the mixed content model (only relevant if element is of typexml::ElementType::mixed
):choices
(typestd::set<xml::String>
) - the allowed child element names.
xml::AttributeDeclaration
- declaration info for a single attribute for a given element type:name
(typexml::String
) - name of the attribute.type
(typexml::AttributeType
) - type of the attribute. One of:xml::AttributeType::
cdata
,id
,idref
,idrefs
,entity
,entities
,nmtoken
,nmtokens
,notation
,enumeration
.presence
(typexml::AttributePresence
) - presence requirements of the attribute. One ofxml::AttributePresence::
required
(attribute must be present),implied
(optional attribute),fixed
(attribute must take fixed value),relaxed
(optional attribute with default).notations
(typestd::set<xml::String>
) - the notation names, only relevant iftype
isxml::AttributeType::notation
.enumeration
(typestd::set<xml::String>
) - the enumeration values, only relevant iftype
isxml::AttributeType::enumeration
.has_default_value
(typebool
) - whether a default value has been specified for the attribute.default_value
(typexml::String
) - the default value of the attribute, ifhas_default_value
-strue
.from_external
(typebool
) - whether the attribute was declared in the external DTD subset or an external parameter entity.
xml::AttributeListDeclaration
- a typedef of typestd::map<xml::String, xml::AttributeDeclaration>
- maps attribute name to its corresponding attribute declaration.xml::Entity
- base class containing attributes common to both general and parameter entities: bool from_external = false;name
(typexml::String
) - the name of the entity.value
(typexml::String
) - the value (text) of the entity (will be empty if the entity is external).is_external
(typebool
) -true
indicates the entity is external,false
indicates the entity is internal.external_id
(typexml::ExternalID
) - the external ID, only for if the entity is external.from_external
(typebool
) - whether the entity was declared in the external DTD subset or an external parameter entity.
xml::GeneralEntity
- represents a general entity, inherited fromxml::Entity
and contains the followwing additional attributes:is_unparsed
(typebool
) - whether the entity is an unparsed entity, in which casevalue
will be empty.notation_name
(typexml::String
) - the name of the notation (only if the entity is unparsed).
xml::ParameterEntity
- represents a parameter entity, inherited fromxml::Entity
, and contains no additional attributes.xml::NotationDeclaration
- a NOTATION declaration in the DTD:name
(typexml::String
) - the name of the notation.has_system_id
(typebool
) - whether the notation has a system ID.has_public_id
(typebool
) - whether the notation has a public ID.system_id
(typexml::String
) - the system ID, if available.public_id
(typexml::String
) - the public ID, if available.
xml::ElementDeclarations
- a typedef of typestd::map<xml::String, xml::ElementDeclaration>
, mapping each element name to its corresponding declaration.xml::AttributeListDeclarations
- a typedef of typestd::map<xml::String, xml::AttributeListDeclaration>
, mapping each element name to its corresponding attribute list.xml::GeneralEntities
- a typedef of typestd::map<xml::String, xml::GeneralEntity>
, mapping each general entity name to its corresponding object.xml::ParameterEntities
- a typedef of typestd::map<xml::String, xml::ParameterEntity>
, mapping each parameter entity name to its corresponding object.xml::NotationDeclarations
- a typedef of typestd::map<xml::String, xml::NotationDeclaration>
, mapping each notation name to its corresponding declaration.xml::Element
- represents an element in the document.text
(typexml::String
) - all character data in the element, excluding text in child elements.tag
(typexml::Tag
) - info about the element seen in the start tag of the element, or empty tag.name
(typexml::String
) - the tag/element name.attributes
(typexml::Attribute
which is typedef ofstd::map<xml::String, xml::String>
) - attribute names mapped to corresponding values.tag_type
(typexml::TagType
) - the tag type (eitherxml::TagType::start
orxml::TagType::empty
in this context).
children
(typestd::vector<xml::Element>
) - list of child elements in order of occurrence.processing_instructions
(typestd::vector<xml::ProcessingInstruction>
) - list of processing instructions within the element in order of occurrence.is_empty
(typebool
) - whether the element contains no content.children_only
(typebool
) - whether the element contains child elements only (except interspersed whitespace).
And so, xml::Document
consists of the following attributes:
version
(typexml::String
) - the document version specified in the XML declaration. If there is no XML declaration, this attribute will have a value of"1.0"
. Note, regardless of specified version, the document will have been parsed as per XML v1.0.encoding
(typexml::String
) - the document encoding in lower-case. Since only UTF-8 is currently supported by the parser, this attribute will always have a value of"utf-8"
.standalone
(typebool
) - indicates whether the document has been specified asstandalone="yes"
in the XML declaration. By default, a document is not considered standalone, hence will have a value offalse
by default. For further information, see: https://www.w3.org/TR/xml/#sec-rmddoctype_declaration
(typeDoctypeDeclaration
) - information about the DOCTYPE declaration of the document.exists
(typebool
) - whether a DTD exists in the document. Iffalse
, all other attributes are irrelevant.root_name
(typexml::String
) - the name of the root element.external_id
(typexml::ExternalID
) - the external ID to the external DTD subset.processing_instructions
(typestd::vector<xml::ProcesssingInstruction>
) - a list of processing instructions in parse-order.element_declarations
(typexml::ElementDeclarations
)attribute_list_declarations
(typexml::AttributeListDeclarations
)general_entities
(typexml::GeneralEntities
)parameter_entities
(typexml::ParameterEntities
)notation_declarations
(typexml::NotationDeclarations
)
root
(typexml::Element
) - the root element of the document.processing_instructions
(typestd::vector<xml::ProcessingInstruction>
) - a list of processing instructions that occur in the toplevel of the document.
Whenever an error occurs and is thrown by the parser, it will be of the type xml::XmlError
, which is inherited from std::runtime_error
.
A suitable error message will be generated, with information on the line number and position where appropriate and possible.
Note, if an error other than type xml::XmlError
occurs, then it was not raised explicitly by the parser and can be considered a bug.
Finally, if a std::istream
is passed in for parsing, then if no error occurs during parsing other than post-validation failure, the stream will be at the end. However, if an error occurs during the actual parsing of characters (reading), the stream will be at an arbitrary position, so be careful.
Examples of the XML parser in action can be seen by looking at some of the tests available in the test/test_document.cpp
and test/test_document_files.cpp
. Overall, play around with the parser and hopefully it should be fairly intuitive to use.