package core:encoding/xml
Overview
XML 1.0 / 1.1 parser
A from-scratch XML implementation, loosely modelled on the spec.
Features: Supports enough of the XML 1.0/1.1 spec to handle the 99.9% of XML documents in common current usage. Simple to understand and use. Small.
Caveats: We do NOT support HTML in this package, as that may or may not be valid XML. If it works, great. If it doesn't, that's not considered a bug.
We do NOT support UTF-16. If you have a UTF-16 XML file, please convert it to UTF-8 first. Also, our condolences. <[!ELEMENT and <[!ATTLIST are not supported, and will be either ignored or return an error depending on the parser options.
MAYBE: XML writer? Serialize/deserialize Odin types?
For a full example, see: core/encoding/xml/example
2021-2022 Jeroen van Rijn <nom@duclavier.com>. available under Odin's BSD-3 license. List of contributors: - Jeroen van Rijn: Initial implementation.
Index
Constants (5)
Variables (0)
This section is empty.
Procedures (32)
- advance_rune
- check_duplicate_attributes
- default_error_handler
- destroy
- error
- expect
- find_attribute_val_by_key
- find_child_by_ident
- init
- is_letter
- is_valid_identifier_rune
- likely
- load_from_file
- new_element
- parse_attribute
- parse_attributes
- parse_bytes
- parse_doctype
- parse_prologue
- parse_string
- peek
- peek_byte
- print_element
- scan
- scan_comment
- scan_identifier
- scan_string
- skip_cdata
- skip_element
- skip_whitespace
- validate_options
Procedure Groups (1)
Types
Attributes ¶
Attributes :: [dynamic]Attribute
Document ¶
Document :: struct { elements: [dynamic]Element, element_count: u32, prologue: [dynamic]Attribute, encoding: Encoding, doctype: struct { // We only scan the <!DOCTYPE IDENT part and skip the rest. ident: string, rest: string, }, // If we encounter comments before the root node, and the option to intern comments is given, this is where they'll live. // Otherwise they'll be in the element tree. comments: [dynamic]string, // Internal tokenizer: ^Tokenizer, allocator: runtime.Allocator, // Input. Either the original buffer, or a copy if `.Input_May_Be_Modified` isn't specified. input: []u8, strings_to_free: [dynamic]string, }
Related Procedures With Parameters
- destroy
- find_attribute_val_by_key
- find_child_by_ident
- new_element
- parse_attribute
- parse_attributes
- parse_doctype
- parse_prologue
- print_element
Related Procedures With Returns
- load_from_file
- parse_bytes
- parse_string
- parse (procedure groups)
Element_ID ¶
Element_ID :: u32
Encoding ¶
Encoding :: enum int { Unknown, UTF_8, ISO_8859_1, // Aliases LATIN_1 = 2, }
Error ¶
Error :: enum int { // General return values. None = 0, General_Error, Unexpected_Token, Invalid_Token, // Couldn't find, open or read file. File_Error, // File too short. Premature_EOF, // XML-specific errors. No_Prolog, Invalid_Prolog, Too_Many_Prologs, No_DocType, Too_Many_DocTypes, DocType_Must_Preceed_Elements, // If a DOCTYPE is present _or_ the caller // asked for a specific DOCTYPE and the DOCTYPE // and root tag don't match, we return `.Invalid_DocType`. Invalid_DocType, Invalid_Tag_Value, Mismatched_Closing_Tag, Unclosed_Comment, Comment_Before_Root_Element, Invalid_Sequence_In_Comment, Unsupported_Version, Unsupported_Encoding, // <!FOO are usually skipped. Unhandled_Bang, Duplicate_Attribute, Conflicting_Options, }
Related Procedures With Returns
Error_Handler ¶
Related Procedures With Parameters
Option_Flag ¶
Option_Flag :: enum int { // If the caller says that input may be modified, we can perform in-situ parsing. // If this flag isn't provided, the XML parser first duplicates the input so that it can. Input_May_Be_Modified, // Document MUST start with `<?xml` prologue. Must_Have_Prolog, // Document MUST have a `<!DOCTYPE`. Must_Have_DocType, // By default we skip comments. Use this option to intern a comment on a parented Element. Intern_Comments, // How to handle unsupported parts of the specification, like <! other than <!DOCTYPE and <![CDATA[ Error_on_Unsupported, Ignore_Unsupported, // By default CDATA tags are passed-through as-is. // This option unwraps them when encountered. Unbox_CDATA, // By default SGML entities like `>`, ` ` and ` ` are passed-through as-is. // This option decodes them when encountered. Decode_SGML_Entities, // If a tag body has a comment, it will be stripped unless this option is given. Keep_Tag_Body_Comments, }
Option_Flags ¶
Option_Flags :: bit_set[Option_Flag; u16]
Options ¶
Options :: struct { flags: bit_set[Option_Flag; u16], expected_doctype: string, }
Related Procedures With Parameters
- load_from_file
- parse_bytes
- parse_string
- validate_options
- parse (procedure groups)
Related Constants
Pos ¶
Pos :: struct { file: string, offset: int, // starting at 0 line: int, // starting at 1 column: int, }
Related Procedures With Parameters
Token ¶
Token :: struct { kind: Token_Kind, text: string, pos: Pos, }
Related Procedures With Returns
Token_Kind ¶
Token_Kind :: enum int { Invalid, Ident, Literal, Rune, String, Double_Quote, // " Single_Quote, // ' Colon, // : Eq, // = Lt, // < Gt, // > Exclaim, // ! Question, // ? Hash, // # Slash, // / Dash, // - Open_Bracket, // [ Close_Bracket, // ] EOF, }
Related Procedures With Parameters
Tokenizer ¶
Tokenizer :: struct { // Immutable data path: string, src: string, err: Error_Handler, // Tokenizing state ch: rune, offset: int, read_offset: int, line_offset: int, line_count: int, // Mutable data error_count: int, }
Related Procedures With Parameters
Constants
CDATA_END ¶
CDATA_END :: "]]>"
CDATA_START ¶
CDATA_START :: "<![CDATA["
COMMENT_END ¶
COMMENT_END :: "-->"
COMMENT_START ¶
COMMENT_START :: "<!--"
DEFAULT_OPTIONS ¶
DEFAULT_OPTIONS :: Options{flags = {.Ignore_Unsupported}, expected_doctype = ""}
Variables
This section is empty.
Procedures
advance_rune ¶
advance_rune :: proc(t: ^Tokenizer) {…}
expect ¶
expect :: proc(t: ^Tokenizer, kind: Token_Kind, multiline_string: bool = false) -> (tok: Token, err: Error) {…}
find_attribute_val_by_key ¶
find_attribute_val_by_key :: proc(doc: ^Document, parent_id: u32, key: string) -> (val: string, found: bool) {…}
Find an attribute by key.
find_child_by_ident ¶
find_child_by_ident :: proc(doc: ^Document, parent_id: u32, ident: string, nth: int = 0) -> (res: u32, found: bool) {…}
Find parent's nth child with a given ident.
init ¶
init :: proc(t: ^Tokenizer, src: string, path: string, err: Error_Handler = default_error_handler) {…}
likely ¶
likely :: intrinsics.expect
load_from_file ¶
load_from_file :: proc(filename: string, options: Options = DEFAULT_OPTIONS, error_handler: proc(pos: Pos, msg: string, .. args: ..any) = default_error_handler, allocator := context.allocator) -> (doc: ^Document, err: Error) {…}
Load an XML file
print ¶
Just for debug purposes.
scan_comment ¶
A comment ends when we see -->, preceded by a character that's not a dash. "For compatibility, the string "--" (double-hyphen) must not occur within comments."
See: https://www.w3.org/TR/2006/REC-xml11-20060816/#dt-comment
Thanks to the length (4) of the comment start, we also have enough lookback,
and the peek at the next byte asserts that there's at least one more character
that's a >
.
skip_whitespace ¶
skip_whitespace :: proc(t: ^Tokenizer) {…}
Procedure Groups
parse ¶
parse :: proc{ parse_string, parse_bytes, }
Source Files
Generation Information
Generated with odin version dev-2024-12 (vendor "odin") Windows_amd64 @ 2024-12-20 21:10:45.897697100 +0000 UTC