package core:encoding/xml

⌘K
Ctrl+K
or
/

    Overview

    XML 1.0 / 1.1 parser

    2021-2022 Jeroen van Rijn <nom@duclavier.com>. available under Odin's BSD-3 license.

    from-scratch XML implementation, loosely modelled on the [spec](https://www.w3.org/TR/2006/REC-xml11-20060816).

    Features: Supports enough of the XML 1.0/1.1 spec to handle the 99.9% of XML documents in common current usage. Simple to understand and use. Small.

    Caveats: We do NOT support HTML in this package, as that may or may not be valid XML. If it works, great. If it doesn't, that's not considered a bug.

    We do NOT support UTF-16. If you have a UTF-16 XML file, please convert it to UTF-8 first. Also, our condolences. <[!ELEMENT and <[!ATTLIST are not supported, and will be either ignored or return an error depending on the parser options.

    MAYBE: XML writer? Serialize/deserialize Odin types?

    List of contributors: Jeroen van Rijn: Initial implementation.

    Types

    Attribute ¶

    Attribute :: struct {
    	key: string,
    	val: string,
    }
    Related Procedures With Parameters
    Related Procedures With Returns

    Attributes ¶

    Attributes :: [dynamic]Attribute

    Document ¶

    Document :: struct {
    	elements:        [dynamic]Element,
    	element_count:   u32,
    	prologue:        [dynamic]Attribute,
    	encoding:        Encoding,
    	doctype:         struct {
    		// We only scan the <!DOCTYPE IDENT part and skip the rest.
    		ident: string,
    		rest:  string,
    	},
    	// If we encounter comments before the root node, and the option to intern comments is given, this is where they'll live.
    	// Otherwise they'll be in the element tree.
    	comments:        [dynamic]string,
    	// Internal
    	tokenizer:       ^Tokenizer,
    	allocator:       runtime.Allocator,
    	// Input. Either the original buffer, or a copy if `.Input_May_Be_Modified` isn't specified.
    	input:           []u8,
    	strings_to_free: [dynamic]string,
    }
    Related Procedures With Parameters
    Related Procedures With Returns

    Element ¶

    Element :: struct {
    	ident:   string,
    	value:   [dynamic]Value,
    	attribs: [dynamic]Attribute,
    	kind:    enum int {
    		Element = 0, 
    		Comment, 
    	},
    	parent:  u32,
    }

    Element_ID ¶

    Element_ID :: u32

    Encoding ¶

    Encoding :: enum int {
    	Unknown, 
    	UTF_8, 
    	ISO_8859_1, 
    	// Aliases
    	LATIN_1    = 2, 
    }

    Error ¶

    Error :: enum int {
    	// General return values.
    	None                          = 0, 
    	General_Error, 
    	Unexpected_Token, 
    	Invalid_Token, 
    	// Couldn't find, open or read file.
    	File_Error, 
    	// File too short.
    	Premature_EOF, 
    	// XML-specific errors.
    	No_Prolog, 
    	Invalid_Prolog, 
    	Too_Many_Prologs, 
    	No_DocType, 
    	Too_Many_DocTypes, 
    	DocType_Must_Preceed_Elements, 
    	// If a DOCTYPE is present _or_ the caller
    	// asked for a specific DOCTYPE and the DOCTYPE
    	// and root tag don't match, we return `.Invalid_DocType`.
    	Invalid_DocType, 
    	Invalid_Tag_Value, 
    	Mismatched_Closing_Tag, 
    	Unclosed_Comment, 
    	Comment_Before_Root_Element, 
    	Invalid_Sequence_In_Comment, 
    	Unsupported_Version, 
    	Unsupported_Encoding, 
    	// <!FOO are usually skipped.
    	Unhandled_Bang, 
    	Duplicate_Attribute, 
    	Conflicting_Options, 
    }
    Related Procedures With Returns

    Error_Handler ¶

    Error_Handler :: proc(pos: Pos, fmt: string, .. args: ..any)
    Related Procedures With Parameters

    Option_Flag ¶

    Option_Flag :: enum int {
    	// If the caller says that input may be modified, we can perform in-situ parsing.
    	// If this flag isn't provided, the XML parser first duplicates the input so that it can.
    	Input_May_Be_Modified, 
    	// Document MUST start with `<?xml` prologue.
    	Must_Have_Prolog, 
    	// Document MUST have a `<!DOCTYPE`.
    	Must_Have_DocType, 
    	// By default we skip comments. Use this option to intern a comment on a parented Element.
    	Intern_Comments, 
    	// How to handle unsupported parts of the specification, like <! other than <!DOCTYPE and <![CDATA[
    	Error_on_Unsupported, 
    	Ignore_Unsupported, 
    	// By default CDATA tags are passed-through as-is.
    	// This option unwraps them when encountered.
    	Unbox_CDATA, 
    	// By default SGML entities like `>`, ` ` and ` ` are passed-through as-is.
    	// This option decodes them when encountered.
    	Decode_SGML_Entities, 
    	// If a tag body has a comment, it will be stripped unless this option is given.
    	Keep_Tag_Body_Comments, 
    }

    Option_Flags ¶

    Option_Flags :: bit_set[Option_Flag; u16]

    Options ¶

    Options :: struct {
    	flags:            bit_set[Option_Flag; u16],
    	expected_doctype: string,
    }
    Related Procedures With Parameters
    Related Constants

    Pos ¶

    Pos :: struct {
    	file:   string,
    	offset: int,
    	// starting at 0
    	line:   int,
    	// starting at 1
    	column: int,
    }
    Related Procedures With Parameters

    Token ¶

    Token :: struct {
    	kind: Token_Kind,
    	text: string,
    	pos:  Pos,
    }
    Related Procedures With Returns

    Token_Kind ¶

    Token_Kind :: enum int {
    	Invalid, 
    	Ident, 
    	Literal, 
    	Rune, 
    	String, 
    	Double_Quote,  // "
    	Single_Quote,  // '
    	Colon,         // :
    	Eq,            // =
    	Lt,            // <
    	Gt,            // >
    	Exclaim,       // !
    	Question,      // ?
    	Hash,          // #
    	Slash,         // /
    	Dash,          // -
    	Open_Bracket,  // [
    	Close_Bracket, // ]
    	EOF, 
    }
    Related Procedures With Parameters

    Tokenizer ¶

    Tokenizer :: struct {
    	// Immutable data
    	path:        string,
    	src:         string,
    	err:         Error_Handler,
    	// Tokenizing state
    	ch:          rune,
    	offset:      int,
    	read_offset: int,
    	line_offset: int,
    	line_count:  int,
    	// Mutable data
    	error_count: int,
    }
    Related Procedures With Parameters

    Value ¶

    Value :: union {
    	string, 
    	u32, 
    }

    Constants

    CDATA_END ¶

    CDATA_END :: "]]>"

    CDATA_START ¶

    CDATA_START :: "<![CDATA["

    COMMENT_END ¶

    COMMENT_END :: "-->"

    COMMENT_START ¶

    COMMENT_START :: "<!--"

    DEFAULT_OPTIONS ¶

    DEFAULT_OPTIONS :: Options{flags = {.Ignore_Unsupported}, expected_doctype = ""}

    Variables

    This section is empty.

    Procedures

    advance_rune ¶

    advance_rune :: proc(t: ^Tokenizer) {…}

    check_duplicate_attributes ¶

    check_duplicate_attributes :: proc(t: ^Tokenizer, attribs: [dynamic]Attribute, attr: Attribute, offset: int) -> (err: Error) {…}

    default_error_handler ¶

    default_error_handler :: proc(pos: Pos, msg: string, .. args: ..any) {…}

    destroy ¶

    destroy :: proc(doc: ^Document) {…}

    error ¶

    error :: proc(t: ^Tokenizer, offset: int, msg: string, .. args: ..any) {…}

    expect ¶

    expect :: proc(t: ^Tokenizer, kind: Token_Kind) -> (tok: Token, err: Error) {…}

    find_attribute_val_by_key ¶

    find_attribute_val_by_key :: proc(doc: ^Document, parent_id: u32, key: string) -> (val: string, found: bool) {…}
     

    Find an attribute by key.

    find_child_by_ident ¶

    find_child_by_ident :: proc(doc: ^Document, parent_id: u32, ident: string, nth: int = 0) -> (res: u32, found: bool) {…}
     

    Find parent's nth child with a given ident.

    init ¶

    init :: proc(t: ^Tokenizer, src: string, path: string, err: Error_Handler = default_error_handler) {…}

    is_letter ¶

    is_letter :: proc(r: rune) -> bool {…}

    is_valid_identifier_rune ¶

    is_valid_identifier_rune :: proc(r: rune) -> bool {…}

    likely ¶

    likely :: intrinsics.expect

    load_from_file ¶

    load_from_file :: proc(filename: string, options: Options = DEFAULT_OPTIONS, error_handler: proc(pos: Pos, msg: string, .. args: ..any) = default_error_handler, allocator := context.allocator) -> (doc: ^Document, err: Error) {…}
     

    Load an XML file

    new_element ¶

    new_element :: proc(doc: ^Document) -> (id: u32) {…}

    parse_attribute ¶

    parse_attribute :: proc(doc: ^Document) -> (attr: Attribute, offset: int, err: Error) {…}

    parse_attributes ¶

    parse_attributes :: proc(doc: ^Document, attribs: ^[dynamic]Attribute) -> (err: Error) {…}

    parse_bytes ¶

    parse_bytes :: proc(data: []u8, options: Options = DEFAULT_OPTIONS, path: string = "", error_handler: proc(pos: Pos, msg: string, .. args: ..any) = default_error_handler, allocator := context.allocator) -> (doc: ^Document, err: Error) {…}

    parse_doctype ¶

    parse_doctype :: proc(doc: ^Document) -> (err: Error) {…}

    parse_prologue ¶

    parse_prologue :: proc(doc: ^Document) -> (err: Error) {…}

    parse_string ¶

    parse_string :: proc(data: string, options: Options = DEFAULT_OPTIONS, path: string = "", error_handler: proc(pos: Pos, msg: string, .. args: ..any) = default_error_handler, allocator := context.allocator) -> (doc: ^Document, err: Error) {…}

    peek ¶

    peek :: proc(t: ^Tokenizer) -> (token: Token) {…}

    peek_byte ¶

    peek_byte :: proc(t: ^Tokenizer, offset: int = 0) -> u8 {…}

    print ¶

    print :: proc(writer: io.Stream, doc: ^Document) -> (written: int, err: io.Error) {…}
     

    Just for debug purposes.

    print_element :: proc(writer: io.Stream, doc: ^Document, element_id: u32, indent: int = 0) -> (written: int, err: io.Error) {…}

    scan ¶

    scan :: proc(t: ^Tokenizer) -> Token {…}

    scan_comment ¶

    scan_comment :: proc(t: ^Tokenizer) -> (comment: string, err: Error) {…}
     

    A comment ends when we see -->, preceded by a character that's not a dash.

    "For compatibility, the string "--" (double-hyphen) must not occur within comments."
    
    See: https://www.w3.org/TR/2006/REC-xml11-20060816/#dt-comment
    
    Thanks to the length (4) of the comment start, we also have enough lookback,
    and the peek at the next byte asserts that there's at least one more character
    that's a `>`.
    

    scan_identifier ¶

    scan_identifier :: proc(t: ^Tokenizer) -> string {…}

    scan_string ¶

    scan_string :: proc(t: ^Tokenizer, offset: int, close: rune = '<', consume_close: bool = false, multiline: bool = true) -> (value: string, err: Error) {…}

    skip_cdata ¶

    skip_cdata :: proc(t: ^Tokenizer) -> (err: Error) {…}
     

    Skip CDATA

    skip_element ¶

    skip_element :: proc(t: ^Tokenizer) -> (err: Error) {…}

    skip_whitespace ¶

    skip_whitespace :: proc(t: ^Tokenizer) {…}

    validate_options ¶

    validate_options :: proc(options: Options) -> (validated: Options, err: Error) {…}

    Procedure Groups

    Source Files

    Generation Information

    Generated with odin version dev-2024-04 (vendor "odin") Windows_amd64 @ 2024-04-19 21:09:18.667103500 +0000 UTC