package core:text/regex/virtual_machine

⌘K
Ctrl+K
or
/

    Overview

    package regex_vm implements a threaded virtual machine for interpreting regular expressions, based on the designs described by Russ Cox and attributed to both Ken Thompson and Rob Pike.

    The virtual machine executes all threads in lock step, i.e. the string pointer does not advance until all threads have finished processing the current rune. The algorithm does not look backwards.

    Threads merge when splitting or jumping to positions already visited by another thread, based on the observation that each thread having visited one PC (Program Counter) state will execute identically to the previous thread.

    Each thread keeps a save state of its capture groups, and thread priority is used to allow higher precedence operations to complete first with correct save states, such as greedy versus non-greedy repetition.

    For more information, see: https://swtch.com/~rsc/regexp/regexp2.html

    Implementation Details:

    Each opcode is 8 bits in size, and most instructions have no operands.

    All operands larger than u8 are read in system endian order.

    Jump and Split instructions operate on absolute positions in u16 operands.

    Classes such as [0-9] are stored in a RegEx-specific slice of structs which are then dereferenced by a u8 index from the Rune_Class instructions.

    Each Byte and Rune opcode have their operands stored inline after the opcode, sized u8 and i32 respectively.

    A bitmap is used to determine which PC positions are occupied by a thread to perform merging. The bitmap is cleared with every new frame.

    The VM supports two modes: ASCII and Unicode, decided by a compile-time boolean constant argument provided to run. The procedure differs only in string decoding. This was done for the sake of performance.

    No allocations are ever freed; the VM expects an arena or temporary allocator to be used in the context preceding it.

    Opcode Reference:

    (0x00) Match
    
    The terminal opcode which ends a thread. This always comes at the end of
    the program.
    
    (0x01) Match_And_Exit
    
    A modified version of Match which stops the virtual machine entirely. It is
    only compiled for `No_Capture` expressions, as those expressions do not
    need to determine which thread may have saved the most appropriate capture
    groups.
    
    (0x02) Byte
    
    Consumes one byte from the text using its operand, which is also a byte.
    
    (0x03) Rune
    
    Consumes one Unicode codepoint from the text using its operand, which is
    four bytes long in a system-dependent endian order.
    
    (0x04) Rune_Class
    
    Consumes one character (which may be an ASCII byte or Unicode codepoint,
    wholly dependent on which mode the virtual machine is running in) from the
    text.
    
    The actual data storing what runes and ranges of runes apply to the class
    are stored alongside the program in the Regular_Expression structure and
    the operand for this opcode is a single byte which indexes into a
    collection of these data structures.
    
    (0x05) Rune_Class_Negated
    
    A modified version of Rune_Class that functions the same, save for how it
    returns the opposite of what Rune_Class matches.
    
    (0x06) Wildcard
    
    Consumes one byte or one Unicode codepoint, depending on the VM mode.
    
    (0x07) Jump
    
    Sets the Program Counter of a VM thread to the operand, which is a u16.
    This opcode is used to implement Alternation (coming at the end of the left
    choice) and Repeat_Zero (to cause the thread to loop backwards).
    
    (0x08) Split
    
    Spawns a new thread for the X operand and causes the current thread to jump
    to the Y operand. This opcode is used to implement Alternation, all the
    Repeat variations, and the Optional nodes.
    
    Splitting threads is how the virtual machine is able to execute optional
    control flow paths, letting it evaluate different possible ways to match
    text.
    
    (0x09) Save
    
    Saves the current string index to a slot on the thread dictated by the
    operand. These values will be used later to reconstruct capture groups.
    
    (0x0A) Assert_Start
    
    Asserts that the thread is at the beginning of a string.
    
    (0x0B) Assert_End
    
    Asserts that the thread is at the end of a string.
    
    (0x0C) Assert_Word_Boundary
    
    Asserts that the thread is on a word boundary, which can be the start or
    end of the text. This examines both the current rune and the next rune.
    
    (0x0D) Assert_Non_Word_Boundary
    
    A modified version of Assert_Word_Boundary that returns the opposite value.
    
    (0x0E) Multiline_Open
    
    This opcode is compiled in only when the `Multiline` flag is present, and
    it replaces both `^` and `$` text anchors.
    
    It asserts that either the current thread is on one of the string
    boundaries, or it consumes a `\n` or `\r` character.
    
    If a `\r` character is consumed, the PC will be advanced to the sibling
    `Multiline_Close` opcode to optionally consume a `\n` character on the next
    frame.
    
    (0x0F) Multiline_Close
    
    This opcode is always present after `Multiline_Open`.
    
    It handles consuming the second half of a complete newline, if necessary.
    For example, Windows newlines are represented by the characters `\r\n`,
    whereas UNIX newlines are `\n` and Macintosh newlines are `\r`.
    
    (0x10) Wait_For_Byte
    (0x11) Wait_For_Rune
    (0x12) Wait_For_Rune_Class
    (0x13) Wait_For_Rune_Class_Negated
    
    These opcodes are an optimization around restarting threads on failed
    matches when the beginning to a pattern is predictable and the Global flag
    is set.
    
    They will cause the VM to wait for the next rune to match before splitting,
    as would happen in the un-optimized version.
    
    (0x14) Match_All_And_Escape
    
    This opcode is an optimized version of `.*$` or `.+$` that causes the
    active thread to immediately work on escaping the program by following all
    Jumps out to the end.
    
    While running through the rest of the program, the thread will trigger on
    every Save instruction it passes to store the length of the string.
    
    This way, any time a program hits one of these `.*$` constructs, the
    virtual machine can exit early, vastly improving processing times.
    
    Be aware, this opcode is not compiled in if the `Multiline` flag is on, as
    the meaning of `$` changes with that flag.
    

    Index

    Constants (0)

    This section is empty.

    Variables (0)

    This section is empty.

    Procedure Groups (0)

    This section is empty.

    Types

    Machine ¶

    Machine :: struct {
    	// Program state
    	memory:            string,
    	class_data:        []Rune_Class_Data,
    	code:              []Opcode,
    	// Thread state
    	top_thread:        int,
    	threads:           [^]Thread,
    	next_threads:      [^]Thread,
    	// The busy map is used to merge threads based on their program counters.
    	busy_map:          []u64,
    	// Global state
    	string_pointer:    int,
    	current_rune:      rune,
    	current_rune_size: int,
    	next_rune:         rune,
    	next_rune_size:    int,
    }
    Related Procedures With Parameters
    Related Procedures With Returns

    Opcode ¶

    Opcode :: enum u8 {
    	// | [ operands ]
    	Match                       = 0,  // |
    	Match_And_Exit              = 1,  // |
    	Byte                        = 2,  // | u8
    	Rune                        = 3,  // | i32
    	Rune_Class                  = 4,  // | u8
    	Rune_Class_Negated          = 5,  // | u8
    	Wildcard                    = 6,  // |
    	Jump                        = 7,  // | u16
    	Split                       = 8,  // | u16, u16
    	Save                        = 9,  // | u8
    	Assert_Start                = 10, // |
    	Assert_End                  = 11, // |
    	Assert_Word_Boundary        = 12, // |
    	Assert_Non_Word_Boundary    = 13, // |
    	Multiline_Open              = 14, // |
    	Multiline_Close             = 15, // |
    	Wait_For_Byte               = 16, // | u8
    	Wait_For_Rune               = 17, // | i32
    	Wait_For_Rune_Class         = 18, // | u8
    	Wait_For_Rune_Class_Negated = 19, // | u8
    	Match_All_And_Escape        = 20, // |
    }
    Related Procedures With Parameters
    Related Procedures With Returns

    Opcode_Iterator ¶

    Opcode_Iterator :: struct {
    	code: []Opcode,
    	pc:   int,
    }
    Related Procedures With Parameters

    Program ¶

    Program :: []Opcode

    Rune_Class_Data ¶

    Rune_Class_Data :: struct {
    	runes:  []rune,
    	ranges: []regex_parser.Rune_Class_Range,
    }
     

    NOTE: This structure differs intentionally from the one in regex/parser, as this data doesn't need to be a dynamic array once it hits the VM.

    Rune_Class_Range ¶

    Rune_Class_Range :: regex_parser.Rune_Class_Range

    Thread ¶

    Thread :: struct {
    	pc:    int,
    	saved: ^[20]int,
    }

    Constants

    This section is empty.

    Variables

    This section is empty.

    Procedures

    add_thread ¶

    add_thread :: proc(vm: ^Machine, saved: ^[20]int, pc: int) {…}

    check_busy_map ¶

    check_busy_map :: proc "contextless" (vm: ^Machine, pc: int) -> bool {…}

    create ¶

    create :: proc(code: []Opcode, str: string) -> (vm: Machine) {…}

    is_word_class ¶

    is_word_class :: proc "contextless" (r: rune) -> bool {…}
     

    @MetaCharacter NOTE: This must be kept in sync with the compiler & tokenizer.

    iterate_opcodes ¶

    iterate_opcodes :: proc(iter: ^Opcode_Iterator) -> (opcode: Opcode, pc: int, ok: bool) {…}

    opcode_count ¶

    opcode_count :: proc(code: []Opcode) -> (opcodes: int) {…}

    opcode_to_name ¶

    opcode_to_name :: proc(opcode: Opcode) -> (str: string) {…}

    run ¶

    run :: proc(vm: ^Machine, $UNICODE_MODE: bool = ) -> (saved: ^[20]int, ok: bool) {…}

    set_busy_map ¶

    set_busy_map :: proc "contextless" (vm: ^Machine, pc: int) -> bool {…}

    Procedure Groups

    This section is empty.

    Source Files

    Generation Information

    Generated with odin version dev-2024-11 (vendor "odin") Windows_amd64 @ 2024-11-16 21:10:10.883157800 +0000 UTC