package core:text/regex
Overview
package regex implements a complete suite for using Regular Expressions to match and capture text.
Regular expressions are used to describe how a piece of text can match to another, using a pattern language.
Odin's regex library implements the following features:
Alternation: `apple|cherry` Classes: `[0-9_]` Classes, negated: `[^0-9_]` Shorthands: `\d\s\w` Shorthands, negated: `\D\S\W` Wildcards: `.` Repeat, optional: `a*` Repeat, at least once: `a+` Repetition: `a{1,2}` Optional: `a?` Group, capture: `([0-9])` Group, non-capture: `(?:[0-9])` Start & End Anchors: `^hello$` Word Boundaries: `\bhello\b` Non-Word Boundaries: `hello\B`
These specifiers can be composed together, such as an optional group:
(?:hello)?
This package also supports the non-greedy variants of the repeating and
optional specifiers by appending a ?
to them.
Of the shorthand classes that are supported, they are all ASCII-based, even when compiling in Unicode mode. This is for the sake of general performance and simplicity, as there are thousands of Unicode codepoints which would qualify as either a digit, space, or word character which could be irrelevant depending on what is being matched.
Here are the shorthand class equivalencies:
\d: [0-9] \s: [\t\n\f\r ] \w: [0-9A-Z_a-z]
If you need your own shorthands, you can compose strings together like so:
MY_HEX :: "[0-9A-Fa-f]" PATTERN :: MY_HEX + "-" + MY_HEX
The compiler will handle turning multiple identical classes into references to the same set of matching runes, so there's no penalty for doing it like this.
``Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.'' - Jamie Zawinski
Regular expressions have gathered a reputation over the decades for often being chosen as the wrong tool for the job. Here, we will clarify a few cases in which RegEx might be good or bad.
When is it a good time to use RegEx?
You don't know at compile-time what patterns of text the program will need to match when it's running. As an example, you are making a client which can be configured by the user to trigger on certain text patterns received from a server. For another example, you need a way for users of a text editor to compose matching strings that are more intricate than a simple substring lookup. The text you're matching against is small (< 64 KiB) and your patterns aren't overly complicated with branches (alternations, repeats, and optionals). If none of the above general impressions apply but your project doesn't warrant long-term maintenance.
When is it a bad time to use RegEx?
You know at compile-time the grammar you're parsing; a hand-made parser has the potential to be more maintainable and readable. The grammar you're parsing has certain validation steps that lend itself to forming complicated expressions, such as e-mail addresses, URIs, dates, postal codes, credit cards, et cetera. Using RegEx to validate these structures is almost always a bad sign. The text you're matching against is big (> 1 MiB); you would be better served by first dividing the text into manageable chunks and using some heuristic to locate the most likely location of a match before applying RegEx against it. You value high performance and low memory usage; RegEx will always have a certain overhead which increases with the complexity of the pattern.
The implementation of this package has been optimized, but it will never be as thoroughly performant as a hand-made parser. In comparison, there are just too many intermediate steps, assumptions, and generalizations in what it takes to handle a regular expression.
Index
Constants (0)
This section is empty.
Variables (0)
This section is empty.
Types
Capture ¶
This struct corresponds to a set of string captures from a RegEx match.
pos
will contain the start and end positions for each string in groups
,
such that str[pos[0][0]:pos[0][1]] == groups[0]
.
Related Procedures With Parameters
- destroy_capture
- match_with_preallocated_capture
- destroy (procedure groups)
- match (procedure groups)
Related Procedures With Returns
Compiler_Error ¶
Compiler_Error :: regex_compiler.Error
Creation_Error ¶
Creation_Error :: enum int { None, // A `\` was supplied as the delimiter to `create_by_user`. Bad_Delimiter, // A pair of delimiters for `create_by_user` was not found. Expected_Delimiter, // An unknown letter was supplied to `create_by_user` after the last delimiter. Unknown_Flag, }
Error ¶
Error :: union { regex_parser.Error, regex_compiler.Error, Creation_Error, }
Related Procedures With Returns
Parser_Error ¶
Parser_Error :: regex_parser.Error
Regular_Expression ¶
Regular_Expression :: struct { flags: bit_set[regex_common.Flag; u8] `fmt:"-"`, class_data: []regex_vm.Rune_Class_Data `fmt:"-"`, program: []regex_vm.Opcode `fmt:"-"`, }
A compiled Regular Expression value, to be used with the match_*
procedures.
Related Procedures With Parameters
- destroy_regex
- match_and_allocate_capture
- match_with_preallocated_capture
- destroy (procedure groups)
- match (procedure groups)
Related Procedures With Returns
Constants
This section is empty.
Variables
This section is empty.
Procedures
create ¶
create :: proc(pattern: string, flags: bit_set[regex_common.Flag; u8] = {}, permanent_allocator := context.allocator, temporary_allocator := context.temp_allocator) -> (result: Regular_Expression, err: Error) {…}
Create a regular expression from a string pattern and a set of flags.
Allocates Using Provided Allocators
Inputs:
pattern: The pattern to compile.
flags: A bit_set
of RegEx flags.
permanent_allocator: The allocator to use for the final regular expression. (default: context.allocator)
temporary_allocator: The allocator to use for the intermediate compilation stages. (default: context.temp_allocator)
Returns:
result: The regular expression.
err: An error, if one occurred.
create_by_user ¶
create_by_user :: proc(pattern: string, permanent_allocator := context.allocator, temporary_allocator := context.temp_allocator) -> (result: Regular_Expression, err: Error) {…}
Create a regular expression from a delimited string pattern, such as one provided by users of a program or those found in a configuration file.
They are in the form of:
[DELIMITER] [regular expression] [DELIMITER] [flags]
For example, the following strings are valid:
/hellope/i #hellope#i •hellope•i つhellopeつi
The delimiter is determined by the very first rune in the string.
The only restriction is that the delimiter cannot be \
, as that rune is used
to escape the delimiter if found in the middle of the string.
All runes after the closing delimiter will be parsed as flags:
'g': Global 'm': Multiline 'i': Case_Insensitive 'x': Ignore_Whitespace 'u': Unicode 'n': No_Capture '-': No_Optimization
Allocates Using Provided Allocators
Inputs:
pattern: The delimited pattern with optional flags to compile.
str: The string to match against.
permanent_allocator: The allocator to use for the final regular expression. (default: context.allocator)
temporary_allocator: The allocator to use for the intermediate compilation stages. (default: context.temp_allocator)
Returns:
result: The regular expression.
err: An error, if one occurred.
destroy_capture ¶
destroy_capture :: proc(capture: Capture, allocator := context.allocator) {…}
Free all data allocated by the match_and_allocate_capture
procedure.
Frees Using Provided Allocator
Inputs:
capture: A Capture.
allocator: (default: context.allocator)
destroy_regex ¶
destroy_regex :: proc(regex: Regular_Expression, allocator := context.allocator) {…}
Free all data allocated by the create*
procedures.
Frees Using Provided Allocator
Inputs:
regex: A regular expression.
allocator: (default: context.allocator)
match_and_allocate_capture ¶
match_and_allocate_capture :: proc(regex: Regular_Expression, str: string, permanent_allocator := context.allocator, temporary_allocator := context.temp_allocator) -> (capture: Capture, success: bool) {…}
Match a regular expression against a string and allocate the results into the
returned capture
structure.
The resulting capture strings will be slices to the string str
, not wholly
copied strings, so they won't need to be individually deleted.
Allocates Using Provided Allocators
Inputs:
regex: The regular expression.
str: The string to match against.
permanent_allocator: The allocator to use for the capture results. (default: context.allocator)
temporary_allocator: The allocator to use for the virtual machine. (default: context.temp_allocator)
Returns:
capture: The capture groups found in the string.
success: True if the regex matched the string.
match_with_preallocated_capture ¶
match_with_preallocated_capture :: proc(regex: Regular_Expression, str: string, capture: ^Capture, temporary_allocator := context.temp_allocator) -> (num_groups: int, success: bool) {…}
Match a regular expression against a string and save the capture results into
the provided capture
structure.
The resulting capture strings will be slices to the string str
, not wholly
copied strings, so they won't need to be individually deleted.
Allocates Using Provided Allocator
Inputs:
regex: The regular expression.
str: The string to match against.
capture: A pointer to a Capture structure with groups
and pos
already allocated.
temporary_allocator: The allocator to use for the virtual machine. (default: context.temp_allocator)
Returns:
num_groups: The number of capture groups set into capture
.
success: True if the regex matched the string.
preallocate_capture ¶
preallocate_capture :: proc(allocator := context.allocator) -> (result: Capture) {…}
Allocate a Capture
in advance for use with match
. This can save some time
if you plan on performing several matches at once and only need the results
between matches.
Inputs:
allocator: (default: context.allocator)
Returns:
result: The Capture
with the maximum number of groups allocated.
Procedure Groups
destroy ¶
destroy :: proc{ destroy_regex, destroy_capture, }
match ¶
match :: proc{ match_and_allocate_capture, match_with_preallocated_capture, }
Source Files
Generation Information
Generated with odin version dev-2024-12 (vendor "odin") Windows_amd64 @ 2024-12-17 21:11:02.074207400 +0000 UTC