Regex-Toolkit

Regex-Toolkit provides tools for creating RE and RE2 expressions.

Requirements:

Regex-Toolkit requires Python 3.10 or higher, is platform independent, and has no outside dependencies.

Issue reporting

If you discover an issue with Regex-Toolkit, please report it at https://github.com/Phosmic/regex-toolkit/issues.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Requirements Installing Usage Library

Installing

Most stable version from PyPi:

python3 -m pip install regex-toolkit

Development version from GitHub:

git clone git+https://github.com/Phosmic/regex-toolkit.git
cd regex-toolkit
python3 -m pip install -e .

Usage

To harness the toolkit's capabilities, you should import the necessary packages:

import re
# and/or
import re2
import regex_toolkit as rtk

Why Use `regex_toolkit`?

Regex definitions vary across languages and versions. By using the toolkit, you can achieve a more consistent and comprehensive representation of unicode support. It is especially useful to supplement base unicode sets with the latest definitions from other languages and standards.

RE2 Overview

RE2 focuses on safely processing regular expressions, particularly from untrusted inputs. It ensures both linear match time and efficient memory usage. Although it might not always surpass other engines in speed, it intentionally omits features that depend solely on backtracking, like backreferences and look-around assertions.

A brief rundown of RE2 terminology:

BitState: An execution engine that uses backtracking search.
bytecode: The set of instructions that form an automaton.
DFA: The engine for Deterministic Finite Automaton searches.
NFA: Implements the Nondeterministic Finite Automaton search method.
OnePass: A one-pass search execution engine.
pattern: The textual form of a regex.
Prog: The compiled version of a regex.
Regexp: The parsed version of a regex.
Rune: A character in terms of encoding, essentially a code point.

For an in-depth exploration, please refer to the RE2 documentation.

Library

`regex_toolkit.utils`

`resolve_flavor`

def resolve_flavor(potential_flavor: int | RegexFlavor | None) -> RegexFlavor

Resolve a regex flavor.

If the flavor is an integer, it is validated and returned. If the flavor is a RegexFlavor, it is returned. If the flavor is None, the default flavor is returned. To change the default flavor, set default_flavor.

import regex_toolkit as rtk

rtk.base.default_flavor = 2
assert rtk.utils.resolve_flavor(None) == rtk.enums.RegexFlavor.RE2

Arguments:

potential_flavor int | RegexFlavor | None - Potential regex flavor.

Returns:

RegexFlavor - Resolved regex flavor.

Raises:

ValueError - Invalid regex flavor.

`iter_sort_by_len`

def iter_sort_by_len(texts: Iterable[str],
                     *,
                     reverse: bool = False) -> Generator[str, None, None]

Iterate strings sorted by length.

Arguments:

texts Iterable[str] - Strings to sort.
reverse bool, optional - Sort in descending order (longest to shortest). Defaults to False.

Yields:

str - Strings sorted by length.

`sort_by_len`

def sort_by_len(texts: Iterable[str],
                *,
                reverse: bool = False) -> tuple[str, ...]

Sort strings by length.

Arguments:

texts Iterable[str] - Strings to sort.
reverse bool, optional - Sort in descending order (longest to shortest). Defaults to False.

Returns:

tuple[str, ...] - Strings sorted by length.

`ord_to_cpoint`

def ord_to_cpoint(ordinal: int) -> str

Character ordinal to character codepoint.

The codepoint is always 8 characters long (zero-padded).

Example:

ord_to_cpoint(97)
# Output: '00000061'

Arguments:

ordinal int - Character ordinal.

Returns:

str - Character codepoint.

`cpoint_to_ord`

def cpoint_to_ord(cpoint: str) -> int

Character codepoint to character ordinal.

Arguments:

cpoint str - Character codepoint.

Returns:

int - Character ordinal.

`char_to_cpoint`

def char_to_cpoint(char: str) -> str

Character to character codepoint.

Example:

char_to_cpoint("a")
# Output: '00000061'

Arguments:

char str - Character.

Returns:

str - Character codepoint.

`to_nfc`

def to_nfc(text: str) -> str

Normalize a Unicode string to NFC form C.

Form C favors the use of a fully combined character.

Example:

to_nfc("e\\u0301") == "é"
# Output: True

Arguments:

text str - String to normalize.

Returns:

str - Normalized string.

`iter_char_range`

def iter_char_range(first_char: str,
                    last_char: str) -> Generator[str, None, None]

Iterate all characters within a range of characters (inclusive).

Example:

char_range("a", "c")
# Output: ('a', 'b', 'c')

char_range("c", "a")
# Output: ('c', 'b', 'a')

Arguments:

first_char str - Starting (first) character.
last_char str - Ending (last) character.

Yields:

str - Characters within a range of characters.

`char_range`

def char_range(first_char: str, last_char: str) -> tuple[str, ...]

Tuple of all characters within a range of characters (inclusive).

Example:

char_range("a", "d")
# Output: ('a', 'b', 'c', 'd')

char_range("d", "a")
# Output: ('d', 'c', 'b', 'a')

Arguments:

first_char str - Starting (first) character.
last_char str - Ending (last) character.

Returns:

tuple[str, ...] - Characters within a range of characters.

`mask_span`

def mask_span(text: str,
              span: list[int] | tuple[int, int],
              mask: str | None = None) -> str

Slice and mask a string using a single span.

Arguments:

text str - String to slice.
span list[int] | tuple[int, int] - Domain of index positions (start, end) to mask.
mask str, optional - Mask to insert after slicing. Defaults to None.

Returns:

str - String with span replaced with the mask text.

`mask_spans`

def mask_spans(text: str,
               spans: Iterable[list[int] | tuple[int, int]],
               masks: Iterable[str] | None = None) -> str

Slice and mask a string using multiple spans.

Todo: Add support for overlapping (and unordered?) spans.

Arguments:

text str - String to slice.
spans Iterable[list[int] | tuple[int, int]] - Domains of index positions (x1, x2) to mask within the text.
masks Iterable[str], optional - Masks to insert when slicing. Defaults to None.

Returns:

str - String with all spans replaced with the mask text.

`regex_toolkit.base`

`escape`

def escape(char: str, flavor: int | None = None) -> str

Create a regex expression that exactly matches a character.

Arguments:

char str - Character to match.
flavor int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.

Returns:

str - Expression that exactly matches the original character.

Raises:

ValueError - Invalid regex flavor.

`string_as_exp`

def string_as_exp(text: str, flavor: int | None = None) -> str

Create a regex expression that exactly matches a string.

Arguments:

text str - String to match.
flavor int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.

Returns:

str - Expression that exactly matches the original string.

Raises:

ValueError - Invalid regex flavor.

`strings_as_exp`

def strings_as_exp(texts: Iterable[str], flavor: int | None = None) -> str

Create a regex expression that exactly matches any one string.

Arguments:

texts Iterable[str] - Strings to match.
flavor int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.

Returns:

str - Expression that exactly matches any one of the original strings.

Raises:

ValueError - Invalid regex flavor.

`make_exp`

def make_exp(chars: Iterable[str], flavor: int | None = None) -> str

Create a regex expression that exactly matches a list of characters.

The characters are sorted and grouped into ranges where possible. The expression is not anchored, so it can be used as part of a larger expression.

Example:

exp = "[" + make_exp(["a", "b", "c", "z", "y", "x"]) + "]"
# Output: '[a-cx-z]'

Arguments:

chars Iterable[str] - Characters to match.
flavor int | None, optional - Regex flavor (1 for RE, 2 for RE2). Defaults to None.

Returns:

str - Expression that exactly matches the original characters.

Raises:

ValueError - Invalid regex flavor.

`regex_toolkit.enums`

Enums.

`RegexFlavor` Objects

class RegexFlavor(int, Enum)

Regex flavors.

Attributes:

RE int - Standard Python regex flavor.
RE2 int - Google RE2 regex flavor.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
ci		ci
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements-doc.txt		requirements-doc.txt
requirements-test.txt		requirements-test.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regex-Toolkit

Requirements:

Issue reporting

License

Installing

Usage

Why Use `regex_toolkit`?

RE2 Overview

Library

`regex_toolkit.utils`

`resolve_flavor`

`iter_sort_by_len`

`sort_by_len`

`ord_to_cpoint`

`cpoint_to_ord`

`char_to_cpoint`

`to_nfc`

`iter_char_range`

`char_range`

`mask_span`

`mask_spans`

`regex_toolkit.base`

`escape`

`string_as_exp`

`strings_as_exp`

`make_exp`

`regex_toolkit.enums`

`RegexFlavor` Objects

About

Releases 1

Languages

License

Phosmic/regex-toolkit

Folders and files

Latest commit

History

Repository files navigation

Regex-Toolkit

Requirements:

Issue reporting

License

Installing

Usage

Why Use regex_toolkit?

RE2 Overview

Library

regex_toolkit.utils

resolve_flavor

iter_sort_by_len

sort_by_len

ord_to_cpoint

cpoint_to_ord

char_to_cpoint

to_nfc

iter_char_range

char_range

mask_span

mask_spans

regex_toolkit.base

escape

string_as_exp

strings_as_exp

make_exp

regex_toolkit.enums

RegexFlavor Objects

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages

Why Use `regex_toolkit`?

`regex_toolkit.utils`

`resolve_flavor`

`iter_sort_by_len`

`sort_by_len`

`ord_to_cpoint`

`cpoint_to_ord`

`char_to_cpoint`

`to_nfc`

`iter_char_range`

`char_range`

`mask_span`

`mask_spans`

`regex_toolkit.base`

`escape`

`string_as_exp`

`strings_as_exp`

`make_exp`

`regex_toolkit.enums`

`RegexFlavor` Objects