Prerequisites

Before diving into the techniques for optimizing regular expressions covered in this article, it is recommended to have:

  • A basic understanding of regular expressions
  • A basic understanding of finite automata

Introduction

Regular Expressions, commonly known as regex or regexp, are incredibly powerful for searching and manipulating text strings, especially in processing text files. A single line of regex can easily replace several dozen lines of code.

However, their use can sometimes lead to performance issues, particularly when regex patterns are poorly designed or applied to large datasets. Optimizing regular expressions becomes crucial to speed up code execution, reduce resource consumption, and improve system responsiveness.

In this article, we will explore different techniques to make your regular expressions more efficient. We will discuss concepts such as avoiding redundancy, reducing backtracking, and using quantifiers effectively. Whether you are a novice or an expert in regex, these tips will help you improve your patterns to make them faster.

Regex Engines: NFA vs. DFA

It is difficult to discuss optimization without talking about regex engines, which fall into two main categories: NFA (Non-deterministic Finite Automaton) and DFA (Deterministic Finite Automaton). These two methods differ fundamentally in their approach to regex processing.

An NFA engine uses backtracking, exploring multiple paths to find a match. While this approach is flexible, it can be inefficient or even exponentially slow with complex expressions. DFA engines, on the other hand, determine a unique state for each character without backtracking, ensuring faster, linear-time execution at the cost of higher memory consumption.

DFA engines are exceptionally fast because they operate like a lookup table, without needing to make decisions or backtrack. However, converting an NFA with n states into a DFA can lead to up to 2n states, due to the prohibition of duplicate transitions.

The NFA architecture is advantageous for expressions requiring advanced features (such as lookaheads) and complex syntax. In contrast, the DFA is often preferred in applications where speed is critical, such as firewalls or spam filters, because it guarantees constant-time analysis regardless of the expression’s complexity. State diagrams can help illustrate these differences: NFA engines handle multiple matching paths, while DFA engines build a state table that uniquely directs each transition without backtracking.

The efficiency of regex engines thus depends on choosing the appropriate type for each application. NFA engines are more flexible and suitable for feature-rich environments, while DFA engines, despite their limitations, are ideal for applications requiring consistent and predictable performance.

For a deeper understanding of this complex topic, these two articles are recommended:

In Summary:

NFA (Non-deterministic Finite Automaton):

  • The engine tries all possible branches until it finds a match, backtracking as needed.
  • Supports complex constructs such as capturing groups and lookaheads/lookbehinds.

DFA (Deterministic Finite Automaton):

  • The engine does not backtrack. It processes each character in a single pass, eliminating performance issues related to backtracking.
  • Generally faster for simple regex but does not support some complex constructs (like lookaheads/lookbehinds and capturing groups).

Regex Engine Performance Comparison

This benchmark compares the performance of various regex engines across different programming languages using three common patterns: email, URI, and IPv4 addresses. The results indicate that lower-level language-based engines, such as Rust and C++, outperform others in speed, with Rust showing notable efficiency even with complex expressions.

Patterns Used:

  • Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
  • URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
  • IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])
Language Email(ms) URI(ms) IP(ms) Total(ms)
Rust 45.62 44.01 6.88 96.52
C++ Boost 63.79 61.56 19.92 145.27
Javascript 80.60 63.68 1.85 146.13
PHP 68.54 72.33 7.14 148.00
Perl 146.34 85.91 28.35 260.61
C PCRE2 178.46 153.83 16.25 348.54
C# .Net Core 172.74 146.99 62.49 382.22
Ruby 310.11 277.34 55.97 643.42
Python 2 297.27 180.01 398.35 875.63
Java 273.64 312.58 372.36 958.59
Python 3 369.28 270.05 426.22 1065.55
Go 321.04 309.70 476.92 1107.65
C++ STL 561.59 448.54 322.84 1332.97
C# Mono 3739.35 3171.64 191.41 7102.39

Test Environment: MSI Prestige 15, Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz, 16 GB 2667 MHz DDR4 with Ubuntu 24.04.

The results indicate that the choice of regex engine depends on the application’s speed and memory requirements. While DFA engines may suffer in multi-pattern benchmarks due to the exponential increase in states, they excel in processing long, simple strings. NFA engines, with their flexibility, are often used for more complex patterns.

General Regex Optimization Techniques

Note: From this point on, the article will focus on NFA engines, which are the most commonly used regex engines.

Avoid Greedy Quantifiers

The quantifiers *, +, and ? are “greedy” by default, meaning they try to match as many characters as possible before backtracking to satisfy the rest of the regex. This behavior can lead to excessive backtracking.

Solution:

Use non-greedy quantifiers (*?, +?, ??) to limit this excessive search. For example:

  • Greedy: <.+>
  • Non-greedy: <.+?>

In this case, the non-greedy quantifier +? stops as soon as it finds the first closing tag, whereas the greedy + will try to match everything before backtracking.

Use Assertions Instead of Capturing Groups

Capturing groups can introduce unnecessary overhead, especially when not needed. If you don’t need to capture the group’s content, use non-capturing groups (?:...).

Example:

  • Capturing: (\d{4})-(\d{2})-(\d{2})
  • Non-capturing: (?:\d{4})-(?:\d{2})-(?:\d{2})

Non-capturing groups slightly improve performance, especially when the captured substrings are not needed.

Limit the Scope of Quantifiers

Quantifiers like .* or .+ can cause catastrophic performance issues when used carelessly, especially with large input strings.

Solution:

Restrict the scope of quantifiers. For instance, if you expect a specific number of characters, specify the range:

  • Non-optimal: .*
  • Optimal (if the character count is limited): .{1,100}

This reduces the chances of excessive backtracking, especially when the expression fails after a long match.

Use Specific Character Classes

Instead of using broad patterns like .*, specify the exact characters expected using character classes. This narrows down the search and improves performance.

Example:

  • Non-specific: .*\d.*
  • Specific: \w*\d\w*

The second regex is more efficient because it is more precise about the characters it searches for.

Anchor Your Regular Expressions

When you know where a match should occur (beginning or end of a string), use anchors like ^ (beginning) and $ (end) to limit the number of possible matches.

Example:

  • Without anchoring: \d{4}
  • Anchored at the start of the line: ^\d{4}

By using ^, the regex starts searching immediately at the beginning of the string, avoiding scanning the entire string for potential matches.

Prefer Ordered Alternatives

When using alternatives (|), place the most common options first to reduce the number of comparisons.

Example:

  • Non-optimal: (foo|bar|baz)
  • Optimal if ‘foo’ is more common: (foo|baz|bar)

Regex engines test each alternative from left to right. If the first option is the most likely, the number of comparisons is minimized.

Use Lookahead and Lookbehind Sparingly

Assertions such as lookaheads ((?=...)) and lookbehinds ((?<=...)) can avoid backtracking since they do not consume characters, but they add extra complexity. Use them sparingly and only when truly necessary.

Example:

Use lookahead to check for a suffix without consuming it: \w+(?=ing\b)

This regex finds words ending in “ing” without including “ing” in the match. It avoids backtracking within the string.

Use Atomic Groups to Limit Backtracking

Atomic groups (?>...) prevent backtracking within the group, improving performance by blocking attempts to backtrack once a match is found.

Example:

  • Without atomic group: (a+|b+)+c
  • With atomic group: (?>a+|b+)+c

In the version with an atomic group, the engine will not backtrack to try the other alternative if one fails. This reduces unnecessary matching attempts, especially with alternatives or nested loops. Use them when you know backtracking is not needed to improve performance.

Note: While powerful, atomic groups should be used carefully. They completely eliminate backtracking for the enclosed pattern, so if you need flexibility in matching or have complex patterns where partial matches are useful, an atomic group might prevent legitimate backtracking attempts. \

  • Matching Rigidity: By using an atomic group, you may inadvertently prevent the engine from finding a valid solution that would require minor backtracking. This makes the regex more rigid.
  • Limited Use Cases: Atomic groups are mainly useful in scenarios with repetitive patterns where backtracking significantly impacts performance. If your regex does not suffer from backtracking issues, their use may be unnecessary.

Case Study: (a+a+)+b

To illustrate the optimization of an inefficient regex, we’ll use the example (a+a+)+b, which is often prone to catastrophic backtracking issues. We will benchmark using the .NET 7.0 (C#) regex engine and the regex101 website, with the test string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa".

Without Optimization (Regex: (a+a+)+b)

We begin by testing the original regex without any optimization. Here are the results obtained:

4 iterations in 10s
avg: 1.39s, min: 1.39s, max: 1.4s

These results reveal catastrophic backtracking, where the regex engine explores an enormous number of potential paths, causing severe slowdowns.

First Optimization: Removing Redundancy

The pattern (a+a+) is redundant because the second occurrence of a+ can be included in the first. We simplify the regex by replacing (a+a+) with (aa+), resulting in the regex (aa+)+b.

4 iterations in 10s
avg: 1.41s, min: 1.39s, max: 1.43s

Unfortunately, this simplification does not lead to any notable improvement. The engine still suffers from backtracking.

Optimizing Groups: Using Non-Capturing Groups

Since we do not need to capture the groups in this example, we can use non-capturing groups. The regex then becomes (?:aa+)+b.

639 iterations in 10s
avg: 102.5μs, min: 60μs, max: 660μs

This optimization significantly improves performance, increasing from 4 to 639 iterations in 10 seconds. The reduction in backtracking by eliminating unnecessary captures is key here.

Advanced Optimization: Using Atomic Groups

We replace the non-capturing groups with atomic groups (?>...), which completely block backtracking within the group. The regex becomes (?>aa+)+b.

639 iterations in 10s
avg: 102.88μs, min: 60μs, max: 680μs

The use of atomic groups maintains the same performance with a slight variation, indicating that the engine does not need to backtrack in this context.

Final Optimization: Removing the Unnecessary Quantifier

We notice that the + quantifier before the b is not necessary. By removing it, we obtain the final regex: (?>aa+)b.

640 iterations in 10s
avg: 96.78μs, min: 60μs, max: 660μs

This final optimization shows a slight performance improvement, with 640 iterations in 10 seconds, confirming that backtracking has been completely eliminated.

Final Result

Here is a summary of the optimizations and iterations obtained:

---
config:
    themeVariables:
        xyChart:
            plotColorPalette: "#3498db"
---
xychart-beta
    title "(a+a+)+b Optimization"
    x-axis ["(a+a+)+b", "(aa+)+b", "(?:aa+)+b", "(?>aa+)+b", "(?>aa+)b"]
    y-axis "Number of Iterations in 10s" 1 --> 650
    bar [4, 4, 639, 639, 640]

By progressively optimizing the regex, we significantly reduced execution time by limiting backtracking at each step, improving performance by a factor of 160.

Tools and Best Practices

When it comes to optimizing regular expressions, tools like regex101 are essential for profiling and benchmarking. This site offers an interactive interface that allows for real-time regex testing, pattern behavior analysis, and detailed backtracking observation. By visualizing the execution steps, users can identify parts of their regex that cause slowdowns and adjust their approach accordingly. Additionally, it is crucial to implement unit tests when optimizing regex patterns. These tests ensure that performance-enhancing changes do not alter the regex’s behavior. By systematically verifying that the use cases work as expected, developers can be confident that their optimizations are effective while preserving the original matching logic and accuracy.

Conclusion

This article explored various regex optimization techniques, highlighting the significant impact these improvements can have on regex engine performance. We covered strategies such as avoiding greedy quantifiers, using non-capturing and atomic groups, and limiting the scope of quantifiers. Each of these methods aims to reduce backtracking and increase efficiency, leading to noticeable performance gains.

To validate the implemented optimizations, it is strongly recommended to use benchmarking tools like regex101 and perform unit testing. These practices help ensure that changes do not affect the expected behavior of regular expressions while guaranteeing faster and more efficient execution. By integrating these techniques and tools into your workflow, you can transform your regex patterns into powerful text-processing assets, maximizing their potential while minimizing resource usage.

Sources