Prerequisites

Because this article is meant as a deep dive into regular expressions, it is best to already have a basic understanding of them.

Introduction

Regular Expressions, commonly known as regex, are a powerful tool for pattern matching in text.

However, not all regex engines behave the same way. Under the hood, regex engines rely on different implementations of finite automata – namely, non-deterministic finite automata (NFA) and deterministic finite automata (DFA).

This article will explain the key differences between NFAs and DFAs, show why knowing the regex engine you’re using matters, and offer practical advice for writing safe and efficient patterns.

Basics of finite automata

What is a finite automaton

A finite automaton, or finite-state automaton (FSA) is a mathematical model of computation. An FSA is formally defined by:

  • A finite set of states $Q$,
  • A finite set of input symbols called the alphabet $\Sigma$,
  • A set of transitions, formally defined by a transition function $\delta: Q \times \Sigma \mapsto P(Q)$, where $P(Q)$ denotes the power set of $Q$,
  • An initial state $q_0 \in Q$, and
  • A set of final states $F \subseteq Q$.

It is an abstract machine that can be in exactly one of any state of $Q$ at any given time. The FSA can change from one state to another in response to some inputs; changes from one state to another are called transitions.

“An example of a simple mechanism that can be modeled by a state machine is a turnstile. A turnstile, used to control access to subways and amusement park rides, is a gate with three rotating arms at waist height, one across the entryway. Initially the arms are locked, blocking the entry, preventing patrons from passing through. Depositing a coin or token in a slot on the turnstile unlocks the arms, allowing a single customer to push through. After the customer passes through, the arms are locked again until another coin is inserted.

Considered as a state machine, the turnstile has two possible states: Locked and Unlocked. There are two possible inputs that affect its state: putting a coin in the slot (coin) and pushing the arm (push). In the locked state, pushing on the arm has no effect; no matter how many times the input push is given, it stays in the locked state. Putting a coin in – that is, giving the machine a coin input – shifts the state from Locked to Unlocked. In the unlocked state, putting additional coins in has no effect; that is, giving additional coin inputs does not change the state. A customer pushing through the arms gives a push input and resets the state to Locked.

Wikipedia

---
config:
    theme: 'forest'
---
flowchart LR
    u(("Unlocked")) & l(("Locked")) --Coin--> u
    u & l --Push--> l

How this relates to regular expressions

According to Kleene’s Theorem, a language is regular if and only if it can be recognized by a finite automaton. This means that every regular expression corresponds to some finite automaton, and vice-versa.

For example, the regular expression /ab*/ could be represented by the following automata:

---
config:
    theme: 'forest'
---
flowchart LR
    s(("start")) --> a((" ")) --a--> b((" ")) --b--> c(("end")) --b--> c

Deterministic Finite Automata (DFA) vs Non-Deterministic Finite Automata (NFA)

NFA Definition

An NFA allows a state to have multiple transitions for the same input symbol. It also allows empty transitions, denoted as $\epsilon$ transitions, that the automaton can choose to follow without consuming any input. In such cases, the automaton needs to decide which transition to follow.

Although the process is nondeterministic, the result is deterministic: either a match is found or it isn’t. The engine tries every viable path until one leads to a match or all fail. This is commonly implemented through backtracking, though some engines use concurrent sub-automatons to simulate this behavior.

For example, if we are trying to match all strings ending in 101, assuming our strings only contain 0s and 1s, we could use the regex (0|1)*101. Using Thompson’s construction, it results in the following NFA:

---
config:
    theme: 'forest'
---
flowchart LR
    s(("start")) --> a((" "))
    a --ϵ--> b((" ")) & f((" "))
    b --1--> c((" ")) --0--> d((" ")) --1--> e(("end"))
    f --ϵ--> g((" ")) --1--> h((" ")) --ϵ--> i((" ")) --ϵ--> b & f
    f --ϵ--> j((" ")) --0--> k((" ")) --ϵ--> i

DFA Definition

A DFA processes input through a single, unambiguous path. For every state and input symbol, there is exactly one transition. This enables fast, linear-time matching with no backtracking.

As in the NFA example above, if we are trying to match all strings ending in 101, we can use the powerset construction to generate a DFA from our NFA. It would look like this:

---
config:
    theme: 'forest'
---
flowchart LR
    s(("start")) --> a((" "))
    a & d --0--> b((" "))
    b --0--> b
    c & e --0--> d((" "))
    a & b & c & e --1--> c((" "))
    d --1--> e(("end"))

Comparison

Memory vs Speed

One key takeaway: NFAs and DFAs are functionally equivalent: they recognize the same class of regular languages. The difference lies in a classic tradeoff in computing: memory vs speed.

DFAs are fast because they don’t need to explore multiple paths or backtrack. In practice, they can be seen as glorified lookup tables, meaning they can match an output in linear time. That speed comes at a cost: since you cannot have duplicated transitions, it can lead to an automaton with way more states than an NFA. If you transform an NFA with $n$ states to a DFA using the powerset construction, the resulting DFA may have up to $2^n$ states in the worst case.

It is important to note that DFA minimization algorithms exist: except in worst-case scenarios or if you’re running regex on microcontrollers (but why would you?) the memory tradeoff isn’t really something to be worried about.

NFAs, on the other hand, can represent patterns more compactly. They use less memory but may need to backtrack or simulate multiple states, which can hurt performance.

However, this leads us to a critical drawback of NFA engines: catastrophic backtracking.

Catastrophic backtracking occurs when the engine faces a regex with nested or ambiguous quantifiers that create a large number of possible matching paths. In worst-case scenarios, the engine explores these paths one by one, consuming exponential time. This behavior can dramatically slow down the matching process or even worse cause a crash. It’s a well-known security vulnerability: attackers can exploit poorly written regexes to launch ReDoS (Regular Expression Denial of Service) attacks.

DFA engines avoid this entirely, as they never backtrack.

So… looks like we have a clear winner.

DFAs are clearly better in most situations.

The end.

Modern regular expressions

Well… not really. You see, up to this point we were working with the formal definition of these engines. What is called “regular expressions” nowadays are not really using regular languages. The set of instructions usable in a regular language, in the strict sense of the term, is pretty restricted. You don’t have access to what most modern engines offer, such as capturing groups, backreferences or lookarounds assertions. And because those features have become quite essential in any programmer’s toolbox, you obviously want to use an engine that offers them.

When it comes to modern regex, NFAs have an incredible advantage: they are far more intuitive and aligned with the structure of the regex itself. NFA based engines are often described as regex-directed while DFA based ones are text-directed. The distinction lies in how they process input. NFA automata effectively mirror the regex pattern, following its structure step by step. The execution feels like a direct walk through the expression.

DFAs, on the other hand, restructure the pattern internally. Their states often appear scrambled, as they precompute all possible paths in advance to ensure deterministic processing. This compression makes DFAs fast but harder to interpret and extend.

These differences can be clearly seen in the above example graphs. Notice how the NFA is easy to read while the DFA seems jumbled.

Lazy quantifiers

The structural clarity of NFAs also makes them much easier to enhance with advanced features.

Lazy quantifiers are a perfect example of this.

By default, quantifiers (such as *, +, or ?) are greedy: they will consume as many characters as possible. For example, suppose you are trying to match HTML tags with <.+>. Take the string: <h1>Hello World!</h1>.

You may think that the regex would only match <h1> and </h1>. However, because quantifiers are greedy, it will match from the first < to the last >, meaning it will match <h1>Hello World!</h1> instead of the tags you wanted.

This is a classic beginner mistake and this is where lazy quantifiers (*?, +?, ??) come into play. Instead of consuming as many characters as possible, they will try to consume as few characters as possible. The pattern can be changed to <.+?> and it will now stop at the first > it sees, matching both <h1> and </h1> as intended.

Laziness requires backtracking: the engine must try the smallest possible match first, check if it works, and only consume more input if it doesn’t. That inherently goes against the DFA model. As such, lazy quantifiers are supported only in NFAs.

That’s why it’s so important to understand the capabilities of your regex engine. If it doesn’t support lazy quantifiers, you’ll need to find other ways to write your pattern. For example, here you might use something like <[^<>]+> to avoid greedy matching.

Atomic groups

This feature disparity can also be seen with atomic groups.

An atomic group is a non-capturing group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Since DFAs don’t use backtracking, atomic groups don’t even make sense in them.

In Summary

Here is a simple table to succinctly compare DFAs and NFAs:

Aspect NFA DFA
Matching Algorithm Backtracking or simulating subautomatons Single, predictable transition per input
Execution Time Can be exponential in worst cases Always linear
Memory Usage Typically lower Can be very high (up to $2^n$ states)
Catastrophic Backtracking Possibly used in ReDoS attacks Not possible
Feature Support Can support lazy quantifiers, atomic groups, etc… Limited to core regex features
Best Use Case Rich, flexible pattern matching Safe, high-performance matching

Modern Regex Engines in practice: RE2 and PCRE2

We have seen the formal differences between NFAs and DFAs. In particular, we looked at their trade-offs in memory, speed, and feature support. Now, we are ready to explore how these concepts appear in real-world regex engines.

Many modern engines blur the line between pure DFA and pure NFA, choosing hybrid approaches to balance safety, performance and flexibility. Two prominent examples are RE2 and PCRE2, each embodying one end of the spectrum.

How RE2 prioritizes safety by using a DFA / NFA hybrid

Google’s RE2 is a modern regex engine explicitly designed with safety as its raison d’être. Its core objective is to eliminate the risk of handling untrusted regular expressions. As such, the RE2 algorithm is immune to ReDoS.

A secondary goal is to guarantee that the match time is always linear, regardless of the pattern’s complexity. It mostly uses DFAs (yes multiple DFAs!) with no backtracking. Specifically, RE2 maintains up to four separate DFAs: two for forward execution (one for first-match and one for longest-match) and two for reverse execution (used in unanchored searches). Because of this DFA-driven design, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and lookaround assertions are not supported.

However, if all of its DFAs exceed their memory budget, RE2 falls back on an NFA implementation. Unlike most NFAs, RE2’s NFA algorithm cannot take exponential time to match its input. Thus, what could be seen as the only drawback of DFAs in terms of security is avoided.

In short, RE2 offers a neat balance between performance and safety. This makes it ideal for environments where reliability trumps expressiveness.

How PCRE2 offers powerful features by using an augmented NFA

On the other end of the spectrum, PCRE2 (Perl Compatible Regular Expressions) is a powerful, feature-rich engine. It is widely used in systems that prioritize flexibility and expressive pattern matching.

Unlike RE2, PCRE2 is based on a backtracking NFA model, which allows it to support a broad range of advanced regex features such as lazy quantifiers, backreferences, lookarounds assertions and many other advanced features. This makes PCRE2 extremely versatile and well-aligned with the Perl-style regular expressions that many developers are familiar with.

However, this expressive power comes with trade-offs. Because it relies on backtracking to explore all possible matching paths, PCRE2 can suffer from catastrophic backtracking in poorly constructed patterns or in ReDoS. That said, PCRE2 includes multiple engines to fit your needs and allows you to use either the default NFA for feature-rich expressions, an NFA using JIT (just-in-time compilation) for improved performances, or a DFA engine (but then you would probably be better off with RE2).

PCRE2 is an ideal choice when full-featured pattern support is essential and performance can be managed through careful pattern design.

Ultimately, both RE2 and PCRE2 embody different philosophies. RE2 favors safety and predictability, while PCRE2 emphasizes expressiveness and feature richness. But what does this mean for you as a developer, writing real-world regex patterns?

Practical Implications: How engine type affects regex writing

Understanding whether your regex engine is NFA-based, DFA-based, or a hybrid can significantly influence how you write, optimize, and debug your patterns. Understanding your engine’s behavior isn’t just a matter of performance: it can change what patterns are even possible to write.

Feature availability varies by engine

As we’ve seen, different engines offer different levels of expressive power. This means you must adapt your pattern design based on the engine’s capabilities. For instance, if you want to match repeated characters in an NFA-based engine, such as PCRE2, you can do so using backreferences: (.)\1+. This allows you to match strings such as aa or bbbb.

In a DFA-based engine, such as RE2, you can’t rely on backreferences. In this instance, you would need to fall back on regular code to handle the pattern matching logic.

In short: always check what your regex engine actually supports before designing complex patterns. If you want an extensive comparison of features available per regex engine, I encourage you to skim through this handy table.

Performance tips

When performance is critical, or when processing untrusted user input, it’s important to write regex patterns with a clear understanding of the engine behind them.

In NFA-based engines, such as PCRE2, patterns can backtrack extensively. To avoid unnecessary slowdowns, it’s best to minimize greedy quantifiers, make use of atomic groups where possible, and anchor your patterns to avoid ambiguous matches. Using specific character classes instead of broad wildcards like .* also helps narrow the search space and improve performance.

If you’d like to dive deeper into optimization strategies for NFA engines, be sure to check out this related article on the Gistre Blog, which covers a wide range of practical tips.

DFA-based engines, like RE2, typically offer fast and predictable performance out of the box. However, not all patterns are equally efficient. Many of the best practices for NFAs, such as anchoring and avoiding excessive use of wildcards, still apply. One additional consideration unique to DFA engines is memory usage. Since DFAs may precompute large state tables, it’s important to remain mindful of input size and pattern complexity to avoid exceeding memory budgets.

Across all engines, one universal recommendation stands out: if your environment allows it, always precompile and reuse your regex patterns. This can significantly reduce overhead and make your code more efficient in repeated matching scenarios.

Conclusion

Regular expressions are far more than just simple pattern-matching tools. Behind every match lies a powerful engine whose design choices can greatly influence the behavior, performance, and safety of your code. While it’s easy to treat regex as a black box, understanding whether the engine is based on a deterministic or non-deterministic finite automaton can make a significant difference in how you write and reason about your patterns.

By understanding the underlying model of your regex engine, you gain the ability to make more informed choices. That can mean optimizing for performance, avoiding common pitfalls like catastrophic backtracking, or simply ensuring that your patterns are even valid in the first place.

In the end, mastering regular expressions means not only knowing the syntax, but also understanding the machinery that brings them to life.

Sources

NFA vs DFA

RE2

PCRE2

Regex Features and caveats