Blackbox Testing
Certain types of software are inherently opaque, meaning that it can be difficult or, in some cases, impossible to access the complete source code of the program being utilized. When this occurs, the system is often described as functioning like a “black box”.
This lack of transparency introduces challenges in ensuring the software’s reliability and accurately predicting its outcomes, which may lead to significant safety concerns in critical applications. In this article, we will explore a machine learning technique known as active learning. It enables the reconstruction of the full behavior of a blackbox system by analyzing a finite set of inputoutput pairs.
This approach also facilitates the assessment of “equivalence” between two software systems, even when they are implemented in different programming languages, through the abstraction of their behavior into state machines.
To fully understand this article, a background in graph theory, automata theory, and machine learning is recommended.
Table of Contents
 Different categories of programs
 Interesting properties of regular languages
 The L* algorithm
 Example run
 Extension on VPA
 The beauty of abstraction
 Applications and limitations
Different categories of programs
In order to adapt our approaches effectively, we need to distinguish between three categories of programs:
1. DFA
The first category includes programs that behave like regular expressions, such as a binary that checks a password and returns 1 if the password is correct, and 0 otherwise. These programs can be effectively modeled using simple Deterministic Finite Automata (DFA), as the operations performed on the program stack during execution, as well as function calls, are irrelevant for capturing their behavior.
Example:
A function like is_even()
can modeled as the DFA below.


It is then plain to see that such function behaves as the regular expression:
[E = ?[09]^*[02468]^+]
Finding such DFA (or equivalently its regular expression) is sufficient to indentify the program’s whole behavior .
2. VPA
The second category comprises independent deterministic programs with nearisomorphism, which are programs that do not interact with the environment and exhibit a finite, predictable set of behaviors for each possible input. These programs can be modeled as Visibly Pushdown Automata (VPA) [1].
Such automaton recognizes the set of sequences of statements that are valid runs of the program. A word w belongs to the language of the minimal VPA of a program if and only if it describes the sequences of statements performed by the latter for some input.
In this case, the automaton’s stack corresponds to the function call stack used by the actual program. The automaton is termed “visibly” pushdown because the operations of pushing and popping to the stack are constrained by specific symbols, such as function calls and returns [2]. As a result, the stack’s behavior depends solely on the input, significantly simplifying the process of comparing two VPAs.
Example:
A function like power
can be modeled as the VPA below. The transitions in green (resp. red) correspond to pushing (resp. poping) operations on the machine’s stack.


A run of the automaton corresponds to a sequence of statement, and is accepted if and only if the said sequence can be performed by the program for some input. For example
[0 → 1 → 5 → 0 → 1 → 5 → 0 → 1 → 2 → 4 → 6 → 9 → 10 → 6 → 7 → 8]
is the sequence of statements performed by (power(3,2)) and is therefore accepted.
Now, let the following symbols represent specific program instructions or conditions:
 a =
int res;
 b =
n == 0
 c =
n != 0
 d =
call(power)
 e =
return 1;
 f =
return (power)
 g =
n % 2 == 0
 h =
n % 2 != 0
 i =
return res * res;
 j =
return res * res * x;
Then the VPA above describes the nonregular expression:
[E = [(acd)^k abef (gif + hjf )^{k−1}(gi + hj)] + abe, k \geq 1]
We then fall into the same situation as with our DFAs, where finding this expression is sufficient to describe the entire program’s behavior.
3. Turing Machines
The third and final category encompasses general programs that interact with the environment, exhibit truly random behavior, lack observable outputs, or have unbounded complexity.
These programs can only be modeled by Turing machines, and as such, they fall outside the scope of active learning algorithms as understood by the current state of the art.
Consequently, attempting to learn the behavior of complex systems like the Linux kernel or largescale software such as The Elder Scrolls V: Skyrim through active learning methods is entirely impractical.
Interesting properties of regular languages
Now that we understand that programs can be represented as state machines (or equivalently regular expressions), we can turn our attention to the challenge of extracting these representations from a blackbox system. To address this problem, it is essential to examine some key properties of regular languages.
Given two words ((w_{1}, w_{2})) of a language L, we define a distinguishing extension as a word c such that:
[ w_1 \bullet c \in L \neq w_2 \bullet c \in L ]
We then define the MyhillNerode relationship (\sim_L) as follows:
[ \forall (w_1, w_2) \in L^2, w_1 \sim_L w_2 \iff \nexists c \in \Sigma^*, w_1 \bullet c \in L \neq w_2 \bullet c \in L ]
The MyhillNerode theorem [3] shows that this relationship partitions L into equivalence classes and that L is regular if and only if this partition is of finite order.
These properties enable regular languages, which may have infinite cardinality, to be described using a finite set of elements by identifying equivalence classes based on the MyhillNerode theorem [4]. This approach is sufficient for programs that behave like DFA, as their associated languages are regular by definition. Moreover, an extension of this theorem exists for VPA, even though they can represent contextfree languages [5].
The goal of active learning algorithms is to identify the distinct partitions of the target language by querying inputs to the observed program and analyzing the corresponding outputs. Through this process, it becomes possible to construct a state machine that accurately captures any program’s behavior in finite time, even if the latter can perform an infinite variety of instruction sequences.
The L* algorithm
The L* algorithm was one of the earliest active learning algorithms developed [6] and is capable of learning a DFA by querying an information source known as the teacher. In our context, the teacher could be represented by a binary executable, with the L* algorithm querying it by providing various inputs to observe its corresponding outputs. Through this interaction, L* gradually constructs a model of the DFA that describes the program.
In order to classify the information given by the teacher, L* maintains an observation table. An observation table over an alphabet (\Sigma) is a tuple ((S, E, T)) with S a finite prefixclosed set of words over (\Sigma), E a finite suffixclosed set of words over (\Sigma) and T a mapping function from (((S \cup S.\Sigma).E)) to ({ 0, 1}). A set of words X is said to be prefixclosed (resp. suffixclosed) iff all prefixes (resp. suffixes) of all elements of X are also in X.
For any (s \in (S \cup S.\Sigma)) we define row(s) as the function (f:e \rightarrow T(s.e)).
For example, with (S = E = { \epsilon, a, b}) and (T) such that (T(x) = 1) for (x \in { \epsilon, a, aa, aaa}) and (T(x) = 0) otherwise, we have the following observation table:
ϵ  a  b  

S: ϵ  1  1  0 
S: a  1  1  0 
S: b  0  0  0 
aa  1  1  0 
ab  0  0  0 
ba  0  0  0 
bb  0  0  0 
Instinctively, a cell tells us whether the word formed by the concatenation of the row label ((s \in S)) and the column label (( e \in E)) is a word of the language.
By constructing S, the learner identifies the different equivalence classes of the target language in regards to the MyhillNerode relationship. E then represents the set of distinguishing extensions. In a similar manner, it identifies the transition functions by constructing T.
An observation table is said to be closed iff for every (t \in S.\Sigma) there exists a (s \in S) such that (row(t) = row(s)). An observation table is said to be consistent if and only if:
[ \forall (s_1, s_2) \in S^2, row(s_1) = row(s_2), \forall a \in \Sigma, row(s_1.a) = row(s_2.a) ]
Given a closed, consistent observation table ((S, E, T)) we define an acceptor (M((S, E, T))) as a tuple ((Q, q_0, F, \delta)) with (Q = { row(s), s \in S}) a set of states, (q_0) the unique initial state, (F) the subset of (Q) of final states and (\delta) the transition function such that (\delta(row(s), a) = row(s + a)). (M((S, E, T))) is a DFA by construction and has exactly (S) states.
The algorithm starts with (S = E = {\epsilon}). It will then loop while the observation table is nonclosed or nonconsistent and will try to remedy one of these properties with each iteration. Enforcing closure is done by adding elements to S until the condition is met. Enforcing consistency is done by adding elements to E that distinguishes elements that were indistinguishable beforehand. Once the table is closed and consistent, the learner will submit an acceptor A corresponding to the table as an hypothesis to the teacher. The acceptor is then compared to the teacher, this is done in a finite number of queries using characterization sets [7]. If we find no counterexample, then A is the minimal target DFA and the algorithm terminates. Otherwise, the learner will extract a new distinguishing element from the given counterexample and will continue checking for closure and consistency.
Given n the number of states of the minimal DFA equivalent to the model and k the maximum length of a counterexemple handed by the teacher, L* terminates in (O(n*k)) [6].
Example run
The following example outlines how the L* algorithm can be used to reverseengineer the behavior of a blackbox program that checks for valid passwords in the form of ( ab^*a ). The program is compiled, and its source code is discarded, yet L* is able to reconstruct its behavior using queries and counterexamples from a teacher.
Here is a breakdown of the process:

Initial Setup: The algorithm begins with an observation table where both (S) (the set of prefixes) and (E) (the set of suffixes) are initialized with ( \epsilon ) (the empty string). The teacher provides responses for each row:
ϵ S: ϵ 0 a 0 b 0 The table is consistent and closed, so the algorithm constructs the DFA below that recognizes the empty language.
However, upon comparing this to the teacher using wmethod [7], a counterexample is found. Such method allows comparing two state machines in finite time by constructing a minimal test suite (We won’t go into too much detail here for the sake of brevity, but curious readers can find out more at [14]). With such method, the teacher identifies the counterexample “aa”.

CounterExample Handling: The counterexample “aa” suggests that the DFA is incorrect. To resolve this, the context “a” is added to the set (E), and the table is updated:
ϵ a S: ϵ 0 0 a 0 1 b 0 0 Since the table is now nonclosed (no row matches ( \text{row}(a) = 01 )), “a” is added to (S), and further updates occur:
ϵ a S: ϵ 0 0 S: a 0 1 b 0 0 aa 1 0 ab 0 1 
Further Refinement: After the table is updated, the row ( \text{row}(aa) = 10 ) still lacks a match, prompting the addition of “aa” to (S), and another update is made:
ϵ a S: ϵ 0 0 S: a 0 1 S: aa 1 0 b 0 0 ab 0 1 aaa 0 0 aab 0 0 The table is now closed and consistent, and a DFA is generated:
However, a new counterexample, “aaaaa”, is provided by the teacher.

Final Refinement: The counterexample “aaaaa” leads to the extraction of context “aa,” which is added to (E), and the table is updated:
ϵ a aa S: ϵ 0 0 1 S: a 0 1 0 S: aa 1 0 0 b 0 0 0 ab 0 1 0 aaa 0 0 0 aab 0 0 0 The table is nonclosed because ( \text{row}(b) = 000 ), so “b” is added to (S), and the final update occurs:
ϵ a aa S: ϵ 0 0 1 S: a 0 1 0 S: aa 1 0 0 S: b 0 0 0 ab 0 1 0 aaa 0 0 0 aab 0 0 0 ba 0 0 0 bb 0 0 0 This final table is both consistent and closed, and the DFA below is constructed.

Final DFA:
After comparing the generated DFA to the teacher, no further counterexamples are found, indicating that the DFA now correctly models the passwordchecking program. The DFA represents the exact regular expression ( ab^*a ), corresponding to the valid password structure expected by the program.
Through this stepbystep querying and refinement, L* successfully reverseengineers the behavior of the blackbox password checker.
Extension on VPA
The L* algorithm is sufficient for the first class of programs, where the alphabet symbols directly correspond to the input symbols.
However, in the more complex and interesting case of the second class of programs, the alphabet symbols represent statements used by the program. Here, the “output” observed by the algorithm is the state of the program’s stack, meaning the algorithm must track the stack’s behavior during each function call and return.
Since the target programs are deterministic, there exists a bijection between possible inputs and their corresponding stack states sequence. This allows the same strategy as L* to be employed for learning these programs.
Because the equivalence classes of the languages of VPA can be identified in finite time, learning such machines using this approach becomes feasible [2]. The process involves identifying equivalence classes of sequences of statements that the program can potentially execute, effectively capturing and describing its entire behavior. By grouping sequences that lead to the same outcomes or system states, the algorithm can represent the program’s functionality in a compact and comprehensive manner, allowing for a full abstraction of its operational dynamics.
We won’t dive into this extension of the L* algorithm in this article, as doing so would lead us deep into complex algebraic concepts.
Breaking down something like the power
function example mentioned earlier would take an extremely long time, given that such problems are NPcomplete [2].
Just keep in mind that these problems can still be solved in finite time, using techniques closely related to the L* algorithm.
The beauty of abstraction
One of the most intriguing aspects of using state machines to model programs is how they abstract away the specific meanings of individual code statements. Rather than concerning themselves with the details of what each command or operation does at a granular level, state machines concentrate on the highlevel behavior of a system. In this approach, the exact functionality of the program’s actions becomes irrelevant. What matters instead is the overall structure, i.e. the flow of control and how the system transitions from one state to another over time.
This abstraction shifts the focus from the intricate semantics of individual statements to the overarching logic governing the program’s execution. By emphasizing the sequence of state transitions rather than the internal workings of specific instructions, state machines provide a powerful framework for understanding the system’s behavior from an algebraic and formal perspective. This allows for a cleaner and more rigorous analysis of the program’s behavior without getting bogged down in the complexities of lowlevel implementation details.
In essence, state machines enable us to model the behavior of software systems in a way that isolates their logical flow, making it easier to reason about how the system responds to different inputs and events. This not only facilitates debugging and optimization but also aids in proving properties about the system, such as correctness, termination, or consistency, by focusing solely on the transition between states rather than the detailed workings of individual operations.
Applications and limitations
As we have seen in the preceding sections, active learning can be employed to construct state machines that model the behavior of computer programs. This technique proves valuable in several scenarios: it can be used to deduce the behavior of a program whose source code is inaccessible, to determine whether two programs written in different languages exhibit identical behavior, and to generate a program equivalent to a given source but with the minimal number of statements necessary to achieve the same functionality.
In realworld applications, active learning is integrated into certain modelchecking algorithms to generate models on which system properties can be verified [8], a technique particularly useful for analyzing timed systems [9]. It is also used to extract automata models of trained recurrent neural network (RNN) in order to test their properties [10].
It is also widely used for the analysis of communication protocols. For example, active learning has been employed to test the behavior of various implementations of the TLS [11] and SSH [12] protocols. Additionally, it has been used in the testing of IoT communication systems, including a comprehensive analysis of the MQTT protocol in 2017 [13].
On a purely practical level, LearnLib is an opensource Java framework designed for automata learning, providing tools and algorithms to infer models of systems like state machines through active learning. It is widely used for testing, verification, and reverse engineering of protocols and software systems.
There are, however, certain limitations to the contexts in which active learning can be effectively employed.
First, it is essential to know the alphabet of the target language in advance, as active learning algorithms require knowledge of the possible symbols in order to query them. For instance, when the L* algorithm attempts to learn the behavior of a computer program, it must have prior knowledge of all the statements that the program could potentially use.
Another limitation is the need to correctly bound the maximal complexity of the target system. In order to compute equivalence between the generated state machine and the actual program, the complexity (such as the number of states in the target machine) must be appropriately constrained. For example, to learn a state machine A, L* must be able to determine an upper bound on the number of states in the minimal machine equivalent to A. Without such bounds, the learning process becomes impossible.
The processing time for membership queries can also be considered a significant limitation of this method. The impact of an equivalence query is directly tied to the complexity of the teacher, which, by definition, is not computable since the teacher is a black box. In practice, such queries can be highly timeconsuming or even cost money. Consequently, the complexity of different solutions is evaluated based on the number of membership queries required to learn a teacher with n states. The underlying philosophy is that, since the cost of a query cannot be predicted, efforts are made to minimize the number of queries on average.
It is important to emphasize that these methods are still in their early stages of development. The primary focus is currently on identifying algorithms that reliably terminate in finite time. Should these solutions be adopted on an industrial scale, the issue of time complexity in their implementation will likely become a significant concern.
References
[1] A Random Testing Approach Using Pushdown Automata, Dreyfus, Héam, Kouchnarenko, Masson, 2014
[2] Learning Deterministic Visibly Pushdown Automata Under Accessible Stack, Michaliszyn, Otop, 2022
[3] Linear Automaton Transformations, Nerode, 1958
[4] A note on the number of queries needed to identify regular languages, Angluin, 1981
[5] Towards an algebraic theory of context free language, Berstel, Boasson, 1996
[6] Learning regular sets from queries and counterexamples, Angluin, 1987
[7] testing software design modeled by finitestate machines, Chow, 1978
[8] BLACK BOX CHECKING, Peled, Vardi, Yannakakis, 1999
[9] Learning Assumptions for Compositional Verification of Timed Systems, Lin, Andre, Liu, Sun, Song Dong, 2023
[10] Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples, Weiss, Goldberg, Yahav, 2022
[11] A Tale of the OpenSSL State Machine: a Largescale Blackbox Analysis, de Ruiter, 2015
[12] Inferring SSH state machines using protocol state fuzzing, Verleg, 2016
[13] ModelBased Testing IoT Communication via Active Automata Learning, Aichernig, Bloem, Tappler, 2017
[14] Nominal Techniques and Black Box Testing for Automata Learning, Moerman, 2019