Black-box Testing

Certain types of software are inherently opaque, meaning that it can be difficult or, in some cases, impossible to access the complete source code of the program being utilized. When this occurs, the system is often described as functioning like a “black box”.

This lack of transparency introduces challenges in ensuring the software’s reliability and accurately predicting its outcomes, which may lead to significant safety concerns in critical applications. In this article, we will explore a machine learning technique known as active learning. It enables the reconstruction of the full behavior of a black-box system by analyzing a finite set of input-output pairs.

This approach also facilitates the assessment of “equivalence” between two software systems, even when they are implemented in different programming languages, through the abstraction of their behavior into state machines.

To fully understand this article, a background in graph theory, automata theory, and machine learning is recommended.

Different categories of programs
Interesting properties of regular languages
The L* algorithm
Example run
Extension on VPA
The beauty of abstraction
Applications and limitations

Different categories of programs

In order to adapt our approaches effectively, we need to distinguish between three categories of programs:

1. DFA

The first category includes programs that behave like regular expressions, such as a binary that checks a password and returns 1 if the password is correct, and 0 otherwise. These programs can be effectively modeled using simple Deterministic Finite Automata (DFA), as the operations performed on the program stack during execution, as well as function calls, are irrelevant for capturing their behavior.

Example:

A function like is_even() can modeled as the DFA below.

1
2
3
4
5
6


int is_even(unsigned int n){
if (n % 2 == 0)
  return 1;
else
  return 0;
}

DFA

It is then plain to see that such function behaves as the regular expression:

[E = -?[0-9]^*[0|2|4|6|8]^+]

Finding such DFA (or equivalently its regular expression) is sufficient to indentify the program’s whole behavior .

2. VPA

The second category comprises independent deterministic programs with near-isomorphism, which are programs that do not interact with the environment and exhibit a finite, predictable set of behaviors for each possible input. These programs can be modeled as Visibly Pushdown Automata (VPA) [1].

Such automaton recognizes the set of sequences of statements that are valid runs of the program. A word w belongs to the language of the minimal VPA of a program if and only if it describes the sequences of statements performed by the latter for some input.

In this case, the automaton’s stack corresponds to the function call stack used by the actual program. The automaton is termed “visibly” pushdown because the operations of pushing and popping to the stack are constrained by specific symbols, such as function calls and returns [2]. As a result, the stack’s behavior depends solely on the input, significantly simplifying the process of comparing two VPAs.

Example:

A function like power can be modeled as the VPA below. The transitions in green (resp. red) correspond to pushing (resp. poping) operations on the machine’s stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


int power(float x, int n) {
int res;
if (n == 0) {
  return 1;
} else {
  res = power(x, n / 2);
  if (n % 2 == 0) {
    return res * res;
  } else {
    return res * res * x;
  }
}
}

VPA

A run of the automaton corresponds to a sequence of statement, and is accepted if and only if the said sequence can be performed by the program for some input. For example

[0 → 1 → 5 → 0 → 1 → 5 → 0 → 1 → 2 → 4 → 6 → 9 → 10 → 6 → 7 → 8]

is the sequence of statements performed by (power(3,2)) and is therefore accepted.

Now, let the following symbols represent specific program instructions or conditions:

a = int res;
b = n == 0
c = n != 0
d = call(power)
e = return 1;
f = return (power)
g = n % 2 == 0
h = n % 2 != 0
i = return res * res;
j = return res * res * x;

Then the VPA above describes the nonregular expression:

[E = [(acd)^k abef (gif + hjf )^{k−1}(gi + hj)] + abe, k \geq 1]

We then fall into the same situation as with our DFAs, where finding this expression is sufficient to describe the entire program’s behavior.

3. Turing Machines

The third and final category encompasses general programs that interact with the environment, exhibit truly random behavior, lack observable outputs, or have unbounded complexity.

These programs can only be modeled by Turing machines, and as such, they fall outside the scope of active learning algorithms as understood by the current state of the art.

Consequently, attempting to learn the behavior of complex systems like the Linux kernel or large-scale software such as The Elder Scrolls V: Skyrim through active learning methods is entirely impractical.

Interesting properties of regular languages

Now that we understand that programs can be represented as state machines (or equivalently regular expressions), we can turn our attention to the challenge of extracting these representations from a black-box system. To address this problem, it is essential to examine some key properties of regular languages.

Given two words ((w_{1}, w_{2})) of a language L, we define a distinguishing extension as a word c such that:

[ w_1 \bullet c \in L \neq w_2 \bullet c \in L ]

We then define the Myhill-Nerode relationship (\sim_L) as follows:

[ \forall (w_1, w_2) \in L^2, w_1 \sim_L w_2 \iff \nexists c \in \Sigma^*, w_1 \bullet c \in L \neq w_2 \bullet c \in L ]

The Myhill-Nerode theorem [3] shows that this relationship partitions L into equivalence classes and that L is regular if and only if this partition is of finite order.

These properties enable regular languages, which may have infinite cardinality, to be described using a finite set of elements by identifying equivalence classes based on the Myhill-Nerode theorem [4]. This approach is sufficient for programs that behave like DFA, as their associated languages are regular by definition. Moreover, an extension of this theorem exists for VPA, even though they can represent context-free languages [5].

The goal of active learning algorithms is to identify the distinct partitions of the target language by querying inputs to the observed program and analyzing the corresponding outputs. Through this process, it becomes possible to construct a state machine that accurately captures any program’s behavior in finite time, even if the latter can perform an infinite variety of instruction sequences.

The L* algorithm

The L* algorithm was one of the earliest active learning algorithms developed [6] and is capable of learning a DFA by querying an information source known as the teacher. In our context, the teacher could be represented by a binary executable, with the L* algorithm querying it by providing various inputs to observe its corresponding outputs. Through this interaction, L* gradually constructs a model of the DFA that describes the program.

In order to classify the information given by the teacher, L* maintains an observation table. An observation table over an alphabet (\Sigma) is a tuple ((S, E, T)) with S a finite prefix-closed set of words over (\Sigma), E a finite suffix-closed set of words over (\Sigma) and T a mapping function from (((S \cup S.\Sigma).E)) to ({ 0, 1}). A set of words X is said to be prefix-closed (resp. suffix-closed) iff all prefixes (resp. suffixes) of all elements of X are also in X.

For any (s \in (S \cup S.\Sigma)) we define row(s) as the function (f:e \rightarrow T(s.e)).

For example, with (S = E = { \epsilon, a, b}) and (T) such that (T(x) = 1) for (x \in { \epsilon, a, aa, aaa}) and (T(x) = 0) otherwise, we have the following observation table:

	ϵ	a
S: ϵ	1	1
S: a	1	1
S: b	0	0
aa	1	1
ab	0	0
ba	0	0
bb	0	0

Instinctively, a cell tells us whether the word formed by the concatenation of the row label ((s \in S)) and the column label (( e \in E)) is a word of the language.

By constructing S, the learner identifies the different equivalence classes of the target language in regards to the Myhill-Nerode relationship. E then represents the set of distinguishing extensions. In a similar manner, it identifies the transition functions by constructing T.

An observation table is said to be closed iff for every (t \in S.\Sigma) there exists a (s \in S) such that (row(t) = row(s)). An observation table is said to be consistent if and only if:

[ \forall (s_1, s_2) \in S^2, row(s_1) = row(s_2), \forall a \in \Sigma, row(s_1.a) = row(s_2.a) ]

Given a closed, consistent observation table ((S, E, T)) we define an acceptor (M((S, E, T))) as a tuple ((Q, q_0, F, \delta)) with (Q = { row(s), s \in S}) a set of states, (q_0) the unique initial state, (F) the subset of (Q) of final states and (\delta) the transition function such that (\delta(row(s), a) = row(s + a)). (M((S, E, T))) is a DFA by construction and has exactly (|S|) states.

The algorithm starts with (S = E = {\epsilon}). It will then loop while the observation table is non-closed or non-consistent and will try to remedy one of these properties with each iteration. Enforcing closure is done by adding elements to S until the condition is met. Enforcing consistency is done by adding elements to E that distinguishes elements that were indistinguishable beforehand. Once the table is closed and consistent, the learner will submit an acceptor A corresponding to the table as an hypothesis to the teacher. The acceptor is then compared to the teacher, this is done in a finite number of queries using characterization sets [7]. If we find no counterexample, then A is the minimal target DFA and the algorithm terminates. Otherwise, the learner will extract a new distinguishing element from the given counterexample and will continue checking for closure and consistency.

Given n the number of states of the minimal DFA equivalent to the model and k the maximum length of a counter-exemple handed by the teacher, L* terminates in (O(n*k)) [6].

Example run

The following example outlines how the L* algorithm can be used to reverse-engineer the behavior of a black-box program that checks for valid passwords in the form of ( ab^*a ). The program is compiled, and its source code is discarded, yet L* is able to reconstruct its behavior using queries and counter-examples from a teacher.

Here is a breakdown of the process:

Initial Setup: The algorithm begins with an observation table where both (S) (the set of prefixes) and (E) (the set of suffixes) are initialized with ( \epsilon ) (the empty string). The teacher provides responses for each row:

ϵ

S: ϵ 0

a 0

b 0

The table is consistent and closed, so the algorithm constructs the DFA below that recognizes the empty language.

However, upon comparing this to the teacher using w-method [7], a counter-example is found. Such method allows comparing two state machines in finite time by constructing a minimal test suite (We won’t go into too much detail here for the sake of brevity, but curious readers can find out more at [14]). With such method, the teacher identifies the counter-example “aa”.
Counter-Example Handling: The counter-example “aa” suggests that the DFA is incorrect. To resolve this, the context “a” is added to the set (E), and the table is updated:

ϵ a

S: ϵ 0 0

a 0 1

b 0 0

Since the table is now non-closed (no row matches ( \text{row}(a) = 01 )), “a” is added to (S), and further updates occur:

ϵ a

S: ϵ 0 0

S: a 0 1

b 0 0

aa 1 0

ab 0 1
Further Refinement: After the table is updated, the row ( \text{row}(aa) = 10 ) still lacks a match, prompting the addition of “aa” to (S), and another update is made:

ϵ a

S: ϵ 0 0

S: a 0 1

S: aa 1 0

b 0 0

ab 0 1

aaa 0 0

aab 0 0

The table is now closed and consistent, and a DFA is generated:

However, a new counter-example, “aaaaa”, is provided by the teacher.
Final Refinement: The counter-example “aaaaa” leads to the extraction of context “aa,” which is added to (E), and the table is updated:

ϵ a aa

S: ϵ 0 0 1

S: a 0 1 0

S: aa 1 0 0

b 0 0 0

ab 0 1 0

aaa 0 0 0

aab 0 0 0

The table is non-closed because ( \text{row}(b) = 000 ), so “b” is added to (S), and the final update occurs:

ϵ a aa

S: ϵ 0 0 1

S: a 0 1 0

S: aa 1 0 0

S: b 0 0 0

ab 0 1 0

aaa 0 0 0

aab 0 0 0

ba 0 0 0

bb 0 0 0

This final table is both consistent and closed, and the DFA below is constructed.
Final DFA:

After comparing the generated DFA to the teacher, no further counter-examples are found, indicating that the DFA now correctly models the password-checking program. The DFA represents the exact regular expression ( ab^*a ), corresponding to the valid password structure expected by the program.

	ϵ
S: ϵ	0
a	0
b	0

	ϵ	a
S: ϵ	0	0
a	0	1
b	0	0

	ϵ	a
S: ϵ	0	0
S: a	0	1
b	0	0
aa	1	0
ab	0	1

	ϵ	a
S: ϵ	0	0
S: a	0	1
S: aa	1	0
b	0	0
ab	0	1
aaa	0	0
aab	0	0

	ϵ	a	aa
S: ϵ	0	0	1
S: a	0	1	0
S: aa	1	0	0
b	0	0	0
ab	0	1	0
aaa	0	0	0
aab	0	0	0

	ϵ	a	aa
S: ϵ	0	0	1
S: a	0	1	0
S: aa	1	0	0
S: b	0	0	0
ab	0	1	0
aaa	0	0	0
aab	0	0	0
ba	0	0	0
bb	0	0	0

Through this step-by-step querying and refinement, L* successfully reverse-engineers the behavior of the black-box password checker.

Extension on VPA

The L* algorithm is sufficient for the first class of programs, where the alphabet symbols directly correspond to the input symbols.

However, in the more complex and interesting case of the second class of programs, the alphabet symbols represent statements used by the program. Here, the “output” observed by the algorithm is the state of the program’s stack, meaning the algorithm must track the stack’s behavior during each function call and return.

Since the target programs are deterministic, there exists a bijection between possible inputs and their corresponding stack states sequence. This allows the same strategy as L* to be employed for learning these programs.

Because the equivalence classes of the languages of VPA can be identified in finite time, learning such machines using this approach becomes feasible [2]. The process involves identifying equivalence classes of sequences of statements that the program can potentially execute, effectively capturing and describing its entire behavior. By grouping sequences that lead to the same outcomes or system states, the algorithm can represent the program’s functionality in a compact and comprehensive manner, allowing for a full abstraction of its operational dynamics.

We won’t dive into this extension of the L* algorithm in this article, as doing so would lead us deep into complex algebraic concepts. Breaking down something like the power function example mentioned earlier would take an extremely long time, given that such problems are NP-complete [2]. Just keep in mind that these problems can still be solved in finite time, using techniques closely related to the L* algorithm.

The beauty of abstraction

One of the most intriguing aspects of using state machines to model programs is how they abstract away the specific meanings of individual code statements. Rather than concerning themselves with the details of what each command or operation does at a granular level, state machines concentrate on the high-level behavior of a system. In this approach, the exact functionality of the program’s actions becomes irrelevant. What matters instead is the overall structure, i.e. the flow of control and how the system transitions from one state to another over time.

This abstraction shifts the focus from the intricate semantics of individual statements to the overarching logic governing the program’s execution. By emphasizing the sequence of state transitions rather than the internal workings of specific instructions, state machines provide a powerful framework for understanding the system’s behavior from an algebraic and formal perspective. This allows for a cleaner and more rigorous analysis of the program’s behavior without getting bogged down in the complexities of low-level implementation details.

In essence, state machines enable us to model the behavior of software systems in a way that isolates their logical flow, making it easier to reason about how the system responds to different inputs and events. This not only facilitates debugging and optimization but also aids in proving properties about the system, such as correctness, termination, or consistency, by focusing solely on the transition between states rather than the detailed workings of individual operations.

Applications and limitations

As we have seen in the preceding sections, active learning can be employed to construct state machines that model the behavior of computer programs. This technique proves valuable in several scenarios: it can be used to deduce the behavior of a program whose source code is inaccessible, to determine whether two programs written in different languages exhibit identical behavior, and to generate a program equivalent to a given source but with the minimal number of statements necessary to achieve the same functionality.

In real-world applications, active learning is integrated into certain model-checking algorithms to generate models on which system properties can be verified [8], a technique particularly useful for analyzing timed systems [9]. It is also used to extract automata models of trained recurrent neural network (RNN) in order to test their properties [10].

It is also widely used for the analysis of communication protocols. For example, active learning has been employed to test the behavior of various implementations of the TLS [11] and SSH [12] protocols. Additionally, it has been used in the testing of IoT communication systems, including a comprehensive analysis of the MQTT protocol in 2017 [13].

On a purely practical level, LearnLib is an open-source Java framework designed for automata learning, providing tools and algorithms to infer models of systems like state machines through active learning. It is widely used for testing, verification, and reverse engineering of protocols and software systems.

There are, however, certain limitations to the contexts in which active learning can be effectively employed.

First, it is essential to know the alphabet of the target language in advance, as active learning algorithms require knowledge of the possible symbols in order to query them. For instance, when the L* algorithm attempts to learn the behavior of a computer program, it must have prior knowledge of all the statements that the program could potentially use.

Another limitation is the need to correctly bound the maximal complexity of the target system. In order to compute equivalence between the generated state machine and the actual program, the complexity (such as the number of states in the target machine) must be appropriately constrained. For example, to learn a state machine A, L* must be able to determine an upper bound on the number of states in the minimal machine equivalent to A. Without such bounds, the learning process becomes impossible.

The processing time for membership queries can also be considered a significant limitation of this method. The impact of an equivalence query is directly tied to the complexity of the teacher, which, by definition, is not computable since the teacher is a black box. In practice, such queries can be highly time-consuming or even cost money. Consequently, the complexity of different solutions is evaluated based on the number of membership queries required to learn a teacher with n states. The underlying philosophy is that, since the cost of a query cannot be predicted, efforts are made to minimize the number of queries on average.

It is important to emphasize that these methods are still in their early stages of development. The primary focus is currently on identifying algorithms that reliably terminate in finite time. Should these solutions be adopted on an industrial scale, the issue of time complexity in their implementation will likely become a significant concern.