This means that those useful abstractions are not present when
it comes time to analyze an executable program, such as mal-
ware, without access to its source code.
Traditionally, security practitioners would reverse engineer
executables by using a disassembler to represent the semantics
of the program as assembly code. While better than nothing,
assembly code is still far from readable. Decompilers fill this
gap by analyzing an executable’s behavior and attempting to
recover a plausible source code representation of the behavior.
Despite a great deal of work, decompilation is a notoriously
difficult problem and even state-of-the-art decompilers emit
source code that is a mere shell of its former self [
18
,
26
,
33
,
38
,
39
]. Despite this, decompilers are one of the most popular
tools used by reverse engineers.
Figure 1 shows an example of a decompiled function and
its original source code definition. Although the decompiled
code is C source code,
1
it is arguably quite different from the
original. We say that decompiled C code is not idiomatic; that
is, though it is grammatically legal C code, it does not use
common conventions for ensuring that source code is readable.
Further, as Figure 1 also illustrates, decompiler output may
be incorrect; that is, it may be semantically nonequivalent to
the code in its executable form. We collectively call these
readability and correctness issues fidelity issues because they
do not faithfully represent the software as intended by its
authors. (See Section 3.1 for a discussion of fidelity).
Fidelity issues are problematic because decompiled code
is usually created to be manually read by reverse engineers.
Reverse engineering is a painstaking process which involves
much time spent rebuilding high-level program design as the
reverse engineer develops an understanding of what the ex-
ecutable binary does [
36
]. Code that is more faithful to the
original source contains more of the abstractions designed to
assist with human comprehension of code. Thus, the fidelity
of decompiled code to the original source matters, as it can sig-
nificantly impact reverse engineers’ productivity. Evaluating
the products of decompilation based on fidelity to the original
source is common in existing work [9, 11, 15, 21, 22, 24].
Improving the functionality and usability of decompilers
has long been an active research area, with many contem-
porary efforts [
7
,
13
,
14
,
30
,
37
]. A recent trend in this di-
rection is using statistical methods such as deep learning-
based techniques to improve the process of decompila-
tion [
12
,
15
,
19
,
21
,
32
,
42
], or augment the output of tra-
ditional decompilers [
2
,
4
,
11
,
24
,
31
]. The latter strands of
work have the potential benefit of building on top of ma-
ture tools like Hex-Rays and Ghidra instead of operating on
binaries, and have already seen promising results for recov-
ering missing variable names and types. Here, researchers
have been developing models that learn to suggest meaning-
ful information in a given context with high accuracy, after
seeing many examples of original source code drawn from
1
Decompiled code is not always syntactically correct C code.
open-source repositories like the ones hosted on GitHub.
However, while variable names and types are certainly im-
portant for program comprehension, including in a reverse
engineering context [7, 36, 40], there are many more fidelity
issues in decompiled code, and there is relatively little knowl-
edge of what they are, how they vary across decompilers, and
what the implications are for learning-based approaches aim-
ing to improve the fidelity of decompiled code to the original
source.
We argue that before designing more advanced solutions,
we first need a deeper understanding of the problem. Conse-
quently, in this paper we set out with the
Research Goal
of
developing a comprehensive taxonomy of fidelity issues in
decompiled code. Concretely, we start by curating a sample
of open-source functions decompiled with the Hex-Rays,
2
Ghidra,
3
retdec,
4
and angr
5
[
35
] decompilers. Next, we use
thematic analysis, a qualitative research method for systemati-
cally identifying, organizing, and offering insights into patters
of meaning (themes) across a dataset [
6
], to analyze the de-
compiled functions for fidelity defects, using those functions’
original source code as an oracle. To minimize subjectivity,
we develop a novel abstraction for determining correspon-
dence between code pairs, which we call alignment. Using
this abstraction, we define fidelity defects in decompiled code,
creating a taxonomy consisting of 15 top-level issue cate-
gories with 52 in total. We then use our taxonomy to suggest
how the issues could be addressed, framing our discussion
around the role that deterministic static analysis and learning-
based approaches could play.
In this study, we focus primarily on decompiled code from
C-language binaries; that is, those that were built from C-
language source. C is a common source language for malware
and other binaries targeted by reverse engineering efforts. Fur-
ther, most decompilers of machine-code executables generate
output in terms of C pseudocode regardless of the language
in which the source code for the executable was written. It
is unclear what it would mean for a C decompiler to cor-
rectly decompile a Golang executable into C, for example,
since Golang contains concepts and features without a direct
equivalent in C.
We also examine Java decompilation. However, we find
that decompiled Java code is of very high fidelity and thus
there are few fidelity issues to classify.
Our results are robust both across different researchers as
well as the four decompilers we considered.
In summary, we make the following contributions:
•
A comprehensive, hierarchical taxonomy of fidelity is-
sues in decompiled code beyond names and types.
•
235 coded decompiled/original function pairs, identify-
2
https://hex-rays.com/decompiler/
3
https://ghidra-sre.org/
4
https://github.com/avast/retdec
5
https://angr.io/