Decompiler
Encyclopedia
A decompiler is the name given to a computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

 that performs, as far as possible, the reverse operation to that of a compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

. That is, it translates a file containing information at a relatively low level of abstraction (usually designed to be computer readable rather than human readable) into a form having a higher level of abstraction (usually designed to be human readable). The decompiler does not reconstruct the original source code, and its output is far less intelligible to a human than original source code.

Introduction

The term decompiler is most commonly applied to a program which translates executable
Executable
In computing, an executable file causes a computer "to perform indicated tasks according to encoded instructions," as opposed to a data file that must be parsed by a program to be meaningful. These instructions are traditionally machine code instructions for a physical CPU...

 programs (the output from a compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

) into source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

 in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. By comparison, a disassembler
Disassembler
A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembly language...

 translates an executable program into assembly language
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

 (and an assembler could be used to assemble it back into an executable program).

Decompilation is the act of using a decompiler, although the term can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for computer security
Computer security
Computer security is a branch of computer technology known as information security as applied to computers and networks. The objective of computer security includes protection of information and property from theft, corruption, or natural disaster, while allowing the information and property to...

, interoperability
Interoperability
Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to...

 and error correction. The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the Java Virtual Machine
Java Virtual Machine
A Java virtual machine is a virtual machine capable of executing Java bytecode. It is the code execution component of the Java software platform. Sun Microsystems stated that there are over 4.5 billion JVM-enabled devices.-Overview:...

 or the .NET Framework
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

 Common Language Runtime
Common Language Runtime
The Common Language Runtime is the virtual machine component of Microsoft's .NET framework and is responsible for managing the execution of .NET programs. In a process known as just-in-time compilation, the CLR compiles the intermediate language code known as CIL into the machine instructions...

) often include extensive metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

 and high-level features that make decompilation quite feasible. The presence of debug data
Debugger
A debugger or debugging tool is a computer program that is used to test and debug other programs . The code to be examined might alternatively be running on an instruction set simulator , a technique that allows great power in its ability to halt when specific conditions are encountered but which...

 can make it possible to reproduce the original variable and structure names and even the line numbers. Machine language without such metadata or debug data is much harder to decompile.

Some compilers and post-compilation tools produce obfuscated code
Obfuscated code
Obfuscated code is source or machine code that has been made difficult to understand for humans. Programmers may deliberately obfuscate code to conceal its purpose or its logic to prevent tampering, deter reverse engineering, or as a puzzle or recreational challenge for someone reading the source...

 (that is, they attempt to produce output that is very difficult to decompile). This is done to make it more difficult to reverse engineer the executable.

Design

Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.

Loader

The first decompilation phase loads and parses the input machine code or intermediate language
Intermediate language
In computer science, an intermediate language is the language of an abstract machine designed to aid in the analysis of computer programs. The term comes from their use in compilers, where a compiler first translates the source code of a program into a form more suitable for code-improving...

 program's binary file format. It should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc), and the entry point. In many cases, it should be able to find the equivalent of the main function of a C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible. If available the symbol tables and debug data are also loaded. The front end may be able to identify the libraries used even if they are linked with the code, this will provide library interfaces. If it can determine the compiler or compilers used it may provide useful information in identifying code idioms.

Disassembly

The next logical phase is the disassembly of machine code instructions into a machine independent intermediate representation (IR). For example, the Pentium machine instruction
mov eax, [ebx+0x04]
might be translated to the IR
eax := m[ebx+4];

Idioms

Idiomatic machine code sequences are sequences of code whose combined semantics is not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code
X86 assembly language
x86 assembly language is a family of backward-compatible assembly languages, which provide some level of compatibility all the way back to the Intel 8008. x86 assembly languages are used to produce object code for the x86 class of processors, which includes Intel's Core series and AMD's Phenom and...

:
cdq eax ; edx is set to the sign-extension of eax
xor eax, edx
sub eax, edx
could be translated to
eax := abs(eax);

Some idiomatic sequences are machine independent; some involve only one instruction. For example, xor eax, eax clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a xor a = 0.

In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex expressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation.

It is particularly important to recognize the compiler idioms for subroutine
Subroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....

 calls, exception handling
Exception handling
Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution....

, and switch statement
Switch statement
In computer programming, a switch, case, select or inspect statement is a type of selection control mechanism that exists in most imperative programming languages such as Pascal, Ada, C/C++, C#, Java, and so on. It is also included in several other types of languages...

s. Some languages also have extensive support for string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

s or long integers.

Program analysis

Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example,
mov eax,[ebx+0x04]
add eax,[ebx+0x08]
sub [ebx+0x0C],eax
could result in the following IR after expression propagation:
m[ebx+12] := m[ebx+12] - (m[ebx+4] + m[ebx+8]);
The resulting expression is more like high level language, and has also eliminated the use of the machine register eax . Later analyses may eliminate the ebx register.

Data flow analysis

The places where register contents are defined and used must be traced using data flow analysis. The same analysis can be applied to locations that are used for temporaries and local data. A different name can then be formed for each such connected set of value definitions and uses. It is possible that the same local variable location was used for more than one variable in different parts of the original program. Even worse it is possible for the data flow analysis to identify a path whereby a value may flow between two such uses even though it would never actually happen or matter in reality. This may in bad cases lead to needing to define a location as a union of types. The decompiler may allow the user to explicitly break such unnatural dependencies which will lead to clearer code. This of course means a variable is potentially used without being initialized and so indicates a problem in the original program.

Type analysis

A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...

 values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different).

Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic.

The example from the previous section could result in the following high level code:

struct T1 *ebx;
struct T1 {
int v0004;
int v0008;
int v000C;
};
ebx->v000C -= ebx->v0004 + ebx->v0008;

Structuring

The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code
xor eax, eax
l0002:
or ebx, ebx
jge l0003
add eax,[ebx]
mov ebx,[ebx+0x4]
jmp l0002
l0003:
mov [0x10040000],eax

could be translated into:

eax = 0;
while (ebx < 0) {
eax += ebx->v0000;
ebx = ebx->v0004;
}
v10040000 = eax;

Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables.

Code generation

The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages.

Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of graphical user interface
Graphical user interface
In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...

. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source code refactoring tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e.g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable.

Legality

The majority of computer programs are covered by copyright
Copyright
Copyright is a legal concept, enacted by most governments, giving the creator of an original work exclusive rights to it, usually for a limited time...

 laws. Although the precise scope of what is covered by copyright differs from region to region, copyright law generally provides the author (the programmer(s) or employer) with a collection of exclusive rights to the program. These rights include the right to make copies, including copies made into the computer's RAM.
Since the decompilation process involves making multiple such copies, it is generally prohibited without the authorization of the copyright holder. However, because decompilation is often a necessary step in achieving software interoperability
Interoperability
Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to...

, copyright laws in both the United States and Europe permit decompilation to a limited extent.

In the United States, the copyright fair use
Fair use
Fair use is a limitation and exception to the exclusive right granted by copyright law to the author of a creative work. In United States copyright law, fair use is a doctrine that permits limited use of copyrighted material without acquiring permission from the rights holders...

 defense has been successfully invoked in decompilation cases. For example, in Sega v. Accolade
Sega v. Accolade
Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 , is a significant case in American intellectual property law. The case involved several overlapping issues, including the scope of copyright, permissible uses for trademarks, and the scope of fair use for computer code.This case had two...

, the court held that Accolade could lawfully engage in decompilation in order to circumvent the software locking mechanism used by Sega's game consoles.

In Europe, the 1991 Software Directive explicitly provides for a right to decompile in order to achieve interoperability. The result of a heated debate between, on the one side, software protectionists, and, on the other, academics as well as independent software developers, Article 6 permits decompilation only if a number of conditions are met:
  • First, a person or entity must have a license
    Software license agreement
    A software license agreement is a contract between the "licensor" and purchaser of the right to use software. The license may define ways under which the copy can be used, in addition to the automatic rights of the buyer including the first sale doctrine and .Many form contracts are only contained...

     to use the program to be decompiled.
  • Second, decompilation must be necessary to achieve interoperability
    Interoperability
    Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to...

     with the target program or other programs. Interoperability information should therefore not be readily available, such as through manuals or API
    Application programming interface
    An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

     documentation. This is an important limitation. The necessity must be proven by the decompiler. The purpose of this important limitation is primarily to provide an incentive for developers to document and disclose their products' interoperability information.
  • Third, the decompilation process must, if possible, be confined to the parts of the target program relevant to interoperability. Since one of the purposes of decompilation is to gain an understanding of the program structure, this third limitation may be difficult to meet. Again, the burden of proof is on the decompiler.


In addition, Article 6 prescribes that the information obtained through decompilation may not be used for other purposes and that it may not be given to others.

Overall, the decompilation right provided by Article 6 codifies what is claimed to be common practice in the software industry. Few European lawsuits are known to have emerged from the decompilation right. This could be interpreted as meaning either one of two things: 1) the decompilation right is not used frequently and the decompilation right may therefore have been unnecessary, or 2) the decompilation right functions well and provides sufficient legal certainty not to give rise to legal disputes. In a recent report regarding implementation of the Software Directive by the European member states, the European Commission
European Commission
The European Commission is the executive body of the European Union. The body is responsible for proposing legislation, implementing decisions, upholding the Union's treaties and the general day-to-day running of the Union....

 seems to support the second interpretation.

See also

  • Linker (computing)
  • Interpreter
  • Abstract interpretation
    Abstract interpretation
    In computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices. It can be viewed as a partial execution of a computer program which gains information about its semantics In...

  • Boomerang decompiler
    Boomerang (decompiler)
    Boomerang is a GPL decompiler, which allows programmers to translate a binary program into C-like source code. Boomerang uses Static Single Assignment Form to represent code....

  • Mocha decompiler
    Mocha (decompiler)
    Mocha is a Java decompiler, which allows programmers to translate a program's bytecode into source code.A beta version of Mocha was released in 1996, by Dutch developer Hanpeter van Vliet, alongside an obfuscator named Crema. A controversy erupted and he temporarily withdrew Mocha from public...

  • JAD decompiler
    JAD (JAva Decompiler)
    Jad is a currently unmaintained decompiler for the Java programming language.Jad provides a command-line user interface to extract source code from class files.The most popular GUI for Jad is...

  • .NET Reflector
    .NET Reflector
    .NET Reflector is a proprietary software utility for Microsoft .NET combining class browsing, static analysis and decompilation, originally written by Lutz Roeder. MSDN Magazine named it as one of the Ten Must-Have utilities for developers, and Scott Hanselman listed it as part of his "Big Ten...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK