All Topics  
Decompiler

 

   Email Print
   Bookmark   Link






 

Decompiler



 
 
A decompiler is the name given to a computer program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
 that performs the reverse operation to that of a compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
. That is, it translates a file containing information at a relatively low level of abstraction (usually designed to be computer readable rather than human readable) into a form having a higher level of abstraction (usually designed to be human readable).

Introduction
The term "decompiler" is most commonly applied to a program which translates executable
Executable

In computing, an executable causes a computer "to perform indicated tasks according to encoded instruction ," as opposed to a file that only contains data ....
 programs (the output from a compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
) into source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program.






Discussion
Ask a question about 'Decompiler'
Start a new discussion about 'Decompiler'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A decompiler is the name given to a computer program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
 that performs the reverse operation to that of a compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
. That is, it translates a file containing information at a relatively low level of abstraction (usually designed to be computer readable rather than human readable) into a form having a higher level of abstraction (usually designed to be human readable).

Introduction


The term "decompiler" is most commonly applied to a program which translates executable
Executable

In computing, an executable causes a computer "to perform indicated tasks according to encoded instruction ," as opposed to a file that only contains data ....
 programs (the output from a compiler
Compiler

A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
) into source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. By comparison, a disassembler
Disassembler

A disassembler is a computer program that translates machine language into assembly language?the inverse operation to that of an Assembly language#Assembler....
 translates an executable program into assembly language
Assembly language

An assembly language is a low-level language for programming computers. It implements a symbolic representation of the numeric machine codes and other constants needed to program a particular CPU architecture....
 (and an assembler could be used to assemble it back into an executable program).

Decompilation is the act of using a decompiler, although the term, when used as a noun, can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for computer security
Computer security

Computer security is a branch of technology known as information security as applied to computers. The objective of computer security can include protection of information from theft or corruption, or the preservation of availability, as defined in the security policy....
, interoperability
Interoperability

Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to system performance....
 and error correction. The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the Java Virtual Machine
Java Virtual Machine

A Java Virtual Machine is a set of computer software programs and data structures which use a virtual machine model for the execution of other computer programs and Scripting language....
 or the .NET Framework
.NET Framework

The Microsoft .NET Framework is a software framework that is available with several Microsoft Windows operating systems. It includes a large Library of coded solutions to prevent common programming problems and a virtual machine that manages the execution of programs written specifically for the Software framework....
 Common Language Runtime
Common Language Runtime

The Common Language Runtime is a core component of Microsoft .NET Framework initiative. It is Microsoft's implementation of the Common Language Infrastructure standard, which defines an execution environment for program code....
) often include extensive metadata
Metadata

Metadata is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema....
 and high-level features that make decompilation quite feasible. Machine language has typically much less metadata, and is therefore much harder to decompile.

Some compilers and post-compilation tools produce obfuscated code
Obfuscated code

Obfuscated code is source code in a computer programming language that has been made difficult to understand. Programmers may deliberately obfuscate code to conceal its purpose, to deter reverse engineering, or as a puzzle or recreational challenge for readers....
 (that is, they attempt to produce output that is very difficult to decompile). This is done to make it more difficult to reverse engineer the executable.

Design


Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.

Loader


The first decompilation phase is the loader, which parses the input machine code or intermediate language program's binary file format. The loader should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc), and the entry point. In many cases, it should be able to find the equivalent of the main function of a C program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible.

Disassembly


The next logical phase is the disassembly of machine code instructions into a machine independent intermediate representation (IR). For example, the Pentium machine instruction mov eax, [ebx+0x04] might be translated to the IR eax := m[ebx+4];

Idioms


Idiomatic machine code sequences are sequences of code whose combined semantics is not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code
X86 assembly language

x86 assembly language is the family of backwards-compatible assembly languages for the x86 class of processors, which includes Intel's Pentium series and AMD's Athlon series....
: cdq eax ; edx is set to the sign-extension of eax xor eax, edx sub eax, edx could be translated to eax := abs(eax);

Some idiomatic sequences are machine independent; some involve only one instruction. For example, xor eax, eax clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a xor a = 0.

In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex expressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation.

Program analysis


Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example, mov eax,[ebx+0x04] add eax,[ebx+0x08] sub [ebx+0x0C],eax could result in the following IR after expression propagation: m[ebx+12] := m[ebx+12] - (m[ebx+4] + m[ebx+8]); The resulting expression is more like high level language, and has also eliminated the use of the machine register eax . Later analyses may eliminate the ebx register.

Type analysis


A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on floating point
Floating point

In computing, floating point describes a system for numerical representation in which a String of digits represents a rational number.The term floating point refers to the fact that the radix point can "float": that is, it can be placed anywhere relative to the Significant figures of the number....
 values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different).

Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic.

The example from the previous section could result in the following high level code:

struct T1* ebx; struct T1 ; ebx->v000C -= ebx->v0004 + ebx->v0008;

Structuring


The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code xor eax, eax l0002: or ebx, ebx jge l0003 add eax,[ebx] mov ebx,[ebx+0x4] jmp l0002 l0003: mov [0x10040000],eax

could be translated into:

eax = 0; while (ebx < 0) v10040000 = eax;

Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables..

Code generation

The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages.

Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of graphical user interface
Graphical user interface

A graphical user interface is a type of user interface which allows people to human-computer interaction such as computers; hand-held devices such as MP3 Players, Portable Media Players or Gaming devices; household appliances and office equipment....
. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source code refactoring tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e.g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable.

Legality

The majority of computer programs are covered by copyright
Copyright

Copyright is a form of intellectual property which gives the creator of an original work exclusive rights for a certain time period in relation to that work, including its publication, distribution and adaptation; after which time the work is said to enter the public domain....
 laws. Although the precise scope of what is covered by copyright differs from region to region, copyright law generally provides the author (the programmer(s) or employer) with a collection of exclusive rights to the program. These rights include the right to make copies, including copies made into the computer's RAM. Since the decompilation process involves making multiple such copies, it is generally prohibited without the authorization of the copyright holder. However, because decompilation is often a necessary step in achieving software interoperability
Interoperability

Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to system performance....
, copyright laws in both the United States and Europe permit decompilation to a limited extent.

In the United States, the copyright fair use
Fair use

Fair use is a doctrine in United States copyright law that allows limited use of copyrighted material without requiring permission from the rights holders, such as use for scholarship or review....
 defense has been successfully invoked in decompilation cases. For example, in , the court held that Accolade could lawfully engage in decompilation in order to circumvent the software locking mechanism used by Sega's game consoles.

In Europe, the explicitly provides for a right to decompile in order to achieve interoperability. The result of a heated debate between, on the one side, software protectionists, and, on the other, academics as well as independent software developers, Article 6 permits decompilation only if a number of conditions are met:

  • First, the decompiler must have a license
    Software license agreement

    A software license agreement is a contract between a producer and a user of computer software which grants the user a software license. Most often, a software license agreement indicates the terms under which an end-user may utilize the licensed software, in which case the agreement is called an end-user license agreement or EULA...
     to use the program to be decompiled.


  • Second, decompilation must be necessary to achieve interoperability
    Interoperability

    Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to system performance....
     with the target program or other programs. Interoperability information may therefore not be readily available, such as through manuals or API
    Application programming interface

    An application programming interface is a set of subroutine, data structures, class and/or Protocol provided by library and/or operating system Service s in order to support the building of applications....
     documentation. This is an important limitation. The necessity must be proven by the decompiler. The purpose of this important limitation is primarily to provide an incentive for developers to document and disclose their products' interoperability information.


  • Third, the decompilation process must, if possible, be confined to the parts of the target program relevant to interoperability. Since one of the purposes of decompilation is to gain an understanding of the program structure, this third limitation may be difficult to meet. Again, the burden of proof is on the decompiler.


In addition, Article 6 prescribes that the information obtained through decompilation may not be used for other purposes and that it may not be given to others.

Overall, the decompilation right provided by Article 6 codifies
Codification

In law, codification is the process of collecting and restating the law of a jurisdiction in certain areas, usually by subject, forming a legal code....
 what is claimed to be common practice in the software industry. Few European lawsuits are known to have emerged from the decompilation right. This could be interpreted as meaning either one of two things: 1) the decompilation right is not used frequently and the decompilation right may therefore have been unnecessary, or 2) the decompilation right functions well and provides sufficient legal certainty not to give rise to legal disputes. In a regarding implementation of the Software Directive by the European member states, the European Commission
European Commission

The European Commission is the executive of the European Union. The body is responsible for proposing legislation, implementing decisions, upholding the Treaties of the European Union and the general day-to-day running of the Union....
 seems to support the second interpretation.

See also

  • Disassembler
    Disassembler

    A disassembler is a computer program that translates machine language into assembly language?the inverse operation to that of an Assembly language#Assembler....
  • Compiler
    Compiler

    A compiler is a computer program that transforms source code written in a programming language into another computer language . The most common reason for wanting to transform source code is to create an executable program....
  • Linker
    Linker

    In computer science, a linker or link editor is a computer program that takes one ormore object file generated by a compiler and combines them into a single executable program....
  • Interpreter
  • Abstract interpretation
    Abstract interpretation

    In computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattice s....
  • Obfuscated code
    Obfuscated code

    Obfuscated code is source code in a computer programming language that has been made difficult to understand. Programmers may deliberately obfuscate code to conceal its purpose, to deter reverse engineering, or as a puzzle or recreational challenge for readers....
  • Reverse engineering
    Reverse engineering

    Reverse engineering is the process of discovering the technological principles of a device, object or system through analysis of its structure, function and operation....


External links