Dataflow analysis is a technique for gathering information about the possible set of values calculated at various points in a
computer programA computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
. A program's
control flow graphA control flow graph in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Overview :...
(CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate. The information gathered is often used by
compilerA compiler is a computer program that transforms source code written in a programming language into another computer language...
s when optimizing a program. A canonical example of a dataflow analysis is reaching definitions.
A simple way to perform dataflow analysis of programs is to set up dataflow equations for each
nodeA node is a record consisting of one or more fields that are links to other nodes, and a data field. The link and data fields are often implemented by pointers or references although it is also quite common for the data to be embedded directly in the node. Nodes are used to build linked, often...
of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes, i.e., it reaches a fixpoint. This general approach was developed by
Gary KildallGary Arlen Kildall was an American computer scientist and microcomputer entrepreneur who created the CP/M operating system and founded Digital Research, Inc....
while teaching at the
Naval Postgraduate SchoolThe Naval Postgraduate School is an accredited research university operated by the United States Navy. Located in Monterey, California, it grants master's degrees, Engineer's degrees and doctoral degrees...
.
Basic principles
It is the process of collecting information about the way the variables are used, defined in the program. Dataflow analysis attempts to obtain particular information at each point in a procedure. Usually, it is enough to obtain this information at the boundaries of basic blocks, since from that it is easy to compute the information at points in the basic block. In forward flow analysis, the exit state of a block is a function of the block's entry state. This function is the composition of the effects of the statements in the block. The entry state of a block is a function of the exit states of its predecessors. This yields a set of dataflow equations:
For each block b:


In this,
is the
transfer function of the block
. It works on the entry state
, yielding the exit state
. The join operation
combines the exit states of the predecessors
of
, yielding the entry state of
.
After solving this set of equations, the entry and / or exit states of the blocks can be used to derive properties of the program at the block boundaries. The transfer function of each statement separately can be applied to get information at a point inside a basic block.
Each particular type of dataflow analysis has its own specific transfer function and join operation. Some dataflow problems require backward flow analysis. This follows the same plan, except that the transfer function is applied to the exit state yielding the entry state, and the join operation works on the entry states of the successors to yield the exit state.
The entry point (in forward flow) plays an important role: Since it has no predecessors, its entry state is well defined at the start of the analysis. For instance, the set of local variables with known values is empty. If the control flow graph does not contain cycles (there were no explicit or implicit loops in the procedure) solving the equations is straightforward. The control flow graph can then be topologically sorted; running in the order of this sort, the entry states can be computed at the start of each block, since all predecessors of that block have already been processed, so their exit states are available. If the control flow graph does contain cycles, a more advanced algorithm is required.
An iterative algorithm
The most common way of solving the dataflow equations is by using an iterative algorithm. It starts with an approximation of the instate of each block. The outstates are then computed by applying the transfer functions on the instates. From these, the instates are updated by applying the join operations. The latter two steps are repeated until we reach the socalled
fixpoint: the situation in which the instates (and the outstates in consequence) do not change.
A basic algorithm for solving dataflow equations is the
roundrobin iterative algorithm:
 for i ← 1 to N
 initialize node i
 while (sets are still changing)
 for i ← 1 to N
 recompute sets at node i
Convergence
To be usable, the iterative approach should actually reach a fixpoint. This can be guaranteed
by imposing constraints on the combination of the value domain of the states, the transfer functions and the join operation.
The value domain should be a partial order with
finite height (i.e., there are no infinite ascending chains
<
< ...). The combination of the transfer function and the join operation should be monotonic with respect to this partial order. Monotonicity ensures that on each iteration the value will either stay the same or will grow larger, while finite height ensures that it cannot grow indefinitely. Thus we will ultimately reach a situation where T(x) = x for all x, which is the fixpoint.
The work list approach
It is easy to improve on the algorithm above by noticing that the instate of a block will not change if the outstates of its predecessors don't change. Therefore, we introduce a
work list: a list of blocks that still need to be processed. Whenever the outstate of a block changes, we add its successors to the work list. In each iteration, a block is removed from the work list. Its outstate is computed. If the outstate changed, the block's successors are added to the work list. For efficiency, a block should not be in the work list more than once.
The algorithm is started by putting the entry point in the work list. It terminates when the
work list is empty.
The order matters
The efficiency of iteratively solving dataflow equations is influenced by the order at which
local nodes are visited. Furthermore, it depends, whether the dataflow equations are used for forward or backward dataflow analysis over the CFG.
Intuitively, in a forward flow problem, it would be fastest if all
predecessors of a block have been processed before the block itself, since then the iteration will use the latest information. In the absence of loops it is possible to order the blocks in such a way that the correct outstates are computed by processing each block only once.
In the following, a few iteration orders for solving dataflow equations are discussed
(a related concept to iteration order of a
CFGA control flow graph in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Overview :...
is
tree traversalIn computer science, treetraversal refers to the process of visiting each node in a tree data structure, exactly once, in a systematic way. Such traversals are classified by the order in which the nodes are visited...
of a
treeIn mathematics, more specifically graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree...
).
 Random order  This iteration order is not aware whether the dataflow equations solve a forward or backward dataflow problem. Therefore, the performance is relatively poor compared to specialized iteration orders.
 Postorder  This is a typical iteration order for backward dataflow problems. In postorder iteration, a node is visited after all its successor nodes have been visited. Typically, the postorder iteration is implemented with the depthfirst strategy.
 Reverse postorder  This is a typical iteration order for forward dataflow problems. In reversepostorder iteration, a node is visited before all its successor nodes have been visited, except when the successor is reached by a back edge. (Note that this is not the same as preorder.)
Initialization
The initial value of the instates is important to obtain correct and accurate results.
If the results are used for compiler optimizations, they should provide
conservative information, i.e. when applying the information, the program should not change semantics.
The iteration of the fixpoint algorithm will take the values in the direction of the maximum element. Initializing all blocks with the maximum element is therefore not useful. At least one block starts in a state with a value less than the maximum. The details depend on the
dataflow problem. If the minimum element represents totally conservative information, the results can be used safely even during the dataflow iteration. If it represents the most accurate information, fixpoint should be reached before the results can be applied.
Examples
The following are examples of properties of computer programs that can be calculated by dataflow analysis.
Note that the properties calculated by dataflow analysis are typically only approximations of the real
properties. This is because dataflow analysis operates on the syntactical structure of the CFG without
simulating the exact control flow of the program.
However, to be still useful in practice, a dataflow analysis algorithm is typically designed to calculate
an upper respectively lower approximation of the real program properties.
Forward Analysis
The
reaching definitionIn compiler theory, a reaching definition for a given instruction is another instruction, the target variable of which may reach the given instruction without an intervening assignment. For example, in the following code: d1 : y := 3 d2 : x := y...
analysis calculates for each program point the set of definitions that
may potentially reach this program point.
1: if b4 then
2: a = 5;
3: else
4: a = 3;
5: endif
6:
7: if a < 4 then
8: ...

The reaching definition of variable "a" at line 7 is the set of assignments a=5 at line 2 and a=3 at line 4.

Backward Analysis
The live variable analysis calculates for each program point the variables that may be
potentially read afterwards before their next write update. The result is typically used by
dead code eliminationIn compiler theory, dead code elimination is a compiler optimization to remove code which does not affect the program results. Removing such code has two benefits: it shrinks program size, an important...
to remove statements that assign to a variable whose value is not used afterwards.
The instate of a block is the set of variables that are live at the end of the block. Its outstate is the set of variable that is live at the start of it. The instate is the union of the outstates of the blocks successors. The transfer function of a statement is applied by making the variables that are written dead, then making the variables that are read live.
Initial Code:
b1: a = 3;
b = 5;
d = 4;
if a > b then
b2: c = a + b;
d = 2;
b3: endif
c = 4;
return b * d + c;

Backward Analysis:
// out: {}
b1: a = 3;
b = 5;
d = 4;
if a > b then
// in: {a,b,d}
// out: {a,b}
b2: c = a + b;
d = 2;
// in: {b,d}
// out: {b,d}
b3: endif
c = 4;
return b * d + c;
// in:{}

The outstate of b3 only contains
b and
d, since
c has been written. The instate of b1 is the union of the outstates of b2 and b3. The definition of
c in b2 can be removed, since
c is not live immediately after the statement.
Solving the dataflow equations starts with initializing all instates and outstates to the empty set. The work list is initialized by inserting the exit point (b3) in the work list (typical for backward flow). Its computed outstate differs from the previous one, so its predecessors b1 and b2 are inserted and the process continues. The progress is summarized in the table below.
processing 
instate 
old outstate 
new outstate 
work list 
b3 
{} 
{} 
{b,d} 
(b1,b2) 
b1 
{b,d} 
{} 
{} 
(b2) 
b2 
{b,d} 
{} 
{a,b} 
(b1) 
b1 
{a,b,d} 
{} 
{} 

Note that b1 was entered in the list before b2, which forced processing b1 twice (b1 was reentered as predecessor of b2). Inserting b2 before b1 would have allowed earlier completion.
Initializing with the empty set is an optimistic initialization: all variables start out as dead. Note that the outstates cannot shrink from one iteration to the next, although the outstate can be smaller that the instate. This can be seen from the fact that after the first iteration the outstate can only change by a change of the instate. Since the instate starts as the empty set, it can only grow in further iterations.
Other approaches
In 2002, Markus Mohnen described a new method of dataflow analysis that does not require the explicit construction of a dataflow graph, instead relying on
abstract interpretationIn computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices. It can be viewed as a partial execution of a computer program which gains information about its semantics In...
of the program and keeping a working set of program counters. At each conditional branch, both targets are added to the working set. Each path is followed for as many instructions as possible (until end of program or until it has looped with no changes), and then removed from the set and the next program counter retrieved.
Bit vector problems
The examples above are problems in which the dataflow value is a set, e.g. the set of reaching definitions (Using a bit for a definition position in the program), or the set of live variables. These sets can be represented efficiently as
bit vectors, in which each bit represents set membership of one particular element. Using this representation, the join and transfer functions can be implemented as bitwise logical operations. The join operation is typically union or intersection, implemented by bitwise
logical or and
logical and.
The transfer function for each block can be decomposed in socalled
gen and
kill sets.
As an example, in livevariable analysis, the join operation is union. The
kill set is the set of variables that are written in a block, whereas the
gen set is the set of variables that are read without being written first. The dataflow equations become
In logical operations, this reads as
 out(b) = 0
 for s in succ(b)
 out(b) = out(b) or in(s)
 in(b) = (out(b) and not kill(b)) or gen(b)
Sensitivities
Dataflow analysis is inherently flowsensitive. Dataflow analysis is typically pathinsensitive, though it is possible to define dataflow equations that yield a pathsensitive analysis.
 A flowsensitive analysis takes into account the order of statements in a program. For example, a flowinsensitive pointer alias analysis may determine "variables x and y may refer to the same location", while a flowsensitive analysis may determine "after statement 20, variables x and y may refer to the same location".
 A pathsensitive analysis computes different pieces of analysis information dependent on the predicates at conditional branch instructions. For instance, if a branch contains a condition x>0, then on the fallthrough path, the analysis would assume that x<=0 and on the target of the branch it would assume that indeed x>0 holds.
 A contextsensitive analysis is an interprocedural analysis that considers the calling context when analyzing the target of a function call. In particular, using context information one can jump back to the original call site, whereas without that information, the analysis information has to be propagated back to all possible call sites, potentially losing precision.
List of dataflow analyses
 Reaching definitions
 Liveness analysis
 Definite assignment analysis
In computer science, definite assignment analysis is a dataflow analysis used by compilers to conservatively ensure that a variable or location is always assigned to before it is used.Motivation:...
 Available expression
In the field of compiler optimizations, available expressions is an analysis algorithm that determines for each point in the program the set of expressions that need not be recomputed. Those expressions are said to be available at such a point...
Further reading
 Cooper, Keith D. and Torczon, Linda. Engineering a Compiler. Morgan Kaufmann. 2005.
 Muchnick, Steven S. Advanced Compiler Design and Implementation. Morgan Kaufmann. 1997.
 Hecht, Matthew S. Flow Analysis of Computer Programs. Elsevier NorthHolland Inc. 1977.
 Khedker, Uday P. Sanyal, Amitabha Karkare, Bageshri. Data Flow Analysis: Theory and Practice, CRC Press (Taylor and Francis Group). 2009.
 Flemming Nielson, Hanne Riis Nielson, Chris Hankin. Principles of Program Analysis. Springer. 2005.