Tagged union
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, a tagged union, also called a variant
Variant type
Variant is a data type in certain programming languages, particularly Visual Basic and C++ when using the Component Object Model.In Visual Basic the Variant data type is a tagged union that can be used to represent any other data type except fixed-length string type and...

, variant record, discriminated union, or disjoint union
Disjoint union
In mathematics, the term disjoint union may refer to one of two different concepts:* In set theory, a disjoint union is a modified union operation that indexes the elements according to which set they originated in; disjoint sets have no element in common.* In probability theory , a disjoint union...

, is a data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...

 used to hold a value that could take on several different, but fixed types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type which has several "cases," each of which should be handled correctly when that type is manipulated. Like ordinary unions
Union (computer science)
In computer science, a union is a value that may have any of several representations or formats; or a data structure that consists of a variable which may hold such a value. Some programming languages support special data types, called union types, to describe such values and variables...

, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

Tagged unions are most important in functional languages such as ML
ML programming language
ML is a general-purpose functional programming language developed by Robin Milner and others in the early 1970s at the University of Edinburgh, whose syntax is inspired by ISWIM...

 and Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

, where they are called datatypes (see algebraic data type
Algebraic data type
In computer programming, particularly functional programming and type theory, an algebraic data type is a datatype each of whose values is data from other datatypes wrapped in one of the constructors of the datatype. Any wrapped datum is an argument to the constructor...

) and the compiler is able to verify that all cases of a tagged union are always handled, avoiding many types of errors. They can, however, be constructed in nearly any language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

, and are much safer than untagged unions, often simply called unions, which are similar but do not explicitly keep track of which member of the union is currently in use.

Tagged unions are often accompanied by the concept of a constructor, which is similar but not the same as a constructor
Constructor (computer science)
In object-oriented programming, a constructor in a class is a special type of subroutine called at the creation of an object. It prepares the new object for use, often accepting parameters which the constructor uses to set any member variables required when the object is first created...

 for a class. Constructors produce a tagged union value, given the initial tag value and a value of the corresponding type.

Mathematically, tagged unions correspond to disjoint
Disjoint union
In mathematics, the term disjoint union may refer to one of two different concepts:* In set theory, a disjoint union is a modified union operation that indexes the elements according to which set they originated in; disjoint sets have no element in common.* In probability theory , a disjoint union...

or discriminated unions, usually written using +. Given an element of a disjoint union A + B, it is possible to determine whether it came from A or B. If an element lies in both, there will be two effectively distinct copies of the value in A + B, one from A and one from B.

In type theory
Type theory
In mathematics, logic and computer science, type theory is any of several formal systems that can serve as alternatives to naive set theory, or the study of such formalisms in general...

, a tagged union is called a sum type. Notations vary, but usually the sum type comes with two introduction forms and . The elimination form is case analysis, known as pattern matching
Pattern matching
In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures...

 in ML-style programming languages: if has type and and have type under the assumptions and respectively, then the term
has type . The sum type corresponds to intuitionistic logical disjunction
Logical disjunction
In logic and mathematics, a two-place logical connective or, is a logical disjunction, also known as inclusive disjunction or alternation, that results in true whenever one or more of its operands are true. E.g. in this context, "A or B" is true if A is true, or if B is true, or if both A and B are...

 under the Curry-Howard correspondence.

An enumerated type
Enumerated type
In computer programming, an enumerated type is a data type consisting of a set of named values called elements, members or enumerators of the type. The enumerator names are usually identifiers that behave as constants in the language...

 can be seen as a degenerate case: a tagged union of unit type
Unit type
In the area of mathematical logic, and computer science known as type theory, a unit type is a type that allows only one value . The carrier associated with a unit type can be any singleton set. There is an isomorphism between any two such sets, so it is customary to talk about the unit type and...

s. It corresponds to a set of nullary constructors and may be implemented as a simple tag variable, since it holds no additional data besides the value of the tag.

Advantages and disadvantages

The primary advantage of a tagged union over an untagged union is that all accesses are safe, and the compiler can even check that all cases are handled. Untagged unions depend on program logic to correctly identify the currently active field, which may result in strange behavior and hard-to-find bugs if that logic fails.

The primary advantage of a tagged union over a simple record
Record (computer science)
In computer science, a record is an instance of a product of primitive data types called a tuple. In C it is the compound data in a struct. Records are among the simplest data structures. A record is a value that contains other values, typically in fixed number and sequence and typically indexed...

 containing a field for each type is that it saves storage by overlapping storage for all the types. Some implementations reserve enough storage for the largest type, while others dynamically adjust the size of a tagged union value as needed. When the value is immutable
Immutable object
In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created...

, it is simple to allocate just as much storage as is needed.

The main disadvantage of tagged unions is that the tag occupies space. Since there are usually a small number of alternatives, the tag can often be squeezed into 2 or 3 bits wherever space can be found, but sometimes even these bits are not available. In this case, a helpful alternative may be folded, computed or encoded tags, where the tag value is dynamically computed from the contents of the union field. Common examples of this are the use of reserved values, where, for example, a function returning a positive number may return -1 to indicate failure, and sentinel value
Sentinel value
In computer programming, a sentinel value is a special value whose presence guarantees termination of a loop that processes structured data...

s, most often used in tagged pointer
Tagged pointer
In computer science, a tagged pointer is a common example of a tagged union, where the primary type of data to be stored in the union is a pointer...

s.

Sometimes, untagged unions are used to perform bit-level conversions between types, called reinterpret casts in C++. Tagged unions are not intended for this purpose; typically a new value is assigned whenever the tag is changed.

Many languages support, to some extent, a universal data type
Top type
The top type in type theory, commonly abbreviated as top or by the down tack symbol , is the universal type—that type which contains every possible object in the type system of interest. The top type is sometimes called the universal supertype as all other types in any given type system are...

, which is a type that includes every value of every other type, and often a way is provided to test the actual type of a value of the universal type. These are sometimes referred to as variants. While universal data types are comparable to tagged unions in their formal definition, typical tagged unions include a relatively small number of cases, and these cases form different ways of expressing a single coherent concept, such as a data structure node or instruction. Also, there is an expectation that every possible case of a tagged union will be dealt with when it is used. The values of a universal data type are not related and there is no feasible way to deal with them all.

Like option type
Option type
In programming languages and type theory, an option type or maybe type is a polymorphic type that represents encapsulation of an optional value; e.g. it is used as the return type of functions which may or may not return a meaningful value when they're applied...

s and exception handling
Exception handling
Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution....

, tagged unions are sometimes used to handle the occurrence of exceptional results. Often these tags are folded into the type as "reserved values", and their occurrence is not consistently checked: this is a fairly common source of programming errors. This use of tagged unions can be formalized as a monad
Monads in functional programming
In functional programming, a monad is a programming structure that represents computations. Monads are a kind of abstract data type constructor that encapsulate program logic instead of data in the domain model...

 with the following functions:


where "value" and "err" are the constructors of the union type, A and B are valid result types and E is the type of error conditions. Alternately, the same monad may be described by return and two additional functions, fmap and join:

Examples

Say we wanted to build a binary tree
Binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...

 of integers. In ML, we would do this by creating a datatype like this:


datatype tree = Leaf
| Node of (int * tree * tree)


This is a tagged union with two cases: one, the leaf, is used to terminate a path of the tree, and functions much like a null value would in imperative languages. The other branch holds a node, which contains an integer and a left and right subtree. Leaf and Node are the constructors, which enable us to actually produce a particular tree, such as:


Node(5, Node(1,Leaf,Leaf), Node(3, Leaf, Node(4, Leaf, Leaf)))


which corresponds to this tree:



Now we can easily write a typesafe function that, say, counts the number of nodes in the tree:


fun countNodes(Leaf) = 0
| countNodes(Node(int,left,right)) =
1 + countNodes(left) + countNodes(right)

1960s

In ALGOL 68
ALGOL 68
ALGOL 68 isan imperative computerprogramming language that was conceived as a successor to theALGOL 60 programming language, designed with the goal of a...

, tagged unions are called united modes, the tag is implicit, and the case construct is used to determine which field is tagged:

mode node = union (real, int, compl, string);

Usage example for union case of node:


node n := "1234";
 
case n in
(real r): print(("real:", r)),
(int i): print(("int:", i)),
(compl c): print(("compl:", c)),
(string s): print(("string:", s))
out print(("?:", n))
esac

1970s & 1980s

Although primarily only functional languages such as ML
ML programming language
ML is a general-purpose functional programming language developed by Robin Milner and others in the early 1970s at the University of Edinburgh, whose syntax is inspired by ISWIM...

 and Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

 (from 1990s) give a central role to tagged unions and have the power to check that all cases are handled, other languages have support for tagged unions as well. However, in practice they can be less efficient in non-functional languages due to optimizations enabled by functional language compilers that can eliminate explicit tag checks and avoid explicit storage of tags.

Pascal, Ada, and Modula-2
Modula-2
Modula-2 is a computer programming language designed and developed between 1977 and 1980 by Niklaus Wirth at ETH Zurich as a revision of Pascal to serve as the sole programming language for the operating system and application software for the personal workstation Lilith...

 call them variant records (formally discriminated type in Ada), and require the tag field to be manually created and the tag values specified, as in this Pascal example:


type shapeKind = (square, rectangle, circle);
shape = record
centerx : integer;
centery : integer;
case kind : shapeKind of
square : (side : integer);
rectangle : (length, height : integer);
circle : (radius : integer);
end;


And this Ada example:

-- Ada example of discriminated record.

type Discriminant_Kind is (A_Foo, A_Bar, A_Who_Know_What);

type Discriminated_Type (Discriminant : Discriminant_Kind) is record
case Discriminant is
when A_Foo => Foo : Foo_Type;
when A_Bar => Bar : Bar_Type;
when A_Who_Know_What => Who_Know_What : Who_Know_What_Type;
end case;
end record;

-- Any attempt to access a member whose existence depends
-- on a particular value of the discriminant, while the
-- discriminant is not the expected one, raise an error.


In C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 and C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

, a tagged union can be created from untagged unions using a strict access discipline where the tag is always checked:


enum ShapeKind { Square, Rectangle, Circle };

struct Shape {
int centerx;
int centery;
enum ShapeKind kind;
union {
struct { int side; } squareData;
struct { int length, height; } rectangleData;
struct { int radius; } circleData;
} shapeKindData;
};

int getSquareSide(struct Shape* s) {
assert(s->kind Square);
return s->shapeKindData.squareData.side;
}

void setSquareSide(struct Shape* s, int side) {
s->kind = Square;
s->shapeKindData.squareData.side = side;
}

/* and so on */


As long as the union fields are only accessed through the functions, the accesses will be safe and correct. The same approach can be used for encoded tags; we simply decode the tag and then check it on each access. If the inefficiency of these tag checks is a concern, they may be automatically removed in the final version.

C and C++ also have language support for one particular tagged union: the possibly-null pointer. This may be compared to the option type in ML or the Maybe type in Haskell, and can be seen as a tagged pointer
Tagged pointer
In computer science, a tagged pointer is a common example of a tagged union, where the primary type of data to be stored in the union is a pointer...

: a tagged union (with an encoded tag) of two types:
  • Valid pointers,
  • A type with only one value, null, indicating an exceptional condition.

Unfortunately, C compilers do not verify that the null case is always handled, and this is a particularly prevalent source of errors in C code, since there is a tendency to ignore exceptional cases.

2000s

One advanced dialect of C called Cyclone
Cyclone programming language
The Cyclone programming language is intended to be a safe dialect of the C language. Cyclone is designed to avoid buffer overflows and other vulnerabilities that are endemic in C programs, without losing the power and convenience of C as a tool for system programming.Cyclone development was started...

 has extensive built-in support for tagged unions. See the tagged union section of the on-line manual for more information.
The variant library from Boost has demonstrated it was possible to implement a safe tagged union as a library in C++, visitable using functors.

struct display : boost::static_visitor
{
void operator(int i)
{
std::cout << "It's an int, with value " << i << std::endl;
}

void operator(const std::string& s)
{
std::cout << "It's a string, with value " << s << std::endl;
}
};

boost::variant v = 42;
boost::apply_visitor(display, v);

boost::variant v = "hello world";
boost::apply_visitor(display, v);

Scala has case classes:

sealed abstract class Tree
case object Leaf extends Tree
case class Node(value: Int, left: Tree, right: Tree) extends Tree

val tree = Node(5, Node(1,Leaf,Leaf), Node(3, Leaf, Node(4, Leaf, Leaf)))


Because the class hierarchy is sealed, the compiler can check that all cases are handled in a pattern match:

tree match {
case Node(x, _, _) => println("top level node value: " + x)
case Leaf => println("top level node is a leaf")
}


Scala's case classes also permit reuse through subtyping:


sealed abstract class Shape(centerX: Int, centerY: Int)
case class Square(side: Int, centerX: Int, centerY: Int) extends Shape(centerX, centerY)
case class Rectangle(length: Int, height: Int, centerX: Int, centerY: Int) extends Shape(centerX, centerY)
case class Circle(radius: Int, centerX: Int, centerY: Int) extends Shape(centerX, centerY)

Class hierarchies as tagged unions
In a typical class hierarchy
Class hierarchy
As in taxonomy, the classifications of species, a class hierarchy in computer science is a classification of object types, denoting objects as the instantiations of classes inter-relating the various classes by relationships such as "inherits", "extends", "is an abstraction of", "an interface...

 in object-oriented programming
Object-oriented programming
Object-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...

, each subclass can encapsulate data unique to that class. The metadata used to perform virtual method lookup (for example, the object's vtable
Virtual method table
A virtual method table, virtual function table, dispatch table, or vtable, is a mechanism used in a programming language to support dynamic dispatch ....

 pointer in most C++ implementations) identifies the subclass and so effectively acts as a tag identifying the particular data stored by the instance (see RTTI).
An object's constructor
Constructor (computer science)
In object-oriented programming, a constructor in a class is a special type of subroutine called at the creation of an object. It prepares the new object for use, often accepting parameters which the constructor uses to set any member variables required when the object is first created...

 sets this tag, and it remains constant throughout the object's lifetime.

Nevertheless, a class hierarchy involves true subtype polymorphism
Subtype polymorphism
In programming language theory, subtyping or subtype polymorphism is a form of type polymorphism in which a subtype is a datatype that is related to another datatype by some notion of substitutability, meaning that program constructs, typically subroutines or functions, written to operate on...

; it can be extended by creating further subclasses of the same base type, which could not be handled correctly under a tag/dispatch model. Hence, it is usually not possible to do case analysis or dispatch on a subobject's 'tag' as one would for tagged unions. Some languages such as Scala allow base classes to be "sealed", and unify tagged unions with sealed base classes.
External links
  • boost::variant is a C++ typesafe discriminated union
  • std.variant is an implementation of variant type in D
    D (programming language)
    The D programming language is an object-oriented, imperative, multi-paradigm, system programming language created by Walter Bright of Digital Mars. It originated as a re-engineering of C++, but even though it is mainly influenced by that language, it is not a variant of C++...

    2.0
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK