S-expression

# S-expression

Discussion

Encyclopedia
S-expressions or sexps are list-based data structures that represent semi-structured data
Semi-structured data
Semi-structured data is a form of structured data that does not conform with the formal structure of tables and data models associated with relational databases but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data...

. An S-expression may be a nested list of smaller S-expressions. S-expressions are probably best known for their use in the Lisp
Lisp programming language
Lisp is a family of computer programming languages with a long history and a distinctive, fully parenthesized syntax. Originally specified in 1958, Lisp is the second-oldest high-level programming language in widespread use today; only Fortran is older...

family of programming languages. They are typically represented in text by parenthesized, whitespace-separated sequences of character strings, as in `(= 4 (+ 2 2))`, which represents the Boolean
Boolean logic
Boolean algebra is a logical calculus of truth values, developed by George Boole in the 1840s. It resembles the algebra of real numbers, but with the numeric operations of multiplication xy, addition x + y, and negation −x replaced by the respective logical operations of...

expression commonly written as `4 (2 + 2)` in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

and related programming languages.

In Lisp, the first element of every S-expression is an operator and any remaining elements are treated as data. This is called "prefix notation" or "Cambridge Polish notation
Polish notation
Polish notation, also known as prefix notation, is a form of notation for logic, arithmetic, and algebra. Its distinguishing feature is that it places operators to the left of their operands. If the arity of the operators is fixed, the result is a syntax lacking parentheses or other brackets that...

". Due to the close association between S-expressions and written representations of data, the term "S-expression" in casual conversation often refers to the written representation of an S-expression.

Other uses of S-expressions are in Lisp-derived languages such as DSSSL
Document Style Semantics and Specification Language
Document Style Semantics and Specification Language is a computer language for specifying stylesheets for SGML documents, based on a subset of the Scheme programming language. It is specified by the standard ISO/IEC 10179:1996...

, and as mark-up in communications protocols like IMAP
Internet Message Access Protocol
Internet message access protocol is one of the two most prevalent Internet standard protocols for e-mail retrieval, the other being the Post Office Protocol...

and John McCarthy
John McCarthy (computer scientist)
John McCarthy was an American computer scientist and cognitive scientist. He coined the term "artificial intelligence" , invented the Lisp programming language and was highly influential in the early development of AI.McCarthy also influenced other areas of computing such as time sharing systems...

's CBCL
The Common Business Communication Language is a communications language proposed by John McCarthy that foreshadowed much of XML. The language consists of a basic framework of hierarchical markup derived from S-expressions, coupled with some general principles about use and extensibility...

. The details of the syntax
Syntax
In linguistics, syntax is the study of the principles and rules for constructing phrases and sentences in natural languages....

and supported data types vary in the different languages, but the most common feature among these languages is the use of S-expressions and prefix notation.

In Lisp, S-expressions are used to store both source code and data (see McCarthy Recursive Functions of Symbolic Expressions). S-expressions were originally intended only for data to be manipulated by M-expression
M-expression
In computer programming, M-expressions were intended to be the expressions used to write functions in the Lisp programming language. Data to be manipulated using M-expressions was to be written using S-expressions...

s, but the first implementation of Lisp was an interpreter of S-expression encodings of M-expressions, and Lisp programmers soon became accustomed to using S-expressions for both code and data.
This means that Lisp is homoiconic, that is, the primary representation of programs is also a data structure in a primitive type of the language itself.
Definition
S-expressions are defined recursively
Recursion
Recursion is the process of repeating items in a self-similar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from...

as either single data objects called "atoms" or lists of S-expressions. Numbers, arrays, character strings, and symbols
Symbol (Lisp)
A symbol in computer programming is a primitive datatype whose instances have a unique human-readable form. Symbols can be used as identifiers. In some programming languages, they are called atoms....

are all atoms.

An S-expression may be a list of other S-expressions, which may also be lists, so S-expressions may be arbitrarily nested to any depth. In Lisp, S-expression lists are constructed from small data structures called cons pairs, written as `(x . y)`. The first element of the cons (x) is the first element of the list, and the second element (y) is the remainder of the list. Longer lists are made up of nested cons pairs, for example
`(1 . (2 . (3 . nil)))`. The special symbol `nil` marks the end of the list. Cons-based lists are usually written more compactly as `(1 2 3)`. Nested lists can also be written as S-expressions: `((milk juice) (honey marmalade))` is a two-element S-expression whose elements are also two-element S-expressions. The whitespace-separated notation used in Lisp (and this article) is typical. Line breaks (newline characters) usually qualify as separators.

A non-atomic S-expression in which the second element is an atom other than `nil` is a "dotted pair" (e.g., `(a . b)`); a "dotted list" (e.g., `(a b c . d)`) is defined recursively as an s-expression in which the second element is either a dotted pair or another dotted list, while a "list" (e.g., `(a b c)`, which is equivalent to ` (a b c . nil)`) is defined as either the reserved atom `nil`, or a non-atomic s-expression in which the second element is a list.

### Example

This is a simple context-free grammar
Context-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....

written as an s-expression (Gazdar/Melish, Natural Language Processing in Lisp):

(((S) (NP) (VP))
((VP) (V))
((VP) (V) (NP))
((V) died)
((V) employed)
((NP) nurses)
((NP) patients)
((NP) Medicenter)
((NP) Dr Chan))

S-expressions in Lisp
Program code can be written in S-expressions, usually using prefix notation.

Example in Common Lisp
Common Lisp
Common Lisp, commonly abbreviated CL, is a dialect of the Lisp programming language, published in ANSI standard document ANSI INCITS 226-1994 , . From the ANSI Common Lisp standard the Common Lisp HyperSpec has been derived for use with web browsers...

:

(defun factorial (x)
(if (zerop x)
1
(* x (factorial (- x 1)))))

S-expressions can be read in Lisp using the function READ. READ reads the textual representation of an s-expression and returns Lisp data. The function PRINT can be used to output an s-expression. The output then can be read with the function READ, when all printed data objects have a readable representation. Lisp has readable representations for numbers, strings, symbols, lists and many other data types. Program code can be formatted as pretty printed S-expressions using the function PPRINT.

Lisp programs are valid s-expressions, but not all s-expressions are valid Lisp programs. `(1.0 + 3.1)` is a valid s-expression, but not a valid Lisp program, since Lisp uses prefix notation and a floating point number (here 1.0) is not valid as an operation (the first element of the expression).

An S-expression preceded by a single quotation mark, as in `'x`, is syntactic sugar
Syntactic sugar
Syntactic sugar is a computer science term that refers to syntax within a programming language that is designed to make things easier to read or to express....

for a quoted S-expression, in this case `(quote x)`.
Standardization
Standards for some Lisp-derived programming languages include a specification for their S-expression syntax. These include Common Lisp
Common Lisp
Common Lisp, commonly abbreviated CL, is a dialect of the Lisp programming language, published in ANSI standard document ANSI INCITS 226-1994 , . From the ANSI Common Lisp standard the Common Lisp HyperSpec has been derived for use with web browsers...

(ANSI standard document ANSI INCITS 226-1994 (R2004)), Scheme (R5RS and R6RS), and ISLISP
ISLISP
ISLISP is a programming language in the LISP family standardized by ISO working group ISO/IEC JTC 1/SC 22/WG 16 . The primary output of this working group was an International Standard, ISO/IEC 13816:1997, published by ISO. The standard was updated in 2007 and republished as ISO/IEC 13816:2007...

.

In May 1997, Ron Rivest
Ron Rivest
Ronald Linn Rivest is a cryptographer. He is the Andrew and Erna Viterbi Professor of Computer Science at MIT's Department of Electrical Engineering and Computer Science and a member of MIT's Computer Science and Artificial Intelligence Laboratory...

submitted an Internet-Draft to be considered for publication as an RFC
In computer network engineering, a Request for Comments is a memorandum published by the Internet Engineering Task Force describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems.Through the Internet Society, engineers and...

. The draft defined a syntax based on Lisp S-expressions but intended for general-purpose data storage and exchange (similar to XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

) rather than specifically for programming. It was never approved as an RFC, but it has since been cited and used by other RFCs (e.g. RFC 2693) and several other publications. It was originally intended for use in SPKI
Simple public key infrastructure
Simple public key infrastructure was born out of a joint effort to overcome the overcomplication and scalability problems of traditional X.509 public key infrastructure. It is specified in two Internet Engineering Task Force Request For Comments specifications—RFC 2692 and RFC 2693—from the IETF...

.

Rivest's format defines an S-expression as being either an octet-string (a series of byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s) or a finite list of other S-expressions. It describes three interchange formats for expressing this structure. One is the "advanced transport", which is very flexible in terms of formatting, and is syntactically similar to Lisp-style expressions, but they are not identical. The advanced transport, for example, allows octet-strings to be represented verbatim (the string's length followed by a colon and the entire raw string), a quoted form allowing escape characters, hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

, Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, or placed directly as a "token" if it meets certain conditions. (Rivest's tokens differ from Lisp tokens in that the former are just for convenience and aesthetics, and treated exactly like other strings, while the latter have specific syntactical meaning.) Another interchange format, intended to be more compact, easier to parse, and unique for any abstract S-expression, is the "canonical representation" which only allows verbatim strings, and prohibits whitespace as formatting outside strings. Finally there is the "basic transport representation", which is either the canonical form or the same encoded as Base64 and surrounded by braces
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

, the latter intended to safely transport a canonically-encoded S-expression in a system which might change spacing (e.g. an email system which has 80-character-wide lines and wraps anything longer than that).

This format has not been widely adapted for use outside of SPKI. Rivest's S-expressions web page provides C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

source code for a parser and generator, which could theoretically be adapted and embedded into other programs, though licensing
Copyright is a legal concept, enacted by most governments, giving the creator of an original work exclusive rights to it, usually for a limited time...

on these programs is unclear. However, there are no restrictions on independently implementing the format.