All Topics  
Delimiter

 

   Email Print
   Bookmark   Link






 

Delimiter



 
 
A delimiter is a sequence of one or more character
Character (computing)

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written language form of a natural language....
s used to specify the boundary between separate, independent regions in plain text
Plain text

In computing, plain text is a term used for an ordinary "unformatted" sequential file readable as textual material without much processing.The Character encoding has traditionally been either ASCII, one of its many derivatives such as ISO/IEC 646 etc., or sometimes EBCDIC....
 or other data stream. An example of a delimiter is the comma
Comma

A comma is a type of punctuation mark .Comma may also refer to:* Comma , a type of interval in music theory* Comma , a species of butterfly...
 character, which acts as a field delimiter in a sequence of comma-separated values
Comma-separated values

A Comma separated values file is a computer data file used for implementing the tried and true organizational tool, the Comma Separated List....
.

Delimiters represent one of various means to specify boundaries in a data stream
Data stream

In telecommunications and computing, a data stream is a sequence of encoder coherent Signalling s used to Transmission or receive information that is in transmission ....
. There are alternate means as well. Declarative notation
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.

This article emphasizes the use of delimiters in computing.






Discussion
Ask a question about 'Delimiter'
Start a new discussion about 'Delimiter'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A delimiter is a sequence of one or more character
Character (computing)

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written language form of a natural language....
s used to specify the boundary between separate, independent regions in plain text
Plain text

In computing, plain text is a term used for an ordinary "unformatted" sequential file readable as textual material without much processing.The Character encoding has traditionally been either ASCII, one of its many derivatives such as ISO/IEC 646 etc., or sometimes EBCDIC....
 or other data stream. An example of a delimiter is the comma
Comma

A comma is a type of punctuation mark .Comma may also refer to:* Comma , a type of interval in music theory* Comma , a species of butterfly...
 character, which acts as a field delimiter in a sequence of comma-separated values
Comma-separated values

A Comma separated values file is a computer data file used for implementing the tried and true organizational tool, the Comma Separated List....
.

Delimiters represent one of various means to specify boundaries in a data stream
Data stream

In telecommunications and computing, a data stream is a sequence of encoder coherent Signalling s used to Transmission or receive information that is in transmission ....
. There are alternate means as well. Declarative notation
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.

This article emphasizes the use of delimiters in computing. For more general treatment of delimiters in written human languages, see interword separation
Interword separation

In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin alphabet, Cyrillic alphabet, and Arabic alphabets, as well as other languages of Europe and the Mideast, the word divider is a blank Space , or whitespace, a convention which is spreading, along with other aspects of European punctuati...
.

Overview

Delimiters can be broken down into:

  • Field and Record delimiters; and
  • Bracket delimiters.


Field and record delimiters

Field delimiters separate data fields. Record delimiters separate groups of fields.

For example, the (CSV) file format uses a comma as the delimiter between fields
Field (computer science)

In computer science, data that has several parts can be divided into fields. For example, a computer may represent today's date as three distinct fields: the day, the month and the year....
, and a end-of-line indicator as the delimiter between records
Row (database)

In the context of a relational database, a row?also called a record or tuple?represents a single, implicitly structured data item in a table ....
. For instance: fname,lname,age,salary nancy,davolio,33,$30000 erin,borakova,28,$25250 tony,raphael,35,$28700 specifies a simple flat file database
Flat file database

A flat file database describes any of various means to encode a database model as a plain text file....
 table
Table (information)

A table is both a mode of visual communication and a means of arranging data. The use of tables is pervasive throughout all communication, research and data analysis....
 using the (CSV) file format.

Conventions


Computing platforms historically use certain delimiters by convention. The following tables depict just a few examples for comparison.

Programming languages (See also, Comparison of programming languages (syntax)
Comparison of programming languages (syntax)

ExpressionsProgramming language Expression s can be broadly classifiedin three classes:prefix notation* Lisp infix notation...
).
String Literal - ! Pascal | singlequote - ! C | doublequote, singlequote -


Field and Record delimiters (See also, ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
, Control character
Control character

In computing and telecommunication, a control Grapheme or non-printing character is a code point in a character encoding, that does not in itself represent a written symbol....
).

End of Field End of Record - ! Unix (and Mac OS X) | Tab
Tab key

Tab key on a alphanumeric keyboard is used to advance the cursor to the next tab stop....
 
LF - ! Windows | Tab
Tab key

Tab key on a alphanumeric keyboard is used to advance the cursor to the next tab stop....
 
CRLF Control-Z
Control-Z

In computing, control-Z is a control character in ASCII code, also known as the substitute character. It is generated by pressing the key while holding down the key on a computer keyboard....
 (optional) |- ! Classic Mac OS | Tab
Tab key

Tab key on a alphanumeric keyboard is used to advance the cursor to the next tab stop....
 
CR
Carriage return

Originally, carriage return was the term for the control character in Baudot code on a Teleprinter for end of line return to beginning of line and did not include line feed....
 
- ! Unicode | UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)


Bracket delimiters


Bracket delimiters (also block delimiters, region delimiters, balanced delimiters) mark both the start and end of a region of text. They are used in almost all programming languages, including Wikicode.

Common examples of bracket delimiters include:















DelimitersDescription
( and )Parentheses
Bracket

Brackets are punctuation marks used in pairs to set apart or interject text within other text. In computer science, the term is sometimes said to strictly apply to the square or box type....
. The Lisp programming language syntax is cited as recognizable primarily from its use of parentheses.
Curly brackets
Bracket

Brackets are punctuation marks used in pairs to set apart or interject text within other text. In computer science, the term is sometimes said to strictly apply to the square or box type....
.
[ and ]Square brackets
Bracket

Brackets are punctuation marks used in pairs to set apart or interject text within other text. In computer science, the term is sometimes said to strictly apply to the square or box type....
.
< and >Angle brackets
Bracket

Brackets are punctuation marks used in pairs to set apart or interject text within other text. In computer science, the term is sometimes said to strictly apply to the square or box type....
.
" and "commonly used to denote string literal
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
s.
' and 'commonly used to denote string literals.
and ?>used to indicate XML processing instructions.
/* and */used to denote comment
Comment (computer programming)

In computer programming, a comment is a programming language construct used to embed programmer-readable annotations in the source code of a computer program....
s in some programming languages.
and |}

used to indicate a table in Wikicode.
<% and %>used in some web template
Web template

A web template is a tool used to Separation of concerns content from presentation in web design, and for mass-production of web documents. It is a basic component of a web template system....
s to specify language boundaries. These are also called template delimiters.
>

Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions. In the case of Comma-separated values
Comma-separated values

A Comma separated values file is a computer data file used for implementing the tried and true organizational tool, the Comma Separated List....
 files, for example, this can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"). In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character.

In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of security vulnerabilities
Vulnerability (computing)

In computer security, the term vulnerability is applied to a weakness in a system which allows an attacker to violate the integrity of that system....
 and exploits
Exploit (computer security)

An exploit is a piece of software, a chunk of data, or sequence of commands that take advantage of a software bug, glitch or vulnerability in order to cause unintended or unanticipated behavior to occur on computer software, hardware, or something electronic ....
. Malicious users can take advantage of delimiter collision in languages such as SQL
SQL

SQL is a database computer language designed for the retrieval and management of data in relational database management systems , database schema creation and modification, and database object access control management....
 and HTML
HTML

HTML, an Acronym and initialism of HyperText Markup Language, is the predominant markup language for Web pages. It provides a means to describe the structure of text-based information in a document?by denoting certain text as links, headings, paragraphs, lists, and so on?and to supplement that text with interactive forms, embedded '...
 to deploy such well-known attacks as SQL injection
SQL injection

SQL injection is a code injection technique that exploits a security vulnerability occurring in the database layer of an application software. The vulnerability is present when user input is either incorrectly filtered for string literal escape sequences embedded in SQL statements or user input is not Strongly-typed programming language and t...
 and Cross-site scripting
Cross-site scripting

Cross-site scripting is a type of computer insecurity vulnerability typically found in web applications which allow code injection by malicious web users into the web pages viewed by other users....
, respectively.

Solutions


Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad-hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream. Other, more formal conventions are therefore applied as well.

Escape character
One method for avoiding delimiter collision is to use escape character
Escape character

In computing and telecommunication, an escape character is a single character which in a sequence of characters signifies that what is to follow takes an alternative interpretation....
s. From a language design standpoint, these are adequate, but they have drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters;
  • they require a mechanism to 'escape the escapes' when not intended as escape characters; and
  • although easy to type, they can be cryptic to someone unfamiliar with the language.


Escape sequence
Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literal
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
s that include a doublequote (") character. For example in Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
, the code:

print "Nancy said \x22Hello World!\x22 to the crowd."; ### use \x22

produces the same output as:

print "Nancy said \"Hello World!\" to the crowd."; ### use escape char

One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also: character entity reference
Character entity reference

In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named SGML entity that has been predefined or explicitly declared in a Document Type Definition ....
, numeric character reference
Numeric character reference

A numeric character reference is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of character s that, in turn, represent a single character from the Universal Character Set of Unicode....
).

Dual quoting delimiters
In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a singlequote (') or a doublequote (") to specify a string literal. For example in Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
:

print 'Nancy said "Hello World!" to the crowd.'; produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.
Padding quoting delimiters
In contrast to escape sequences and escape characters, padding delimiters provides yet another way to avoid delimiter collision. VisualBasic, for example, uses double quotes as delimiter. this is similar to escaping the delimiter.

print "Nancy said ""Hello World!"" to the crowd." produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used. The code to print the above source code would look more confusing:

print "print ""Nancy said """"Hello World!"""" to the crowd."""

Multiple quoting delimiters
In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision.

For example in Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
:

print qq^Nancy doesn't want to say "Hello World!" anymore.^; print qq@Nancy doesn't want to say "Hello World!" anymore.@; print qq§Nancy doesn't want to say "Hello World!" anymore.§;

all produce the desired output through use of the quotelike operator, which allows any convenient character to act as delimiters. Although this method is more flexible, few languages support it. Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
 and Ruby
Ruby (programming language)

Ruby is a dynamic programming language, reflection , general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features....
 are two that do.

Content boundary
A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a long sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation.

This is usually done by specifying a random sequence of characters followed by an identifying mark such as a UUID, a timestamp
Timestamp

A timestamp is a sequence of characters, denoting the date and/or time at which a certain event occurred. This data is usually presented in a consistent format, allowing for easy comparison of two different records and tracking progress over time; the practice of recording timestamps in a consistent manner along with the actual data is called...
, or some other distinguishing mark. (See e.g., MIME
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
, Here documents).

Whitespace or indentation
Some programming and computer languages allow the use of whitespace delimiters
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
 or indentation
Indentation

English An indentation can mean two things:*To make notches in something or form deep recesses in a coastline for instance.*To place text farther to the right to separate it from surrounding text....
 as a means of specifying boundaries between independent regions in text. Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
 and YAML
YAML

YAML is a human-readable data serialization format that takes concepts from languages such as XML, C , Python , Perl, as well as the format for electronic mail as specified by Request for Comments ....
 are prominent examples.

Regular expression syntax


In specifying a regular expression
Regular expression

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters....
, alternate delimiters may also be used to simplify the syntax for match and substitution operations in Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
.

For example, a simple match operation may be specified in perl with the following syntax:

$string1 = 'Nancy said "Hello World!" to the crowd.'; # specify a target string print $string1 =~ m/[aeiou]+/; # match one or more vowels

The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:

$string1 = 'Nancy said "httq://Hello/World.htm" is not a valid address.'; # target string

print $string1 =~ m@httq://@; # match using alternate regular expression delimiter print $string1 =~ m; # same as previous, but different delimiter print $string1 =~ m!httq://!; # same as previous, but different delimiter

ASCII armor

Although principally used as a mechanism for text encoding of binary data, "ASCII armoring" is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances. This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such as base64
Base64

The term Base64 refers to a specific MIME#Content-Transfer-Encoding. It is also used as a generic term for any similar encoding scheme that encodes binary data by treating it numerically and translating it into a base 64 representation....
, to ensure that delimiter characters do not appear in transmitted data.

This technique is used, for example, in Microsoft
Microsoft

Microsoft Corporation is a multinational corporation computer technology corporation that develops, manufactures, licenses, and supports a wide range of computer software products for computing devices....
's ASP.NET
ASP.NET

ASP.NET is a web application framework developed and marketed by Microsoft to allow programmers to build dynamic web sites, web applications and web services....
 web development technology, and is closely associated with the "VIEWSTATE" component of that system.

Example

The following simplified example demonstrates how this technique works in practice.

The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:



This first code fragment is not well-formed
Well-formed element

In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either*opened and subsequently closed,...
, and would therefore not work properly in a "real world" deployed system.

In contrast, the second code fragment shows the same HTML tag, except this time incompatible characters in the VIEWSTATE value are removed through the application of base64 encoding:



This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.

See also

  • Delimiter-separated values
  • String literal
    String literal

    A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
  • CamelCase
    CamelCase

    CamelCase is the practice of writing compound noun and adjectives or phrases in which the words are joined without Whitespace s and are capitalization within the compound?as in Patti LaBelle, Visual Basic, or iPod....
     (used in WikiWikiWeb
    WikiWikiWeb

    WikiWikiWeb was the first wiki application ever written. It was developed in 1994 by Ward Cunningham in order to make the exchange of ideas between programmers easier and was based on the ideas developed in HyperCard stacks that he built in the late 1980s....
     as an alternate method of link creation that does not require delimiters to indicate links)
  • Federal Standard 1037C
    Federal Standard 1037C

    Federal Standard 1037C, entitled Telecommunications: Glossary of Telecommunication Terms is a United States Federal Standard, issued by the General Services Administration pursuant to the Federal Property and Administrative Services Act of 1949, as amended....
     (contains a simple definition for "delimiter")
  • Naming collision
    Naming collision

    A naming collision is a circumstance where two or more identifiers in a given Namespace or a given Scope cannot be unambiguously Name resolution, and such unambiguous resolution is a requirement of the underlying system....
  • Sigil
    Sigil (computer programming)

    In computer programming, a sigil is a symbol attached to a variable name, showing the variable's datatype or Scope . The term was first applied to Perl usage by Philip Gwyn in 1999 to replace the more cumbersome "funny character in front of a variable name"....