All Topics  
String literal

 

   Email Print
   Bookmark   Link






 

String literal



 
 
A string literal is the representation of a string
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
 value within the source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 of a computer program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
 in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

Specifically, most string literals can be specified using:



require the use of balanced "bracketed" characters on either side of the string.

Advantages:

Drawbacks: This is however not a drawback when the prefix is generated by an algorithm as most likely the case.

ntation.

- title: An example multi-line string in YAML body : | This is a multi-line string. "special" metacharacters may appear here.






Discussion
Ask a question about 'String literal'
Start a new discussion about 'String literal'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A string literal is the representation of a string
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
 value within the source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 of a computer program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
 in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

Specifically, most string literals can be specified using:

  • declarative notation;
  • whitespace delimiters (indentation);
  • bracketed delimiters (quoting);
  • escape characters; or
  • a combination of some or all of the above


Declarative notation


In the original FORTRAN
Fortran

Fortran is a general-purpose programming language, procedural programming language, imperative programming language programming language that is especially suited to numerical analysis and scientific computing....
 programming language (for example), string literals were written in so-called Hollerith notation, where a decimal count of the number of characters was followed by the letter H, and then the characters of the string:

27HAn example Hollerith string

This declarative notation style is contrasted with bracketed delimiter
Delimiter

A delimiter is a sequence of one or more character s used to specify the boundary between separate, independent regions in plain text or other data stream....
 quoting, because it does not require the use of balanced "bracketed" characters on either side of the string.

Advantages:
  • eliminates text searching (for the delimiter character) and therefore requires significantly less overhead
    Computational overhead

    In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal....
  • avoids the (100% programmer induced) problem of delimiter collision
    Delimiter

    A delimiter is a sequence of one or more character s used to specify the boundary between separate, independent regions in plain text or other data stream....
  • enables the inclusion of metacharacter
    Metacharacter

    A metacharacter is a character that has a special meaning to a computer program, such as a Operating system shell or a regular expression engine....
    s that might otherwise be mistaken as commands
  • can be used for quite effective data compression of plain text strings


Drawbacks:
  • this type of notation is error-prone if used as manual entry by programmer
    Programmer

    A programmer is someone who writes computer software. The term computer programmer can refer to a specialist in one area of computer programming or to a generalist who writes code for many kinds of software....
    s
This is however not a drawback when the prefix is generated by an algorithm as most likely the case.

Whitespace delimiters


In YAML
YAML

YAML is a human-readable data serialization format that takes concepts from languages such as XML, C , Python , Perl, as well as the format for electronic mail as specified by Request for Comments ....
, string literals may be specified by the relative positioning of whitespace and indentation.

- title: An example multi-line string in YAML body : | This is a multi-line string. "special" metacharacters may appear here. The extent of this string is indicated by indentation.

Bracketed delimiters


Most modern programming languages use bracket delimiters
Delimiter

A delimiter is a sequence of one or more character s used to specify the boundary between separate, independent regions in plain text or other data stream....
 (also balanced delimiters, or quoting) to specify string literals. Double quotes are the most common quoting delimiters used:

"Hi There!"

Some languages also allow the use of single quotes as an alternative to double quotes (though the string must begin and end with the same kind of quotation mark):

'Hi There!'

Note that these quotation marks are unpaired (the same character is used as an opener and a closer), which is a hangover from the typewriter
Typewriter

A typewriter is a Machine or electromechanical device with a set of "keys" that, when pressed, cause Typeface to be printed on a medium, usually paper....
 technology which was the precursor of the earliest computer input and output devices. The Unicode
Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems....
 character set includes paired (separate opening and closing) versions of both single and double quotes:

“Hi There!” ‘Hi There!’

The paired double quotes can be used in Visual Basic .NET
Visual Basic .NET

Visual Basic , formerly called Visual Basic .NET , is an object-oriented programming computer language that can be viewed as an evolution of Microsoft Visual Basic implemented on the .NET Framework....
.

The PostScript
PostScript

PostScript is a dynamically typed concatenative programming language programming language created by John Warnock and Charles Geschke in 1982. PostScript is best known for its use as a page description language in the electronic and desktop publishing areas....
 programming language uses parentheses, with embedded newlines allowed, and also embedded unescaped parentheses provided they are properly paired:

(The quick (brown fox))

Similarly, the Tcl
Tcl

Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration"?according to the author?with programmers devising their own languages intended to be embedded into applications, Tcl quickly gained wide acceptance on its own and is generally thought to be easy to learn, but powerful in competent hands....
 programming language uses braces (embedded newlines allowed, embedded unescaped braces allowed provided properly paired):



This practice derives on one hand from the single quotes in Unix shells (these are raw strings) and on the other from the use of braces in C
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
 for compound statements, since blocks of code is in Tcl syntactically the same thing as string literals. That the delimiters are paired is essential for making this feasible.

Delimiter collision


Delimiter collision is a common problem for string literal notations that use balanced delimiters and quoting. The problem occurs when a programmer attempts to use a quoting character as part of the string literal itself. Because this is a very common problem, a number of methods for avoiding delimiter collision have been invented.

Dual quoting style


Some languages (e.g. Modula-2
Modula-2

Modula-2 is a computer programming language invented by Niklaus Wirth at ETH, around 1978, as a successor to his intermediate language Modula. Modula-2 was implemented in 1980 for the Lilith computer, which was commercialized in 1982 by startup company DISER as MC1 and MC2....
, JavaScript
JavaScript

JavaScript is a scripting language widely used for client-side web development. It was the originating Programming language dialect of the ECMAScript standard....
) attempt to avoid the delimiter collision problem by allowing a dual quoting style. Typically, this consists of allowing the programmer to use either single quotes or double quotes interchangeably.

"This is John's apple." 'I said, "Can you hear me?"'

One problem with dual quoting is that it doesn't allow for the inclusion of both styles of quotes at once within the same literal (unless escaped, see below).

Some programming languages allow subtle variations on dual quoting, treating single quotes and double quotes slightly differently (e.g. sh
Bourne shell

The Bourne shell, or sh, was the default Unix shell of Version 7 Unix, and replaced the Thompson shell, whose executable file had the same name, sh....
, Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
).

Escape character


One method for avoiding delimiter collision is to use escape character
Escape character

In computing and telecommunication, an escape character is a single character which in a sequence of characters signifies that what is to follow takes an alternative interpretation....
s:

"I said, \"Can you hear me?\""

The most commonly-used escape character for this purpose is the backslash "\", the tradition for which originated on Unix. From a language design standpoint, this approach is adequate, but there are drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters
  • escape characters are required to be escaped, when not intended as escape characters
  • although easy to type, they can be cryptic to someone unfamiliar with the language


"I said, \"The Windows path is C:\\Foo\\Bar\\Baz\""

The confusing presence of too many escape and slash characters in a string is commonly disparaged as leaning toothpick syndrome
Leaning toothpick syndrome

In computer programming, leaning toothpick syndrome is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes , to avoid Delimiter#Delimiter collision....
.

Escape sequence


An extended concept of the escape character, an escape sequence is also a means of avoiding delimiter collision. An escape sequence consists of two or more consecutive characters that can have special meaning when used in the context of a string literal.

"I said, \x22Can you hear me?\x22"

Escape sequences can also be used for purposes other than avoiding delimiter collision, and can also include metacharacters. (see Metacharacters below).

Double-up escape sequence


Some languages (such as Pascal
Pascal (programming language)

Pascal is an influential imperative programming and Procedural programming programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small and efficient language intended to encourage good programming practices using structured programming and data structure....
, BASIC
BASIC

In computer programming, BASIC is a family of high-level programming languages. The Dartmouth BASIC was designed in 1964 by John George Kemeny and Thomas Eugene Kurtz at Dartmouth College in New Hampshire, United States to provide computer access to non-science students....
 and DCL
DIGITAL Command Language

DCL, the DIGITAL Command Language, is the standard command languageadopted by most of the operating systems that were sold by the former Digital Equipment Corporation ....
) avoid delimiter collision by doubling up on the quotation marks that are intended to be part of the string literal itself:

  'This Pascal stringcontains two apostrophes
  "I said, ""Can you hear me?"""


Extended quoting styles

Some languages extend the previously-mentioned quoting conventions even further. These extended approaches provide an even more flexible style of notation for avoiding delimiter collision.

Triple quoting: One such extension, the use of
triple quoting, is used in Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
:

This is John's apple.

"""John is Nancy's so-called "boyfriend"."""

Triple quoted string literals may be delimited by """ or
. Triple quoting in Python also has the added benefit of allowing string literals to span more than one physical line of source code.

Multiple quoting: Another such extension is the use of
multiple quoting, which allows the author to choose which characters should specify the bounds of a string literal.

For example in Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
:

qq^I said, "Can you hear me?"^

qq@I said, "Can you hear me?"@

qq§I said, "Can you hear me?"§

all produce the desired result. Although this notation is more flexible, few languages support it. Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
and Ruby
Ruby (programming language)

Ruby is a dynamic programming language, reflection , general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features....
 are two that do.

Here documents


A Here document is an alternate quoting notation that allows the programmer to specify an arbitrary unique identifier as a content boundary for a string literal. This avoids delimiter collision, and also preserves newlines in the source code as newlines in the string literal itself.

Metacharacters

Many languages support the use of metacharacter
Metacharacter

A metacharacter is a character that has a special meaning to a computer program, such as a Operating system shell or a regular expression engine....
s inside string literals. Metacharacters have varying interpretations depending on the context and language, but are generally a kind of 'processing command' for representing printing or nonprinting characters.

For instance, in a C
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
 string literal, if the backslash is followed by a letter such as "b", "n" or "t", then this represents a nonprinting
backspace, newline or tab character respectively. Or if the backslash is followed by 3 octal
Octal

The octal numeral system, or oct for short, is the radix-8 number system, and uses the digits 0 to 7. Numerals can be made from Binary numeral system numerals by grouping consecutive digits into groups of three ....
 digits, then this sequence is interpreted as representing the arbitrary character with the specified ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 code. This was later extended to allow more modern hexadecimal
Hexadecimal

In mathematics and computer science, hexadecimal is a numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 09 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen....
 character code notation:

"I said,\t\t\x22Can you hear me?\x22\n"

Raw strings

A few languages provide a method of specifying that a literal is to be processed without any language specific interpretation.

For example, in Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
 'raw strings' are preceded by an r. In such strings backslashes are not interpreted as escape sequences, making it simpler to write DOS/Windows paths
Path (computing)

A path is the general form of a computer file or directory name, specifying a unique location in a file system. A path points to a file system location by following the directory tree hierarchy expressed in a string of character in which path components, separated by a delimiting character, represent each directory....
 and regular expressions: r"The Windows path is C:\Foo\Bar\Baz\ "

C#'s notation is called @-quoting: @"C:\Foo\Bar\Baz\" Which also allows double-up quotes: @"I said, ""Hello there."""

In XML documents, CDATA
CDATA

The term CDATA, meaning character data, is used for distinct, but related purposes in the markup languages Standard Generalized Markup Language and XML....
 sections allows use of characters such as & and < without an XML parser attempting to interpret them as part of the structure of the document itself. This can be useful when including literal text and scripting code, to keep the document well formed
Well-formed XML document

A "well-formed" XML document is defined as an XML document that has correct XML syntax. According to W3C, this means:* XML documents must have a root element...
.

Variable interpolation

Languages differ on whether and how to interpret string literals as either 'raw' or 'variable interpolated'. Variable interpolation is the process of evaluating an expression containing one or more variables, and returning output where the variables are replaced with their corresponding values in memory. In sh-compatible Unix shells
Bourne shell

The Bourne shell, or sh, was the default Unix shell of Version 7 Unix, and replaced the Thompson shell, whose executable file had the same name, sh....
, quote-delimited (") strings are interpolated, while apostrophe-delimited (') strings are not.

For example, the following Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
 code:

$sName = "Nancy"; $sGreet = "Hello World"; print "$sName said $sGreet to the crowd of people.";

produces the output:

Nancy said Hello World to the crowd of people.

The sigil
Sigil (computer programming)

In computer programming, a sigil is a symbol attached to a variable name, showing the variable's datatype or Scope . The term was first applied to Perl usage by Philip Gwyn in 1999 to replace the more cumbersome "funny character in front of a variable name"....
 character ($) is interpreted to indicate variable interpolation.

Similarly, the printf
Printf

The class of printf functions is a class of function , typically associated with curly bracket programming languages, that accept a string parameter which specifies a method for rendering a number of other parameters into a string....
function produces the same output using notation such as:

printf "%s said %s to the crowd of people.", ($sName,$sGreet);

The metacharacters (%s) indicate variable interpolation.

This is contrasted with "raw" strings:

print '$sName said $sGreet to the crowd of people.';

which produce output like:

$sName said $sGreet to the crowd of people.

Here the $ characters are not sigils
Sigil (computer programming)

In computer programming, a sigil is a symbol attached to a variable name, showing the variable's datatype or Scope . The term was first applied to Perl usage by Philip Gwyn in 1999 to replace the more cumbersome "funny character in front of a variable name"....
, and are not interpreted to have any meaning other than plain text.

Binary and hexadecimal strings


REXX
REXX

REXX is an Interpreted language programming language which was developed at IBM. It is a structured high-level programming language which was designed to be both easy to learn and easy to read....
 uses suffix characters to specify characters or strings using their hexadecimal or binary code. E.g.,

'20'x
"0010 0000"b
"00100000"b
all yield the space character, avoiding the function call X2C(20).

Embedding source code in string literals


Languages that lack flexibility in specifying string literals make it particularly cumbersome to write programming code that generates other programming code. This is particularly true when the generation language is the same or similar to the output language.

for example:
  • writing code to produce quines
  • generating an output language from within a web template
    Web template

    A web template is a tool used to Separation of concerns content from presentation in web design, and for mass-production of web documents. It is a basic component of a web template system....
    ;
  • using XSLT to generate XSLT, or SQL
    SQL

    SQL is a database computer language designed for the retrieval and management of data in relational database management systems , database schema creation and modification, and database object access control management....
     to generate more SQL
  • generating a PostScript
    PostScript

    PostScript is a dynamically typed concatenative programming language programming language created by John Warnock and Charles Geschke in 1982. PostScript is best known for its use as a page description language in the electronic and desktop publishing areas....
     representation of a document for printing purposes, from within a document-processing application written in C
    C (programming language)

    C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
     or some other language.


Nevertheless, some languages are particularly well-adapted to produce this sort of self-similar output, especially those that support multiple options for avoiding delimiter collision.

Using string literals as code that generates other code may have adverse security implications, especially if the output is based at least partially on untrusted user input. This is particularly acute in the case of Web-based applications, where malicious users can take advantage of such weaknesses to subvert the operation of the application, for example by mounting an SQL injection
SQL injection

SQL injection is a code injection technique that exploits a security vulnerability occurring in the database layer of an application software. The vulnerability is present when user input is either incorrectly filtered for string literal escape sequences embedded in SQL statements or user input is not Strongly-typed programming language and t...
 attack.

See also



  • Sigil (computer programming)
    Sigil (computer programming)

    In computer programming, a sigil is a symbol attached to a variable name, showing the variable's datatype or Scope . The term was first applied to Perl usage by Philip Gwyn in 1999 to replace the more cumbersome "funny character in front of a variable name"....


External links