All Topics  
String (computer science)

 

   Email Print
   Bookmark   Link






 

String (computer science)



 
 
In computer programming
Computer programming

Computer programming is the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs. This source code is written in a programming language....
 and some branches of mathematics
Mathematics

Mathematics is the study of quantity, structure, space, change, and related topics of pattern and form. Mathematicians seek out patterns whether found in numbers, space, natural science, computers, imaginary abstractions, or elsewhere....
, a string is an ordered sequence
Sequence

In mathematics, a sequence is an ordered list of objects . Like a Set , it contains Element , and the number of terms is called the length of the sequence....
 of symbols. These symbols are chosen from a predetermined set
Set

A set is a collection of distinct objects, considered as an object in its own right. Sets are one of the most fundamental concepts in mathematics....
 or alphabet
Alphabet

An alphabet is a standardized set of letter basic written symbols each of which roughly represents a phoneme, a spoken language, either as it exists now or as it was in the past....
.

In computer programming
Computer programming

Computer programming is the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs. This source code is written in a programming language....
, a string is generally understood as a data type
Data type

A data type in programming languages is an attribute of a data which tells the computer something about the kind of data it is. This involves setting constraints on the datum, such as what values it can take and what operations may be performed upon it....
 storing a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding
Character encoding

A character encoding system consists of a code that pairs a sequence of character from a given character set with something else, such as a sequence of natural numbers, octet or electrical pulses, in order to facilitate the transmission of data through telecommunication networks and/or Computer data storage of Character in compute...
, which differentiates it from the more general array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
 data type. In this context, the terms binary string and byte string are used to suggest strings in which the stored data does not (necessarily) represent text.

A variable
Variable

A variable is a symbol that stands for a value that may vary; the term usually occurs in opposition to constant, which is a symbol for a non-varying value, i.e....
 declared to have a string data type usually causes storage to be allocated in memory that is capable of holding some predetermined number of symbols. When a string appears literally in source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
, it is known as a string literal
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
 and has a representation that denotes it as such.



S be an alphabet
Alphabet (computer science)

In computer science, an alphabet is a, usually finite, set of characters or digits. The most common alphabet is , the binary alphabet. A finite String is a finite sequence of characters from an alphabet; for instance a binary string is a string drawn from the alphabet ....
, a non-empty
Empty set

In mathematics, and more specifically set theory, the empty set is the unique Set having no members. Some axiomatic set theories assure that the empty set exists by including an axiom of empty set; in other theories, its existence can be deduced....
 finite
Finite set

In mathematics, finite set is a Set that has a finite number of element . For example,is a finite set with five elements. The number of elements of a finite set is a natural number , and is called the cardinality of the set....
 set
Set

A set is a collection of distinct objects, considered as an object in its own right. Sets are one of the most fundamental concepts in mathematics....
.






Discussion
Ask a question about 'String (computer science)'
Start a new discussion about 'String (computer science)'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In computer programming
Computer programming

Computer programming is the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs. This source code is written in a programming language....
 and some branches of mathematics
Mathematics

Mathematics is the study of quantity, structure, space, change, and related topics of pattern and form. Mathematicians seek out patterns whether found in numbers, space, natural science, computers, imaginary abstractions, or elsewhere....
, a string is an ordered sequence
Sequence

In mathematics, a sequence is an ordered list of objects . Like a Set , it contains Element , and the number of terms is called the length of the sequence....
 of symbols. These symbols are chosen from a predetermined set
Set

A set is a collection of distinct objects, considered as an object in its own right. Sets are one of the most fundamental concepts in mathematics....
 or alphabet
Alphabet

An alphabet is a standardized set of letter basic written symbols each of which roughly represents a phoneme, a spoken language, either as it exists now or as it was in the past....
.

In computer programming
Computer programming

Computer programming is the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs. This source code is written in a programming language....
, a string is generally understood as a data type
Data type

A data type in programming languages is an attribute of a data which tells the computer something about the kind of data it is. This involves setting constraints on the datum, such as what values it can take and what operations may be performed upon it....
 storing a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding
Character encoding

A character encoding system consists of a code that pairs a sequence of character from a given character set with something else, such as a sequence of natural numbers, octet or electrical pulses, in order to facilitate the transmission of data through telecommunication networks and/or Computer data storage of Character in compute...
, which differentiates it from the more general array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
 data type. In this context, the terms binary string and byte string are used to suggest strings in which the stored data does not (necessarily) represent text.

A variable
Variable

A variable is a symbol that stands for a value that may vary; the term usually occurs in opposition to constant, which is a symbol for a non-varying value, i.e....
 declared to have a string data type usually causes storage to be allocated in memory that is capable of holding some predetermined number of symbols. When a string appears literally in source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
, it is known as a string literal
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
 and has a representation that denotes it as such.



Formal theory

Let S be an alphabet
Alphabet (computer science)

In computer science, an alphabet is a, usually finite, set of characters or digits. The most common alphabet is , the binary alphabet. A finite String is a finite sequence of characters from an alphabet; for instance a binary string is a string drawn from the alphabet ....
, a non-empty
Empty set

In mathematics, and more specifically set theory, the empty set is the unique Set having no members. Some axiomatic set theories assure that the empty set exists by including an axiom of empty set; in other theories, its existence can be deduced....
 finite
Finite set

In mathematics, finite set is a Set that has a finite number of element . For example,is a finite set with five elements. The number of elements of a finite set is a natural number , and is called the cardinality of the set....
 set
Set

A set is a collection of distinct objects, considered as an object in its own right. Sets are one of the most fundamental concepts in mathematics....
. Elements of S are called symbols or characters. A string (or word) over S is any finite sequence
Sequence

In mathematics, a sequence is an ordered list of objects . Like a Set , it contains Element , and the number of terms is called the length of the sequence....
 of characters from S. For example, if S = , then 0101 is a string over S.

The length
Length

Length is the long dimension of any object. The length of a thing is the distance between its ends, its linear extent as measured from end to end....
 of a string is the number of characters in the string (the length of the sequence) and can be any non-negative integer. The empty string
Empty string

In computer science and formal language theory, the empty string is the unique string of String #Formal_theory zero. It is denoted with "?" or sometimes ?....
 is the unique string over S of length 0, and is denoted e or ?.

The set of all strings over S of length n is denoted Sn. For example, if S = , then S2 = . Note that S0 = for any alphabet S.

The set of all strings over S of any length is the Kleene closure
Kleene star

In mathematical logic and computer science, the Kleene star is a unary operation, either on Set of string or on sets of symbols or characters....
 of S and is denoted S*. In terms of Sn, For example, if S = , S* = . Although S* itself is countably infinite, all elements of S* have finite length.

A set of strings over S (i.e. any subset
Subset

In mathematics, especially in set theory, a Set A is a subset of a set B if A is "contained" inside B. Notice that A and B may coincide....
 of S*) is called a formal language
Formal language

A formal language is a set of words, i.e. finite string of letters, or symbols. The inventory from which these letters are taken is called the alphabet over which the language is defined....
 over S. For example, if S = , the set of strings with an even number of zeros is a formal language over S.

Concatenation and substrings

Concatenation
Concatenation

In computer programming, string concatenation is the operation of joining two character string end to end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"....
 is an important binary operation
Binary operation

In mathematics, a binary operation is a calculation involving two operands, in other words, an operation whose arity is two. Binary operations can be accomplished using either a binary function or binary operator....
 on S*. For any two strings s and t in S*, their concatenation is defined as the sequence of characters in s followed by the sequence of characters in t, and is denoted st. For example, if S = , s = bear, and t = hug, then st = bearhug and ts = hugbear.

String concatenation is an associative, but non-commutative operation. The empty string serves as the identity element
Identity element

In mathematics, an identity element is a special type of element of a Set with respect to a binary operation on that set. It leaves other elements unchanged when combined with them....
; for any string s, es = se = s. Therefore, the set S* and the concatenation operation form a monoid
Monoid

In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single, associative binary operation and an identity element....
, the free monoid generated by S. In addition, the length function defines a monoid homomorphism from S* to the non-negative integers.

A string s is said to be a substring
Substring

A subsequence, substring, prefix or suffix of a String is a subset of the symbols in a string, where the order of the elements is preserved....
 or factor of t if there exist (possibly empty) strings u and v such that t = usv. The relation
Binary relation

In mathematics, a binary relation is an arbitrary association of elements within a set or with elements of another set.An example is the "divides" relation between the set of prime numbers P and the set of integers Z, in which every prime p is associated with every integer z that is a divisibility of p, and no othe...
 "is a substring of" defines a partial order on S*, the least element of which is the empty string.

Lexicographical ordering

It is often necessary to define an ordering on the set of strings. If the alphabet S has a total order
Total order

In mathematics and set theory, a total order, linear order, simple order, or ordering is a binary relation on some Set X....
 (cf. alphabetical order) one can define a total order
Total order

In mathematics and set theory, a total order, linear order, simple order, or ordering is a binary relation on some Set X....
 on S* called lexicographical order
Lexicographical order

In mathematics, the lexicographic or lexicographical order, , is a natural order theory structure of the Cartesian product of two ordered sets....
. Note that since S is finite, it is always possible to define a well ordering on S and thus on S*. For example, if S = and 0 < 1, then the lexicographical ordering of S* is e < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 …

String operations

A number of additional operations on strings commonly occur in the formal theory. These are given in the article on string operations
String operations

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used on computer programming, and some commonly used functions in the theoretical realm are rarely used when programming....
.

Topology

Strings admit the following interpretation as nodes on a graph:
  • Fixed length strings can be viewed as nodes on a hypercube
    Hypercube

    In geometry, a hypercube is an n-dimensional analogue of a Square and a cube . It is a Closed set, Compact space, Convex set figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, at right angles to each other and of the same length....
    ;
  • Variable length strings (of finite length) can be viewed as nodes on the k-ary tree
    K-ary tree

    In graph theory, a k-ary tree is a rooted tree in which each node has no more than k children. It is also sometimes known as a k-way tree, an N-ary tree, or an M-ary tree....
    , where k is the number of symbols in S;
  • Infinite strings can be viewed as infinite paths on the k-ary tree.


The natural topology on the set of fixed length strings or variable length strings is the discrete topology, but the natural topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse limit
Inverse limit

In mathematics, the inverse limit is a construction which allows one to "glue together" several related objects, the precise manner of the gluing process being specified by morphisms between the objects....
 of the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the Cantor set
Cantor set

In mathematics, the Cantor set, introduced by Germany mathematician Georg Cantor in 1883 , is a set of points lying on a single line segment that has a number of remarkable and deep properties....
, and yields the same topology.

String datatypes

A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
. In some languages they are available as primitive type
Primitive type

In computer science, primitive type can refer to either of the following concepts:* a basic type is a data type provided by a programming language as a basic building block....
s and in others as composite type
Composite type

In computer science, composite types are datatypes which can be constructed in a programming language out of that language's basic primitive types and other composite types....
s. The syntax
Syntax

In linguistics, syntax is the study of the principles and rules for constructing Sentence s in natural languages. In addition to referring to the discipline, the term syntax is also used to refer directly to the rules and principles that govern the sentence structure of any individual language, as in "the Irish syntax"....
 of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal
String literal

A string literal is the representation of a String value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question....
.

String length

Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed length strings which have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not, and variable length strings whose length is not arbitrarily fixed and which use varying amounts of memory depending on their actual size. Most strings in modern programming languages are variable length strings. Despite the name, even variable length strings are limited in length; although, generally, the limit depends only on the amount of memory
Computer memory

Computer memory is usually meant to refer to the semiconductor technology that is used to store information in Electronics devices. Current primary computer memory makes use of integrated circuits consisting of silicon-based transistors....
 available.

Character encoding

Historically, string datatypes allocated one byte
Byte

A byte is a basic unit of measurement of Computer storage in computer science. In many computer architectures it is a Byte addressing memory address space....
 per character, and although the exact character set varied by region, character encoding
Character encoding

A character encoding system consists of a code that pairs a sequence of character from a given character set with something else, such as a sequence of natural numbers, octet or electrical pulses, in order to facilitate the transmission of data through telecommunication networks and/or Computer data storage of Character in compute...
s were similar enough that programmers could generally get away with ignoring this — groups of character sets used by the same system in different regions usually either had a character in the same place, or did not have it at all. These character sets were typically based on ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 or EBCDIC
EBCDIC

Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used on IBM mainframe operating systems such as z/OS, OS/390, VM and VSE , as well as IBM midrange computer operating systems such as OS/400 and i5/OS ....
.

Logographic languages such as Chinese
Chinese language

Chinese or the Sinitic language is a language family consisting of language mutually unintelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the two branches of Sino-Tibetan languages of languages....
, Japanese
Japanese language

IPA: [n?iho?go] is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is related to the Ryukyuan languages....
, and Korean
Korean language

Korean is the official language of North Korea and South Korea. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China....
 (known collectively as CJK
CJK

CJK is a collective term for Chinese language, Japanese language, and Korean language, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization....
) need far more than 256 characters (the limit of a one 8-bit byte
Byte

A byte is a basic unit of measurement of Computer storage in computer science. In many computer architectures it is a Byte addressing memory address space....
 per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 and using two-byte representations for CJK ideographs. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the EUC
EUC

EUC may refer to:Excellent used condition. Frequent usage found on sites such as Ebay to indicate a used item still has worth and value.* End-user computing...
 family guarantee that a byte value in the ASCII range will only represent that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as ISO-2022 and Shift-JIS
Shift-JIS

Shift JIS is a character encoding for the Japanese language originally developed by a Japanese company called ASCII in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1....
 do not make such guarantees, making matching on byte codes unsafe. Another issue is that if the beginning of a string is deleted, important instructions for the decoder or information on position in a multibyte sequence may be lost. Another is that if strings are joined together (especially after having their ends truncated by code not aware of the encoding), the first string may not leave the encoder in a state suitable for dealing with the second string.

Unicode
Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems....
 has simplified the picture somewhat. Most languages have a datatype for Unicode strings (usually UTF-16 as it was usually added before Unicode supplemental planes were introduced). Converting between Unicode and local encodings requires an understanding of the local encoding, which may be problematic for existing systems where strings of various encodings are being transmitted together with no real marking as to what encoding they are in.

Implementations

Some languages like C++
C++

C++ is a general-purpose programming language. It is regarded as a middle-level language, as it comprises a combination of both high-level programming language and low-level programming language language features....
 implement strings as templates
Generic programming

Generic programming is a style of computer programming in which algorithms are written in terms of to-be-specified-later types that are then instantiated when needed for specific types provided as parameters and was pioneered by Ada which appeared in 1983....
 that can be used with any datatype, but this is the exception, not the rule.

If an object-oriented language represents strings as objects, they are called mutable if the value can change at runtime and immutable if the value is frozen after creation. For example, Ruby
Ruby (programming language)

Ruby is a dynamic programming language, reflection , general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features....
 has mutable strings, while Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
's strings are immutable.

Other languages, most notably Prolog
Prolog

Prolog is a logic programming language. It is a general purpose language often associated with artificial intelligence and computational linguistics....
 and Erlang, avoid implementing a string datatype, instead adopting the convention of representing strings as lists of character codes.

Representations

Representations of strings depend heavily on the choice of character repertoire and the method of character encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode
Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems....
 along with a variety of complex encodings such as UTF-8
UTF-8

UTF-8 is a Variable-width encoding character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backward compatibility with ASCII....
 and UTF-16.

Most string implementations are very similar to variable-length array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
s with the entries storing the character codes of corresponding characters. The principal difference is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8
UTF-8

UTF-8 is a Variable-width encoding character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backward compatibility with ASCII....
, where single characters can take anywhere from one to four bytes. In these cases, the logical length of the string differs from the logical length of the array.

The length of a string can be stored implicitly by using a special terminating character; often this is the null character
Null character

The null character is a character with the value zero, present in the ASCII and Unicode character sets, and available in nearly all mainstream programming languages....
 having value zero, a convention used and perpetuated by the popular C programming language
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
. Hence, this representation is commonly referred to as C string
C string

In computing, a C string is a character string stored as a one-dimensional character array and terminated with a null character . The name refers to the ubiquitous C which uses this string #Representations....
. The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte
Byte

A byte is a basic unit of measurement of Computer storage in computer science. In many computer architectures it is a Byte addressing memory address space....
 value — a convention used in Pascal
Pascal (programming language)

Pascal is an influential imperative programming and Procedural programming programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small and efficient language intended to encourage good programming practices using structured programming and data structure....
; consequently some people call it a P-string.

In terminated strings, the terminating code is not an allowable character in any string.

The term bytestring usually indicates a general-purpose string of bytes — rather than strings of only (readable) characters, strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored as-is, meaning that there should be no value interpreted as a termination value.

Here is an example of a null-terminated string stored in a 10-byte buffer
Buffer (computer science)

In computing, a buffer is a region of Memory used to temporarily hold data while it is being moved from one place to another. Typically, the data is stored in a buffer as it is retrieved from an input device or just before it is sent to an output device ....
, along with its ASCII representation:

|- | F || R || A || N || K | NUL | style="background: #DDD" | k | style="background: #DDD" | e | style="background: #DDD" | f | style="background: #DDD" | w |- | 46 || 52 || 41 || 4E || 4B | 00 | style="background: #DDD" | 6B | style="background: #DDD" | 66 | style="background: #DDD" | 66 | style="background: #DDD" | 77 |}

The length of a string in the above example is 5 characters, but it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of this form are sometimes called ASCIZ strings, after the original assembly language
Assembly language

An assembly language is a low-level language for programming computers. It implements a symbolic representation of the numeric machine codes and other constants needed to program a particular CPU architecture....
 directive used to declare them.)

Here is the equivalent (old style) Pascal string stored in a 10-byte buffer, along with its ASCII representation:

|- | length | F || R || A || N || K | style="background: #DDD" | k | style="background: #DDD" | e | style="background: #DDD" | f | style="background: #DDD" | w |- | 05 | 46 || 52 || 41 || 4E || 4B | style="background: #DDD" | 6B | style="background: #DDD" | 66 | style="background: #DDD" | 66 | style="background: #DDD" | 77 |}

Both character termination and length codes limit strings: for example, C character arrays that contain Nul characters cannot be handled directly by C string library functions: strings using a length code are limited to the maximum value of the length code.

Both of these limitations can be overcome by clever programming, of course, but such workarounds are by definition not standard.

Historically, rough equivalents of the C termination method appear in both hardware and software. For example "data processing" machines like the IBM 1401
IBM 1401

The IBM 1401, the first member of the IBM 1400 series, was a variable wordlength decimal computer that was announced by International Business Machines on October 5, 1959....
 used a special word mark bit to delimit strings at the left, where the operation would start at the right. This meant that while the IBM 1401 had a seven-bit word in "reality", almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example) handle ASCII codes.

It is possible to create data structures and functions that manipulate them that do not have the problems associated with character termination and can in principle overcome length code bounds. It is also possible to optimize the string represented using techniques from run length encoding (replacing repeated characters by the character value and a length) and Hamming encoding.

While these representations are common, others are possible. Using rope
Rope (computer science)

In computer programming, a rope is a heavyweight string , involving the use of a concatenation tree representation. The concept was introduced in a paper called "Ropes: an Alternative to Strings"....
s makes certain string operations, such as insertions, deletions, and concatenations more efficient.

Vectors

While character strings are very common uses of strings, a string in computer science may refer generically to any vector of homogenously typed data. A string of bits or bytes, for example, may be used to represent data retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used.

String processing algorithms

There are many algorithm
Algorithm

In mathematics, computing, linguistics and related subjects, an algorithm is a sequence of finite instructions, often used for calculation and data processing....
s for processing strings, each with various trade-offs. Some categories of algorithms include
  • string searching algorithm
    String searching algorithm

    String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several string are found within a larger string or text....
    s for finding a given substring or pattern;
  • string manipulation algorithms;
  • sorting algorithm
    Sorting algorithm

    In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a List in a certain Total order. The most-used orders are numerical order and lexicographical order....
    s;
  • regular expression
    Regular expression

    In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters....
     algorithms; and
  • parsing a string.


Advanced string algorithms often employ complex mechanisms and data structures, among them suffix tree
Suffix tree

In computer science, a suffix tree is a data structure that presents the suffix of a given String in a way that allows for a particularly fast implementation of many important string operations....
s and finite state machine
Finite state machine

A finite state machine or finite state automaton or simply a state machine, is a model of behavior composed of a finite number of state s, transitions between those states, and actions....
s.

Character string oriented languages and utilities

Character strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include the following languages:

  • awk
  • Icon
    Icon programming language

    Icon is a very high-level programming language featuring goal directed execution and many facilities for managing string and textual patterns. It is related to SNOBOL, a string processing language....
  • MUMPS
    MUMPS

    MUMPS , or alternatively M, is a programming language created in the late 1960s, originally for use in the Health care. It was designed for the production of multi-user database-driven applications....
  • Perl
    Perl

    In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
  • Rexx
    REXX

    REXX is an Interpreted language programming language which was developed at IBM. It is a structured high-level programming language which was designed to be both easy to learn and easy to read....
  • Ruby
  • sed
    Sed

    sed is a Unix utility which parses text files and implements a programming language which can apply textual transformations to such files. It reads input files line by line , applying the operation which has been specified via the command line , and then outputs the line....
  • SNOBOL
    SNOBOL

    SNOBOL is a computer programming language developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P....
  • Tcl
    Tcl

    Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration"?according to the author?with programmers devising their own languages intended to be embedded into applications, Tcl quickly gained wide acceptance on its own and is generally thought to be easy to learn, but powerful in competent hands....


Many UNIX
Unix

Unix is a computer operating system originally developed in 1969 by a group of American Telephone & Telegraph employees at Bell Labs, including Ken Thompson , Dennis Ritchie, Douglas McIlroy, and Joe Ossanna....
 utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings.

Some Application Programming Interface
Application programming interface

An application programming interface is a set of subroutine, data structures, class and/or Protocol provided by library and/or operating system Service s in order to support the building of applications....
s like Multimedia Control Interface, embedded SQL
Embedded SQL

Embedded SQL is a method of combining the computing power of a programming language and the database Data Manipulation Language capabilities of SQL....
 or printf
Printf

The class of printf functions is a class of function , typically associated with curly bracket programming languages, that accept a string parameter which specifies a method for rendering a number of other parameters into a string....
 use strings to hold commands that will be interpreted.

Recent scripting programming languages, including Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
, Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
, Ruby, and Tcl
Tcl

Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration"?according to the author?with programmers devising their own languages intended to be embedded into applications, Tcl quickly gained wide acceptance on its own and is generally thought to be easy to learn, but powerful in competent hands....
 employ regular expression
Regular expression

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters....
s to facilitate text operations.

Some languages such as Perl
Perl

In computer programming, Perl is a high-level programming language, List of programming languages by category, Interpreter , dynamic programming language....
 and Ruby support string interpolation, which permits arbitrary expressions to be evaluated and included in string literals.

Character string functions

String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query information about a string. They are usually used within the context of a computer programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
.

The most basic example of a string function is the length(string) function, which returns the length of a string (not counting any terminator characters or any of the string's internal structural information) and does not modify the string. For example, length("hello world") returns 11.

There are many string functions which exist in other languages with similar or exactly the same syntax or parameters. For example in many languages the length function is usually represented as len(string). Even though string functions are very useful to a computer programmer, a computer programmer using these functions should be mindful that a string function in one language could in another language behave differently or have a similar or completely different function name, parameters, syntax, and results.

See also

  • Bitstring
    Bitstring

    A bitstring is a sequence of bits. Anything on a discrete computer can be represented by a bitstring. In particular, any discrete computer can be encoded in a bitstring, usually called a software program....
  • Rope
    Rope (computer science)

    In computer programming, a rope is a heavyweight string , involving the use of a concatenation tree representation. The concept was introduced in a paper called "Ropes: an Alternative to Strings"....
Category:Algorithms on strings