All Topics  
Simplified molecular input line entry specification

 

   Email Print
   Bookmark   Link

 

Simplified molecular input line entry specification


 
 

The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemicalChemistry

Chemistry is the science of matter at the atomic to molecular scale, dealing primarily with collections of atoms ....
 moleculeMolecule

In chemistry, a molecule is an aggregate of two or more atoms in a definite arrangement held together by chemical bonds....
s using short ASCIIASCII Summary

ASCII , generally pronounced , is a character encoding based on the English alphabet....
 stringsString (computer science)

In computer programming and some branches of mathematics, strings are sequences of various simple objects....
. SMILES strings can be imported by most molecule editorMolecule editor Overview

A molecule editor is a computer program for drawing and editing chemical structures....
s for conversion back into two-dimensional drawings or three-dimensionalDimension

In common usage, a dimension is a parameter or measurement required to define the characteristics of an object—i.e....
 models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David WeiningerDavid Weininger

David Weininger is a chemist and entrepreneur....
 in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. In 2007, an open standardOpen standard

Open standards are publicly available and implementable standards....
 called was developed by the open-source chemistry community. Other 'linear' notations include the Wiswesser Line NotationWiswesser Line Notation

Wiswesser Line Notation, also referred to as WLN, invented by William J....
 (WLN), ROSDAL and SLNFacts About SYBYL Line Notation

The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules usi...
 (Tripos Inc).

In August of 2006, the IUPACInternational Union of Pure and Applied Chemistry

The International Union of Pure and Applied Chemistry is an international non-governmental organization established in 191...
 introduced the InChIInternational Chemical Identifier

The IUPAC International Chemical Identifier, developed by IUPAC and NIST, is a digital equivalent of the IUPAC name for any ...
 as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e.g., graph theoryGraph theory

In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise...
) backing.

Terminology

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings and the exact meaning is usually apparent from the context. The terms Canonical and Isomeric can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.

Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanolEthanol

This article is about the chemical compound....
. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it, and is termed the Canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure and do not simply manipulate strings as is sometimes thought. Algorithms for generating Canonical SMILES have been developed at , and . A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a databaseChemical database

A chemical database is a database specifically designed to store chemical information....
.

SMILES notation allows the specification of configuration at tetrahedral centersMolecular configuration

The configuration of a molecule is the permanent geometry that results from the spatial arrangement of its bonds....
, and double bond geometry. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed Isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term Isomeric SMILES is also applied to SMILES in which isotopeIsotope

An isotope is any of several different forms of an element each having different atomic mass....
s are specified.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-firstFacts About Depth-first search

Depth-first search is an algorithm for traversing or searching a tree, tree structure, or graph....
 tree traversalTree traversal

In computer science, tree traversal is the process of visiting each node in a tree data structure....
 of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning treeSpanning tree (mathematics)

In the mathematical field of graph theory, a spanning tree T of a connected, undirected graph G is a tree composed o...
. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.

Examples

Atoms

AtomAtom

In chemistry and physics, an atom is the smallest possible particle of a chemical element that retains its chemical propert...
s are represented by the standard abbreviation of the chemical elementChemical element

A chemical element, often called simply an element, is a substance that cannot be decomposed or transformed into other...
s, in square brackets, such as [Au] for goldGold

Gold is a highly sought-after precious metal that for many centuries has been used as money, a store of value and in jewelry...
. The hydroxideHydroxide

Hydroxide is a polyatomic ion consisting of oxygen and hydrogen:...
 anion is [OH-]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for waterFacts About Water

Water is a tasteless, odorless substance that is essential to all known forms of life and is known as the universal solve...
 is simply O.

Bonds

Bonds between aliphaticAliphatic compound

In chemistry, aliphatic compounds are organic compounds, in which carbon atoms are joined together in straight or branched c...
 atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanolEthanol

This article is about the chemical compound....
 can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexaneCyclohexane

Cyclohexane is a cycloalkane with the molecular formula C6H12....
 and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Double and triple bondsChemical bond

A chemical bond is the physical phenomenon of chemical species being held together by attraction of atoms to each other thro...
 are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O and C#N.

Aromaticity

AromaticAromaticity

Aromaticity is a chemical property in which a conjugated ring of unsaturated bonds, lone pairs, or empty orbitals exhibit a ...
 C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. BenzeneBenzene

Benzene, also known as benzol, is an organic chemical compound with the formula C6H6....
, pyridinePyridine

Pyridine is a chemical compound with the formula C5H5N....
 and furanFuran

----Furan, also known as furane and furfuran, is a heterocyclic organic compound, produced when wood, especiall...
 can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenylBiphenyl

Biphenyl is a solid organic compound that forms colorless to yellowish crystals....
 can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrolePyrrole

Pyrrole, or pyrrol, is a heterocyclic aromatic organic compound, a five-membered ring with the formula C4H5N....
 must be represented as [nH] and imidazoleImidazole

Imidazole is a heterocyclic aromatic organic compound....
 is written in SMILES notation as n1c[nH]cc1.

The and algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acidPropionic acid

Propionic acid is a naturally occurring carboxylic acid with chemical formula CH3CH2COOH....
 and C(F)(F)F for fluoroformFluoroform

Fluoroform CHF3; CAS number, also known as trifluoromethane, is one of the haloalkanes with zero ozone depletion, as i...
. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N and COc(cc1)ccc1C#N which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

Stereochemistry

Configuration around double bonds is specified using the characters "/" and "\". For example, F/C=C/Fis one representation of transTrans

Trans is a Latin word meaning "across", "beyond" or "on the opposite side" and is the opposite of cis, which means "on t...
-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F is one possible representation of cisCIS

CIS usually refers to:* Commonwealth of Independent States, a modern-day political entity consisting of 11 former Sov...
-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.

Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomerEnantiomer

In chemistry, enantiomers are stereoisomers that are mirror images of each other....
 of the amino acidAmino acid Overview

In chemistry, an amino acid is any molecule that contains both amine and carboxyl functional groups....
 alanineAlanine

Alanine also 2-aminopropanoic acid is a non-essential a-amino acid....
 can be written as N[C@@H](C)C(=O)O. The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O)appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O. The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[C@@H](C(=O)O)C.

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. BenzeneBenzene

Benzene, also known as benzol, is an organic chemical compound with the formula C6H6....
 in which one atom is carbon-14Carbon-14

Carbon-14, 14C, or radiocarbon, is a radioactive isotope of carbon discovered February 27, 1940, by Martin Kamen...
 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.

Other examples of SMILES

The SMILES notation is described extensively in the provided by and a number of illustrative examples are presented. Daylight's provides users with the means to check their own examples of SMILES and is a valuable educational tool.

Extensions

SMARTSSmiles arbitrary target specification

Smiles ARbitrary Target Specification is a language for specifying substructural patterns in molecules....
 is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcardWildcard character

The term wildcard character has the following meanings:...
 atoms and bonds, which can be used to define substructural queries for chemical databaseChemical database

A chemical database is a database specifically designed to store chemical information....
 searching. One common misconception is that SMARTS-based subtructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphismIsomorphism

In mathematics, an isomorphism is a bijective map f such that both f and its inverse f −1 are homomo...
. is a line notation for specifying reaction transforms.

Conversion

SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities.

See also

  • Smiles arbitrary target specificationSmiles arbitrary target specification

    Smiles ARbitrary Target Specification is a language for specifying substructural patterns in molecules....
     SMARTS language for specification of substructural queries.
  • SYBYL Line NotationSYBYL Line Notation

    The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules usi...
     (another line notation)
  • Molecular Query LanguageMolecular Query Language

    The Molecular Query Language was designed for allowing more complex and problem specific search methods....
     - query languageQuery language

    Query languages are computer languages used to make queries into databases and information systems....
     allowing also numerical properties, e.g. physicochemical values or distances
  • Chemistry Development KitChemistry Development Kit

    The Chemistry Development Kit is an open source Java library for Chemoinformatics and Bioinformatics....
     (2D layout and conversion)
  • International Chemical IdentifierInternational Chemical Identifier

    The IUPAC International Chemical Identifier, developed by IUPAC and NIST, is a digital equivalent of the IUPAC name for any ...
     (InChI), the free and open alternative to SMILES by the IUPACInternational Union of Pure and Applied Chemistry

    The International Union of Pure and Applied Chemistry is an international non-governmental organization established in 191...
    .
  • OpenBabelOpenBabel Overview

    OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats....
    , JOELibJOELib

    JOELib is a freeware chemical expert system mainly used for converting chemical file formats....
    , OELibOELib

    OELib was an Open Source Cheminformatics library....
     (conversion)

External links

Specifications



SMILES related software utilities

  • – 3D Coordinate Generation
  • – online molecule editor
  • – online chemical editor/viewer and SMILES generator/converter
  • – desktop application for storing/generating/converting/visualizing/searching SMILES structures, particularly batch processing; personal edition free
  • – a molecule editor for Linux which can read and write SMILES
  • – an unofficial InChI website featuring on-line converter from InChI and SMILES to molecular drawings