All Topics  
Simplified molecular input line entry specification

 

   Email Print
   Bookmark   Link






 

Simplified molecular input line entry specification



 
 
The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical
Chemistry

Chemistry is the science concerned with the composition, structure, and properties of matter, as well as the changes it undergoes during chemical reactions....
 molecule
Molecule

In chemistry, a molecule is defined as a sufficiently stable, electric charge neutral group of at least two atoms in a definite arrangement held together by very strong chemical bonds....
s using short ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 strings
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
. SMILES strings can be imported by most molecule editor
Molecule editor

A molecule editor is a computer program for creating and modifying representations of chemical structures. There are a number types of molecule editor....
s for conversion back into two-dimensional drawings or three-dimensional
Dimension

In mathematics, the dimension of a space is roughly defined as the minimum number of coordinates needed to specify every point within it. For example: a point on the unit circle in the plane can be specified by two Cartesian coordinates but one can make do with a single coordinate , so the circle is 1-dimensional even though it exists in...
 models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David Weininger
David Weininger

David Weininger is a chemist and entrepreneur. He is founder of Daylight Chemical Information Systems, a company in Santa Fe, New Mexico that does rapid analysis of massive chemical databases....
 in the late 1980s.






Discussion
Ask a question about 'Simplified molecular input line entry specification'
Start a new discussion about 'Simplified molecular input line entry specification'
Answer questions from other users
Full Discussion Forum



Encyclopedia


The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical
Chemistry

Chemistry is the science concerned with the composition, structure, and properties of matter, as well as the changes it undergoes during chemical reactions....
 molecule
Molecule

In chemistry, a molecule is defined as a sufficiently stable, electric charge neutral group of at least two atoms in a definite arrangement held together by very strong chemical bonds....
s using short ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 strings
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
. SMILES strings can be imported by most molecule editor
Molecule editor

A molecule editor is a computer program for creating and modifying representations of chemical structures. There are a number types of molecule editor....
s for conversion back into two-dimensional drawings or three-dimensional
Dimension

In mathematics, the dimension of a space is roughly defined as the minimum number of coordinates needed to specify every point within it. For example: a point on the unit circle in the plane can be specified by two Cartesian coordinates but one can make do with a single coordinate , so the circle is 1-dimensional even though it exists in...
 models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David Weininger
David Weininger

David Weininger is a chemist and entrepreneur. He is founder of Daylight Chemical Information Systems, a company in Santa Fe, New Mexico that does rapid analysis of massive chemical databases....
 in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. In 2007, an open standard
Open standard

An open standard is a standard that is publicly available and has various rights to use associated with it, and various properties of how it was designed....
 called was developed by the open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation
Wiswesser Line Notation

Wiswesser Line Notation, also referred to as WLN, invented by William Wiswesser in 1949, was the first line notation capable of precisely describing complex molecules....
 (WLN), ROSDAL and SLN
SYBYL Line Notation

The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemistry molecules using short ASCII string ....
 (Tripos Inc).

In August 2006, the IUPAC
International Union of Pure and Applied Chemistry

The International Union of Pure and Applied Chemistry is a non-governmental organization established in 1919 for the advancing of chemistry. Its members are national chemistry societies....
 introduced the InChI
International Chemical Identifier

The IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard and human-readable way to encode molecular information and to facilitate the search for such information in databases and on the web....
 as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e.g., graph theory
Graph theory

In mathematics and computer science, graph theory is the study of graph : mathematical structures used to model pairwise relations between objects from a certain collection....
) backing.

Terminology


The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings and the exact meaning is usually apparent from the context. The terms Canonical and Isomeric can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.

Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol
Ethanol

Ethanol, also called ethyl alcohol, pure alcohol, grain alcohol, or drinking alcohol, is a volatility , flammable, colorless liquid....
. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it, and is termed the Canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure and do not simply manipulate strings as is sometimes thought. Algorithms for generating Canonical SMILES have been developed at , and . A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database
Chemical database

A chemical database is a database specifically designed to store cheminformatics. Most chemical databases store information on stable molecules....
.

SMILES notation allows the specification of configuration at tetrahedral centers
Molecular configuration

The configuration of a molecule is the permanent geometry that results from the space arrangement of its chemical bond. The ability of the same set of atoms to form two or more molecules with different configurations is stereoisomerism....
, and double bond geometry. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed Isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term Isomeric SMILES is also applied to SMILES in which isotope
Isotope

Isotopes are any of the different types of atoms of the same chemical element, each having a different atomic mass . Isotopes of an element have atomic nucleus with the same number of protons but different numbers of neutron....
s are specified.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first
Depth-first search

Depth-first search is an algorithm for traversing or searching a tree data structure, tree structure, or graph . One starts at the root and explores as far as possible along each branch before backtracking....
 tree traversal
Tree traversal

In computer science, tree-traversal refers to the process of visiting each node in a tree data structure, exactly once, in a systematic way. Such traversals are classified by the order in which the nodes are visited....
 of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree
Spanning tree (mathematics)

In the mathematics field of graph theory, a spanning tree T of a connected graph, undirected graph G is a tree composed of all the vertices and some of the edges of G....
. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.

Examples


Atoms

Atom
Atom

|-! bgcolor=gray | Properties|-||}The atom is a basic unit of matter consisting of a dense, central atomic nucleus surrounded by a electron cloud of electric charge electrons....
s are represented by the standard abbreviation of the chemical element
Chemical element

A chemical element is a type of atom that is distinguished by its atomic number; that is, by the number of protons in its atomic nucleus. The term is also used to refer to a pure chemical Chemical substance composed of atoms with the same number of protons....
s, in square brackets, such as [Au] for gold
Gold

Gold is a chemical element with the symbol Au and atomic number 79. It is a highly sought-after precious metal, having been used as money, as a store of value, in jewelry, in sculpture, and for ornamentation since the beginning of recorded history....
. The hydroxide
Hydroxide

In chemistry, hydroxide is the name for the Diatomic molecule anion OH-, consisting of oxygen and hydrogen atoms, usually derived from the Dissociation of a base ....
 anion is [OH-]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water
Water

Water is a common chemical substance that is essential for the survival of all known forms of life. In typical usage, water refers only to its liquid form or States of matter, but the substance also has a solid state, ice, and a gaseous state, water vapor or steam....
 is simply O.

Bonds

Bonds between aliphatic
Aliphatic compound

In organic chemistry, compounds composed of carbon and hydrogen are divided into two classes: aromatic compounds, which contain benzene rings or similar rings of atoms, and aliphatic compounds , which do not contain aromatic rings....
 atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanol
Ethanol

Ethanol, also called ethyl alcohol, pure alcohol, grain alcohol, or drinking alcohol, is a volatility , flammable, colorless liquid....
 can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexane
Cyclohexane

Cyclohexane is a cycloalkane with the molecular formula Carbon6Hydrogen12. Cyclohexane is used as a nonpolar solvent for the chemical industry, and also as a raw material for the industrial production of adipic acid and caprolactam, both of which are intermediates used in the production of nylon....
 and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Double and triple bonds
Chemical bond

A chemical bond is the physical process responsible for the attractive interactions between atoms and molecules, and that which confers stability to diatomic and polyatomic chemical compounds....
 are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O (carbon dioxide
Carbon dioxide

Carbon dioxide is a chemical compound composed of two oxygen atoms covalent bond to a single carbon atom. It is a gas at standard temperature and pressure and exists in Earth's atmosphere in this state....
) and C#N (hydrogen cyanide
Hydrogen cyanide

Hydrogen cyanide is a chemical compound with chemical formula HCN. A solution of hydrogen cyanide in water is called hydrocyanic acid. Hydrogen cyanide is a colorless, extremely poisonous, and highly volatility liquid that boiling slightly above room temperature at 26 Celsius ....
).

Aromaticity

Aromatic
Aromaticity

Aromaticity is a chemical property in which a conjugated system ring of unsaturated bonds, lone pairs, or empty orbitals exhibit a stabilization stronger than would be expected by the stabilization of conjugation alone....
 C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene
Benzene

Benzene, or benzol, is an organic compound chemical compound and a known carcinogen with the molecular formula Carbon6Hydrogen6....
, pyridine
Pyridine

Pyridine is a simple and important heterocyclic aromatic organic compound with the formula CarbonHydrogenNitrogen. This colorless liquid with a distinctive fish-like odor is structurally related to benzene, wherein one CH group in the six-membered ring is replaced by a nitrogen atom....
 and furan
Furan

Furan, also known as furane and furfuran, is a Heterocyclic compound organic compound. It is typically derived by the thermal decomposition of pentose-containing materials, cellulosic solids especially pine-wood....
 can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl
Biphenyl

Biphenyl is an organic compound that forms colorless crystals. It has a distinctively pleasant smell. Biphenyl is an aromatic hydrocarbon with a molecular formula 2....
 can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrole
Pyrrole

Pyrrole, or pyrrol, is a heterocyclic aromatic organic compound, a five-membered ring with the chemical formula carbon4hydrogen4nitrogenH....
 must be represented as [nH] and imidazole
Imidazole

Imidazole is a organic compound with the formula C3H4N2. This aromatic heterocyclic is classified as an alkaloid....
 is written in SMILES notation as n1c[nH]cc1.

The and algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acid
Propionic acid

Propionic acid is a naturally-occurring carboxylic acid with chemical formula CarbonHydrogen3CH2COxygenOH. In the pure state, it is a colorless liquid with a pungent odor....
 and C(F)(F)F for fluoroform
Fluoroform

Fluoroform is the chemical compound with the formula CHF3. It is one of the "trihalomethane", a class of compounds with the formula CHX3 ....
. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N () and COc(cc1)ccc1C#N () which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

Stereochemistry

Configuration around double bonds is specified using the characters "/" and "\". For example, F/C=C/F () is one representation of trans
Trans

Trans is a Latin noun or prefix, meaning "across", "beyond" or "on the opposite side".Trans may refer to:...
-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F () is one possible representation of cis
CIS

CIS usually refers to the Commonwealth of Independent States, a modern political entity consisting of nine former Soviet Union republics.CIS may also refer to:...
-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.

Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer
Enantiomer

In chemistry, an enantiomer is one of two stereoisomers that are Superpose complete mirror images of each other, much as one's left and right Chirality are "the same" but opposite....
 of the amino acid
Amino acid

In chemistry, an amino acid is a molecule containing both amine and carboxyl functional groups. These molecules are particularly important in biochemistry, where this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent....
 alanine
Alanine

Alanine is an a-amino acid with the chemical formula CH3CHCOOH. The L-isomer is one of the 20 proteinogenic amino acids, i.e. the building blocks of proteins....
 can be written as N[C@@H](C)C(=O)O (). The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O) appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O (). The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[C@@H](C(=O)O)C ().

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene
Benzene

Benzene, or benzol, is an organic compound chemical compound and a known carcinogen with the molecular formula Carbon6Hydrogen6....
 in which one atom is carbon-14
Carbon-14

Carbon-14, 14C, or radiocarbon, is a radioactive isotope of carbon discovered on February 27, 1940, by Martin Kamen and Sam Ruben at the University of California Radiation Laboratory in Berkeley, California, though its existence had been suggested already in 1934 by Franz Kurie....
 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.

Other examples of SMILES

The SMILES notation is described extensively in the provided by and a number of illustrative examples are presented. Daylight's provides users with the means to check their own examples of SMILES and is a valuable educational tool.

Extensions

SMARTS
Smiles arbitrary target specification

Smiles ARbitrary Target Specification is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing....
 is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard
Wildcard character

The term wildcard character has the following meanings:...
 atoms and bonds, which can be used to define substructural queries for chemical database
Chemical database

A chemical database is a database specifically designed to store cheminformatics. Most chemical databases store information on stable molecules....
 searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism
Isomorphism

In abstract algebra, an isomorphism is a bijection map f such that both f and its inverse function f −1 are homomorphisms, i.e., structure-preserving mappings....
. is a line notation for specifying reaction transforms.

Conversion

SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities.

See also

  • Smiles arbitrary target specification
    Smiles arbitrary target specification

    Smiles ARbitrary Target Specification is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing....
     SMARTS language for specification of substructural queries.
  • SYBYL Line Notation
    SYBYL Line Notation

    The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemistry molecules using short ASCII string ....
     (another line notation)
  • Molecular Query Language
    Molecular Query Language

    The Molecular Query Language was designed for allowing more complex and problem specific search methods. The query language is based on an extended Backus?Naur form using JavaCC....
     - query language
    Query language

    Query languages are computer languages used to make query into databases and information systems.Broadly, query languages can be classified according to whether they are database query languages or information retrieval query languages....
     allowing also numerical properties, e.g. physicochemical values or distances
  • Chemistry Development Kit
    Chemistry Development Kit

    The Chemistry Development Kit is an open source Java library for Chemoinformatics and Bioinformatics . It is available for Microsoft Windows, Unix, and Mac OS....
     (2D layout and conversion)
  • International Chemical Identifier
    International Chemical Identifier

    The IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard and human-readable way to encode molecular information and to facilitate the search for such information in databases and on the web....
     (InChI), the free and open alternative to SMILES by the IUPAC
    International Union of Pure and Applied Chemistry

    The International Union of Pure and Applied Chemistry is a non-governmental organization established in 1919 for the advancing of chemistry. Its members are national chemistry societies....
    .
  • OpenBabel
    OpenBabel

    OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. Due to the strong relationship to informatics this program belongs more to the category cheminformatics than to molecular modelling....
    , JOELib
    JOELib

    JOELib is a free software chemical expert system mainly used for converting chemical file formats. Because of its strong relationship to informatics this program belongs more to the category cheminformatics than to molecular modelling....
    , OELib
    OELib

    OELib was an Open Source Cheminformatics library. Its actual GNU General Public License C++ and Java successors are OpenBabel and JOELib. Its commercial successor is called OEChem....
     (conversion)


External links


Specifications


SMILES related software utilities

  • – Java online molecule editor
  • – online molecule editor
  • – 3D Coordinate Generation
  • – Translate a SMILES formula into graphics
  • – online chemical editor/viewer and SMILES generator/converter
  • – desktop application for storing/generating/converting/visualizing/searching SMILES structures, particularly batch processing; personal edition free
  • – a molecule editor for Linux which can read and write SMILES
  • – an unofficial InChI website featuring on-line converter from InChI and SMILES to molecular drawings