Quantitative structure-activity relationship
Encyclopedia
Quantitative structure–activity relationship (QSAR) or QSPR (quantitative structure–property relationship) is the process by which chemical structure
Chemical structure
A chemical structure includes molecular geometry, electronic structure and crystal structure of molecules. Molecular geometry refers to the spatial arrangement of atoms in a molecule and the chemical bonds that hold the atoms together. Molecular geometry can range from the very simple, such as...

 is quantitatively correlated
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

 with a well defined process, such as biological activity
Biological activity
In pharmacology, biological activity or pharmacological activity describes the beneficial or adverse effects of a drug on living matter. When a drug is a complex chemical mixture, this activity is exerted by the substance's active ingredient or pharmacophore but can be modified by the other...

 or chemical reactivity.

For example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can form a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression can then be used to predict the biological response of other chemical structures.

QSAR's most general mathematical form is:

SAR and the SAR paradox

The basic assumption for all molecule based hypotheses is that similar molecules have similar activities. This principle is also called Structure–Activity Relationship (SAR). The underlying problem is therefore how to define a small difference on a molecular level, since each kind of activity, e.g. reaction
Chemical reaction
A chemical reaction is a process that leads to the transformation of one set of chemical substances to another. Chemical reactions can be either spontaneous, requiring no input of energy, or non-spontaneous, typically following the input of some type of energy, such as heat, light or electricity...

 ability, biotransformation
Biotransformation
Biotransformation is the chemical modification made by an organism on a chemical compound. If this modification ends in mineral compounds like CO2, NH4+, or H2O, the biotransformation is called mineralisation....

 ability, solubility
Solubility
Solubility is the property of a solid, liquid, or gaseous chemical substance called solute to dissolve in a solid, liquid, or gaseous solvent to form a homogeneous solution of the solute in the solvent. The solubility of a substance fundamentally depends on the used solvent as well as on...

, target activity, and so on, might depend on another difference. A good example was given in the bioisosterism review of Patanie/LaVoie.

In general, one is more interested in finding strong trends
Trend estimation
Trend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as a time series, trend estimation can be used to make and justify statements about tendencies in the data...

. Created hypotheses
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

 usually rely on a finite number of chemical data. Thus, the induction principle should be respected to avoid overfitted
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

 hypotheses and deriving overfitted and useless interpretations on structural/molecular data.

The SAR paradox refers to the fact that it is not the case that all similar molecules have similar activities.

Fragment based (group contribution)

It has been shown that the logP
Partition coefficient
In chemistry and the pharmaceutical sciences, a partition- or distribution coefficient is the ratio of concentrations of a compound in the two phases of a mixture of two immiscible solvents at equilibrium. The terms "gas/liquid partition coefficient" and "air/water partition coefficient" are...

 of compound can be determined by the sum of its fragments. Fragmentary logP values have been determined statistically. This method gives mixed results and is generally not trusted to have accuracy of more than ±0.1 units.

Group or Fragment based QSAR is also known as GQSAR. GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric set. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.
Lead discovery using Fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.

3D-QSAR

3D-QSAR refers to the application of force field
Force field (chemistry)
In the context of molecular modeling, a force field refers to the form and parameters of mathematical functions used to describe the potential energy of a system of particles . Force field functions and parameter sets are derived from both experimental work and high-level quantum mechanical...

 calculations requiring three-dimensional structures, e.g. based on protein crystallography
Crystallography
Crystallography is the experimental science of the arrangement of atoms in solids. The word "crystallography" derives from the Greek words crystallon = cold drop / frozen drop, with its meaning extending to all solids with some degree of transparency, and grapho = write.Before the development of...

 or molecule superimposition
Superimposition
In graphics, superimposition is the placement of an image or video on top of an already-existing image or video, usually to add to the overall image effect, but also sometimes to conceal something .This technique is used in cartography to produce photomaps by superimposing grid lines, contour lines...

. It uses computed potentials, e.g. the Lennard-Jones potential
Lennard-Jones potential
The Lennard-Jones potential is a mathematically simple model that approximates the interaction between a pair of neutral atoms or molecules. A form of the potential was first proposed in 1924 by John Lennard-Jones...

, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. It examines the steric fields (shape of the molecule) and the electrostatic fields based on the applied energy function.

The created data space is then usually reduced by a following feature extraction
Feature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...

 (see also dimensionality reduction
Dimensionality reduction
In machine learning, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.-Feature selection:...

). The following learning method can be any of the already mentioned machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 methods, e.g. support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s.

In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods, since it applies the feature extraction
Feature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...

 and induction
Inductive reasoning
Inductive reasoning, also known as induction or inductive logic, is a kind of reasoning that constructs or evaluates propositions that are abstractions of observations. It is commonly construed as a form of reasoning that makes generalizations based on individual instances...

 in one step.

On June 18th 2011 the CoMFA patent has dropped any restriction on the use of GRID and PLS technologies and the RCMD team (www.rcmd.it) has opened a 3D QSAR web server (www.3d-qsar.com).

Data mining

For the coding usually a relatively large number of features or molecular descriptors are calculated, which can lack structural interpretation ability. In combination with the later applied learning method or as preprocessing step occurs a feature selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...

 problem.

A typical data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 based prediction uses e.g. support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s, decision tree
Decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically...

s, neural networks
Neural Networks
Neural Networks is the official journal of the three oldest societies dedicated to research in neural networks: International Neural Network Society, European Neural Network Society and Japanese Neural Network Society, published by Elsevier...

 for inducing
Inductive reasoning
Inductive reasoning, also known as induction or inductive logic, is a kind of reasoning that constructs or evaluates propositions that are abstractions of observations. It is commonly construed as a form of reasoning that makes generalizations based on individual instances...

 a predictive learning model.

Molecule mining
Molecule mining
This page describes mining for molecules. Since molecules may be represented by molecular graphs this is strongly related to graph mining and structured data mining. The main problem is how to represent molecules while discriminating the data instances...

 approaches, a special case of structured data mining
Structured data mining
Structure mining or structured data mining is the process of finding and extracting useful information from semi structured data sets. Graph mining is a special case of structured data mining.-Description:...

 approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore there exist also approaches using maximum common subgraph
Maximum common subgraph isomorphism problem
In complexity theory, maximum common subgraph-isomorphism is an optimization problem that is known to be NP-hard. The formal description of the problem is as follows:Maximum common subgraph-isomorphism...

 searches or graph kernels.

Judging the quality of QSAR models

QSARs represent predictive model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

s derived from application of statistical tools correlating biological activity
Biological activity
In pharmacology, biological activity or pharmacological activity describes the beneficial or adverse effects of a drug on living matter. When a drug is a complex chemical mixture, this activity is exerted by the substance's active ingredient or pharmacophore but can be modified by the other...

 (including desirable therapeutic effect and undesirable side effects) of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure
Molecular geometry
Molecular geometry or molecular structure is the three-dimensional arrangement of the atoms that constitute a molecule. It determines several properties of a substance including its reactivity, polarity, phase of matter, color, magnetism, and biological activity.- Molecular geometry determination...

 and/or properties
Molecular property
Molecular properties include the chemical properties, physical properties, and structural properties of molecules, including drugs. Molecular properties typically do not include pharmacological or biological properties of a chemical compound.- See also :...

. QSARs are being applied in many disciplines for example risk assessment
Risk assessment
Risk assessment is a step in a risk management procedure. Risk assessment is the determination of quantitative or qualitative value of risk related to a concrete situation and a recognized threat...

, toxicity prediction, and regulatory decisions in addition to drug discovery
Drug discovery
In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which drugs are discovered or designed.In the past most drugs have been discovered either by identifying the active ingredient from traditional remedies or by serendipitous discovery...

 and lead optimization
Drug development
Drug development is a blanket term used to define the process of bringing a new drug to the market once a lead compound has been identified through the process of drug discovery...

. Obtaining a good quality QSAR model depends on many factors, such as the quality of biological data, the choice of descriptors and statistical methods. Any QSAR modeling should ultimately lead to statistically robust models capable of making accurate and reliable predictions of biological activities of new compounds.

For validation of QSAR models usually four strategies are adopted:
  1. internal validation or cross-validation;
  2. validation by dividing the data set into training and test compounds;
  3. true external validation by application of model on external data and
  4. data randomization or Y-scrambling.


The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose. Leave one-out cross-validation generally leads to an overestimation of predictive capacity, and even with external validation, no one can be sure whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published. Different aspects of validation of QSAR models that need attention includes methods of selection of training set compounds, setting training set size and impact of variable selection for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.

Chemical

One of the first historical
History
History is the discovery, collection, organization, and presentation of information about past events. History can also mean the period of time after writing was invented. Scholars who write about history are called historians...

 QSAR applications was to predict boiling point
Boiling point
The boiling point of an element or a substance is the temperature at which the vapor pressure of the liquid equals the environmental pressure surrounding the liquid....

s.

It is well known for instance that within a particular family
Chemical classification
Chemical classification systems attempt to classify as elements or compounds according to certain chemical functional or structural properties. Whereas the structural properties are largely intrinsic, functional properties and the derived classifications depend to a certain degree on the type of...

 of chemical compound
Chemical compound
A chemical compound is a pure chemical substance consisting of two or more different chemical elements that can be separated into simpler substances by chemical reactions. Chemical compounds have a unique and defined chemical structure; they consist of a fixed ratio of atoms that are held together...

s, especially of organic chemistry
Organic chemistry
Organic chemistry is a subdiscipline within chemistry involving the scientific study of the structure, properties, composition, reactions, and preparation of carbon-based compounds, hydrocarbons, and their derivatives...

, that there are strong correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

s between structure and observed properties. A simple example is the relationship between the number of carbons in alkanes and their boiling point
Boiling point
The boiling point of an element or a substance is the temperature at which the vapor pressure of the liquid equals the environmental pressure surrounding the liquid....

s. There is a clear trend in the increase of boiling point with an increase in the number carbons and this serves as a means for predicting the boiling points of higher alkanes
Higher alkanes
Higher alkanes are often defined as alkanes having nine or more carbon atoms. Nonane is the lightest alkane to have a flash point above 25 °C, and so not to be classified as dangerously flammable....

.

A still very interesting application is the Hammett equation
Hammett equation
The Hammett equation in organic chemistry describes a linear free-energy relationship relating reaction rates and equilibrium constants for many reactions involving benzoic acid derivatives with meta- and para-substituents to each other with just two parameters: a substituent constant and a...

, Taft equation
Taft equation
The Taft equation is a linear free energy relationship used in physical organic chemistry in the study of reaction mechanisms and in the development of quantitative structure activity relationships for organic compounds. It was developed by Robert W. Taft in 1952 as a modification to the Hammett...

 and pKa prediction
Acid dissociation constant
An acid dissociation constant, Ka, is a quantitative measure of the strength of an acid in solution. It is the equilibrium constant for a chemical reaction known as dissociation in the context of acid-base reactions...

 methods.

Biological

The biological activity of molecules is usually measured in assay
Assay
An assay is a procedure in molecular biology for testing or measuring the activity of a drug or biochemical in an organism or organic sample. A quantitative assay may also measure the amount of a substance in a sample. Bioassays and immunoassays are among the many varieties of specialized...

s to establish the level of inhibition of particular signal transduction
Signal transduction
Signal transduction occurs when an extracellular signaling molecule activates a cell surface receptor. In turn, this receptor alters intracellular molecules creating a response...

 or metabolic pathway
Metabolic pathway
In biochemistry, metabolic pathways are series of chemical reactions occurring within a cell. In each pathway, a principal chemical is modified by a series of chemical reactions. Enzymes catalyze these reactions, and often require dietary minerals, vitamins, and other cofactors in order to function...

s. Chemicals can also be biologically active by being toxic
Toxicity
Toxicity is the degree to which a substance can damage a living or non-living organisms. Toxicity can refer to the effect on a whole organism, such as an animal, bacterium, or plant, as well as the effect on a substructure of the organism, such as a cell or an organ , such as the liver...

. Drug discovery
Drug discovery
In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which drugs are discovered or designed.In the past most drugs have been discovered either by identifying the active ingredient from traditional remedies or by serendipitous discovery...

 often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets
Biological target
A biological target is a biopolymer such as a protein or nucleic acid whose activity can be modified by an external stimulus. The definition is context-dependent and can refer to the biological target of a pharmacologically active drug compound, or the receptor target of a hormone . The...

 and have low toxicity
Toxicity
Toxicity is the degree to which a substance can damage a living or non-living organisms. Toxicity can refer to the effect on a whole organism, such as an animal, bacterium, or plant, as well as the effect on a substructure of the organism, such as a cell or an organ , such as the liver...

 (non-specific activity). Of special interest is the prediction of partition coefficient
Partition coefficient
In chemistry and the pharmaceutical sciences, a partition- or distribution coefficient is the ratio of concentrations of a compound in the two phases of a mixture of two immiscible solvents at equilibrium. The terms "gas/liquid partition coefficient" and "air/water partition coefficient" are...

 log P, which is an important measure used in identifying "druglikeness
Druglikeness
Druglikeness is a qualitative concept used in drug design for how "druglike" a substance is with respect to factors like bioavailability. It is estimated from the molecular structure before the substance is even synthesized and tested...

" according to Lipinski's Rule of Five
Lipinski's Rule of Five
Lipinski's Rule of Five is a rule of thumb to evaluate druglikeness or determine if a chemical compound with a certain pharmacological or biological activity has properties that would make it a likely orally active drug in humans. The rule was formulated by Christopher A...

.

While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an enzyme
Enzyme
Enzymes are proteins that catalyze chemical reactions. In enzymatic reactions, the molecules at the beginning of the process, called substrates, are converted into different molecules, called products. Almost all chemical reactions in a biological cell need enzymes in order to occur at rates...

 or receptor
Receptor (biochemistry)
In biochemistry, a receptor is a molecule found on the surface of a cell, which receives specific chemical signals from neighbouring cells or the wider environment within an organism...

 binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis
Site-directed mutagenesis
Site-directed mutagenesis, also called site-specific mutagenesis or oligonucleotide-directed mutagenesis, is a molecular biology technique in which a mutation is created at a defined site in a DNA molecule. In general, this form of mutagenesis requires that the wild type gene sequence be known...

.

It is part of the machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE
Minimum-variance unbiased estimator
In statistics a uniformly minimum-variance unbiased estimator or minimum-variance unbiased estimator is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.The question of determining the UMVUE, if one exists, for a particular...

). In general all QSAR problems can be divided into a coding
Coding
Coding may refer to:* Channel coding in coding theory* Line coding* Computer programming, the process of designing, writing, testing, debugging / troubleshooting, and maintaining the source code of computer programs...

 and learning
Learning
Learning is acquiring new or modifying existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and some machines. Progress over time tends to follow learning curves.Human learning...

.

Applicability domain

As the use of (Q)SAR models for chemical risk management increases steadily and is also used for regulatory purposes (in the EU: Registration, Evaluation, Authorisation and Restriction of Chemicals), it is of crucial importance to be able to assess the reliability of predictions. The chemical descriptor space spanned by a particular training set of chemicals is called Applicability Domain
Applicability Domain
The Applicability Domain of a QSAR is the physico-chemical, structural or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds....

. It offers the opportunity to assess whether a compound can be reliably predicted.

See also

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK