Semantic similarity - AbsoluteAstronomy.com

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric

Metric space

In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...

based on the likeness of their meaning / semantic content.

Concretely, this can be achieved for instance by defining a topological similarity

Similarity

-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...

, by using ontologies

Ontology (computer science)

In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...

to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph

Directed acyclic graph

In mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...

like a hierarchy

Hierarchy

A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...

would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model

Vector space model

Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

to correlate

Correlation

In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

words and textual contexts from a suitable text corpus

Text corpus

In linguistics, a corpus or text corpus is a large and structured set of texts...

(co-occurrence

Co-occurrence

Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...

Taxonomy

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy

Meronymy

Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,...

, while similarity does not
. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

Visualisation

An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.

Biomedical Informatics

Semantic similarity measures have been applied and developed in biomedical ontologies, namely, the Gene Ontology

Gene Ontology

The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...

(GO).
They are mainly used to compare genes

Gênes

Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

and proteins based on the similarity of their functions rather than on their sequence similarity,
but they are also being extended to other bioentities, such as chemical compounds and
diseases.

These comparisons can be done using tools freely available on the web:

ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt
UniProt
UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...

proteins and to get the information content and calculate the functional semantic similarity of GO terms.
CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI
ChEBI
Chemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort...

based semantic similarity measures.
CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.

GeoInformatics

Similarity is also applied to find similar geographic features or feature types:

SIM-DL similarity server can be used to compute similarities between concepts stored in geographic feature type ontologies.
Geo-Net-PT Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.

Linguistics

Several metrics use WordNet

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

Edge-based: which use the edges and their types as the data source;
Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:

Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

Edge-based

IntelliGO:

Node-based

Resnik
- based on the notion of information content
  Information content
  The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...
Lin
Jiang and Conrath
DiShIn
DiShIn
DiShIn is a method for exploitation of multiple inheritance when calculating the shared information content between two ontology concepts being compared by node-based semantic similarity measures...

Disjunctive Shared Information between Ontology Concepts
- other alternative: GraSM
  GraSM
  GraSM is a method for incorporating the semantic richness of a graph in semantic similarity measures by selecting disjunctive common ancestors of two concepts. GraSM assumes that two common ancestors are disjunctive if there are independent paths from both ancestors to the concept...
  
  (Graph-based Similarity Measure)

Pairwise

maximum of the pairwise similarities
composite average in which only the best-matching pairs are considered (best-match average)

Groupwise

Jaccard index
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
simGIC
simLP
simUI

Statistical similarity

LSA (Latent semantic analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
PMI (Pointwise mutual information
Pointwise Mutual Information
Pointwise mutual information , or point mutual information, is a measure of association used in information theory and statistics.-Definition:...

) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
SOC-PMI (Second-order co-occurrence pointwise mutual information
Second-order co-occurrence pointwise mutual information
Second-order co-occurrence pointwise mutual information is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. PMI-IR used AltaVista's Advanced Search query syntax to calculate probabilities. Note...

) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
NGD (Normalized Google distance
Normalized Google distance
Google distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords...

) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.
ESA (Explicit Semantic Analysis) based on Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

and the ODP
Open Directory Project
The Open Directory Project , also known as Dmoz , is a multilingual open content directory of World Wide Web links. It is owned by Netscape but it is constructed and maintained by a community of volunteer editors.ODP uses a hierarchical ontology scheme for organizing site listings...
n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm
Dijkstra's algorithm
Dijkstra's algorithm, conceived by Dutch computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree...

is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map
Self-organizing map
A self-organizing map or self-organizing feature map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional , discretized representation of the input space of the training samples, called a map...

to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
SimRank
SimRank
SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other...

Software

WordNet-Similarity, an open source package for computing the similarity and relatedness of concepts found in WordNet
UMLS-Similarity, an open source package for computing the similarity and relatedness of concepts found in the Unified Medical Language System (UMLS)

Web Services

Measures of Semantic Relatedness (MRS)
WordNet-Similarity, a web interface to WordNet-Similarity
UMLS-Similarity, a web interface to UMLS-Similarity

External links

List of related literature
WordNet::Similarity (using WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

as an ontology
Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...

)
WordNet Explorer (interactive graphic WordNet database editor)
Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)
Survey on Semantic Similarity Measures (C. d'Amato, S. Staab, N. Fanizzi, EKAW 2008, Springer-Verlag)
lgorithm, Implementation and Application of the SIM-DL Similarity Server (Introduction to the SIM-DL Similarity Server)

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.