Semantic similarity

# Semantic similarity

Discussion

Encyclopedia
Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric
Metric space
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...

based on the likeness of their meaning / semantic content.

Concretely, this can be achieved for instance by defining a topological similarity
Similarity
-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...

, by using ontologies
Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...

to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph
Directed acyclic graph
In mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...

like a hierarchy
Hierarchy
A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...

would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

to correlate
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

words and textual contexts from a suitable text corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

(co-occurrence
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...

).

## Taxonomy

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy
Meronymy
Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,...

, while similarity does not
. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

## Visualisation

An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.

### Biomedical Informatics

Semantic similarity measures have been applied and developed in biomedical ontologies, namely, the Gene Ontology
Gene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...

(GO).
They are mainly used to compare genes
Gênes
Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

and proteins based on the similarity of their functions rather than on their sequence similarity,
but they are also being extended to other bioentities, such as chemical compounds and
diseases.

These comparisons can be done using tools freely available on the web:
• ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt
UniProt
UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...

proteins and to get the information content and calculate the functional semantic similarity of GO terms.
• CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI
ChEBI
Chemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort...

based semantic similarity measures.
• CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.

### GeoInformatics

Similarity is also applied to find similar geographic features or feature types:

### Linguistics

Several metrics use WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary

### Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:
• Edge-based: which use the edges and their types as the data source;
• Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:
• Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
• Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

• IntelliGO:

#### Node-based

• Resnik
• based on the notion of information content
Information content
The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...

• Lin
• Jiang and Conrath
• DiShIn
DiShIn
DiShIn is a method for exploitation of multiple inheritance when calculating the shared information content between two ontology concepts being compared by node-based semantic similarity measures...

Disjunctive Shared Information between Ontology Concepts
• other alternative: GraSM
GraSM
GraSM is a method for incorporating the semantic richness of a graph in semantic similarity measures by selecting disjunctive common ancestors of two concepts. GraSM assumes that two common ancestors are disjunctive if there are independent paths from both ancestors to the concept...

(Graph-based Similarity Measure)

#### Pairwise

• maximum of the pairwise similarities
• composite average in which only the best-matching pairs are considered (best-match average)

#### Groupwise

• Jaccard index
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....

• simGIC
• simLP
• simUI

### Statistical similarity

• LSA (Latent semantic analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
• PMI (Pointwise mutual information
Pointwise Mutual Information
Pointwise mutual information , or point mutual information, is a measure of association used in information theory and statistics.-Definition:...

) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
• SOC-PMI (Second-order co-occurrence pointwise mutual information
Second-order co-occurrence pointwise mutual information
Second-order co-occurrence pointwise mutual information is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. PMI-IR used AltaVista's Advanced Search query syntax to calculate probabilities. Note...

) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
• GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
• ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
Google distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords...

) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.
• ESA (Explicit Semantic Analysis) based on Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

and the ODP
Open Directory Project
The Open Directory Project , also known as Dmoz , is a multilingual open content directory of World Wide Web links. It is owned by Netscape but it is constructed and maintained by a community of volunteer editors.ODP uses a hierarchical ontology scheme for organizing site listings...

• n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm
Dijkstra's algorithm
Dijkstra's algorithm, conceived by Dutch computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree...

is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
• VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
• BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map
Self-organizing map
A self-organizing map or self-organizing feature map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional , discretized representation of the input space of the training samples, called a map...

to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
• SimRank
SimRank
SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other...

## Software

• WordNet-Similarity, an open source package for computing the similarity and relatedness of concepts found in WordNet
• UMLS-Similarity, an open source package for computing the similarity and relatedness of concepts found in the Unified Medical Language System (UMLS)

## Web Services

• Terminology extraction
Terminology extraction
Terminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus....

• Coherence (linguistics)
Coherence (linguistics)
Coherence in linguistics is what makes a text semantically meaningful.It is especially dealt with in text linguistics. Coherence is achieved through syntactical features such as the use of deictic, anaphoric and cataphoric elements or a logical tense structure, as well as presuppositions and...

• Analogy
Analogy
Analogy is a cognitive process of transferring information or meaning from a particular subject to another particular subject , and a linguistic expression corresponding to such a process...

• Semantic differential
Semantic differential
Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.-Semantic differential:...