DiShIn - AbsoluteAstronomy.com

DiShIn is a method for exploitation of multiple inheritance when calculating the shared information content

Information content

The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...

between two ontology concepts being compared by node-based semantic similarity

Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

measures. DiShIn re-deﬁnes the shared information content between two concepts as the average of all their disjunctive ancestors,
assuming that an ancestor is disjunctive if the difference between the number of distinct
paths from the concepts to it is different from that of any other more informative ancestor. In other words,
a disjunctive ancestor is the most informative ancestor representing a given set of parallel interpretations.
DiShIn is an improvement of GraSM

GraSM

GraSM is a method for incorporating the semantic richness of a graph in semantic similarity measures by selecting disjunctive common ancestors of two concepts. GraSM assumes that two common ancestors are disjunctive if there are independent paths from both ancestors to the concept...

in terms of computational efficiency and in the management of parallel interpretations.

Example

For example, palladium

Palladium

Palladium is a chemical element with the chemical symbol Pd and an atomic number of 46. It is a rare and lustrous silvery-white metal discovered in 1803 by William Hyde Wollaston. He named it after the asteroid Pallas, which was itself named after the epithet of the Greek goddess Athena, acquired...

, platinum

Platinum

Platinum is a chemical element with the chemical symbol Pt and an atomic number of 78. Its name is derived from the Spanish term platina del Pinto, which is literally translated into "little silver of the Pinto River." It is a dense, malleable, ductile, precious, gray-white transition metal...

, silver

Silver

Silver is a metallic chemical element with the chemical symbol Ag and atomic number 47. A soft, white, lustrous transition metal, it has the highest electrical conductivity of any element and the highest thermal conductivity of any metal...

and gold

Gold

Gold is a chemical element with the symbol Au and an atomic number of 79. Gold is a dense, soft, shiny, malleable and ductile metal. Pure gold has a bright yellow color and luster traditionally considered attractive, which it maintains without oxidizing in air or water. Chemically, gold is a...

are considered to be precious metals, and silver

Silver

, gold

Gold

and copper

Copper

Copper is a chemical element with the symbol Cu and atomic number 29. It is a ductile metal with very high thermal and electrical conductivity. Pure copper is soft and malleable; an exposed surface has a reddish-orange tarnish...

considered to be coinage metals

Coinage metals

The coinage metals comprise, at minimum, those metallic chemical elements which have historically been used as components in alloys used to mint coins. The term is not perfectly defined, however, since a number of metals have been used to make "demonstration coins" which have never been used to...

. Thus, we have:

metal
/ \
precious coinage
/ | \ \ / / \
/ | \ gold / \
palladium platinum silver copper

When calculating the semantic similarity

Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

between platinum and gold,
DiShIn starts by calculating the number of paths difference for all their common ancestors:
gold -> coinage -> metal
gold -> precious -> metal
platinum -> precious -> metal

gold -> precious
platinum -> precious

For metal we have two paths from gold and one from platinum, so we have a path difference of one.
For precious we have one path from each concept, so we have a path difference of zero.

Since their path difference is distinct, both common ancestors metal and precious are considered to be disjunctive common ancestors.

When calculating the semantic similarity

Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

between platinum and palladium,
DiShIn starts by calculating the number of paths difference for all their common ancestors:
palladium -> precious -> metal
platinum -> precious -> metal

palladium -> precious
platinum -> precious

For both metal and precious, we have only one path from each concept, so we have a path difference of zero for both common ancestors.
Thus, only the common ancestor precious (the most informative) is considered to be a disjunctive common ancestor.

Given that node-based semantic similarity measures are proportional to the average of the information content

Information content

of their common disjunctive ancestors: metal and precious in case of platinum and gold; and precious in case of platinum and palladium, means that for DiShIn palladium and platinum are more similar than platinum and gold.

When calculating the semantic similarity

Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

between silver and gold, ,
DiShIn starts by calculating the number of paths difference for all their common ancestors:
gold -> coinage -> metal
gold -> precious -> metal
silver -> coinage -> metal
silver -> precious -> metal

gold -> precious
silver -> precious

gold -> coinage
silver -> coinage

As in the case of platinum and palladium, here all common ancestors have a path difference of zero, since silver and gold share the same relationships
and therefore have parallel interpretations.
Thus, only the most informative common ancestor precious or coinage is considered to be a disjunctive common ancestor.
This means that for DiShIn the similarity between silver and gold is greater or equal than the similarity between any other pair of the leaf concepts.
Thus, DiShIn does not penalize parallel interpretations as GraSM

GraSM

did.

Implementation

After estimating the information content for each concept and the number of distinct paths from one concept to another,
DiShIn can be implemented as a single SQL

SQL

SQL is a programming language designed for managing data in relational database management systems ....

query described in the authors's publication in the Journal of Biomedical Semantics.

SQL Implementation for the Gene Ontology
Gene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...

MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

release

With Paths and IC computed on-the-fly, i.e. not requiring any preliminary calculations

Usage example:
mysql -hmysql.ebi.ac.uk -ugo_select -pamigo -P4085 go_latest < DiShIn.sql
or use a local installation from pre-built database dumps http://www.geneontology.org/GO.downloads.database.shtml

Input GO terms to calculate shared information:
SET @t1Id = (SELECT id FROM term WHERE acc='GO:0060255'),
@t2Id = (SELECT id FROM term WHERE acc='GO:0031326');

or for example the terms used in http://dx.doi.org/10.1186/2041-1480-2-5
SET @t1Id = (SELECT id FROM term WHERE acc='GO:0008387'),
@t2Id = (SELECT id FROM term WHERE acc='GO:0008396');

or for example the terms used in http://dx.doi.org/10.1016/j.datak.2006.05.003
SET @t1Id = (SELECT id FROM term WHERE acc='GO:0008387'),
@t2Id = (SELECT id FROM term WHERE acc='GO:0008396');

Calculation of the maximum frequency for a term, assuming the number of gene products as the maximum frequency possible
SET @maxFreq = (SELECT COUNT(*) FROM gene_product);

Calculation of the information content of input term @t1Id
SET @t1IC = (
SELECT -LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) as ic
FROM graph_path gp
INNER JOIN association a ON (gp.term2_id = a.term_id)
WHERE gp.term1_id = @t1Id
AND a.is_not = 0
AND gp.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
);

Calculation of the information content of input term @t12d
SET @t2IC = (
SELECT -LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) as ic
FROM graph_path gp
INNER JOIN association a ON (gp.term2_id = a.term_id)
WHERE gp.term1_id = @t1Id
AND a.is_not = 0
AND gp.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
);

Calculation of the disjunctive shared information (DiShIn) without requiring preliminary calculations.
It assumes that the difference of the number of distinct paths can be estimated on-the-fly by the difference of the number of distinct nodes in the paths.
SET @dishin = (
SELECT AVG(dishin.ic)
FROM (
SELECT MAX(ca_ic.ic) as ic
FROM (
SELECT ca.term_id, ca.diff,
-LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) as ic
FROM (
SELECT ca.term_id, ABS(ca.ca_t1_number - ca.ca_t2_number) as diff
FROM (
SELECT ca.ancestor as term_id,
COUNT(DISTINCT ca_t1_nodes.term2_id) as ca_t1_number,
COUNT(DISTINCT ca_t2_nodes.term2_id) as ca_t2_number
FROM (
SELECT p1.term1_id as ancestor
FROM graph_path p1, graph_path p2
WHERE p1.term2_id = @t1Id
AND p2.term2_id = @t2Id
AND p1.term1_id = p2.term1_id
AND p1.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
AND p2.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
) as ca
INNER JOIN graph_path ca_t1_nodes
ON (ca.ancestor = ca_t1_nodes.term1_id)
INNER JOIN graph_path ca_t2_nodes
ON (ca.ancestor = ca_t2_nodes.term1_id)
WHERE ca_t1_nodes.term2_id IN
(
SELECT p2.term1_id as ancestor
FROM graph_path p2
WHERE p2.term2_id = @t1Id
)
AND ca_t2_nodes.term2_id IN
(
SELECT p2.term1_id as ancestor
FROM graph_path p2
WHERE p2.term2_id = @t2Id
)
AND ca_t1_nodes.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
AND ca_t2_nodes.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
GROUP BY ca.ancestor
) as ca
) as ca
INNER JOIN graph_path gp ON (ca.term_id = gp.term1_id)
INNER JOIN association a ON (gp.term2_id = a.term_id)
WHERE a.is_not = 0
AND gp.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
GROUP BY ca.term_id, ca.diff
) as ca_ic
GROUP BY ca_ic.diff
) as dishin
);

Information content normalization to a [0..1] interval
SET @maxIC = ( SELECT -LOG(1/@maxFreq) );
SET @t1IC_norm = ( SELECT @t1IC/@maxIC );
SET @t2IC_norm = ( SELECT @t2IC/@maxIC );
SET @dishin_norm = ( SELECT @dishin/@maxIC );

Calculation of the semantic similarity measures using DiShIn:

Resnik

SELECT @dishin_norm as Sim_resnik;

Jiang&Conrath

SELECT @t1IC_norm + @t2IC_norm - 2*@dishin_norm as Dist_jc;

SELECT (2*@dishin_norm) / (@t1IC_norm + @t2IC_norm) as Sim_lin;

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.