Concept Mining - AbsoluteAstronomy.com

Concept mining is an activity that results in the extraction of concept

Concept

The word concept is used in ordinary language as well as in almost all academic disciplines. Particularly in philosophy, psychology and cognitive sciences the term is much used and much discussed. WordNet defines concept: "conception, construct ". However, the meaning of the term concept is much...

s from artifacts

Document

The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

. Solutions to the task typically involve aspects of artificial intelligence

Artificial intelligence

Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, such as data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

and text mining

Text mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial

Nontrivial

Nontrivial is the opposite of trivial. In contexts where trivial has a formal meaning, nontrivial is its antonym.It is a term common among communities of engineers and mathematicians, to indicate a statement or theorem that is not obvious or easy to prove.-Examples:*In mathematics, it is often...

, but it can provide powerful insights into the meaning, provenance and similarity of documents.

Methods

Traditionally, the conversion of words to concepts has been performed using a thesaurus

Thesaurus

A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...

, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

.

The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context.

For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.

There are many techniques for disambiguation

Word sense disambiguation

In computational linguistics, word-sense disambiguation is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings...

that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity

Semantic similarity

Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....

between the possible concepts and the context have appeared and gained interest in the scientific community.

Detecting and indexing similar documents in large corpora

One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on hypernymy and meronymy

Meronymy

Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,...

. These structures can be used to produce simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.

Clustering documents by topic

Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their text mining

Text mining

cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.

Methods

Detecting and indexing similar documents in large corpora

Clustering documents by topic

See also